MIT Think 02

I didn’t have enough time to finish up the research plan part for the Think proposal. It would require too much time, but I managed to finish the future work and project budget. This is something interesting, and this is the first time I seriously make a budget for the research project and thinks about how much I have to spend and how to spend them wisely.
Future Work
Although building my own model has been a fun task in and of itself, I hope to put my work to the test in solving real-world problems. First and foremost, I hope that DeepPLA can be applied to help medical researchers find a cure for COVID-19. To this end, my project still needs to overcome some challenges.
Firstly, model precision is greatly determined by the choice of embedding algorithms. The embedding modules I am using currently, however, generate vector representations for the drugs and the proteins that are out of balance (300 elements for drug and 6165 elements for protein). This imbalance skews the model excessively toward the protein information, potentially resulting in lower precision. To solve this problem, I will experiment with different embedding algorithms. Recently I have my eyes on an exciting machine learning method, dual-learning that Microsoft developed for language translation. After communicating with the developer, I learned that this method has the potential for accurate and fast protein embedding, possibly enhancing the performance of my model.
Secondly, to achieve higher precision, the model can be entirely reconstructed based on a new graph neural network approach. It is time-consuming, but it may be game-changing with outstanding performance while keeping the input simple. I would benefit from expert advice on how to optimize my model architecture.
Lastly, the database I am currently using, BindingDB, does not provide access to commercial datasets, which contain more data relevant to finding a cure for COVID-19. This limitation renders the database less suitable to represent the entire knowledge on drug candidate molecules and COVID-19 proteins. Therefore, I hope to cooperate with drug-developing companies and test my model on their target protein. My research will also benefit from their feedback on DeepPLA’s prediction performance. Although I have made my model publicly accessible on Github for wider exposure, as a high school student, I find it hard to get feedback from end-users alone.
I believe that my research can benefit a lot from the MIT THINK program, where I can meet creative minds and find inspiration for my work from their research. I appreciate the opportunity to connect with experts in related fields and receive their advice in optimizing my machine learning frameworks and in building better models to represent drug-target interactions. Opportunities to put me into contact with high-performance GPU clusters will also be a great benefit in speeding up model development.

Timeline and Budget

Milestones and Evaluations	Deadline
** Data cleaning and processing	May 2021
** Design the first version of the model	Aug 2021
Compare with baseline models on MSE, R-squared and running time	Jan 2022
Revision 1: Reducing Protein Embedding; Reaching current performance	Feb 2022
Revision 2: Adapting Dual Learning; reaching higher performance (R2>0.7)	March 2022
Predict real-life DTI to evaluate drugs for COVID-19; prepare manuscript for publication	June 2022
** Currently finished part

Budget

GPU Computing Cluster Aliyun Cloud Service $100/month https://www.aliyun.com/

# Research # Machine learning # Modeling