Machine Learning NoteBook 0813

8/13: to-do

draw figures for the prediction with different epochs

the outer test set runs not quite well, try to use other datasets instead

current status

finish the test with outer test set, but find some problems that the accuracy is not that good
- trying to use some other sets to do the test:
  - KiBA – Processing
  - DAVIS – Waiting may not include $K_i$ value

refine data statistic

step	description	data
0	original	2278226
1	drop multichain	2169710
2	only keep data with $K_i$ value	490605
3	calculate the number of time that molecular and sequence occur, remove data with molecular occur less than 3 times and sequence occur less than 6 times	288115
4	remove invalid $K_i$ value(e.g. $K_i$ = 0)	250481
5	embed molecular and sequence, remove the data which cannot be embedded	250343
6	remove $pK_i(log10 K_i)$ with higher than $8$	249517

need to do tomorrow

notice

a new python package is found to download data: homepage

1	pip install PyTDC

for download data use:

from tdc.multi_pred import DTI

data = DTI(name = 'BindingDB_Kd')
data = DTI(name = 'BindingDB_IC50')
data = DTI(name = 'BindingDB_Ki')
data = DTI(name = 'KIBA')
data = DTI(name = 'DAVIS')

split = data.get_split()

split is a dictionary with 3 keys:
- "train" "test" "valid" each key is a pd.DataFrame

figures and reference for paper

+
| epoch | test r^2 |
| —- | —- |
| nan | nan |

# Code # Notebook # MachineLearning