8/13: to-do
- draw figures for the prediction with different epochs
- the outer test set runs not quite well, try to use other datasets instead
current status
- finish the test with outer test set, but find some problems that the accuracy is not that good
- trying to use some other sets to do the test:
- KiBA – Processing
- DAVIS – Waiting may not include $K_i$ value
- trying to use some other sets to do the test:
refine data statistic
step | description | data |
---|---|---|
0 | original | 2278226 |
1 | drop multichain | 2169710 |
2 | only keep data with $K_i$ value | 490605 |
3 | calculate the number of time that molecular and sequence occur, remove data with molecular occur less than 3 times and sequence occur less than 6 times | 288115 |
4 | remove invalid $K_i$ value(e.g. $K_i$ = 0) | 250481 |
5 | embed molecular and sequence, remove the data which cannot be embedded | 250343 |
6 | remove $pK_i(log10 K_i)$ with higher than $8$ | 249517 |
need to do tomorrow
notice
a new python package is found to download data: homepage
1
pip install PyTDC
- for download data use:
1
2
3
4
5
6
7
8
9from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd')
data = DTI(name = 'BindingDB_IC50')
data = DTI(name = 'BindingDB_Ki')
data = DTI(name = 'KIBA')
data = DTI(name = 'DAVIS')
split = data.get_split()split
is a dictionary with 3 keys:"train"
"test"
"valid"
each key is apd.DataFrame
figures and reference for paper
+
| epoch | test r^2 |
| —- | —- |
| nan | nan |