8/13: to-do
- draw figures for the prediction with different epochs
- the outer test set runs not quite well, try to use other datasets instead
current status
- finish the test with outer test set, but find some problems that the accuracy is not that good
- trying to use some other sets to do the test:
- KiBA – Processing
- DAVIS – Waiting may not include $K_i$ value
- trying to use some other sets to do the test:
refine data statistic
| step | description | data |
|---|---|---|
| 0 | original | 2278226 |
| 1 | drop multichain | 2169710 |
| 2 | only keep data with $K_i$ value | 490605 |
| 3 | calculate the number of time that molecular and sequence occur, remove data with molecular occur less than 3 times and sequence occur less than 6 times | 288115 |
| 4 | remove invalid $K_i$ value(e.g. $K_i$ = 0) | 250481 |
| 5 | embed molecular and sequence, remove the data which cannot be embedded | 250343 |
| 6 | remove $pK_i(log10 K_i)$ with higher than $8$ | 249517 |
need to do tomorrow
notice
a new python package is found to download data: homepage
1
pip install PyTDC
- for download data use:
1
2
3
4
5
6
7
8
9from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd')
data = DTI(name = 'BindingDB_IC50')
data = DTI(name = 'BindingDB_Ki')
data = DTI(name = 'KIBA')
data = DTI(name = 'DAVIS')
split = data.get_split()splitis a dictionary with 3 keys:"train""test""valid"each key is apd.DataFrame
figures and reference for paper
+
| epoch | test r^2 |
| —- | —- |
| nan | nan |