Machine Learning NoteBook 0813

Technique Sharing

8/13: to-do

  • draw figures for the prediction with different epochs
  • the outer test set runs not quite well, try to use other datasets instead

current status

  • finish the test with outer test set, but find some problems that the accuracy is not that good
    • trying to use some other sets to do the test:
      • KiBA – Processing
      • DAVIS – Waiting may not include $K_i$ value

refine data statistic

step description data
0 original 2278226
1 drop multichain 2169710
2 only keep data with $K_i$ value 490605
3 calculate the number of time that molecular and sequence occur, remove data with molecular occur less than 3 times and sequence occur less than 6 times 288115
4 remove invalid $K_i$ value(e.g. $K_i$ = 0) 250481
5 embed molecular and sequence, remove the data which cannot be embedded 250343
6 remove $pK_i(log10 K_i)$ with higher than $8$ 249517

need to do tomorrow

notice

  • a new python package is found to download data: homepage

    1
    pip install PyTDC
    • for download data use:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    from tdc.multi_pred import DTI

    data = DTI(name = 'BindingDB_Kd')
    data = DTI(name = 'BindingDB_IC50')
    data = DTI(name = 'BindingDB_Ki')
    data = DTI(name = 'KIBA')
    data = DTI(name = 'DAVIS')

    split = data.get_split()
    • split is a dictionary with 3 keys:
      • "train" "test" "valid" each key is a pd.DataFrame

figures and reference for paper

+
| epoch | test r^2 |
| —- | —- |
| nan | nan |