Machine Learning Notebook 0806

Technique Sharing
  • the first attempt of the network

  • fix the problem to processing too slow(with embedding the sequence first)

    • do not use DataFrame in building the DataLoader, translate DataFrame into tensor before put into DataLoader
    1
    2
    3
    4
    5
    6
    7
    # train
    train_seq = tensor(np.array(train_data.iloc[:,305:])).unsqueeze(dim=1).to(torch.float32)
    train_mol = tensor(np.array(train_data.iloc[:,5:305])).unsqueeze(dim=1).to(torch.float32)
    train_Ki = tensor(np.array(train_data.iloc[:,4]))

    trainDataset = TensorDataset(train_mol,train_seq,train_Ki)
    trainDataLoader = DataLoader(trainDataset, batch_size=128)
  • add normalization layer after ReLU

need to do tomorrow

  • check the embedding results
  • improve the network accuracy

notice

  • when building the CNN part of the network, should add normalization layer every time after Conv
  • don’t put DataFrame into DataSet

figures and reference for paper

  • about the data:

    • the statistic review of protein length(exclude a single data larger than 7k)

    Figure1

    • x label: protein sequence length
    • y label: number of sequence