Calculate Features

This notebook reads a dataset with protein sequence and protein fold classification and calculates a feature vector for each protein sequence.


Rule 3: Use Divisions to Make Steps Clear. We use one cell for each distinct task.

Rule 4: Modularize Code. To avoid duplicating code, we have collected several functions in protvectors.py. These functions are also used in 4-Predict.

Rule 8: Share and Explain Your Data. To enable reproducibility we provide a local copy of a Word2vec model in the /data directory and a file that describes the datasets with download locations and dates.


In [1]:
import pandas as pd
import protvectors
In [2]:
# column names
feature_col = "features" # feature vector
value_col = "foldClass" # fold class to be predicted
In [3]:
df = pd.read_json("./intermediate_data/foldClassification.json")

Create a Word2vec Model

We use the ProtVec model (Asgari et al.) to calculate a 100-dimensional feature vector for each protein sequence. ProtVec uses a Word2vec model (Mikolov et al.) that has been trained on 546,790 sequences in Swiss-Prot using 546,790 × 3 = 1,640,370 sequences of 3-grams. The 3-grams represent "biological words" in a protein sequence, e.g., sequence: SRMPSPP -> 3-grams: SRM RMP MPS PSP SPP. The ProtVec model is available for download at: https://github.com/ehsanasgari/Deep-Proteomics.

Asgari E, Mofrad MR (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics, PLoS One. 10(11):e0141287. doi: 10.1371/journal.pone.0141287.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119.

Read ProtVec Model

Next we read a local copy of the ProtVec model. The ProtVec model is represented as a dictionary, with the 3-gram as the key and the 100-dimensional feature vector as the value.

In [4]:
protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")

print("Example ProtVec for 3-gram SRM:\n", protvec['SRM'])
Example ProtVec for 3-gram SRM:
 [-0.349053 -0.034172 -0.14602  -0.112906  0.318846  0.100117 -0.104718
 -0.194695 -0.08249   0.016351 -0.181182  0.109543  0.067238 -0.027135
  0.222703  0.073312 -0.074177 -0.087137 -0.27853   0.003309 -0.065516
 -0.035587  0.042179  0.169955  0.155156 -0.07882   0.203758  0.129488
 -0.009507 -0.033186 -0.007172 -0.039388  0.243934  0.009303  0.043914
 -0.018962 -0.23077  -0.136273  0.027782  0.232346 -0.2341    0.102889
 -0.054253 -0.111376  0.106518 -0.027139 -0.139712 -0.049569  0.057983
 -0.157097  0.090227  0.0228    0.114038  0.017181 -0.015422 -0.035576
 -0.014446  0.000584 -0.292332  0.003074  0.097327  0.072325  0.138753
  0.028772 -0.023035  0.024519  0.123589  0.021453  0.286168  0.094651
 -0.145597  0.132008 -0.104951  0.121934 -0.042467 -0.075287  0.306096
  0.096278 -0.121827  0.167771  0.059359 -0.169576  0.018486 -0.143597
  0.211764  0.171916  0.200995  0.190091 -0.142053  0.022641  0.204606
 -0.083642  0.016121 -0.147855  0.001436 -0.124035  0.00538  -0.177881
  0.116058  0.195754]

Create 3-grams of the Protein Sequence

Next, we create 3-grams for the protein sequences in our dataset.

In [5]:
# add column ngram to dataframe
df['ngram'] = df.sequence.apply(protvectors.ngrammer, n=3)
df.head(3)
Out[5]:
Exptl. FreeRvalue R-factor alpha beta coil foldClass length pdbChainId resolution secondary_structure sequence ngram
1 XRAY 0.26 0.19 0.469945 0.046448 0.483607 alpha 366 16VP.A 2.100 CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... [SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...
1000 XRAY 0.23 0.18 0.504630 0.004630 0.490741 alpha 216 1PBW.B 2.000 CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT... MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL... [MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...
10002 XRAY 0.26 0.22 0.716172 0.006601 0.277228 alpha 303 4TQ3.A 2.408 CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC... MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS... [MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ...

Create a Fixed-sized Feature Vector

Here we create a 100-dimensional feature vector by adding up the ProtVectors for all 3-grams in a protein sequence and standardize each feature vector to zero-mean and unit-variance.

In [6]:
df[feature_col] = df.ngram.apply(protvectors.apply_protvectors, protvec=protvec)

df.head(3)
Out[6]:
Exptl. FreeRvalue R-factor alpha beta coil foldClass length pdbChainId resolution secondary_structure sequence ngram features
1 XRAY 0.26 0.19 0.469945 0.046448 0.483607 alpha 366 16VP.A 2.100 CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... [SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ... [-2.618341208445193, -0.37215537192569575, 0.1...
1000 XRAY 0.23 0.18 0.504630 0.004630 0.490741 alpha 216 1PBW.B 2.000 CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT... MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL... [MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ... [-2.4130836608297224, -0.5122827315971855, 0.1...
10002 XRAY 0.26 0.22 0.716172 0.006601 0.277228 alpha 303 4TQ3.A 2.408 CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC... MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS... [MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ... [-2.6375752438981404, 0.18385725798670652, 0.2...

Save DataFrame with Feature Vectors

We save the dataset with protein sequence, fold classification, and feature vectors as a Pandas dataframe for further analysis.

In [7]:
df.to_json("./intermediate_data/features.json")

Next step

After you saved the dataset here, run the next step in the workflow 3-FitModel.ipynb or go back go back to 0-Workflow.ipynb.


Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018