This notebook reads a dataset with protein sequence and protein fold classification and calculates a feature vector for each protein sequence.
Rule 3: Use Divisions to Make Steps Clear. We use one cell for each distinct task.
Rule 4: Modularize Code. To avoid duplicating code, we have collected several functions in protvectors.py. These functions are also used in 4-Predict.
Rule 8: Share and Explain Your Data. To enable reproducibility we provide a local copy of a Word2vec model in the /data directory and a file that describes the datasets with download locations and dates.
import pandas as pd
import protvectors
# column names
feature_col = "features" # feature vector
value_col = "foldClass" # fold class to be predicted
df = pd.read_json("./intermediate_data/foldClassification.json")
We use the ProtVec model (Asgari et al.) to calculate a 100-dimensional feature vector for each protein sequence. ProtVec uses a Word2vec model (Mikolov et al.) that has been trained on 546,790 sequences in Swiss-Prot using 546,790 × 3 = 1,640,370 sequences of 3-grams. The 3-grams represent "biological words" in a protein sequence, e.g., sequence: SRMPSPP -> 3-grams: SRM RMP MPS PSP SPP. The ProtVec model is available for download at: https://github.com/ehsanasgari/Deep-Proteomics.
Asgari E, Mofrad MR (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics, PLoS One. 10(11):e0141287. doi: 10.1371/journal.pone.0141287.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119.
Next we read a local copy of the ProtVec model. The ProtVec model is represented as a dictionary, with the 3-gram as the key and the 100-dimensional feature vector as the value.
protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")
print("Example ProtVec for 3-gram SRM:\n", protvec['SRM'])
Next, we create 3-grams for the protein sequences in our dataset.
# add column ngram to dataframe
df['ngram'] = df.sequence.apply(protvectors.ngrammer, n=3)
df.head(3)
Here we create a 100-dimensional feature vector by adding up the ProtVectors for all 3-grams in a protein sequence and standardize each feature vector to zero-mean and unit-variance.
df[feature_col] = df.ngram.apply(protvectors.apply_protvectors, protvec=protvec)
df.head(3)
We save the dataset with protein sequence, fold classification, and feature vectors as a Pandas dataframe for further analysis.
df.to_json("./intermediate_data/features.json")
After you saved the dataset here, run the next step in the workflow 3-FitModel.ipynb or go back go back to 0-Workflow.ipynb.