Calculate Features¶

This notebook reads a dataset with protein sequence and protein fold classification and calculates a feature vector for each protein sequence.

Rule 3: Use Divisions to Make Steps Clear. We use one cell for each distinct task.

Rule 4: Modularize Code. To avoid duplicating code, we have collected several functions in protvectors.py. These functions are also used in 4-Predict.

Rule 8: Share and Explain Your Data. To enable reproducibility we provide a local copy of a Word2vec model in the /data directory and a file that describes the datasets with download locations and dates.

import pandas as pd
import protvectors

# column names
feature_col = "features" # feature vector
value_col = "foldClass" # fold class to be predicted

df = pd.read_json("./intermediate_data/foldClassification.json")

Create a Word2vec Model¶

We use the ProtVec model (Asgari et al.) to calculate a 100-dimensional feature vector for each protein sequence. ProtVec uses a Word2vec model (Mikolov et al.) that has been trained on 546,790 sequences in Swiss-Prot using 546,790 × 3 = 1,640,370 sequences of 3-grams. The 3-grams represent "biological words" in a protein sequence, e.g., sequence: SRMPSPP -> 3-grams: SRM RMP MPS PSP SPP. The ProtVec model is available for download at: https://github.com/ehsanasgari/Deep-Proteomics.

Asgari E, Mofrad MR (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics, PLoS One. 10(11):e0141287. doi: 10.1371/journal.pone.0141287.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119.

Read ProtVec Model¶

Next we read a local copy of the ProtVec model. The ProtVec model is represented as a dictionary, with the 3-gram as the key and the 100-dimensional feature vector as the value.

protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")

print("Example ProtVec for 3-gram SRM:\n", protvec['SRM'])

Example ProtVec for 3-gram SRM:
 [-0.349053 -0.034172 -0.14602  -0.112906  0.318846  0.100117 -0.104718
 -0.194695 -0.08249   0.016351 -0.181182  0.109543  0.067238 -0.027135
  0.222703  0.073312 -0.074177 -0.087137 -0.27853   0.003309 -0.065516
 -0.035587  0.042179  0.169955  0.155156 -0.07882   0.203758  0.129488
 -0.009507 -0.033186 -0.007172 -0.039388  0.243934  0.009303  0.043914
 -0.018962 -0.23077  -0.136273  0.027782  0.232346 -0.2341    0.102889
 -0.054253 -0.111376  0.106518 -0.027139 -0.139712 -0.049569  0.057983
 -0.157097  0.090227  0.0228    0.114038  0.017181 -0.015422 -0.035576
 -0.014446  0.000584 -0.292332  0.003074  0.097327  0.072325  0.138753
  0.028772 -0.023035  0.024519  0.123589  0.021453  0.286168  0.094651
 -0.145597  0.132008 -0.104951  0.121934 -0.042467 -0.075287  0.306096
  0.096278 -0.121827  0.167771  0.059359 -0.169576  0.018486 -0.143597
  0.211764  0.171916  0.200995  0.190091 -0.142053  0.022641  0.204606
 -0.083642  0.016121 -0.147855  0.001436 -0.124035  0.00538  -0.177881
  0.116058  0.195754]

Create 3-grams of the Protein Sequence¶

Next, we create 3-grams for the protein sequences in our dataset.

# add column ngram to dataframe
df['ngram'] = df.sequence.apply(protvectors.ngrammer, n=3)
df.head(3)

Create a Fixed-sized Feature Vector¶

Here we create a 100-dimensional feature vector by adding up the ProtVectors for all 3-grams in a protein sequence and standardize each feature vector to zero-mean and unit-variance.

df[feature_col] = df.ngram.apply(protvectors.apply_protvectors, protvec=protvec)

df.head(3)

Save DataFrame with Feature Vectors¶

We save the dataset with protein sequence, fold classification, and feature vectors as a Pandas dataframe for further analysis.

df.to_json("./intermediate_data/features.json")

Next step¶

After you saved the dataset here, run the next step in the workflow 3-FitModel.ipynb or go back go back to 0-Workflow.ipynb.

Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018

	Exptl.	FreeRvalue	R-factor	alpha	beta	coil	foldClass	length	pdbChainId	resolution	secondary_structure	sequence	ngram
1	XRAY	0.26	0.19	0.469945	0.046448	0.483607	alpha	366	16VP.A	2.100	CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...	SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...	[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...
1000	XRAY	0.23	0.18	0.504630	0.004630	0.490741	alpha	216	1PBW.B	2.000	CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...	MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...	[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...
10002	XRAY	0.26	0.22	0.716172	0.006601	0.277228	alpha	303	4TQ3.A	2.408	CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC...	MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS...	[MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ...

	Exptl.	FreeRvalue	R-factor	alpha	beta	coil	foldClass	length	pdbChainId	resolution	secondary_structure	sequence	ngram	features
1	XRAY	0.26	0.19	0.469945	0.046448	0.483607	alpha	366	16VP.A	2.100	CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...	SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...	[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...	[-2.618341208445193, -0.37215537192569575, 0.1...
1000	XRAY	0.23	0.18	0.504630	0.004630	0.490741	alpha	216	1PBW.B	2.000	CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...	MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...	[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...	[-2.4130836608297224, -0.5122827315971855, 0.1...
10002	XRAY	0.26	0.22	0.716172	0.006601	0.277228	alpha	303	4TQ3.A	2.408	CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC...	MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS...	[MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ...	[-2.6375752438981404, 0.18385725798670652, 0.2...