Make a Prediction

In this notebook you can enter a protein sequence and predict the fold classification.


Rule 9: Design Your Notebooks to Be Read, Run, and Explored. We use ipywidgets to present to users a text box to execute a prediction for a protein sequence of their choice. We provide a default sequence to generate a reproducible result.


In [1]:
import numpy as np
from sklearn.externals import joblib
import protvectors
from ipywidgets import widgets

Enter a Protein Sequence in Text Box

We have populated the text box with a default sequence from PDB chain 5YU2.A (expected result: alpha+beta).

In [2]:
text_box = widgets.Textarea(description='Sequence:', value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL')
In [3]:
display(text_box)
In [4]:
sequence = text_box.value
print("Make prediction for:", sequence)
Make prediction for: GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL

Load Classifier model

In [5]:
classifier = joblib.load("./intermediate_data/classifier")
In [6]:
classifier
Out[6]:
SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=13, shrinking=True,
  tol=0.001, verbose=False)

Calculate 3-grams

In [7]:
ngrams = protvectors.ngrammer(sequence, n=3)
print(ngrams)
['GAA', 'AAS', 'ASM', 'SMK', 'MKI', 'KII', 'IIN', 'INT', 'NTT', 'TTR', 'TRL', 'RLP', 'LPE', 'PEA', 'EAL', 'ALG', 'LGP', 'GPY', 'PYS', 'YSH', 'SHA', 'HAT', 'ATV', 'TVV', 'VVN', 'VNG', 'NGM', 'GMV', 'MVY', 'VYT', 'YTS', 'TSG', 'SGQ', 'GQI', 'QIP', 'IPL', 'PLN', 'LNV', 'NVD', 'VDG', 'DGK', 'GKI', 'KIV', 'IVS', 'VSA', 'SAD', 'ADV', 'DVQ', 'VQA', 'QAQ', 'AQT', 'QTK', 'TKQ', 'KQV', 'QVL', 'VLE', 'LEN', 'ENL', 'NLK', 'LKV', 'KVV', 'VVL', 'VLE', 'LEE', 'EEA', 'EAG', 'AGS', 'GSD', 'SDL', 'DLN', 'LNS', 'NSV', 'SVA', 'VAK', 'AKA', 'KAT', 'ATI', 'TIF', 'IFI', 'FIK', 'IKD', 'KDM', 'DMN', 'MND', 'NDF', 'DFQ', 'FQK', 'QKI', 'KIN', 'INE', 'NEV', 'EVY', 'VYG', 'YGQ', 'GQY', 'QYF', 'YFN', 'FNE', 'NEH', 'EHK', 'HKP', 'KPA', 'PAR', 'ARS', 'RSC', 'SCV', 'CVE', 'VEV', 'EVA', 'VAR', 'ARL', 'RLP', 'LPK', 'PKD', 'KDV', 'DVK', 'VKV', 'KVE', 'VEI', 'EIE', 'IEL', 'ELV', 'LVS', 'VSK', 'SKI', 'KIK', 'IKE', 'KEL']

Read ProtVec Model

In [8]:
protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")

Calculate Feature Vector using ProtVec Model

In [9]:
featureVector = protvectors.apply_protvectors(ngrams, protvec)
print(featureVector)
[-2.3101338  -0.32331702  0.78323473 -2.17371844 -0.05491938 -0.04513578
  1.22565468 -0.4836296   0.29705189  2.28707543  0.22305553 -0.41752745
 -0.39105919  0.70440273  0.13918634  0.19040849  0.84646709 -1.56859546
  0.02784964 -1.49127177 -0.01911258 -1.30944049 -1.42153082  0.01448804
  0.36608489  0.39708845 -0.16598204 -0.15441528 -0.33733611 -1.21403695
 -0.5650013  -1.30023446  0.56057566 -0.1993203   0.07892812  1.4882538
  0.08757783 -0.22068366  0.53207356 -0.09555411  0.20772675  0.67549063
  0.52102914 -0.12743451 -0.47274557  0.02531047 -0.91127284 -0.41035579
  0.00657577  1.81890208  0.12983772 -0.76028579  1.77282759 -1.40223342
  1.04664272 -1.91512564  0.0619787   0.3228361  -0.30558078 -2.91303999
 -0.25224169  1.90002018  0.20970349  0.03197095  1.54426113  0.86039855
  0.22558089  2.08030942  0.80783672 -1.10335774  0.15124303  0.28051646
  0.7273708   1.120928    0.87064261 -0.78159259  0.86080856 -0.02225375
  0.7175734  -0.98596892 -1.03833133 -0.07469874  0.04270149 -0.09935889
  1.53536898 -0.9162694  -1.40590277 -0.72739099 -1.47466438  0.48371854
  0.35492522 -1.17893066  0.41448665  0.640024    1.64565464  0.09323905
 -0.32028101 -1.52077313  0.63261307  2.31153588]

Predict Fold Class

We use our classification model to predict the fold class. The class with the highest probability is reported as the final result.

In [10]:
predictions = classifier.predict([featureVector])
probabilities = classifier.predict_proba([featureVector])

print("Sequence:")
print(sequence)
print("\nProbabilities:")
print(classifier.classes_)
print(probabilities[0])
print("\nPrediction:", predictions[0])
Sequence:
GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL

Probabilities:
['alpha' 'alpha+beta' 'beta']
[0.15417294 0.73021587 0.11561119]

Prediction: alpha+beta

Note the limitations of the model (see 3-FitModel.ipynb). This is not a state-of-the art model to predict protein fold classes, rather it serves as an example how to create a reproducible and interactive workflow with Jupyter Notebooks.


Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018