In this notebook you can enter a protein sequence and predict the fold classification.
Rule 9: Design Your Notebooks to Be Read, Run, and Explored. We use ipywidgets to present to users a text box to execute a prediction for a protein sequence of their choice. We provide a default sequence to generate a reproducible result.
import numpy as np
from sklearn.externals import joblib
import protvectors
from ipywidgets import widgets
text_box = widgets.Textarea(description='Sequence:', value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL')
display(text_box)
sequence = text_box.value
print("Make prediction for:", sequence)
classifier = joblib.load("./intermediate_data/classifier")
classifier
ngrams = protvectors.ngrammer(sequence, n=3)
print(ngrams)
protvec = protvectors.read_protvectors("./data/protVec_100d_3grams.csv")
featureVector = protvectors.apply_protvectors(ngrams, protvec)
print(featureVector)
We use our classification model to predict the fold class. The class with the highest probability is reported as the final result.
predictions = classifier.predict([featureVector])
probabilities = classifier.predict_proba([featureVector])
print("Sequence:")
print(sequence)
print("\nProbabilities:")
print(classifier.classes_)
print(probabilities[0])
print("\nPrediction:", predictions[0])
Note the limitations of the model (see 3-FitModel.ipynb). This is not a state-of-the art model to predict protein fold classes, rather it serves as an example how to create a reproducible and interactive workflow with Jupyter Notebooks.