{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Make a Prediction\n", "In this notebook you can enter a protein sequence and predict the fold classification." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**Rule 9: Design Your Notebooks to Be Read, Run, and Explored.** We use ipywidgets to present to users a text box to execute a prediction for a protein sequence of their choice. We provide a default sequence to generate a reproducible result.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.externals import joblib\n", "import protvectors\n", "from ipywidgets import widgets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enter a Protein Sequence in Text Box\n", "We have populated the text box with a default sequence from PDB chain [5YU2.A](https://www.rcsb.org/structure/5YU2) (expected result: alpha+beta)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "text_box = widgets.Textarea(description='Sequence:', value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "391ce7f51aca431387188610ba44d563", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Textarea(value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVY…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(text_box)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Make prediction for: GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL\n" ] } ], "source": [ "sequence = text_box.value\n", "print(\"Make prediction for:\", sequence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Classifier model" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "classifier = joblib.load(\"./intermediate_data/classifier\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=True, random_state=13, shrinking=True,\n", " tol=0.001, verbose=False)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate 3-grams" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['GAA', 'AAS', 'ASM', 'SMK', 'MKI', 'KII', 'IIN', 'INT', 'NTT', 'TTR', 'TRL', 'RLP', 'LPE', 'PEA', 'EAL', 'ALG', 'LGP', 'GPY', 'PYS', 'YSH', 'SHA', 'HAT', 'ATV', 'TVV', 'VVN', 'VNG', 'NGM', 'GMV', 'MVY', 'VYT', 'YTS', 'TSG', 'SGQ', 'GQI', 'QIP', 'IPL', 'PLN', 'LNV', 'NVD', 'VDG', 'DGK', 'GKI', 'KIV', 'IVS', 'VSA', 'SAD', 'ADV', 'DVQ', 'VQA', 'QAQ', 'AQT', 'QTK', 'TKQ', 'KQV', 'QVL', 'VLE', 'LEN', 'ENL', 'NLK', 'LKV', 'KVV', 'VVL', 'VLE', 'LEE', 'EEA', 'EAG', 'AGS', 'GSD', 'SDL', 'DLN', 'LNS', 'NSV', 'SVA', 'VAK', 'AKA', 'KAT', 'ATI', 'TIF', 'IFI', 'FIK', 'IKD', 'KDM', 'DMN', 'MND', 'NDF', 'DFQ', 'FQK', 'QKI', 'KIN', 'INE', 'NEV', 'EVY', 'VYG', 'YGQ', 'GQY', 'QYF', 'YFN', 'FNE', 'NEH', 'EHK', 'HKP', 'KPA', 'PAR', 'ARS', 'RSC', 'SCV', 'CVE', 'VEV', 'EVA', 'VAR', 'ARL', 'RLP', 'LPK', 'PKD', 'KDV', 'DVK', 'VKV', 'KVE', 'VEI', 'EIE', 'IEL', 'ELV', 'LVS', 'VSK', 'SKI', 'KIK', 'IKE', 'KEL']\n" ] } ], "source": [ "ngrams = protvectors.ngrammer(sequence, n=3)\n", "print(ngrams)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read ProtVec Model" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "protvec = protvectors.read_protvectors(\"./data/protVec_100d_3grams.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate Feature Vector using ProtVec Model" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-2.3101338 -0.32331702 0.78323473 -2.17371844 -0.05491938 -0.04513578\n", " 1.22565468 -0.4836296 0.29705189 2.28707543 0.22305553 -0.41752745\n", " -0.39105919 0.70440273 0.13918634 0.19040849 0.84646709 -1.56859546\n", " 0.02784964 -1.49127177 -0.01911258 -1.30944049 -1.42153082 0.01448804\n", " 0.36608489 0.39708845 -0.16598204 -0.15441528 -0.33733611 -1.21403695\n", " -0.5650013 -1.30023446 0.56057566 -0.1993203 0.07892812 1.4882538\n", " 0.08757783 -0.22068366 0.53207356 -0.09555411 0.20772675 0.67549063\n", " 0.52102914 -0.12743451 -0.47274557 0.02531047 -0.91127284 -0.41035579\n", " 0.00657577 1.81890208 0.12983772 -0.76028579 1.77282759 -1.40223342\n", " 1.04664272 -1.91512564 0.0619787 0.3228361 -0.30558078 -2.91303999\n", " -0.25224169 1.90002018 0.20970349 0.03197095 1.54426113 0.86039855\n", " 0.22558089 2.08030942 0.80783672 -1.10335774 0.15124303 0.28051646\n", " 0.7273708 1.120928 0.87064261 -0.78159259 0.86080856 -0.02225375\n", " 0.7175734 -0.98596892 -1.03833133 -0.07469874 0.04270149 -0.09935889\n", " 1.53536898 -0.9162694 -1.40590277 -0.72739099 -1.47466438 0.48371854\n", " 0.35492522 -1.17893066 0.41448665 0.640024 1.64565464 0.09323905\n", " -0.32028101 -1.52077313 0.63261307 2.31153588]\n" ] } ], "source": [ "featureVector = protvectors.apply_protvectors(ngrams, protvec)\n", "print(featureVector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict Fold Class\n", "We use our classification model to predict the fold class. The class with the highest probability is reported as the final result." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence:\n", "GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL\n", "\n", "Probabilities:\n", "['alpha' 'alpha+beta' 'beta']\n", "[0.15417294 0.73021587 0.11561119]\n", "\n", "Prediction: alpha+beta\n" ] } ], "source": [ "predictions = classifier.predict([featureVector])\n", "probabilities = classifier.predict_proba([featureVector])\n", "\n", "print(\"Sequence:\")\n", "print(sequence)\n", "print(\"\\nProbabilities:\")\n", "print(classifier.classes_)\n", "print(probabilities[0])\n", "print(\"\\nPrediction:\", predictions[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the limitations of the model (see [3-FitModel.ipynb](./3-FitModel.ipynb)). This is not a state-of-the art model to predict protein fold classes, rather it serves as an example how to create a reproducible and interactive workflow with Jupyter Notebooks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**Authors:** [Peter W. Rose](mailto:pwrose.ucsd@gmail.com), Shih-Cheng Huang, UC San Diego, October 1, 2018\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }