{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Make a Prediction\n",
    "In this notebook you can enter a protein sequence and predict the fold classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**Rule 9: Design Your Notebooks to Be Read, Run, and Explored.** We use ipywidgets to present to users a text box to execute a prediction for a protein sequence of their choice. We provide a default sequence to generate a reproducible result.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.externals import joblib\n",
    "import protvectors\n",
    "from ipywidgets import widgets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Enter a Protein Sequence in Text Box\n",
    "We have populated the text box with a default sequence from PDB chain [5YU2.A](https://www.rcsb.org/structure/5YU2) (expected result: alpha+beta)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_box = widgets.Textarea(description='Sequence:', value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "391ce7f51aca431387188610ba44d563",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Textarea(value='GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVY…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(text_box)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Make prediction for: GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL\n"
     ]
    }
   ],
   "source": [
    "sequence = text_box.value\n",
    "print(\"Make prediction for:\", sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Classifier model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "classifier = joblib.load(\"./intermediate_data/classifier\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,\n",
       "  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=True, random_state=13, shrinking=True,\n",
       "  tol=0.001, verbose=False)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate 3-grams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['GAA', 'AAS', 'ASM', 'SMK', 'MKI', 'KII', 'IIN', 'INT', 'NTT', 'TTR', 'TRL', 'RLP', 'LPE', 'PEA', 'EAL', 'ALG', 'LGP', 'GPY', 'PYS', 'YSH', 'SHA', 'HAT', 'ATV', 'TVV', 'VVN', 'VNG', 'NGM', 'GMV', 'MVY', 'VYT', 'YTS', 'TSG', 'SGQ', 'GQI', 'QIP', 'IPL', 'PLN', 'LNV', 'NVD', 'VDG', 'DGK', 'GKI', 'KIV', 'IVS', 'VSA', 'SAD', 'ADV', 'DVQ', 'VQA', 'QAQ', 'AQT', 'QTK', 'TKQ', 'KQV', 'QVL', 'VLE', 'LEN', 'ENL', 'NLK', 'LKV', 'KVV', 'VVL', 'VLE', 'LEE', 'EEA', 'EAG', 'AGS', 'GSD', 'SDL', 'DLN', 'LNS', 'NSV', 'SVA', 'VAK', 'AKA', 'KAT', 'ATI', 'TIF', 'IFI', 'FIK', 'IKD', 'KDM', 'DMN', 'MND', 'NDF', 'DFQ', 'FQK', 'QKI', 'KIN', 'INE', 'NEV', 'EVY', 'VYG', 'YGQ', 'GQY', 'QYF', 'YFN', 'FNE', 'NEH', 'EHK', 'HKP', 'KPA', 'PAR', 'ARS', 'RSC', 'SCV', 'CVE', 'VEV', 'EVA', 'VAR', 'ARL', 'RLP', 'LPK', 'PKD', 'KDV', 'DVK', 'VKV', 'KVE', 'VEI', 'EIE', 'IEL', 'ELV', 'LVS', 'VSK', 'SKI', 'KIK', 'IKE', 'KEL']\n"
     ]
    }
   ],
   "source": [
    "ngrams = protvectors.ngrammer(sequence, n=3)\n",
    "print(ngrams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read ProtVec Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "protvec = protvectors.read_protvectors(\"./data/protVec_100d_3grams.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate Feature Vector using ProtVec Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-2.3101338  -0.32331702  0.78323473 -2.17371844 -0.05491938 -0.04513578\n",
      "  1.22565468 -0.4836296   0.29705189  2.28707543  0.22305553 -0.41752745\n",
      " -0.39105919  0.70440273  0.13918634  0.19040849  0.84646709 -1.56859546\n",
      "  0.02784964 -1.49127177 -0.01911258 -1.30944049 -1.42153082  0.01448804\n",
      "  0.36608489  0.39708845 -0.16598204 -0.15441528 -0.33733611 -1.21403695\n",
      " -0.5650013  -1.30023446  0.56057566 -0.1993203   0.07892812  1.4882538\n",
      "  0.08757783 -0.22068366  0.53207356 -0.09555411  0.20772675  0.67549063\n",
      "  0.52102914 -0.12743451 -0.47274557  0.02531047 -0.91127284 -0.41035579\n",
      "  0.00657577  1.81890208  0.12983772 -0.76028579  1.77282759 -1.40223342\n",
      "  1.04664272 -1.91512564  0.0619787   0.3228361  -0.30558078 -2.91303999\n",
      " -0.25224169  1.90002018  0.20970349  0.03197095  1.54426113  0.86039855\n",
      "  0.22558089  2.08030942  0.80783672 -1.10335774  0.15124303  0.28051646\n",
      "  0.7273708   1.120928    0.87064261 -0.78159259  0.86080856 -0.02225375\n",
      "  0.7175734  -0.98596892 -1.03833133 -0.07469874  0.04270149 -0.09935889\n",
      "  1.53536898 -0.9162694  -1.40590277 -0.72739099 -1.47466438  0.48371854\n",
      "  0.35492522 -1.17893066  0.41448665  0.640024    1.64565464  0.09323905\n",
      " -0.32028101 -1.52077313  0.63261307  2.31153588]\n"
     ]
    }
   ],
   "source": [
    "featureVector = protvectors.apply_protvectors(ngrams, protvec)\n",
    "print(featureVector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predict Fold Class\n",
    "We use our classification model to predict the fold class. The class with the highest probability is reported as the final result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sequence:\n",
      "GAASMKIINTTRLPEALGPYSHATVVNGMVYTSGQIPLNVDGKIVSADVQAQTKQVLENLKVVLEEAGSDLNSVAKATIFIKDMNDFQKINEVYGQYFNEHKPARSCVEVARLPKDVKVEIELVSKIKEL\n",
      "\n",
      "Probabilities:\n",
      "['alpha' 'alpha+beta' 'beta']\n",
      "[0.15417294 0.73021587 0.11561119]\n",
      "\n",
      "Prediction: alpha+beta\n"
     ]
    }
   ],
   "source": [
    "predictions = classifier.predict([featureVector])\n",
    "probabilities = classifier.predict_proba([featureVector])\n",
    "\n",
    "print(\"Sequence:\")\n",
    "print(sequence)\n",
    "print(\"\\nProbabilities:\")\n",
    "print(classifier.classes_)\n",
    "print(probabilities[0])\n",
    "print(\"\\nPrediction:\", predictions[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the limitations of the model (see [3-FitModel.ipynb](./3-FitModel.ipynb)). This is not a state-of-the art model to predict protein fold classes, rather it serves as an example how to create a reproducible and interactive workflow with Jupyter Notebooks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**Authors:** [Peter W. Rose](mailto:pwrose.ucsd@gmail.com), Shih-Cheng Huang, UC San Diego, October 1, 2018\n",
    "\n",
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}