{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predict Fold Type of a Protein from Protein Sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The notebooks in this directory demonstrate and apply the \"Ten Rules for Reproducible Research in Jupyter Notebooks\". Throughout the notebooks we refer to some the rules we applied.**\n",
    "\n",
    "**For example, this notebook demonstrates:**\n",
    "\n",
    "---\n",
    "\n",
    "**Rule 1: Tell a Story for an Audience.** This notebook was developed to learn how to apply a simple machine learning model to predict protein features based on protein sequences.\n",
    "\n",
    "**Rule 3: Use Divisions to Make Steps Clear.** We broke the workflow into separate notebooks and use this top-level notebook to explain and organize the workflow.\n",
    "\n",
    "**Rule 7: Build a Pipeline.** This notebook describes the entire workflow from data preparation, feature calculation, model fitting, to prediction. The modularity makes it easy to replace one of the steps, for example, use a different method to calculate features or apply a different machine learning model.\n",
    "\n",
    "\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Proteins have four different levels of structure – primary, secondary, tertiary and quaternary. Secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are:\n",
    "* Alpha helices\n",
    "* Beta sheets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can classify proteins into three major fold classes based on their predominant secondary structure content:\n",
    "* alpha: contains predominantly alpha helices\n",
    "* beta: contains predominantly beta sheets\n",
    "* alpha+beta: contains alpha helices and beta sheets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Goal\n",
    "This notebook demonstrates how to create a reproducible record using a machine learning model. We train the model to predict the fold class of a protein given its amino acid sequence using a representative set of 3D structures from the Protein Data Bank.\n",
    "\n",
    "**Run the following notebooks and explore how we applied the Ten Simple Rules.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Create Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we create a dataset with protein secondary structure information obtained from 3D protein chains.\n",
    "\n",
    "Run the following notebook to extract secondary structure information from a representative set of protein chains downloaded from the RCSB Protein Data Bank and assign a fold class to each protein chain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[1-CreateDataset.ipynb](./1-CreateDataset.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook saves the dataset in the file `./intermediate_data/foldClassification.json`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Calculate Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Protein sequences cannot be directly used for machine learning. Here we use the Word2vec method to calculate a fixed-sized feature vector for each protein sequence.\n",
    "\n",
    "Run the following notebook to calculate feature vectors. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[2-CalculateFeatures.ipynb](./2-CalculateFeatures.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook saves the dataset with feature vectors in the file `./intermediate_data/features.json`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Fit a Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we fit a 3-state classification model using the feature vectors and the given fold classification from the Protein Data Bank dataset.\n",
    "\n",
    "Run the following notebook to fit a machine learning model on a training set and evaluate its performance on a test set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[3-FitModel.ipynb](./3-FitModel.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook saves the classification model in the file `./intermediate_data/classifier`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Make a Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we use the trained classifier to predict the fold class from a protein sequence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[4-Predict.ipynb](./4-Predict.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Version and Hardware Information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**Rule 5: Record Dependencies.** Here we use the watermark extension to print software, operating system, and hardware version information.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPython 3.6.3\n",
      "IPython 6.3.1\n",
      "\n",
      "ipywidgets 7.4.0\n",
      "matplotlib 2.2.2\n",
      "numpy 1.14.5\n",
      "pandas 0.22.0\n",
      "sklearn 0.20.0\n",
      "\n",
      "compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)\n",
      "system     : Darwin\n",
      "release    : 17.5.0\n",
      "machine    : x86_64\n",
      "processor  : i386\n",
      "CPU cores  : 4\n",
      "interpreter: 64bit\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -v -m -p ipywidgets,matplotlib,numpy,pandas,sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**Authors:** [Peter W. Rose](mailto:pwrose.ucsd@gmail.com), Shih-Cheng Huang, UC San Diego, October 1, 2018\n",
    "\n",
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}