Predict Fold Type of a Protein from Protein Sequence

The notebooks in this directory demonstrate and apply the "Ten Rules for Reproducible Research in Jupyter Notebooks". Throughout the notebooks we refer to some the rules we applied.

For example, this notebook demonstrates:


Rule 1: Tell a Story for an Audience. This notebook was developed to learn how to apply a simple machine learning model to predict protein features based on protein sequences.

Rule 3: Use Divisions to Make Steps Clear. We broke the workflow into separate notebooks and use this top-level notebook to explain and organize the workflow.

Rule 7: Build a Pipeline. This notebook describes the entire workflow from data preparation, feature calculation, model fitting, to prediction. The modularity makes it easy to replace one of the steps, for example, use a different method to calculate features or apply a different machine learning model.


Introduction

Proteins have four different levels of structure – primary, secondary, tertiary and quaternary. Secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are:

  • Alpha helices
  • Beta sheets

We can classify proteins into three major fold classes based on their predominant secondary structure content:

  • alpha: contains predominantly alpha helices
  • beta: contains predominantly beta sheets
  • alpha+beta: contains alpha helices and beta sheets

Goal

This notebook demonstrates how to create a reproducible record using a machine learning model. We train the model to predict the fold class of a protein given its amino acid sequence using a representative set of 3D structures from the Protein Data Bank.

Run the following notebooks and explore how we applied the Ten Simple Rules.

1. Create Dataset

First, we create a dataset with protein secondary structure information obtained from 3D protein chains.

Run the following notebook to extract secondary structure information from a representative set of protein chains downloaded from the RCSB Protein Data Bank and assign a fold class to each protein chain.

This notebook saves the dataset in the file ./intermediate_data/foldClassification.json.

2. Calculate Features

Protein sequences cannot be directly used for machine learning. Here we use the Word2vec method to calculate a fixed-sized feature vector for each protein sequence.

Run the following notebook to calculate feature vectors.

This notebook saves the dataset with feature vectors in the file ./intermediate_data/features.json.

3. Fit a Model

Next, we fit a 3-state classification model using the feature vectors and the given fold classification from the Protein Data Bank dataset.

Run the following notebook to fit a machine learning model on a training set and evaluate its performance on a test set.

This notebook saves the classification model in the file ./intermediate_data/classifier.

4. Make a Prediction

Finally, we use the trained classifier to predict the fold class from a protein sequence.

Version and Hardware Information


Rule 5: Record Dependencies. Here we use the watermark extension to print software, operating system, and hardware version information.


In [1]:
%load_ext watermark
%watermark -v -m -p ipywidgets,matplotlib,numpy,pandas,sklearn
CPython 3.6.3
IPython 6.3.1

ipywidgets 7.4.0
matplotlib 2.2.2
numpy 1.14.5
pandas 0.22.0
sklearn 0.20.0

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018