Fit Model¶

This notebook fits a 3-state classification model on a training set and calculates metrics on a test set.

Rule 9: Design Your Notebooks to Be Read, Run, and Explored. We use ipywidgets to present the user with a pull-down menu to select a machine learning model.

import pandas as pd
import mlutils
from sklearn import svm, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from ipywidgets import widgets

# column names
feature_col = "features" # feature vector
value_col = "foldClass" # fold class to be predicted

Read data set with fold type classifications and feature vectors¶

df = pd.read_json("./intermediate_data/features.json")

print("Total number of data:", df.shape[0], "\n")
df.head(2)

Total number of data: 5370

Split dataset into a training set and a test set¶

train, test = train_test_split(df, test_size=0.25, random_state=13, stratify=df[value_col])
print("Training set size:", train.shape[0], "\n")
print(train[value_col].value_counts())
print()
print("Test set size:", test.shape[0], "\n")
print(test[value_col].value_counts())

Training set size: 4027 

alpha         2107
alpha+beta    1266
beta           654
Name: foldClass, dtype: int64

Test set size: 1343 

alpha         703
alpha+beta    422
beta          218
Name: foldClass, dtype: int64

Select a Classification Method (default SVM)¶

Using the pull-down menu, you can select one of three machine learning models and compare the performance.

method = widgets.Dropdown(options=['SVM', 'LogisticRegression', "NeuralNetwork"],description='Method')

display(method)

Train a classifier¶

if (method.value == 'SVM'):
    classifier = svm.SVC(gamma='auto', class_weight='balanced', random_state=13, probability=True)
elif (method.value == 'LogisticRegression'):
    classifier = LogisticRegression(class_weight='balanced', random_state=13, solver='lbfgs', multi_class='auto', max_iter=500)
elif (method.value == 'NeuralNetwork'):
    # Neural network with one hidden layer of 20 nodes
    classifier = MLPClassifier(hidden_layer_sizes = (20), random_state=13, early_stopping=True)


classifier.fit(train[feature_col].tolist(), train[value_col])

SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=13, shrinking=True,
  tol=0.001, verbose=False)

Make prediction for the test set¶

predicted = classifier.predict(test[feature_col].tolist())
expected = test[value_col]

Calculate metrics for the test set¶

print("Classification metrics:\n")
print(metrics.classification_report(expected, predicted))

cm = metrics.confusion_matrix(expected, predicted)
mlutils.plot_confusion_matrix(cm, classifier.classes_, normalize=True, title='Normalized Confusion Matrix')

Classification metrics:

              precision    recall  f1-score   support

       alpha       0.88      0.78      0.83       703
  alpha+beta       0.65      0.73      0.69       422
        beta       0.71      0.80      0.75       218

   micro avg       0.77      0.77      0.77      1343
   macro avg       0.75      0.77      0.76      1343
weighted avg       0.78      0.77      0.77      1343

mlutils.plot_confusion_matrix(cm, classifier.classes_, normalize=False, title='Confusion Matrix')

Results¶

The three classification methods: SVM, Logistic Regression, and Neural Network perform about the same on the test dataset. We have not optimized any parameters. We leave this as an excercise for the reader.

For all three methods, the prediction of the mixed class: alpha+beta has the lowest precision and recall.

Limitations of the Model

The feature vectors are created by summing ProtVectors for all 3-grams of a protein sequence and this process averages individual contributions. This step may contribute to the lower performance for the alpha+beta class, because alpha and beta related features are averaged together. In addition, the beta class is underrepresented.

The limiting factor appears to the expressiveness of the feature vector using the ProtVec model. Furthermore, we used a cutoff of 25% alpha and/or beta content to define the fold classes. This means that the model will not perform well on protein sequences with minimal alpha or beta content.

Alternative feature vectors can be easily explored by replacing the CalculateFeatures step with another method.

Save the Classification Model¶

joblib.dump(classifier, "./intermediate_data/classifier")

['./intermediate_data/classifier']

Next Step¶

After you saved the classification model here, run the next step in the workflow 4-Predict.ipynb or go back go back to 0-Workflow.ipynb.

Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018

	Exptl.	FreeRvalue	R-factor	alpha	beta	coil	features	foldClass	length	ngram	pdbChainId	resolution	secondary_structure	sequence
1	XRAY	0.26	0.19	0.469945	0.046448	0.483607	[-2.6183412084, -0.37215537190000003, 0.140630...	alpha	366	[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...	16VP.A	2.1	CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...	SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...
1000	XRAY	0.23	0.18	0.504630	0.004630	0.490741	[-2.4130836608, -0.5122827316, 0.1969318015, -...	alpha	216	[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...	1PBW.B	2.0	CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...	MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...