This notebook fits a 3-state classification model on a training set and calculates metrics on a test set.
Rule 9: Design Your Notebooks to Be Read, Run, and Explored. We use ipywidgets to present the user with a pull-down menu to select a machine learning model.
import pandas as pd
import mlutils
from sklearn import svm, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from ipywidgets import widgets
# column names
feature_col = "features" # feature vector
value_col = "foldClass" # fold class to be predicted
df = pd.read_json("./intermediate_data/features.json")
print("Total number of data:", df.shape[0], "\n")
df.head(2)
train, test = train_test_split(df, test_size=0.25, random_state=13, stratify=df[value_col])
print("Training set size:", train.shape[0], "\n")
print(train[value_col].value_counts())
print()
print("Test set size:", test.shape[0], "\n")
print(test[value_col].value_counts())
Using the pull-down menu, you can select one of three machine learning models and compare the performance.
method = widgets.Dropdown(options=['SVM', 'LogisticRegression', "NeuralNetwork"],description='Method')
display(method)
if (method.value == 'SVM'):
classifier = svm.SVC(gamma='auto', class_weight='balanced', random_state=13, probability=True)
elif (method.value == 'LogisticRegression'):
classifier = LogisticRegression(class_weight='balanced', random_state=13, solver='lbfgs', multi_class='auto', max_iter=500)
elif (method.value == 'NeuralNetwork'):
# Neural network with one hidden layer of 20 nodes
classifier = MLPClassifier(hidden_layer_sizes = (20), random_state=13, early_stopping=True)
classifier.fit(train[feature_col].tolist(), train[value_col])
predicted = classifier.predict(test[feature_col].tolist())
expected = test[value_col]
print("Classification metrics:\n")
print(metrics.classification_report(expected, predicted))
cm = metrics.confusion_matrix(expected, predicted)
mlutils.plot_confusion_matrix(cm, classifier.classes_, normalize=True, title='Normalized Confusion Matrix')
mlutils.plot_confusion_matrix(cm, classifier.classes_, normalize=False, title='Confusion Matrix')
The three classification methods: SVM, Logistic Regression, and Neural Network perform about the same on the test dataset. We have not optimized any parameters. We leave this as an excercise for the reader.
For all three methods, the prediction of the mixed class: alpha+beta has the lowest precision and recall.
Limitations of the Model
The feature vectors are created by summing ProtVectors for all 3-grams of a protein sequence and this process averages individual contributions. This step may contribute to the lower performance for the alpha+beta class, because alpha and beta related features are averaged together. In addition, the beta class is underrepresented.
The limiting factor appears to the expressiveness of the feature vector using the ProtVec model. Furthermore, we used a cutoff of 25% alpha and/or beta content to define the fold classes. This means that the model will not perform well on protein sequences with minimal alpha or beta content.
Alternative feature vectors can be easily explored by replacing the CalculateFeatures step with another method.
joblib.dump(classifier, "./intermediate_data/classifier")
After you saved the classification model here, run the next step in the workflow 4-Predict.ipynb or go back go back to 0-Workflow.ipynb.