{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fit Model\n", "This notebook fits a 3-state classification model on a training set and calculates metrics on a test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**Rule 9: Design Your Notebooks to Be Read, Run, and Explored.** We use ipywidgets to present the user with a pull-down menu to select a machine learning model.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import mlutils\n", "from sklearn import svm, metrics\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.externals import joblib\n", "from ipywidgets import widgets" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# column names\n", "feature_col = \"features\" # feature vector\n", "value_col = \"foldClass\" # fold class to be predicted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read data set with fold type classifications and feature vectors" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_json(\"./intermediate_data/features.json\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of data: 5370 \n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Exptl.FreeRvalueR-factoralphabetacoilfeaturesfoldClasslengthngrampdbChainIdresolutionsecondary_structuresequence
1XRAY0.260.190.4699450.0464480.483607[-2.6183412084, -0.37215537190000003, 0.140630...alpha366[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...16VP.A2.1CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...
1000XRAY0.230.180.5046300.0046300.490741[-2.4130836608, -0.5122827316, 0.1969318015, -...alpha216[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...1PBW.B2.0CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...
\n", "
" ], "text/plain": [ " Exptl. FreeRvalue R-factor alpha beta coil \\\n", "1 XRAY 0.26 0.19 0.469945 0.046448 0.483607 \n", "1000 XRAY 0.23 0.18 0.504630 0.004630 0.490741 \n", "\n", " features foldClass length \\\n", "1 [-2.6183412084, -0.37215537190000003, 0.140630... alpha 366 \n", "1000 [-2.4130836608, -0.5122827316, 0.1969318015, -... alpha 216 \n", "\n", " ngram pdbChainId \\\n", "1 [SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ... 16VP.A \n", "1000 [MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ... 1PBW.B \n", "\n", " resolution secondary_structure \\\n", "1 2.1 CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... \n", "1000 2.0 CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT... \n", "\n", " sequence \n", "1 SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... \n", "1000 MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Total number of data:\", df.shape[0], \"\\n\")\n", "df.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split dataset into a training set and a test set" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training set size: 4027 \n", "\n", "alpha 2107\n", "alpha+beta 1266\n", "beta 654\n", "Name: foldClass, dtype: int64\n", "\n", "Test set size: 1343 \n", "\n", "alpha 703\n", "alpha+beta 422\n", "beta 218\n", "Name: foldClass, dtype: int64\n" ] } ], "source": [ "train, test = train_test_split(df, test_size=0.25, random_state=13, stratify=df[value_col])\n", "print(\"Training set size:\", train.shape[0], \"\\n\")\n", "print(train[value_col].value_counts())\n", "print()\n", "print(\"Test set size:\", test.shape[0], \"\\n\")\n", "print(test[value_col].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Select a Classification Method (default SVM)\n", "Using the pull-down menu, you can select one of three machine learning models and compare the performance." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "method = widgets.Dropdown(options=['SVM', 'LogisticRegression', \"NeuralNetwork\"],description='Method')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c304a2d147484533bca674876ec1235f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Method', options=('SVM', 'LogisticRegression', 'NeuralNetwork'), value='SVM')" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(method)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train a classifier" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", " max_iter=-1, probability=True, random_state=13, shrinking=True,\n", " tol=0.001, verbose=False)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if (method.value == 'SVM'):\n", " classifier = svm.SVC(gamma='auto', class_weight='balanced', random_state=13, probability=True)\n", "elif (method.value == 'LogisticRegression'):\n", " classifier = LogisticRegression(class_weight='balanced', random_state=13, solver='lbfgs', multi_class='auto', max_iter=500)\n", "elif (method.value == 'NeuralNetwork'):\n", " # Neural network with one hidden layer of 20 nodes\n", " classifier = MLPClassifier(hidden_layer_sizes = (20), random_state=13, early_stopping=True)\n", "\n", "\n", "classifier.fit(train[feature_col].tolist(), train[value_col])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make prediction for the test set" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "predicted = classifier.predict(test[feature_col].tolist())\n", "expected = test[value_col]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate metrics for the test set" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classification metrics:\n", "\n", " precision recall f1-score support\n", "\n", " alpha 0.88 0.78 0.83 703\n", " alpha+beta 0.65 0.73 0.69 422\n", " beta 0.71 0.80 0.75 218\n", "\n", " micro avg 0.77 0.77 0.77 1343\n", " macro avg 0.75 0.77 0.76 1343\n", "weighted avg 0.78 0.77 0.77 1343\n", "\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"Classification metrics:\\n\")\n", "print(metrics.classification_report(expected, predicted))\n", "\n", "cm = metrics.confusion_matrix(expected, predicted)\n", "mlutils.plot_confusion_matrix(cm, classifier.classes_, normalize=True, title='Normalized Confusion Matrix')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mlutils.plot_confusion_matrix(cm, classifier.classes_, normalize=False, title='Confusion Matrix')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results\n", "The three classification methods: SVM, Logistic Regression, and Neural Network perform about the same on the test dataset. We have not optimized any parameters. We leave this as an excercise for the reader.\n", "\n", "For all three methods, the prediction of the mixed class: alpha+beta has the lowest precision and recall. \n", "\n", "**Limitations of the Model**\n", "\n", "The feature vectors are created by summing ProtVectors for all 3-grams of a protein sequence and this process averages individual contributions. This step may contribute to the lower performance for the alpha+beta class, because alpha and beta related features are averaged together. In addition, the beta class is underrepresented.\n", "\n", "The limiting factor appears to the expressiveness of the feature vector using the ProtVec model. Furthermore, we used a cutoff of 25% alpha and/or beta content to define the fold classes. This means that the model will not perform well on protein sequences with minimal alpha or beta content.\n", "\n", "Alternative feature vectors can be easily explored by replacing the CalculateFeatures step with another method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the Classification Model" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['./intermediate_data/classifier']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joblib.dump(classifier, \"./intermediate_data/classifier\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Step\n", "After you saved the classification model here, run the next step in the workflow [4-Predict.ipynb](./4-Predict.ipynb) or go back go back to [0-Workflow.ipynb](./0-Workflow.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**Authors:** [Peter W. Rose](mailto:pwrose.ucsd@gmail.com), Shih-Cheng Huang, UC San Diego, October 1, 2018\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }