{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculate Features\n", "This notebook reads a dataset with protein sequence and protein fold classification and calculates a feature vector for each protein sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "\n", "**Rule 3: Use Divisions to Make Steps Clear.** We use one cell for each distinct task.\n", "\n", "\n", "**Rule 4: Modularize Code.** To avoid duplicating code, we have collected several functions in protvectors.py. These functions are also used in 4-Predict.\n", "\n", "\n", "**Rule 8: Share and Explain Your Data.** To enable reproducibility we provide a local copy of a Word2vec model in the /data directory and a file that describes the datasets with download locations and dates.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import protvectors" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# column names\n", "feature_col = \"features\" # feature vector\n", "value_col = \"foldClass\" # fold class to be predicted" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_json(\"./intermediate_data/foldClassification.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a Word2vec Model\n", "We use the **ProtVec model** (Asgari et al.) to calculate a 100-dimensional feature vector for each protein sequence. ProtVec uses a Word2vec model (Mikolov et al.) that has been trained on 546,790 sequences in [Swiss-Prot](https://web.expasy.org/docs/swiss-prot_guideline.html) using 546,790 × 3 = 1,640,370 sequences of 3-grams. The 3-grams represent \"biological words\" in a protein sequence, e.g., sequence: SRMPSPP -> 3-grams: SRM RMP MPS PSP SPP. The **ProtVec** model is available for download at: https://github.com/ehsanasgari/Deep-Proteomics.\n", "\n", "Asgari E, Mofrad MR (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics, PLoS One. 10(11):e0141287. doi: [10.1371/journal.pone.0141287](https://doi.org/10.1371/journal.pone.0141287).\n", "\n", "Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, Distributed representations of words and phrases and their compositionality. In: [Advances in neural information processing systems; 2013. p. 3111–3119.](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read ProtVec Model\n", "Next we read a local copy of the ProtVec model. The ProtVec model is represented as a dictionary, with the 3-gram as the key and the 100-dimensional feature vector as the value." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Example ProtVec for 3-gram SRM:\n", " [-0.349053 -0.034172 -0.14602 -0.112906 0.318846 0.100117 -0.104718\n", " -0.194695 -0.08249 0.016351 -0.181182 0.109543 0.067238 -0.027135\n", " 0.222703 0.073312 -0.074177 -0.087137 -0.27853 0.003309 -0.065516\n", " -0.035587 0.042179 0.169955 0.155156 -0.07882 0.203758 0.129488\n", " -0.009507 -0.033186 -0.007172 -0.039388 0.243934 0.009303 0.043914\n", " -0.018962 -0.23077 -0.136273 0.027782 0.232346 -0.2341 0.102889\n", " -0.054253 -0.111376 0.106518 -0.027139 -0.139712 -0.049569 0.057983\n", " -0.157097 0.090227 0.0228 0.114038 0.017181 -0.015422 -0.035576\n", " -0.014446 0.000584 -0.292332 0.003074 0.097327 0.072325 0.138753\n", " 0.028772 -0.023035 0.024519 0.123589 0.021453 0.286168 0.094651\n", " -0.145597 0.132008 -0.104951 0.121934 -0.042467 -0.075287 0.306096\n", " 0.096278 -0.121827 0.167771 0.059359 -0.169576 0.018486 -0.143597\n", " 0.211764 0.171916 0.200995 0.190091 -0.142053 0.022641 0.204606\n", " -0.083642 0.016121 -0.147855 0.001436 -0.124035 0.00538 -0.177881\n", " 0.116058 0.195754]\n" ] } ], "source": [ "protvec = protvectors.read_protvectors(\"./data/protVec_100d_3grams.csv\")\n", "\n", "print(\"Example ProtVec for 3-gram SRM:\\n\", protvec['SRM'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create 3-grams of the Protein Sequence\n", "Next, we create 3-grams for the protein sequences in our dataset." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Exptl.FreeRvalueR-factoralphabetacoilfoldClasslengthpdbChainIdresolutionsecondary_structuresequencengram
1XRAY0.260.190.4699450.0464480.483607alpha36616VP.A2.100CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...
1000XRAY0.230.180.5046300.0046300.490741alpha2161PBW.B2.000CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...
10002XRAY0.260.220.7161720.0066010.277228alpha3034TQ3.A2.408CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC...MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS...[MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ...
\n", "
" ], "text/plain": [ " Exptl. FreeRvalue R-factor alpha beta coil foldClass \\\n", "1 XRAY 0.26 0.19 0.469945 0.046448 0.483607 alpha \n", "1000 XRAY 0.23 0.18 0.504630 0.004630 0.490741 alpha \n", "10002 XRAY 0.26 0.22 0.716172 0.006601 0.277228 alpha \n", "\n", " length pdbChainId resolution \\\n", "1 366 16VP.A 2.100 \n", "1000 216 1PBW.B 2.000 \n", "10002 303 4TQ3.A 2.408 \n", "\n", " secondary_structure \\\n", "1 CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... \n", "1000 CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT... \n", "10002 CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC... \n", "\n", " sequence \\\n", "1 SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... \n", "1000 MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL... \n", "10002 MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS... \n", "\n", " ngram \n", "1 [SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ... \n", "1000 [MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ... \n", "10002 [MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# add column ngram to dataframe\n", "df['ngram'] = df.sequence.apply(protvectors.ngrammer, n=3)\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a Fixed-sized Feature Vector\n", "Here we create a 100-dimensional feature vector by adding up the ProtVectors for all 3-grams in a protein sequence and standardize each feature vector to zero-mean and unit-variance. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Exptl.FreeRvalueR-factoralphabetacoilfoldClasslengthpdbChainIdresolutionsecondary_structuresequencengramfeatures
1XRAY0.260.190.4699450.0464480.483607alpha36616VP.A2.100CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...[-2.618341208445193, -0.37215537192569575, 0.1...
1000XRAY0.230.180.5046300.0046300.490741alpha2161PBW.B2.000CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...[-2.4130836608297224, -0.5122827315971855, 0.1...
10002XRAY0.260.220.7161720.0066010.277228alpha3034TQ3.A2.408CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC...MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS...[MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ...[-2.6375752438981404, 0.18385725798670652, 0.2...
\n", "
" ], "text/plain": [ " Exptl. FreeRvalue R-factor alpha beta coil foldClass \\\n", "1 XRAY 0.26 0.19 0.469945 0.046448 0.483607 alpha \n", "1000 XRAY 0.23 0.18 0.504630 0.004630 0.490741 alpha \n", "10002 XRAY 0.26 0.22 0.716172 0.006601 0.277228 alpha \n", "\n", " length pdbChainId resolution \\\n", "1 366 16VP.A 2.100 \n", "1000 216 1PBW.B 2.000 \n", "10002 303 4TQ3.A 2.408 \n", "\n", " secondary_structure \\\n", "1 CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... \n", "1000 CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT... \n", "10002 CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC... \n", "\n", " sequence \\\n", "1 SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... \n", "1000 MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL... \n", "10002 MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS... \n", "\n", " ngram \\\n", "1 [SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ... \n", "1000 [MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ... \n", "10002 [MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ... \n", "\n", " features \n", "1 [-2.618341208445193, -0.37215537192569575, 0.1... \n", "1000 [-2.4130836608297224, -0.5122827315971855, 0.1... \n", "10002 [-2.6375752438981404, 0.18385725798670652, 0.2... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[feature_col] = df.ngram.apply(protvectors.apply_protvectors, protvec=protvec)\n", "\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save DataFrame with Feature Vectors\n", "We save the dataset with protein sequence, fold classification, and feature vectors as a Pandas dataframe for further analysis." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df.to_json(\"./intermediate_data/features.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next step\n", "After you saved the dataset here, run the next step in the workflow [3-FitModel.ipynb](./3-FitModel.ipynb) or go back go back to [0-Workflow.ipynb](./0-Workflow.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**Authors:** [Peter W. Rose](mailto:pwrose.ucsd@gmail.com), Shih-Cheng Huang, UC San Diego, October 1, 2018\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }