Create Dataset

This notebook extracts from the Protein Data Bank information about the secondary structure of proteins. The ultimate goal is to assign a fold classification from a protein sequence.


Rule 2: Document the Process, Not Just the Results. Here we describe the steps how to produce the dataset.

Rule 7: Build a Pipeline. Besides documenting all steps, the entire process of dataset creation from the original data files in the /data directory is automated. There are no manual steps.

Rule 8: Share and Explain Your Data. To enable reproducibility we provide a /data directory with data files and a file that describes the datasets with download locations and dates.


In [1]:
# column names
value_col = "foldClass" # fold class to be predicted
In [2]:
import pandas as pd
import numpy as np     
import pdbutils 

Read a Representative Set of Protein Chains

Protein sequences in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the PISCES website with a maximum of 25% sequence identity and an X-ray resolution of <= 3.0 Å.

Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server, Nucleic Acids Res. 33, W94-8. doi: 10.1093/nar/gki402

In [3]:
representatives = pdbutils.read_pisces_representatives('./data/cullpdb_pc25_res3.0_R1.0_d180920_chains14051.gz')
print("Number of representative PDB chains:", representatives.shape[0], "\n")
representatives.head()
Number of representative PDB chains: 14051 

Out[3]:
length Exptl. resolution R-factor FreeRvalue pdbChainId
0 330 XRAY 2.2 0.16 0.29 12AS.A
1 366 XRAY 2.1 0.19 0.26 16VP.A
2 348 XRAY 2.6 0.22 0.34 1A0I.A
3 413 XRAY 1.7 0.19 0.22 1A12.C
4 108 XRAY 2.0 0.21 0.25 1A1X.A

Read Protein Sequence and Secondary Structure Data

Secondary structure of proteins is most commonly assigned with the DSSP method.

Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22, 2577–637. doi 0.1002/bip.360221211

DSSP defines 8 classes of secondary structure:

DSSP Class Code
4-turn helix (α helix), min. length 4 residues H
Isolated β-bridge (single pair β-sheet hydrogen bond formation) B
Extended strand in parallel or anti-parallel β-sheet conformation, min length 2 residues E
3-turn helix (310 helix), min. length 3 residues G
5-turn helix (π helix), min. length 5 residues I
Hydrogen bonded turn (3, 4 or 5 turn) T
Bend (the only non-hydrogen-bond based assignment) S
Coil (residues which are not in any of the above conformations) C

The RCSB Protein Data Bank provides protein sequences and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe.

The secondary_structure string below is the secondary structure assignment for each amino acid residue in the sequence.

In [4]:
sec_struct = pdbutils.read_secondary_structure('./data/ss_dis.txt.gz')
print("Number of protein chains in PDB:", sec_struct.shape[0], "\n")
sec_struct.head()
Number of protein chains in PDB: 402007 

Out[4]:
pdbChainId secondary_structure sequence
0 101M.A CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
1 102L.A CCHHHHHHHHHCCEEEEEECTTSCEEEETTEEEESSSCTTTHHHHH... MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
2 102M.A CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
3 103L.A CCHHHHHHHHHCCEEEEEECTTSCEEEETTEECCCCCCCCCHHHHH... MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
4 103M.A CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT... MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...

Find the Intersection between the two Datasets

Note the high level of redundancy in the PDB: the representative set contains about 14,000 protein chains, whereas the entire PDB contains more than 400,000 protein chains.

By merging the representative set of protein chains with the secondary structure dataset we obtain the intersection between the two sets. Both datasets contain a unique identifier column pdbChainId that is used to join the datasets.

In [5]:
df = sec_struct.merge(representatives, left_on='pdbChainId', right_on='pdbChainId', how='inner')
print("Number of representative chains with secondary structure data:", df.shape[0])
df.head()
Number of representative chains with secondary structure data: 13791
Out[5]:
pdbChainId secondary_structure sequence length Exptl. resolution R-factor FreeRvalue
0 12AS.A CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC... MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD... 330 XRAY 2.2 0.16 0.29
1 16VP.A CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... 366 XRAY 2.1 0.19 0.26
2 1A0I.A CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT... VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT... 348 XRAY 2.6 0.22 0.34
3 1A12.C CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... 413 XRAY 1.7 0.19 0.22
4 1A1X.A CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... 108 XRAY 2.0 0.21 0.25

Calculate Secondary Structure Content

To reduce the DSSP 8-state classification to a 3-state classification we group related secondary structure elements: alpha (I, H, G), beta (E, B), and coil (S, T, C). Then we calculate the fraction of the amino acid residues in each of the three classes.

In [6]:
def alpha_fraction(s):
    return (s.count('I') + s.count('H') + s.count('G')) / len(s)


def beta_fraction(s):
    return (s.count('E') + s.count('B')) / len(s)


def coil_fraction(s):
    return (s.count('S') + s.count('T') + s.count('C')) / len(s)


df['alpha'] = df.secondary_structure.apply(alpha_fraction)                                                                               
df['beta'] = df.secondary_structure.apply(beta_fraction)                                                                           
df['coil'] = df.secondary_structure.apply(coil_fraction)                                                                 

df.head()
Out[6]:
pdbChainId secondary_structure sequence length Exptl. resolution R-factor FreeRvalue alpha beta coil
0 12AS.A CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC... MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD... 330 XRAY 2.2 0.16 0.29 0.345455 0.206061 0.448485
1 16VP.A CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... 366 XRAY 2.1 0.19 0.26 0.469945 0.046448 0.483607
2 1A0I.A CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT... VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT... 348 XRAY 2.6 0.22 0.34 0.232759 0.318966 0.448276
3 1A12.C CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... 413 XRAY 1.7 0.19 0.22 0.038741 0.418886 0.542373
4 1A1X.A CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... 108 XRAY 2.0 0.21 0.25 0.037037 0.472222 0.490741

Classify Sequences by Secondary Structure Content

Next we classify each protein chain into one of four classes. We use a threshold of 25% to define a predominant class.

  • alpha: predominantly alpha (> 25%)
  • beta: predominantly beta (> 25%)
  • alpha+beta: significant alpha (> 25%) and beta (> 25%)
  • other: cases that do not fit into the 3 classes above

Protein chains in the other class will be ignored in the subsequent analysis.

In [7]:
def protein_fold_class(data, min_threshold, max_threshold):
    """
    Returns fold classification:
    "alpha", "beta", "alpha+beta", and "other" based upon the 
    fraction of alpha and beta.
    """
    if data.alpha > max_threshold and data.beta < min_threshold:                                
        return "alpha"                                                                        
    elif data.beta > max_threshold and data.alpha < min_threshold:                              
        return "beta"                                                                         
    elif data.alpha > max_threshold and data.beta > max_threshold:                              
        return "alpha+beta"                                                                   
    else:                                                                                     
        return "other"
In [8]:
# assign protein fold class
df[value_col] = df.apply(protein_fold_class, min_threshold=0.05, max_threshold=0.25, axis=1)

# exclude protein chains without a dominant classification from further analysis.
df = df[df[value_col] != 'other']

print("Dataset size", df.shape[0])
df.head()
Dataset size 5370
Out[8]:
pdbChainId secondary_structure sequence length Exptl. resolution R-factor FreeRvalue alpha beta coil foldClass
1 16VP.A CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS... SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL... 366 XRAY 2.10 0.19 0.26 0.469945 0.046448 0.483607 alpha
3 1A12.C CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC... RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV... 413 XRAY 1.70 0.19 0.22 0.038741 0.418886 0.542373 beta
4 1A1X.A CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE... GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ... 108 XRAY 2.00 0.21 0.25 0.037037 0.472222 0.490741 beta
5 1A2X.B CCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHTCCCCCCCCCCCCCCC GDEEKRNRAITARRQHLKSVMLQIAATELEKEEGRREAEKQNYLAEH 47 XRAY 2.30 0.22 0.33 0.574468 0.000000 0.425532 alpha
8 1A62.A CBHHHHHTSCHHHHHHHHHTTTCCCCTTSCHHHHHHHHHHHHHHTT... MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAKSG... 130 XRAY 1.55 0.22 0.25 0.284615 0.261538 0.453846 alpha+beta

Save Dataset

We save the representative dataset with protein sequence and fold classification as a Pandas dataframe for further analysis.

In [9]:
df.to_json("./intermediate_data/foldClassification.json")

Next step

After you saved the dataset here, run the next step in the workflow 2-CalculateFeatures.ipynb or go back to 0-Workflow.ipynb.


Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018