Create Dataset¶

This notebook extracts from the Protein Data Bank information about the secondary structure of proteins. The ultimate goal is to assign a fold classification from a protein sequence.

Rule 2: Document the Process, Not Just the Results. Here we describe the steps how to produce the dataset.

Rule 7: Build a Pipeline. Besides documenting all steps, the entire process of dataset creation from the original data files in the /data directory is automated. There are no manual steps.

Rule 8: Share and Explain Your Data. To enable reproducibility we provide a /data directory with data files and a file that describes the datasets with download locations and dates.

# column names
value_col = "foldClass" # fold class to be predicted

import pandas as pd
import numpy as np     
import pdbutils

Read a Representative Set of Protein Chains¶

Protein sequences in the Protein Data Bank (PDB) are redundant. For example, there are more than 300 structures of hemoglobin in the PDB. To reduce biases in our analysis, we use a representative set of protein chains, i.e., a set of protein chains with minimal sequence identity among its members. For this analysis we downloaded a non-redundant set from the PISCES website with a maximum of 25% sequence identity and an X-ray resolution of <= 3.0 Å.

Wang G, Dunbrack RL Jr (2005) PISCES: recent improvements to a PDB sequence culling server, Nucleic Acids Res. 33, W94-8. doi: 10.1093/nar/gki402

representatives = pdbutils.read_pisces_representatives('./data/cullpdb_pc25_res3.0_R1.0_d180920_chains14051.gz')
print("Number of representative PDB chains:", representatives.shape[0], "\n")
representatives.head()

Number of representative PDB chains: 14051

Read Protein Sequence and Secondary Structure Data¶

Secondary structure of proteins is most commonly assigned with the DSSP method.

Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22, 2577–637. doi 0.1002/bip.360221211

DSSP defines 8 classes of secondary structure:

DSSP Class	Code
4-turn helix (α helix), min. length 4 residues	H
Isolated β-bridge (single pair β-sheet hydrogen bond formation)	B
Extended strand in parallel or anti-parallel β-sheet conformation, min length 2 residues	E
3-turn helix (3₁₀ helix), min. length 3 residues	G
5-turn helix (π helix), min. length 5 residues	I
Hydrogen bonded turn (3, 4 or 5 turn)	T
Bend (the only non-hydrogen-bond based assignment)	S
Coil (residues which are not in any of the above conformations)	C

The RCSB Protein Data Bank provides protein sequences and DSSP secondary structure assignments. The method below reads a local copy of this file and returns the data as a Pandas dataframe.

The secondary_structure string below is the secondary structure assignment for each amino acid residue in the sequence.

sec_struct = pdbutils.read_secondary_structure('./data/ss_dis.txt.gz')
print("Number of protein chains in PDB:", sec_struct.shape[0], "\n")
sec_struct.head()

Number of protein chains in PDB: 402007

Find the Intersection between the two Datasets¶

Note the high level of redundancy in the PDB: the representative set contains about 14,000 protein chains, whereas the entire PDB contains more than 400,000 protein chains.

By merging the representative set of protein chains with the secondary structure dataset we obtain the intersection between the two sets. Both datasets contain a unique identifier column pdbChainId that is used to join the datasets.

df = sec_struct.merge(representatives, left_on='pdbChainId', right_on='pdbChainId', how='inner')
print("Number of representative chains with secondary structure data:", df.shape[0])
df.head()

Number of representative chains with secondary structure data: 13791

Calculate Secondary Structure Content¶

To reduce the DSSP 8-state classification to a 3-state classification we group related secondary structure elements: alpha (I, H, G), beta (E, B), and coil (S, T, C). Then we calculate the fraction of the amino acid residues in each of the three classes.

def alpha_fraction(s):
    return (s.count('I') + s.count('H') + s.count('G')) / len(s)


def beta_fraction(s):
    return (s.count('E') + s.count('B')) / len(s)


def coil_fraction(s):
    return (s.count('S') + s.count('T') + s.count('C')) / len(s)


df['alpha'] = df.secondary_structure.apply(alpha_fraction)                                                                               
df['beta'] = df.secondary_structure.apply(beta_fraction)                                                                           
df['coil'] = df.secondary_structure.apply(coil_fraction)                                                                 

df.head()

Classify Sequences by Secondary Structure Content¶

Next we classify each protein chain into one of four classes. We use a threshold of 25% to define a predominant class.

alpha: predominantly alpha (> 25%)
beta: predominantly beta (> 25%)
alpha+beta: significant alpha (> 25%) and beta (> 25%)
other: cases that do not fit into the 3 classes above

Protein chains in the other class will be ignored in the subsequent analysis.

def protein_fold_class(data, min_threshold, max_threshold):
    """
    Returns fold classification:
    "alpha", "beta", "alpha+beta", and "other" based upon the 
    fraction of alpha and beta.
    """
    if data.alpha > max_threshold and data.beta < min_threshold:                                
        return "alpha"                                                                        
    elif data.beta > max_threshold and data.alpha < min_threshold:                              
        return "beta"                                                                         
    elif data.alpha > max_threshold and data.beta > max_threshold:                              
        return "alpha+beta"                                                                   
    else:                                                                                     
        return "other"

# assign protein fold class
df[value_col] = df.apply(protein_fold_class, min_threshold=0.05, max_threshold=0.25, axis=1)

# exclude protein chains without a dominant classification from further analysis.
df = df[df[value_col] != 'other']

print("Dataset size", df.shape[0])
df.head()

Dataset size 5370

Save Dataset¶

We save the representative dataset with protein sequence and fold classification as a Pandas dataframe for further analysis.

df.to_json("./intermediate_data/foldClassification.json")

Next step¶

After you saved the dataset here, run the next step in the workflow 2-CalculateFeatures.ipynb or go back to 0-Workflow.ipynb.

Authors: Peter W. Rose, Shih-Cheng Huang, UC San Diego, October 1, 2018

	length	Exptl.	resolution	R-factor	FreeRvalue	pdbChainId
0	330	XRAY	2.2	0.16	0.29	12AS.A
1	366	XRAY	2.1	0.19	0.26	16VP.A
2	348	XRAY	2.6	0.22	0.34	1A0I.A
3	413	XRAY	1.7	0.19	0.22	1A12.C
4	108	XRAY	2.0	0.21	0.25	1A1X.A

	pdbChainId	secondary_structure	sequence
0	101M.A	CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT...	MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
1	102L.A	CCHHHHHHHHHCCEEEEEECTTSCEEEETTEEEESSSCTTTHHHHH...	MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
2	102M.A	CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT...	MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
3	103L.A	CCHHHHHHHHHCCEEEEEECTTSCEEEETTEECCCCCCCCCHHHHH...	MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
4	103M.A	CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHHCGGGGGGCTT...	MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...

	pdbChainId	secondary_structure	sequence	length	Exptl.	resolution	R-factor	FreeRvalue
0	12AS.A	CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC...	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD...	330	XRAY	2.2	0.16	0.29
1	16VP.A	CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...	SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...	366	XRAY	2.1	0.19	0.26
2	1A0I.A	CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT...	VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT...	348	XRAY	2.6	0.22	0.34
3	1A12.C	CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC...	RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV...	413	XRAY	1.7	0.19	0.22
4	1A1X.A	CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE...	GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ...	108	XRAY	2.0	0.21	0.25

	pdbChainId	secondary_structure	sequence	length	Exptl.	resolution	R-factor	FreeRvalue	alpha	beta	coil
0	12AS.A	CCCCHHHHHHHHHHHHHHHHHHHHHHHCEEECCCCSEEETTSSCSC...	MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD...	330	XRAY	2.2	0.16	0.29	0.345455	0.206061	0.448485
1	16VP.A	CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...	SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...	366	XRAY	2.1	0.19	0.26	0.469945	0.046448	0.483607
2	1A0I.A	CTTCCCCEEEEECCHHHHHHHHHHHSSEEEEECCCSEEEEEEEETT...	VNIKTNPFKAVSFVESAIKKALDNAGYLIAEIKYDGVRGNICVDNT...	348	XRAY	2.6	0.22	0.34	0.232759	0.318966	0.448276
3	1A12.C	CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC...	RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV...	413	XRAY	1.7	0.19	0.22	0.038741	0.418886	0.542373
4	1A1X.A	CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE...	GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ...	108	XRAY	2.0	0.21	0.25	0.037037	0.472222	0.490741

	pdbChainId	secondary_structure	sequence	length	Exptl.	resolution	R-factor	FreeRvalue	alpha	beta	coil	foldClass
1	16VP.A	CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...	SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...	366	XRAY	2.10	0.19	0.26	0.469945	0.046448	0.483607	alpha
3	1A12.C	CCCCCCCCCCCCCCCCCCCTTCCCCCBEEEEEEECTTSTTCSCTTC...	RRSPPADAIPKSKKVKVSHRSHSTEPGLVLTLGQGDVGQLGLGENV...	413	XRAY	1.70	0.19	0.22	0.038741	0.418886	0.542373	beta
4	1A1X.A	CCCCCCCCCCCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEE...	GSAGEDVGAPPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQ...	108	XRAY	2.00	0.21	0.25	0.037037	0.472222	0.490741	beta
5	1A2X.B	CCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHTCCCCCCCCCCCCCCC	GDEEKRNRAITARRQHLKSVMLQIAATELEKEEGRREAEKQNYLAEH	47	XRAY	2.30	0.22	0.33	0.574468	0.000000	0.425532	alpha
8	1A62.A	CBHHHHHTSCHHHHHHHHHTTTCCCCTTSCHHHHHHHHHHHHHHTT...	MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAKSG...	130	XRAY	1.55	0.22	0.25	0.284615	0.261538	0.453846	alpha+beta