\[ \definecolor{data}{RGB}{18,110,213} \definecolor{unknown}{RGB}{217,86,16} \definecolor{learned}{RGB}{175,114,176} \]

Conformational Dynamics in MSMBuilder3

Kyle A. Beauchamp
Updated Feb. 27, 2015 (msmbuilder v3.1)

Old-School Analysis of MD Data

  • Analysis happens in walled gardens (Gromacs, Amber, VMD)
  • Exclusively command line interfaces, C and Fortran code
  • Duplication of statistical algorithms by non-experts (e.g. chemists)
  • Possible code maintainability issues?

Jarvis Patrick Clustering in Gromacs

real code in gromacs


static void jarvis_patrick(int n1, real **mat, int M, int P,
real rmsdcut, t_clusters *clust) {
t_dist *row;
t_clustid *c;
int **nnb;
int i, j, k, cid, diff, max;
gmx_bool bChange;
real **mcpy = NULL;
if (rmsdcut < 0) {
rmsdcut = 10000;
}
/* First we sort the entries in the RMSD matrix row by row.
* This gives us the nearest neighbor list.
*/

Jarvis Patrick Clustering in Gromacs (Cont.)


// Five more pages of this
// You get the idea

// Also, how do we even use this function?
static void jarvis_patrick(int n1, real **mat, int M, int P,
real rmsdcut, t_clusters *clust);

Enter Data Science

  • Machine learning is mainstream now!
  • Thousands of experts are using machine learning approaches
  • Well-tested, performant, and facile implementations are available
  • Writing your own is not the way to go!
    • E.g: is clustering that special and MD-specific such that we need our own custom algorithms and implementations? No.

MSMBuilder3: Design

Builds on scikit-learn idioms:

  • Everything is a Model.
  • Models are fit() on data.
  • Models learn attributes_.
  • Pipeline() concatenate models.
  • Use best-practices (cross-validation)
http://rmcgibbo.org/posts/whats-new-in-msmbuilder3/

Everything is a Model()!


>>> import msmbuilder.cluster
>>> clusterer = msmbuilder.cluster.KMeans(n_clusters=4)

>>> import msmbuilder.decomposition
>>> tica = msmbuilder.decomposition.tICA(n_components=3)

>>> import msmbuilder.msm
>>> msm = msmbuilder.msm.MarkovStateModel()

Hyperparameters go in the constructor.

Actually, everything is a sklearn.base.BaseEstimator()

Models fit() data!


>>> import msmbuilder.cluster

>>> trajectories = [np.random.normal(size=(100, 3))]

>>> clusterer = msmbuilder.cluster.KMeans(n_clusters=4, n_init=10)
>>> clusterer.fit(trajectories)

>>> clusterer.cluster_centers_

array([[-0.22340896,  1.0745301 , -0.40222902],
       [-0.25410827, -0.11611431,  0.95394687],
       [ 1.34302485,  0.14004818,  0.01130485],
       [-0.59773874, -0.82508303, -0.95703567]])


Estimated parameters always have trailing underscores!

fit() acts on lists of sequences


>>> import msmbuilder.msm

>>> trajectories = [np.array([0, 0, 0, 1, 1, 1, 0, 0])]

>>> msm = msmbuilder.msm.MarkovStateModel()
>>> msm.fit(trajectories)

>>> msm.transmat_

array([[ 0.75      ,  0.25      ],
       [ 0.33333333,  0.66666667]])

This is different from sklearn, which uses 2D arrays.

Models transform() data!


>>> import msmbuilder.cluster

>>> trajectories = [np.random.normal(size=(100, 3))]

>>> clusterer = msmbuilder.cluster.KMeans(n_clusters=4, n_init=10)
>>> clusterer.fit(trajectories)
>>> Y = clusterer.transform(trajectories)

[array([5, 6, 6, 0, 5, 5, 1, 6, 1, 7, 5, 7, 4, 2, 2, 2, 5, 3, 0, 0, 1, 3, 0,
        5, 5, 0, 4, 0, 0, 3, 4, 7, 3, 5, 5, 5, 6, 1, 1, 0, 0, 7, 4, 4, 2, 6,
        1, 4, 2, 0, 2, 4, 4, 5, 2, 6, 3, 2, 0, 6, 3, 0, 7, 7, 7, 0, 0, 0, 3,
        3, 2, 7, 6, 7, 2, 5, 1, 0, 3, 6, 3, 2, 0, 5, 0, 3, 4, 2, 5, 4, 1, 5,
        5, 4, 3, 3, 7, 2, 1, 4], dtype=int32)]

Moving the data-items from one "space" / representation into another.

Pipeline() concatenates models!


>>> import msmbuilder.cluster, msmbuilder.msm
>>> from sklearn.pipeline import Pipeline

>>> trajectories = [np.random.normal(size=(100, 3))]

>>> clusterer = msmbuilder.cluster.KMeans(n_clusters=2, n_init=10)
>>> msm = msmbuilder.msm.MarkovStateModel()
>>> pipeline = Pipeline([("clusterer", clusterer), ("msm", msm)])

>>> pipeline.fit(trajectories)
>>> msm.transmat_

array([[ 0.53703704,  0.46296296],
       [ 0.53333333,  0.46666667]])

Data "flows" through transformations in the pipeline.

Featurizing Trajectories

Featurizers wrap MDTraj functions via the transform() function


>>> from msmbuilder.featurizer import DihedralFeaturizer
>>> from msmbuilder.example_datasets import fetch_alanine_dipeptide
>>> from matplotlib.pyplot import hexbin, plot

>>> trajectories = fetch_alanine_dipeptide()["trajectories"]
>>> featurizer = DihedralFeaturizer(
...        ["phi", "psi"], sincos=False)
>>> X = featurizer.transform(trajectories)
>>> phi, psi = np.rad2deg(np.concatenate(X).T)

>>> hexbin(phi, psi)

Loading Trajectories


>>> import glob
>>> import mdtraj as md

>>> filenames = glob.glob("./Trajectories/*.h5")
>>> trajectories = [md.load(filename) for filename in filenames]

Note: for big datasets, you can get fancy with md.iterload.

Old-school MSMs


>>> from msmbuilder import example_datasets, cluster, msm, featurizer
>>> from sklearn.pipeline import make_pipeline

>>> dataset = example_datasets.alanine_dipeptide.fetch_alanine_dipeptide()  # From Figshare!
>>> trajectories = dataset["trajectories"]  # List of MDTraj Trajectory Objects

>>> clusterer = cluster.KCenters(n_clusters=10, metric="rmsd")
>>> msm_model = msm.MarkovStateModel()

>>> pipeline = make_pipeline(clusterer, msm_model)
>>> pipeline.fit(trajectories)

Old-school MSMs (contd.)

# ...
>>> dih_featurizer = featurizer.DihedralFeaturizer(["phi", "psi"], sincos=False)
>>> X = dih_featurizer.transform(trajectories)
>>> phi, psi = np.rad2deg(np.concatenate(X).T)

>>> hexbin(phi, psi)
>>> phi, psi = np.rad2deg(dih_featurizer.transform([clusterer.cluster_centers_])[0].T)
>>> plot(phi, psi, 'w*', markersize=25)

Old-school MSMs (contd.)

# ...
>>> clusterer.cluster_centers_.save("./cluster_centers.pdb")

Cross Validation


from sklearn.cross_validation import KFold

cv = KFold(len(trajectories), n_folds=5)

for fold, (train_index, test_index) in enumerate(cv):
    train_data = [trajectories[i] for i in train_index]
    test_data = [trajectories[i] for i in test_index]
    model.fit(train_data)
    model.score(test_data)

Also scikit-learn's GridSearchCV and RandomizedSearchCV.