- Analysis happens in walled gardens (Gromacs, Amber, VMD)
- Exclusively command line interfaces, C and Fortran code
- Duplication of statistical algorithms by non-experts (e.g. chemists)
- Possible code maintainability issues?
Kyle A. Beauchamp
Updated Feb. 27, 2015 (msmbuilder v3.1)
static void jarvis_patrick(int n1, real **mat, int M, int P, real rmsdcut, t_clusters *clust) { t_dist *row; t_clustid *c; int **nnb; int i, j, k, cid, diff, max; gmx_bool bChange; real **mcpy = NULL; if (rmsdcut < 0) { rmsdcut = 10000; } /* First we sort the entries in the RMSD matrix row by row. * This gives us the nearest neighbor list. */
// Five more pages of this // You get the idea // Also, how do we even use this function? static void jarvis_patrick(int n1, real **mat, int M, int P, real rmsdcut, t_clusters *clust);
Builds on scikit-learn idioms:
Model
.fit()
on data.attributes_
.Pipeline()
concatenate models.Model()
!>>> import msmbuilder.cluster >>> clusterer = msmbuilder.cluster.KMeans(n_clusters=4) >>> import msmbuilder.decomposition >>> tica = msmbuilder.decomposition.tICA(n_components=3) >>> import msmbuilder.msm >>> msm = msmbuilder.msm.MarkovStateModel()
Hyperparameters go in the constructor.
fit()
data!>>> import msmbuilder.cluster >>> trajectories = [np.random.normal(size=(100, 3))] >>> clusterer = msmbuilder.cluster.KMeans(n_clusters=4, n_init=10) >>> clusterer.fit(trajectories) >>> clusterer.cluster_centers_ array([[-0.22340896, 1.0745301 , -0.40222902], [-0.25410827, -0.11611431, 0.95394687], [ 1.34302485, 0.14004818, 0.01130485], [-0.59773874, -0.82508303, -0.95703567]])
Estimated parameters always have trailing underscores!
fit()
acts on lists of sequences>>> import msmbuilder.msm >>> trajectories = [np.array([0, 0, 0, 1, 1, 1, 0, 0])] >>> msm = msmbuilder.msm.MarkovStateModel() >>> msm.fit(trajectories) >>> msm.transmat_ array([[ 0.75 , 0.25 ], [ 0.33333333, 0.66666667]])
This is different from sklearn, which uses 2D arrays.
transform()
data!>>> import msmbuilder.cluster >>> trajectories = [np.random.normal(size=(100, 3))] >>> clusterer = msmbuilder.cluster.KMeans(n_clusters=4, n_init=10) >>> clusterer.fit(trajectories) >>> Y = clusterer.transform(trajectories) [array([5, 6, 6, 0, 5, 5, 1, 6, 1, 7, 5, 7, 4, 2, 2, 2, 5, 3, 0, 0, 1, 3, 0, 5, 5, 0, 4, 0, 0, 3, 4, 7, 3, 5, 5, 5, 6, 1, 1, 0, 0, 7, 4, 4, 2, 6, 1, 4, 2, 0, 2, 4, 4, 5, 2, 6, 3, 2, 0, 6, 3, 0, 7, 7, 7, 0, 0, 0, 3, 3, 2, 7, 6, 7, 2, 5, 1, 0, 3, 6, 3, 2, 0, 5, 0, 3, 4, 2, 5, 4, 1, 5, 5, 4, 3, 3, 7, 2, 1, 4], dtype=int32)]
Moving the data-items from one "space" / representation into another.
Pipeline()
concatenates models!>>> import msmbuilder.cluster, msmbuilder.msm >>> from sklearn.pipeline import Pipeline >>> trajectories = [np.random.normal(size=(100, 3))] >>> clusterer = msmbuilder.cluster.KMeans(n_clusters=2, n_init=10) >>> msm = msmbuilder.msm.MarkovStateModel() >>> pipeline = Pipeline([("clusterer", clusterer), ("msm", msm)]) >>> pipeline.fit(trajectories) >>> msm.transmat_ array([[ 0.53703704, 0.46296296], [ 0.53333333, 0.46666667]])
Data "flows" through transformations in the pipeline.
Featurizers wrap MDTraj functions via the transform()
function
>>> from msmbuilder.featurizer import DihedralFeaturizer >>> from msmbuilder.example_datasets import fetch_alanine_dipeptide >>> from matplotlib.pyplot import hexbin, plot >>> trajectories = fetch_alanine_dipeptide()["trajectories"] >>> featurizer = DihedralFeaturizer( ... ["phi", "psi"], sincos=False) >>> X = featurizer.transform(trajectories) >>> phi, psi = np.rad2deg(np.concatenate(X).T) >>> hexbin(phi, psi)
>>> import glob >>> import mdtraj as md >>> filenames = glob.glob("./Trajectories/*.h5") >>> trajectories = [md.load(filename) for filename in filenames]
Note: for big datasets, you can get fancy with md.iterload
.
>>> from msmbuilder import example_datasets, cluster, msm, featurizer >>> from sklearn.pipeline import make_pipeline >>> dataset = example_datasets.alanine_dipeptide.fetch_alanine_dipeptide() # From Figshare! >>> trajectories = dataset["trajectories"] # List of MDTraj Trajectory Objects >>> clusterer = cluster.KCenters(n_clusters=10, metric="rmsd") >>> msm_model = msm.MarkovStateModel() >>> pipeline = make_pipeline(clusterer, msm_model) >>> pipeline.fit(trajectories)
# ... >>> dih_featurizer = featurizer.DihedralFeaturizer(["phi", "psi"], sincos=False) >>> X = dih_featurizer.transform(trajectories) >>> phi, psi = np.rad2deg(np.concatenate(X).T) >>> hexbin(phi, psi) >>> phi, psi = np.rad2deg(dih_featurizer.transform([clusterer.cluster_centers_])[0].T) >>> plot(phi, psi, 'w*', markersize=25)
# ... >>> clusterer.cluster_centers_.save("./cluster_centers.pdb")
from sklearn.cross_validation import KFold cv = KFold(len(trajectories), n_folds=5) for fold, (train_index, test_index) in enumerate(cv): train_data = [trajectories[i] for i in train_index] test_data = [trajectories[i] for i in test_index] model.fit(train_data) model.score(test_data)
Also scikit-learn's GridSearchCV
and RandomizedSearchCV
.