Centrally managed scientific software environments

The speed, depth and quality of open source scientific software development has grown consistently for many years. Additionally, the modular nature of languages like Python has resulted in a large number of dependencies needing to be packaged - it has reached the point where no one individual or small team can possibly remain abreast of the frequent changes that take place in the scientific software stack. This combination of pace rapidly makes centrally deployed scientific software stacks date quickly and often results in scientific lab software environments being out of date and missing new features and bug fixes.

The challenge is heightened by the dichotomy between user stability and modernity. Logically we arrive at the prospect of software self-service, or a need to deploy multiple centrally managed software environments *to the same systems*.

The solution

This article outlines the tools and a workflow that enables deployment of multiple up-to-date centrally managed scientific software environments.

Conda has emerged as the leading tool for managing individual scientific software environments, and was chosen as a basis for extension into the centrally managed use case.

To separate concerns and to simplify development the deployment was broken into three distinct phases:

Unlike conda itself, we have separated the "resolve" and "deploy" stages, allowing us to resolve the environment once and then subsequently deploy to a number of different targets such as RPM, tarball or directly to disk.

Conda channel

A conda channel is simply a platform specific directory of conda distributions (.tar.bz2 files). A distribution represents the metadata of a single package, including details of version, dependencies, and pre-compiled files that should be installed. When conda is used to install a distribution from a channel, it recursively resolves the package's dependencies and places all necessary files in the desired destination.

Generating a conda distribution is best done with conda-build. The input of conda-build is a "recipe" containing metadata about the package, its dependencies, and build steps needed to compile the package. The metadata is stored as a YAML document and is therefore ideal for putting under a VCS such as git.

In order to facilitate the sharing and canonicalisation of recipes in a single location, conda-forge has been developed. It takes a community oriented maintenance model to enable package developers and software packagers to come together in a single place to collaborate publicly. Because of this, the majority of recipes for a lab's software stack can be drawn from this public resource into a private repository of recipes. At the time of writing, there are over 300 maintainers of over 1400 "recipes" on conda-forge (Sept 2016).

As part of developing conda-forge and its predecessor "conda-recipes-scitools", conda-build-all was produced to orchestrate the building of a collection of recipes against a matrix of targeted dependency versions (e.g. python 2.7 and 3.5). This same tool can be used for building a private repository of recipes.

With the build of all recipes complete, we can now move the distributions (the things that were actually built) into a conda channel. At this point individuals could use vanilla conda to manage personal environments by pointing at the created channel(s). Combined with site-wide conda configuration it should be stressed how valuable this option is for development purposes as well as for enabling early testing of recently built distributions.

Environment definition

Whilst conda can be used to manage personal environments it doesn't easily enable centrally managed deployments in a reproducible manner. Instead we separate the environment "resolve & deploy" step that are part of the normal "conda create/install".

Environment definitions themselves are best tracked through a VCS such as git as it enables environment diffing, provides an audit trail of what and when changes occurred.

conda-gitenv takes a git repository containing an environment specification and resolves this into a full manifest of the environment in a distinct "manifest" branch within the repo. We are able to represent a "release" of a software environment by using git tags, and are therefore able to benefit from continuous development to automatically resolve the environment on a regular basis whilst having the option of putting a human in the loop to do the actual tagging.

Environment deployment

With the environment definition in the form of a git repo in place, we now wish to deploy it. Decoupling of the resolve and deploy steps means that we are able to deploy the same environment in a multitude of ways.

One deployment mechanism that ships with conda-gitenv enables environment deployment direct to disk. This gives us feature parity with conda itself, with the additional benefit of having the environment git repository to track changes to the deployed environments. This would be an excellent choice for environment deployment to a centralised directory such as a network mounted disk.

Another deployment option is conda-rpms, which turns the environments into a collection of RPMs for installation on compatible systems (e.g. Red Hat Enterprise Linux, Fedora).

Summary

By separating the build, resolve and deploy phases we have achieved a highly customisable scientific software deployment mechanism.

We have developed conda-forge to enable community-wide recipe sharing and conda-build-all to enable building of these recipes into a fully featured conda channel of distributions.

Using any desired conda channel, conda-gitenv was developed to enable environment representation within a git repository for diffing, reviewing and tagging purposes.

With a git repository containing environment definitions we can deploy the environment in any desired form, including through RPM or by writing directly to disk.