Automating data analysis pipelines

Overview
Install make
Test drive make and RStudio
Hands-on activity
More examples
Resources

Although we spend alot of time working with data interactively, this sort of hands-on babysitting is not always appropriate. We have a philosophy of “source is real” in this class and that philosophy can be implemented on a grander scale. Just as we save R code in a script so we can replay analytical steps, we can also record how a series of scripts and commands work together to produce a set of analytical results. This is what we mean by automating data analysis or building an analytical pipeline.

Overview

slides

Why and how we automate data analyses + examples.

Install `make`

2015-11-17 NOTE: since we have already set up a build environment for R packages, it is my hope that everyone has Make. These instructions were from 2014, when we did everything in a different order. Cross your fingers and ignore!

Windows installation

(If you are running Mac OS or Linux, make should already be installed.)

Test drive `make` and RStudio

Test drive of make.

Walk before you run! Prove that make is actually installed and that it can be found and executed from the shell and from RStudio. It is also important to tell RStudio to NOT substitute spaces for tabs when editing a Makefile (applies to any text editor).

Hands-on activity

This fully developed example shows you

How to run an R script non-interactively
How to use make
- to record which files are inputs vs. intermediates vs. outputs
- to capture how scripts and commands convert inputs to outputs
- to re-run parts of an analysis that are out-of-date
The intersection of R and make, i.e. how to
- run snippets of R code
- run an entire R script
- render an R Markdown document (or R script)
The interface between RStudio and make
How to use make from the shell
How Git facilitates the process of building a pipeline

2015-11-19 Andrew MacDonald translated the above into a pipeline for the remake package from Rich Fitzjohn: see this gist.

More examples

There are three more toy pipelines, using the Lord of the Rings data, that reinforce:

01_automation-example_just-r: use of an R script as a pseudo-Makefile
02_automation-example_r-and-make: use of a simple Makefile
03_automation-example_render-without-rstudio: use of rmarkdown::render() from a Makefile, as the default way of running an R script or an R Markdown document, leading to pretty HTML reports without any mouse clicks

Resources

xkcd comic on automation. ‘Automating’ comes from the roots ‘auto-’ meaning ‘self-’, and ‘mating’, meaning ‘screwing’.

Karl Broman covers GNU Make in his course Tools for Reproducible Research (see first week)

Karl Broman also wrote An introduction to Make, aimed at stats / data science types

Using Make for reproducible scientific analyses, blog post by Ben Morris

Software Carpentry’s Slides on Make

Zachary M. Jones wrote GNU Make for Reproducible Data Analysis

Keeping tabs on your data analysis workflow, blog post by Adam Laiacano, who works at Tumblr

Mike Bostock, of D3.js and New York Times fame, explains Why Use Make: “it’s about the benefits of capturing workflows via a file-based dependency-tracking build system”

Make for Data Scientists, blog post by Paul Butler, who also made a beautiful map of Facebook connections using R

Other, more modern data-oriented alternatives to make

Drake, a kind of “make for data”
Nextflow for “data-driven computational pipelines”
maker, “Make-like build management, re-imagined for R”

Managing Projects with GNU Make, Third Edition By Robert Mecklenburg is a fantastic book but, sadly, is very focused on compiling software

RStudio’s website documenting R Markdown is generated from this repo using this 20 line Makefile, which is sort of amazing. This is why we study regular expressions and follow filename conventions, people!

littler is an R package maintained by Dirk Eddelbuettel that “provides the r program, a simplified command-line interface for GNU R.”