R + Docker: Why & How

class: center, middle, inverse, title-slide

# R + Docker: Why & How
### Clayton Yochum
### November 9, 2017

---

layout: true
background-color: #fafaef
<div class="my-footer"><img src="mc_logo_rectangle.png" style="height: 30px;"/></div>

---

## What is Docker?

- Something people _will not_ shut the hell up about 
--
(myself included)

- docker.com: 
--
"Docker is the world's leading software containerization platform"

- _containerization_?

**My Take** 
--
-- Docker is a tool for _packaging software_ so it:

- is easy to run the same way in different places
--
.pull-right[(portable)]

- is easy to throw away and start over
--
.pull-right[(disposable)]

- runs the same way tomorrow, next month, next year
--
.pull-right[(reproducible)]

---
class: center, middle

## We increasingly need to not just _write_ code, but _run it_ somewhere else

## Managing servers is a lot of work, and we want to spend as little time doing it as possible

## **Docker lets me be lazy and worry less about the systems running our code**

---

## If only we knew sooner...

Issues we could have solved with Docker:

- We have a shiny app on AWS we can no longer run locally

- We have another shiny app that takes 1-2 hours to re-deploy when things go wrong

- We've delivered code that took our client multiple hours to find and install non-R dependencies for

Packrat is ~~great~~okay at managing R package dependencies, but Docker makes it easy to handle _the other stuff_:

- system-level dependencies (java, curl, ssl, etc.)

- unpackaged R code (analysis, modeling, etc.)

- R version

---

## Okay, so what's a container?

- It's where your code goes!

- _Not_ a new concept; been in linux for decades 
--
(but Docker started in 2013)

- _Containerizing_ some code mean packing it up into a single object that has both your code and everything your code needs to run

- We call those objects container _images_

- You need special software (docker) to run images, but it should run exactly the same anywhere you run it

- A _container_ is an instance of an image deployed on some computer

---

## Did you just describe a virtual machine?

--
Yeah, pretty much! But:

- Containers are more _lightweight_ and _portable_ than VM's

- Containers are easier to deploy, and you can fit more on one system

- It's easier to tell a container what you need than to tell a VM

- Docker : containers :: Vagrant : virtual machines

- VM's emulate a computer _down to the hardware_, but containers make more assumptions, and share common pieces between them

---

![](cont-vs-vm.png)

_source: Docker, Inc._

---

## Why so damn popular?

.pull-left[
In Software Development:

- ensure consistency between dev/staging/production environments

- makes "dev-ops" easier

- many small pieces (microservices) are easier to evolve than one big piece (monolith)
]

.pull-right[
In Data Science:

- ensure the model runs when it's not on your computer

- would rather build models than babysit servers

- single container can have many different pieces like R, Python, Jupyter, PyTorch, TensorFlow, etc.
]

---

## How do we use it?

--
.pull-left[
You'll Need:

- docker installed (possibly in a VM)

- access to the terminal

- network/internet access
]

.pull-right[
Steps:

1. write a Dockerfile (what your code needs)

2. build an image from the Dockerfile

3. run that image somewhere
]

[The Rocker group](https://github.com/rocker-org) builds & maintains awesome R images on [Docker Hub](hub.docker.com), including:

- plain R

- R + RStudio

- R + RStudio + Tidyverse

- R + Shiny Server

- R + tex and related stuff

Using one of these let's us skip right to step 3, running the image.

---

## Running RStudio Server in a container

If you have docker setup, it's one command to start running RStudio Server with all the Tidyverse pre-installed:

```bash
docker run -d -p 80:8787 rocker/tidyverse
```

This will:

1. Download an image, layer-by-layer, from Docker Hub to your computer

2. Run that image, which includes starting RStudio Server inside the image

3. Expose the RStudio port (8787) through the local machine's port 80

Now open a browser, go to `localhost` (or the IP of your VM), and login to RStudio Server with user/pass `rstudio`/`rstudio`!

Sure, you already have R, RStudio, etc. setup _locally_, but how easy is it to get the same setup on a beefy AWS server?

---

## Other useful docker commands

```bash
# list running containers
docker ps

# stop a container
docker stop <name>

# start a stopped container
docker start <name>

# list runnning AND stopped containers
docker ps -a

# remove a stopped container
docker rm <name>

# list images currently on system
docker images

# remove an image
docker rmi <image-name>  # can't remove unless all related containers are removed
```

---

## Adding our own stuff

For a recent client deliverable, we used that [`rocker/tidyverse`](https://github.com/rocker-org/rocker-versioned/blob/master/tidyverse/3.4.2/Dockerfile) image as a _base_, and added our own code to it by writing our own _Dockerfile_:

```Dockerfile
FROM rocker/tidyverse:3.4.2

USER rstudio

WORKDIR /home/rstudio

COPY clientpackage ./clientpackage/
COPY holdout-files ./holdout-files/
COPY models ./models/
COPY predict-script.R README.Rmd report/report.Rmd ./

RUN R -e "devtools::install('clientpackage', dependencies = TRUE)"

USER root
```

Then we setup Gitlab's CI tools so every time the code updates, Gitlab will build the image for us!

---

## Adding our own stuff

--
Last time we did a bunch of modeling for a certain client, we were using the R package `h2o`, which requires Java to run.
--
 I made a [small Dockerfile](https://github.com/MethodsConsultants/tidyverse-h2o/blob/master/Dockerfile) to add that on top of the `rocker/tidyverse` image:

```Dockerfile
FROM rocker/tidyverse:latest

RUN apt-get update -qq \
  && apt-get -y --no-install-recommends install \
    default-jdk \
    default-jre \
  && R CMD javareconf \
  && install2.r --error \
    --repos 'http://cran.rstudio.com' \
    h2o
```

Because this is public on our Github, I [configured Docker Hub](https://hub.docker.com/r/methodsconsultants/tidyverse-h2o/) to automatically re-build the image anytime the Dockerfile _or the base image_ changes
--
, _for free_!

Then anyone connected to the internet can use this as easily as we used `rocker/tidyverse`:

```bash
docker run -d -p 80:8787 methodsconsultants/tidyverse-h2o
```

---

## Adding stuff outside the Dockerfile

When doing research inside a container, we might not want to write our own Dockerfile that copies our code and/or data into the image

We instead use the `-v` option of `docker run` to make files on our computer available to the container:

```bash
docker run -d -p 80:8787 -v ~/someproject:/home/rstudio/someproject rocker/tidyverse
```

Then if you add/delete/edit those files from within docker, the changes will happen on the host filesystem

For this specific image, t's important to attach our local volume to the home of user `rstudio` since that's the only user available by default

We could use environment variables to make a different user and/or set a password (again, image-specific)

---

## Things we didn't have time for

--
- networking

- environment variables

- why the layered approach is so awesome

- how to combine with CI tools

- pulling images from non-Docker Hub locations (e.g. private Gitlab registries)

- using `docker machine` to make docker hosts for us

- combining Docker with Packrat

- _orchestrating_ containers with Kubernetes, Docker Swarm, AWS ECS

---

## Parting thoughts

--
- Getting your Dockerfile right can take more work up-front, but is often worth the time saved down the line
--
 -- especially when combined with CI tools
    
--

- If you figure out how to install something _once_, you don't have to sweat it again

- Having code that runs _here_ but not _there_ is so last decade

- **Docker gives you portability _now_ and reproducibility _later_.**