To run a docker container, we use the command docker run
followed by the name of the container we want to run. You should already have a ubuntu
container, so lets try and run it
docker run ubuntu
Nothing happens. In this case, we need to give a command that we want to execute, or run the container interactively. In the first case we will use the echo
command to print the hello world message in the traditional fashion.
docker run ubuntu echo "Hello World"
Hello World
or we could print the current date and time
docker run ubuntu date
Fri Nov 11 11:38:39 UTC 2016
In both cases, we ran a single command, printed the output and then exited. To launch an interactive session, we can change the arguments to the run
command to attach standard input with -i
(stdin) and allocate an output (tty) with -t
. This will drop us into a terminal. You should notice that the the username changes to root
and the machine name is the name of the container. You can exit the container using exit
docker run -i -t ubuntu /bin/bash
By default, we are running the latest version of Ubuntu. A very useful feature of Docker is that containers can be versioned, so we can always go back to a previous version. There is a file /etc/lsb-release
that will tell us which version of Ubuntu is being run;
docker run ubuntu cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
Alternative versions can be executed by specifying a “tag” after the container name. Ubuntu has tags that correspond to particular versions of the OS.
docker run ubuntu:14.04 cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"
Let’s say that someone has released a new tool that sounds amazing. You are itching to try it out, but from past experience know how much of a pain it can be to install new software. Moreover, if you are experimenting with lots of different packages your software
directory can get polluted with lots of tools that you run only once. This is a situation where docker can really help.
As an example, the author of delly
(a tool for calling structural variants) has created a docker container that we can run. We have been previously running a ubuntu
container, which is an official container in Docker. Typically docker containers are uploaded to the central Dockerhub
repository and submitted under a particular username or project (dellytools
in this case). It is common for a docker file to be under version control in github.
The main program, delly
can be run from the container to display help information;
docker run dellytools/delly delly
**********************************************************************
Program: Delly
This is free software, and you are welcome to redistribute it under
certain conditions (GPL); for license details use '-l'.
This program comes with ABSOLUTELY NO WARRANTY; for details use '-w'.
Delly (Version: 0.7.6)
Contact: Tobias Rausch (rausch@embl.de)
**********************************************************************
Usage: delly <command> <arguments>
Commands:
call discover and genotype structural variants
merge merge structural variants across VCF/BCF files and within a single VCF/BCF file
filter filter somatic or germline structural variants
As before, we can run delly interactively with the -it
argument. However, once we launch into docker we cannot automatically see the contents of our own hard drive.
docker run -it dellytools/delly /bin/bash
We can mount volumes from a particular location on our host drive onto the file system used in the docker container. Let’s mount the example/
directory of the course materials folder. Currently, this contains an example .bam
file (mapped reads for a small region on chromosome 21) and index. The analysis we are going to do will also require a reference genome, which you can download from UCSC.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr21.fa.gz -P example/
gunzip example/chr21.fa.gz
The -v
argument is used to mount the volume in the form -v FROM:TO
. In this example, we mount the example/
sub-directory of the current working directory to a folder /data/
inside the container. This needs to be the full path to the directory. We can use pwd
command to expand the current working directory.
Note on Mac OSX, you may need to specify the directory that you want mounted in Docker. The default setup includes /Users/
and Volumes
, but if you want to use something else you’ll need to add it to the list in Preferences.
docker run -ti -v `pwd`/example/:/data dellytools/delly /bin/bash
Once the container is launched, we can list the contents of /data/
, which should hopefully should match the local contents of example/
####Run this inside the delly container####
ls -l /data/
total 73700
-rw-r--r-- 1 ubuntu ubuntu 49092500 Nov 11 08:58 chr21.fa
-rw-r--r-- 1 ubuntu ubuntu 26342948 Nov 11 09:37 test.bam
-rw-r--r-- 1 ubuntu ubuntu 28176 Nov 11 09:37 test.bam.bai
Note that this directory doesn’t already have to be present in the directory structure of the container
docker run -v `pwd`/example:/mark dellytools/delly ls -l /mark/
Once the volume is mounted inside the container, anything written to /data/
will be visible to the host directory. Exit the delly
container (CTRL + D
), and re-run with the following. A new file should be created in example/
docker run -v `pwd`/example/:/data dellytools/delly touch /data/hello.txt
ls example/
chr21.fa
hello.txt
test.bam
test.bam.bai
The actual command to run delly
on our example data is as follows. When specifying the paths to the bam and reference file, we have to specify the paths as they appear inside the container, so /data/
in this case.
docker run -v `pwd`/example/:/data dellytools/delly delly call -t DUP -o /data/test.bcf -g /data/chr21.fa /data/test.bam
[fai_load] build FASTA index.
[2016-Nov-11 11:39:04] delly call -t DUP -o /data/test.bcf -g /data/chr21.fa /data/test.bam
[2016-Nov-11 11:39:04] Paired-end clustering
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
[2016-Nov-11 11:39:05] Split-read alignment
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
[2016-Nov-11 11:39:06] Junction read annotation
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
[2016-Nov-11 11:39:06] Breakpoint spanning coverage annotation
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
[2016-Nov-11 11:39:07] Read-depth annotation
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
[2016-Nov-11 11:39:07] Genotyping
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
[2016-Nov-11 11:39:07] Library statistics
Sample: HCC1143
RG: ID=HCC1143,ReadSize=76,AvgDist=1,EstCov=76,MappedAsPair=0.999035,Median=156,MAD=37,Layout=2,MaxSize=489,MinSize=0,UniqueDiscordantPairs=42
[2016-Nov-11 11:39:07] Done.
delly
should manage to call a known tandem duplication in this example dataset. But more importantly, we’ve managed to install and run the tool in a relatively painless manner. If you look in the example directory you should find a test.bcf
output file. You can convert and view this in the more-common (and human readable) VCF format using the bcftools
utility available as part of SAMtools. But what if we don’t have samtools
installed? Well, luckily the delly
container also contains a version of samtools
and bcftools
Exercise: run bcftools
from inside the delly container to view the contents of the test.bcf
file output by delly
.
HINT bcftools view
can be used to print a bcf file in human readable form.
The Bioconductor team now distribute various docker containers that are built-upon the rocker project. This is a convenient way of running an R instance with the latest (or an older version if you prefer) versions of particular packages installed. It is also possible to run such containers in a Cloud environment such as Amazon. Various flavous of container are available as described on the Bioconductor website. For example there are pre-built containers containing sequencing, proteomics, flow cytometry packages.
To run such a container, we can use the following. Here the -p
argument is to open a port with a particular address.
docker run -p 8787:8787 bioconductor/release_base
We if we open a web browser and type the addresshttp://localhost:8787
, you should see a Rstudio server login page. The username and password are both rstudio
. After login you will have a fully-functioning RStudio instance inside your web browser. If you require a package you can use install.packages
or biocLite
in the usual fashion
If you require command line R, rather than RStudio, you can do
docker run -ti bioconductor/release_base R
Don’t forget that if you want to save your results to the host machine you will need to mount a volume as we discussed above.
docker run -v `pwd`:/home/rstudio -ti bioconductor/release_base R
You could even release a docker container for the analysis of your paper, as Stephen Eglen from DAMTP has done for his 2014 paper. Running this container will load RStudio with all the packages and scripts available. The implications for reproducibility are tremendous.
docker run -d -p 8787:8787 sje30/waverepo
Building such an image is fairly painless, as one can extend the existing Bioconductor image to include your own packages and data. We will explore this in the following sections.
The Bioconductor project has a 6 month release-cycle and package authors are required to make sure that their package can be compiled and run with the latest version of R and other Bioconductor software. This process typically involves compiling the latest developmental (and potentially unstable) version of R. In the days before docker, this would mean getting the latest .tar.gz
file from the CRAN repository, compiling and making on my desktop machine. With docker I can get the latest version onto machine without cluttering it up with lots of different R versions.
docker run -it bioconductor/devel_base R
The Sanger institute have released their entire cancer genome analysis pipeline via docker. This is part of an initiative with the Pan-Cancer genomes project and other major sequencing centres (Broad, DKFZ) are also committed to releasing software this way.
For the end-user, it is a single command to install the software encompassing the entire pipeline.
docker pull quay.io/wtsicgp/cgp_in_a_box
Some configuration is required in order to run the pipeline on your own data. However, instructions to download and analyse a test dataset are included on the website