Introduction to sleuth

Overview

sleuth is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:

The use of boostraps to ascertain and correct for technical variation in experiments.
Implemention of a flexible response error measurement model for inference that allows for a multitude of experimental designs.
Interactive plots that enable real-time exploratory data analysis.

To use sleuth, RNA-Seq data must first be quantified with kallisto, which is a program for very fast RNA-Seq quantification based on pseudo-alignment. An important feature of kallisto is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. sleuth has also been designed to be lightweight and fast, and therefore RNA-Seq analysis with kallisto and sleuth is tractable on a laptop computer in a matter of minutes.

The model sleuth uses for performing differential analysis is a general linear model where there is error in the response. Formally, in the case of two conditions being assayed, for a transcript $t$ in a sample $i$ , the (log) “true” unobserved abundance $y_i$ measured in read counts is modeled by

$y_{t,i} = \beta_{t,0} + \beta_{t,1} x_{t,i} + \epsilon_{t,i}$

where $x_{t,i}$ is an indicator variable describing the condition, $\beta_{t,0}$ and $\beta_{t,1}$ are parameters of the model and $\epsilon_{t,i}$ is biological “noise”. However, conditioned on $x_i$ and $y_i$ the estimated number of counts from the observations in the experiment is given by

$d_{t,i} = y_{t,i} + \zeta_{t,i}$

where $\zeta_{t,i}$ represents technical “noise”, i.e. uncertainty in the measurement due to effects other than biological variability. The sleuth model incorporates the assumptions that the expectation $E(\zeta_{t,i}|y_{t,i}) = 0$ , that $E(d_{t,i}) = \beta_{t,0} + \beta_{t,1} x_{t,i}$ and that the response error is additive, i.e. if the variance $V(\epsilon_{t,i}) = \sigma_t^2$ and the variance $V(\zeta_{t,i}|y_{t,i}) = \tau_t^2$ then the variance $V(d_{t,i}) = \sigma_{t}^2 + \tau_{t}^2$ . sleuth makes use of the boostraps from kallisto to estimate the $\tau_t^2$ , and after subtracting the estimated technical variance the $\sigma_t^2$ are estimated via a shrinkage procedure similar to that used in Limma Voom.

More generally, sleuth can be used to analyze multiple conditions organized in complex experimental designs. In the general case the unobserved model is given by

$Y_t = X_t \beta_t + \epsilon_t$

where $t$ is a transcript, $Y_t$ is a vector of length $n$ where $n$ is the number of samples, $X_t$ is an $n \times p$ matrix where $p$ is the number of covariates, $\beta_t$ represents the effect sizes and is a vector of length $p$ , and $\epsilon_t$ is a vector of length $n$ . The observed response is described by

$D_t = Y_t + \zeta_t$

where $\zeta_t$ and $D_t$ are vectors of length $n$ .

sleuth has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into sleuth and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.

Installation

To install sleuth start R and first install rhdf5 by typing:

source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")

Then install devtools by typing

install.packages("devtools")

and install sleuth by typing

devtools::install_github("pachterlab/sleuth")

Next load sleuth with

library("sleuth")

Example

To explain how to use sleuth we provide an example based on the data in the “Cuffdiff2 paper”:

Differential analysis of gene regulation at transcript resolution with RNA-seq by Cole Trapnell, David G Henderickson, Martin Savageau, Loyal Goff, John L Rinn and Lior Pachter, Nature Biotechnology 31, 46–53 (2013).

The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with sleuth.

To analyze the data, first download the raw reads, install kallisto and then quantify the data with boostraps as described here. This step can be skipped for the purposes of the vignette, by downloading the kallisto processed data directly by clicking here.

The first step in a sleuth analysis is to specify where the kallisto results are stored. Begin by storing the base directory of the results in a variable,

base_dir <- "~/Downloads/cuffdiff2_data_kallisto_results"

Next get the list of sample IDs with

sample_id <- dir(file.path(base_dir,"results"))

The result can be displayed by typing

sample_id

## [1] "SRR493366" "SRR493367" "SRR493368" "SRR493369" "SRR493370" "SRR493371"

In the box above, lines beginning with ## show the output of the command (in what follows we include the output that should appear with each command).

A list of paths to the kallisto results indexed by the sample IDs is collated with

kal_dirs <- sapply(sample_id, function(id) file.path(base_dir, "results", id, "kallisto"))
kal_dirs

##                                                                SRR493366 
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493366/kallisto" 
##                                                                SRR493367 
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493367/kallisto" 
##                                                                SRR493368 
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493368/kallisto" 
##                                                                SRR493369 
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493369/kallisto" 
##                                                                SRR493370 
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493370/kallisto" 
##                                                                SRR493371 
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493371/kallisto"

The next step is to load an auxillary table that describes the experimental design and the relationship between the kallisto directories and the samples:

s2c <- read.table(file.path(base_dir, "hiseq_info.txt"), header = TRUE, stringsAsFactors=FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c

##      sample condition
## 1 SRR493366  scramble
## 2 SRR493367  scramble
## 3 SRR493368  scramble
## 4 SRR493369   HOXA1KD
## 5 SRR493370   HOXA1KD
## 6 SRR493371   HOXA1KD

Now, we must enter the directories into a column in the table describing the experiment. This column must be labeled path, otherwise sleuth will throw an error. This is to ensure that the user can check which samples correspond to which kallisto runs.

s2c <- dplyr::mutate(s2c, path = kal_dirs)

The user should check whether or not the order is correct. In this case, the kallisto output is correctly matched with the sample identifiers.

print(s2c)

##      sample condition
## 1 SRR493366  scramble
## 2 SRR493367  scramble
## 3 SRR493368  scramble
## 4 SRR493369   HOXA1KD
## 5 SRR493370   HOXA1KD
## 6 SRR493371   HOXA1KD
##                                                                     path
## 1 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493366/kallisto
## 2 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493367/kallisto
## 3 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493368/kallisto
## 4 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493369/kallisto
## 5 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493370/kallisto
## 6 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493371/kallisto

Now the “sleuth object” can be constructed. This requires four commands that (1) load the kallisto processed data into the object (2) estimate parameters for the sleuth response error measurement (full) model (3) estimate parameters for the sleuth reduced model, and (4) perform differential analysis (testing). On a laptop the four steps should take about a few minutes altogether.

First type

so <- sleuth_prep(s2c, ~ condition)

## reading in kallisto results

## ......

##

## normalizing est_counts

## 50844 targets passed the filter

## normalizing tpm

## merging in metadata

## normalizing bootstrap samples

## summarizing bootstraps

then fit the full model

so <- sleuth_fit(so)

## fitting measurement error models

## shrinkage estimation

## computing variance of betas

Next, we fit the reduced model. In this case, the reduced model is the intercept-only model:

so <- sleuth_fit(so, ~1, 'reduced')

## fitting measurement error models

## shrinkage estimation

## computing variance of betas

and finally perform the test:

so <- sleuth_lrt(so, 'reduced', 'full')

In general, we can test models that are nested using the likelihood ratio test. Viewing models which have been fit can be done using the models() function.

models(so)

## [  full  ]
## formula:  ~condition 
## coefficients:
##  (Intercept)
##      conditionscramble
## [  reduced  ]
## formula:  ~1 
## coefficients:
##  (Intercept)

Including gene names into transcript-level analysis

At this point the sleuth object constructed from the kallisto runs has information about the data, the experimental design, the kallisto estimates, the model fit, and the testing. In other words it contains the entire analysis of the data. There is, however, one piece of information that can be useful to add in, but that is optional. In reading the kallisto output sleuth has no information about genes, but this can be added allowing for searching and analysis by gene instead of transcript.

Since the example was constructed with the ENSEMBL human transcriptome, we will add gene names from ENSEMBL using biomaRt (there are other ways to do this as well):

First, install biomaRt with

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

Then collect gene names with

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "hsapiens_gene_ensembl",
  host = 'ensembl.org')

and add them into the sleuth table with

t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
    "external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g)

## reading in kallisto results

## ......

##

## normalizing est_counts

## 50844 targets passed the filter

## normalizing tpm

## merging in metadata

## normalizing bootstrap samples

## summarizing bootstraps

so <- sleuth_fit(so)

## fitting measurement error models

## shrinkage estimation

## computing variance of betas

so <- sleuth_fit(so, ~1, 'reduced')

## fitting measurement error models

## shrinkage estimation

## computing variance of betas

so <- sleuth_lrt(so, 'reduced', 'full')

This addition of metadata to transcript IDs is very general, and can be used to add in other information.

The best way to view the results is to generate the Shiny webpage that allows for exploratory data analysis:

sleuth_live(so)

To generate a table of results for analysis within R type

results_table <- sleuth_results(so, 'reduced:full', test_type = 'lrt')

Gene level analysis

Assuming biomaRt has been installed as in the previous step, sleuth can also be run in the ‘gene aggregation’ mode. In addition to requiring a target_mapping, a string which references a column to aggregate by, (aggregation_column). In our example, we could use ens_gene or ext_gene. It is preferable to use ens_gene as ext_gene tends to be a bit redundant.

The modified sleuth prep command looks as follows:

so <- sleuth_prep(s2c, ~condition, target_mapping = t2g,
  aggregation_column = 'ens_gene')

## reading in kallisto results

## ......

##

## normalizing est_counts

## 50844 targets passed the filter

## normalizing tpm

## merging in metadata

## normalizing bootstrap samples

## summarizing bootstraps

## aggregating by column: ens_gene

## 14412 genes passed the filter

The remainder of the pipeline is unchanged. When running sleuth_live or sleuth_results, the gene column you selected will be listed as target_id, rather than the transcript name.

sleuth_prep might take a bit longer here because aggregation has to be done amongst all of the bootstraps as well. We are currently working to speed this up and expect to release it along with several memory improvements in the next version (0.28.2). One way to speed this up is to use more processes (assuming you have more cores). Internally, sleuth uses the function parallel::mclapply. You can set the number of cores as follows:

# set the number of available cores to 4
options(mc.cores = 4L)