**sleuth** is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:

- The use of boostraps to ascertain and correct for technical variation in experiments.
- Implemention of a flexible response error measurement model for inference that allows for a multitude of experimental designs.
- Interactive plots that enable real-time exploratory data analysis.

To use **sleuth**, RNA-Seq data must first be quantified with **kallisto**, which is a program for *very* fast RNA-Seq quantification based on pseudo-alignment. An important feature of **kallisto** is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. **kallisto** can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. **sleuth** has also been designed to be lightweight and fast, and therefore RNA-Seq analysis with **kallisto** and **sleuth** is tractable on a laptop computer in a matter of minutes.

The model **sleuth** uses for performing differential analysis is a general linear model where there is error in the response. Formally, in the case of two conditions being assayed, for a transcript \(t\) in a sample \(i\), the (log) “true” unobserved abundance \(y_i\) measured in read counts is modeled by

\[ y_{t,i} = \beta_{t,0} + \beta_{t,1} x_{t,i} + \epsilon_{t,i} \]

where \(x_{t,i}\) is an indicator variable describing the condition, \(\beta_{t,0}\) and \(\beta_{t,1}\) are parameters of the model and \(\epsilon_{t,i}\) is biological “noise”. However, conditioned on \(x_i\) and \(y_i\) the estimated number of counts from the observations in the experiment is given by

\[ d_{t,i} = y_{t,i} + \zeta_{t,i} \]

where \(\zeta_{t,i}\) represents technical “noise”, i.e. uncertainty in the measurement due to effects other than biological variability. The **sleuth** model incorporates the assumptions that the expectation \(E(\zeta_{t,i}|y_{t,i}) = 0\), that \(E(d_{t,i}) = \beta_{t,0} + \beta_{t,1} x_{t,i}\) and that the response error is *additive*, i.e. if the variance \(V(\epsilon_{t,i}) = \sigma_t^2\) and the variance \(V(\zeta_{t,i}|y_{t,i}) = \tau_t^2\) then the variance \(V(d_{t,i}) = \sigma_{t}^2 + \tau_{t}^2\). **sleuth** makes use of the boostraps from **kallisto** to estimate the \(\tau_t^2\), and after subtracting the estimated technical variance the \(\sigma_t^2\) are estimated via a shrinkage procedure similar to that used in Limma Voom.

More generally, **sleuth** can be used to analyze multiple conditions organized in complex experimental designs. In the general case the unobserved model is given by

\[ Y_t = X_t \beta_t + \epsilon_t \]

where \(t\) is a transcript, \(Y_t\) is a vector of length \(n\) where \(n\) is the number of samples, \(X_t\) is an \(n \times p\) matrix where \(p\) is the number of covariates, \(\beta_t\) represents the effect sizes and is a vector of length \(p\), and \(\epsilon_t\) is a vector of length \(n\). The observed response is described by

\[ D_t = Y_t + \zeta_t \]

where \(\zeta_t\) and \(D_t\) are vectors of length \(n\).

**sleuth** has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into **sleuth** and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.

To install **sleuth** start R and first install `rhdf5`

by typing:

```
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
```

Then install devtools by typing

`install.packages("devtools")`

and install **sleuth** by typing

`devtools::install_github("pachterlab/sleuth")`

Next load **sleuth** with

`library("sleuth")`

To explain how to use **sleuth** we provide an example based on the data in the “Cuffdiff2 paper”:

- Differential analysis of gene regulation at transcript resolution with RNA-seq by Cole Trapnell, David G Henderickson, Martin Savageau, Loyal Goff, John L Rinn and Lior Pachter, Nature Biotechnology
**31**, 46–53 (2013).

The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with **sleuth**.

To analyze the data, first download the raw reads, install **kallisto** and then quantify the data with boostraps as described here. This step can be skipped for the purposes of the vignette, by downloading the **kallisto** processed data directly by clicking here.

The first step in a **sleuth** analysis is to specify where the **kallisto** results are stored. Begin by storing the base directory of the results in a variable,

`base_dir <- "~/Downloads/cuffdiff2_data_kallisto_results"`

Next get the list of sample IDs with

`sample_id <- dir(file.path(base_dir,"results"))`

The result can be displayed by typing

`sample_id`

`## [1] "SRR493366" "SRR493367" "SRR493368" "SRR493369" "SRR493370" "SRR493371"`

In the box above, lines beginning with ## show the output of the command (in what follows we include the output that should appear with each command).

A list of paths to the **kallisto** results indexed by the sample IDs is collated with

```
kal_dirs <- sapply(sample_id, function(id) file.path(base_dir, "results", id, "kallisto"))
kal_dirs
```

```
## SRR493366
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493366/kallisto"
## SRR493367
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493367/kallisto"
## SRR493368
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493368/kallisto"
## SRR493369
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493369/kallisto"
## SRR493370
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493370/kallisto"
## SRR493371
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493371/kallisto"
```

The next step is to load an auxillary table that describes the experimental design and the relationship between the kallisto directories and the samples:

```
s2c <- read.table(file.path(base_dir, "hiseq_info.txt"), header = TRUE, stringsAsFactors=FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c
```

```
## sample condition
## 1 SRR493366 scramble
## 2 SRR493367 scramble
## 3 SRR493368 scramble
## 4 SRR493369 HOXA1KD
## 5 SRR493370 HOXA1KD
## 6 SRR493371 HOXA1KD
```

Now, we must enter the directories into a column in the table describing the experiment. This column must be labeled `path`

, otherwise sleuth will throw an error. This is to ensure that the user can check which samples correspond to which kallisto runs.

`s2c <- dplyr::mutate(s2c, path = kal_dirs)`

The user should check whether or not the order is correct. In this case, the kallisto output is correctly matched with the sample identifiers.

`print(s2c)`

```
## sample condition
## 1 SRR493366 scramble
## 2 SRR493367 scramble
## 3 SRR493368 scramble
## 4 SRR493369 HOXA1KD
## 5 SRR493370 HOXA1KD
## 6 SRR493371 HOXA1KD
## path
## 1 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493366/kallisto
## 2 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493367/kallisto
## 3 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493368/kallisto
## 4 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493369/kallisto
## 5 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493370/kallisto
## 6 ~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493371/kallisto
```

Now the “sleuth object” can be constructed. This requires four commands that (1) load the kallisto processed data into the object (2) estimate parameters for the **sleuth** response error measurement (full) model (3) estimate parameters for the **sleuth** reduced model, and (4) perform differential analysis (testing). On a laptop the four steps should take about a few minutes altogether.

First type

`so <- sleuth_prep(s2c, ~ condition)`

`## reading in kallisto results`

`## ......`

`## `

`## normalizing est_counts`

`## 50844 targets passed the filter`

`## normalizing tpm`

`## merging in metadata`

`## normalizing bootstrap samples`

`## summarizing bootstraps`

then fit the full model

`so <- sleuth_fit(so)`

`## fitting measurement error models`

`## shrinkage estimation`

`## computing variance of betas`

Next, we fit the reduced model. In this case, the reduced model is the intercept-only model:

`so <- sleuth_fit(so, ~1, 'reduced')`

`## fitting measurement error models`

`## shrinkage estimation`

`## computing variance of betas`

and finally perform the test:

`so <- sleuth_lrt(so, 'reduced', 'full')`

In general, we can test models that are nested using the likelihood ratio test. Viewing models which have been fit can be done using the `models()`

function.

`models(so)`

```
## [ full ]
## formula: ~condition
## coefficients:
## (Intercept)
## conditionscramble
## [ reduced ]
## formula: ~1
## coefficients:
## (Intercept)
```

At this point the sleuth object constructed from the kallisto runs has information about the data, the experimental design, the **kallisto** estimates, the model fit, and the testing. In other words it contains the entire analysis of the data. There is, however, one piece of information that can be useful to add in, but that is optional. In reading the kallisto output **sleuth** has no information about *genes*, but this can be added allowing for searching and analysis by gene instead of transcript.

Since the example was constructed with the ENSEMBL human transcriptome, we will add gene names from ENSEMBL using biomaRt (there are other ways to do this as well):

First, install biomaRt with

```
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
```

Then collect gene names with

```
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = 'ensembl.org')
```

and add them into the **sleuth** table with

```
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
"external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g)
```

`## reading in kallisto results`

`## ......`

`## `

`## normalizing est_counts`

`## 50844 targets passed the filter`

`## normalizing tpm`

`## merging in metadata`

`## normalizing bootstrap samples`

`## summarizing bootstraps`

`so <- sleuth_fit(so)`

`## fitting measurement error models`

`## shrinkage estimation`

`## computing variance of betas`

`so <- sleuth_fit(so, ~1, 'reduced')`

`## fitting measurement error models`

`## shrinkage estimation`

`## computing variance of betas`

`so <- sleuth_lrt(so, 'reduced', 'full')`

This addition of metadata to transcript IDs is very general, and can be used to add in other information.

The best way to view the results is to generate the Shiny webpage that allows for exploratory data analysis:

`sleuth_live(so)`

To generate a table of results for analysis within R type

`results_table <- sleuth_results(so, 'reduced:full', test_type = 'lrt')`

Assuming `biomaRt`

has been installed as in the previous step, sleuth can also be run in the ‘gene aggregation’ mode. In addition to requiring a `target_mapping`

, a string which references a column to aggregate by, (`aggregation_column`

). In our example, we could use `ens_gene`

or `ext_gene`

. It is preferable to use `ens_gene`

as `ext_gene`

tends to be a bit redundant.

The modified sleuth prep command looks as follows:

```
so <- sleuth_prep(s2c, ~condition, target_mapping = t2g,
aggregation_column = 'ens_gene')
```

`## reading in kallisto results`

`## ......`

`## `

`## normalizing est_counts`

`## 50844 targets passed the filter`

`## normalizing tpm`

`## merging in metadata`

`## normalizing bootstrap samples`

`## summarizing bootstraps`

`## aggregating by column: ens_gene`

`## 14412 genes passed the filter`

The remainder of the pipeline is unchanged. When running `sleuth_live`

or `sleuth_results`

, the gene column you selected will be listed as `target_id`

, rather than the transcript name.

`sleuth_prep`

might take a bit longer here because aggregation has to be done amongst all of the bootstraps as well. We are currently working to speed this up and expect to release it along with several memory improvements in the next version (0.28.2). One way to speed this up is to use more processes (assuming you have more cores). Internally, `sleuth`

uses the function `parallel::mclapply`

. You can set the number of cores as follows:

```
# set the number of available cores to 4
options(mc.cores = 4L)
```