library(scater)
library(data.table)
library(cowplot)
library(DT)
library(knitr)
library(scran)
opts_chunk$set(cache = FALSE, warning = FALSE)

1 Load the Zeisel et al mouse cortex data

The mouse cortex data from Zeisel et al, 2015 have been made available as UMI (unique molecular identifier) counts at the Linnarsson Lab website.

Here we download the data directly from the Linnarrsson Lab page and produce an SCESet object, which is the data structure used in the scater package.

First load the data for metadata for mRNA genes:

## download mRNA data
mrna_meta <- fread("http://linnarssonlab.org/blobs/cortex/expression_mRNA_17-Aug-2014.txt",
              nrows = 10, header = FALSE)
mrna_meta <- as.data.frame(t(mrna_meta), stringsAsFactors = FALSE)
colnames(mrna_meta) <- mrna_meta[2,]
mrna_meta <- mrna_meta[-c(1,2),]
rownames(mrna_meta) <- mrna_meta$cell_id
datatable(mrna_meta[1:6,])

Now load the count data for the mRNA genes:

mrna <- fread("http://linnarssonlab.org/blobs/cortex/expression_mRNA_17-Aug-2014.txt",
              skip = 10)

## 
Read 0.0% of 19972 rows
Read 19972 rows and 3007 (of 3007) columns from 0.113 GB file in 00:00:05

mrna <- as.data.frame(mrna)
rownames(mrna) <- mrna$V1
mrna_gene_cluster <- mrna$V2
mrna <- mrna[, -c(1,2)]
colnames(mrna) <- mrna_meta$cell_id
datatable(mrna[1:20, 1:6])

Next load the metadata for mitochondrial genes:

## download data for mitochondrial genes
mtgenes_meta <- fread("http://linnarssonlab.org/blobs/cortex/expression_mito_17-Aug-2014.txt",
                      nrows = 10, header = FALSE)
mtgenes_meta <- as.data.frame(t(mtgenes_meta), stringsAsFactors = FALSE)
colnames(mtgenes_meta) <- mtgenes_meta[2,]
mtgenes_meta <- mtgenes_meta[-c(1,2),]
rownames(mtgenes_meta) <- mtgenes_meta$cell_id
datatable(mtgenes_meta[1:6,])

Followed by the count data for mitochondrial genes:

mtgenes <- fread("http://linnarssonlab.org/blobs/cortex/expression_mito_17-Aug-2014.txt",
                 skip = 10)
mtgenes <- as.data.frame(mtgenes)
rownames(mtgenes) <- mtgenes$V1
mtgenes <- mtgenes[, -c(1,2)]
colnames(mtgenes) <- mtgenes_meta$cell_id
datatable(mtgenes[1:10, 1:6])

Finally load the metadata for ERCC spike-ins:

## download data for ERCC spike-ins
ercc_meta <- fread("http://linnarssonlab.org/blobs/cortex/expression_spikes_17-Aug-2014.txt",
                   nrows = 10, header = FALSE)
ercc_meta <- as.data.frame(t(ercc_meta), stringsAsFactors = FALSE)
colnames(ercc_meta) <- ercc_meta[2,]
ercc_meta <- ercc_meta[-c(1,2),]
rownames(ercc_meta) <- ercc_meta$cell_id
datatable(ercc_meta[1:6,])

and then the count data for the ERCC spike-ins:

ercc <- fread("http://linnarssonlab.org/blobs/cortex/expression_spikes_17-Aug-2014.txt",
                   skip = 10)
ercc <- as.data.frame(ercc)
rownames(ercc) <- ercc$V1
ercc <- ercc[, -c(1,2)]
colnames(ercc) <- ercc_meta$cell_id
datatable(ercc[1:10, 1:6])

Now, after checking that the cell IDs match up in the different data frames we have loaded, combine data and form into an SCESet object:

identical(mtgenes_meta[rownames(mrna_meta),], mrna_meta)
identical(ercc_meta[rownames(mrna_meta),], mrna_meta)
## cell metadata matches as long as cells are matched
mtgenes <- mtgenes[, rownames(mrna_meta)]
identical(colnames(mtgenes), colnames(mrna))
ercc <- ercc[, rownames(mrna_meta)]
identical(colnames(ercc), colnames(mrna))

## combine expression values
counts <- rbind(mrna, mtgenes, ercc)
dim(counts)

## [1] 20063  3005

pd <- new("AnnotatedDataFrame", mrna_meta)
fd <- data.frame(gene_cluster = c(mrna_gene_cluster, 
                                rep(NA, (nrow(ercc) + nrow(mtgenes)))))
rownames(fd) <- rownames(counts)
fd <- new("AnnotatedDataFrame", fd)
sce_zeisel_raw <- newSCESet(countData = counts, phenoData = pd, featureData = fd)
sce_zeisel_raw

## SCESet (storageMode: environment)
## assayData: 20063 features, 3005 samples 
##   element names: counts, cpm, exprs, is_exprs 
## protocolData: none
## phenoData
##   sampleNames: 1772071015_C02 1772071017_G12 ... 1772058148_F03
##     (3005 total)
##   varLabels: tissue group # ... level2class (10 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: Tspan12 Tshz1 ... ERCC-00171 (20063 total)
##   fvarLabels: gene_cluster
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

This data were produced using UMIs, so cpm (counts-per-million) should be the correct units for analysis (transcript length becomes irrelevant as we are counting molecules directly).

2 Calculate QC metrics

Calculate QC metrics with scater’s calculateQCMetrics function. It is helpful to define “feature controls”, that is features (here genes) that can be used for QC purposes. Typically, it is helpful to define ERCC spike-ins and mitochondrial genes as feature controls, which is exactly what we do here:

ercc_genes <- grep("ERCC", featureNames(sce_zeisel_raw))
mt_genes <- grep("mt-", featureNames(sce_zeisel_raw))
sce_zeisel_raw <- calculateQCMetrics(
    sce_zeisel_raw, feature_controls = list(ERCC = ercc_genes, mt = mt_genes))

For this data set the original authors classified the cells into the broad cell classes shown in the output below.

table(sce_zeisel_raw$level1class)

## 
## astrocytes_ependymal    endothelial-mural         interneurons 
##                  224                  235                  290 
##            microglia     oligodendrocytes        pyramidal CA1 
##                   98                  820                  939 
##         pyramidal SS 
##                  399

We will explore the cell types throughout the QC procedures described subsequently.

3 QC the genes

Scater allows you to set minimum QC thresholds for a gene to be considered sufficiently expressed in your downstream analysis. Here, using the inbuilt is_exprs() function, we enforce that a gene must have least one count in at least 98 cells (as this is the smallest group of cells as identified by Zeisel et al). The rationale behind this is that we would want to keep a gene if it was expressed in just one group of cells, but we don’t want genes with very sparse expression overall.

keep_gene <- rowSums(is_exprs(sce_zeisel_raw)) >= 98
fData(sce_zeisel_raw)$use <- keep_gene

We retain 12386 genes for the analysis and drop 7677 lowly-expressed genes from the analysis.

It can be useful to plot gene expression frequency versus mean expression level to assess the effects of technical dropout in the dataset. We fit a non-linear least squares curve for the relationship between expression frequency and mean expression and use this to define the number of genes above high technical dropout and the numbers of genes that are expressed (here defined as at least 1 count) in at least 50% and at least 25% of cells. A subset of genes to be treated as feature controls can be specified, otherwise any feature controls previously defined are used.

plotQC(sce_zeisel_raw, type = "exprs")

The plotQC() function provides several useful QC plots, such as the example below that considers the the number of reads consumed by the top 50 expressed genes. Aside from spike-ins, these are typically mitochondrial and housekeeping genes. Here, as with most single-cell experiments, a large proportion of reads are being are taken up by uninteresting biology.

plotQC( sce_zeisel_raw[fData(sce_zeisel_raw)$use, ], 
        type = "highest-expression", col_by_variable = "level1class" )

As well as the expected ERCC and mitochondrial genes among the most expressed, we see Actb, involved in cell motility, structure and integrity.

4 QC the cells

A useful first step is flagging/failing poorly performing cells. This can be done from the sample meta-data using the automated QC metrics generated above,
any additional sequencing metrics from sequencing aligner/mapping software,
and additional cell phenotypes such as from imaging. For the sake of demonstration, here we focus on four metrics. Others you may want to consider are % reads mapped to mitochondrial genes, library PCR duplication rate,
and mean sequencing bias per cell.

plotPhenoData() can be used to explore specific sample meta-data values. For example, in the plots below we can see how different QC metrics distinguish the cells.

p1 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "log10_total_counts", 
                                  y = "pct_counts_feature_controls_mt", 
                                  colour = "level1class")) 
p2 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "log10_total_counts", 
                                  y = "pct_counts_feature_controls_ERCC", 
                                  colour = "level1class")) 
p3 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "total_features", 
                                  y = "pct_counts_feature_controls_mt", 
                                  colour = "level1class")) 
p4 <- plotPhenoData(sce_zeisel_raw, aes_string(x = "total_features", 
                                  y = "pct_counts_feature_controls_ERCC", 
                                  colour = "level1class")) 
plot_grid(p1, p2, p3, p4, labels = letters[1:4], nrow = 2)

plotPhenoData(sce_zeisel_raw, aes_string(x = "total_features", 
                                  y = "pct_counts_feature_controls", 
                                  colour = "filter_on_pct_counts_feature_controls",
                                  shape_by = "filter_on_total_counts")) +
    geom_vline(xintercept = 1000, linetype = 2)

Use QC metrics to select a subset of cells for use.

sce_zeisel_raw$use <- (sce_zeisel_raw$total_features > 1000 & #sufficient features 
                           sce_zeisel_raw$total_counts > 5000 & # sufficient molucules counted
                           !sce_zeisel_raw$filter_on_pct_counts_feature_controls & # sufficient endogenous RNA
                           
                           !sce_zeisel_raw$filter_on_total_features & # remove cells with unusual numbers of genes
                           !sce_zeisel_raw$is_cell_control # controls shouldn't be used in downstream analysis
)
table(sce_zeisel_raw$use)

## 
## FALSE  TRUE 
##    27  2978

This would lead us to drop a further 27 cells from this dataset.

Box plots aren’t particularly useful for visualising sparse data, so plot() applied to an SCESet object helps visualise all cells as a cumulative proportion of reads per cell. You can see from the plot below that the two failed cells have curves that look more like the blank controls.

Cumulative expression plots:

plot(sce_zeisel_raw, block1 = "level1class", colour_by = "level2class", 
     exprs_values = "counts")

We observe that certain cell types (e.g. oligodendrocytes and microglia) have a larger proportion of cells with their libraries accounted for by a small number of cells than to pyramidal cells.

plot(sce_zeisel_raw, block1 = "level1class", colour_by = "use", 
     exprs_values = "counts")

The plots above show that some (but certainly not all) of the cells we have opted not to use have a large proportion of their library accounted for by a handful of very highly-expressed genes.

Scater allows users total flexibility to run their favourite dimension reduction methods, as decribed here and in the supporting help files. Here we use plotPCA() to further explore the cells. The t-SNE plot works particularly nicely for this dataset to separate the different cell types as identified by Zeisel et al.

Let’s take a look at a PCA plot:

sce_zeisel_raw <- plotPCA(sce_zeisel_raw, colour_by = "tissue", 
                          return_SCESet = TRUE)

plotReducedDim(sce_zeisel_raw, colour_by = "level1class")

Take a look using a diffusion map (better suited to cells distributed along a process):

plotDiffusionMap(sce_zeisel_raw, colour_by = "level1class")

And finally look at a t-SNE plot:

plotTSNE(sce_zeisel_raw, colour_by = "level1class", rand_seed = 20160128)

The t-SNE gives nice, tight cell-type clusters.

plotPCA(sce_zeisel_raw, colour_by = "use", shape_by = "level1class" )

Another option available in scater is to conduct PCA on a set of QC metrics. The advantage of doing this is that the QC metrics focus on technical aspects of the libraries that are likely to distinguish problematics cells. Automatic outlier detection on PCA plots using QC metrics is available to help identify potentially problematic cells.

By default, the following metrics are used for PCA-based outlier detection:

pct_counts_top_100_features
total_features
pct_counts_feature_controls
n_detected_feature_controls
log10_counts_endogenous_features
log10_counts_feature_controls

A particular set of variables to be used can be specified with the selected_variables argument as shown in the example below.

## PCA on the phenoData cannot handle missing values
## for the exercise here we thus set cdna_recovered_in_ng_per_ul to 100 where 
## there are NA values actually (creating a new dummy variable)
sce_zeisel_raw <- plotPCA(sce_zeisel_raw, size_by = "total_features", 
                          shape_by = "filter_on_total_features", 
                          pca_data_input = "pdata", detect_outliers = TRUE, 
                          return_SCESet = TRUE)

## The following cells/samples are detected as outliers:
## 
## Variables with highest loadings for PC1 and PC2:
## 
##                                            PC1         PC2
## ---------------------------------  -----------  ----------
## pct_counts_top_100_features          0.5125750   0.0770133
## pct_counts_feature_controls          0.5060459   0.1395898
## log10_counts_feature_controls        0.1337814   0.6556969
## n_detected_feature_controls          0.0683787   0.6457144
## total_features                      -0.4766248   0.2540722
## log10_counts_endogenous_features    -0.4810821   0.2512876

This dataset has already been carefully QC’d by the original authors and this PCA plot confirms that, with the PCA on QC metrics detecting no further outliers.

With scater, any specific set of features based on prior knowledge can be used for PCA, t-SNE or diffusion maps. A feature set to use can be defined by supplying the feature_set argument to plotPCA, plotTSNE or plotDiffusionMap. This allows, for example, using only housekeeping features or control features or cell cycle genes to produce reduced-dimension plots. The plots below use only the spike-in genes defined as such earlier.

p1 <- plotPCA(sce_zeisel_raw, feature_set = fData(sce_zeisel_raw)$is_feature_control, 
               colour_by = "use", shape_by = "tissue", 
               size_by = "outlier" ) + ggtitle("PCA")
p2 <- plotDiffusionMap(sce_zeisel_raw, 
                       feature_set = fData(sce_zeisel_raw)$is_feature_control, 
                       colour_by = "use", shape_by = "tissue", 
                       size_by = "outlier" ) + ggtitle("Diffusion map")
multiplot(p1, p2, cols = 2)

In addition to outliers, we prefer to use cells that have a good coverage of detected genes. After gene filtering, we demand that cells have detectable expression for at least 10% of retained genes.

keep_cell <- (colSums(is_exprs(sce_zeisel_raw[fData(sce_zeisel_raw)$use,])) > 
                  0.1 * sum(fData(sce_zeisel_raw)$use))
sce_zeisel_raw$use <- (sce_zeisel_raw$use & keep_cell)

After this procedure we retain 2931 cells and drop 74 cells.

5 Investigate explanatory variables

Once you have filtered cells and genes, a next step is to explore technical drivers of variability in the data to inform data normalisation before downstream analysis.

Experimental design is a critical, but neglected, aspect of sc-RNA-seq studies. To the best of our knowledge, methods like those described in this section for exploring experimental and QC variables and the experimental design, do not feature in any scRNA-seq software packages apart from scater. There are a very large number of potential confounders, artifacts and biases in sc-RNA-seq studies. Exploring the effects of such explanatory variables (both those recorded during the experiment and computed QC metrics) is crucial for appropriate modeling of the data. The scater package provides a set of methods specifically for quality control of experimental and explanatory variables, which will be demonstrated briefly here.

Using the plotPCA() function we can see that principal component one appears to be driven by differences in number of genes detected (“total features”). Differences in number of detected genes is a common driver of cell clustering and can be result of biology (e.g. different cell types, cell cycle). Indeed, in the PCA here we see that the endothelial-mural and astrocyte-ependymal cells have typically many fewer features than the other cell types.

However, number of genes detected often has a strong technical component due to variably recovered RNA, reverse transcription, or library amplification. Its effect can also be notably non-linear, affecting low expressed and high expressed genes differently. The plotQC() function can be used to explore the the marginal % variance explained (per gene) of the various technical factors. In the second plot we can see that it’s not unusual for gene coverage to explain more than 10% of the expression variance of a gene.

# we can easily subset to only plot genes and cells of interest
plotPCA( sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use],  
         colour_by = "level1class", shape_by = "tissue", size_by = "total_features" )

One can also easily produce plots to identify principal components that correlate with experimental and QC variables of interest. The function plotQC with the option type = "find-pcs" ranks the principal components in decreasing order of R-squared from a linear model regressing PC value against the variable of interest. The default behaviour is to show the relationships between the variable of interest and the six principal components with the strongest relationship to the variable (as measured by R-squared). This works both for continous and categorical variables. This type of plot can indicate which explanatory variables may be driving differences between cells as detected by PCA and highlight which PCs are associated with the variable. The “total features” variable shows very strong correlation with both principal components 1 and 2.

p1 <- plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use], 
        type = "find-pcs", variable = "total_features" )
p2 <- plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use], 
        type = "find-pcs", variable = "level1class" ) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))
plot_grid(p1, p2, ncol = 2, labels = letters[1:2])

Very clearly, the first principal component is affected strongly by the total number of features detected in a cell.

The relative importance of different explanatory variables can be explored with some of the plotQC function options. Supplying the type = "expl" argument to plotQC computes the marginal R-squared for each variable in the when fitting a linear model regressing expression values for each gene against just that variable, and displays a density plot of the gene-wise marginal R-squared values for the variables. The default approach looks at all variables in the slot of the object and plots the top nvars_to_plot variables (default is 10).

Alternatively, one can choose a subset of variables to plot in this manner, which we do here. The density curves for marginal R-squared show the relative importance of different variables for explaining variance in expression between cells. In the plot we can see that it’s not unusual for gene coverage to explain more than 10% of the expression variance of a gene.

plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use], 
        type = "explanatory-variables", 
        variables = c("pct_counts_top_100_features", "total_features", 
                      "pct_counts_feature_controls", "level1class",
                      "n_detected_feature_controls", 
                      "log10_counts_endogenous_features",
                      "log10_counts_feature_controls", "tissue") )

This analysis indicates that total number of features detected and the sequencing depth (number of counts) for endogenous genes, in particular, have substantial explanatory power for many genes, so these variables are good candidates for conditioning out in a normalisation step, or including in downstream statistical models. The number of detected feature controls (spike-in genes) does not appear to be an important explanatory variable.

The plotQC() function can also be used to produce a pairs plot of explanatory variables (with the same calls as above, but with method = "pairs"). The plot below shows this use case for looking at the % counts from the top 100 most-expressed genes, the total number of expressed genes, the % of counts from feature controls, the number of detected feature controls, the number of counts (on the log-10 scale) from endogenous features, the number of counts (log-10 scale) from feature controls and sample type. The explanatory variables are ordered by the median R-squared of the variable across all genes, and this value is reported on the plot. This type of plot is useful for finding correlations between experimental and QC variables with substantial explanatory power.

plotQC(sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use], 
        type = "expl", method = "pairs", 
        variables = c("pct_counts_top_100_features", "total_features", 
                      "pct_counts_feature_controls", 
                      "n_detected_feature_controls", 
                      "log10_counts_endogenous_features",
                      "log10_counts_feature_controls", "level1class"),
        theme_size = 6)

The plot above shows that the variables level1class (cell type), total_features, log10_counts_endogenous_features, pct_counts_top_100_features and pct_counts_feature_controls are all potentially important explanatory variables, while tissue is less important and n_detected_feature_controls and log10_counts_feature_controls do not look important at all.

6 Normalise the data

After important explanatory variables have been identified with the tools shown above, their effects can be accounted for in subsequent statistical models, or they can be conditioned out using normaliseExprs(), if so desired. If a design matrix incorporating a selection of explanatory variables is supplied as an argument to normaliseExprs, then normalised expression values returned for each feature will be the residuals from a linear model fitted with the design matrix, after any size-factor normalisation has been applied to the expression data.

Normalising single-cell RNA-seq data is a topic in its infancy, but many of the basic principles still apply. How much you choose to initially correct for technical factors depends on your question of interest and the ease with which they can be accounted for in downstream models.

6.1 Size-factor normalisation

In scater it is easy to perform size-factor normalisation using the “TMM” approach (the default in the edgeR package), the “RLE” approach (default in the DESeq package), an “upper-quartile approach” (as proposed by Bullard et al (2010)) or simple scaling by total counts. These approaches were designed for bulk RNA-seq and thus are not optimal for single-cell data, but are provided for convenience and interest. Careful thought should be given to applying these scale-factor normalisation methods as underlying assumptions may not be appropriate for all single-cell datasets.

For instance, the RLE size-factor method does not work for this data set, as the distribution of expression values for each cell has too many zeroes for this method to handle. Similarly, the Bullard et al method only works here for quantile values greater than 90%, because the quantile values less than 90% give a value of zero for numerous cells - when this is the case, the normalisation method fails. If inappropriate normalisation factors are computed, a warning is provided and normalisation factors are set to 1 (the same as applying method="none").

However, Lun et al (2016) recently published a size-factor normalisation method specifically designed for scRNA-seq data. This method performs much better on single-cell data and its implementation in the Bioconductor package scran allows seamless integration into the scater workflow. (The scran package itself depends on scater).

We recommend using the scran size-factor normalisation approach and demonstrate its use here.

The code below demonstrates how to carry out size-factor normalisation on a subsetted SCESet object, normalising either to ERCC spike-in genes (not shown) or to endogenous genes. Under certain circumstances either may be appropriate.

## subset to form a QC'd version of the data
sce_zeisel_qc <- sce_zeisel_raw[fData(sce_zeisel_raw)$use | 
                            fData(sce_zeisel_raw)$is_feature_control, 
                        sce_zeisel_raw$use]
endog_genes <- !fData(sce_zeisel_qc)$is_feature_control

## size factor normalisation with scran 
qclust <- quickCluster(sce_zeisel_qc)
sce_zeisel_qc <- computeSumFactors(sce_zeisel_qc, clusters = qclust)
summary(sce_zeisel_qc$size_factor)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1538  0.6005  1.0300  1.2650  1.6750  5.5790

sce_zeisel_qc <- normalise(sce_zeisel_qc)

# sce_zeisel_qc <- normaliseExprs(sce_zeisel_qc, method = "upperquartile", 
#                                 feature_set = endog_genes, p = 0.99)
# set_exprs(sce_zeisel_qc, "norm_exprs_sf") <- norm_exprs(sce_zeisel_qc)
# summary(sce_zeisel_qc$norm_factors)

Here normalisation using size factors using the computeSumFactors function from the scran package is undertaken. The quickCluster method from scran was used to obtain a rough clustering of cells to guide the size-factor computation step.

We can investigate the effects of normalisation with PCA plots, here using only endogenous gene expression values to produce the plots.

## subset again so that only endogenous genes are used
plt_pca_prenorm <- plotPCA(
    sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use], 
    exprs_values = "exprs", 
    size_by = "total_features",  colour_by = "level1class", shape_by = "tissue") + 
    ggtitle("PCA - no normalisation") + 
    theme(legend.position = "bottom")
plt_pca_endog_norm <- plotPCA(
    sce_zeisel_qc[fData(sce_zeisel_qc)$use,], exprs_values = "exprs", 
    size_by = "total_features",  colour_by = "level1class", shape_by = "tissue") + 
    ggtitle("PCA - endogenous size-factor normalisation") + 
    theme(legend.position = "bottom")
multiplot(plt_pca_prenorm, plt_pca_endog_norm, cols = 2)

Here, where there are a number of distrinct cell types, we see that size-factor normalisation has little effect on the large-scale structure in the dataset as determined by PCA.

However, in the t-SNE plots below the overall structure is very similar in both cases, but we see that the normalised data brings combines the two clusters of oligodendrocytes from the un-normalised plot, and that the microglia and endothelial-mural cells are better differentiated with the normalised data.

## subset again so that only endogenous genes are used
plt_tsne_prenorm <- plotTSNE(
    sce_zeisel_raw[fData(sce_zeisel_raw)$use, sce_zeisel_raw$use], 
    exprs_values = "exprs", 
    size_by = "total_features",  colour_by = "level1class", shape_by = "tissue",
    random_seed = 20160203) + 
    ggtitle("t-SNE - no normalisation") + 
    theme(legend.position = "bottom")
plt_tsne_endog_norm <- plotTSNE(
    sce_zeisel_qc[fData(sce_zeisel_qc)$use,], exprs_values = "exprs", 
    size_by = "total_features",  colour_by = "level1class", shape_by = "tissue",
    random_seed = 20160203) + 
    ggtitle("t-SNE - endogenous size-factor normalisation") + 
    theme(legend.position = "bottom")
multiplot(plt_tsne_prenorm, plt_tsne_endog_norm, cols = 2)

6.2 Regressing out explanatory variables

In the section above we used the plotQC() function to identify the following technical variables that (marginally) explained a substantial proportion of the variance for many genes: * log10_counts_endogenous_features * pct_counts_top_100_features * pct_counts_feature_controls. The normaliseExprs() function enables such explanatory variables to be regressed out of the expression values for downstream analyses. (NB: in some settings total_features may also be considered a techinical variable, but here it could reflect real biology, so we do not regress it out.)

The residuals of a linear regression are returned to norm_exprs(object). Here, the exprs values in the SCESet are assumed to be on a log scale - if they are not, then this regression approach may not work reliably. For similar reasons, norm_counts, norm_tpm, norm_cpm and norm_fpkm values are not affected by regressing out a set of explanatory variables. As using regression residuals is unreliable for these strictly non-negative quantities, they are only affected by size-factor normalisation.

To regress out a set of explanatory variables, we simply need to supply a “design matrix” to normaliseExprs that contains the matrix of values for the variables we wish to regress out. This is easy to construct in R with the built-in model.matrix function. Here we regress out the four explanatory variables listed above, combined with the same upper-quartile size-factor normalisation as above. To retain all normalised expression values, here we save the normalised expression residuals to a new object (norm_exprs_resid) and then add these values to a slot with the same name in the SCESet object:

design <- model.matrix(~sce_zeisel_qc$log10_counts_endogenous_features +
                           sce_zeisel_qc$pct_counts_top_100_features +
                           sce_zeisel_qc$pct_counts_feature_controls)
# sce_zeisel_qc <- normaliseExprs(sce_zeisel_qc, design = design,
#                                 method = "upperquartile", 
#                                 feature_set = endog_genes, p = 0.99)
sce_zeisel_qc <- normaliseExprs(sce_zeisel_qc, design = design,
                                method = "none", exprs_values = "exprs",
                                feature_set = endog_genes, 
                                return_norm_as_exprs = FALSE)
set_exprs(sce_zeisel_qc, "norm_exprs_resid") <- norm_exprs(sce_zeisel_qc)
smoothScatter(exprs(sce_zeisel_qc), get_exprs(sce_zeisel_qc,
                                                   "norm_exprs_resid"),
              xlab = "Size-factor normalised expression values",
              ylab = "Size-factor normalised expression residuals")
abline(0, 1, col = "firebrick", lty = 2)

As the smoothed scatter plot above shows, regressing out the explanatory variables does alter expression levels quite substantially, because there are several technical factors that have large effects on expression.

As previously, we can investigate the affects of normalisation on large-scale cell population structure.

## subset again so that only endogenous genes are used
plt_pca_endog_norm_resid <- plotPCA(
    sce_zeisel_qc, exprs_values = "norm_exprs_resid", 
    size_by = "total_features",  colour_by = "level1class", shape_by = "tissue") + 
    ggtitle("PCA - endogenous size-factor normalisation residuals") + 
    theme(legend.position = "bottom")
#multiplot(plt_pca_endog_norm, plt_pca_endog_norm_resid, cols = 2)
plt_pca_endog_norm_resid

Earlier PCA plots suggested that technical factors were strongly affecting the PCA results. The plot above confirms this - when important technical variables are regressed out, the cells of different level 1 clases group together more closely when looking at the first two principal components.

p1 <- plotQC(sce_zeisel_qc, type = "find-pcs", 
             variable = "pct_counts_feature_controls", 
             exprs_values = "norm_exprs_resid")
p2 <- plotQC(sce_zeisel_qc, type = "find-pcs", variable = "level1class",
             exprs_values = "norm_exprs_resid") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))
plot_grid(p1, p2, ncol = 2, labels = letters[1:2])

The QC plot above shows that PCs no longer tag total_features, and that later PCs (2, 4, 5, 7) capture more signal for differences in cell type.

## subset again so that only endogenous genes are used
plt_pca_endog_norm_resid_n7 <- plotPCA(
    sce_zeisel_qc, exprs_values = "norm_exprs_resid", ncomponents = 7,
    colour_by = "level1class", shape_by = "tissue") + 
    ggtitle("PCA - endogenous size-factor normalisation residuals") + 
    theme(legend.position = "bottom")
#multiplot(plt_pca_endog_norm, plt_pca_endog_norm_resid, cols = 2)
plt_pca_endog_norm_resid_n7

It is still eminently possible to distinguish between cell types, but with strong technical effects (which here appear confounded with cell type to a substantial extent) the later PCs are required to separate cell types with PCA.

## subset again so that only endogenous genes are used
plt_tsne_endog_norm_resid <- plotTSNE(
    sce_zeisel_qc, exprs_values = "norm_exprs_resid", 
    size_by = "total_features",  colour_by = "level1class", shape_by = "tissue",
    random_seed = 20160203) + 
    ggtitle("t-SNE - endogenous size-factor normalisation residuals") + 
    theme(legend.position = "bottom")
#multiplot(plt_tsne_endog_norm, plt_tsne_endog_norm_resid, cols = 2)
plt_tsne_endog_norm_resid

plt_tsne_endog_norm_resid2 <- plotTSNE(
    sce_zeisel_qc, exprs_values = "norm_exprs_resid", 
    size_by = "total_features",  colour_by = "level2class", shape_by = "tissue",
    random_seed = 20160203) + 
    ggtitle("t-SNE - endogenous size-factor normalisation residuals") + 
    theme(legend.position = "bottom")
plt_tsne_endog_norm_resid2

The user will need to make a decision for each dataset on the value of regressing out explanatory variables for their analysis. The example here highlights a tricky situation in which technical sources of variation seem to be confounded with biology of interest to a substantial effect.

NB: Users who prefer North American to British/Australian spelling can normalize their data with normalizeExprs(), which is exactly the same as normaliseExprs().

6.3 Customised normalisation approaches

“Customised” normalisation approaches can be easily incorporated into the scater workflow, because it is very easy to add data to an SCESet obejct with the set_exprs function. In a subsequent scater tutorial we provide an example in which we normalise and standardise counts conditioned on expression level. One advantage of this approach is that a biologically ‘noisy’ gene is naturally defined as one with greater dispersion than other genes at a similar expression level. In the normalised data these are genes with a mean absolute deviation greater than 1.

Other sophisticated normalisation approaches can be attempted, such as those implemented in the BASiCS package. Alternative normalisation approaches can easily be incorporated into a scater workflow.

Thus, after convenient pre-processing, QC and normalisation with , the data are well organised (with feature and cell metadata and many data transformations), clean and tidy, and are ready for further statistical modeling and analysis.

7 Downstream analysis

Expression data at a single-cell resolution not only allows testing for differential expression, but exploring how this is dependent on sub-types of cells and/or how genes coexpress with each other across cells. As with normalisation, single-cell analysis methodology is an area in its infancy and deserving of discussion that is beyond this case study. Here we simply demonstrate how QC’ed and normalised data contained within an SCESet object allows for easy downstream interrogation. Let’s assume we are interested in defining differential expression as change in expression frequency. This can be done with a standard generalised linear model and the qvalue package to control false discovery rates.
Here, we simply test whether the expression frequency of each gene for endothelial-mural cells is different from that of astrocytes-ependymal cells (which is the default baseline in this model).

library("qvalue")
celltype <- sce_zeisel_qc$level1class
de <- data.frame(t(apply(1 * is_exprs(sce_zeisel_qc), 1, # test change in expression frequency 
                         function(y) coef( 
                             summary(glm(y ~ celltype, family = "binomial")))[2, c(1, 4)])), 
                 check.names = FALSE )
de$qvalue <- qvalue(de[, "Pr(>|z|)"], fdr.level = 0.05)$qvalues # control for false disovery rate
de <- de[order(de$qvalue, decreasing = FALSE ), ] # order by global statistical evidence
datatable(de[1:1000,])

In scater the plotExpression function enables the convenient visualisation of expression values for a set of features. Here, the expression values for the six most DE genes for expression frequency between cell types are shown. The units for expression in the plot can be defined with the exprs_values argument (the expression values must exist with the provided name in the assayData slot of the SCESet object; if not the default exprs values will be used, with a warning). As with other plots in scater we can use phenotype data variables to define the colour and shape of the points.

plotExpression(sce_zeisel_qc, rownames(de)[1:6], x = "level1class", ncol = 3,
               colour_by = "level1class") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

plotExpression(sce_zeisel_qc, tail(rownames(de), n = 6), exprs_values = "norm_exprs", 
               x = "level1class", colour_by = "level1class", 
               shape_by = "tissue", ncol = 3) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

As one would hope, the genes with the least evidence for differential expression frequency are spike-in genes.

The scater package also makes it easy to investigate correlations between genes and cells by supporting cell-cell and gene-gene distances in an SCESet object. Below we show how cell-cell distances can be defined and then how these distances can easily be used to produce multidimensional-scaling (MDS) plots to investigate the structure between cells.

cellDist(sce_zeisel_qc) <- (1 - cor(norm_exprs(sce_zeisel_qc), 
                                    method = "spearman")) / 2

plotMDS(sce_zeisel_qc, size_by = "total_features",  colour_by = "level1class", 
        shape_by = "tissue") + 
    ggtitle("MDS plot based on spearman correlation")

With the plotMDS function the user can apply their favourite way of defining the distance between cells to then plot the cells in a low-dimensional space. The plot above suggests that Spearman correlation is not the best distance metric for single-cell RNA-seq data unless you are particularly interested in separating cells by total number of detected genes. Other metrics may work substantially better.

Finally, let us save the QC’d version of the data that can be used for any subsequent analyses.

save(sce_zeisel_qc, file = "zeisel_sce_qc.RData", compress = "xz", 
     compression_level = 9)

8 Technical stuff

Scater has been tested on Mac OS X and Linux environments and requires the R packages:

Biobase
BiocGenerics
ggplot2
methods

and imports the packages:

biomaRt
data.table
dplyr
edgeR
grid
limma
matrixStats
plyr
reshape2
rhdf5
rjson
viridis

This case study was run using the following platform and R package versions:

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] qvalue_2.4.0        scran_1.0.0         BiocParallel_1.6.0 
##  [4] knitr_1.13          DT_0.1              cowplot_0.6.2      
##  [7] data.table_1.9.6    scater_1.1.2        ggplot2_2.1.0      
## [10] Biobase_2.32.0      BiocGenerics_0.18.0
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.10            minqa_1.2.4           colorspace_1.2-6     
##   [4] mvoutlier_2.0.6       RcppEigen_0.3.2.8.1   rjson_0.2.15         
##   [7] class_7.3-14          dynamicTreeCut_1.63-1 pls_2.5-0            
##  [10] proxy_0.4-15          MatrixModels_0.4-1    cvTools_0.3.2        
##  [13] AnnotationDbi_1.34.0  mvtnorm_1.0-5         splines_3.3.0        
##  [16] sROC_0.1-2            tximport_1.0.0        robustbase_0.92-5    
##  [19] Formula_1.2-1         jsonlite_0.9.20       nloptr_1.0.4         
##  [22] robCompositions_2.0.0 pbkrtest_0.4-6        cluster_2.0.4        
##  [25] shinydashboard_0.5.1  shiny_0.13.2          rrcov_1.3-11         
##  [28] assertthat_0.1        Matrix_1.2-6          lazyeval_0.1.10      
##  [31] limma_3.28.4          formatR_1.4           acepack_1.3-3.3      
##  [34] htmltools_0.3.5       quantreg_5.21         tools_3.3.0          
##  [37] igraph_1.0.1          gtable_0.2.0          reshape2_1.4.1       
##  [40] dplyr_0.4.3           Rcpp_0.12.4.5         sgeostat_1.0-27      
##  [43] nlme_3.1-128          lmtest_0.9-34         stringr_1.0.0        
##  [46] lme4_1.1-12           mime_0.4              XML_3.98-1.4         
##  [49] edgeR_3.14.0          DEoptimR_1.0-4        zlibbioc_1.18.0      
##  [52] MASS_7.3-45           zoo_1.7-13            scales_0.4.0         
##  [55] VIM_4.4.1             rhdf5_2.16.0          SparseM_1.7          
##  [58] RColorBrewer_1.1-2    yaml_2.1.13           gridExtra_2.2.1      
##  [61] biomaRt_2.28.0        rpart_4.1-10          reshape_0.8.5        
##  [64] latticeExtra_0.6-28   stringi_1.0-1         RSQLite_1.0.0        
##  [67] highr_0.6             S4Vectors_0.10.0      pcaPP_1.9-60         
##  [70] e1071_1.6-7           destiny_1.2.0         chron_2.3-47         
##  [73] matrixStats_0.50.2    bitops_1.0-6          evaluate_0.9         
##  [76] lattice_0.20-33       htmlwidgets_0.6       labeling_0.3         
##  [79] GGally_1.0.1          plyr_1.8.3            magrittr_1.5         
##  [82] R6_2.1.2              IRanges_2.6.0         Hmisc_3.17-4         
##  [85] DBI_0.4-1             foreign_0.8-66        mgcv_1.8-12          
##  [88] survival_2.39-4       scatterplot3d_0.3-36  RCurl_1.95-4.8       
##  [91] sp_1.2-3              nnet_7.3-12           car_2.1-2            
##  [94] KernSmooth_2.23-15    rmarkdown_0.9.6       viridis_0.3.4        
##  [97] grid_3.3.0            FNN_1.1               vcd_1.4-1            
## [100] digest_0.6.9          xtable_1.8-2          httpuv_1.3.3         
## [103] stats4_3.3.0          munsell_0.4.3

Single-cell dataset: Zeisel mouse cortex

Davis McCarthy

2016-05-12