1 Investigating mouse mesoderm differentiation data

This analysis requires scater, as well as the following CRAN packages:

cowplot
DT
knitr
RBGL
dynamicTreeCut
RColorBrewer
gplots

and the following Bioconductor packages:

scran

and the dpt package can be installed from Fabian Theis’s webpage:

install.packages('devtools')
setRepositories(ind = 1:2)
devtools::install_url(paste0('https://www.helmholtz-muenchen.de/fileadmin/ICB/software/',
                             c('destiny/destiny_1.3.4.tar.gz', 'dpt_0.6.0.tar.gz')))

We will demonstrate the pre-processing of the data from Scialdone et al, 2016 with scater and scran, and explore the cells with visualisation tools in scater and with pseudotime estimation methods from dpt.

library(scater)
library(ggplot2)
library(cowplot)
library(DT)
library(knitr)
library(scran)
opts_chunk$set(cache = FALSE, warning = FALSE)

2 Load the Scialdone et al mouse mesoderm data

The mouse mesoderm data from Scialdone et al, 2016 have been made available as gene-level counts at Cambridge University Stem Cells website.

We first need to download the data (counts.gz and unzip to counts.txt) and the cell metadata (metadata.txt) to the directory containing this Rmd file and read in the data and metadata from those files.

First load the data for metadata for mRNA genes:

## download mRNA data
mrna_meta <- read.delim("metadata.txt", sep = " ")
rownames(mrna_meta) <- mrna_meta$cellName
datatable(mrna_meta[1:6,])

Now load the count data for the mRNA genes:

## read in data
mrna <- read.delim("counts.txt", sep = " ", nrows = 41389)
datatable(mrna[1:20, 1:4])

Combine data and metadata to form an SCESet object:

## combine expression values
pd <- new("AnnotatedDataFrame", mrna_meta)
fd <- data.frame(ensembl_gene_id = rownames(mrna))
rownames(fd) <- rownames(mrna)
fd <- new("AnnotatedDataFrame", fd)
sce_scialdone_raw <- newSCESet(countData = mrna, phenoData = pd, 
                               featureData = fd)
sce_scialdone_raw <- getBMFeatureAnnos(
    sce_scialdone_raw, filters = "ensembl_gene_id",
    attributes = c("ensembl_gene_id", "mgi_symbol",
                   "chromosome_name", "gene_biotype"), 
    feature_symbol = "mgi_symbol",
    feature_id = "ensembl_gene_id", biomart = "ENSEMBL_MART_ENSEMBL",
    dataset = "mmusculus_gene_ensembl", host = "www.ensembl.org")
featureNames(sce_scialdone_raw) <- paste(
    featureNames(sce_scialdone_raw), 
    fData(sce_scialdone_raw)$mgi_symbol, sep = "_")
sce_scialdone_raw

## SCESet (storageMode: lockedEnvironment)
## assayData: 41388 features, 1205 samples 
##   element names: counts, exprs 
## protocolData: none
## phenoData
##   sampleNames: SLX.8343.N701_N501 SLX.8343.N701_N502 ...
##     SLX.9562.N712_N517 (1205 total)
##   varLabels: cellName embryoStage ... ssc (7 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001_Gnai3 ENSMUSG00000000003_Pbsn
##     ... ENSMUSG00000102132_Gm21813 (41388 total)
##   fvarLabels: ensembl_gene_id mgi_symbol ... feature_id (6 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Now we have an SCESet object containing the expression data with cell and gene metadata combined.

This data were produced using read alignment with GSNAP and gene-level quantification with HTSeq, so cpm (counts-per-million) should be the correct units for analysis.

3 Calculate QC metrics

Calculate QC metrics with scater’s calculateQCMetrics function. It is helpful to define “feature controls”, that is features (here genes) that can be used for QC purposes. Typically, it is helpful to define ERCC spike-ins and mitochondrial genes as feature controls. However, the Scialdone data from the website have already been QC’d, so the 1205 cells for which we have data here are those that passed the QC filters described in the Scialdone et al (2016) paper. Here we show computing QC metrics with the mitochondrial genes identified as a set of control features (the ERCC spike-in gene expression values are not present):

mt_genes <- grep("mt-", fData(sce_scialdone_raw)$mgi_symbol)
sce_scialdone_raw <- calculateQCMetrics(
    sce_scialdone_raw, feature_controls = list(mt = mt_genes))

For this data set the original authors classified the cells into the broad cell categories shown in the output below, based on embryo stage and cell surface markers.

sce_scialdone_raw$cellCategory <- factor(
    sce_scialdone_raw$cellCategory, 
    levels = c("E6.5", "PS-Flk1+", "NP-Flk1+", "NP-CD41+Flk1+", "NP-CD41+Flk1-",
               "HF-Flk1+", "HF-CD41+Flk1+", "HF-CD41+Flk1-"))
table(sce_scialdone_raw$cellCategory)

## 
##          E6.5      PS-Flk1+      NP-Flk1+ NP-CD41+Flk1+ NP-CD41+Flk1- 
##           501           138           159            55            45 
##      HF-Flk1+ HF-CD41+Flk1+ HF-CD41+Flk1- 
##           139            78            90

In the paper, they use t-SNE plots to visualize the population structure of the cells. We can replicate that with the default t-SNE plot easily with the scater function plotTSNE:

plotTSNE(sce_scialdone_raw, colour_by = "cellCategory")

By default, plotTSNE uses the 500 most variable genes in the dataset. In the plot above, we can easily distinguish the E6.5 cells from the others, just as Scialdone et al presented in their paper.

We will explore the cell types throughout the QC procedures described subsequently, and particularly as we subset the genes to those more relevant for the analysis we should see clearer structure in the cell populations.

4 QC the genes

Scater allows you to set minimum QC thresholds for a gene to be considered sufficiently expressed in your downstream analysis. Here, we enforce that a gene must have least five counts in at least 45 cells (as this is the smallest group of cells as identified by Scialdone et al). The rationale behind this is that we would want to keep a gene if it was expressed in just one group of cells, but we don’t want genes with very sparse expression overall.

keep_gene <- rowMeans(counts(sce_scialdone_raw)) >= 1
fData(sce_scialdone_raw)$use <- keep_gene 
fData(sce_scialdone_raw)$use[mt_genes] <- FALSE

We retain 14246 genes for the analysis and drop 27142 lowly-expressed genes from the analysis.

It can be useful to plot gene expression frequency versus mean expression level to assess the effects of technical dropout in the dataset. We fit a non-linear least squares curve for the relationship between expression frequency and mean expression and use this to define the number of genes above high technical dropout and the numbers of genes that are expressed (here defined as at least 1 count) in at least 50% and at least 25% of cells. A subset of genes to be treated as feature controls can be specified, otherwise any feature controls previously defined are used.

plotQC(sce_scialdone_raw, type = "exprs")

The plotQC() function provides several useful QC plots, such as the example below that considers the the number of reads consumed by the top 50 expressed genes. Aside from spike-ins, these are typically mitochondrial and housekeeping genes. Here, as with most single-cell experiments, a large proportion of reads are being are taken up by uninteresting biology.

plotQC(sce_scialdone_raw[fData(sce_scialdone_raw)$use, ], 
       type = "highest-expression", col_by_variable = "cellCategory" )

As well as the expected mitochondrial genes among the most expressed, we see Actb, involved in cell motility, structure and integrity.

5 QC the cells

A useful first step is flagging/failing poorly performing cells. This can be done from the sample meta-data using the automated QC metrics generated above,
any additional sequencing metrics from sequencing aligner/mapping software,
and additional cell phenotypes such as from imaging. For the sake of demonstration, here we focus on four metrics. Others you may want to consider are % reads mapped to mitochondrial genes, library PCR duplication rate,
and mean sequencing bias per cell.

plotPhenoData() can be used to explore specific sample meta-data values. For example, in the plots below we can see how different QC metrics distinguish the cells.

p1 <- plotPhenoData(sce_scialdone_raw, aes_string(x = "log10_total_counts", 
                                  y = "pct_counts_feature_controls_mt", 
                                  colour = "cellCategory")) 
p2 <- plotPhenoData(sce_scialdone_raw, aes_string(x = "total_features", 
                                  y = "pct_counts_feature_controls_mt", 
                                  colour = "cellCategory")) 
plot_grid(p1, p2, labels = letters, nrow = 2)

plotPhenoData(sce_scialdone_raw, aes_string(x = "total_features", 
                                  y = "pct_counts_feature_controls", 
                                  colour = "filter_on_pct_counts_feature_controls",
                                  shape_by = "filter_on_total_counts")) +
    geom_vline(xintercept = 1000, linetype = 2)

Look at cumulative expression plot for these cells. Theis plot indicates that there are a few cells with low complexity libraries (e.g. top 100 genes accounting for more than 50% of counts) that could be filtered out.

plot(sce_scialdone_raw, exprs_values = "counts", block1 = "embryoStage",
     colour = "cellCategory", ncol = 2)

Take a look at a PCA plot using QC metrics to see if any cells are flagged as outliers.

sce_scialdone_raw <- plotPCA(sce_scialdone_raw, size_by = "total_features", 
                             shape_by = "embryoStage", pca_data_input = "pdata",
                             detect_outliers = TRUE, return_SCESet = TRUE)

## sROC 0.1-2 loaded

## The following cells/samples are detected as outliers:
## SLX.8343.N701_N502
## SLX.8343.N710_N501
## SLX.8344.N701_N502
## SLX.8344.N701_N508
## SLX.8344.N703_N504
## SLX.8344.N705_N501
## SLX.8344.N705_N503
## SLX.8345.N701_N501
## SLX.8345.N701_N502
## SLX.8345.N701_N503
## SLX.8345.N701_N506
## SLX.8345.N706_N501
## SLX.8345.N708_N501
## SLX.8408.N701_N502
## SLX.8408.N701_N503
## SLX.8408.N701_N504
## SLX.8408.N701_N506
## SLX.8408.N701_N507
## SLX.8408.N701_N508
## SLX.8408.N703_N507
## SLX.8408.N704_N507
## SLX.8408.N705_N505
## SLX.8408.N705_N507
## SLX.8408.N708_N502
## SLX.8409.N710_N501
## SLX.8411.N701_N504
## SLX.8411.N701_N505
## SLX.8411.N701_N506
## SLX.8411.N701_N507
## SLX.8411.N701_N508
## SLX.8411.N702_N504
## SLX.8411.N702_N505
## SLX.8411.N702_N508
## SLX.8411.N703_N501
## SLX.8411.N703_N506
## SLX.8411.N704_N502
## SLX.8411.N704_N504
## SLX.8411.N705_N505
## SLX.8411.N706_N504
## SLX.8411.N706_N505
## SLX.8411.N706_N508
## SLX.8411.N707_N505
## SLX.8411.N708_N503
## SLX.8411.N708_N507
## SLX.8411.N708_N508
## SLX.8411.N709_N503
## SLX.8411.N709_N506
## SLX.8411.N709_N507
## SLX.8411.N710_N508
## SLX.8411.N711_N508
## SLX.8411.N712_N507
## SLX.9559.N711_N502
## SLX.9560.N701_N502
## SLX.9561.N701_N502
## SLX.9561.N702_N504
## SLX.9561.N712_N517
## SLX.9562.N701_N517
## SLX.9562.N703_N508
## Variables with highest loadings for PC1 and PC2:
## 
##                                            PC1          PC2
## ---------------------------------  -----------  -----------
## pct_counts_top_100_features         -0.0441437   -0.6047886
## total_features                      -0.2308390    0.5078794
## pct_counts_feature_controls         -0.3634284   -0.5293013
## log10_counts_endogenous_features    -0.4709040    0.2915054
## n_detected_feature_controls         -0.4755686    0.0461208
## log10_counts_feature_controls       -0.6039606   -0.0950100

This QC check flags 58 cells as outliers from the multivariate outlier detection in the PCA.

In addition to outliers, we prefer to use cells that have a good coverage of detected genes. After gene filtering, we demand that cells have detectable expression for at least 30% of retained genes.

keep_cell <- (
    colSums(counts(sce_scialdone_raw)[fData(sce_scialdone_raw)$use,] > 5) > 
        0.3 * sum(fData(sce_scialdone_raw)$use))
sce_scialdone_raw$keep_cell_cov <- keep_cell

After this procedure we retain 0 cells and drop 68 cells.

Use QC metrics to select a subset of cells for use.

Now we will proceed to filtering out potentially problematic cells. We apply the following criteria for filtering:

total counts > 50,000
total features > 5,000
% dropout < 80%
% counts from MT genes < 20%
% counts from top 200 features < 60%
cell not identified as outlier above.

sce_scialdone_filt <- filter(sce_scialdone_raw, total_counts > 5e04, 
                             pct_dropout < 80, !outlier, keep_cell_cov,
                             pct_counts_feature_controls_mt < 20,
                             pct_counts_top_100_features < 50, 
                             total_features > 5000)
sce_scialdone_raw$use <- (cellNames(sce_scialdone_raw) %in% 
                              cellNames(sce_scialdone_filt))
dim(sce_scialdone_filt)

## Features  Samples 
##    41388     1137

This would lead us to drop a further 68 cells from this dataset.

Box plots aren’t particularly useful for visualising sparse data, so plot() applied to an SCESet object helps visualise all cells as a cumulative proportion of reads per cell. You can see from the plot below that all of the cells retained for analysis have relatively high-complexity libraries.

All of the cells with lower-complexity libraries (plus others) have been filtered out by our QC approach here.

plot(sce_scialdone_raw, block1 = "embryoStage", colour_by = "use", 
     exprs_values = "counts", ncol = 2)

The profiles for the remaining cells look very good.

plot(sce_scialdone_filt, exprs_values = "counts", block1 = "embryoStage",
     colour = "cellCategory", ncol = 2)

The plots above show that some (but certainly not all) of the cells we have opted not to use have a larger proportion of their library accounted for by a handful of very highly-expressed genes.

6 Filter lowly-expressed genes

We retain genes for analysis if they have average counts greater than 1. We will also recompute QC metrics after this filtering.

keep_gene <- (rowMeans(counts(sce_scialdone_filt)) >= 1)
sce_scialdone_filt <- sce_scialdone_filt[keep_gene,]
mt_genes <- grepl("^mt-", fData(sce_scialdone_filt)$mgi_symbol)
sce_scialdone_filt <- calculateQCMetrics(
    sce_scialdone_filt, feature_controls = list(MT = mt_genes))
sce_scialdone_filt <- 
    sce_scialdone_filt[!fData(sce_scialdone_filt)$is_feature_control,]

7 Investigate explanatory variables

Once you have filtered cells and genes, a next step is to explore technical drivers of variability in the data to inform data normalisation before downstream analysis.

Experimental design is a critical, but neglected, aspect of sc-RNA-seq studies. To the best of our knowledge, methods like those described in this section for exploring experimental and QC variables and the experimental design, do not feature in any scRNA-seq software packages apart from scater. There are a very large number of potential confounders, artifacts and biases in sc-RNA-seq studies. Exploring the effects of such explanatory variables (both those recorded during the experiment and computed QC metrics) is crucial for appropriate modeling of the data. The scater package provides a set of methods specifically for quality control of experimental and explanatory variables, which will be demonstrated briefly here.

Using the plotPCA() function we can see that principal component one appears to be driven by differences in number of genes detected (“total features”). Differences in number of detected genes is a common driver of cell clustering and can be result of biology (e.g. different cell types, cell cycle). Indeed, in the PCA here we see that the endothelial-mural and astrocyte-ependymal cells have typically many fewer features than the other cell types.

However, number of genes detected often has a strong technical component due to variably recovered RNA, reverse transcription, or library amplification. Its effect can also be notably non-linear, affecting low expressed and high expressed genes differently. The plotQC() function can be used to explore the the marginal % variance explained (per gene) of the various technical factors. In the second plot we can see that it’s not unusual for gene coverage to explain more than 10% of the expression variance of a gene.

# we can easily subset to only plot genes and cells of interest
plotPCA(sce_scialdone_filt, colour_by = "cellCategory", 
        size_by = "total_features" )

One can also easily produce plots to identify principal components that correlate with experimental and QC variables of interest. The function plotQC with the option type = "find-pcs" ranks the principal components in decreasing order of R-squared from a linear model regressing PC value against the variable of interest. The default behaviour is to show the relationships between the variable of interest and the six principal components with the strongest relationship to the variable (as measured by R-squared). This works both for continous and categorical variables. This type of plot can indicate which explanatory variables may be driving differences between cells as detected by PCA and highlight which PCs are associated with the variable. The “total features” variable shows very strong correlation with both principal components 1 and 2.

p1 <- plotQC(sce_scialdone_filt, type = "find-pcs", variable = "total_features")
p2 <- plotQC(sce_scialdone_filt, type = "find-pcs", variable = "cellCategory" ) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))
plot_grid(p1, p2, ncol = 2, labels = letters)

Very clearly, the first and second principal components here are driven by the different cell types, while total features is most strongly correlated with just the seventh principal component. As we would hope after rigorous QC, biological signal overwhelms technical effects.

The relative importance of different explanatory variables can be explored with some of the plotQC function options. Supplying the type = "expl" argument to plotQC computes the marginal R-squared for each variable in the when fitting a linear model regressing expression values for each gene against just that variable, and displays a density plot of the gene-wise marginal R-squared values for the variables. The default approach looks at all variables in the slot of the object and plots the top nvars_to_plot variables (default is 10).

Alternatively, one can choose a subset of variables to plot in this manner, which we do here. The density curves for marginal R-squared show the relative importance of different variables for explaining variance in expression between cells. In the plot we can see that it’s not unusual for gene coverage to explain more than 10% of the expression variance of a gene.

plotQC(sce_scialdone_filt, type = "explanatory-variables", 
        variables = c("pct_counts_top_100_features", "total_features", 
                      "pct_counts_feature_controls", "cellCategory", 
                      "embryoStage", "embryo",
                      "n_detected_feature_controls", 
                      "log10_counts_endogenous_features",
                      "log10_counts_feature_controls") )

This analysis indicates that the biological factors embryo, cellCategory and embryoStage have the most substantial (marginal) explanatory power for many genes. The technical effects have less substantial effects, with the strongest being total_features (number of expression genes). We may wish to regress out some of the technical effects as part of a normalization step, but it may not be vital for these data, where the biological signal is very strong.

The plotQC() function can also be used to produce a pairs plot of explanatory variables (with the same calls as above, but with method = "pairs"). The plot below shows this use case for looking at the % counts from the top 100 most-expressed genes, the total number of expressed genes, the % of counts from feature controls, the number of detected feature controls, the number of counts (on the log-10 scale) from endogenous features, the number of counts (log-10 scale) from feature controls and sample type. The explanatory variables are ordered by the median R-squared of the variable across all genes, and this value is reported on the plot. This type of plot is also useful for finding correlations between experimental and QC variables with substantial explanatory power.

plotQC(sce_scialdone_filt, 
        type = "expl", method = "pairs", 
        variables = c("pct_counts_top_100_features", "total_features", 
                      "pct_counts_feature_controls", "embryo", "embryoStage",
                      "n_detected_feature_controls", 
                      "log10_counts_endogenous_features",
                      "log10_counts_feature_controls", "cellCategory"),
        theme_size = 6)

The plot above shows that the variables embryo, embryoStage and cellCategory are strongly correlated (for obvious reasons). We can see that variables like log10_counts_feature_controls and pct_counts_features_controls are correlated. Here, the correlated variables are expected, but this plot can be very useful for uncovering variables with substantial explanatory power that may not be obviously correlated.

8 Normalise the data

After important explanatory variables have been identified with the tools shown above, their effects can be accounted for in subsequent statistical models, or they can be conditioned out using normaliseExprs(), if so desired. If a design matrix incorporating a selection of explanatory variables is supplied as an argument to normaliseExprs, then normalised expression values returned for each feature will be the residuals from a linear model fitted with the design matrix, after any size-factor normalisation has been applied to the expression data.

Normalising single-cell RNA-seq data is a topic in its infancy, but many of the basic principles still apply. How much you choose to initially correct for technical factors depends on your question of interest and the ease with which they can be accounted for in downstream models.

8.1 Size-factor normalisation

In scater it is easy to perform size-factor normalisation using the recent method from, Lun et al (2016). Their size-factor normalisation method is specifically designed for scRNA-seq data and method performs much better on single-cell data. Its implementation in the Bioconductor package scran allows seamless integration into the scater workflow. (The scran package itself depends on scater).

We recommend using the scran size-factor normalisation approach and demonstrate its use here.

The code below demonstrates how to carry out size-factor normalisation on a subsetted SCESet object, normalising either to ERCC spike-in genes (not shown) or to endogenous genes. Under certain circumstances either may be appropriate.

We do not have ERCC spikes in this dataset, so we cannot use the spike normalisation methods in scran.

## size factor normalisation with scran 
qclust <- quickCluster(sce_scialdone_filt)
sce_scialdone_qc <- computeSumFactors(sce_scialdone_filt, clusters = qclust)
summary(sce_scialdone_qc$size_factor)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1477  0.6633  0.9053  1.0000  1.2320  4.8920

sce_scialdone_qc <- normalise(sce_scialdone_qc)

Here normalisation using size factors using the computeSumFactors function from the scran package is undertaken. The quickCluster method from scran was used to obtain a rough clustering of cells to guide the size-factor computation step.

Looking at a t-SNE plot using the normalised data yields perhaps slightly more distinct cell-type clusters, although the effect here is subtle.

plotTSNE(sce_scialdone_qc, exprs_values = "exprs", ntop = 1000,  
         colour_by = "cellCategory", shape_by = "embryoStage",
         random_seed = 20160203) + 
    ggtitle("t-SNE - endogenous size-factor normalisation")

8.2 Classification of cell cycle phase

We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training data set, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test data set can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone function using a pre-trained set of marker pairs for human data.

mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds",
                                package = "scran"))
# library(org.Mm.eg.db)
# anno <- select(org.Mm.eg.db, keys = fData(sce_scialdone_qc)$mgi_symbol, 
#                keytype = "SYMBOL", column = "ENSEMBL")
# ensembl <- anno$ENSEMBL[match(fData(sce_scialdone_qc)$mgi_symbol, anno$SYMBOL)]
register(SerialParam())
assignments <- cyclone(sce_scialdone_qc, mm.pairs, 
                       gene.names = fData(sce_scialdone_qc)$ensembl_gene_id)
sce_scialdone_qc$G1_score <- assignments$score$G1
sce_scialdone_qc$G2M_score <- assignments$score$G2M
sce_scialdone_qc$cell_cycle_phase <- assignments$phases
plotPhenoData(sce_scialdone_qc, aes(G1_score, G2M_score))

ggplot(pData(sce_scialdone_qc), aes(G1_score, G2M_score)) + geom_hex()

The cell cycle phase inference from cyclone looks reasonable here as we see most cells getting either a very low G1 score or a very low G2/M score (or both), which means that all of the cells can be assigned to a cell cycle phase with high confidence.

8.3 Regressing out explanatory variables

In the section above we used the plotQC() function to identify the following technical variables that (marginally) explained a moderate but appreciable proportion of the variance for many genes: * log10_counts_endogenous_features * pct_counts_top_100_features * pct_counts_feature_controls. The normaliseExprs() function enables such explanatory variables to be regressed out of the expression values for downstream analyses. (NB: in some settings total_features may also be considered a techinical variable, but here it could reflect real biology, so we do not regress it out.)

The residuals of a linear regression are returned to norm_exprs(object). Here, the exprs values in the SCESet are assumed to be on a log scale - if they are not, then this regression approach may not work reliably. For similar reasons, norm_counts, norm_tpm, norm_cpm and norm_fpkm values are not affected by regressing out a set of explanatory variables. As using regression residuals is unreliable for these strictly non-negative quantities, they are only affected by size-factor normalisation.

To regress out a set of explanatory variables, we simply need to supply a “design matrix” to normaliseExprs that contains the matrix of values for the variables we wish to regress out. This is easy to construct in R with the built-in model.matrix function. Here we regress out the G1 and G2M scores computed with cyclone above and then save the normalised expression residuals to a new slot (norm_exprs_resid) in the SCESet object:

design <- model.matrix(~sce_scialdone_qc$G1_score +
                           sce_scialdone_qc$G2M_score)

sce_scialdone_qc <- normaliseExprs(sce_scialdone_qc, design = design,
                                method = "none", exprs_values = "exprs",
                                return_norm_as_exprs = FALSE)
set_exprs(sce_scialdone_qc, "norm_exprs_resid") <- norm_exprs(sce_scialdone_qc)
smoothScatter(exprs(sce_scialdone_qc), get_exprs(sce_scialdone_qc,
                                                   "norm_exprs_resid"),
              xlab = "Size-factor normalised expression values",
              ylab = "Size-factor normalised expression residuals")
abline(0, 1, col = "firebrick", lty = 2)

As the smoothed scatter plot above shows, regressing out the cell cycle scores does alter expression levels quite substantially.

We can investigate the effects of using normalisation residuals as opposed to just size-factor normalisation on large-scale cell population structure using PCA.

## subset again so that only endogenous genes are used
p1 <- plotPCA(sce_scialdone_qc, exprs_values = "exprs", 
              colour_by = "cellCategory", shape_by = "embryoStage") + 
    ggtitle("PCA - endogenous size-factor normalisation") + 
    theme(legend.position = "bottom")
p2 <- plotPCA(sce_scialdone_qc, exprs_values = "norm_exprs_resid", 
    colour_by = "cellCategory", shape_by = "embryoStage") + 
    ggtitle("PCA - endogenous size-factor normalisation residuals") + 
    theme(legend.position = "bottom")
plot_grid(p1, p2, labels = letters)

Thus, regressing out these technical factors brings the cells closer together in PCA-space.

## subset again so that only endogenous genes are used
p1 <- plotTSNE(sce_scialdone_qc, exprs_values = "exprs", ntop = 1000,
              colour_by = "cellCategory", shape_by = "embryoStage",
              rand_seed = 20160727) + 
    ggtitle("tSNE - endogenous size-factor normalisation") + 
    theme(legend.position = "bottom")
p2 <- plotTSNE(sce_scialdone_qc, exprs_values = "norm_exprs_resid", 
    colour_by = "cellCategory", shape_by = "embryoStage", ntop = 1000,
    rand_seed = 20160727) + 
    ggtitle("tSNE - endogenous size-factor normalisation residuals") + 
    theme(legend.position = "bottom")
plot_grid(p1, p2, labels = letters)

The t-SNE plots are not drastically affected by using residuals compared with size-factor normalised expression values. This is encouraging, as it suggests that the cell-type differences in expression profiles are much stronger than differences between cells due to nuisance effects like the cell cycle.

The QC plot below shows that PCs no longer tag cell_cycle_phase, but that PCs 1-5 are still strongly explained by cellCategory.

p1 <- plotQC(sce_scialdone_qc, type = "find-pcs", 
             variable = "cell_cycle_phase", 
             exprs_values = "norm_exprs_resid")
p2 <- plotQC(sce_scialdone_qc, type = "find-pcs", variable = "cellCategory",
             exprs_values = "norm_exprs_resid") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))
plot_grid(p1, p2, ncol = 2, labels = letters[1:2])

The user will need to make a decision for each dataset on the value of regressing out explanatory variables for their analysis. In the example here, regressing out cell cycle effects does not have a drastic impact, because the biological signal (differences between cell types) in the data is already the main source of variation between cells. Going forward, we will use the expression residuals, just to minimise any effects of cell cycle.

#exprs(sce_scialdone_qc) <- get_exprs(sce_scialdone_qc, "norm_exprs_resid")

NB: Users who prefer North American to British/Australian spelling can normalize their data with normalizeExprs(), which is exactly the same as normaliseExprs().

9 Identify HVGs from the normalized log-expression

We use methods from the scran package to identify highly variable genes (HVGs) from the normalized log-expression. This approach decomposes the biological and technical components of the variance by fitting a mean-variance trend to the endogenous genes, and using spike-in genes as diagnostic (due to their small number they are not used for fitting the trend).

The top HVGs are identified by ranking genes on their biological components. This can be used to prioritize genes for further investigation. In general, we consider a gene to be a HVG if it has a biological component of at least 1. For transformed expression values on the log2 scale, this means that gene expression will vary by at least 2-fold around the mean due to biological effects.

It is recommended to check the distribution of expression values for the top HVGs to ensure that the variance estimate is not dominated by one or two outlier cells.

We estimate highly-variable genes from expression values before regressing out the cell cycle scores (otherwise the distributional assumptions go awry for the variance).

var.fit <- trendVar(sce_scialdone_qc, trend = "loess", use.spikes = FALSE,
                    span = 0.2)
var.out <- decomposeVar(sce_scialdone_qc, var.fit)

plot(var.out$mean, var.out$total, pch = 16, cex = 0.6, 
     xlab = "Mean log-expression", ylab = "Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col = "dodgerblue", lwd = 2)

top.hvgs <- order(var.out$bio, decreasing = TRUE)
write.table(file = "scialdone_hvg.tsv", var.out[top.hvgs,], sep = "\t", 
            quote = FALSE, col.names = NA)
head(var.out[top.hvgs,])

##                               mean    total      bio     tech
## ENSMUSG00000031155_Pim2   5.023620 20.13079 16.65210 3.478687
## ENSMUSG00000047751_Utf1   4.028361 19.79861 15.03519 4.763422
## ENSMUSG00000045394_Epcam  5.096136 18.17101 14.79584 3.375171
## ENSMUSG00000023906_Cldn6  3.986524 19.22544 14.41776 4.807677
## ENSMUSG00000031297_Slc7a3 3.854211 17.64708 12.70564 4.941439
## ENSMUSG00000031074_Fgf3   3.913692 17.55565 12.67309 4.882553

examined <- top.hvgs[1:10]
all.names <- matrix(rownames(sce_scialdone_qc)[examined], 
                    nrow = length(examined), ncol = ncol(sce_scialdone_qc))
plotExpression(sce_scialdone_qc, examined, show_median = TRUE)

table(var.out$bio > var.out$tech)

## 
## FALSE  TRUE 
## 14004   276

table(var.out$bio > 1)

## 
## FALSE  TRUE 
## 12965  1315

n_hvgs <- sum(var.out$bio > 1)

There are 1315 genes with a biological component of the variance greater than 1. We will consider these to be the highly variable genes.

10 Identifying correlated gene pairs with Spearman’s rho

Another useful procedure is to identify the HVGs that are highly correlated with one another. This distinguishes between HVGs caused by random noise and those involved in driving systematic differences between subpopulations. Gene pairs with significantly large positive or negative values for Spearman’s rho are identified using the correlatePairs function. Note that we only apply this function for the top set of HVGs – doing so for all possible gene pairs would require too much computational time and may prioritize uninteresting genes that have strong correlations but low variance, e.g., tightly co-regulated house-keeping genes.

For this analysis we will use the normalised expression residuals as expression values.

set.seed(100)
var.cor <- correlatePairs(get_exprs(sce_scialdone_qc, "norm_exprs_resid"),
                          subset.row = top.hvgs[1:n_hvgs])
write.table(file = "scialdone_cor.tsv", var.cor, sep = "\t", 
            quote = FALSE, row.names = FALSE)
head(var.cor)

##                      gene1                        gene2       rho
## 1 ENSMUSG00000032494_Tdgf1 ENSMUSG00000095834_Tdgf1-ps1 0.9709433
## 2 ENSMUSG00000023906_Cldn6      ENSMUSG00000000303_Cdh1 0.9627381
## 3 ENSMUSG00000023906_Cldn6     ENSMUSG00000018569_Cldn7 0.9595003
## 4  ENSMUSG00000000303_Cdh1     ENSMUSG00000018569_Cldn7 0.9512579
## 5  ENSMUSG00000047751_Utf1     ENSMUSG00000023906_Cldn6 0.9452250
## 6  ENSMUSG00000047751_Utf1     ENSMUSG00000018569_Cldn7 0.9406871
##        p.value          FDR
## 1 1.999998e-06 4.750982e-06
## 2 1.999998e-06 4.750982e-06
## 3 1.999998e-06 4.750982e-06
## 4 1.999998e-06 4.750982e-06
## 5 1.999998e-06 4.750982e-06
## 6 1.999998e-06 4.750982e-06

sig.cor <- var.cor$FDR <= 0.05
summary(sig.cor)

##    Mode   FALSE    TRUE    NA's 
## logical  250646  613309       0

Larger sets of correlated genes can be assembled by treating genes as nodes in a graph and each pair of genes with significantly large correlations as an edge. In this manner, an undirected graph can be constructed using methods in the RBGL package. Highly connected subgraphs can then be identified and defined as gene sets. This provides a convenient summary of the pairwise correlations between genes.

library(RBGL)

## Loading required package: graph

g <- ftM2graphNEL(cbind(var.cor$gene1, var.cor$gene2)[sig.cor,], 
                  W = NULL, V = NULL, edgemode = "undirected")
# cl <- highlyConnSG(g)$clusters
# cl <- cl[order(lengths(cl), decreasing = TRUE)]
# cl <- cl[lengths(cl) > 2]
# length(cl)

Here we obtain just one large cluster, however.

Significant correlations provide evidence for substructure in the data set, i.e., subpopulations of cells with systematic differences in their expression profiles. The number of significantly correlated HVG pairs represents the strength of the substructure. If many pairs were significant, this would indicate that the subpopulations were clearly defined and distinct from one another.

The correlation results can also be used directly in follow-up experiments to verify if any substructure is present. This is done by using sets of correlated HVGs as markers in procedures such as fluorescence-activated cell sorting, immunohistochemistry or RNA flourescence in situ hybridization. In this manner, the existence of subpopulations with distinct expression patterns for the chosen HVGs can be experimentally validated. Negatively correlated pairs may be particularly useful as they provide more power to discriminate between subpopulations. In the simplest example, a subpopulation would be positive for one marker and negative for the other while the reverse would be true for a different subpopulation, thus allowing the two subpopulations to be easily distinguished.

11 Using correlated HVGs for further data exploration

For further analyses, we focus on the significantly correlated HVGs for which any substructure should be most pronounced. This may allow us to identify subpopulations that would have otherwise been masked by random noise in the expression profiles. (Even greater resolution can be achieved by looking at one or more subsets of correlated HVGs, identified above as the highly connected subgraphs. This focuses on specific aspects of biology that might have been otherwise masked by stronger effects in the data, e.g., metabolic state being masked by cell type. However, this requires some pre-existing knowledge to determine which genes are relevant and which subsets should be used.)

chosen <- unique(c(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))
fData(sce_scialdone_qc)$high_var_gene <- FALSE
fData(sce_scialdone_qc)[chosen, "high_var_gene"] <- TRUE
norm.exprs <- get_exprs(sce_scialdone_qc, "norm_exprs_resid")[chosen,, drop = FALSE]

We construct a simple dendrogram to group together cells with similar expression patterns across the chosen genes. Here, we cluster on Euclidean distances to provide greater sensitivity to differences in expression for low numbers of genes. Ward’s clustering criterion is used to minimize the total variance within each cluster.

library(igraph)
## dist from rho
gg <- graph_from_data_frame(var.cor, directed = FALSE)
corr_mat <- get.adjacency(gg, attr = "rho")
rho_dist <- as.dist((1 - corr_mat) / 2)
rho_tree <- hclust(rho_dist, method = "ward.D2")

## euclidean dist
cellDist(sce_scialdone_qc) <- dist(t(norm.exprs))
my.tree <- hclust(cellDist(sce_scialdone_qc), method = "ward.D2")

In addition, a tree cut can be used to explicitly define subpopulations of cells from the dendrogram. Note that some tuning of the cut height h is required to obtain satisfactory results for each data set.

library(dynamicTreeCut)
library(RColorBrewer)
library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

## cluster genes
rho.clusters <- unname(cutreeDynamic(rho_tree, distM = as.matrix(rho_dist), 
                                    verbose = 0))
heat.vals <- t(norm.exprs) - colMeans(norm.exprs)
clust.col <- brewer.pal(max(rho.clusters), "Set3")
heatmap.2(heat.vals, col = bluered, symbreak = TRUE, trace = 'none', 
          cexRow = 0.3, ColSideColors = clust.col[rho.clusters],
          Colv = as.dendrogram(rho_tree))

pdf("scialdone_heat_genes.pdf", width = 30, height = 30)
heatmap.2(heat.vals, col = bluered, symbreak = TRUE, trace = 'none', 
          cexRow = 0.3, ColSideColors = clust.col[rho.clusters],
          Colv = as.dendrogram(rho_tree))
dev.off()

## quartz_off_screen 
##                 2

We can visualize the constructed dendrogram with a heatmap. All expression values are mean-centred for each gene to highlight the relative expression between cells. We recommend storing the heatmap at a sufficiently high resolution so that the relevant genes can be easily identified for further examination.

## cluster cells - euclidean dist
my.clusters <- unname(cutreeDynamic(my.tree, 
                                    distM = as.matrix(cellDist(sce_scialdone_qc)),
                                    verbose = 0))
heat.vals <- norm.exprs - rowMeans(norm.exprs)
clust.col <- brewer.pal(max(my.clusters), "Set3")
heatmap.2(heat.vals, col = bluered, symbreak = TRUE, trace = 'none', 
          cexRow = 0.3, ColSideColors = clust.col[my.clusters],
          Colv = as.dendrogram(my.tree))

pdf("scialdone_heat_cells.pdf", width = 20, height = 40)
heatmap.2(heat.vals, col = bluered, symbreak = TRUE, trace = 'none', 
          cexRow = 0.3, ColSideColors = clust.col[my.clusters], 
          Colv = as.dendrogram(my.tree))
dev.off()

## quartz_off_screen 
##                 2

sce_scialdone_qc$hier_cluster <- my.clusters
sce_scialdone_qc$original_cluster <- as.integer(sce_scialdone_qc$cluster)

Finally, dimensionality reduction can be applied using only the set of correlated HVGs to highlight any substructure that might be present. This is shown in Figure 12 for both PCA and t-SNE plots, though in this case, focusing on HVGs does not provide any additional separation into distinct subpopulations. A more informative strategy is to colour cells in the plot based on the expression of a gene of interest. This improves visualization by highlighting changes in expression across the cell population.

out.pca <- plotPCA(sce_scialdone_qc, exprs_values="norm_exprs_resid", 
                   feature_set = chosen, colour_by = "ENSMUSG00000031162_Gata1") +
    theme(legend.position = "bottom")
out.tsne <- plotTSNE(sce_scialdone_qc, exprs_values = "norm_exprs_resid", 
                     feature_set = chosen, rand_seed = 100,
                     colour_by = "ENSMUSG00000031162_Gata1") +
        theme(legend.position = "bottom")
plot_grid(out.pca, out.tsne, labels = letters)

We can also compare the cluster of cells defined by the authors of the original study and the hierarchical clustering approach defined above. Both the authors’ original cluster and the hierarchical clustering we have applied neatly delineates identifiable clusters in the t-SNE plot.

There is a large degree of agreement between the two clusterings. Original cluster 10 and our cluster 1 coincide perfectly. Original clusters 7 and 8 make up our cluster 6. Original cluster 3 is very close to our cluster 2. There are more original clusters than there are in our hierarchical clustering, but the remaining clusters nevertheless show a large degree of overlap.

sce_scialdone_qc <- plotTSNE(sce_scialdone_qc, exprs_values = "norm_exprs_resid", 
              feature_set = chosen,  rand_seed = 100, perplexity = 150,
              return_SCESet = TRUE, draw_plot = FALSE)
p1 <- plotReducedDim(sce_scialdone_qc, colour_by = "original_cluster") +
    theme(legend.position = "bottom")
p2 <- plotReducedDim(sce_scialdone_qc, colour_by = "hier_cluster") +
        theme(legend.position = "bottom")
plot_grid(p1, p2, labels = letters)

Showing the “cellCategory” variable (from known embryo stage and FACS marker results), shows that our cluster 1 almost perfectly captures the E6.5 cells. Furthermore, we observe high expression of Pou5f1 (Oct4) in this cluster, and low expression in the other cells.

p3 <- plotReducedDim(sce_scialdone_qc, colour_by = "cellCategory") +
        theme(legend.position = "bottom")
p4 <- plotReducedDim(sce_scialdone_qc, colour_by = "ENSMUSG00000024406_Pou5f1") +
        theme(legend.position = "bottom")
plot_grid(p3, p4, labels = letters)

12 Data visualization

Scater allows users total flexibility to run their favourite dimension reduction methods, as demonstrated to some extect above and decribed in the supporting help files. Previously we have used plotPCA() to explore the cell population. The t-SNE plot works particularly nicely for this dataset to separate the different cell types as identified by Scialdone et al, as seen above.

With scater, any specific set of features based on prior knowledge can be used for PCA, t-SNE or diffusion maps. A feature set to use can be defined by supplying the feature_set argument to plotPCA, plotTSNE or plotDiffusionMap. This allows, for example, using only housekeeping features or control features or cell cycle genes or highly-variable genes (as we have done here) to produce reduced-dimension plots.

Since this study looks at differentiating cells, using a diffusion map (better suited to cells distributed along a process) is also a natural choice for visualising the cells. Here, the diffusion map looks similar to PCA, but with the E6.5 cells more tightly grouped.

sce_scialdone_qc <- plotDiffusionMap(sce_scialdone_qc, return_SCESet = TRUE,
                                     feature_set = chosen, rand_seed = 100, 
                                     draw_plot = FALSE)
p1 <- plotReducedDim(sce_scialdone_qc, colour_by = "cellCategory") +
    theme(legend.position = "bottom")
p2 <- plotReducedDim(sce_scialdone_qc, colour_by = "hier_cluster") +
    theme(legend.position = "bottom")
plot_grid(p1, p2, labels = letters)

The two-dimensional arc of cells in the diffusion map suggests a continuous process of differentiation undergone by the cells. Also, we can store the diffusion map coordinates in the reducedDimension slot of the SCESet object so that we do not need to recompute the diffusion map each time we want to produce a variation on the plot.

A neat feature of the dimension-reduction plots in scater is that we can easily colour plotted points by the expression values of a gene of interest. Here, we reproduce the diffusion map plots above showing the expression of the differentiation marker genes Nanog, Pou5f1 (Oct4) and Gata1.

p1 <- plotReducedDim(sce_scialdone_qc, colour_by = "ENSMUSG00000012396_Nanog") +
        theme(legend.position = "bottom", legend.text = element_text(size = 8))
p2 <- plotReducedDim(sce_scialdone_qc, colour_by = "ENSMUSG00000024406_Pou5f1") +
    theme(legend.position = "bottom", legend.text = element_text(size = 8))
p3 <- plotReducedDim(sce_scialdone_qc, colour_by = "ENSMUSG00000031162_Gata1") +
        theme(legend.position = "bottom", legend.text = element_text(size = 8))
plot_grid(p1, p2, p3, labels = letters, ncol = 3)

We see that Pou5f1 and Nanog are highly expressed in the E6.5 cells and decrease in expression across the other clusters. Nanog is almost exclusibely expressed in the E6.5 cells. Conversely, Gata1 is highly expressed in the cells in the bottom right corner of the diffusion map plot, suggested this is the “end” of a differentiation process that the diffusion map appears to tease out.

With the plotMDS function the user can apply their favourite way of defining the distance between cells to then plot the cells in a low-dimensional space. The plot below suggests that Spearman correlation works reasonably well for this dataset. Other metrics may work substantially better here and in other contexts.

p1 <- plotMDS(sce_scialdone_qc, colour_by = "cellCategory") +
            theme(legend.position = "bottom")
p2 <- plotMDS(sce_scialdone_qc, colour_by = "ENSMUSG00000024406_Pou5f1") +
            theme(legend.position = "bottom")
plot_grid(p1, p2, labels = letters, ncol = 2)

In scater the plotExpression function enables the convenient visualisation of expression values for a set of features. Here, the expression values for the six most DE genes for expression frequency between cell types are shown. The units for expression in the plot can be defined with the exprs_values argument (the expression values must exist with the provided name in the assayData slot of the SCESet object; if not the default exprs values will be used, with a warning). As with other plots in scater we can use phenotype data variables to define the colour and shape of the points.

Here we take a set of genes presented in the Scialdone et al (2016) paper as relevant for their analysis and explore their behaviour some more with scater tools.

As the plot below shows, Podxl, Pou5f1, and Smtnl2 are expressed in most cells, whereas the other genes here are only expressed at reasonable levels in a small subset of cells.

feats <- c("ENSMUSG00000024406_Pou5f1", "ENSMUSG00000012396_Nanog",
           "ENSMUSG00000031162_Gata1", "ENSMUSG00000052217_Hbb-bh1",
           "ENSMUSG00000063060_Sox7", "ENSMUSG00000058794_Nfe2",
           "ENSMUSG00000066652_Lefty2", "ENSMUSG00000030699_Tbx6",
           "ENSMUSG00000021835_Bmp4", "ENSMUSG00000025608_Podxl",
           "ENSMUSG00000045382_Cxcr4", "ENSMUSG00000034664_Itga2b",
           "ENSMUSG00000006362_Cbfa2t3", "ENSMUSG00000024986_Hhex",
           "ENSMUSG00000038700_Hoxb5", "ENSMUSG00000042286_Stab1",
           "ENSMUSG00000045667_Smtnl2", "ENSMUSG00000041361_Myzap")
feats <- feats[order(fData(sce_scialdone_qc)[feats, "mgi_symbol"])]
plotExpression(sce_scialdone_qc, feats)

pluri_feats <- c("ENSMUSG00000024406_Pou5f1", "ENSMUSG00000012396_Nanog",
                 "ENSMUSG00000025608_Podxl")
mid_feats <- c("ENSMUSG00000021835_Bmp4", "ENSMUSG00000045382_Cxcr4",
                "ENSMUSG00000024986_Hhex", "ENSMUSG00000038700_Hoxb5",
                "ENSMUSG00000066652_Lefty2",  "ENSMUSG00000041361_Myzap",
               "ENSMUSG00000045667_Smtnl2", "ENSMUSG00000030699_Tbx6",
               "ENSMUSG00000042286_Stab1")
later_feats <- c( "ENSMUSG00000006362_Cbfa2t3", "ENSMUSG00000031162_Gata1",
                  "ENSMUSG00000052217_Hbb-bh1", "ENSMUSG00000034664_Itga2b",
                  "ENSMUSG00000058794_Nfe2", "ENSMUSG00000063060_Sox7")

sce_scialdone_qc$hier_cluster <- as.factor(sce_scialdone_qc$hier_cluster)
plotExpression(sce_scialdone_qc, pluri_feats, x = "cellCategory", 
               colour_by = "hier_cluster", ncol = 1) + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1)) +
    ggtitle("Pluripotency genes")

plotExpression(sce_scialdone_qc, mid_feats[1:3], x = "cellCategory", 
               colour_by = "hier_cluster", ncol = 1) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1)) +
        ggtitle("'Mid' genes")

plotExpression(sce_scialdone_qc, mid_feats[4:6], x = "cellCategory", 
               colour_by = "hier_cluster", ncol = 1) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1)) +
    ggtitle("'Mid' genes")

plotExpression(sce_scialdone_qc, mid_feats[7:9], x = "cellCategory", 
               colour_by = "hier_cluster", ncol = 1) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1)) +
    ggtitle("'Mid' genes")

plotExpression(sce_scialdone_qc, later_feats[1:3], x = "cellCategory", 
               colour_by = "hier_cluster", ncol = 1) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1)) +
    ggtitle("'Later' genes")

plotExpression(sce_scialdone_qc, later_feats[4:6], x = "cellCategory", 
               colour_by = "hier_cluster", ncol = 1) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1)) +
    ggtitle("'Later' genes")

These scRNA-seq data certainly capture known biology during embryonic differentiation from E6.5 to E7.5.

13 Ordering cells in pseudotime

13.1 Diffusion Pseudotime

A first approach to ordering cells in pseudotime is “diffusion pseudotime”, or DPT. This approach measures transitions on all length scales between cells using diffusion-like random walks. This allows us to identify cells that undergo branching decisions or are in metastable states.

We define a “root cell” for the DPT calculation as a cell with high Pou5f1, Nanog and Podxl expression, using this as an example of a more “pluripotent” cell. The diffusion pseudotime values for the other cells are relative to this cell. This ensures that diffusion pseudotime will run in the “right” direction, that is increased differentiation as pseudotime increases. The diffusion pseudotime values for the other cells are relative to this cell.

A plot of the first two diffusion map components as computed with DPT shows some separation of the different cell populations based on FACS marker values (a), and no separation of cells by plate used for processing (b).

library(dpt)
ww <- which(sce_scialdone_qc$cellCategory == "E6.5" & 
        exprs(sce_scialdone_qc)[pluri_feats[1],] > 9 &
        exprs(sce_scialdone_qc)[pluri_feats[2],] > 8 &
        exprs(sce_scialdone_qc)[pluri_feats[3],] > 6)
sce_scialdone_qc$root_cell <- FALSE
sce_scialdone_qc$root_cell[ww[1]] <- TRUE

ts <- Transitions(sce_scialdone_qc)
pt <- dpt(ts, branching = TRUE, root = which(sce_scialdone_qc$root_cell))
ev <- eigen(as.matrix(ts@transitions), TRUE)$vectors
dm <- as.data.frame(ev[, -1])
colnames(dm) <- paste0('DC', seq_len(ncol(dm)))
redDim(sce_scialdone_qc) <- as.matrix(dm)
sce_scialdone_qc$DPT <- pt$DPT / diff(range( pt$DPT))
sce_scialdone_qc$DPT_rank <- rank(sce_scialdone_qc$DPT)
sce_scialdone_qc$DPT_branch <- pt$Branch

plot_dpt(ts, pt, 1:2)

p1 <- plotReducedDim(sce_scialdone_qc, colour_by = "cellCategory") + 
    theme(legend.position = "bottom") +
    ggtitle("Diffusion map showing FACS cell categories")
p2 <- plotReducedDim(sce_scialdone_qc, colour_by = "hier_cluster") + 
    theme(legend.position = "bottom") +
    ggtitle("Diffusion map showing hierarchical cluster")
plot_grid(p1, p2, labels = letters)

Of course, the DPT reduced-dimension plot is almost identical to the earlier diffusion maps (dpt has a slightly different implementation).

First, we can look at DPT in relation to previous hierarchical clustering of the cells as well as the defined FACS/embryo stage cell categories. We see that the cell clusters have a strong relationship to DPT. For example, hierarchical cluster 1 (which contains most of the E6.5 cells) has low DPT values/ranks. We see that Flk+ cells occupy the middle DPT range, and CD41+ cells generally have high DPT values. This is encouraging, as it indicates that the DPT trajectory is capturing the differentiation process of interest in these data.

p1 <- plotPhenoData(sce_scialdone_qc, 
              aes(x = hier_cluster, y = DPT, colour = cellCategory)) +
    theme(legend.position = "bottom", legend.text = element_text(size = 8))
p2 <- plotPhenoData(sce_scialdone_qc, 
              aes(x = hier_cluster, y = DPT_rank, colour = cellCategory)) +
        theme(legend.position = "bottom", legend.text = element_text(size = 8))
plot_grid(p1, p2, labels = letters)

We can look at the behaviour of specific genes over diffusion pseudotime. We can use DPT either as a continuous measurement of pseudotime, or use DPT rank to have a discrete pseudotemporal ordering of cells. We see from both kind of plot below that Pou5f1 is highly expressed in almost all E6.5 cells, and then decreases steadily in expression with pseudotime. Nanog is highly expressed earlier in DPT and decreases steadily to near-zero for high DPT. Podxl is almost ubiquitously expressed across all genes, so does not show a particularly strong relationship to DPT.

plotExpression(sce_scialdone_qc, pluri_feats, "DPT", 
               colour_by = "cellCategory", show_violin = FALSE, 
               show_smooth = FALSE, ncol = 3) +
    geom_smooth(aes(group = 1), linetype = 2, colour = "firebrick",
                alpha = 0.5, se = TRUE, method = "loess") +
    coord_cartesian(ylim = c(0, 9)) +
    theme(legend.position = "bottom") + ggtitle("Expression against DPT")

plotExpression(sce_scialdone_qc, pluri_feats, "DPT_rank", 
               colour_by = "cellCategory", show_violin = FALSE, 
               show_smooth = FALSE, ncol = 3) +
    geom_smooth(aes(group = 1), linetype = 2, colour = "firebrick",
                alpha = 0.5, se = TRUE, span = 0.75, method = "loess") +
    theme(legend.position = "bottom") + ggtitle("Expression against DPT rank")

plotExpression(sce_scialdone_qc, pluri_feats, "DPT_rank", 
               colour_by = "hier_cluster", show_violin = FALSE, 
               show_smooth = FALSE, ncol = 3) +
    geom_smooth(aes(group = 1), linetype = 2, colour = "firebrick",
                alpha = 0.5, se = TRUE, span = 0.75, method = "loess") +
    theme(legend.position = "bottom") + ggtitle("Expression against DPT rank")

The discrete values for DPT give cleaner, more interpretable plots in this case, so we will use these values below.

Plotting expression against DPT rank for the ‘mid’ genes reveals some of them to have a peak in expression levels in the middle of the DPT range (Bmp4, Cxcr4 and Tbx6), while some have peaks in expression late in DPT (Hhex, Hoxb5, Myzap and Stab1). Lefty2 decreases mildly in expression over DPT, while Smtnl2 is highly expressed across DPT, increasing in expression somewhat.

plotExpression(sce_scialdone_qc, mid_feats, "DPT_rank", 
               colour_by = "cellCategory", show_violin = FALSE, 
               show_smooth = FALSE, ncol = 3) +
    geom_smooth(aes(group = 1), linetype = 2, colour = "firebrick",
                alpha = 0.5, se = TRUE, span = 0.75, method = "loess") +
    theme(legend.position = "bottom") + ggtitle("Expression against DPT rank")

The ‘later’ set of genes convincingly display increased expression with DPT. All six of these genes show virtually no expression in E6.5 cells and then increase sharply, with only Sox7 showing a dip in expression at the high end of DPT.

plotExpression(sce_scialdone_qc, later_feats, "DPT_rank", 
               colour_by = "cellCategory", show_violin = FALSE, 
               show_smooth = FALSE, ncol = 3) +
    geom_smooth(aes(group = 1), linetype = 2, colour = "firebrick",
                alpha = 0.5, se = TRUE, span = 0.75, method = "loess") +
    theme(legend.position = "bottom") + ggtitle("Expression against DPT rank")

14 Save processed data to disk

Finally, let us save the QC’d version of the data that can be used for any subsequent analyses.

saveRDS(sce_scialdone_qc, file = "scialdone_sce_qc.rds")

15 Technical stuff

Scater has been tested on Mac OS X and Linux environments and requires the R packages:

Biobase
BiocGenerics
ggplot2
methods

and imports the packages:

biomaRt
data.table
dplyr
edgeR
grid
limma
matrixStats
plyr
reshape2
rhdf5
rjson
viridis

This case study was run using the following platform and R package versions:

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dpt_0.6.0             gplots_3.0.1          RColorBrewer_1.1-2   
##  [4] dynamicTreeCut_1.63-1 igraph_1.0.1          RBGL_1.49.1          
##  [7] graph_1.51.0          hexbin_1.27.1         scran_1.1.7          
## [10] BiocParallel_1.7.5    knitr_1.13            DT_0.1               
## [13] cowplot_0.6.2         scater_1.1.12         ggplot2_2.1.0        
## [16] Biobase_2.33.0        BiocGenerics_0.19.2  
## 
## loaded via a namespace (and not attached):
##   [1] Hmisc_3.17-4          RcppEigen_0.3.2.8.1   plyr_1.8.4           
##   [4] lazyeval_0.2.0        sp_1.2-3              shinydashboard_0.5.1 
##   [7] splines_3.3.1         digest_0.6.9          htmltools_0.3.5      
##  [10] viridis_0.3.4         gdata_2.17.0          memoise_1.0.0        
##  [13] magrittr_1.5          cluster_2.0.4         limma_3.29.17        
##  [16] matrixStats_0.50.2    xts_0.9-7             colorspace_1.2-6     
##  [19] rrcov_1.3-11          dplyr_0.5.0           RCurl_1.95-4.8       
##  [22] jsonlite_1.0          tximport_1.1.3        lme4_1.1-12          
##  [25] survival_2.39-5       zoo_1.7-13            gtable_0.2.0         
##  [28] zlibbioc_1.19.0       MatrixModels_0.4-1    car_2.1-2            
##  [31] DEoptimR_1.0-6        SparseM_1.7           VIM_4.5.0            
##  [34] scales_0.4.0          sgeostat_1.0-27       mvtnorm_1.0-5        
##  [37] DBI_0.4-1             GGally_1.2.0          edgeR_3.15.2         
##  [40] Rcpp_0.12.6           sROC_0.1-2            xtable_1.8-2         
##  [43] laeken_0.4.6          foreign_0.8-66        proxy_0.4-16         
##  [46] Formula_1.2-1         stats4_3.3.1          vcd_1.4-1            
##  [49] htmlwidgets_0.7       FNN_1.1               acepack_1.3-3.3      
##  [52] reshape_0.8.5         XML_3.98-1.4          nnet_7.3-12          
##  [55] locfit_1.5-9.1        labeling_0.3          reshape2_1.4.1       
##  [58] AnnotationDbi_1.35.4  munsell_0.4.3         tools_3.3.1          
##  [61] RSQLite_1.0.0         pls_2.5-0             evaluate_0.9         
##  [64] stringr_1.0.0         cvTools_0.3.2         yaml_2.1.13          
##  [67] robustbase_0.92-6     caTools_1.17.1        nlme_3.1-128         
##  [70] mime_0.5              quantreg_5.26         formatR_1.4          
##  [73] biomaRt_2.29.2        pbkrtest_0.4-6        beeswarm_0.2.3       
##  [76] e1071_1.6-7           smoother_1.1          tibble_1.1           
##  [79] statmod_1.4.24        robCompositions_2.0.0 pcaPP_1.9-60         
##  [82] stringi_1.1.1         highr_0.6             lattice_0.20-33      
##  [85] Matrix_1.2-6          nloptr_1.0.4          lmtest_0.9-34        
##  [88] data.table_1.9.6      bitops_1.0-6          httpuv_1.3.3         
##  [91] R6_2.1.2              latticeExtra_0.6-28   KernSmooth_2.23-15   
##  [94] gridExtra_2.2.1       vipor_0.4.3           IRanges_2.7.12       
##  [97] boot_1.3-18           MASS_7.3-45           gtools_3.5.0         
## [100] assertthat_0.1        chron_2.3-47          destiny_1.3.4        
## [103] rhdf5_2.17.2          rjson_0.2.15          S4Vectors_0.11.10    
## [106] mgcv_1.8-13           grid_3.3.1            rpart_4.1-10         
## [109] class_7.3-14          minqa_1.2.4           rmarkdown_1.0        
## [112] mvoutlier_2.0.6       Rtsne_0.11            TTR_0.23-1           
## [115] scatterplot3d_0.3-37  shiny_0.13.2          ggbeeswarm_0.5.0

Pre-processing Data from Scialdone et al 2016

Davis McCarthy

2016-08-19