NOTE: This document was generated with sctransform version 0.3.2.9007

Introduction

With this vignette we introduce the non-parametric differential expression test for sparse non-negative data as seen in single-cell RNA sequencing.

The test is a model-free randomization test where the observed difference in mean between two groups is compared against a null distribution that is obtained by random shuffling of the group labels.

Given the observed difference and the null distribution, empirical p-values are calculated: emp_pval = (b + 1) / (R + 1) where b is the number of times the absolute difference in mean from a random permutation is at least as large as the absolute value of the observed difference in mean, R is the number of random permutations. This is an upper bound of the real empirical p-value that would be obtained by enumerating all possible group label permutations.

Additionally, we approximate the empirical null distribution with a normal distribution and turn the observed difference in mean into a z-score and then into a p-value. Finally, all p-values (for the tested genes) are adjusted using the Benjamini & Hochberg method (fdr).

The log2FC values in the output are log2(mean1 / mean2).

The current implementation only supports sparse matrices of class dgCMatrix from the Matrix package and has been optimized for speed (see benchmarking results below).

Load some data

We use the publicly available “10k PBMCs from a Healthy Donor (v3 chemistry)” data (11,769 cells) from 10x Genomics available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3

The data is the same as the demo dataset used in Azimuth. We load the results obtained from Azimuth (Azimuth version: 0.3.2; Seurat version: 4.0.0; Reference version: 1.0.0; default parameters in web-app).

We then keep only those cell types that have at least 5 cells with a prediction and mapping score > 0.66 and further remove all genes that have not been detected in at least 5 cells.

counts <- Seurat::Read10X_h5("~/Projects/data_warehouse/raw_public_10x/pbmc_10k_v3_filtered_feature_bc_matrix.h5")
predictions <- read.delim("~/Projects/data_warehouse/raw_public_10x/pbmc_10k_v3_azimuth_pred.tsv", 
    row.names = 1)

tab <- table(predictions$predicted.celltype.l2, predictions$predicted.celltype.l2.score > 
    0.66 & predictions$mapping.score > 0.66)
keep_types <- rownames(tab)[tab[, 2] >= 5]
keep_cells <- rownames(predictions)[predictions$predicted.celltype.l2 %in% keep_types]

counts <- counts[, keep_cells]
counts <- counts[rowSums(counts > 0) >= 5, ]
predictions <- predictions[keep_cells, ]

cell_types <- factor(predictions$predicted.celltype.l2, levels = names(sort(table(predictions$predicted.celltype.l2), 
    decreasing = TRUE)))

We now have a count matrix of 19088 genes and 11729 cells with the following cell type labels:

data.frame(table(cell_types))
cell_types Freq
CD14 Mono 3469
CD4 TCM 2335
B naive 1062
MAIT 587
CD4 Naive 565
NK 551
CD8 TEM 397
CD16 Mono 385
CD8 Naive 365
Eryth 319
B intermediate 269
B memory 264
gdT 236
Platelet 179
CD4 TEM 143
CD8 TCM 140
cDC2 135
Treg 103
pDC 88
HSPC 28
NK_CD56bright 27
NK Proliferating 24
ILC 23
Plasmablast 19
cDC1 16

Motivation

Here we illustrate the concept of the test using CD14 Monocytes as group 1 and all remaining cells as group 2. We will show two example genes: HINT1 (not differentially expressed), and CD14.

goi <- c("HINT1", "CD14")
df <- melt(t(as.matrix(counts[goi, , drop = FALSE])), varnames = c("cell", "gene"), 
    value.name = "counts")
df$cell_type <- factor(c("rest", "CD14 Mono")[(cell_types == "CD14 Mono") + 1])
# calculate the (geometric) mean per group
df_sum <- group_by(df, gene, cell_type) %>% summarise(mean = expm1(mean(log1p(counts))), 
    mid = median(range(counts)), .groups = "drop")
# and the difference of means
df_diff <- group_by(df_sum, gene) %>% summarise(diff_mean = mean[1] - mean[2], label = sprintf("Difference in mean: %1.2g\nlog2 fold-change: %1.2g", 
    diff_mean, log2(mean[1]/mean[2])), x = max(mid), y = Inf, .groups = "drop")
p1 <- ggplot(df, aes(counts, y = ..density.., fill = cell_type)) + geom_histogram(binwidth = 1, 
    position = "identity", alpha = 0.4) + geom_vline(data = df_sum, aes(xintercept = mean, 
    color = cell_type)) + geom_label(data = df_diff, aes(x, y, label = label), vjust = 1, 
    inherit.aes = FALSE, size = 3) + facet_wrap(~gene, scales = "free") + xlab("Gene counts") + 
    ylab("Proportion of cells") + ggtitle("Observed data and differences in geometric mean")
plot(p1)

The plot above shows the UMI counts per gene per group. Also shown is the difference in mean (mean1 - mean2) and the log2 fold-change (log2(mean1 / mean2)). To find out whether the observed difference in mean is significant we look at the null distribution of difference in mean, i.e. we shuffle the labels (here we use 99 repetitions) and calculate the difference in mean.

# calculate null distribution of difference in mean for each gene
grp <- factor(c("rest", "CD14 Mono")[(cell_types == "CD14 Mono") + 1])
tmp_counts <- counts[goi, , drop = FALSE]
R <- 99
diff_mean_null <- sapply(1:R, function(i) {
    mean_r <- sctransform:::row_gmean_grouped_dgcmatrix(matrix = tmp_counts, group = grp, 
        eps = 1, shuffle = TRUE)
    mean_r[, 1] - mean_r[, 2]
})
df_null <- melt(diff_mean_null, varnames = c("gene", "iteration"), value.name = "diff_mean")

p2 <- ggplot(df_null, aes(diff_mean)) + geom_histogram(bins = 33) + facet_wrap(~gene, 
    scales = "free") + xlab("Difference in geometric mean") + ylab("Count") + ggtitle("Null distribution of differences in geometric mean")
plot(p2)

The null distribution of ‘difference in mean’ shown above indicates what values to expect if the null is true (no difference in mean between the two groups). We can use the distribution to obtain an empirical p-value by asking how often the absolute value of the null distribution is larger or equal to the observed difference in mean. We use the absolute value since this is a two-tailed test, and use a pseudo-count in nominator and denominator when turning the observed frequencies into p-values (see Phipson and Smyth (2010) and Hartwig (2013) for discussions).


# given the null distribution, get empirical p-value, fit a gaussian and get
# approximated p-value
df_res <- left_join(df_null, df_diff, by = "gene") %>% group_by(gene) %>% summarise(emp_pval = (sum((abs(diff_mean.x) - 
    abs(diff_mean.y)) >= 0) + 1)/(R + 1), sds = sqrt(sum(diff_mean.x^2)/(R - 1)), 
    zscore = (diff_mean.y[1] - mean(diff_mean.x))/sds, pval = 2 * pnorm(-abs(zscore)), 
    min_r = min(diff_mean.x), max_r = max(diff_mean.x), mean_r = mean(diff_mean.x), 
    observed = diff_mean.y[1], .groups = "drop")
df_fit <- group_by(df_res, gene) %>% summarise(x = seq(from = min(min_r, observed), 
    to = max(max_r, observed), length.out = 333), y = dnorm(x = x, mean = mean_r, 
    sd = sds), .groups = "drop")
df_anno <- group_by(df_res, gene) %>% summarise(x = max(max_r, observed), y = Inf, 
    label = sprintf("Empirical p-value: %1.2g\nApprox. p-value: %1.2g", emp_pval, 
        pval))

p3 <- ggplot(df_null, aes(diff_mean, y = ..density..)) + geom_histogram(bins = 33, 
    aes(fill = "gray70")) + geom_line(data = df_fit, aes(x = x, y = y, linetype = "1")) + 
    geom_vline(data = df_res, aes(xintercept = observed, linetype = "2"), show_guide = FALSE) + 
    geom_label(data = df_anno, aes(x, y, label = label), hjust = 1, vjust = 1, size = 3) + 
    facet_wrap(~gene, scales = "free") + xlab("Difference in geometric mean") + ylab("Distribution density") + 
    ggtitle("Using the null distribution to obtain p-values") + scale_fill_manual(name = "", 
    values = "gray70", labels = sprintf("null distribution", R)) + scale_linetype_manual(name = "", 
    values = c(1, 2), labels = c("Approximated null", "Observed difference\nin mean"))
plot(p3)

The lowest possible empirical p-value is 1/(R+1) whith R being the number of random permutation used. However, the gaussian approximation of the null distribution allows us to calculate z-scores and consequently p-values that are lower than that. While the approximation using a gaussian might not be exact, especially for genes with very low detection rate or when cell numbers are very low, it generally agrees well with the empirical data.

Example 1: DE of CD14 Mono vs CD16 Mono (one vs one)

First, we will take the count matrix and fit a model using sctransform::vst, and in a second step obtain corrected counts (with the sequencing depth effect removed). Then compare the two groups.

vst_out <- vst(umi = counts, method = "qpoisson", residual_type = "none", return_cell_attr = TRUE, 
    verbosity = 0)
counts_corrected <- correct_counts(x = vst_out, umi = counts, verbosity = 0)

By default sctransform::diff_mean_test applies some moderate pre-filtering and tests only genes with:

  • absolute log2-fold-change of at least log2(1.2) (0.2630344) AND
  • mean value of at least 0.05 in at least one of the tested groups AND
  • at least 5 non-zero observations in the group with higher mean

Here we disable the first filter, but require a mean of at least 0.1 in at least one of the groups. We show results as a volcano plot and highlight the top DE genes (based on p-value or log-fold-change).

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    compare = c("CD14 Mono", "CD16 Mono"), log2FC_th = 0, mean_th = 0.1), max_iterations = 1, 
    filter_gc = FALSE)
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing CD14 Mono (group1, N = 3469) to CD16 Mono (group2, N = 385)
#> Keeping 6650 genes after initial filtering
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing CD14 Mono (group1, N = 3469) to CD16 Mono (group2, N = 385)
#> Keeping 6650 genes after initial filtering
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc gc.sec n_itr n_gc total_time
359MB 0.4265963 1 1 2.34s

The runtime for the tests was 2.34s. In total, 6650 genes were tested. 949 genes showed a mean that was significantly different between groups (FDR <= 0.05 based on empirical p-values) AND had an absolute log2 fold-change of at least 1 (i.e. abs(log2(mean1/mean2)) >= 1).

top_markers <- arrange(de_res, sign(log2FC), -abs(log2FC)) %>% group_by(sign(log2FC)) %>% 
    filter(rank(-abs(zscore), ties.method = "first") <= 4 | rank(-abs(log2FC), ties.method = "first") <= 
        4) %>% ungroup() %>% select(gene, mean1, mean2, log2FC, emp_pval_adj, pval_adj, 
    zscore)

p1 <- ggplot(de_res, aes(log2FC, pmax(-0.5, log10(abs(zscore))))) + geom_point(aes(color = emp_pval_adj < 
    0.05 & pval_adj < 0.05)) + geom_point(data = top_markers, color = "deeppink") + 
    geom_text_repel(data = top_markers, mapping = aes(label = gene)) + theme(legend.position = "bottom") + 
    ylab("Zscore [log10 of absolute value, clipped at -0.5]") + xlab("log2 fold-change (log2(mean1 / mean2))")

show(p1)

Top markers per cell type

filter(top_markers, log2FC < 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE)) %>% DT::formatRound(2:7, digits = 2)
filter(top_markers, log2FC > 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE)) %>% DT::formatRound(2:7, digits = 2)

Example 2: Top markers for all cell types (each vs rest)

Here we use the test to find genes that are high for each cell type compared to the rest. This is the default behavior of the test function. To speed things up, we use fewer random permutations (49) and test only the 222 genes with highest log2 fold-change.

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    R = 49, only_pos = TRUE, only_top_n = 222, verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc gc.sec n_itr n_gc total_time
974MB 0 1 0 5.24s

Show one plot per cell type and highlight the top 4 markers with respect to p-value and the top 4 markers with respect to log2FC.

top_markers <- group_by(de_res, group1) %>% filter(rank(-zscore, ties.method = "first") <= 
    4 | rank(-log2FC, ties.method = "first") <= 4) %>% select(group1, gene, mean1, 
    mean2, log2FC, zscore, emp_pval_adj)

p1 <- ggplot(de_res, aes(pmin(log2FC, 10), pmin(log10(zscore), 4))) + geom_point(aes(color = emp_pval_adj < 
    0.05)) + geom_point(data = top_markers, color = "deeppink") + geom_text_repel(data = top_markers, 
    mapping = aes(label = gene)) + facet_wrap(~group1, ncol = 3) + theme(legend.position = "bottom")
show(p1)

Table of top markers per cell type

DT::datatable(top_markers, rownames = FALSE) %>% DT::formatRound(3:7, digits = 2)

Example 3: Unique markers for all cell types (all vs all)

Since the test is pretty fast we can run all pairwise comparisons to try to find genes that are cell type specific across many individual comparisons.

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    compare = "all_vs_all", R = 49, mean_th = 0.1, log2FC_th = 0, verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc gc.sec n_itr n_gc total_time
71.9GB 0.789407 1 118 2.49m

There are 300 unique cell type pairs in the output. On average 5334 genes were tested in each comparison. Below we list the top genes per cell type ranked by fold change and the number of times it was a cell type marker (absolute log2FC >= 0 and p-value <= 0.05). We also show the minimum log2 fold-change and the maximum p-value across all comparisons.

log2fc_th <- 0  #log2(1.2)
p_th <- 0.05
fn <- function(fc, pval, fc_th = 0, p_th = Inf) {
    worst <- which.min(fc)
    wp <- pval[worst]
    if (fc[worst] < 0) {
        wp <- 1 - wp
    }
    data.frame(n = sum(fc >= fc_th & pval <= p_th), log2FC = fc[worst], pval = wp)
}
tmp1 <- group_by(de_res, gene, group1) %>% summarise(fn(log2FC, pval))

tmp1 <- group_by(de_res, gene, group1) %>% summarise(n = sum(log2FC >= log2fc_th), 
    n_sig = sum(log2FC >= log2fc_th & pval <= p_th), log2FC = min(log2FC), pval = max(pval, 
        log2FC < 0))

tmp2 <- group_by(de_res, gene, group2) %>% summarise(n = sum(log2FC <= -log2fc_th), 
    n_sig = sum(log2FC <= -log2fc_th & pval <= p_th), log2FC = min(-log2FC), pval = max(pval, 
        log2FC < 0))

top_markers <- full_join(tmp1, tmp2, by = c(group1 = "group2", gene = "gene")) %>% 
    rename(cell_type = group1) %>% mutate(comparisons = mapply(sum, n.x, n.y, na.rm = TRUE), 
    n_sig = mapply(sum, n_sig.x, n_sig.y, na.rm = TRUE), log2FC = pmin(log2FC.x, 
        log2FC.y, na.rm = TRUE), pval = pmax(pval.x, pval.y, na.rm = TRUE)) %>% arrange(cell_type, 
    -comparisons, -log2FC * n_sig, pval) %>% group_by(cell_type) %>% slice_head(n = 4) %>% 
    select(cell_type, gene, comparisons, log2FC, pval)

DT::datatable(top_markers, rownames = FALSE, caption = "Cell type specific markers") %>% 
    DT::formatRound(4:5, digits = 2)

Heatmap of top markers

mat <- sctransform:::row_gmean_grouped_dgcmatrix(matrix = counts_corrected[unique(top_markers$gene), 
    ], group = cell_types, eps = 1, shuffle = FALSE)
mat <- t(scale(t(mat)))

p1 <- melt(mat, varnames = c("gene", "celltype")) %>% mutate(celltype = factor(celltype, 
    levels = rev(colnames(mat)))) %>% ggplot(aes(gene, celltype, fill = value)) + 
    geom_tile(colour = "gray66") + scale_fill_gradient2(low = "white", mid = "white", 
    high = "black", name = "Geometric mean of each gene per cell type, scaled per gene") + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = NULL, 
    y = NULL) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 
    0)) + theme(legend.position = "top", axis.ticks = element_blank())
show(p1)

Example 4: CD4 vs CD8 T cells (some vs some)

We can also test one or more groups against one or more other groups. As an example we test CD4 vs CD8 cells (in each group we combine Naive, TCM, TEM subgroups). The cell type identities for the comparison are passed as a list of length two to the compare argument of the diff_mean_test function.

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    compare = list(c("CD4 Naive", "CD4 TCM", "CD4 TEM"), c("CD8 Naive", "CD8 TCM", 
        "CD8 TEM")), log2FC_th = 0, mean_th = 0.1), max_iterations = 1, filter_gc = FALSE)
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing group1 (group1, N = 3043) to group2 (group2, N = 902)
#> Keeping 4467 genes after initial filtering
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing group1 (group1, N = 3043) to group2 (group2, N = 902)
#> Keeping 4467 genes after initial filtering
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc gc.sec n_itr n_gc total_time
304MB 0 1 0 1.56s
top_markers <- arrange(de_res, sign(log2FC), -abs(log2FC)) %>% group_by(sign(log2FC)) %>% 
    filter(rank(-abs(zscore), ties.method = "first") <= 4 | rank(-abs(log2FC), ties.method = "first") <= 
        4) %>% ungroup() %>% select(gene, mean1, mean2, log2FC, emp_pval_adj, pval_adj, 
    zscore)

p1 <- ggplot(de_res, aes(log2FC, pmax(-0.5, log10(abs(zscore))))) + geom_point(aes(color = emp_pval_adj < 
    0.05 & pval_adj < 0.05)) + geom_point(data = top_markers, color = "deeppink") + 
    geom_text_repel(data = top_markers, mapping = aes(label = gene)) + theme(legend.position = "bottom") + 
    ylab("Zscore [log10 of absolute value, clipped at -0.5]") + xlab("log2 fold-change (log2(mean1 / mean2))")

show(p1)

Top markers per group

filter(top_markers, log2FC < 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE), caption = "Higher in CD8") %>% DT::formatRound(2:7, digits = 2)
filter(top_markers, log2FC > 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE), caption = "Higher in CD4") %>% DT::formatRound(2:7, digits = 2)

Example 5: Using the test with Seurat

The functionality outlined in the examples above can also be applied to Seurat objects. We demonstrate this with an example below where we perform unsupervised clustering followed by cluster marker identification (as in example 2).

library("Seurat")
#> Attaching SeuratObject
s <- CreateSeuratObject(counts = counts)
# run sctransform
s <- SCTransform(s, verbose = FALSE, method = "qpoisson")
# 'standard' workflow
s <- RunPCA(s, verbose = FALSE)
s <- RunUMAP(s, dims = 1:30, verbose = FALSE)
s <- FindNeighbors(s, dims = 1:30, verbose = FALSE)
s <- FindClusters(s, verbose = FALSE)
DimPlot(s, label = TRUE) + NoLegend() + ggtitle("Unsupervised clustering")

The cells have been grouped into 22 clusters. We are now going to use sctransform::diff_mean_test to find the top cluster markers. For this we access the normalized counts that are in the SCT assay and the cluster labels.

bm <- bench::mark(de_res <- diff_mean_test(y = GetAssayData(s, assay = "SCT", slot = "counts"), 
    group_labels = s$seurat_clusters, R = 49, only_pos = TRUE, only_top_n = 222, 
    verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc gc.sec n_itr n_gc total_time
856MB 0 1 0 4.75s
top_markers <- group_by(de_res, group1) %>% filter(rank(-zscore, ties.method = "first") <= 
    4 | rank(-log2FC, ties.method = "first") <= 4) %>% select(group1, gene, mean1, 
    mean2, log2FC, zscore, emp_pval_adj)

p1 <- ggplot(de_res, aes(pmin(log2FC, 10), pmin(log10(zscore), 4))) + geom_point(aes(color = emp_pval_adj < 
    0.05)) + geom_point(data = top_markers, color = "deeppink") + geom_text_repel(data = top_markers, 
    mapping = aes(label = gene)) + facet_wrap(~group1, ncol = 4) + theme(legend.position = "bottom")
show(p1)

Table of top markers per cluster

DT::datatable(top_markers, rownames = FALSE) %>% DT::formatRound(3:7, digits = 2)

Seurat heatmap

marker_genes <- group_by(de_res, group1) %>% filter(rank(-log2FC, ties.method = "first") <= 
    4) %>% pull(gene)
# make sure all markers are in the scale.data slot (by default only the highly
# variable genes are there)
s <- GetResidual(s, features = marker_genes, verbose = FALSE)
DoHeatmap(s, features = marker_genes, slot = "scale.data")

Example 6: Conserved markers after integration

Here we show how to use the test to identify the conserved (or consistent) cluster markers after multi-sample integration. This is different from the examples above because additionally to a cluster label the cells also have a sample label and we run the tests per sample and combine the results.

First, follow the Seurat integration vignette to obtain an integrated object.

library("Seurat")
library("SeuratData")
#> Registered S3 method overwritten by 'cli':
#>   method     from         
#>   print.boxx spatstat.geom
#> ── Installed datasets ───────────────────────────────────── SeuratData v0.2.1 ──
#> ✓ bmcite   0.3.0                        ✓ pbmc3k   3.1.4
#> ✓ ifnb     3.1.0                        ✓ ssHippo  3.1.4
#> ✓ panc8    3.0.2                        ✓ stxBrain 0.1.1
#> ────────────────────────────────────── Key ─────────────────────────────────────
#> ✓ Dataset loaded successfully
#> > Dataset built with a newer version of Seurat than installed
#> ❓ Unknown version of Seurat installed
LoadData("ifnb")
#> An object of class Seurat 
#> 14053 features across 13999 samples within 1 assay 
#> Active assay: RNA (14053 features, 0 variable features)
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, method = "qpoisson", verbose = FALSE)
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000, 
    verbose = FALSE)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features, 
    verbose = FALSE)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT", 
    anchor.features = features, verbose = FALSE)
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT", 
    verbose = FALSE)

Note that in this example both samples have been sequenced to similar depth. When working with data where there is a stark discrepancy between sequencing depth, one should make sure that the corrected counts used for differential expression testing were generated using the same ‘target’ depth. Since this is not supported by Seurat::SCTransform it requires running sctransform::correct() with identical cell meta data for all samples and setting the as_is parameter to TRUE. However, in this example we can proceed without any extra steps.

In an unsupervised analysis one would now perform dimensionality reduction and clustering. To keep it simple, we skip these steps and directly use the cell type labels that are provided with the original seurat object.

Now we identify cluster marker genes (each vs rest) that are conserved, i.e. consistent across conditions.

bm <- bench::mark(de_res <- diff_mean_test_conserved(y = GetAssayData(immune.combined.sct, 
    assay = "SCT", slot = "counts"), group_labels = immune.combined.sct$seurat_annotations, 
    sample_labels = immune.combined.sct$stim, only_pos = TRUE, verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc gc.sec n_itr n_gc total_time
1.33GB 0.7549027 1 10 13.2s

The function we used calls diff_mean_test repeatedly (once per sample given this balanced design) and aggregates the results per group and gene. The output table has the following columns

  • group1 Cell type label of the frist group of cells
  • group2 Label of the second group of cells; currently fixed to ‘rest’
  • gene Gene name (from rownames of input matrix)
  • n_tests Number of tests this gene participated in for this group (max 2 in this example)
  • log2FC_min,median,max Summary statistics for log2FC across the tests
  • mean1,2_median Median of group mean across the tests
  • pval_max Maximum of p-values across tests
  • de_tests Number of tests that showed this gene having a log2FC going in the same direction as log2FC_median and having a p-value <= pval_th

The output is ordered by group1, -de_tests, -abs(log2FC_median), pval_max

Show the top marker genes

top_markers <- filter(de_res, de_tests == 2, p.adjust(pval_max) <= 0.001) %>% group_by(group1) %>% 
    filter(rank(-log2FC_median, ties.method = "first") <= 6)
DT::datatable(top_markers, rownames = FALSE, caption = "Conserved cell type specific markers") %>% 
    DT::formatRound(5:9, digits = 2) %>% DT::formatSignif(10)

Show a heatmap of the marker genes. The mean expression per gene is scaled separately for each sample (CTRL and STIM) but they are shown interleaved to highlight that the differential expression is conserved across samples.

celltypes <- immune.combined.sct$seurat_annotations
samples <- immune.combined.sct$stim
immune.combined.sct$type_by_stim <- factor(x = paste(celltypes, samples, sep = " "), 
    levels = paste(rep(levels(top_markers$group1), each = length(unique(samples))), 
        unique(samples)))
mat <- sctransform:::row_gmean_grouped_dgcmatrix(matrix = GetAssayData(immune.combined.sct, 
    assay = "SCT", slot = "counts")[top_markers$gene, ], group = factor(immune.combined.sct$type_by_stim), 
    eps = 1, shuffle = FALSE)
# scale samples separately
sel <- grepl(" CTRL$", colnames(mat))
mat[, sel] <- t(scale(t(mat[, sel])))
sel <- grepl(" STIM$", colnames(mat))
mat[, sel] <- t(scale(t(mat[, sel])))
mat <- mat[order(apply(mat, 1, which.max)), ]

p1 <- melt(mat, varnames = c("gene", "celltype")) %>% mutate(celltype = factor(celltype, 
    levels = rev(colnames(mat)))) %>% ggplot(aes(gene, celltype, fill = value)) + 
    geom_tile(colour = "gray66") + scale_fill_gradient2(low = "white", mid = "white", 
    high = "black", name = "Geometric mean of each gene per cell type and sample, scaled per gene and sample") + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = NULL, 
    y = NULL) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 
    0)) + theme(legend.position = "top", axis.ticks = element_blank())
show(p1)

To identify differences between stimulated and control conditions within a celltype, use the test as shown in the examples above. For example, below we look for genes differentially expressed between conditions within B cells.

de_res <- diff_mean_test(y = GetAssayData(immune.combined.sct, assay = "SCT", slot = "counts"), 
    group_labels = immune.combined.sct$type_by_stim, compare = c("B CTRL", "B STIM"), 
    verbosity = 0)
top_markers1 <- arrange(de_res, zscore) %>% head(10)
top_markers2 <- arrange(de_res, -zscore) %>% head(10)
rbind(top_markers1, top_markers2) %>% select(gene, mean1, mean2, log2FC, pval_adj) %>% 
    DT::datatable(rownames = FALSE, caption = "Top 20 B cell DE genes between control (group 1) and stimulated (group 2)") %>% 
    DT::formatRound(2:4, digits = 2) %>% DT::formatSignif(5)

Session info and runtime

Session info

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] stxBrain.SeuratData_0.1.1 ssHippo.SeuratData_3.1.4 
#>  [3] pbmc3k.SeuratData_3.1.4   panc8.SeuratData_3.0.2   
#>  [5] ifnb.SeuratData_3.1.0     bmcite.SeuratData_0.3.0  
#>  [7] SeuratData_0.2.1          SeuratObject_4.0.0       
#>  [9] Seurat_4.0.2.9002         patchwork_1.1.0.9000     
#> [11] ggrepel_0.8.2             dplyr_1.0.5              
#> [13] knitr_1.30                sctransform_0.3.2.9007   
#> [15] reshape2_1.4.4            ggplot2_3.3.2            
#> [17] Matrix_1.2-18            
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rtsne_0.15            colorspace_2.0-0      deldir_0.1-29        
#>   [4] ellipsis_0.3.1        ggridges_0.5.2        rstudioapi_0.13      
#>   [7] spatstat.data_2.1-0   leiden_0.3.3          listenv_0.8.0        
#>  [10] farver_2.0.3          DT_0.16               bit64_4.0.5          
#>  [13] RSpectra_0.16-0       codetools_0.2-16      splines_4.0.2        
#>  [16] polyclip_1.10-0       jsonlite_1.7.2        ica_1.0-2            
#>  [19] cluster_2.1.0         png_0.1-7             uwot_0.1.10          
#>  [22] shiny_1.5.0           spatstat.sparse_2.0-0 compiler_4.0.2       
#>  [25] httr_1.4.2            assertthat_0.2.1      fastmap_1.0.1        
#>  [28] bench_1.1.1           lazyeval_0.2.2        cli_2.4.0            
#>  [31] later_1.1.0.1         formatR_1.8           htmltools_0.5.1.1    
#>  [34] tools_4.0.2           igraph_1.2.6          gtable_0.3.0         
#>  [37] glue_1.4.2            RANN_2.6.1            rappdirs_0.3.1       
#>  [40] Rcpp_1.0.5            scattermore_0.7       vctrs_0.3.5          
#>  [43] nlme_3.1-149          crosstalk_1.1.0.1     lmtest_0.9-38        
#>  [46] xfun_0.19             stringr_1.4.0         globals_0.13.1       
#>  [49] mime_0.9              miniUI_0.1.1.1        lifecycle_1.0.0      
#>  [52] irlba_2.3.3           goftest_1.2-2         future_1.19.1        
#>  [55] profmem_0.5.0         MASS_7.3-53           zoo_1.8-8            
#>  [58] scales_1.1.1          spatstat.core_2.0-0   promises_1.1.1       
#>  [61] spatstat.utils_2.1-0  parallel_4.0.2        RColorBrewer_1.1-2   
#>  [64] yaml_2.2.1            reticulate_1.16       pbapply_1.4-3        
#>  [67] gridExtra_2.3         rpart_4.1-15          stringi_1.5.3        
#>  [70] highr_0.8             rlang_0.4.10          pkgconfig_2.0.3      
#>  [73] matrixStats_0.57.0    evaluate_0.14         lattice_0.20-41      
#>  [76] ROCR_1.0-11           purrr_0.3.4           tensor_1.5           
#>  [79] htmlwidgets_1.5.2     labeling_0.4.2        cowplot_1.1.0        
#>  [82] bit_4.0.4             tidyselect_1.1.0      RcppAnnoy_0.0.18     
#>  [85] plyr_1.8.6            magrittr_2.0.1        R6_2.5.0             
#>  [88] generics_0.0.2        DBI_1.1.0             pillar_1.4.7         
#>  [91] withr_2.3.0           mgcv_1.8-33           fitdistrplus_1.1-1   
#>  [94] survival_3.2-3        abind_1.4-5           tibble_3.0.4         
#>  [97] future.apply_1.6.0    crayon_1.3.4.9000     hdf5r_1.3.2          
#> [100] KernSmooth_2.23-17    spatstat.geom_2.0-1   plotly_4.9.2.1       
#> [103] rmarkdown_2.5         grid_4.0.2            data.table_1.13.2    
#> [106] digest_0.6.27         xtable_1.8-4          tidyr_1.1.2          
#> [109] httpuv_1.5.4          munsell_0.5.0         viridisLite_0.3.0

Runtime

print(proc.time() - tic)
#>     user   system  elapsed 
#> 1075.652  170.359 1259.250

References

Hartwig, Fernando Pires. 2013. “Two-Tailed P-Values Calculation in Permutation Based Tests: A Warning Against ‘Asymptotic Bias’ in Randomized Clinical Trials.” Journal of Clinical Trials, September. https://doi.org/10.4172/2167-0870.1000145.
Phipson, Belinda, and Gordon K Smyth. 2010. “Permutation P-Values Should Never Be Zero: Calculating Exact P-Values When Permutations Are Randomly Drawn.” Statistical Applications in Genetics and Molecular Biology 9 (1). https://doi.org/10.2202/1544-6115.1585.