NOTE: This document was generated with
sctransform
version 0.3.2.9007
With this vignette we introduce the non-parametric differential expression test for sparse non-negative data as seen in single-cell RNA sequencing.
The test is a model-free randomization test where the observed difference in mean between two groups is compared against a null distribution that is obtained by random shuffling of the group labels.
Given the observed difference and the null distribution, empirical p-values are calculated: emp_pval = (b + 1) / (R + 1)
where b
is the number of times the absolute difference in mean from a random permutation is at least as large as the absolute value of the observed difference in mean, R
is the number of random permutations. This is an upper bound of the real empirical p-value that would be obtained by enumerating all possible group label permutations.
Additionally, we approximate the empirical null distribution with a normal distribution and turn the observed difference in mean into a z-score and then into a p-value. Finally, all p-values (for the tested genes) are adjusted using the Benjamini & Hochberg method (fdr).
The log2FC values in the output are log2(mean1 / mean2)
.
The current implementation only supports sparse matrices of class dgCMatrix
from the Matrix
package and has been optimized for speed (see benchmarking results below).
We use the publicly available “10k PBMCs from a Healthy Donor (v3 chemistry)” data (11,769 cells) from 10x Genomics available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
The data is the same as the demo dataset used in Azimuth. We load the results obtained from Azimuth (Azimuth version: 0.3.2; Seurat version: 4.0.0; Reference version: 1.0.0; default parameters in web-app).
We then keep only those cell types that have at least 5 cells with a prediction and mapping score > 0.66 and further remove all genes that have not been detected in at least 5 cells.
<- Seurat::Read10X_h5("~/Projects/data_warehouse/raw_public_10x/pbmc_10k_v3_filtered_feature_bc_matrix.h5")
counts <- read.delim("~/Projects/data_warehouse/raw_public_10x/pbmc_10k_v3_azimuth_pred.tsv",
predictions row.names = 1)
<- table(predictions$predicted.celltype.l2, predictions$predicted.celltype.l2.score >
tab 0.66 & predictions$mapping.score > 0.66)
<- rownames(tab)[tab[, 2] >= 5]
keep_types <- rownames(predictions)[predictions$predicted.celltype.l2 %in% keep_types]
keep_cells
<- counts[, keep_cells]
counts <- counts[rowSums(counts > 0) >= 5, ]
counts <- predictions[keep_cells, ]
predictions
<- factor(predictions$predicted.celltype.l2, levels = names(sort(table(predictions$predicted.celltype.l2),
cell_types decreasing = TRUE)))
We now have a count matrix of 19088 genes and 11729 cells with the following cell type labels:
data.frame(table(cell_types))
cell_types | Freq |
---|---|
CD14 Mono | 3469 |
CD4 TCM | 2335 |
B naive | 1062 |
MAIT | 587 |
CD4 Naive | 565 |
NK | 551 |
CD8 TEM | 397 |
CD16 Mono | 385 |
CD8 Naive | 365 |
Eryth | 319 |
B intermediate | 269 |
B memory | 264 |
gdT | 236 |
Platelet | 179 |
CD4 TEM | 143 |
CD8 TCM | 140 |
cDC2 | 135 |
Treg | 103 |
pDC | 88 |
HSPC | 28 |
NK_CD56bright | 27 |
NK Proliferating | 24 |
ILC | 23 |
Plasmablast | 19 |
cDC1 | 16 |
Here we illustrate the concept of the test using CD14 Monocytes as group 1 and all remaining cells as group 2. We will show two example genes: HINT1 (not differentially expressed), and CD14.
<- c("HINT1", "CD14")
goi <- melt(t(as.matrix(counts[goi, , drop = FALSE])), varnames = c("cell", "gene"),
df value.name = "counts")
$cell_type <- factor(c("rest", "CD14 Mono")[(cell_types == "CD14 Mono") + 1])
df# calculate the (geometric) mean per group
<- group_by(df, gene, cell_type) %>% summarise(mean = expm1(mean(log1p(counts))),
df_sum mid = median(range(counts)), .groups = "drop")
# and the difference of means
<- group_by(df_sum, gene) %>% summarise(diff_mean = mean[1] - mean[2], label = sprintf("Difference in mean: %1.2g\nlog2 fold-change: %1.2g",
df_diff log2(mean[1]/mean[2])), x = max(mid), y = Inf, .groups = "drop")
diff_mean, <- ggplot(df, aes(counts, y = ..density.., fill = cell_type)) + geom_histogram(binwidth = 1,
p1 position = "identity", alpha = 0.4) + geom_vline(data = df_sum, aes(xintercept = mean,
color = cell_type)) + geom_label(data = df_diff, aes(x, y, label = label), vjust = 1,
inherit.aes = FALSE, size = 3) + facet_wrap(~gene, scales = "free") + xlab("Gene counts") +
ylab("Proportion of cells") + ggtitle("Observed data and differences in geometric mean")
plot(p1)
The plot above shows the UMI counts per gene per group. Also shown is the difference in mean (mean1 - mean2) and the log2 fold-change (log2(mean1 / mean2)). To find out whether the observed difference in mean is significant we look at the null distribution of difference in mean, i.e. we shuffle the labels (here we use 99 repetitions) and calculate the difference in mean.
# calculate null distribution of difference in mean for each gene
<- factor(c("rest", "CD14 Mono")[(cell_types == "CD14 Mono") + 1])
grp <- counts[goi, , drop = FALSE]
tmp_counts <- 99
R <- sapply(1:R, function(i) {
diff_mean_null <- sctransform:::row_gmean_grouped_dgcmatrix(matrix = tmp_counts, group = grp,
mean_r eps = 1, shuffle = TRUE)
1] - mean_r[, 2]
mean_r[,
})<- melt(diff_mean_null, varnames = c("gene", "iteration"), value.name = "diff_mean")
df_null
<- ggplot(df_null, aes(diff_mean)) + geom_histogram(bins = 33) + facet_wrap(~gene,
p2 scales = "free") + xlab("Difference in geometric mean") + ylab("Count") + ggtitle("Null distribution of differences in geometric mean")
plot(p2)
The null distribution of ‘difference in mean’ shown above indicates what values to expect if the null is true (no difference in mean between the two groups). We can use the distribution to obtain an empirical p-value by asking how often the absolute value of the null distribution is larger or equal to the observed difference in mean. We use the absolute value since this is a two-tailed test, and use a pseudo-count in nominator and denominator when turning the observed frequencies into p-values (see Phipson and Smyth (2010) and Hartwig (2013) for discussions).
# given the null distribution, get empirical p-value, fit a gaussian and get
# approximated p-value
<- left_join(df_null, df_diff, by = "gene") %>% group_by(gene) %>% summarise(emp_pval = (sum((abs(diff_mean.x) -
df_res abs(diff_mean.y)) >= 0) + 1)/(R + 1), sds = sqrt(sum(diff_mean.x^2)/(R - 1)),
zscore = (diff_mean.y[1] - mean(diff_mean.x))/sds, pval = 2 * pnorm(-abs(zscore)),
min_r = min(diff_mean.x), max_r = max(diff_mean.x), mean_r = mean(diff_mean.x),
observed = diff_mean.y[1], .groups = "drop")
<- group_by(df_res, gene) %>% summarise(x = seq(from = min(min_r, observed),
df_fit to = max(max_r, observed), length.out = 333), y = dnorm(x = x, mean = mean_r,
sd = sds), .groups = "drop")
<- group_by(df_res, gene) %>% summarise(x = max(max_r, observed), y = Inf,
df_anno label = sprintf("Empirical p-value: %1.2g\nApprox. p-value: %1.2g", emp_pval,
pval))
<- ggplot(df_null, aes(diff_mean, y = ..density..)) + geom_histogram(bins = 33,
p3 aes(fill = "gray70")) + geom_line(data = df_fit, aes(x = x, y = y, linetype = "1")) +
geom_vline(data = df_res, aes(xintercept = observed, linetype = "2"), show_guide = FALSE) +
geom_label(data = df_anno, aes(x, y, label = label), hjust = 1, vjust = 1, size = 3) +
facet_wrap(~gene, scales = "free") + xlab("Difference in geometric mean") + ylab("Distribution density") +
ggtitle("Using the null distribution to obtain p-values") + scale_fill_manual(name = "",
values = "gray70", labels = sprintf("null distribution", R)) + scale_linetype_manual(name = "",
values = c(1, 2), labels = c("Approximated null", "Observed difference\nin mean"))
plot(p3)
The lowest possible empirical p-value is 1/(R+1) whith R being the number of random permutation used. However, the gaussian approximation of the null distribution allows us to calculate z-scores and consequently p-values that are lower than that. While the approximation using a gaussian might not be exact, especially for genes with very low detection rate or when cell numbers are very low, it generally agrees well with the empirical data.
First, we will take the count matrix and fit a model using sctransform::vst
, and in a second step obtain corrected counts (with the sequencing depth effect removed). Then compare the two groups.
<- vst(umi = counts, method = "qpoisson", residual_type = "none", return_cell_attr = TRUE,
vst_out verbosity = 0)
<- correct_counts(x = vst_out, umi = counts, verbosity = 0) counts_corrected
By default sctransform::diff_mean_test
applies some moderate pre-filtering and tests only genes with:
Here we disable the first filter, but require a mean of at least 0.1 in at least one of the groups. We show results as a volcano plot and highlight the top DE genes (based on p-value or log-fold-change).
<- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types,
bm compare = c("CD14 Mono", "CD16 Mono"), log2FC_th = 0, mean_th = 0.1), max_iterations = 1,
filter_gc = FALSE)
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing CD14 Mono (group1, N = 3469) to CD16 Mono (group2, N = 385)
#> Keeping 6650 genes after initial filtering
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing CD14 Mono (group1, N = 3469) to CD16 Mono (group2, N = 385)
#> Keeping 6650 genes after initial filtering
::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details") knitr
mem_alloc | gc.sec | n_itr | n_gc | total_time |
---|---|---|---|---|
359MB | 0.4265963 | 1 | 1 | 2.34s |
The runtime for the tests was 2.34s. In total, 6650 genes were tested. 949 genes showed a mean that was significantly different between groups (FDR <= 0.05 based on empirical p-values) AND had an absolute log2 fold-change of at least 1 (i.e. abs(log2(mean1/mean2)) >= 1
).
<- arrange(de_res, sign(log2FC), -abs(log2FC)) %>% group_by(sign(log2FC)) %>%
top_markers filter(rank(-abs(zscore), ties.method = "first") <= 4 | rank(-abs(log2FC), ties.method = "first") <=
4) %>% ungroup() %>% select(gene, mean1, mean2, log2FC, emp_pval_adj, pval_adj,
zscore)
<- ggplot(de_res, aes(log2FC, pmax(-0.5, log10(abs(zscore))))) + geom_point(aes(color = emp_pval_adj <
p1 0.05 & pval_adj < 0.05)) + geom_point(data = top_markers, color = "deeppink") +
geom_text_repel(data = top_markers, mapping = aes(label = gene)) + theme(legend.position = "bottom") +
ylab("Zscore [log10 of absolute value, clipped at -0.5]") + xlab("log2 fold-change (log2(mean1 / mean2))")
show(p1)
Top markers per cell type
filter(top_markers, log2FC < 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE,
searching = FALSE)) %>% DT::formatRound(2:7, digits = 2)
filter(top_markers, log2FC > 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE,
searching = FALSE)) %>% DT::formatRound(2:7, digits = 2)
Here we use the test to find genes that are high for each cell type compared to the rest. This is the default behavior of the test function. To speed things up, we use fewer random permutations (49) and test only the 222 genes with highest log2 fold-change.
<- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types,
bm R = 49, only_pos = TRUE, only_top_n = 222, verbosity = 0))
::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details") knitr
mem_alloc | gc.sec | n_itr | n_gc | total_time |
---|---|---|---|---|
974MB | 0 | 1 | 0 | 5.24s |
Show one plot per cell type and highlight the top 4 markers with respect to p-value and the top 4 markers with respect to log2FC.
<- group_by(de_res, group1) %>% filter(rank(-zscore, ties.method = "first") <=
top_markers 4 | rank(-log2FC, ties.method = "first") <= 4) %>% select(group1, gene, mean1,
mean2, log2FC, zscore, emp_pval_adj)
<- ggplot(de_res, aes(pmin(log2FC, 10), pmin(log10(zscore), 4))) + geom_point(aes(color = emp_pval_adj <
p1 0.05)) + geom_point(data = top_markers, color = "deeppink") + geom_text_repel(data = top_markers,
mapping = aes(label = gene)) + facet_wrap(~group1, ncol = 3) + theme(legend.position = "bottom")
show(p1)
Table of top markers per cell type
::datatable(top_markers, rownames = FALSE) %>% DT::formatRound(3:7, digits = 2) DT
Since the test is pretty fast we can run all pairwise comparisons to try to find genes that are cell type specific across many individual comparisons.
<- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types,
bm compare = "all_vs_all", R = 49, mean_th = 0.1, log2FC_th = 0, verbosity = 0))
::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details") knitr
mem_alloc | gc.sec | n_itr | n_gc | total_time |
---|---|---|---|---|
71.9GB | 0.789407 | 1 | 118 | 2.49m |
There are 300 unique cell type pairs in the output. On average 5334 genes were tested in each comparison. Below we list the top genes per cell type ranked by fold change and the number of times it was a cell type marker (absolute log2FC >= 0 and p-value <= 0.05). We also show the minimum log2 fold-change and the maximum p-value across all comparisons.
<- 0 #log2(1.2)
log2fc_th <- 0.05
p_th <- function(fc, pval, fc_th = 0, p_th = Inf) {
fn <- which.min(fc)
worst <- pval[worst]
wp if (fc[worst] < 0) {
<- 1 - wp
wp
}data.frame(n = sum(fc >= fc_th & pval <= p_th), log2FC = fc[worst], pval = wp)
}<- group_by(de_res, gene, group1) %>% summarise(fn(log2FC, pval))
tmp1
<- group_by(de_res, gene, group1) %>% summarise(n = sum(log2FC >= log2fc_th),
tmp1 n_sig = sum(log2FC >= log2fc_th & pval <= p_th), log2FC = min(log2FC), pval = max(pval,
< 0))
log2FC
<- group_by(de_res, gene, group2) %>% summarise(n = sum(log2FC <= -log2fc_th),
tmp2 n_sig = sum(log2FC <= -log2fc_th & pval <= p_th), log2FC = min(-log2FC), pval = max(pval,
< 0))
log2FC
<- full_join(tmp1, tmp2, by = c(group1 = "group2", gene = "gene")) %>%
top_markers rename(cell_type = group1) %>% mutate(comparisons = mapply(sum, n.x, n.y, na.rm = TRUE),
n_sig = mapply(sum, n_sig.x, n_sig.y, na.rm = TRUE), log2FC = pmin(log2FC.x,
na.rm = TRUE), pval = pmax(pval.x, pval.y, na.rm = TRUE)) %>% arrange(cell_type,
log2FC.y, -comparisons, -log2FC * n_sig, pval) %>% group_by(cell_type) %>% slice_head(n = 4) %>%
select(cell_type, gene, comparisons, log2FC, pval)
::datatable(top_markers, rownames = FALSE, caption = "Cell type specific markers") %>%
DT::formatRound(4:5, digits = 2) DT
Heatmap of top markers
<- sctransform:::row_gmean_grouped_dgcmatrix(matrix = counts_corrected[unique(top_markers$gene),
mat group = cell_types, eps = 1, shuffle = FALSE)
], <- t(scale(t(mat)))
mat
<- melt(mat, varnames = c("gene", "celltype")) %>% mutate(celltype = factor(celltype,
p1 levels = rev(colnames(mat)))) %>% ggplot(aes(gene, celltype, fill = value)) +
geom_tile(colour = "gray66") + scale_fill_gradient2(low = "white", mid = "white",
high = "black", name = "Geometric mean of each gene per cell type, scaled per gene") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = NULL,
y = NULL) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0,
0)) + theme(legend.position = "top", axis.ticks = element_blank())
show(p1)
We can also test one or more groups against one or more other groups. As an example we test CD4 vs CD8 cells (in each group we combine Naive, TCM, TEM subgroups). The cell type identities for the comparison are passed as a list of length two to the compare
argument of the diff_mean_test
function.
<- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types,
bm compare = list(c("CD4 Naive", "CD4 TCM", "CD4 TEM"), c("CD8 Naive", "CD8 TCM",
"CD8 TEM")), log2FC_th = 0, mean_th = 0.1), max_iterations = 1, filter_gc = FALSE)
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing group1 (group1, N = 3043) to group2 (group2, N = 902)
#> Keeping 4467 genes after initial filtering
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing group1 (group1, N = 3043) to group2 (group2, N = 902)
#> Keeping 4467 genes after initial filtering
::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details") knitr
mem_alloc | gc.sec | n_itr | n_gc | total_time |
---|---|---|---|---|
304MB | 0 | 1 | 0 | 1.56s |
<- arrange(de_res, sign(log2FC), -abs(log2FC)) %>% group_by(sign(log2FC)) %>%
top_markers filter(rank(-abs(zscore), ties.method = "first") <= 4 | rank(-abs(log2FC), ties.method = "first") <=
4) %>% ungroup() %>% select(gene, mean1, mean2, log2FC, emp_pval_adj, pval_adj,
zscore)
<- ggplot(de_res, aes(log2FC, pmax(-0.5, log10(abs(zscore))))) + geom_point(aes(color = emp_pval_adj <
p1 0.05 & pval_adj < 0.05)) + geom_point(data = top_markers, color = "deeppink") +
geom_text_repel(data = top_markers, mapping = aes(label = gene)) + theme(legend.position = "bottom") +
ylab("Zscore [log10 of absolute value, clipped at -0.5]") + xlab("log2 fold-change (log2(mean1 / mean2))")
show(p1)
Top markers per group
filter(top_markers, log2FC < 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE,
searching = FALSE), caption = "Higher in CD8") %>% DT::formatRound(2:7, digits = 2)
filter(top_markers, log2FC > 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE,
searching = FALSE), caption = "Higher in CD4") %>% DT::formatRound(2:7, digits = 2)
The functionality outlined in the examples above can also be applied to Seurat objects. We demonstrate this with an example below where we perform unsupervised clustering followed by cluster marker identification (as in example 2).
library("Seurat")
#> Attaching SeuratObject
<- CreateSeuratObject(counts = counts)
s # run sctransform
<- SCTransform(s, verbose = FALSE, method = "qpoisson")
s # 'standard' workflow
<- RunPCA(s, verbose = FALSE)
s <- RunUMAP(s, dims = 1:30, verbose = FALSE)
s <- FindNeighbors(s, dims = 1:30, verbose = FALSE)
s <- FindClusters(s, verbose = FALSE)
s DimPlot(s, label = TRUE) + NoLegend() + ggtitle("Unsupervised clustering")
The cells have been grouped into 22 clusters. We are now going to use sctransform::diff_mean_test
to find the top cluster markers. For this we access the normalized counts that are in the SCT
assay and the cluster labels.
<- bench::mark(de_res <- diff_mean_test(y = GetAssayData(s, assay = "SCT", slot = "counts"),
bm group_labels = s$seurat_clusters, R = 49, only_pos = TRUE, only_top_n = 222,
verbosity = 0))
::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details") knitr
mem_alloc | gc.sec | n_itr | n_gc | total_time |
---|---|---|---|---|
856MB | 0 | 1 | 0 | 4.75s |
<- group_by(de_res, group1) %>% filter(rank(-zscore, ties.method = "first") <=
top_markers 4 | rank(-log2FC, ties.method = "first") <= 4) %>% select(group1, gene, mean1,
mean2, log2FC, zscore, emp_pval_adj)
<- ggplot(de_res, aes(pmin(log2FC, 10), pmin(log10(zscore), 4))) + geom_point(aes(color = emp_pval_adj <
p1 0.05)) + geom_point(data = top_markers, color = "deeppink") + geom_text_repel(data = top_markers,
mapping = aes(label = gene)) + facet_wrap(~group1, ncol = 4) + theme(legend.position = "bottom")
show(p1)
Table of top markers per cluster
::datatable(top_markers, rownames = FALSE) %>% DT::formatRound(3:7, digits = 2) DT
<- group_by(de_res, group1) %>% filter(rank(-log2FC, ties.method = "first") <=
marker_genes 4) %>% pull(gene)
# make sure all markers are in the scale.data slot (by default only the highly
# variable genes are there)
<- GetResidual(s, features = marker_genes, verbose = FALSE)
s DoHeatmap(s, features = marker_genes, slot = "scale.data")
Here we show how to use the test to identify the conserved (or consistent) cluster markers after multi-sample integration. This is different from the examples above because additionally to a cluster label the cells also have a sample label and we run the tests per sample and combine the results.
First, follow the Seurat integration vignette to obtain an integrated object.
library("Seurat")
library("SeuratData")
#> Registered S3 method overwritten by 'cli':
#> method from
#> print.boxx spatstat.geom
#> ── Installed datasets ───────────────────────────────────── SeuratData v0.2.1 ──
#> ✓ bmcite 0.3.0 ✓ pbmc3k 3.1.4
#> ✓ ifnb 3.1.0 ✓ ssHippo 3.1.4
#> ✓ panc8 3.0.2 ✓ stxBrain 0.1.1
#> ────────────────────────────────────── Key ─────────────────────────────────────
#> ✓ Dataset loaded successfully
#> > Dataset built with a newer version of Seurat than installed
#> ❓ Unknown version of Seurat installed
LoadData("ifnb")
#> An object of class Seurat
#> 14053 features across 13999 samples within 1 assay
#> Active assay: RNA (14053 features, 0 variable features)
<- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, method = "qpoisson", verbose = FALSE)
ifnb.list <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000,
features verbose = FALSE)
<- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features,
ifnb.list verbose = FALSE)
<- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
immune.anchors anchor.features = features, verbose = FALSE)
<- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT",
immune.combined.sct verbose = FALSE)
Note that in this example both samples have been sequenced to similar depth. When working with data where there is a stark discrepancy between sequencing depth, one should make sure that the corrected counts used for differential expression testing were generated using the same ‘target’ depth. Since this is not supported by Seurat::SCTransform
it requires running sctransform::correct()
with identical cell meta data for all samples and setting the as_is
parameter to TRUE. However, in this example we can proceed without any extra steps.
In an unsupervised analysis one would now perform dimensionality reduction and clustering. To keep it simple, we skip these steps and directly use the cell type labels that are provided with the original seurat object.
Now we identify cluster marker genes (each vs rest) that are conserved, i.e. consistent across conditions.
<- bench::mark(de_res <- diff_mean_test_conserved(y = GetAssayData(immune.combined.sct,
bm assay = "SCT", slot = "counts"), group_labels = immune.combined.sct$seurat_annotations,
sample_labels = immune.combined.sct$stim, only_pos = TRUE, verbosity = 0))
::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details") knitr
mem_alloc | gc.sec | n_itr | n_gc | total_time |
---|---|---|---|---|
1.33GB | 0.7549027 | 1 | 10 | 13.2s |
The function we used calls diff_mean_test
repeatedly (once per sample given this balanced design) and aggregates the results per group and gene. The output table has the following columns
The output is ordered by group1, -de_tests, -abs(log2FC_median), pval_max
Show the top marker genes
<- filter(de_res, de_tests == 2, p.adjust(pval_max) <= 0.001) %>% group_by(group1) %>%
top_markers filter(rank(-log2FC_median, ties.method = "first") <= 6)
::datatable(top_markers, rownames = FALSE, caption = "Conserved cell type specific markers") %>%
DT::formatRound(5:9, digits = 2) %>% DT::formatSignif(10) DT
Show a heatmap of the marker genes. The mean expression per gene is scaled separately for each sample (CTRL and STIM) but they are shown interleaved to highlight that the differential expression is conserved across samples.
<- immune.combined.sct$seurat_annotations
celltypes <- immune.combined.sct$stim
samples $type_by_stim <- factor(x = paste(celltypes, samples, sep = " "),
immune.combined.sctlevels = paste(rep(levels(top_markers$group1), each = length(unique(samples))),
unique(samples)))
<- sctransform:::row_gmean_grouped_dgcmatrix(matrix = GetAssayData(immune.combined.sct,
mat assay = "SCT", slot = "counts")[top_markers$gene, ], group = factor(immune.combined.sct$type_by_stim),
eps = 1, shuffle = FALSE)
# scale samples separately
<- grepl(" CTRL$", colnames(mat))
sel <- t(scale(t(mat[, sel])))
mat[, sel] <- grepl(" STIM$", colnames(mat))
sel <- t(scale(t(mat[, sel])))
mat[, sel] <- mat[order(apply(mat, 1, which.max)), ]
mat
<- melt(mat, varnames = c("gene", "celltype")) %>% mutate(celltype = factor(celltype,
p1 levels = rev(colnames(mat)))) %>% ggplot(aes(gene, celltype, fill = value)) +
geom_tile(colour = "gray66") + scale_fill_gradient2(low = "white", mid = "white",
high = "black", name = "Geometric mean of each gene per cell type and sample, scaled per gene and sample") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = NULL,
y = NULL) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0,
0)) + theme(legend.position = "top", axis.ticks = element_blank())
show(p1)
To identify differences between stimulated and control conditions within a celltype, use the test as shown in the examples above. For example, below we look for genes differentially expressed between conditions within B cells.
<- diff_mean_test(y = GetAssayData(immune.combined.sct, assay = "SCT", slot = "counts"),
de_res group_labels = immune.combined.sct$type_by_stim, compare = c("B CTRL", "B STIM"),
verbosity = 0)
<- arrange(de_res, zscore) %>% head(10)
top_markers1 <- arrange(de_res, -zscore) %>% head(10)
top_markers2 rbind(top_markers1, top_markers2) %>% select(gene, mean1, mean2, log2FC, pval_adj) %>%
::datatable(rownames = FALSE, caption = "Top 20 B cell DE genes between control (group 1) and stimulated (group 2)") %>%
DT::formatRound(2:4, digits = 2) %>% DT::formatSignif(5) DT
Session info
sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] stxBrain.SeuratData_0.1.1 ssHippo.SeuratData_3.1.4
#> [3] pbmc3k.SeuratData_3.1.4 panc8.SeuratData_3.0.2
#> [5] ifnb.SeuratData_3.1.0 bmcite.SeuratData_0.3.0
#> [7] SeuratData_0.2.1 SeuratObject_4.0.0
#> [9] Seurat_4.0.2.9002 patchwork_1.1.0.9000
#> [11] ggrepel_0.8.2 dplyr_1.0.5
#> [13] knitr_1.30 sctransform_0.3.2.9007
#> [15] reshape2_1.4.4 ggplot2_3.3.2
#> [17] Matrix_1.2-18
#>
#> loaded via a namespace (and not attached):
#> [1] Rtsne_0.15 colorspace_2.0-0 deldir_0.1-29
#> [4] ellipsis_0.3.1 ggridges_0.5.2 rstudioapi_0.13
#> [7] spatstat.data_2.1-0 leiden_0.3.3 listenv_0.8.0
#> [10] farver_2.0.3 DT_0.16 bit64_4.0.5
#> [13] RSpectra_0.16-0 codetools_0.2-16 splines_4.0.2
#> [16] polyclip_1.10-0 jsonlite_1.7.2 ica_1.0-2
#> [19] cluster_2.1.0 png_0.1-7 uwot_0.1.10
#> [22] shiny_1.5.0 spatstat.sparse_2.0-0 compiler_4.0.2
#> [25] httr_1.4.2 assertthat_0.2.1 fastmap_1.0.1
#> [28] bench_1.1.1 lazyeval_0.2.2 cli_2.4.0
#> [31] later_1.1.0.1 formatR_1.8 htmltools_0.5.1.1
#> [34] tools_4.0.2 igraph_1.2.6 gtable_0.3.0
#> [37] glue_1.4.2 RANN_2.6.1 rappdirs_0.3.1
#> [40] Rcpp_1.0.5 scattermore_0.7 vctrs_0.3.5
#> [43] nlme_3.1-149 crosstalk_1.1.0.1 lmtest_0.9-38
#> [46] xfun_0.19 stringr_1.4.0 globals_0.13.1
#> [49] mime_0.9 miniUI_0.1.1.1 lifecycle_1.0.0
#> [52] irlba_2.3.3 goftest_1.2-2 future_1.19.1
#> [55] profmem_0.5.0 MASS_7.3-53 zoo_1.8-8
#> [58] scales_1.1.1 spatstat.core_2.0-0 promises_1.1.1
#> [61] spatstat.utils_2.1-0 parallel_4.0.2 RColorBrewer_1.1-2
#> [64] yaml_2.2.1 reticulate_1.16 pbapply_1.4-3
#> [67] gridExtra_2.3 rpart_4.1-15 stringi_1.5.3
#> [70] highr_0.8 rlang_0.4.10 pkgconfig_2.0.3
#> [73] matrixStats_0.57.0 evaluate_0.14 lattice_0.20-41
#> [76] ROCR_1.0-11 purrr_0.3.4 tensor_1.5
#> [79] htmlwidgets_1.5.2 labeling_0.4.2 cowplot_1.1.0
#> [82] bit_4.0.4 tidyselect_1.1.0 RcppAnnoy_0.0.18
#> [85] plyr_1.8.6 magrittr_2.0.1 R6_2.5.0
#> [88] generics_0.0.2 DBI_1.1.0 pillar_1.4.7
#> [91] withr_2.3.0 mgcv_1.8-33 fitdistrplus_1.1-1
#> [94] survival_3.2-3 abind_1.4-5 tibble_3.0.4
#> [97] future.apply_1.6.0 crayon_1.3.4.9000 hdf5r_1.3.2
#> [100] KernSmooth_2.23-17 spatstat.geom_2.0-1 plotly_4.9.2.1
#> [103] rmarkdown_2.5 grid_4.0.2 data.table_1.13.2
#> [106] digest_0.6.27 xtable_1.8-4 tidyr_1.1.2
#> [109] httpuv_1.5.4 munsell_0.5.0 viridisLite_0.3.0
Runtime
print(proc.time() - tic)
#> user system elapsed
#> 1075.652 170.359 1259.250