With this vignette we introduce the concept of using regularized regression for normalization and show some examples.

Load some data

First we will follow the Seurat clustering tutorial and load a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.

pbmc_data <- readRDS(file = "~/Projects/data/pbmc3k_umi_counts.Rds")
class(pbmc_data)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(pbmc_data)
#> [1] 32738  2700

pbmc_data is a sparse matrix of UMI counts (32,738 genes as rows and 2,700 cells as columns).

Inspect data

We will now calculate some properties and visually inspect the data. Our main interest is in the general trends not in individual outliers. Neither genes nor cells that stand out are important at this step, instead we want to focus on the global trends.

Derive gene and cell attributes from the UMI matrix.

gene_attr <- data.frame(mean = rowMeans(pbmc_data), detection_rate = rowMeans(pbmc_data > 
    0), var = apply(pbmc_data, 1, var))
gene_attr$log_mean <- log10(gene_attr$mean)
gene_attr$log_var <- log10(gene_attr$var)
rownames(gene_attr) <- rownames(pbmc_data)
cell_attr <- data.frame(n_umi = colSums(pbmc_data), n_gene = colSums(pbmc_data > 
    0))
rownames(cell_attr) <- colnames(pbmc_data)
ggplot(gene_attr, aes(log_mean, log_var)) + geom_point(alpha = 0.3, shape = 16) + 
    geom_density_2d(size = 0.3) + geom_abline(intercept = 0, slope = 1, color = "red")
Mean-variance relationship

Mean-variance relationship

For the genes, we can see that up to a mean UMI count of ca. 0.1 the variance follows the line through the origin with slop one, i.e. variance and mean are roughly equal as expected under a Poisson model. However, genes with a higher average UMI count show overdispersion compared to Poisson.

# add the expected detection rate under Poisson model
x = seq(from = -3, to = 2, length.out = 1000)
poisson_model <- data.frame(log_mean = x, detection_rate = 1 - dpois(0, lambda = 10^x))
ggplot(gene_attr, aes(log_mean, detection_rate)) + geom_point(alpha = 0.3, shape = 16) + 
    geom_line(data = poisson_model, color = "red") + theme_gray(base_size = 8)
Mean-detection-rate relationship

Mean-detection-rate relationship

In line with the previous plot, we see a lower than expected detection rate in the medium expression range. However, for the highly expressed genes, the rate is at or very close to 1.0 suggesting that there is no zero-inflation in the counts for those genes and that zero-inflation is a result of overdispersion, rather than an independent systematic bias.

ggplot(cell_attr, aes(n_umi, n_gene)) + geom_point(alpha = 0.3, shape = 16) + geom_density_2d(size = 0.3)
UMI detected genes relationship

UMI detected genes relationship

The more UMI counts a cell has, the more genes are detected. In this data set this seems to be an almost linear relationship (at least within the UMI range of most of the cells).

General idea of transformation

Based on the observations above, which are not unique to this particular data set, we propose to model the expression of each gene as a negative binomial random variable with a mean that depends on other variables. Here the other variables can be used to model the differences in sequencing depth between cells and are used as independent variables in a regression model. In order to avoid overfitting, we will first fit model parameters per gene, and then use the relationship between gene mean and parameter values to fit parameters, thereby combining information across genes. Given the fitted model parameters, we transform each observed UMI count into a Pearson residual which can be interpreted as the number of standard deviations an observed count is away from the expected mean. If the model accurately describes the mean-variance relationship and the dependency of mean and latent factors, then the result should have mean zero and a stable variance across the range of expression.

Estimate model parameters and transform data

The vst function estimates model parameters and performs the variance stabilizing transformation. Here we use the log10 of the total UMI counts of a cell as variable for sequencing depth for each cell. After data transformation we plot the model parameters as a function of gene mean (geometric mean).

set.seed(44)
vst_out <- sctransform::vst(pbmc_data, latent_var = c("log_umi"), return_gene_attr = TRUE, 
    return_cell_attr = TRUE, verbosity = 1)
#> Calculating cell attributes from input UMI matrix: log_umi
#> Variance stabilizing transformation of count matrix of size 12572 by 2700
#> Model formula is y ~ log_umi
#> Get Negative Binomial regression parameters per gene
#> Using 2000 genes, 2700 cells
#> Found 120 outliers - those will be ignored in fitting/regularization step
#> Second step: Get residuals using fitted parameters for 12572 genes
#> Calculating gene attributes
#> Wall clock passed: Time difference of 17.82073 secs
sctransform::plot_model_pars(vst_out, show_theta = TRUE)

Internally vst performs Poisson regression per gene with \(log(\mu) = \beta_0 + \beta_1 x\), where \(x\) is log_umi, the base 10 logarithm of the total number of UMI counts in each cell, and \(\mu\) are the expected number of UMI counts of the given gene. The previous plot shows \(\beta_0\) (Intercept), the \(\beta_1\) coefficient log_umi, and the maximum likelihood estimate of the overdispersion parameter theta under the negative binomial model. Under the negative binomial model the variance of a gene depends on the expected UMI counts and theta: \(\mu + \frac{\mu^2}{\theta}\). Also shown is the overdispersion factor (od_factor), defined as \(1 + \frac{m}{\theta}\) where \(m\) is the gene mean.

In a second step the regularized model parameters are used to turn observed UMI counts into Pearson residuals.

Inspect model

We will look at several genes in more detail.

sctransform::plot_model(vst_out, pbmc_data, c("MALAT1", "RPL10", "FTL"), plot_residual = TRUE)

For three highly expressed genes we see the regression factor on the x-axis and UMI counts and Pearson residuals on the y-axis. The pink line shows the expected UMI counts given the model, and the shaded region spans one standard deviation from the expected value. We can see that the FTL gene is present in two states (low and high). In case of FTL regularization was particular important since otherwise we would have overestimated the variance and as a result seen lower residuals, as we can see in the following plot.

sctransform::plot_model(vst_out, pbmc_data, c("FTL"), plot_residual = TRUE, show_nr = TRUE, 
    arrange_vertical = FALSE)

The blue line and shaded region show the non-regularized model fit, and the right panel the corresponding residuals.

Here are some other highly variable genes that benefit from regularization.

sctransform::plot_model(vst_out, pbmc_data, c("GNLY", "S100A9"), plot_residual = TRUE, 
    show_nr = TRUE)

GNLY is detected in cells with few UMI and genes and the unconstrained model has a high variance, large intercept and a negative regression coefficient. S100A9 is another example of overestimated variance in the non-regularized model.

Examine the overall properties of the transformed data.

ggplot(vst_out$gene_attr, aes(residual_mean)) + geom_histogram(binwidth = 0.01)

ggplot(vst_out$gene_attr, aes(residual_variance)) + geom_histogram(binwidth = 0.1) + 
    geom_vline(xintercept = 1, color = "red") + xlim(0, 10)

After transformation the mean of the genes is close to zero, and most genes have a variance around one. This suggests that overall the regularized negative binomial model is a suitable model that describes the effect of sequencing depth on UMI counts. Further, after transformation there is no relationship between gene mean and variance, as the next plot shows.

ggplot(vst_out$gene_attr, aes(log10(gmean), residual_variance)) + geom_point(alpha = 0.3, 
    shape = 16) + geom_density_2d(size = 0.3)

We can look at the most variable genes by simply sorting the genes by residual variance.

head(round(vst_out$gene_attr[order(-vst_out$gene_attr$residual_variance), ], 2), 
    22)
#>         detection_rate gmean variance residual_mean residual_variance
#> S100A9            0.36  1.09   278.68          2.49             74.50
#> GNLY              0.18  0.32    45.24          1.50             65.52
#> LYZ               0.60  2.09   564.11          2.49             56.28
#> S100A8            0.28  0.72    78.55          1.81             50.23
#> NKG7              0.30  0.69    52.68          1.64             41.50
#> FTL               0.99 10.68  2008.69          2.14             32.86
#> GZMB              0.12  0.21    12.54          0.78             25.72
#> IGLL5             0.04  0.06     8.89          0.37             24.57
#> FTH1              0.99  9.48  1204.32          1.39             19.88
#> CCL5              0.32  0.68    25.04          1.11             19.68
#> HLA-DRA           0.56  1.88   186.67          1.27             16.38
#> PPBP              0.03  0.04     9.58          0.19             16.07
#> PF4               0.02  0.02     2.72          0.17             14.86
#> CST3              0.40  1.18   100.57          1.07             13.44
#> CD74              0.84  3.48   262.76          1.00             13.33
#> CCL4              0.11  0.13     4.57          0.33             13.22
#> GNG11             0.01  0.02     0.73          0.14             12.43
#> FGFBP2            0.11  0.15     2.82          0.40             11.47
#> SDPR              0.02  0.02     1.26          0.14             11.45
#> CD79A             0.16  0.24     5.73          0.48             11.33
#> GZMH              0.10  0.14     2.68          0.37              9.84
#> FCER1A            0.02  0.03     1.93          0.12              9.75

These genes are markers of cell populations that we expect to see. Note how they represent a great range of mean UMI and detection rate values.