The standard preprocessing pipeline for single-cell RNA-seq data includes sequencing depth normalization followed by log-transformation [1, 2]. The normalization aims to remove technical variability associated with cell-to-cell differences in sequencing depth, whereas the log-transformation is supposed to make the variance of gene counts approximately independent of the mean expression. Two recent papers argue that neither step works very well in practice [3, 4]. Instead, both papers suggest to model UMI (unique molecular identifier) data with count models, explicitly accounting for the cell-to-cell variation in sequencing depth (defined here as the total UMI count per cell). Hafemeister and Satija [3] use a negative binomial (NB) regression model (scTransform package in R), while Townes et al. [4] propose Poisson generalized principal component analysis (GLM-PCA). These two models are seemingly very different.

Here, we show that the model used by Hafemeister and Satija [3] has a too flexible parametrization, resulting in noisy parameter estimates. As a consequence, the original paper employs post hoc smoothing to correct for that. We show that a more parsimonious model produces stable estimates even without smoothing and is equivalent to a special case of GLM-PCA. We then demonstrate that the estimates of gene-specific overdispersion in the original paper are strongly biased and further argue that UMI data do not require gene-specific overdispersion parameters to account for technical noise. Rather, the technical variability is consistent with the same overdispersion parameter shared between all genes. We use available negative control datasets to estimate this technical overdispersion. Furthermore, we compare Pearson residuals, GLM-PCA, and variance-stabilizing transformations for highly variable gene selection and as data transformation for downstream processing.

Our code in Python is available at Analytic Pearson residuals will be included into upcoming Scanpy 1.9 [5].


Analytic Pearson residuals

A common modeling assumption for UMI or read count data without biological variability is that each gene g takes up a certain fraction pg of the total amount nc of counts in cell c [4, 610]. The observed UMI counts Xcg are then modeled as Poisson or negative binomial (NB) [11] samples with expected value μcg=pgnc without zero-inflation [10, 12]:

$$\begin{array}{*{20}l} X_{{cg}} \sim \text{Poisson}(\mu_{{cg}}) \:\:\text{or}\:\: \text{NB}(\mu_{{cg}}, \theta), \end{array} $$
$$\begin{array}{*{20}l} \mu_{{cg}} = n_{c} p_{g}. \end{array} $$

The Poisson model has a maximum likelihood solution (see “Methods”) that can be written in closed form as \(\hat {n}_{c} = \sum _{g} X_{{cg}}\) (sequencing depths), \(\hat {p}_{g} = \sum _{c} X_{{cg}} / \sum _{c} \hat {n}_{c}\), or, put together,

$$ \hat{\mu}_{{cg}} = \frac{\sum_{j} X_{{cj}} \cdot \sum_{i} X_{{ig}} }{\sum_{{ij}} X_{{ij}}} $$

For the negative binomial model this holds only approximately. Using this solution, the Pearson residuals are given by

$$ Z_{{cg}} = \frac{X_{{cg}} - \hat{\mu}_{{cg}}}{\sqrt{\hat{\mu}_{{cg}}+\hat{\mu}_{{cg}}^{2}/\theta}}, $$

where \(\mu _{{cg}}+\mu _{{cg}}^{2}/\theta \) is the NB variance and θ gives the Poisson limit. The variance of Pearson residuals is, up to a constant, equal to the Pearson χ2 goodness-of-fit statistic [13] and quantifies how much each gene deviates from this constant-expression model. As pointed out by Aedin Culhane [14], singular value decomposition of the Pearson residuals under the Poisson model is known as correspondence analysis [1518], a method with a longstanding history [19].

Hafemeister and Satija [3] suggested using Pearson residuals from a related NB regression model for highly variable gene (HVG) selection and also as a data transformation for downstream processing. In parallel, Townes et al. [4] suggested using deviance residuals (see “Methods”) from the same Poisson model as above for HVG selection and also for PCA as an approximation to their GLM-PCA. In the next sections, we discuss the relationships between these approaches.

The regression model in scTransform is overspecified

Hafemeister and Satija [3] used the 33k PBMC (peripheral blood mononuclear cells, an immune cell class that features several distinct subpopulations) dataset from 10X Genomics in their work on normalization of UMI datasets. For each gene g in this dataset, the authors fit an independent NB regression

$$\begin{array}{*{20}l} X_{{cg}} \sim \text{NB}(\mu_{{cg}},\theta_{g}) \end{array} $$
$$\begin{array}{*{20}l} \ln(\mu_{{cg}}) = \beta_{0g} + \beta_{1g} \log_{10}(\hat{n}_{c}). \end{array} $$

Here, θg is the gene-specific overdispersion parameter, \(\hat {n}_{c}\) are observed sequencing depths as defined above, and β0g and β1g are the gene-specific intercept and slope. The natural logarithm follows from the logarithmic link function that is used in NB regression by default. The original paper estimates β0g and β1g using Poisson regression and then uses the obtained estimates to find the maximum likelihood estimate of θg. The resulting estimates for each gene are shown in Fig. 1a–c, reproducing Figure 2A from Hafemeister and Satija [3].

Fig. 1
figure 1

Regression model of Hafemeister and Satija [3] compared to the offset model. Each dot corresponds to a model fit to the counts of a single gene in the 33k PBMC dataset (10x Genomics, n = 33,148 cells). Following Hafemeister and Satija [3], we included only the 16,809 genes that were detected in at least five cells. Color denotes the local point density from low (blue) to high (yellow). Expression mean was computed as \(\frac {1}{n}\sum _{c} X_{{cg}}\). a Intercept estimates \(\hat {\beta }_{0g}\) in the original regression model. Dashed line: Analytic solution for \(\hat {\beta }_{0g}\) in the offset model we propose. b Slope estimates \(\hat {\beta }_{1g}\). Dashed line: β1g= ln(10)≈2.3. c Overdispersion estimates \(\hat {\theta }_{g}\). d Relationship between slope and intercept estimates (ρ=−0.91). e Intercept estimates in the offset model, where the slope coefficient is fixed to 1. Dashed line shows the analytic solution, which is a linear function of gene mean. f Overdispersion estimates \(\hat {\theta }_{g}\) on simulated data with true θ=10 (dashed line) for all genes. g Overdispersion estimates \(\hat {\theta }_{g}\) on the same simulated data as in f, but now with 100 instead of 10 iterations in the optimizer (R, MASS package). Cases for which the optimization diverged to infinity or resulted in spuriously large estimates (\(\hat {\theta }_{g}>10^{6}\)) are shown at \(\hat {\theta }_{g}=\infty \) with some jitter. Dashed line: true value θ=10. h Variance of Pearson residuals in the offset model. The residuals were computed analytically, assuming θ=100 for all genes. Following Hafemeister and Satija [3], we clipped the residuals to a maximum value of \(\sqrt {n}\). Dashed line indicates unit variance. Red dots show the genes identified in the original paper as most variable

The authors observed that the estimates \(\hat {\beta }_{0g}\) and \(\hat {\beta }_{1g}\) were unstable and showed high variance for genes with low average expression (Fig. 1a–b). They addressed this with a “regularization” procedure that re-set all estimates to the local kernel average estimate for a given expression level. This is similar to some approaches to bulk RNA-seq analysis [6, 7] but with post hoc correction instead of Bayesian shrinkage. This kernel smoothing resulted in an approximately linear increase of the intercept with the logarithm of the average gene expression (Fig. 1a) and an approximately constant slope value of \(\hat {\beta }_{1g}\approx 2.3\) (Fig. 1b). The nature of these dependencies was left unexplained. Moreover, we found that \(\hat {\beta }_{0g}\) and \(\hat {\beta }_{1g}\) were strongly correlated (ρ=−0.91), especially for weakly expressed genes (Fig. 1d). Together, these clear symptoms of overfitting suggest that the regression model was overspecified.

Indeed, the theory calls for a less flexible model.As explained above, a common modeling assumption (Eq. 2) is that μcg=pgnc, or equivalently

$$ \ln(\mu_{{cg}}) = \ln(p_{g}) + \ln(n_{c}) = \beta_{0g} + \ln(n_{c}). $$

We see that under this assumption, the slope β1g does not need to be fit at all and should be fixed to 1, if ln(nc) is used as predictor. Not only does this suggest an alternative, simpler parametrization of the model, but it also explains why Hafemeister and Satija [3] found that \(\hat {\beta }_{1g} \approx 2.3\): they used log10(nc)= ln(nc)/ ln(10) instead of ln(nc) as predictor, and so obtained ln(10)≈2.3 as the average slope.

Under the assumption of Eq. 7, a Poisson or NB regression model should be specified using ln(nc) as predictor with a fixed slope of 1, a so-called offset (Eqs. 5 and 7). This way, the resulting model has only one free parameter and is not overspecified. Moreover, the Poisson offset model is equivalent to Eqs. 1 and 2 and so, as explained above, has an analytic solution

$$ \hat{\beta}_{0g} = \ln\left(\textstyle\sum_{c} X_{{cg}}/\textstyle\sum_{c} n_{c}\right) = \ln\left(\frac{1}{N}\textstyle\sum_{c} X_{{cg}}\right) - \ln\left(\frac{1}{N}\textstyle\sum_{c} n_{c}\right), $$

which forms a straight line when plotted against the log-transformed average gene expression \(\frac {1}{n}\sum _{c} X_{{cg}}\) (Fig. 1e). This provides an explanation for the linear trend in \(\hat {\beta }_{0g}\) in the original two-parameter model (Fig. 1a).

In practice, our one-parameter offset model and the original two-parameter model after smoothing arrive at qualitatively similar results (Fig. 1h). However, we argue that the one-parameter model is more appealing from a theoretical perspective, has an analytic solution, and does not require post hoc averaging of the coefficients across genes.

The offset regression model is equivalent to the rank-one GLM-PCA

The offset regression model turns out to be a special case of GLM-PCA [4]. There, the UMI counts are modeled as

$$\begin{array}{*{20}l} X_{{cg}} \sim \text{Poisson}(\mu_{{cg}}) \:\:\text{or}\:\: \text{NB}(\mu_{{cg}}, \theta_{g}), \end{array} $$
$$\begin{array}{*{20}l} \mu_{{cg}} = n_{c} \exp\left(\sum_{l=0}^{k} U_{{cl}} V_{{lg}}\right) = n_{c} \exp\left(V_{0g} + \sum_{l=1}^{k} U_{{cl}} V_{{lg}}\right), \end{array} $$

assuming k+1 latent factors, with U and V playing the role of principal components and corresponding eigenvectors in standard PCA. Importantly, the first latent factor is constrained to Uc0=1 for all cells c, such that V0g can be interpreted as gene-specific intercepts. If the data are modeled without any further latent factors, Eq. 10 reduces to

$$ \ln(\mu_{{cg}}) = V_{0g} + \ln(n_{c}), $$

which is identical to Eq. 7 with V0g=β0g. This shows that the proposed one-parameter offset regression model is exactly equivalent to the intercept-only rank-one GLM-PCA.

Overdispersion estimates in scTransform are biased

After discussing the overparametrization of the systematic component of the scTransform model, we now turn to the NB noise model employed by Hafemeister and Satija [3]. The \(\hat {\theta }_{g}\) estimates in the original paper are monotonically increasing with the average gene expression, both before and after kernel smoothing (Fig. 1c). This suggests that there is a biologically meaningful relationship between the expression strength and the overdispersion parameter θg. However, this conclusion is in fact unsupported by the data.

To demonstrate this, we simulated a dataset with NB-distributed counts \(\widetilde X_{{cg}} \sim \text {NB}(\mu _{{cg}}, \theta =10)\) with μcg given by Eq. 3 using Xcg of the PBMC dataset. Applying the original estimation procedure to this simulated dataset showed the same positive correlation of \(\hat {\theta }_{g}\) with the average expression as in real data (Fig. 1f), strongly suggesting that it does not represent an underlying technical or biological cause, but only the estimation bias. Low-expressed genes had a larger bias and only for genes with the highest average expression was the true θ=10 estimated correctly.

Moreover, the \(\hat {\theta }_{g}\) estimates strongly depended on the exact details of the estimation procedure. Using the R function with its default 10 iterations, as Hafemeister and Satija [3] did, led to multiple convergence warnings for the simulated data in Fig. 1f. Increasing this maximum number of iterations to 100 eliminated most convergence warnings, but caused 49.9% of the estimates to diverge to infinity or above 1010 (Fig. 1g). These instabilities are likely due to shallow maxima in the NB likelihood w.r.t. θ [20].

The above arguments show that the overdispersion parameter estimates in Hafemeister and Satija [3] for genes with low expression were strongly biased. In practice, however, the predicted variance μ+μ2/θ is only weakly affected by the exact value of θ for low expression means μ, and so the bias reported here does not substantially affect the Pearson residuals (see below). Also, many of the weakly expressed genes may be filtered out during preprocessing in actual applications. We note that large errors in NB overdispersion parameter estimates have been extensively described in other fields, with simulation studies showing that estimation bias occurs especially for low NB means, small sample sizes, and large true values of θ [2123], i.e., for samples that are close to the Poisson distribution. Note also that post hoc smoothing [3] can reduce the variance of the \(\hat {\theta }_{g}\) estimates, but does not reduce the bias.

Negative control datasets suggest low overdispersion

To avoid noisy and biased estimates, we suggest to use one common θ value shared between all genes. Of course, any given dataset would be better fit using gene-specific values θg. However, our goal is not the best possible fit: We want the model to account only for technical variability, but not biological variability, e.g., between cell types; this kind of variability should manifest itself as high residual variance.

Rather than estimating the θ value from a biologically heterogeneous dataset such as PBMC, we think it is more appropriate to estimate the technical overdispersion using negative control datasets, collected without any biological variability [12]. We analyzed several such datasets spanning different droplet- and plate-based sequencing protocols (10x Genomics, inDrop, MicrowellSeq) and compared the \(\hat {\theta }_{g}\) estimates to the estimates obtained using simulated NB data with various known values of θ∈{10,100,1000,}. For the simulations, we used the empirically observed sample sizes and sequencing depths. We found that across different protocols, negative control data were consistent with overdispersion θ≈100 or larger (Additional file 1: Figure S1). The plateau at θ≈10 in the PBMC data visible in Fig. 1c could reflect biological and not technical variability. At the same time, negative control data did not exactly conform to the Poisson model (θ=), but likely overdispersion parameter values (θ≈100) were large enough to make the Poisson model acceptable in practice [10, 24, 25]. A parallel work reached the same conclusion [26].

Analytic Pearson residuals select biologically relevant genes

Both Hafemeister and Satija [3] and Townes et al. [4] suggested to use Pearson/deviance residuals based on models that only account for technical variability, in order to identify biologically variable genes. Indeed, genes showing biological variability should have higher variance than predicted by such a model. As explained above, Pearson residuals in the model given by Eqs. 1 and 2 (or, equivalently, offset regression model or rank-one GLM-PCA) can be conveniently written in closed form:

$$ Z_{{cg}} = \frac{X_{{cg}} - \hat{\mu}_{{cg}}}{\sqrt{\hat{\mu}_{{cg}}+\hat{\mu}_{{cg}}^{2}/\theta}},\;\;\hat{\mu}_{{cg}} = \frac{\sum_{j} X_{{cj}} \cdot \sum_{i} X_{{ig}}}{\sum_{i,j} X_{{ij}}},\;\;\theta=100. $$

For most genes in the PBMC data, the variance of the Pearson residuals was close to 1, indicating that this model predicted the variance of the data correctly and suggesting that most genes did not show biological variability (Fig. 1h). Using θ=100 led to several high-expression genes selected as biologically variable that would not be selected with a lower θ (e.g., Malat1), but overall, using θ=10,θ=100, or even the Poisson model with θ= led to only minor differences (Additional file 1: Figure S2a–c). Using analytic Pearson residuals for HVG selection yielded a very similar result compared to using Pearson residuals from the smoothed regression presented in Hafemeister and Satija [3], with almost the same set of genes identified as biologically variable (Fig. 1h, Additional file 1: Figure S2e). This suggests that our model is sufficient to identify biologically relevant genes.

It is instructive to compare the variance of Pearson residuals to the variance that one gets after explicit sequencing depth normalization followed by a variance-stabilizing transformation. For Poisson data, the square root transformation \(\sqrt {x}\) is approximately variance-stabilizing, and several modifications exist in the literature [27], such as the Anscombe transformation \(2\sqrt {x+3/8}\) [28] and the Freeman-Tukey transformation \(\sqrt {x}+\sqrt {x+1}\) [29]. Normalizing UMI counts Xcg by sequencing depths nc (and multiplying the result by the median sequencing depth 〈nc〉 across all cells; “median normalization”) followed by one of the square-root transformations has been advocated for UMI data processing [30, 31].

Comparing the gene variances after the square-root transformation (Fig. 2a) with those of Pearson residuals (Fig. 2b) in the PBMC dataset showed that the square-root transformation is not sufficient for variance stabilization. Particularly affected are low-expression genes that have variance close to zero after the square-root transform [32]. For example, platelet markers genes such as Tubb1 have low average expression (because platelets are a rare population in the PBMC dataset) and do not show high variance after any kind of square-root transform (another example was given by the B-cell marker Cd79a). At the same time, Pearson residuals correctly indicate that these genes have high variance and are biologically meaningful (Fig. 2c). For the genes with higher average expression, some differentially expressed genes like the monocyte marker Lyz or the abovementioned Malat1 showed high variance in both approaches. However, the selection based on the square-root transform also included high-expression genes like Fos, which showed noisy and biologically unspecific expression patterns (Fig. 2c). Similar patterns were observed in the full-retina dataset [33] (Additional file 1: Figure S3).

Fig. 2
figure 2

Selection of variable genes. In the first two panels, each dot shows the variance of a single gene in the PBMC dataset after applying a normalization method. The dotted horizontal line shows a threshold adjusted to select 100 most variable genes. Red dots mark 100 genes that are selected by the other method, i.e., that are above the threshold in the other panel. Stars indicate genes shown in the last panel. a Gene variance after sequencing depth normalization, median-scaling, and the square-root transformation. b Variance of Pearson residuals (assuming θ=100). c t-SNE of the entire PBMC dataset (see Additional file 1: Figure S4), colored by expression of four example genes (after sequencing depth normalization and square-root transform). Platelet marker Tubb1 with low average expression is only selected by Pearson residuals. Arrows indicate the platelet cluster. Fos is only selected by the square root-based method, and does not show a clear type-specific expression pattern. Malat1 (expressed everywhere apart from platelets) and monocyte marker Lyz with higher average expression are selected by both methods

The gene with the highest average expression in the PBMC dataset, Malat1, showed clear signs of biologically meaningful variability, e.g., it is not expressed in platelets (Fig. 2c). While this gene is selected as biologically variable based on Pearson residuals with θ≈100 as we propose (Fig. 2b), it was not selected by Hafemeister and Satija [3] who effectively used θ≈10 (Fig. 1c, h, Additional file 1: Figure S2). This again suggests that θ≈100 is more appropriate than θ≈10 to model technical variability of UMI counts.

Pearson residuals may even be “too sensitive” in that genes that are only expressed in a handful of cells may get very large residual variance. Hafemeister and Satija [3] suggested clipping residuals to \([-\sqrt {n}, \sqrt {n}]\). We found that this step avoids large residual variance in very weakly expressed genes (Additional file 1: Figure S2d, see “Methods” for more details). The variance of unclipped Pearson residuals under the Poisson model (θ=) was very similar to the Fano factor of counts after median normalization (Additional file 1: Figure S2f) and less useful for HVG selection compared to the clipped residuals.

Lastly, gene selection by the widely-used log(1+x)-transform as well as by the variance of deviance residuals as suggested by Townes et al. [4] led to very similar results as described above for the square-root transform: many biologically meaningful genes were not selected, as all three methods overly favored high-expression genes (Additional file 1: Figure S2g–i). In conclusion, neither of these transformations is sufficiently variance-stabilizing. In practice, many existing HVG selection methods take the mean-variance relationship into account when performing the selection (e.g., seurat and seurat_v3 methods [34, 35] as implemented in Scanpy [5]). We benchmarked their performance in the next section.

Analytic Pearson residuals separate cell types better than other methods

Next, we studied the effect of different normalization approaches on PCA representations and t-SNE embeddings. The first approach is median normalization, followed by the square-root transform [30, 31]. We used 50 principal components of the resulting data matrix to construct a t-SNE embedding. The second approach is computing Pearson residuals according to Eq. 12 with θ=100, followed by PCA reduction to 50 components. The third approach is computing 50 components of negative binomial GLM-PCA with θ=100 [4]. We used the same initialization to construct all t-SNE embeddings to ease the visual comparison [36].

We applied these methods to the full PBMC dataset (Additional file 1: Figure S4), three retinal datasets [33, 37, 38] (Fig. 3), and a large organogenesis dataset with n=2 million cells [39] (Fig. 4). For smaller datasets, the resulting embeddings were mostly similar, suggesting comparable performance between methods. Hafemeister and Satija [3] argued that using Pearson residuals reduces the amount of variance in the embedding explained by the sequencing depth variation, compared to sequencing depth normalization and log-transformation. We argue that this effect was mostly due to the large factor that the authors used for re-scaling the counts after normalization (Additional file 1: Figure S5): large scale factors and/or small pseudocounts (ε in log(x+ε)) are known to introduce spurious variation into the distribution of normalized counts [4, 41]. For the PBMC dataset, all three t-SNE embeddings showed similar amount of sequencing depth variation across the embedding space (Additional file 1: Figure S4g–i). Performing the embeddings on 1,000 genes with the largest Pearson residual variance did not noticeably affect the embedding quality (Additional file 1: Figure S4d–f).

Fig. 3
figure 3

t-SNE embeddings of three retinal datasets. Panels in each column are based on a different data transformation method with PCA or GLM-PCA reduction to 50 dimensions (see “Methods”), and each row shows a different retinal dataset. We did not perform any gene selection here. Colors correspond to cell type labels provided by the original papers. ac Full-retina dataset (DropSeq) [33], containing all retinal cell types (including glia and vascular cells). 24, 769 cells. df Bipolar cell dataset (DropSeq) [37]. 13,987 cells. gi Retinal ganglion cell dataset (10X v2) [38]. 15,750 cells

Fig. 4
figure 4

t-SNE embeddings of the organogenesis dataset. All panels show t-SNE embeddings of the organogenesis dataset [39] (2,058,652 cells), colored by the 38 main clusters identified by the original authors. All panels use 2,000 genes with the largest Pearson residual variance. Each panel shows a total of 2,026,641 cells, excluding 32,011 putative doublets identified in the original paper. All t-SNE embeddings were done with exaggeration 4 [36, 40]. a Depth normalization, median scaling, log-transformation and PCA with 50 principal components. b Same as in a, but with an additional standardization step that scales the normalized and log-transformed expression of each gene to mean zero and unit variance, as in the original paper [39]. c GLM-PCA with 50 dimensions (NB model with shared overdispersion as a free parameter, estimated to be \(\hat {\theta }=0.56\)). d Analytic Pearson residuals with θ=100 and PCA with 50 principal components. The scattered small islands do not belong to single clusters but instead are spuriously enriched in single embryos. e Same as in d, but after removing batch effect genes (“Methods”). Text labels correspond to the developmental trajectories identified in the original paper [39] (uppercase: multi-cluster trajectories, lowercase: single-cluster trajectories)

However, on closer inspection, embeddings based on Pearson residuals consistently outperformed the other two. For example, while the Pearson residual embeddings clearly separated fine cell types in the full-retina dataset [33], the square-root embedding mixed some of them (we observed the same when using the log-transform). For the same dataset, GLM-PCA embedding did not fully separate some of the biologically distinct cell types. Furthermore, GLM-PCA embeddings often featured Gaussian-shaped blobs with no internal structure (Fig. 3), suggesting that some fine manifold structure was lost, possibly due to convergence difficulties.

Embedding the organogenesis dataset [39] using Pearson residuals uncovered a strong and surprising batch artifact: hitherto unnoticed, several genes were highly expressed exclusively in small subsets of cells, with each subset coming from a single embryo. These subsets appeared as isolated islands in the t-SNE embedding (Fig. 4), allowing us to uncover and remove this batch effect (Additional file 1: Figure S6), leading to the final, biologically interpretable embedding (Fig. 4). In contrast, embeddings based on log-transform or GLM-PCA did not show this batch artifact at all. GLM-PCA took days to converge (Table 1) and could recover only the coarse structure of the data. Interestingly, the final embedding based on Pearson residuals was broadly similar to the embedding obtained after log-transform and standardization of each gene, as expected given that Pearson residuals stabilize the variance by construction (Fig. 4). Together, these qualitative observations suggest that analytic Pearson residuals can represent small, distinct subpopulations in large datasets better than other methods.

Table 1 Runtimes for different normalization pipelines

To quantify the performance of dimensionality reduction methods, we performed a systematic benchmark using the Zhengmix8eq dataset with known ground truth labels [42] (Fig. 5). This dataset consists of PBMC cells FACS-sorted into eight different cell types with eight types occurring in roughly equal proportions. To make the setup more challenging, we added 10 pseudo-genes expressed only in a group of 50 cells, effectively creating a ninth, rare, cell type (see “Methods”). We used six methods to select 2000 HVGs (and additionally omitted HVG selection) and ten methods for data transformation and dimensionality reduction to 50 dimensions. We assessed the resulting (6+1)·10=70 pipelines using kNN classification of cell types. We used the macro F1 score (harmonic mean between precision and recall, averaged across classes) because this metric fairly averages classifier performance across classes of unequal size. Together, the F1 score of the kNN classifier quantifies how well each pipeline separated cell types in the 50-dimensional representation (Fig. 5c). We did not include approaches that use depth normalization with inferred size factors [44] in this comparison.

Fig. 5
figure 5

Benchmarking the effect of normalization on cell type separation in reduced dimensionality. We used the Zhengmix8eq dataset with eight ground truth FACS-sorted cell types [42, 43] (3, 994 cells) and added ten pseudo-genes expressed in a random group of 50 cells from one type. All HVG selection methods were set up to select 2, 000 genes, and all normalization and dimensionality reduction methods reduced the data to 50 dimensions. For details see “Methods”. a t-SNE embedding after the seurat_v3 HVG selection as implemented in Scanpy, followed by depth normalization, median scaling, square-root transform, and PCA. Colors denote ground truth cell types, the artificially added type is shown in red. b t-SNE embedding after HVG selection by Pearson residuals (θ=100), followed by transformation to Pearson residuals (θ=100), and PCA. Black arrow points at the artificially added type. c Macro F1 score (harmonic mean between precision and recall, averaged across classes to counteract class imbalance) for kNN classification (k=15) of nine ground truth cell types for each of the 70 combinations of HVG selection and data transformation approaches

The pipeline that used analytic Pearson residuals for both gene selection and data transformation outperformed all other pipelines with respect to cell type classification performance. In contrast, popular methods for HVG selection (e.g., seurat_v3 as implemented in Scanpy [5, 35]) combined with log or square-root transformations after depth normalization performed worse and in particular were often unable to separate the rare cell type (Fig. 5a,b; see Additional file 1: Figure S7 for additional embeddings). The performance of GLM-PCA was also poor, likely due to convergence issues (with 15-dimensional, and not 50-dimensional, output spaces, GLM-PCA performed on par with Pearson residuals; data not shown), in agreement with what we reported above for the retinal datasets. Finally, deviance residuals [4] were clearly outperformed by Pearson residuals both as gene selection criterion and as data transformation. This is due to the reduced sensitivity of deviance residuals to low- or medium-expression genes (Additional file 1: Figure S2i). Note that in terms of the overall classification accuracy no pipeline outperformed Pearson residuals but many pipelines performed similarly well; this is because overall accuracy is not sensitive to the rare cell type, unlike the macro F1 score.

For this dataset, not using gene selection at all performed similarly well to HVG selection using Pearson residuals (Fig. 5c), but in general HVG selection is a recommended step in scRNA-seq data analysis [1, 2] and here Pearson residuals performed the best. Also, log-transformed counts that were standardized performed similarly well to Pearson residuals (Fig. 5c), in agreement with the above observations on the organogenesis dataset (Fig. 4b). Nevertheless, the same organogenesis example showed that Pearson residuals can be more sensitive (Fig. 4b, d).

Analytic Pearson residuals are fast to compute

The studied normalization pipelines differ in both space and time complexity. UMI count data are typically very sparse (e.g., in the PBMC dataset, 95% of entries are zero) and can be efficiently stored as a sparse matrix object. Sequencing depth normalization and square-root or log-transformation do not affect the zeros, preserving the sparsity of the matrix, and PCA can be run directly on a sparse matrix. In contrast, Pearson residuals form a dense matrix without any zeros and so can take a large amount of memory to store (4.5 Gb for the PBMC dataset). For large datasets this can become prohibitive (but note that a smart implementation may be able to avoid storing a dense matrix in memory [45]). In contrast, GLM-PCA can be run directly on a sparse matrix but takes a long time to converge (Table 1), becoming prohibitively slow for bigger datasets.

Computational complexity can be greatly reduced if gene selection is performed in advance. After selecting 1000 genes, Pearson residuals do not require a lot of memory (0.3 Gb for the PBMC dataset) and so can be conveniently used. Note that the Pearson residual variance can be computed per gene, without storing the entire residual matrix in memory. GLM-PCA, however, remained slow even after gene selection (4 h vs. 4 s for Pearson residuals for the PBMC dataset; 2 days vs. 4 minutes for the organogenesis dataset; Table 1).


We reviewed and contrasted different methods for normalization of UMI count data. We showed that without post hoc smoothing, the negative binomial regression model of Hafemeister and Satija [3] exhibits high variance in its parameter estimates because it is overspecified, which is why it had to be smoothed in the first place. We argued that instead of smoothing an overspecified model, one should resort to a more parsimonious and theoretically motivated model specification involving an offset term. This made the model equivalent to the rank-one GLM-PCA of Townes et al. [4] and yielded a simple analytic solution, closely related to correspondence analysis [17]. Further, we showed that the estimates of per-gene overdispersion parameter θg in the original paper exhibit substantial and systematic bias. We used negative control datasets from different experimental protocols to show that UMI counts have low overdispersion and technical variation is well described by θ≈100 shared across genes.

We found that the approach developed by Hafemeister and Satija [3] and implemented in the R package scTransform in practice yields Pearson residuals that are often similar to our analytic Pearson residuals with fixed overdispersion parameter (Additional file 1: Figure S2e). We argue that our model with its analytic solution is attractive for reasons of parsimony, theoretical simplicity, and computational speed. Moreover, it provides an explanation for the linear trends in the smoothed estimates in the original paper. We have integrated Pearson residuals into upcoming Scanpy 1.9 [5].

Following our manuscript, scTransform was updated to scTransform v2 and now uses the offset model formulation [46]. At the same time, the authors argue that the dependence of the overdispersion parameter θg on the gene expression strength is not entirely explained by the estimation bias. To reduce the bias, scTransform v2 uses glmGamPoi [47] to estimate the offsets β0g and the overdispersion parameters θg (which are then smoothed). The authors also refer to the bulk RNA-seq literature, where it has been observed that the overdispersion parameter grows monotonically with gene expression [6, 48, 49]. Given the difficulties with estimating overdispersion for low expression means (see above), we believe that this question requires further investigation. However, as argued above, whether θ is assumed to be constant or is allowed to vary between genes, has very little effect on the resulting Pearson residuals.

A parallel publication [50] suggested a Bayesian procedure named Sanity for estimating expression strength underlying the observed UMI counts, based on Poisson likelihood and Bayesian shrinkage. Importantly, Pearson residuals are not aiming at estimating the underlying expression strength; rather, they quantify how strongly each observed UMI count deviates from the null model of constant expression across cells. These two approaches can have opposite effects on markers genes of rare cell types: the Bayesian procedure shrinks their expression towards zero whereas our approach yields large Pearson residuals. We argued here that this emphasis on rare cell types is useful for many downstream tasks, but if the interest lies in true expression, approaches like Sanity may be more appropriate. Future work should perform comprehensive benchmarks on a variety of tasks [51].

On the practical side, we showed that Pearson residuals outperform other methods for selecting biologically variable genes. They are also better than other preprocessing methods for downstream analysis: in a systematic benchmarking effort, we demonstrated that Pearson residuals provide a good basis for general-purpose dimensionality reduction and for constructing 2D embeddings of single-cell UMI data. In particular, they are well suited for identifying rare cell types and their genetic markers. Applying gene selection prior to dimensionality reduction reduces the computational cost of using Pearson residuals down to negligible. We conclude that analytic Pearson residuals provide a theory-based, fast, and convenient method for normalization of UMI datasets.


Mathematical details

Analytic solution

The log-likelihood for the model defined in Eqs. 1 and 2

$$ X_{{cg}} \sim \text{Poisson}(n_{c} p_{g}) $$

can be, up to a constant, written as

$$ \mathcal{L} = \sum_{{cg}} \left[ X_{{cg}}\ln(n_{c} p_{g}) - n_{c} p_{g}\right], $$

where we used the Poisson density p(x)=exeμ/x!. Taking partial derivatives with respect to nc and pg and setting them to zero, one obtains

$$ \hat{n}_{c} = \frac{\sum_{g} X_{{cg}}}{\sum_{g} \hat{p}_{g}},\;\;\;\hat{p}_{g} = \frac{\sum_{c} X_{{cg}}}{\sum_{c} \hat{n}_{c}}. $$

This is a family of solutions. Setting \(\sum _{g} \hat {p}_{g} = 1\), we obtain Eq. 3 and the formulas for \(\hat {n}_{c}\) and \(\hat {p}_{g}\) given in the “Analytic Pearson residuals” section.

This derivation does not generalize to the negative binomial model with density

$$ p(x) = \frac{\Gamma(x+\theta)}{x!\, \Gamma(\theta)}\left(\frac{\theta}{\theta+\mu}\right)^{\theta} \left(\frac{\mu}{\theta+\mu}\right)^{x}, $$

where the log-likelihood (for fixed θ), up to a constant, is

$$ \mathcal{L} = \sum_{{cg}} \left[ X_{{cg}}\ln(n_{c} p_{g}) - (X_{{cg}}+\theta)\ln(n_{c} p_{g} + \theta)\right]. $$

This does not have an analytic maximum likelihood solution. However, for large θ values Eq. 3 can be taken as an approximate solution.

Deviance residuals

Deviance is defined as the doubled difference between the log-likelihood of the saturated model and the log-likelihood of the actual model. The saturated model, in our case, is a full rank model with \(\hat {\mu }_{{cg}}^{*}=X_{{cg}}\). For the Poisson model, the deviance can therefore be obtained from Eq. 14 and is equal to

$$ \mathcal{D} = 2\sum_{{cg}} \left[ X_{{cg}} \ln\frac{X_{{cg}}}{\hat{\mu}_{{cg}}} - \left(X_{{cg}} - \hat{\mu}_{{cg}}\right)\right], $$

where the terms with \(\hat {\mu }_{{cg}} = X_{{cg}}\) are taken to be zero.

Deviance residuals are defined as square roots of the respective deviance terms, such that the sum of squared deviance residuals is equal to the deviance (note that for the Gaussian case this already holds true for the raw residuals, because the saturated model has zero log-likelihood, and the deviance is simply the squared error). It follows that for the Poisson model deviance residuals [4] are given by

$$ Z_{{cg}} = \text{sign}\left(X_{{cg}} - \hat{\mu}_{{cg}}\right) \sqrt{2 \left[X_{{cg}} \ln\frac{X_{{cg}}}{\hat{\mu}_{{cg}}} - \left(X_{{cg}}-\hat{\mu}_{{cg}}\right)\right]} $$

Similarly, for the negative binomial model with fixed θ, the deviance residuals follow from Eq. 17 and are given by

$$ Z_{{cg}} = \text{sign}\left(X_{{cg}} - \hat{\mu}_{{cg}}\right) \sqrt{2 \left[X_{{cg}} \ln\frac{X_{{cg}}}{\hat{\mu}_{{cg}}} - \left(X_{{cg}}+\theta\right) \ln\frac{X_{{cg}}+\theta}{\hat{\mu}_{{cg}}+\theta} \right]} $$

It is easy to verify that this formula reduces to the Poisson case when θ. When computing deviance residuals, we estimated \(\hat {\mu }_{{cg}}\) using Eq. 3.

Clipping Pearson residuals

Clipping Pearson residuals to \(\pm \sqrt {n}\) as suggested by Hafemeister and Satija [3] is needed to avoid large residual variance in rarely expressed genes (Additional file 1: Figure S2d). The intuition behind this heuristic is as follows. Consider a UMI dataset with n cells containing a biologically distinct rare population P of size mn. Let this population have a marker gene with expression following Poisson(λ) for the cells from P, and zero expression for all nm remaining cells. For simplicity we assume the Poisson model here, and further assume that all cells have the same sequencing depth.

The expected average expression of this gene is λm/n and so the expected Pearson residual value for this gene for the cells from P is \((\lambda -\lambda m/n)/\sqrt {\lambda m/n} = (n-m)\sqrt {\lambda /(nm)} \approx \sqrt {\lambda n/m}\).

With the clipping threshold \(\sqrt {n}\), clipping will happen whenever λ>m, i.e., when the population P is either very small or has very large UMI counts. For example, a population of 10 cells having a marker gene with the within-population mean expression of 20 UMIs, will result in clipped residuals, as if the within-population mean expression were ∼ 10 UMIs. This may have a large effect on the leading principal components (even PC1) if the data contain a very small number of cells with strong marker gene expression.

Pearson residuals of biologically variable genes

It is instructive to observe the effect Pearson residuals have on genes that have the same variance of log-expression but different expression means. Consider a gene that has expression μ in half of the cells and is upregulated by a factor of two in the other half of the cells. Then its expression mean is 1.5μ, and the Pearson residuals are close to \(\pm 0.5\mu /\sqrt {1.5\mu } \approx 0.4 \sqrt {\mu }\), i.e. the variance of Pearson residuals grows linearly with μ. This makes sense because for higher-expressed genes there is more statistical certainty about over-Poisson variability, but at the same time highlights that Pearson residuals do not aim to estimate the underlying (log-)expression, unlike, e.g., Sanity [50].

Experimental details

Analyzed datasets and preprocessing

Used datasets are listed in Table 2. For the organogenesis dataset and the FACS-sorted PBMC dataset, we applied no further filtering. In all remaining datasets, we excluded genes that were expressed in fewer than 5 cells, following Hafemeister and Satija [3]. The data were downloaded following links in original publications in form of UMI count tables. Direct links to all data sources are given in our Github repository

Table 2 Overview of UMI datasets used for analysis

HVG selection

For gene selection using sqrt(CPMedian), Pearson residuals, and deviance residuals, we applied the respective data transformation and used the variance after transformation as selection criterion. For Seurat and Seurat_v3 methods, we used the respective Scanpy implementations. In brief, these two methods regress out the mean-variance relationship, and return an estimate of the “excess” variance for each gene [34, 35]. For scTransform, we used the corresponding R package [3]. The Fano factor was computed after normalizing by sequencing depth and scaling by median sequencing depth.

Data transformation and dimensionality reduction

We used the following abbreviations to denote data transformations: sqrt(CPMedian) — normalization by sequencing depth, followed by scaling by the median depth across all cells (“counts per median”), followed by the square-root transform; log(CPMedian + 1) — normalization by sequencing depth, followed by scaling by the median depth across all cells, followed by log(x+1) transform; log(CPMedian + 1) + standardization — same as log(CPMedian + 1), but followed by centering each gene at mean zero and scaling it to unit variance; log(CPM + 1) — normalization by sequencing depth, followed by scaling by one million (“counts per million”), followed by log(x+1) transform. Pearson residuals were computed with Eq. 12 and then clipped to \(\pm \sqrt {n}\). Deviance residuals were computed with Eq. 20.

All of these methods were typically followed by dimensionality reduction by PCA to 50 dimensions using the Scanpy implementation [5], unless otherwise stated.

Further, we used three variants of GLM-PCA to transform raw counts and reduce dimensionality down to 50 in a joint step: Poisson GLM-PCA, negative binomial GLM-PCA with estimation of single overdispersion parameter θ shared across genes, and negative binomial GLM-PCA with fixed shared θ. In Townes et al. [4], the authors only used the former two methods. Whenever possible, we used the glmpca-py implementation with default settings. When we reduced the PBMC dataset to 1 000 genes for Additional file 1: Figure S4f, GLM-PCA did not converge with default penalty 1, so we increased it to 5, following the tuning procedure used in the authors’ R implementation. Similarly, negative binomial GLM-PCA with estimation of θ did not converge on the benchmark dataset (Fig. 5) when we used gene selection by either Deviance residuals (θ=100) or Pearson residuals (θ=10). For these two cases, we had to increase the penalty to 10. On the organogenesis dataset, the Python implementation did not converge within reasonable time, so for this dataset, we resorted to the R implementation. It uses a different optimization method and employs stochastic minibatches. All reported GLM-PCA results for this dataset are for batchsize 10,000, as batchsizes 100 and 1,000 (default) resulted in considerably longer runtimes. Because the R implementation does not support NB GLM-PCA with fixed theta, for this dataset, we used GLM-PCA with jointly fit \(\hat {\theta }\).

Unless otherwise stated, all residuals and GLM-PCA with fixed θ used θ=100. Whenever gene selection was performed prior to a data transformation that required sequencing depths, we computed those depths using the sum over selected genes only.

Benchmarking cell type separation with kNN classification

We used the Zhengmix8eq dataset with known ground truth labels obtained by FACS-sorting of eight PBMC cell types [42, 43]. There were 400–600 cells in each cell type. We created a ninth, artificial population from 50 randomly selected B-cells (marked blue in Fig. 5a–b). To mimic a separate cell type, we added 10 pseudo marker that had zero expression everywhere apart from those 50 cells. For those 50 cells, UMI values were simulated as Poisson(nip), where ni is the sequencing depth of the ith selected cell (range: 452–9,697), and expression fraction p was set to 0.001.

We then applied the 70 normalization pipelines shown in Fig. 5c to this dataset. Each pipeline either used one of the six methods to select 2000 HVGs or proceeded without HVG selection, followed by one of the ten methods for data transformation and dimensionality reduction to 50 dimensions. To assess cell type separation in this output space, we used a kNN classifier with a leave-one-out cross-validation procedure: For each cell, we trained a kNN classifier on the remaining n−1 cells. This resulted in a class prediction for each cell based on the majority vote of its k=15 neighboring cells. We quantified the performance of this prediction by computing the macro F1 score (harmonic mean between precision and recall, averaged across classes to counteract class imbalance). We used the sklearn implementations for kNN classification and the F1 score [55].

Measuring runtimes

All runtimes given in Table 1 are wall times from running the code in a Docker container with an Ubuntu 18 system on a machine with 256 GB RAM and 2×24 CPUs at 2.1 Ghz (Xeon Silver 4116 Dodecacore). The Docker container was restricted to use at most 30 CPU threads. To reduce overhead, we did not use Scanpy for timing experiments, and instead used numpy for basic computations and sklearn for PCA with default settings. Note that we used different implementations of GLM-PCA for the PBMC and organogenesis dataset (see above for details).

t-SNE embeddings

All t-SNE embeddings were made following recommendations from a recent paper [36] using the FIt-SNE implementation [56]. We used the PCA (or, when applicable, GLM-PCA) representation of the data as input. We used default FIt-SNE parameters, including automatically chosen learning rate. For initialization, we used the first two principal components of the data, scaled such that PC1 had standard deviation 0.0001 (as is default in FIt-SNE). The initialization was shared among all embeddings shown in the same figure, i.e., PCs of one data representation were used to initialize all other embeddings as well. For all datasets apart from the organogenesis one, we used perplexity combination of 30 and n/100, where n is the sample size [36]. For the organogenesis dataset embeddings, we used perplexity 30 and exaggeration 4 [40].