Normalizing need not be the norm: count-based math for analyzing single-cell data

Church, Samuel H.; Mah, Jasmine L.; Wagner, Günter; Dunn, Casey W.

doi:10.1007/s12064-023-00408-x

Normalizing need not be the norm: count-based math for analyzing single-cell data

Research
Published: 10 November 2023

Volume 143, pages 45–62, (2024)
Cite this article

Theory in Biosciences Aims and scope Submit manuscript

Samuel H. Church¹,
Jasmine L. Mah¹,
Günter Wagner^1,2,3,4 &
…
Casey W. Dunn¹

474 Accesses
1 Altmetric
Explore all metrics

Abstract

Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

Article Open access 23 December 2019

Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets

Article Open access 06 May 2024

scruff: an R/Bioconductor package for preprocessing single-cell RNA-sequencing data

Article Open access 02 May 2019

Data availability

The countland package for python and R, as well as all data and code required to reproduce the results shown here is available at https://github.com/shchurch/countland.

References

Ahlmann-Eltze C, Huber W (2021) Comparison of transformations for single-cell RNA-seq data. bioRxiv 2021–06
Booeshaghi AS, Hallgrímsdóttir IB, Gálvez-Merchán Á, Pachter L (2022) Depth normalization for single-cell genomics count data. BioRxiv
Cao Y, Kitanovski S, Küppers R, Hoffmann D (2021) UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat Biotechnol 39:158–159
Article CAS PubMed Google Scholar
Chari T, Banerjee J, Pachter L (2021) The specious art of single-cell genomics. BioRxiv
Dong B, Lin MM, Park H (2018) Integer matrix approximation and data mining. J Sci Comput 75:198–224
Article MathSciNet Google Scholar
Freytag S, Tian L, Lönnstedt I, et al (2018) Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research 7:
Grün D, van Oudenaarden A (2015) Design and analysis of single-cell sequencing experiments. Cell 163:799–810
Article PubMed Google Scholar
Hafemeister C, Satija R (2019) Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20:1–15
Article Google Scholar
Hicks SC, Townes FW, Teng M, Irizarry RA (2018) Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562–578
Article MathSciNet PubMed Google Scholar
Houle D, Pélabon C, Wagner GP, Hansen TF (2011) Measurement and meaning in biology. Q Rev Biol 86:3–34
Article PubMed Google Scholar
Jiang R, Sun T, Song D, Li JJ (2022) Statistics or biology: The zero-inflation controversy about scRNA-seq data. Genome Biol 23:1–24
Article Google Scholar
John CR, Watson D, Barnes MR et al (2020) Spectrum: Fast density-aware spectral clustering for single and multi-omic data. Bioinformatics 36:1159–1166
Article CAS PubMed Google Scholar
Lin MM, Dong B, Chu MT (2005) Integer matrix factorization and its application. Technical Reports
Liu S, Trapnell C (2016) Single-cell transcriptome sequencing: Recent advances and remaining challenges. F1000Research 5:
Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: A tutorial. Mol Syst Biol 15:e8746
Article PubMed PubMed Central Google Scholar
Lun A (2018) Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. BioRxiv 404962
Musser JM, Schippers KJ, Nickel M et al (2021) Profiling cellular diversity in sponges informs animal cell type and nervous system evolution. Science 374:717–723
Article ADS CAS PubMed PubMed Central Google Scholar
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: Analysis and an algorithm. Adv Neural Inf Process Syst 14:1–8
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: Machine learning in python. The Journal of Machine Learning Research 12:2825–2830
MathSciNet Google Scholar
Perros I, Papalexakis EE, Park H, et al (2018) SUSTain: Scalable unsupervised scoring for tensors and its application to phenotyping. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp 2080–2089
Qiu P (2020) Embracing the dropouts in single-cell RNA-seq analysis. Nat Commun 11:1–9
Article ADS Google Scholar
Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:1–9
Article Google Scholar
Saliba A-E, Westermann AJ, Gorski SA, Vogel J (2014) Single-cell RNA-seq: Advances and future challenges. Nucleic Acids Res 42:8845–8860
Article CAS PubMed PubMed Central Google Scholar
Sarkar A, Stephens M (2021) Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet 53:770–777
Article CAS PubMed PubMed Central Google Scholar
Silverman JD, Roche K, Mukherjee S, David LA (2020) Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J 18:2789–2798
Article CAS PubMed PubMed Central Google Scholar
Svensson V (2020) Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 38:147–150
Article CAS PubMed Google Scholar
Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol 20:1–16
Article Google Scholar
Vallejos CA, Risso D, Scialdone A et al (2017) Normalizing single-cell RNA sequencing data: Challenges and opportunities. Nat Methods 14:565–571
Article CAS PubMed PubMed Central Google Scholar
Van den Berge K, Hembach KM, Soneson C et al (2019) RNA sequencing data: Hitchhiker’s guide to expression analysis. Annual Review of Biomedical Data Science 2:139–173
Article Google Scholar
Van Verk MC, Hickman R, Pieterse CM, Van Wees SC (2013) RNA-seq: Revelation of the messengers. Trends Plant Sci 18:175–179
Article PubMed Google Scholar
Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci 131:281–285
Article CAS PubMed Google Scholar
Wang Z, Gerstein M, Snyder M (2009) RNA-seq: A revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Article CAS PubMed PubMed Central Google Scholar
Zheng GX, Terry JM, Belgrader P et al (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8:1–12
Article Google Scholar
Ziegenhain C, Vieth B, Parekh S et al (2017) Comparative analysis of single-cell RNA sequencing methods. Mol Cell 65:631–643
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Abby Skwara, Kevin O’Neil, Milo S. Johnson, Seth Donoughe, and members of the Dunn lab for comments on drafts of this manuscript. We thank Jacob Musser for providing insight on analysis of the sponge dataset. We thank the anonymous reviewers for their recommendations. This material is based upon work supported by the NSF Postdoctoral Research Fellowship in Biology under Grant No. 2109502.

Author information

Authors and Affiliations

Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
Samuel H. Church, Jasmine L. Mah, Günter Wagner & Casey W. Dunn
Yale Systems Biology Institute, Yale University, New Haven, CT, USA
Günter Wagner
Department of Obstetrics, Gynecology and Reproductive Sciences, Yale Medical School, New Haven, CT, USA
Günter Wagner
Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA
Günter Wagner

Authors

Samuel H. Church
View author publications
You can also search for this author in PubMed Google Scholar
Jasmine L. Mah
View author publications
You can also search for this author in PubMed Google Scholar
Günter Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Casey W. Dunn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samuel H. Church.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 4559 KB)

Appendix 1. Properties of the count-space over zero and the natural numbers

This appendix provides additional background and information on mathematical concepts relevant to the analysis of transcript count matrices.

The natural numbers, $\mathbb {N}_0$

The natural numbers are all positive integers. Depending on the definition, the natural numbers may also include zero (i.e. all non-negative integers), a convention we follow here so that $\mathbb {N}_0=\{0,1,2,3,\ldots \}$. We can classify sets of numbers by the operations under which they are closed, meaning operations on elements in the set result in elements that are also in the set. One familiar set is a field, which is closed under addition and multiplication, along with their inverses, subtraction and division, and includes rational numbers $\mathbb {Q}$ and real numbers $\mathbb {R}$. The natural numbers form a semiring, because $\mathbb {N}_0$ is closed for multiplication and addition, but not subtraction or division.

$\mathbb {N}_0$ contains the additive identity element, 0, that can be added to any element to return the same element (e.g. $x+0=x$). $\mathbb {N}_0$ also contains the multiplicative identity element, 1, that can be multiplied with any element to return the same element (e.g. $x1=x$).

Inverses are elements that return the identity elements under specified operations. For example, the negative numbers are inverses under addition, because $x+(-x)$ returns the additive identity element, 0. However, because $\mathbb {N}_0$ does not contain negative numbers, it doesn’t have additive inverses. Similarly, reciprocal values are inverses under multiplication, because $x(1/x)$ returns the multiplicative identity element, 1. $\mathbb {N}_0$ likewise does not contain reciprocals, so therefore doesn’t have multiplicative inverses.

Subtraction can be defined as the addition of an additive inverse (a negative number), and division can be defined as multiplication with a multiplicative inverse (a reciprocal). Because these two inverses are not contained in $\mathbb {N}_0$, there is no subtraction or division. This is equivalent to the observation that $\mathbb {N}_0$ is not closed for subtraction or division.

Count-space over $\mathbb {N}_0$

The count-space over $\mathbb {N}_0$ contains the vectors $\textbf{V}$ such that

$$\begin{aligned} \textbf{V} = {(a_1,a_2,a_3,...a_n): a_1,a_2,a_3,...a_n \in \mathbb {N}_0 } \end{aligned}$$

The count-space over $\mathbb {N}_0$ is a semimodule that is restricted to the integer-lattice that is located over the upper right quadrant of a coordinate system, inclusive of the origin and axes. Certain operations are possible in this restricted space, while others are not. For example, we can apply the operation of the dot product because this operation requires only multiplication and addition of vector elements. However, unlike a vector space over a field, in the count-space over $\mathbb {N}_0$ there are no angles between vectors. Calculating angles from dot product requires division by vector length, and $\mathbb {N}_0$ is not closed for division.

Furthermore, in count-space, vector length is not a Euclidean measure of distance as there is no equivalent measure of distance in a space without subtraction or square roots. Instead of Euclidean distance, we can use the number of integer steps in a positive direction as a measure of length, which is equivalent to the magnitude of the Manhattan distance between the vector terminus and the origin, though calculating the Manhattan distance generally requires subtraction.

Vector rotation is not possible in count-space as it would require rotation matrices with new basis vectors that include negative elements. Vector reflections are possible, because we are free to permute our count matrix, as are shears of vectors. Some vector projections are possible, but not all. For example, if we project vector $\textbf{b}$ onto vector $\textbf{a}$, the result will be a multiple $q$ of $\textbf{a}$, $q\textbf{a}$. The value of that multiple will be equal to $q=(\textbf{a}^T\textbf{b})/(\textbf{a}^T\textbf{a})$, which requires division to calculate unless $\textbf{a}^T\textbf{a} = 1$. Over $\mathbb {N}_0$, that only happens when there is a single entry that is 1, i.e. when vector $\textbf{a}$ is one of the original basis vectors. Therefore we can project vectors onto basis vectors, but not onto arbitrary vectors (e.g. we cannot project one cell vector onto another). Projecting onto basis vectors is the equivalent of multiplying some values by 0 while retaining others.

Without rotation and vector projection, it is clear that certain complex operations like principal component analysis that rely on these are not possible in this vector space.

High-dimensional, low-magnitude count-space over $\mathbb {N}_0$

While the above pertains to the count-space over $\mathbb {N}_0$ in general, there are interesting properties of the very high dimensional, low magnitude spaces that describe scRNA-seq count data.

Most gene expression datasets contain measurements for many thousands of genes, meaning this count-space has many thousands of dimensions. Furthermore, due to the sparse nature of these count matrices, it is difficult if not impossible to find a reasonable lower-dimensional approximation. In other words, because many features contain only a few, non-overlapping observations, there is no way to reduce the rank of this matrix without discarding features.

Because most measures of gene expression are low-magnitude integers, most cell vectors terminate only a few steps from the origin in any given direction. This does not mean that vectors are close to the origin overall. Vector length is non-Euclidean; it is calculated as the sum of steps back to the origin (Manhattan distance), not the distance along a diagonal (Euclidean distance).

Because the vast majority of values in the count matrix are zero, cell vectors are perpendicular to each other in many directions. The may result in the outcome that cell-cell similarity has more to do with the number and distribution of non-zero observations than expression magnitude. This has been demonstrated by the fact that binary transformations of scRNA-seq data to zero/non-zero contain enough information to recapitulate major patterns in the data (Qiu 2020).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Church, S.H., Mah, J.L., Wagner, G. et al. Normalizing need not be the norm: count-based math for analyzing single-cell data. Theory Biosci. 143, 45–62 (2024). https://doi.org/10.1007/s12064-023-00408-x

Download citation

Received: 05 June 2023
Accepted: 13 October 2023
Published: 10 November 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s12064-023-00408-x

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Normalizing need not be the norm: count-based math for analyzing single-cell data

Abstract

Access this article

Similar content being viewed by others

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets

scruff: an R/Bioconductor package for preprocessing single-cell RNA-sequencing data

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 4559 KB)

Appendix 1. Properties of the count-space over zero and the natural numbers

The natural numbers, \(\mathbb {N}_0\)

Count-space over \(\mathbb {N}_0\)

High-dimensional, low-magnitude count-space over \(\mathbb {N}_0\)

Rights and permissions

About this article

Cite this article

Navigation

Normalizing need not be the norm: count-based math for analyzing single-cell data

Abstract

Access this article

Similar content being viewed by others

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets

scruff: an R/Bioconductor package for preprocessing single-cell RNA-sequencing data

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 4559 KB)

Appendix 1. Properties of the count-space over zero and the natural numbers

Appendix 1. Properties of the count-space over zero and the natural numbers

The natural numbers, \(\mathbb {N}_0\)

Count-space over \(\mathbb {N}_0\)

High-dimensional, low-magnitude count-space over \(\mathbb {N}_0\)

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation