Skip to main content
Log in

Normalizing need not be the norm: count-based math for analyzing single-cell data

  • Research
  • Published:
Theory in Biosciences Aims and scope Submit manuscript

Abstract

Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The countland package for python and R, as well as all data and code required to reproduce the results shown here is available at https://github.com/shchurch/countland.

References

  • Ahlmann-Eltze C, Huber W (2021) Comparison of transformations for single-cell RNA-seq data. bioRxiv 2021–06

  • Booeshaghi AS, Hallgrímsdóttir IB, Gálvez-Merchán Á, Pachter L (2022) Depth normalization for single-cell genomics count data. BioRxiv

  • Cao Y, Kitanovski S, Küppers R, Hoffmann D (2021) UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat Biotechnol 39:158–159

    Article  CAS  PubMed  Google Scholar 

  • Chari T, Banerjee J, Pachter L (2021) The specious art of single-cell genomics. BioRxiv

  • Dong B, Lin MM, Park H (2018) Integer matrix approximation and data mining. J Sci Comput 75:198–224

    Article  MathSciNet  Google Scholar 

  • Freytag S, Tian L, Lönnstedt I, et al (2018) Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research 7:

  • Grün D, van Oudenaarden A (2015) Design and analysis of single-cell sequencing experiments. Cell 163:799–810

    Article  PubMed  Google Scholar 

  • Hafemeister C, Satija R (2019) Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20:1–15

    Article  Google Scholar 

  • Hicks SC, Townes FW, Teng M, Irizarry RA (2018) Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562–578

    Article  MathSciNet  PubMed  Google Scholar 

  • Houle D, Pélabon C, Wagner GP, Hansen TF (2011) Measurement and meaning in biology. Q Rev Biol 86:3–34

    Article  PubMed  Google Scholar 

  • Jiang R, Sun T, Song D, Li JJ (2022) Statistics or biology: The zero-inflation controversy about scRNA-seq data. Genome Biol 23:1–24

    Article  Google Scholar 

  • John CR, Watson D, Barnes MR et al (2020) Spectrum: Fast density-aware spectral clustering for single and multi-omic data. Bioinformatics 36:1159–1166

    Article  CAS  PubMed  Google Scholar 

  • Lin MM, Dong B, Chu MT (2005) Integer matrix factorization and its application. Technical Reports

  • Liu S, Trapnell C (2016) Single-cell transcriptome sequencing: Recent advances and remaining challenges. F1000Research 5:

  • Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: A tutorial. Mol Syst Biol 15:e8746

    Article  PubMed  PubMed Central  Google Scholar 

  • Lun A (2018) Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. BioRxiv 404962

  • Musser JM, Schippers KJ, Nickel M et al (2021) Profiling cellular diversity in sponges informs animal cell type and nervous system evolution. Science 374:717–723

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  • Ng A, Jordan M, Weiss Y (2001) On spectral clustering: Analysis and an algorithm. Adv Neural Inf Process Syst 14:1–8

    Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: Machine learning in python. The Journal of Machine Learning Research 12:2825–2830

    MathSciNet  Google Scholar 

  • Perros I, Papalexakis EE, Park H, et al (2018) SUSTain: Scalable unsupervised scoring for tensors and its application to phenotyping. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp 2080–2089

  • Qiu P (2020) Embracing the dropouts in single-cell RNA-seq analysis. Nat Commun 11:1–9

    Article  ADS  Google Scholar 

  • Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:1–9

    Article  Google Scholar 

  • Saliba A-E, Westermann AJ, Gorski SA, Vogel J (2014) Single-cell RNA-seq: Advances and future challenges. Nucleic Acids Res 42:8845–8860

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Sarkar A, Stephens M (2021) Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet 53:770–777

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Silverman JD, Roche K, Mukherjee S, David LA (2020) Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J 18:2789–2798

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Svensson V (2020) Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 38:147–150

    Article  CAS  PubMed  Google Scholar 

  • Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol 20:1–16

    Article  Google Scholar 

  • Vallejos CA, Risso D, Scialdone A et al (2017) Normalizing single-cell RNA sequencing data: Challenges and opportunities. Nat Methods 14:565–571

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Van den Berge K, Hembach KM, Soneson C et al (2019) RNA sequencing data: Hitchhiker’s guide to expression analysis. Annual Review of Biomedical Data Science 2:139–173

    Article  Google Scholar 

  • Van Verk MC, Hickman R, Pieterse CM, Van Wees SC (2013) RNA-seq: Revelation of the messengers. Trends Plant Sci 18:175–179

    Article  PubMed  Google Scholar 

  • Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci 131:281–285

    Article  CAS  PubMed  Google Scholar 

  • Wang Z, Gerstein M, Snyder M (2009) RNA-seq: A revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Zheng GX, Terry JM, Belgrader P et al (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8:1–12

    Article  Google Scholar 

  • Ziegenhain C, Vieth B, Parekh S et al (2017) Comparative analysis of single-cell RNA sequencing methods. Mol Cell 65:631–643

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Abby Skwara, Kevin O’Neil, Milo S. Johnson, Seth Donoughe, and members of the Dunn lab for comments on drafts of this manuscript. We thank Jacob Musser for providing insight on analysis of the sponge dataset. We thank the anonymous reviewers for their recommendations. This material is based upon work supported by the NSF Postdoctoral Research Fellowship in Biology under Grant No. 2109502.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samuel H. Church.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 4559 KB)

Appendix 1. Properties of the count-space over zero and the natural numbers

Appendix 1. Properties of the count-space over zero and the natural numbers

This appendix provides additional background and information on mathematical concepts relevant to the analysis of transcript count matrices.

The natural numbers, \(\mathbb {N}_0\)

The natural numbers are all positive integers. Depending on the definition, the natural numbers may also include zero (i.e. all non-negative integers), a convention we follow here so that \(\mathbb {N}_0=\{0,1,2,3,\ldots \}\). We can classify sets of numbers by the operations under which they are closed, meaning operations on elements in the set result in elements that are also in the set. One familiar set is a field, which is closed under addition and multiplication, along with their inverses, subtraction and division, and includes rational numbers \(\mathbb {Q}\) and real numbers \(\mathbb {R}\). The natural numbers form a semiring, because \(\mathbb {N}_0\) is closed for multiplication and addition, but not subtraction or division.

\(\mathbb {N}_0\) contains the additive identity element, 0, that can be added to any element to return the same element (e.g. \(x+0=x\)). \(\mathbb {N}_0\) also contains the multiplicative identity element, 1, that can be multiplied with any element to return the same element (e.g. \(x1=x\)).

Inverses are elements that return the identity elements under specified operations. For example, the negative numbers are inverses under addition, because \(x+(-x)\) returns the additive identity element, 0. However, because \(\mathbb {N}_0\) does not contain negative numbers, it doesn’t have additive inverses. Similarly, reciprocal values are inverses under multiplication, because \(x(1/x)\) returns the multiplicative identity element, 1. \(\mathbb {N}_0\) likewise does not contain reciprocals, so therefore doesn’t have multiplicative inverses.

Subtraction can be defined as the addition of an additive inverse (a negative number), and division can be defined as multiplication with a multiplicative inverse (a reciprocal). Because these two inverses are not contained in \(\mathbb {N}_0\), there is no subtraction or division. This is equivalent to the observation that \(\mathbb {N}_0\) is not closed for subtraction or division.

Count-space over \(\mathbb {N}_0\)

The count-space over \(\mathbb {N}_0\) contains the vectors \(\textbf{V}\) such that

$$\begin{aligned} \textbf{V} = {(a_1,a_2,a_3,...a_n): a_1,a_2,a_3,...a_n \in \mathbb {N}_0 } \end{aligned}$$

The count-space over \(\mathbb {N}_0\) is a semimodule that is restricted to the integer-lattice that is located over the upper right quadrant of a coordinate system, inclusive of the origin and axes. Certain operations are possible in this restricted space, while others are not. For example, we can apply the operation of the dot product because this operation requires only multiplication and addition of vector elements. However, unlike a vector space over a field, in the count-space over \(\mathbb {N}_0\) there are no angles between vectors. Calculating angles from dot product requires division by vector length, and \(\mathbb {N}_0\) is not closed for division.

Furthermore, in count-space, vector length is not a Euclidean measure of distance as there is no equivalent measure of distance in a space without subtraction or square roots. Instead of Euclidean distance, we can use the number of integer steps in a positive direction as a measure of length, which is equivalent to the magnitude of the Manhattan distance between the vector terminus and the origin, though calculating the Manhattan distance generally requires subtraction.

Vector rotation is not possible in count-space as it would require rotation matrices with new basis vectors that include negative elements. Vector reflections are possible, because we are free to permute our count matrix, as are shears of vectors. Some vector projections are possible, but not all. For example, if we project vector \(\textbf{b}\) onto vector \(\textbf{a}\), the result will be a multiple \(q\) of \(\textbf{a}\), \(q\textbf{a}\). The value of that multiple will be equal to \(q=(\textbf{a}^T\textbf{b})/(\textbf{a}^T\textbf{a})\), which requires division to calculate unless \(\textbf{a}^T\textbf{a} = 1\). Over \(\mathbb {N}_0\), that only happens when there is a single entry that is 1, i.e. when vector \(\textbf{a}\) is one of the original basis vectors. Therefore we can project vectors onto basis vectors, but not onto arbitrary vectors (e.g. we cannot project one cell vector onto another). Projecting onto basis vectors is the equivalent of multiplying some values by 0 while retaining others.

Without rotation and vector projection, it is clear that certain complex operations like principal component analysis that rely on these are not possible in this vector space.

High-dimensional, low-magnitude count-space over \(\mathbb {N}_0\)

While the above pertains to the count-space over \(\mathbb {N}_0\) in general, there are interesting properties of the very high dimensional, low magnitude spaces that describe scRNA-seq count data.

Most gene expression datasets contain measurements for many thousands of genes, meaning this count-space has many thousands of dimensions. Furthermore, due to the sparse nature of these count matrices, it is difficult if not impossible to find a reasonable lower-dimensional approximation. In other words, because many features contain only a few, non-overlapping observations, there is no way to reduce the rank of this matrix without discarding features.

Because most measures of gene expression are low-magnitude integers, most cell vectors terminate only a few steps from the origin in any given direction. This does not mean that vectors are close to the origin overall. Vector length is non-Euclidean; it is calculated as the sum of steps back to the origin (Manhattan distance), not the distance along a diagonal (Euclidean distance).

Because the vast majority of values in the count matrix are zero, cell vectors are perpendicular to each other in many directions. The may result in the outcome that cell-cell similarity has more to do with the number and distribution of non-zero observations than expression magnitude. This has been demonstrated by the fact that binary transformations of scRNA-seq data to zero/non-zero contain enough information to recapitulate major patterns in the data (Qiu 2020).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Church, S.H., Mah, J.L., Wagner, G. et al. Normalizing need not be the norm: count-based math for analyzing single-cell data. Theory Biosci. 143, 45–62 (2024). https://doi.org/10.1007/s12064-023-00408-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12064-023-00408-x

Navigation