Abstract
Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.
Similar content being viewed by others
Data availability
The countland package for python and R, as well as all data and code required to reproduce the results shown here is available at https://github.com/shchurch/countland.
References
Ahlmann-Eltze C, Huber W (2021) Comparison of transformations for single-cell RNA-seq data. bioRxiv 2021–06
Booeshaghi AS, Hallgrímsdóttir IB, Gálvez-Merchán Á, Pachter L (2022) Depth normalization for single-cell genomics count data. BioRxiv
Cao Y, Kitanovski S, Küppers R, Hoffmann D (2021) UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat Biotechnol 39:158–159
Chari T, Banerjee J, Pachter L (2021) The specious art of single-cell genomics. BioRxiv
Dong B, Lin MM, Park H (2018) Integer matrix approximation and data mining. J Sci Comput 75:198–224
Freytag S, Tian L, Lönnstedt I, et al (2018) Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research 7:
Grün D, van Oudenaarden A (2015) Design and analysis of single-cell sequencing experiments. Cell 163:799–810
Hafemeister C, Satija R (2019) Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20:1–15
Hicks SC, Townes FW, Teng M, Irizarry RA (2018) Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562–578
Houle D, Pélabon C, Wagner GP, Hansen TF (2011) Measurement and meaning in biology. Q Rev Biol 86:3–34
Jiang R, Sun T, Song D, Li JJ (2022) Statistics or biology: The zero-inflation controversy about scRNA-seq data. Genome Biol 23:1–24
John CR, Watson D, Barnes MR et al (2020) Spectrum: Fast density-aware spectral clustering for single and multi-omic data. Bioinformatics 36:1159–1166
Lin MM, Dong B, Chu MT (2005) Integer matrix factorization and its application. Technical Reports
Liu S, Trapnell C (2016) Single-cell transcriptome sequencing: Recent advances and remaining challenges. F1000Research 5:
Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: A tutorial. Mol Syst Biol 15:e8746
Lun A (2018) Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. BioRxiv 404962
Musser JM, Schippers KJ, Nickel M et al (2021) Profiling cellular diversity in sponges informs animal cell type and nervous system evolution. Science 374:717–723
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: Analysis and an algorithm. Adv Neural Inf Process Syst 14:1–8
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: Machine learning in python. The Journal of Machine Learning Research 12:2825–2830
Perros I, Papalexakis EE, Park H, et al (2018) SUSTain: Scalable unsupervised scoring for tensors and its application to phenotyping. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp 2080–2089
Qiu P (2020) Embracing the dropouts in single-cell RNA-seq analysis. Nat Commun 11:1–9
Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:1–9
Saliba A-E, Westermann AJ, Gorski SA, Vogel J (2014) Single-cell RNA-seq: Advances and future challenges. Nucleic Acids Res 42:8845–8860
Sarkar A, Stephens M (2021) Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet 53:770–777
Silverman JD, Roche K, Mukherjee S, David LA (2020) Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J 18:2789–2798
Svensson V (2020) Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 38:147–150
Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol 20:1–16
Vallejos CA, Risso D, Scialdone A et al (2017) Normalizing single-cell RNA sequencing data: Challenges and opportunities. Nat Methods 14:565–571
Van den Berge K, Hembach KM, Soneson C et al (2019) RNA sequencing data: Hitchhiker’s guide to expression analysis. Annual Review of Biomedical Data Science 2:139–173
Van Verk MC, Hickman R, Pieterse CM, Van Wees SC (2013) RNA-seq: Revelation of the messengers. Trends Plant Sci 18:175–179
Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci 131:281–285
Wang Z, Gerstein M, Snyder M (2009) RNA-seq: A revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Zheng GX, Terry JM, Belgrader P et al (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8:1–12
Ziegenhain C, Vieth B, Parekh S et al (2017) Comparative analysis of single-cell RNA sequencing methods. Mol Cell 65:631–643
Acknowledgements
We thank Abby Skwara, Kevin O’Neil, Milo S. Johnson, Seth Donoughe, and members of the Dunn lab for comments on drafts of this manuscript. We thank Jacob Musser for providing insight on analysis of the sponge dataset. We thank the anonymous reviewers for their recommendations. This material is based upon work supported by the NSF Postdoctoral Research Fellowship in Biology under Grant No. 2109502.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix 1. Properties of the count-space over zero and the natural numbers
Appendix 1. Properties of the count-space over zero and the natural numbers
This appendix provides additional background and information on mathematical concepts relevant to the analysis of transcript count matrices.
The natural numbers, \(\mathbb {N}_0\)
The natural numbers are all positive integers. Depending on the definition, the natural numbers may also include zero (i.e. all non-negative integers), a convention we follow here so that \(\mathbb {N}_0=\{0,1,2,3,\ldots \}\). We can classify sets of numbers by the operations under which they are closed, meaning operations on elements in the set result in elements that are also in the set. One familiar set is a field, which is closed under addition and multiplication, along with their inverses, subtraction and division, and includes rational numbers \(\mathbb {Q}\) and real numbers \(\mathbb {R}\). The natural numbers form a semiring, because \(\mathbb {N}_0\) is closed for multiplication and addition, but not subtraction or division.
\(\mathbb {N}_0\) contains the additive identity element, 0, that can be added to any element to return the same element (e.g. \(x+0=x\)). \(\mathbb {N}_0\) also contains the multiplicative identity element, 1, that can be multiplied with any element to return the same element (e.g. \(x1=x\)).
Inverses are elements that return the identity elements under specified operations. For example, the negative numbers are inverses under addition, because \(x+(-x)\) returns the additive identity element, 0. However, because \(\mathbb {N}_0\) does not contain negative numbers, it doesn’t have additive inverses. Similarly, reciprocal values are inverses under multiplication, because \(x(1/x)\) returns the multiplicative identity element, 1. \(\mathbb {N}_0\) likewise does not contain reciprocals, so therefore doesn’t have multiplicative inverses.
Subtraction can be defined as the addition of an additive inverse (a negative number), and division can be defined as multiplication with a multiplicative inverse (a reciprocal). Because these two inverses are not contained in \(\mathbb {N}_0\), there is no subtraction or division. This is equivalent to the observation that \(\mathbb {N}_0\) is not closed for subtraction or division.
Count-space over \(\mathbb {N}_0\)
The count-space over \(\mathbb {N}_0\) contains the vectors \(\textbf{V}\) such that
The count-space over \(\mathbb {N}_0\) is a semimodule that is restricted to the integer-lattice that is located over the upper right quadrant of a coordinate system, inclusive of the origin and axes. Certain operations are possible in this restricted space, while others are not. For example, we can apply the operation of the dot product because this operation requires only multiplication and addition of vector elements. However, unlike a vector space over a field, in the count-space over \(\mathbb {N}_0\) there are no angles between vectors. Calculating angles from dot product requires division by vector length, and \(\mathbb {N}_0\) is not closed for division.
Furthermore, in count-space, vector length is not a Euclidean measure of distance as there is no equivalent measure of distance in a space without subtraction or square roots. Instead of Euclidean distance, we can use the number of integer steps in a positive direction as a measure of length, which is equivalent to the magnitude of the Manhattan distance between the vector terminus and the origin, though calculating the Manhattan distance generally requires subtraction.
Vector rotation is not possible in count-space as it would require rotation matrices with new basis vectors that include negative elements. Vector reflections are possible, because we are free to permute our count matrix, as are shears of vectors. Some vector projections are possible, but not all. For example, if we project vector \(\textbf{b}\) onto vector \(\textbf{a}\), the result will be a multiple \(q\) of \(\textbf{a}\), \(q\textbf{a}\). The value of that multiple will be equal to \(q=(\textbf{a}^T\textbf{b})/(\textbf{a}^T\textbf{a})\), which requires division to calculate unless \(\textbf{a}^T\textbf{a} = 1\). Over \(\mathbb {N}_0\), that only happens when there is a single entry that is 1, i.e. when vector \(\textbf{a}\) is one of the original basis vectors. Therefore we can project vectors onto basis vectors, but not onto arbitrary vectors (e.g. we cannot project one cell vector onto another). Projecting onto basis vectors is the equivalent of multiplying some values by 0 while retaining others.
Without rotation and vector projection, it is clear that certain complex operations like principal component analysis that rely on these are not possible in this vector space.
High-dimensional, low-magnitude count-space over \(\mathbb {N}_0\)
While the above pertains to the count-space over \(\mathbb {N}_0\) in general, there are interesting properties of the very high dimensional, low magnitude spaces that describe scRNA-seq count data.
Most gene expression datasets contain measurements for many thousands of genes, meaning this count-space has many thousands of dimensions. Furthermore, due to the sparse nature of these count matrices, it is difficult if not impossible to find a reasonable lower-dimensional approximation. In other words, because many features contain only a few, non-overlapping observations, there is no way to reduce the rank of this matrix without discarding features.
Because most measures of gene expression are low-magnitude integers, most cell vectors terminate only a few steps from the origin in any given direction. This does not mean that vectors are close to the origin overall. Vector length is non-Euclidean; it is calculated as the sum of steps back to the origin (Manhattan distance), not the distance along a diagonal (Euclidean distance).
Because the vast majority of values in the count matrix are zero, cell vectors are perpendicular to each other in many directions. The may result in the outcome that cell-cell similarity has more to do with the number and distribution of non-zero observations than expression magnitude. This has been demonstrated by the fact that binary transformations of scRNA-seq data to zero/non-zero contain enough information to recapitulate major patterns in the data (Qiu 2020).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Church, S.H., Mah, J.L., Wagner, G. et al. Normalizing need not be the norm: count-based math for analyzing single-cell data. Theory Biosci. 143, 45–62 (2024). https://doi.org/10.1007/s12064-023-00408-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12064-023-00408-x