Abstract
Singlecell analysis is currently one of the most highresolution techniques to study biology. The large complex datasets that have been generated have spurred numerous developments in computational biology, in particular the use of advanced statistics and machine learning. This review attempts to explain the deeper theoretical concepts that underpin current stateoftheart analysis methods. Singlecell analysis is covered from cell, through instruments, to current and upcoming models. The aim of this review is to spread concepts which are not yet in common use, especially from topology and generative processes, and how new statistical models can be developed to capture more of biology. This opens epistemological questions regarding our ontology and models, and some pointers will be given to how natural language processing (NLP) may help overcome our cognitive limitations for understanding singlecell data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Singlecell datasets are among the most complex data currently generated, and the field is a major driver for new bioinformatic methods. Datasets can now encompass up to one million observations, with 20–50,000 measurements per cell, making it hard to even visualize the data. Furthermore, the data is incredibly noisy, and relies on our ability to detect individual molecules. To overcome the noise and extract a meaningful interpretation, it is usually not sufficient to look at individual cells. Instead, the data is fitted to increasingly advanced statistical models.
Statistics are commonly concerned with data, or observations (what we measure), a model (what generates the data), and underlying variables (frequently hidden and abstract in nature). In this review, we will be concerned with the nature of the data and how different models can help us explain it. In modern statistical language, the relationship between data and variables can be recast in the following general form:
Singlecell observationi~Model[latent variablesi]
Here, the model is the choice of the statistical distribution, parameterized by hidden (latent) variables. These abstract variables need to be given a meaning by the analyst, and a philosophical discussion cannot be avoided (discussed in the second part of this review). Most latent variables are for each observation (cell), while some variables may be in common for all observations or simply considered part of the model itself. The latent variable space typically has a lower dimension than the data space (Fig. 1a), representing the aggregation of knowledge. Both Frequentist and Bayesian statistics are in use, with Bayesian models becoming increasingly popular. Their strength lies especially in their ability to model complex noise, which arises from both the biology and the technical measurement. Bayesian models furthermore support updating (i.e., adding data to a previously fitted model), avoiding complete recomputation as new evidence surfaces.
There is usually not an obviously “correct” choice of model for singlecell data, but model choice can be motivated by a hypothesis of the nature of the data. However, one hypothesis can correspond to multiple models, and multiple models can correspond to one hypothesis (Gelman and Hill 2006) (Fig. 1b). For example, the abundance of RNA in a tube can increase both due to the number of molecules increasing, as well as if the gene switches to a longer isoform (Fig. 1c). How this affects RNAseq depends on the precise chemical details in the library preparation, which is why those are covered extensively in this review. The conceptual difference between hypotheses and models is also especially important to be aware of in the context of hypothesis testing, where the fit of one model is compared to another (Fig. 1d). Another way of motivating models is in terms of their interpretability—a model might fit the data very well, but because of its complexity, it might be hard to interpret. Finally, model choice can be motivated in terms of how well a model fits to notyetseen data. The machine learning (ML) field has especially emphasized this aspect and provided tools such as crossvalidation and penalization to improve upon it. It is now even possible to fit models for which classically there would not be enough data given the number of parameters. These general topics of statistics and ML are beyond the scope of this review.
This review is organized around our cognitive view of modeling (Fig. 1e). Beyond the general form equation, it will also cover how singlecell data is physically generated, how data is preprocessed, and how the latent space can be interpreted from topological and biological standpoints. A minimum of mathematics has been used but readers are encouraged to visit dedicated literature to fill the gaps (Armstrong; Norris 1997; Debnath and Mikusinski 2005), especially in linear algebra (Eie et al. 2023) and statistics (McElreath 2020).
The standard analysis pipeline
In the beginning of singlecell analysis, tools were reused from bulk RNAseq, such as the Bowtie2 aligner (Langmead and Salzberg 2012), and DESeq2 (Love et al. 2014) for differential expression. Single cell is however fundamentally different in that there are many more samples (cells) compared to bulk (tissue averages). Thus, techniques from unsupervised ML were borrowed to aid visualization and comparison. While this worked well, it required a rather skilled bioinformatician. This has led to the development of several userfriendly R and Python packages, which streamline the historically most common operations. Among the most wellknown such packages are Seurat (Satija et al. 2015), Signac (Stuart et al. 2021), monocle (Cao et al. 2019), and ArchR (Granja et al. 2021) for R, and Scanpy (Wolf et al. 2018) for Python.
The standard pipeline (Fig. 2a) proceeds as follows: (1) alignment of sequencing data to a reference genome, (2) gathering data for each cell, (3) reducing the sequencing data into percell features, e.g., gene expression levels, transcription factor binding level, or enhancer accessibility, (4) further quality control with doublet removal and feature selection, (5) dimensional reduction and clustering, (6) comparison of cells and clusters. In every step, data is removed (Fig. 2b). Ideally, there would only be one step, but it would be too slow to be practical. Thus, in fact, there are several data representations, and the process should be seen as a funnel to a pvalue or plot. Each data representation thus needs to contain enough information to be reduced into the next representation (e.g., the normal count reduction makes it impossible to distinguish the scenario outlined in Fig. 1c in later steps). This review thus inevitably covers most steps in the analysis process. One aspect not covered is that there are specific file formats for the intermediate representations (e.g., Anndata (Wolf et al. 2018), loom, https://github.com/mojaveazure/loomR, or Arrow (Granja et al. 2021)), which limits one from easily making changes in these representations.
The archetype singlecell chemistries
Before delving into how data is represented, one must understand how the data arises and to what it physically corresponds. Two archetype library preparation methods (RNAseq and ATACseq), colloquially called “chemistries,” will be covered in this section. These chemistries can now be performed on the same input cell, enabling close comparison of these modalities and sharing of the latent space (Argelaguet et al. 2018; Gayoso et al. 2021). The principles behind the multiome protocols (Lee et al. 2020) (simultaneously measuring more than one biological aspect of the same cell) are virtually the same as for RNAseq and ATACseq separately.
Singlecell RNAseq
scRNAsequencing is a method to quantify which genes are transcribed. The focus is normally on mature mRNA, which at the end contains a 5′ nucleotide cap and a 3′ poly adenine (3′ polyA) tail. Furthermore, introns will at some point be spliced out. Because the mRNA levels enable such broad interpretation of the function and behavior of a cell at a given moment, RNAseq has become the workhorse for much biological exploration. scRNAsequencing is also the basis of more complex protocols, such as singlecell riboseq (VanInsberghe et al. 2021), which can tell where ribosomes are located. RNAseq was the first omics protocol miniaturized for singlecell applications, with multiwell plate Smartseq2 (Picelli et al. 2013) becoming the most popular protocol. Other competing protocols existed; for example, CELseq2 (Hashimshony et al. 2016) only captured the 3′ part, while STRTseq (Islam et al. 2012) captured the 5′ part (in the idealized scenario). The first step in any protocol is reverse transcription (RT). Because of the abundant ribosomal RNA, virtually all protocols so far have used oligodT RT primers which bind to the 3′ polyA of the mRNA. However, oligodT primers can also bind anywhere inside the RNA but with lower efficiency, typically in stretches of 3xA or more (Kozak 1991) (which is essential for estimating “RNA velocity” (La Manno et al. 2018; Bergen et al. 2021) from the quantification of unspliced mRNA that would not be included if only 3′ polyA mRNA was included).
Almost all singlecell protocols require early addition of flanking PCR handles, which enable a first PCR (preamplification) using cDNAsequenceindependent PCR primers. The first PCR handle is included in the RT primer. A key innovation is the use of template switching, where RT can continue from the 5′ end of the RNA to another oligo. This preferentially occurs if the RNA has a 5′ cap, and if the RT enzyme adds additional nucleotides past the RNA, which can be made complementary to an oligo (“the template switching oligo, TSO”). The TSO can be designed to carry a second PCR handle. With known common 5′ and 3′ flanking sequences, the first PCR is then easily designed (Fig. 3a). To aid in read deduplication (discussed later), this is also the step in which a unique molecular identifier (UMI) is introduced, as a stretch of random nucleotides in either the RT primer or TSO.
Next, all libraries aimed for Illumina shortread sequencing require fragmentation of the cDNA down to sizes that can bind to the flow cell (below 1.5 kb, ideally 700 bp). Smartseq uses Tn5, an enzyme that simultaneously fragments the cDNA and adds new PCR handles (Fig. 3b). This protocol is simple, but the ends of input DNA are lost unless 3′/5′ adapters are added somehow (such as by extended TSO and RT primers in Smartseq3 (HagemannJensen et al. 2020)). Furthermore, because Tn5 adds s5/s7 tags randomly, 50% of the fragments being s5s5 or s7s7 will be suppressed in the PCR after tagmentation. As an alternative to Tn5, some protocols instead use enzymatic shearing, dAtailing, and sticky adapter ligation (Fig. 3c). CELseq2 (Hashimshony et al. 2016) and STRTseq (Islam et al. 2012) also fragment the cDNA, but targets either the 5′ or 3′ PCR handle from the cDNA preparation to enrich for the corresponding fragments. This is possible only if different adapter sequences are used for 5′ and 3′ respectively (s5 + s7). After a second PCR, the fragments will contain cDNA, library indices, and sequences that will bind to the Illumina flow cell (Fig. 3d).
Singlecell protocols can also be performed with microfluidics, where a cell is encapsulated in a droplet containing enzymes, buffer, and a bead with oligos (Klein et al. 2015; Macosko et al. 2015). These protocols have severe constraints: buffers cannot be changed, adding liquid to droplets is challenging, and somehow each cell has to obtain a unique library index. For multiwell plates, different index oligos are added to each well, and the wellindex relation is known. Droplets instead depend on beads having oligos that carry a random library index (in this context called the “cell barcode”), with sufficiently many indices such that no cells obtain the same index (Fig. 3e and f). Because it is hard to add liquid to droplets, PCR is rather performed after droplets have been deemulsified and pooled. This limits the possibility of adding the cell barcode to the RTstep. Because the barcode can only be added to the 3′ or 5′ of the cDNA (RT primer or TSO), only these parts of the cDNA are normally sequenced after preparation with the 10 × Genomics Chromium (Fig. 3g).
Finally, splitandpooltype protocols should be briefly mentioned (Vitak et al. 2017; Rosenberg et al. 2018). These use the cell itself as the “droplet,” by careful permeabilization. The cell barcode is made up of a combination of oligos, with each extra oligo added after pooling the cells and splitting them into a new 96/384 well plate. These protocols are less constrained than microfluidic droplet based protocols, but overall share the limitation of only capturing 3′ or 5′ RNA. Splitandpool protocols can also be implemented using the 10x Genomics Chromium (Datlinger et al. 2021).
While this section only covers a fraction of the chemistries ever made, knowing this much about RNAseq is sufficient to be able to statistically model most of existing data. Recent advances in chemistry will however introduce new statistical challenges, e.g., there are alternatives to oligodT RT. For example, microSPLiT uses in vitro polyadenylation to also capture bacterial mRNA efficiently (Kuchina et al. 2021). Similarly, this can be done on eukaryotic fragmented mRNA, enabling fullRNA capture also in the 10 × Genomics Chromium droplet system (as in VASAseq (Salmen et al. 2022)). Because Tn5 also digests RNA:DNA hybrids, the PCR before tagmentation can also be avoided (Di et al. 2022; Xu et al. 2022). As a benefit, the fragmentation sites for each input RNA molecule is virtually unique, simplifying deduplication and the relation between reads and input RNA molecules.
Singlecell ATACseq
ATACseq (Assay for TransposaseAccessible Chromatin using sequencing) is a method that assesses which regions of the chromatin are accessible (“open”) using the transposase Tn5 (Yan et al. 2020). This technology relies on the assumption that inactive transcribing (and replicating) regions of the DNA tend to be highly compacted around histones (“closed”), while active regions are not. When Tn5 tagmentation is done on genomic DNA (gDNA), before purification, the fragmentation patterns will thus be dictated by gDNAbinding proteins, marking unshielded gDNA as open (Fig. 3h). Since tagmentation is wellsuited for smallinput singlecell applications (Buenrostro et al. 2015), it has been adapted to also enable scChIPseq (e.g., singlecell CUT&Tag (Bartosovic et al. 2021)). By adding a H3K9me3targeting chromodomain to the Tn5 enzyme, both open and closed regions can be assessed in parallel, enabling the measurement of “chromatin velocity” (Tedesco et al. 2022). scATACseq is thus the basis for a family of related protocols.
Because mitochondrial DNA (mtDNA) is abundant and highly accessible, a key first step in ATACseq is the extraction of the nuclei. This step can however be relaxed for niche protocols that use somatic mutations in the mtDNA for lineage tracing purposes (Ludwig et al. 2019). During nuclei extraction, the nuclei are also permeabilized, giving access to the gDNA. Because Tn5 binding is separate from the subsequent gDNA fragmentation, which is induced by heating or denaturing agents, Tn5 can be added to nuclei in bulk prior to nuclei singlecell separation (Chen et al. 2018). The individual separation of nuclei can then be done both in multiwell plates as well as in microfluidic droplets. A PCR is finally performed to attach library indices, or cell barcodes in the case of droplets. This is thus unlike 10x Genomics Chromium droplet RNAseq, where cell barcodes are instead added by RT.
Comparison of readout methods
One major difference between RNAseq and ATACseq is the number of possible molecules overlapping the same genomic region (Fig. 3h). In ATACseq, assuming a diploid genome, there can be at most 2 fragments. For RNAseq, there is however no limit on the amount of RNA for one gene. Furthermore, more than one RT primer can bind to one RNA molecule, even if oligodT primers are used. Thus, the UMI represents the number of RT events, not the number of RNA molecules.
Another difference is that there are more steps in the RNAseq chemistry. Assuming that one RNA molecule results in one cDNA molecule, the subsequent fragmentation step still produces more than one sequenceable fragment per RNA molecule. This is not just the case for Smartseq2, but also for 5′ and 3′ capture methods, which produce overlapping fragments of different lengths (Fig. 3i). This must be kept in mind during UMIbased deduplication. In principle, the overlapping fragments can be combined into one single read based on the UMI, but the authors do not know of any such software.
Based on RNAseq and ATACseq, additional steps can lead to new types of readouts. Metabolic RNA labeling protocols to measure rate of transcription (Qiu et al. 2020) are effectively RNAseq protocols. Protein levels can be measured by using oligotagged antibodies, thus turning protein detection into a sequencing problem. CITEseq and REAPseq are two ways in which antibodyattached oligos enable detection by RNAseq protocols (Stoeckius et al. 2017; Peterson et al. 2017; Mimitou et al. 2019) (Fig. 3j). As an alternative, tagged lipids (such as 10x Genomics CellPlex) can label a variety of cells. All of these follow the chemistry expected for RNAseq, except since the fragments are already sufficiently small to fit Illumina sequencers, no fragmentation is needed. Instead, a separate enrichment PCR is sufficient to selectively extract them from the preamplified cDNA library. Labeling of cells can also be done in an ATACseqcompatible manner through a different oligo design (Mimitou et al. 2021). In addition, the ATACseq protocol can be modified for detection of CRISPR sgRNAs (Pierce et al. 2021), and for quantification of other genomic regions (unpublished).
To the author’s knowledge, there are only two fundamentally different archetype readouts not covered by this review. Singlecell HiC generates a map of locationtolocation abundances (Nagano et al. 2013; Stevens et al. 2017; Zhang et al. 2022). The other readout is singlecell wholegenome sequencing (Gawad et al. 2016). These are both rather niche protocols, requiring complex statistics tailored for the purpose, and thus not covered in this review.
From singlecell chemistry to statistics
Sequencing data preprocessing and initial data reductions
Independently of the chemistry used, the first steps of alignment and barcodetocell association are largely the same. If the data stems from a multiwell experiment, then classical bulk RNAseq or ATACseq tools have commonly been used. For droplet data or more complex chemistries, dedicated tools exist that scale better for large datasets. CellRanger operates on 10x Genomics Chromium data, but can be replaced by the faster and more flexible STARsolo (Blibaum et al. 2019) pipeline (which is also suitable for multiwell plates). The output alignment will, for each sequencing read, contain information about position (chromosome name, from, to), and sequence differences vs the reference genome (Fig. 4a). This is more information than is commonly used, and to speed up later algorithms, most information is filtered/reduced in a manner that depends on the needs of subsequent steps.
For RNAseq, only information about which gene the read overlaps is typically retained. Any UMI is used to further deduplicate reads. The result is counts of fragments per gene and per cell, called the count table. Some computational methods retain more information about the read. RNA velocity, for example, stores if the read is intronic or exonic, and thus reduces the reads to two counts per gene and cell. Interestingly, the 10x Genomics software Cellranger only counts exonic reads by default, but can optionally count everything (recommended for singlenuclei protocols). More information exists, including about isoform usage, but novel computational approaches need to be developed to make better use of this.
ATACseq analysis is more challenging because, unlike for genes, there is no accepted “list of enhancers.” Instead, this list is defined for each dataset by collecting fragments across all cells and performing peak calling in a manner identical to bulk ATACseq or ChIPseq. MACS2 (Zhang et al. 2008) is commonly used, while Cellranger has its own algorithm. With peaks defined, it is then possible to collect fragment counts per peak. It is an open question as to what other information can be extracted; after later analysis steps, it is possible to reanalyze the raw reads to detect transcription factor (TF) binding sites (“TF footprinting” (Bentsen et al. 2020)). This shows that in some cases, even if information is not retained in a reduction, it may be possible to backtrack to the raw data to extract further information.
It is sometimes possible to skip the slow alignment step and immediately count the feature overlaps. Alevin (Srivastava et al. 2019) does so for RNAseq by instead solving a kMER deconvolution problem over all reads. This requires a reference of expected sequences of high confidence, which cannot be fulfilled for variable regions such as the T cell receptor (TCR). This approach is however particularly promising, as it avoids the problems of nonuniquely aligned reads, and the counting speed is sufficiently high that downstream statistics can be done by bootstrapping. One can thus expect a future singlecell pipeline that is entirely based on a kMER sequencing data representation instead of the current count matrices.
Filtering cells and features
The resulting count table is, in more general terms, said to consist of quantified “features” for each cell. To speed up computation and avoid false positives, the number of features can be reduced (Fig. 2a panel 4). For example, if across all cells, a gene is always expressed or an enhancer is always open, then it does not contribute much information. In the early days, ERCC (External RNA Controls Consortium) spikein RNA was used to estimate technical variance. However, these days, features are selected by pure comparison of their dispersion to other similar features. For gene expression, this means other transcripts of similar abundance (Brennecke et al. 2013). Genes having less variance than the expected technical variance can also be ignored. This approach does however not apply to ATACseq data which is of rather binary nature (open or closed). Instead a weighting scheme such as TFIDF is later applied to emphasize the most informative features (described later).
Another useful reduction is to remove observations of low quality. Cells for which there are few reads (low coverage) do not contribute much information and are usually removed based on a lower cutoff (no gold standard exists). There may also be free floating RNA or DNA, which may enter droplets of other cells (background). The background can be modeled and its effect removed to a certain extent (Young and Behjati 2020), but cells with few reads are particularly vulnerable to this bias. RNA also sticks to cells, and this may cause bias if it preferentially sticks to neighboring cells, since cells are not randomly distributed in the tissue. Finally, more than one cell can enter a library, especially if droplet microfluidics is used. Such droplets can be detected by ML after simulating the mixing of all cells. Several such packages exist (Wolock et al. 2019).
Unique molecular identifiers and barcode correction
Because of the limited amount of input DNA/RNA, after PCR, it is highly likely that some fragments will be sequenced more than once. To avoid double counting, some fragments are equipped with unique molecular identifiers (UMIs). These are simply stretches of Nnucleotides (random mix of ATCG), with a suitable length depending on the expected number of duplicate fragments. If two fragments are equal, and also share a UMI, then they are assumed to have the same origin (Fig. 4b). Removal of extra copies is called deduplication. UMIs cannot be attached during PCR, but rather only at steps which can only happen once: ligation, reverse transcription and template switching. In Smartseq2style protocols, it is impossible to add UMIs representing RT events for all final fragments, as fragmentation by Tn5 separates most inner cDNA fragments from 5′ or 3′ UMIs. Smartseq3 has, however, also added UMIs to the 3′ and 5′most fragments (HagemannJensen et al. 2020).
The handling of UMIs is generally a matter of preprocessing, as is done by Cellranger. Because sequencing errors can occur, UMIs may need to be bioinformatically corrected. If a small number of reads contain UMIs similar to highly abundant UMIs, then reads can be assumed to be due to sequencing errors (Smith et al. 2017). If many of the possible UMIs are used, then the assumption of UMIs being unique breaks down, and it may be necessary to treat them with more advanced statistics. This is because the birthday paradox (Wikipedia contributors 2022)—the chance of at least two UMIs being shared, if picked randomly from a large pool—increases surprisingly fast. The details are not covered here, but it is important to be aware of the problem such as to design UMIs and experiments accordingly.
Cell barcodes in droplet data can be corrected similarly to UMIs, but with higher confidence if the random sequence comes from a predefined list (called a whitelist). In this case, there are several algorithms enabling the design of oligo sequences that can be corrected efficiently (Buschmann and Bystrykh 2013). As an example, the 10x Genomics Chromium RNAseq chemistries use whitelists of up to 1.4 M cell barcodes.
Toward a statistical model
In early days of model fitting, little or no regard was given to statistical distributions. For simplicity, the Euclidean distance (data vs fitted value) was commonly minimized, which in hindsight was a good choice—it actually has strong links to the normal distribution. The normal distribution is frequently a correct choice because it arises naturally for any variable that is the average of several other stochastic processes—a result denoted the central limit theorem. In biology, such averaging is common (e.g., a phenotype is usually the total result of many interacting genes). That said, the average might rather be on a logscale. This happens if the results are multiplicative (e.g., one mutation increases length by 10%, and another mutation 10% on top, resulting in 1.1*1.1 = 1.21, which is more than 1 + 0.1 + 0.1 = 1.20 in the additive case). This is also implicit to any use of “fold change” in gene expression analysis, as it ignores the absolute gene expression level. Normal and log normal distributions are thus reasonably good and common choices to model biological processes. However, these are continuous distributions while sequencing data is discrete and has a rather different shape near zero (Fig. 4c). Luckily, modern computing has enabled the use of more appropriate distributions.
The most important distribution for discrete sequencing data is the Poisson [λ] distribution. The Poisson distribution can be physically motivated to model the number of decays (happening with rate λ) under a certain time from a radioactive source (Fig. 4d). This is because of an intrinsic physical property, namely that any radiative event is uncorrelated to when the last radiative event happened. In other words, it is a memoryless process. Such processes are widely modeled, even if not completely memoryless, because they are easy to handle mathematically. Sequencing can be thought of as a process of picking random DNA molecules from a semiinfinite tube. Even if the DNA has been PCR amplified, the probability of picking a copy of a previous molecule is approximately nonexistent. This makes it approximately a memoryless process, where the Poisson rate parameter is dictated by the total number of reads, and how many percent of the molecules are expected to come from the gene/enhancer of interest.
The Poisson distribution is relevant to any sequencing context, but does not take biological properties into account. As such, it usually underestimates the variance. The Negative Binomial distribution, NB [rate, dispersion], is a natural extension that in addition to rate also has a dispersion (variance) parameter. It is a wellstudied distribution that is the de facto standard for bulk RNAseq analysis (Love et al. 2014). NB is also equivalent to a Poisson distribution, when the NB rate parameter in turn is Gamma distributed. Thus, NBdistributed counts can be expected from a sequencer, if the continuous Gamma distribution represents the biological variation, and Poisson the sampling by sequencing. This makes the NB distribution a first choice when analyzing data from new sequencing protocols.
Statistics for RNAseq
Singlecell data is highly noisy and begs for more complex models than bulk equivalents. The memoryless assumption behind Poisson is less appropriate because the pool of DNA is no longer semiinfinite. Rather, in our experience, up to 30% or more of the final DNA molecules from a 10x Genomics Chromium scRNAseq library can be duplicated (calling for UMIbased deduplication). To motivate the best statistical models, it is thus necessary to understand details of the central dogma, how the final DNA molecules arise in the library preparation, and what physical properties thus can be expected.
It was noticed early on that RNAseq counts for a gene follow a zeroinflated distribution (large number of 0 values, Fig. 4e), starting a heated debate on the nature of singlecell data and whether there are biological reasons why some genes “drop out.” One study suggests that the zero inflation problem is rather overrated, and that no zero inflation is observed for ERCC spikein RNA (Svensson 2020). However, it is easy to see that zero inflation is a concern for platebased fulllength RNAseq (Fig. 4f). When a singleRNA molecule is present, it can give rise to multiple counts; however, when the RNA molecule is not present, the count will be exactly zero. Our group has noticed that a singleRNA molecule can give rise to multiple cDNA molecules, even in regular 10x Genomics Chromium droplet chemistry (unpublished). Statistically, the problem can also be seen as zero inflation (dropout rate q) of count C (Poisson distributed):
Letting the variable Z denote the presence of the RNA molecule, and C the cDNA count distribution during presence. However, this only applies to the case of a singleRNA molecule, and as such is likely to fit best to lowly expressed genes. Alternatively (and more correctly), the count can be seen as a sum of highly correlated variables. However, this treatment is difficult mathematically. A practical serious concern is identifiability (can the parameters be fitted given the data?): the ZINB model has three parameters (mean, dispersion, dropout rate), while NB has two (mean, dispersion) and Poisson only one (mean). If the dropout rate cannot be reliably fitted, then this might affect later steps, such as differential expression (Kharchenko et al. 2014; Finak et al. 2015).
Zero inflation can also have biological origins. It has been proposed that cells produce RNA in bursts (Fig. 4g) and the kinetics of polymerase binding/releasing has been fitted (Kim and Marioni 2013). We have noticed that the kinetics depends on the promoter type and that burstiness is higher for immune genes (Hagai et al. 2018). Others have seen that broadly, enhancers control burst frequency, while core promoters control burst size (Larsson et al. 2019). More work is needed to make good statistical use of what we know so far, and one must be aware that pvalues produced by common differential expression software are greatly inflated; and some genes fit the statistical distributions less than others. The need for biological replicates to obtain correct pvalues has been raised (Squair et al. 2021). As sequencing and library preparation prices have dropped, and multiplexing has become easier, biological replicates must again become the norm. Statistically, pvalues can be calculated over pseudobulk samples, where the counts for a gene are taken over several similar (clustered) cells, and the resulting pseudobulks are compared using bulk RNAseq tools (Robinson et al. 2010; Love et al. 2014). This is a straightforward, albeit arbitrary method compared to a hierarchical statistical singlecell model, e.g.:
A model that explicitly models each step of the biology and library preparation process can better integrate knowledge about, e.g., promoter/enhancer architecture, and make differential expression more about a specific biological aspect (promoter binding, enhancer use). As more data becomes available, and computing power increases, more elaborate statistical models can be expected.
Statistics for ATACseq
ATACseq is fundamentally different from RNAseq, and while there is a standard workflow (Baek and Lee 2020), the statistics have been much less discussed. Because there is no fragmentation, there is less concern about zero inflation. The counts have a rather firm upper limit, but regular Poisson statistics still fit the data well; however, some authors have preferred to binarize the data, resulting in a binary statistical distribution (e.g., done by Signac (Stuart et al. 2021)).
The open questions about ATACseq are rather about the meaning of the data. Tn5 has been shown to have a sequence bias and yields different data than the older DNase hypersensitivity assay (Karabacak Calviello et al. 2019). It is also possible to pinpoint transcription factor binding sites (or DNA binding proteins in general) as “holes” in the ATACseq peaks (Fig. 4h). In comparison to ChIPseq data, we have also seen sites with strong ChIPseq peaks but no corresponding ATACseq peak (Henriksson et al. 2019), showing that Tn5 cannot always access TF sites (Fig. 4i). One should thus ask what a “site” is. Enhancers are still defined using bulk ATACseq peak detection methods (Zhang et al. 2008; Granja et al. 2021; Stuart et al. 2021), applied to the singlecell data but ignoring which cell each fragment originates from. This simplifies analysis but likely misses out on discoveries the singlecell ATACseq data yet has to offer. Thus, much more work remains in the area of scATACseq analysis.
Size factors and sequencing depth
Cells may differ in terms of the depth to which they are sequenced (i.e., the number of molecules counted). The reasons for this are unclear, but can be affected by inefficient cell lysis, unevenness in droplet size and content, stochastic enzymatic effect, and stochasticity of sequencing. What is clear is that if cells are compared using any Euclideantype measure, then they will be organized according to the total amount of molecules rather than which molecules are present. All packages thus normalize cells using a correcting factor, termed a “size factor”—in the simplest case, it is simply a division by the total number of molecules. More elaborate corrections are used for bulk RNAseq (Love et al. 2014), but they take more time to compute and appear to be unnecessary for singlecell data.
Not much attention is given to the size factor, but some statistical notes are in order. First, RNA abundance differs between cell types, e.g., activated T cells can contain over ten times more RNA than naive T cells. Such differences are currently normalized away. Second, the less RNA/DNA that stems from the cell, the more reads might stem from background free RNA/DNA (Young and Behjati 2020). Dividing by the total number of molecules per cell is thus not a correct normalization for cells of low abundance. A quick solution is to remove such cells; however, because there is no clear cutoff for what constitutes a “low abundance cell,” this is not a perfect solution. Overall, the current size factor normalization appears to work well in practice but analysts must be aware of potential exceptions when this ceases to be the case.
From statistics to the underlying dimensions
Even for something as simple as the Normal distribution N (μ,σ), we assume that the parameters μ and σ have some sort of interpretable meaning (i.e., here, mean, and variance). While singlecell data can be described by a distribution, which has over 20,000 dimensions (genes or enhancers), understanding the data means that we can find a significantly smaller set of “hyper parameters,” making up a latent space to which we can assign meaning. This can be expressed as:
Finding a suitable latent space, and a transformation from this small space to the larger data space, is called dimensional reduction (DR). Because there is no best way of doing this, and there are tradeoffs in latent space size and shape vs interpretability, a plethora of approaches has been developed.
Linear dimensional reductions
The simplest form of dimension reduction is linear DR. All linear transformations from latent space X to data Y can be described by matrices:
where the matrix W decides how the space is reshaped by combining rotations, translations, and skews of the data (Fig. 5a). For DR, the aim is to choose W such that as much information is moved to the first couple of dimensions. It is, however, not clear what, exactly, constitutes information, and the quality of the reduction relies on using the right definition.
PCA (principal component analysis) is the historically most common DR algorithm (Pearson 1901) and it informed the development of all other DR methods. In PCA, the first dimensions of the latent space are called principal components, and this name is also commonly used for other methods. However, PCA itself refers to the case when W is chosen such as to (1) maximize the variance of the data along the first dimensions, and (2) make the dimensions uncorrelated, by being orthogonal (Fig. 5b). Only the first dimensions are kept, as these are assumed to contain the largest variation and thus information. The choice of W is almost unique, and the computation of it is extremely fast, as it ends up being the eigenvectors of the covariance matrix. As eigenvalue problems are well studied, PCA is also rather intuitive compared to all other DR algorithms. Together, this made PCA very popular, even today, despite that the resulting DR seldom captures the relevant biology well. Ignoring the interpretability, it can also be used as a first data reduction step before using more sophisticated nonlinear DR algorithms, such as UMAP (described later). PCA can be unstable for noisy data, but improved variants exist (such as https://github.com/facebookarchive/fbpca).
Independent component analysis (ICA) is similar to PCA, but W is picked based on other criteria (Alaa 2020). Several possible criteria exist, but the intuition as to why PCA might not give the most informative latent space is shown in Fig. 5c. Instead of maximizing variance, ICA can aim to maximize the skew of the data. The issue with ICA over PCA is that the answer is less constrained (not unique), and the algorithm is considerably slower. ICA has been used for singlecell analysis in, for example, the Monocle2 package (Van den Berge et al. 2020). It stands as a good reminder that PCA is not the only option.
Nonnegative matrix factorization (NMF) has also been used for singlecell analysis (DeBruine et al. 2021). Unlike PCA and ICA, the requested number of reduced dimensions (here called factors) is given up front (Fig. 5d). Several different numbers of dimensions are tested to find the optimal DR, but it can also be based on the expected meaning of the dimensions. Frameworks such as fscLVM enable the analyst to also predefine some of the dimensions based on known genes (Buettner et al. 2017). This type of “bias” can help steer the model to increase interpretability. Linear models are straightforward to extend to perform multiomics data integration (such as the MOFA package (Argelaguet et al. 2018)).
Overall, linear DR models are much easier to interpret than nonlinear models, and can easily be solved for advanced statistical distributions (ZINB and beyond). It is thus unlikely that they ever will go completely out of fashion, and even if a nonlinear model is deployed, it is still good to have a linear model to benchmark against.
Correcting for batch effects
When singlecell libraries are generated, common variations may be introduced that affect all cells. The sources are not well understood but arise as the mixes of enzymes and buffers differ between runs. This effect (commonly known as batch effect) may cause cells to not be directly comparable. Several dedicated algorithms exist to try and correct for this effect, for example, Harmony (Korsunsky et al. 2019), MNN (Haghverdi et al. 2018), and BBKNN (Polański et al. 2020). Scanpy and Seurat have other algorithms included. The problem was studied already for bulk RNAseq data, and some older algorithms can also be used (Risso et al. 2014) given that they are fast enough for today’s huge singlecell datasets. The performance of many algorithms has been benchmarked (Tran et al. 2020).
The earliest method for removing batch effects was regular PCA, where any differences in principal component 1 (PC1) frequently were due to batch differences. Thus, the differences could be handled by discarding the first component. This highlights the link of batch effects to DR. The most conservative way of statistically handling them is to introduce a categorical latent variable that represents the batch. This also holds for nonlinear methods discussed later; for example, the neural network–based model SCVI (Gayoso et al. 2022) can take the batch ID as a covariate. The only caveat with this approach is that it may be too conservative for practical use. Modern batch correction algorithms can instead use anchor cells, which are cells deemed similar enough between datasets that they can be overlapped in the batch integration. Cosine distance has been suggested as a way of finding anchor cells (Haghverdi et al. 2018). The best way of handling batches has not yet been settled, but the idea that batch effects can correspond to a latent variable is crucial. If other experimental variables are known between singlecell datasets, it may be reasonable to expect them to appear as latent variables (implicitly or explicitly).
Common other transformations
Not all transformations are linear; more generally, counts can be modeled as
where T is a transformation over a normal distribution. It can be beneficial to undo the effect of this transformation as the untransformed variable then follow a simpler Normal distribution. It also affects our view of distance between points. For example, consider the commonly used \({l}^{2}\)distance, that is, \({d(x,y) =(x,y)}^{2}\). It is implicitly assumed that points of equal distance do not change distance if moved together in space; this assumption does however not hold after most transformations (Fig. 5e). The solution if T is an exponential transformation is simply to compute log(x). However, because log(0) = − ∞, for RNAseq count tables it is more common to compute log(x + 1), where 1 is a pseudocount. Several alternate transformations exist and have been compared (AhlmannEltze and Huber 2023).
Transformations can possibly more intuitively be seen as means to put suitable weights on different features before other analyses such as PCA are performed. For ATACseq, the most common weighting approach is TFIDF (term frequency–inverse document frequency) (Cusanovich et al. 2018). This method, borrowed from linguistic analysis, does two things in this context: (1) It normalizes counts across one cell, as a more efficient tagmentation can lead to all of the chromatin appearing more accessible; this is equivalent to size factor normalization as done for RNAseq. (2) Regions that are accessible in all cells do not help distinguish different types of cells, and are thus given lower weight (Fig. 5f).
The importance of the transformations described here is easily overlooked due to their speed and simplicity; however, they are crucial steps. Importantly, these models have no free parameters, and their choice can be motivated by an understanding of the underlying biology and chemistry. They should thus be considered before using more advanced models, with neural networks being the extreme opposite (described later).
Cell states and latent space topology
Nonlinear models were developed because many processes (including many in biology) simply are not linear. Unfortunately, many nonlinear models fit to the same data, and they are hard to interpret (Fig. 6a). Before using a nonlinear model, it is thus important to have an idea of how one wants the model to behave, and how it might behave. This requires a fair bit of abstract mathematical thinking that will be presented here.
Topology is a subdiscipline of mathematics that is focused on the properties of spaces and surfaces (Armstrong 1983). It tries to make concepts concrete, such as pathconnectedness. For example, one might ask the question, “is there a way to connect a point A to a point B?” If point A is taken to be among pluripotent cells, and B is taken to another cell type, this type of connectedness is equivalent to the biological question “can type A cells differentiate into type B cells?” (Fig. 6b). While this question is fairly easy to answer by a human if the cells are reduced to a 2D plane, it is not obvious in the higher dimensional (20,000 for all genes) space. Luckily, the field of topology is set up to handle any number of dimensions. Anyone analyzing higher dimensional data should thus be interested in topology.
Discrete topologies and basic nonlinear dimensional reduction
One problem with topology is that mathematicians usually have welldefined spaces (based on an equation). Singlecell biologists, on the other hand, just have a finite set of noisy observations. A link is achieved by approximating topologies from the data points using ksimplexes—a multidimensional equivalent to a triangle (Fig. 6c). These simplexes can be built by letting each corner be a singlecell observation. The edges remain lines, and the lines can be picked by distance. How they should be picked is an open problem, but typically cells are connected to the k nearest other cells, also known as the nearest neighbors (the result is called a kNN graph, Fig. 6d). If an Euclidean space is assumed, i.e., the Euclidean distance is used, then the kNN graph can be computed quickly also for large numbers of cells.
The kNN graph is the input to the most commonly used nonlinear DR tools for singlecell data: tdistributed stochastic neighbor embedding (tSNE) and uniform manifold approximation and projection (UMAP) (McInnes et al. 2018). While the algorithms can work with the full set of cell–cell distances, the closest neighbors are the most relevant, and focusing on these speeds up the computation tremendously. The output of each of these algorithms are 2D coordinates (or user choice of dimensions, with 3D frequently being useful), where the distance of the points reflects their distance in the higher dimensional space (Fig. 6e).
The greatest issue with tSNE and UMAP is that interpretation is difficult (discussed further in the later section on cell types). While these methods are good to give an unbiased overview of the data, the latent space axes have no meaning, and the distance might not reflect the most interesting biology. Thus, other nonlinear models that help capture the biology of interest are recommended after initial data investigation.
Cell state space dynamics and trajectory inference
While tSNE and UMAP aim to simplify the data in terms of mapping them directly to a lower dimensional space, another approach is to find a simplified topology. These topologies can, but need not, lend themselves to easy presentation. While nonlinear factor analysis can handle topologies such as lines and planes, it cannot handle tree topologies (Fig. 6b), as relevant for cell fate decisions during, for example, differentiation. Because of the historically close link to differentiation over time, the use of algorithms to analyze line or tree topologies is called pseudotime analysis, or trajectory inference. Several pseudotime algorithms have been proposed and compared (Saelens et al. 2019).
Trajectory inference rests on several assumptions. First, if the trajectory inference is for a timebased process, then the data must contain cells representing all the time points. Since cells commonly do not respond at the same rate, this is frequently the case, but cells from several time points may need to be mixed. Most crucially, no algorithm can prove that a pseudotime trajectory exists—it is an assumption (RNA velocity (La Manno et al. 2018; Bergen et al. 2021), and RNA metabolic labeling (Qiu et al. 2022), which measures the vector field, tries to overcome this limitation). Some algorithms try to find the type of trajectories, while others simply accept user input. Thus, the latent space can be shaped by prior knowledge, or it can be unbiased.
Several algorithms are based on algorithms that find the Minimum Spanning Trees (MST), i.e., the smallest subset of edges in a graph, given weights (distances), that still connect all the vertices (Fig. 7a). For singlecell data, the kNN graph is the input to the MST algorithm. However, due to noise, the MST graph can become rather complicated. To avoid overfitting, several methods are thus used to simplify the graph. Slingshot (Street et al. 2018) and Monocle (Trapnell et al. 2014) are examples of MSTbased algorithms. MST has the advantage (and disadvantage) of not enforcing the number of endpoints per se. Other approaches exist that simplify the graph, not necessarily to a tree, such as PAGA (Wolf et al. 2019).
Knowledge can be extracted from the graph representations in several ways. For simple graphs, such as trees, a subset of cells can be ordered from the tip of one branch to another. Another approach is to study the dynamics of cells, assuming they transition semirandomly over the neighbor graph. The simplest suitable statistical models are memoryless processes. These also correspond to a Newtonian model of the evolution of the cell state X at time t, which is assumed to contain all information needed to predict the future:
Memoryless processes of this kind are also called Markov chains, and can be described by graphs having transition probabilities on each edge (Norris 1997) (Fig. 7b). Thus, the singlecell neighbor graph with suitable probabilities assigned can be treated with powerful Markov chain theory. The jumping probabilities can be uniform, or informed by other data such as RNA velocity (Lange et al. 2022). If some vertices in the graph only have incoming edges, then these are denoted as absorbing states—a random walk will at some point get stuck in any of these (Fig. 7c). If there are multiple such absorbing states, then it is easy to compute which end state is the most likely, and what is the average number of random jumps until it happens (corresponding to total time). This can be used to estimate the likelihood of a type differentiation. If no absorbing states exist, then it is possible to calculate the stationary distribution—how likely it is for a cell to be in a given state, independent of where the cell starts (assuming a property called ergodicity, usually fulfilled for singlecell data). These are just the simplest Markov chain concepts, and the chain can be designed to model and answer various biological questions.
One caveat with current pseudotime methods is that they provide no information about why any branching occurs. An analogy is with a previous study of speciation of ours (Henriksson et al. 2010): to learn why speciation occurs, the concept of speciation had to be removed from the model. Similarly, it may be speculated, cell fate decisionmaking cannot be understood using algorithms that enforce a simplified model of a branching event. In particular, the local topology at the branch is not understood at all. Unbiased Markov chain–based modeling is likely a good bet for anyone trying to further understand cell fates.
Donuts and the cell cycle
Topology offers tools to reason about the shape of the latent space. One of the most famous results from topology is that a coffee cup is equivalent to a donut (Fig. 8a). This result more specifically tells us about a certain type of connectedness: can one line from A to B be deformed in a continuous manner such as to overlap another line from A to B? It turns out that there are lines on the donut which do not have this property (Fig. 8b), because the hole in the middle restricts the deformations. The same property holds for a coffee cup. In topology, the coffee cup is equivalent to the donut in the sense of having similar behavior linedeformationconnectedness properties.
The concept of topological equivalence is important to us because we can reason about simpler spaces instead of the highdimensional raw data. The cell cycle can be thought of as a circle (which indeed is the way it is commonly drawn, Fig. 8c). The cell cycle state can be identified from RNAseq data using common workflows (part of both Seurat and SCANPY), and cells can be annotated as being in either G1, S, or G2M phase. However, the standard workflow does not order the cells within these phases, limiting the resolution at which cell cycle–linked events can be studied. In limited cases, it might be possible to use a pseudotime algorithm, but because linear pseudotime has the implicit assumption of a start and an end, linear pseudotime is topologically incompatible with the cell cycle (Fig. 8c).
There are several attempts at ordering the cells according to the cell cycle. The common RNAseq analysis method for categorizing cell cycle state is based on a list of marker genes, and a PCA is used to reduce the number of dimensions to 2. It is likely not a coincidence that the smallest number of dimensions in which a circle (or cycle) can be represented is also 2. In principle, the angle in this reduced space can be used to order the cells, although we have never seen this performed.
One caveat with topological reasoning is that we only have a finite number of samples, and thus only approximate knowledge of the space. Can cells ever reside in holes of the topology? If points or the kNN graph is assumed to have volume, then this question can be addressed by testing different volumes. This has been applied to singlecell data (Rizvi et al. 2017) and the concept enables a range of analyses, such as how “small” or isolated a gene regulatory program is (Fig. 8d).
Housekeeping genes, group theory, and product spaces
Group theory is a topic of abstract algebra and describes the mathematical structures (groups) generated by binary operators, such as addition or multiplication. A formal definition is beyond this review, but the main use is in analyzing symmetries. It has had success in Xray crystallography, where it can be used to prove the number of possible crystal structure symmetries. While addition is a function over the space of numbers, it can also be used to organize, e.g., how many turns plasmid dsDNA is wounded. The state can be represented by the winding number (Fig. 8e). The operator of interest is in this case “PositiveWinding,” though an inverse “NegativeWinding” can be derived. This winding operator operates on circular dsDNA as objects, adding the turns of one plasmid to another. Because winding behaves exactly as integers over addition, an isomorphism can be defined over classes of topologies to numbers (thus, one can speak of winding numbers, instead of complex geometric objects). The main point of group theory is that various symmetries can be argued based on the properties of the operator, and the operator can operate over topologies. Conversely, symmetry usually implies some type of group, e.g., a rotated circle is still a circle (making the rotation operator an identity operator in this case). The breaking of symmetries has also been studied in relation to groups in embryogenesis (Kumar and Bentley 2003). Past positive examples make use of group theory tempting also in a singlecell latent space setting.
Groups are related to product spaces. We can imagine a representation of the cell state to be a position in “the space of cell cycle,” and simultaneously a position in “the space of cell differentiation.” The set of all possible two positions together (cell cycle, differentiation) then makes up a higher dimensional “product space” (Fig. 8f). Such a latent space can naturally be expected to capture more of the biology than just each of them alone.
Unfortunately, there are few pure “operators” in biology that operate on only one space, and thus the product space analogy breaks down. Even processes considered to be “housekeeping” are intertwined with other processes. Several examples can be given for the cell cycle: activation of naive T cells is essentially a synchronized entry into the cell cycle; and cells in the skin preferentially divide during the night (Beri and Milgraum 2016), linking it with circadian rhythm. T cell migration in and out of lymph nodes is also linked with the circadian rhythm (Druzd et al. 2017). These are just some examples of how seemingly housekeeping processes are linked to more specialized processes, and how there are no truly independent spaces. This is the most important take home message—biology does not conform well to idealized representations such as independent spaces. Nevertheless, in limited cases, it can be a good approximation, and fitting data to idealized models is a natural part of hypothesis testing.
Nonlinear models and neural networks
Graphbased reasoning allows one to compare cells in a topologically relevant manner, but it does not provide an explicit function to and from the latent space to the data space the way PCA, NMF, or other methods do. Having an explicit function is a requirement for using most statistical tools. Unfortunately, it is normally difficult to motivate the choice of a nonlinear function for higher dimensional data. One way out is to allow a large range of differently shaped functions; this can be done by using neural networks (NNs). These are inspired by neurons, built up by many small simple units that together can produce complex behavior. The smallest modern “neuron” typically looks like this (Fig. 9a and b):
where the nonlinear function ReLU is defined as follows:
Several layers of neurons make up a neural network (Fig. 9c). The optimal input weights w_{i} and b are computed using optimization, e.g., by minimizing the difference of the neural network output vs the given data (also known as the reconstruction error):
There is little special with this mathematical construction, except that it has a “wellbehaved” differential, which helps optimization over many layers of neurons (avoiding what is called the “vanishing gradient” problem, which occurs when lowerlayer connection weights in a deep NN become static, hindering or halting further training of the NN). Secondly, it is in fact just many multiplications and additions organized in a coherent way, and computing it fits well with how graphics processing units (GPUs, graphics cards) are designed. GPUs enable an order of magnitude faster solving of NNs. Since NNs are nowadays easy to use and fast to compute, many methods are based on them. Note, however, that linear functions are a special case of neural networks, and everything here could be designed for these as well.
A final important note is that the NN reconstruction does not rely on using the Euclidean distance between data points. The reconstruction error model can easily be modified to better incorporate statistical properties. Because the Euclidean distance enforces a certain type of topology, and thus latent space structure, the use of NNs can thus open the door to biologically more relevant latent spaces.
Generative processes
Methods such as PCA, ICA, NMF, and UMAP were primarily developed to map points from data to latent space. An alternative approach is to develop algorithms that map from latent space to data. However, since the latent space has a smaller dimension, it cannot easily cover all of the larger space (Fig. 9d; filling the data space is possible with spacefilling curves (Armstrong), but they are only of theoretical interest). A solution to this is to consider multivalued functions; that is, f (x) can return more than one value. In the case of generative processes (GPs), the returned values follow a probability distribution. A basic generative process from latent space X to data space Y can look as follows (Fig. 9e):
This generative process was constructed manually and illustrates the concept. It shows the enormous freedom in the choice of latent space and GP. However, for most applications, an algorithm is used to fit a suitable GP to the data. Because these algorithms have little a priori knowledge of the data, highly flexible NN nonlinear functions are commonly used in combination with simple statistical distributions. A hypothetical example could be:
Quasilinear versions have also been tested on singlecell data, with the aim of providing explainable mappings (Svensson et al. 2020). To find the best NN, this has to be recast as an optimization problem. Two examples of networks/algorithms that realize this in practice are variational autoencoders (VAEs) and Generative Adversarial Networks (GANs).
Autoencoders (AEs) and variational autoencoders
Autoencoders (AEs) were first developed independent of GPs, but were later fused with the concept, resulting in variational autoencoders (VAEs) (Kingma and Welling 2013). In addition to the function from latent space to data (in this context, the GP is called the decoder), a function is also sought from data to the latent space (called the encoder). These are considered meaningful, if given a data point, it can be encoded, and then decoded, the data point approximately reconstructed (Fig. 9f). The solution to the reconstruction problem is typically not unique, giving rise to many possible latent spaces. Furthermore, it might not from a human standpoint have sufficient structure (Fig. 9g).
To increase the latent space structure, one commonly aims to organize it such that samples close in the latent space should correspond to similar output data. Thus, the latent space is smoothened, avoiding the scenario in Fig. 9g. For singlecell data, this has been achieved by rather trying to fit the underlying distribution than directly trying to reconstruct the data (Eraslan et al. 2019), a trick still in use (Gayoso et al. 2022). However, a more common solution to smoothen the latent space is to replace plain AEs with VAEs. For VAEs, unlike AEs, the latent space is a probability distribution (Fig. 9h). This distribution is commonly called the variational distribution, Q (zx). The latent variables, z, are sampled from this distribution, and the random sampling is what ensures that similar latent space points result in similar data points. Figure 9f shows the basic structure of a VAE, where the NN parameters \(\theta\) and \(\phi\) are found by minimizing the following loss function over all data points \({x}_{i}\):
The first part of the function is the reconstruction loss, going from latent space z to data space x. In other words, it is a measurement for how similar the reconstructed data is to the original input. The second half of the loss function measures the KL divergence (a measure of difference between two distributions) between the variational distribution and \(p\,(z)\). The distribution of \(p\,(z)\) greatly determines the final latent space structure but is almost always set to be a Normal distribution, a rather unbiased choice.
The minimization of the loss function could be done with any nonlinear optimizer, but because the random samples of the latent variables \(z\) from the variational distribution \({q}_{\theta }\,(z  {x}_{i})\) has high variance (Kingma and Welling 2013), convergence would be poor. The solution to this problem is known as the reparameterization trick. Unfortunately, this trick limits which variational distributions can be used for the latent space, as the distribution must have certain symmetries. The Gaussian distribution, which is commonly used, is sufficient for most cases; however, other distributions have been tested (Ding and Regev 2021).
VAEs for singlecell analysis
The link between VAEs and the known singlecell statistics is through the GP \({p}_{\phi }\,({x}_{i}  z)\). In other words, given a latent space point, what is the statistical distribution of the data? The toolkit SCVI (Gayoso et al. 2022) proposes the following GP for RNAseq (Lopez et al. 2018) (simplified to show key concepts). The latent space has been split into z and L, and below is thus actually \({p}_{\phi }\,({x}_{i}  z, L)\):
This reallife example shows how different types of data distributions can be modeled on top of a latent space. The latent space has been split such that z represents cell type, and L the sequencing depth (size factor). The log normal distribution of L ensures that the size factor is always positive and close to the fitted value. However, the model also works well if L is just replaced with the fitted size factor for the particular cell. Next, the variable \(\varrho\) is found using a NN, which optionally can use information about, e.g., which batch the cell comes from (for batch correction). The variable \(\varrho\) can be thought of as to represent the RNA levels of an ideal cell, without zero inflation, or differences in sequencing depth. To account for sequencing depth, it is later simply rescaled as \(\varrho\)*L. Finally, the data X follows a ZINB distribution with \(\varrho\)*L as the idealized average cell. However, NB or Poisson can equally well be used instead.
This example shows the enormous flexibility of the VAE framework. Given a latent space, which can be given any number of dimensions and shape, a neural network can transform it into the parameters of any choice of probability distribution. Other GPs represent CITEseq (Gayoso et al. 2021) and ATACseq (Xiong et al. 2019; Ashuach et al. 2022), and can easily share the latent space z for multiomics integration.
Several open problems remain for VAEs. In SCVI, a plain neural network is used to encode data into z, without any use of known singlecell statistics, which might limit the reconstruction. Also, several users have reported VAEs to be “brittle,” giving rather different solutions for small changes in the input, or changes in the NN structure. If the data is limited then, as for most NNbased algorithms, VAEs struggle to fit a good model. This requires tuning of the NN architecture, in terms of number of layers and neurons in each layer. An ideal model of data should not rely on technical parameters, and in this regard, VAEs have a long road left ahead. Nevertheless, their flexibility in statistical formulation suggests that the trip is worth the effort. At the current forefront are different ways of structuring the latent space such that the coordinates have meanings, such as by linking them to genes linked to the same biological process (Lotfollahi et al. 2023). This can also be done to shape the neural networks and reduce the number of parameters (Elmarakeby et al. 2021). By limiting the number of parameters, less data is needed and the brittleness can be overcome.
Graph neural networks
A class of neural network–based algorithms operate on data organized in graphs: graph convolutional networks (GCNs). They have for example been used to predict properties of molecules, where the atoms and their connectivities make up a graph (Reiser et al. 2022). The famous AlphaFold algorithm for predicting protein structures is also a GCN (Jumper et al. 2021). For singlecell data, the graph is commonly, but not necessarily, the kNN graph. While GCNs thus rely on the latent space induced by the Euclidean distance metric, they permit entirely different problem formulations.
For GCNs, the question is: can knowledge be gathered at a local point of the graph, NN (G_{local}), and can it be sufficiently propagated by repeated application of NN layers, NN_{1} (NN_{2} (G_{local})) (Fig. 9i)? This approach can be motivated by a field of mathematics, fixed point theory (Debnath and Mikusinski 2005), studying functions and points x_{fix} such that f(x_{fix}) = x_{fix}. This is related to the repeated application of functions, and convergence to these fixed points: f (f (f (…f (x)))) → x_{fix}. Such a fixed point is thought to represent the fully extracted knowledge of the graph. In practice however, too many iterated function applications results in “oversmoothing” (Cai and Wang 2020), a state in which the discriminatory power actually decreases. Thus in reality, it may not be a good idea to attempt to run as many iterations as possible.
Because GCNs are such a broad topic, and further can be combined with VAEs (Wang et al. 2021), this section primarily lists some useful singlecell applications. Here ~ should be read as some sort of VAE formulation, and the annotation is primarily for illustration:

GeneExpression = F (G). In this case, the aim is to compute the expression levels of this cell by comparing it to the neighboring cells (Gu et al. 2022). This can be used to calculate the expression of an “idealized cell,” without the technical or biological noise—also called denoising. Some downstream algorithms prefer smoothened data, and it can also be used to aid visualization.

CellType = F (G). Instead of relying on clustering, and annotating cell types from the average gene expression levels of that cluster, it can be done directly by investigating each cell and the neighboring cell. Since clustering requires manually providing settings about the resolution (or expected cluster size), the GCN approach is less arbitrary.

F_{protein} (G) ~ F_{ATAC} (G) ~ F_{RNA} (G). Different data modalities can be compared (or “integrated”) using some variant of graph neural networks. Again, this avoids the need for clustering, but it further has the advantage that the graphs need not be the same. For example, the graphs based on ATACseq need not correspond to those for RNAseq.

F_{RNA} (G_{RNA}) ~ F_{gene_homology} (G_{genes}). There are cases when two orthogonal graphs are being estimated. One such case is the comparison of cells between species (Liu et al. 2023); the traditional Euclidean distance between cells is problematic because it is not clear which genes in species A should be compared to which genes in species B. However, assuming that the cells are lined up correctly, and with some knowledge of homology (based on gene sequences), it is possible to find which genes correlate and thus correspond. This is a circular dependency: The cell–cell correspondences depend on the gene–gene correspondence which depends on the cell–cell correspondences. Circular problems of this type beg for an iterative algorithm that solves both problems at the same time. GCN fitting is iterative and naturally matches the structure of the problem.
It is still early days for the use of graph NNs for singlecell analysis. Likely any singlecell problem can be formulated within this framework, and the combination with VAEs also enables them to capture an interpretable latent space. GCNs are thus a prime area for novel research.
Clustering, language, and cell types
The most primitive latent space representation is that of categories of cells. Formally:
The representation has the advantage that it is easy to perform pairwise comparisons of clusters, and the categories can be given memorable names. Batches and different treatments of cells usually map to categories, but categories are primarily discussed in the context of cell types.
There are many ways of classifying cells as different types, including tools that compare the gene expression to databases of profiles (Abdelaal et al. 2019). However, the most common methods split the kNN graph in such a way that the sum (“cost”) of the cut edges is minimized (Fig. 10a). The family of spectral graph cutting methods has not gained much traction, but has been used (Schwartz et al. 2020). The by far most common methods are now Louvain (Blondel et al. 2008) and Leiden (Traag et al. 2019), available through packages like Scanpy and Seurat. The use of graph cutting clustering algorithms thus makes the matter of cell type categorization a topological one, where the algorithm objective function need not correspond to what the user wanted (Fig. 10b). The number of clusters also has to be specified by the user and there is seldom an objectively “correct” number of clusters. This brings this review to the longest standing open question in the singlecell community.
What is a cell type?
Cell types were originally identified by their morphology as this property was first available. The function of the cells also entered the definitions. Neurons and muscle cells are very distinct. This is a type of natural history, where cells were simply grouped together (like all biological samples) to create a sense of order. Overall, cell type definitions have largely followed what industry and technology have had to offer (such as microscopes), the intertwining called technoscience (Pickstone 2001). As new measurement tools became available, such as Fluorescenceactivated Cell Sorting (FACS) machines to sort by surface marker proteins, definitions have increasingly moved to be based on the cellular content. FACS has played a major role in immunology, where a huge number of cells have been defined by an increasing list of proteins. The most common way to qualitatively annotate cell types in singlecell data is to use these lists of cell type–specific “marker genes,” after clustering by, e.g., Leiden. Sequencing and proteomics have however challenged old surface marker–based definitions. One issue is that some marker proteins have only low levels of corresponding mRNA, and thus are not suitable for singlecell RNAseq cell annotation. Another even bigger issue is that the mRNA need not be present in every cell, despite the protein being there. A debate has raged whether this is for technical or biological reasons, but consensus moves toward the idea that mRNA is produced in bursts (Golding et al. 2005; Raj et al. 2006; Larsson et al. 2019). Because the protein carries the actual function, the mRNA need not be present at all times, thus pointing toward inherent biological reasons. The solution in the singlecell world has been to rely on unsupervised clustering and using the average profile for annotation. When clusters are distinct, this is usually beyond doubt. However, for some cells, with subtypes of neurons, T cells, and monocytes as notable examples, the boundaries between classically defined subtypes are not clear (Fig. 10c).
A way around the unclear boundaries is to refer to the cell differentiation history, classically called the “lineage tree.” Thus, a cell type is not just a separate category, but somehow linked to other categories with shorter or longer distance. For comparison with other types of cells, it primarily makes sense to compare with other cell types in the same lineage (Fig. 10d). This is already implicitly performed by analysts by setting the clustering resolution appropriately, but the lineage relationship could in principle be stored in the annotation as well. Concepts from comparison of genes across species, such as inparalogs (Sonnhammer and Östlund 2015), could be borrowed to make the correct comparisons more formal.
However, in many singlecell datasets, a cluster exists based on the cell cycle alone. Are dividing cells their own cell type? Many would argue against it and would rather call it a cell state, thus questioning the ability of current clustering methods to define cell types. The solution calls for clustering algorithms that either interpret the latent space topology differently, or topologies that fit better with our notion of cell types. Alternatively, the nomenclature needs updating. The Human Cell Atlas is, for example, trying to update the cell type definitions (OsumiSutherland et al. 2021). But sometimes it is not even clear what should be annotated (Fig. 10e); muscle cells are nucleated, suggesting that the outer cell membrane is the object of interest. However, epithelial cells are stuck together and share space through gap junctions, suggesting that epithelia is just one single large cell. This motivates a higher level view on the cell type problem.
Sociologists provide several external views. One view emphasizes rather the power relationships (Foucault 1995), such as reviewers for grant agencies and journals upholding the use of certain terminology (the poststructuralist view). Different review boards may also be interested in different genes based on technologies, e.g., for T helper type 2 cells, a genomics panel would be concerned with GATA3 expression using sequencing, while an immunologist panel would look at IL4 secretion using flow cytometry (the historical materialism/Marxist view). The constructivist view looks at this simply as new concepts being developed in the light of old concepts, and that as we specialize and socially compartmentalize, we develop different new concepts that need not agree. The logician Wittgenstein argues, using his language game model (Wittgenstein 1998), that language need not be “logically correct”—just serving a function within a certain context. Interestingly, the ML community is just about to bring back many of these abstract and questioned concepts, but in a new quantifiable shape.
Semantics and language models
The use of language models deserves extra attention in the context of latent representations. What is knowledge and what is meaning? This philosophical problem has a long history, and with the advent of logic, it has been argued that “meaning” is what someone tries to convey in a sentence (or utterance). Sentences can then be described in a type of logic (Ttheories in the Davidsonian tradition of philosophy (Glüer 2011)). This idea has caught on in the study of causality. Classical statistics does not handle causality, only correlation; it has been speculated that the reason causality has been mathematically largely undeveloped until recently, is because language is such a great causal inference system (Pearl and Mackenzie 2018). Causality is, however, no simple matter, and already Aristotele tried to understand it in more detail (Haig 2020). If we think of understanding biology, in terms of causality (“calcium is released because X bound to Y”), then this can be understood in terms of language, but we still do not understand language itself well. However, computers are good at modeling language (natural language processing, NLP); does this mean that computers can understand biology? And how well can language be a suitable latent representation for the data we see, that is (Fig. 10f):
The NLP field is these days centered around transformers (Vaswani et al. 2017), a type of NN over strings of data that excels at keeping references to earlier parts of the data string (Fig. 10g). If the data string is a set of characters or words, then it maps directly to language processing. But transformers can also process sound, and more recently it has even generated realistic images from textual descriptions (Ramesh et al. 2021). One type of transformer, BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018), has already been applied to singlecell RNAseq data (Yang et al. 2022). Similarly, ATACseq (Bravo GonzálezBlas et al. 2019) and multiomic data have also been modeled (Cui et al. 2023).
From the discussion of clusters, and how these do not always map well to classical cell types, one point about language should already be clear: if our language is imprecise, then so is our ability to model and understand. If the vocabulary is poor, then this limits any statistical model on top. However, this field has a future in that at some point, our models must map to human understanding, which arguably is based on the brain’s language center. It must also connect to previous research, written up as language in articles and largely inaccessible to anything but NLP. This is thus a topic that should be followed by great interest and especially taught in any biology class on the theory of science.
Data that still begs for representations
This limited review covers the most common singlecell concepts and how these can be mapped to latent representations. However, how to best represent newer types of readouts are open problems.
RNAseq also captures genomic sequence information, for example, SNPs. This can be used to separate cells into donors (for humans, not inbred mice), for example, by Vireo (Huang et al. 2019). This is a simple categorical representation. Several studies, however, attempt to trace cell lineages (“lineaging”) from accumulated mutations. The underlying representation is then a (lineage) tree. This representation is invalid for cells that have fused, e.g., myocytes, and rather begs for a direct acyclic graph. Because of lack of data, such representations have not yet been developed to the authors’ knowledge.
Spatial transcriptomics will bring the next level of challenges to the singlecell world. Frameworks such as STELLAR (Brbić et al. 2022) and SPICEMIX (Chidester et al. 2023) attempt to model spatial transcriptomics data given cells position in tissue. Further combination of spatial and singlecell data enables other types of statistics (Svensson et al. 2018; Liu et al. 2021; Kleshchevnikov et al. 2022). Some spatial methods measure the 3D location of individual RNA molecules, which can be more informative than just having the counts for each cell. The location is an important part of the regulation; cells may contain stress granules, which may soak up proteins and RNA to temporarily disable them. RNA may also be kept disabled for rapid activation. The size of neurons also makes the location of RNA important. Representations of data for these scenarios remain underdeveloped or nonexistent; a challenge is to find suitable informative simplifications. Because 10x Genomics only recently announced a commercial technology for singleRNA molecule spatial resolution, we can expect an explosion of data analysis methods in this field.
Finally, genetic perturbations cause cells to shift within the latent spaces, or move outside what the latent space can describe if it is just based on unperturbed cells. The singlecell field is slowly moving toward being able to perturb large numbers of genes (even genomewide) (Replogle et al. 2020, 2022; Peidli et al. 2022). How should the latent space be set up to cope with such a large amount of information? Of related interest is the prediction of the effects of perturbations (Qiu et al. 2022). If the vector field of cells in the cell state space can be measured, for example, by RNA velocity (Bergen et al. 2021), or measured by metabolic labeling (Qiu et al. 2022), then this can also inform about the ideal latent space structure.
Concluding remarks
This review has hopefully managed to portray that the matter of latent spaces, or representations, is at the heart of understanding biology at the singlecell level. Luckily, easytouse frameworks have been developed, which at least are good enough for testing new representations (turning it into a new analysis package further requires software engineering skills). Complex hierarchical Bayesian models (Gelman and Hill 2006) can be directly formulated using, for example, STAN (Carpenter et al. 2017). While solving them is slow in STAN, the flexibility makes up for it. Another option for solving Bayesian equations, approximately but fast, is the use of Variational Inference, which goes beyond the VAE example in this review. Bayesian equations have recently gained much traction, especially in conjunction with the use of GPUs. PyTorch (https://pytorch.org/) and Tensorflow (https://www.tensorflow.org/) are two frameworks in which it is fairly straightforward to formulate variational inference problems. Interested readers should investigate SCVI (Gayoso et al. 2022) and scArches (Lotfollahi et al. 2022), which uses this for solving VAEs; and Cell2location (Kleshchevnikov et al. 2022) which uses this to link singlecell data to spatial transcriptomics data. Large number of free lectures on ML methods are now also available on common streaming platforms, with further examples on Github.
Data availability
The code used to generate some of the graphs is available at Github, https://github.com/henrikssonlab/singlecell_review2023.
References
Abdelaal T, Michielsen L, Cats D et al (2019) A comparison of automatic cell identification methods for singlecell RNA sequencing data. Genome Biol 20:194. https://doi.org/10.1186/s130590191795z
AhlmannEltze C, Huber W (2023) Comparison of transformations for singlecell RNAseq data. Nat Methods. https://doi.org/10.1038/s41592023018141
Alaa T (2020) Independent component analysis: an introduction. Appl Comput Inform 17:222–249. https://doi.org/10.1016/j.aci.2018.08.006
Argelaguet R, Velten B, Arnol D et al (2018) MultiOmics Factor Analysisa framework for unsupervised integration of multiomics data sets. Mol Syst Biol 14:e8124. https://doi.org/10.15252/msb.20178124
Armstrong MA (1983) Basic topology, Springer verlag
Ashuach T, Reidenbach DA, Gayoso A, Yosef N (2022) PeakVI: a deep generative model for singlecell chromatin accessibility analysis. Cell Rep Methods 2:100182. https://doi.org/10.1016/j.crmeth.2022.100182
Baek S, Lee I (2020) Singlecell ATAC sequencing analysis: from data preprocessing to hypothesis generation. Comput Struct Biotechnol J 18:1429–1439. https://doi.org/10.1016/j.crmeth.2022.100182
Bartosovic M, Kabbe M, CasteloBranco G (2021) Singlecell CUT&Tag profiles histone modifications and transcription factors in complex tissues. Nat Biotechnol 39:825–835. https://doi.org/10.1038/s41587021008699
Bentsen M, Goymann P, Schultheis H et al (2020) ATACseq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat Commun 11:4267. https://doi.org/10.1038/s41467020180351
Bergen V, Soldatov RA, Kharchenko PV, Theis FJ (2021) RNA velocitycurrent challenges and future perspectives. Mol Syst Biol 17:e10282. https://doi.org/10.15252/msb.202110282
Beri K, Milgraum SS (2016) Rhyme and reason: the role of circadian rhythms in skin and its implications for physicians. Future Sci OA 2:FSO115. https://doi.org/10.4155/fsoa20160007
Blibaum A, Werner J, Dobin A (2019) STARsolo: singlecell RNAseq analyses beyond gene expression
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech 2008:P10008. https://doi.org/10.1088/17425468/2008/10/P10008
Bravo GonzálezBlas C, Minnoye L, Papasokrati D et al (2019) cisTopic: cisregulatory topic modeling on singlecell ATACseq data. Nat Methods 16:397–400. https://doi.org/10.1038/s4159201903671
Brbić M, Cao K, Hickey JW et al (2022) Annotation of spatially resolved singlecell data with STELLAR. Nat Methods 19:1411–1418. https://doi.org/10.1038/s41592022016518
Brennecke P, Anders S, Kim JK et al (2013) Accounting for technical noise in singlecell RNAseq experiments. Nat Methods 10:1093–1095. https://doi.org/10.1038/nmeth.2645
Buenrostro JD, Wu B, Litzenburger UM et al (2015) Singlecell chromatin accessibility reveals principles of regulatory variation. Nature 523:486–490. https://doi.org/10.1038/nature14590
Buettner F, Pratanwanich N, McCarthy DJ et al (2017) fscLVM: scalable and versatile factor analysis for singlecell RNAseq. Genome Biol 18:212. https://doi.org/10.1186/s1305901713348
Buschmann T, Bystrykh LV (2013) Levenshtein errorcorrecting barcodes for multiplexed DNA sequencing. BMC Bioinformatics 14:272. https://doi.org/10.1186/1471210514272
Cai C, Wang Y (2020) A note on oversmoothing for graph neural networks. arXiv [cs.LG]
Cao J, Spielmann M, Qiu X et al (2019) The singlecell transcriptional landscape of mammalian organogenesis. Nature 566:496–502. https://doi.org/10.1038/s415860190969x
Carpenter B, Gelman A, Hoffman M et al (2017) Stan: a probabilistic programming language. J Stat Softw, Artic 76:1–32. https://doi.org/10.18637/jss.v076.i01
Chen X, Miragaia RJ, Natarajan KN, Teichmann SA (2018) A rapid and robust method for single cell chromatin accessibility profiling. Nat Commun 9:5345. https://doi.org/10.1038/s41467018077710
Chidester B, Zhou T, Alam S, Ma J (2023) SPICEMIX enables integrative singlecell spatial modeling of cell identity. Nat Genet 55:78–88. https://doi.org/10.1038/s4158802201256z
Cusanovich DA, Hill AJ, Aghamirzaie D et al (2018) A singlecell atlas of in vivo mammalian chromatin accessibility. Cell 174:13091324.e18. https://doi.org/10.1016/j.cell.2018.06.052
Cui H, Wang C, Maan H, Wang B (2023) scGPT: towards building a foundation model for singlecell multiomics using generative AI. bioRxiv 2023.04.30.538439. https://doi.org/10.1101/2023.04.30.538439
Datlinger P, Rendeiro AF, Boenke T et al (2021) Ultrahighthroughput singlecell RNA sequencing and perturbation screening with combinatorial fluidic indexing. Nat Methods 18:635–642. https://doi.org/10.1016/j.cell.2018.06.052
Debnath L, Mikusinski P (2005) Introduction to Hilbert spaces with applications. Academic Press
DeBruine ZJ, Melcher K, Triche TJ (2021) Fast and robust nonnegative matrix factorization for singlecell experiments. bioRxiv 2021.09.01.458620. https://doi.org/10.1101/2021.09.01.458620
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pretraining of deep bidirectional transformers for language understanding. arXiv [cs.CL]
Di L, Liu B, Lyu Y et al (2022) Rapid and sensitive singlecell RNA sequencing with SHERRY2. BMC Biol 20:213. https://doi.org/10.1186/s1291502201416x
Ding J, Regev A (2021) Deep generative model embedding of singlecell RNASeq profiles on hyperspheres and hyperbolic spaces. Nat Commun 12:2554. https://doi.org/10.1038/s41467021228514
Druzd D, Matveeva O, Ince L et al (2017) Lymphocyte circadian clocks control lymph node trafficking and adaptive immune responses. Immunity 46:120–132. https://doi.org/10.1016/j.immuni.2016.12.011
Eie M, Chang ST, Kuttler K (2023) A first course in linear algebra. Lyryx Learning Inc, Calgary, Alberta, Canada
Elmarakeby HA, Hwang J, Arafeh R et al (2021) Biologically informed deep neural network for prostate cancer discovery. Nature 598:348–352. https://doi.org/10.1038/s41586021039224
Eraslan G, Simon LM, Mircea M et al (2019) Singlecell RNAseq denoising using a deep count autoencoder. Nat Commun 10:390. https://doi.org/10.1038/s41467018079312
Finak G, McDavid A, Yajima M et al (2015) MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol 16:278. https://doi.org/10.1186/s1305901508445
Foucault M (1995) Discipline and punish: the birth of the prison. Knopf Doubleday Publishing Group
Gawad C, Koh W, Quake SR (2016) Singlecell genome sequencing: current state of the science. Nat Rev Genet 17:175–188. https://doi.org/10.1038/nrg.2015.16
Gayoso A, Steier Z, Lopez R et al (2021) Joint probabilistic modeling of singlecell multiomic data with totalVI. Nat Methods 18:272–282. https://doi.org/10.1038/s4159202001050x
Gayoso A, Lopez R, Xing G et al (2022) A Python library for probabilistic analysis of singlecell omics data. Nat Biotechnol 40:163–166. https://doi.org/10.1038/s4158702101206w
Gelman A, Hill J (2006) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942
Glüer K (2011) Donald Davidson: a short introduction. Oxford University Press, USA. https://doi.org/10.1093/acprof:osobl/9780195382976.001.0001
Golding I, Paulsson J, Zawilski SM, Cox EC (2005) Realtime kinetics of gene activity in individual bacteria. Cell 123:1025–1036. https://doi.org/10.1016/j.cell.2005.09.031
Granja JM, Corces MR, Pierce SE et al (2021) ArchR is a scalable software package for integrative singlecell chromatin accessibility analysis. Nat Genet 53:403–411. https://doi.org/10.1038/s41588021007906
Gu H, Cheng H, Ma A et al (2022) scGNN 2.0: a graph neural network tool for imputation and clustering of singlecell RNASeq data. Bioinformatics 38:5322–5325. https://doi.org/10.1093/bioinformatics/btac684
Hagai T, Chen X, Miragaia RJ et al (2018) Gene expression variability across cells and species shapes innate immunity. Nature 563:197–202. https://doi.org/10.1038/s4158601806572
HagemannJensen M, Ziegenhain C, Chen P et al (2020) Singlecell RNA counting at allele and isoform resolution using Smartseq3. Nat Biotechnol 38:708–714. https://doi.org/10.1038/s4158702004970
Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018) Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36:421–427. https://doi.org/10.1038/nbt.4091
Haig D (2020) From Darwin to Derrida: selfish genes, social selves, and the meanings of life. MIT Press. https://doi.org/10.7551/mitpress/12384.001.0001
Hashimshony T, Senderovich N, Avital G et al (2016) CELSeq2: sensitive highlymultiplexed singlecell RNASeq. Genome Biol 17:77. https://doi.org/10.1186/s1305901609388
Henriksson J, Chen X, Gomes T et al (2019) Genomewide CRISPR screens in T helper cells reveal pervasive crosstalk between activation and differentiation. Cell 176:882896.e18. https://doi.org/10.1016/j.cell.2018.11.044
Henriksson J, Lundh T, Wennberg B (2010) A model of sympatric speciation through reinforcement. Kinet Relat Models 3:143–163. https://doi.org/10.3934/krm.2010.3.143
Huang Y, McCarthy DJ, Stegle O (2019) Vireo: Bayesian demultiplexing of pooled singlecell RNAseq data without genotype reference. Genome Biol 20:273. https://doi.org/10.1186/s1305901918652
Islam S, Kjällquist U, Moliner A et al (2012) Highly multiplexed and strandspecific singlecell RNA 5’ end sequencing. Nat Protoc 7:813–828. https://doi.org/10.1038/nprot.2012.022
Jumper J, Evans R, Pritzel A et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589. https://doi.org/10.1038/s41586021038192
Karabacak Calviello A, Hirsekorn A, Wurmus R et al (2019) Reproducible inference of transcription factor footprints in ATACseq and DNaseseq datasets using protocolspecific bias modeling. Genome Biol 20:42. https://doi.org/10.1186/s130590191654y
Kharchenko PV, Silberstein L, Scadden DT (2014) Bayesian approach to singlecell differential expression analysis. Nat Methods 11:740–742. https://doi.org/10.1038/nmeth.2967
Kim JK, Marioni JC (2013) Inferring the kinetics of stochastic gene expression from singlecell RNAsequencing data. Genome Biol 14:R7. https://doi.org/10.1186/gb2013141r7
Kingma DP, Welling M (2013) Autoencoding variational Bayes. arXiv [stat.ML]
Klein AM, Mazutis L, Akartuna I et al (2015) Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell 161:1187–1201. https://doi.org/10.1016/j.cell.2015.04.044
Kleshchevnikov V, Shmatko A, Dann E et al (2022) Cell 2location maps finegrained cell types in spatial transcriptomics. Nat Biotechnol 40:661–671. https://doi.org/10.1016/j.cell.2015.04.044
Korsunsky I, Millard N, Fan J et al (2019) Fast, sensitive and accurate integration of singlecell data with Harmony. Nat Methods 16:1289–1296. https://doi.org/10.1038/s4159201906190
Kozak M (1991) Structural features in eukaryotic mRNAs that modulate the initiation of translation. J Biol Chem 266:19867–19870. https://doi.org/10.1016/S00219258(18)548602
Kuchina A, Brettner LM, Paleologu L, et al. (2021) Microbial singlecell RNA sequencing by splitpool barcoding. Science 371.: https://doi.org/10.1126/science.aba5257
Kumar S, Bentley PJ (eds) (2003) On growth, form and computers, 1st edn. Academic Press. https://doi.org/10.1016/B9780124287655/500347
La Manno G, Soldatov R, Zeisel A et al (2018) RNA velocity of single cells. Nature 560:494–498. https://doi.org/10.1038/s4158601804146
Lange M, Bergen V, Klein M et al (2022) Cell Rank for directed singlecell fate mapping. Nat Methods 19:159–170. https://doi.org/10.1038/s41592021013466
Langmead B, Salzberg SL (2012) Fast gappedread alignment with Bowtie 2. Nat Methods 9:357–359. https://doi.org/10.1038/nmeth.1923
Larsson AJM, Johnsson P, HagemannJensen M et al (2019) Genomic encoding of transcriptional burst kinetics. Nature 565:251–254. https://doi.org/10.1038/s4158601808361
Lee J, Hyeon DY, Hwang D (2020) Singlecell multiomics: technologies and data analysis methods. Exp Mol Med 52:1428–1442. https://doi.org/10.1038/s1227602004202
Liu B, Li Y, Zhang L (2021) Analysis and visualization of spatial transcriptomic data. Front Genet 12:785290. https://doi.org/10.3389/fgene.2021.785290
Liu X, Shen Q, Zhang S (2023) Crossspecies celltype assignment from singlecell RNAseq data by a heterogeneous graph neural network. Genome Res 33:96–111. https://doi.org/10.1101/gr.276868.122
Lopez R, Regier J, Cole MB et al (2018) Deep generative modeling for singlecell transcriptomics. Nat Methods 15:1053–1058. https://doi.org/10.1038/s4159201802292
Lotfollahi M, Naghipourfar M, Luecken MD et al (2022) Mapping singlecell data to reference atlases by transfer learning. Nat Biotechnol 40:121–130. https://doi.org/10.1038/s41587021010017
Lotfollahi M, Rybakov S, Hrovatin K et al (2023) Biologically informed deep learning to query gene programs in singlecell atlases. Nat Cell Biol 25:337–350. https://doi.org/10.1038/s4155602201072x
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol 15:550. https://doi.org/10.1186/s1305901405508
Ludwig LS, Lareau CA, Ulirsch JC et al (2019) Lineage tracing in humans enabled by mitochondrial mutations and singlecell genomics. Cell 176:13251339.e22. https://doi.org/10.1016/j.cell.2019.01.022
Macosko EZ, Basu A, Satija R et al (2015) Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell 161:1202–1214. https://doi.org/10.1016/j.cell.2015.05.002
McElreath R (2020) Statistical rethinking: Bayesian course with examples in R and STAN. CRC Press. https://doi.org/10.1201/9780429029608
McInnes L, Healy J, Melville J (2018) UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [stat.ML]. https://doi.org/10.21105/joss.00861
Mimitou EP, Cheng A, Montalbano A et al (2019) Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat Methods 16:409–412. https://doi.org/10.1038/s4159201903920
Mimitou EP, Lareau CA, Chen KY et al (2021) Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat Biotechnol 39:1246–1258. https://doi.org/10.1038/s41587021009272
Nagano T, Lubling Y, Stevens TJ et al (2013) Singlecell HiC reveals celltocell variability in chromosome structure. Nature 502:59–64. https://doi.org/10.1038/nature12593
Norris JR (1997) Markov chains. Cambridge University Press
OsumiSutherland D, Xu C, Keays M et al (2021) Cell type ontologies of the Human Cell Atlas. Nat Cell Biol 23:1129–1135. https://doi.org/10.1038/s41556021007877
Pearl J, Mackenzie D (2018) The book of why: the new science of cause and effect. Basic Books
Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philo Mag J Sci 2:559–572. https://doi.org/10.1080/14786440109462720
Peidli S, Green TD, Shen C, et al. (2022) scPerturb: information resource for harmonized singlecell perturbation data. bioRxiv 2022.08.20.504663. https://doi.org/10.1101/2022.08.20.504663
Peterson VM, Zhang KX, Kumar N et al (2017) Multiplexed quantification of proteins and transcripts in single cells. Nat Biotechnol 35:936–939. https://doi.org/10.1038/nbt.3973
Picelli S, Björklund ÅK, Faridani OR et al (2013) Smartseq2 for sensitive fulllength transcriptome profiling in single cells. Nat Methods 10:1096–1098. https://doi.org/10.1038/nmeth.2639
Pickstone JV (2001) Ways of knowing: a new history of science, technology, and medicine. University of Chicago Press
Pierce SE, Granja JM, Greenleaf WJ (2021) Highthroughput singlecell chromatin accessibility CRISPR screens enable unbiased identification of regulatory networks in cancer. Nat Commun 12:2969. https://doi.org/10.1038/s4146702123213w
Polański K, Young MD, Miao Z et al (2020) BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36:964–965. https://doi.org/10.1093/bioinformatics/btz625
Qiu Q, Hu P, Qiu X et al (2020) Massively parallel and timeresolved RNA sequencing in single cells with scNTseq. Nat Methods 17:991–1001. https://doi.org/10.1038/s4159202009354
Qiu X, Zhang Y, MartinRufino JD et al (2022) Mapping transcriptomic vector fields of single cells. Cell 185:690711.e45. https://doi.org/10.1016/j.cell.2021.12.045
Raj A, Peskin CS, Tranchina D et al (2006) Stochastic mRNA synthesis in mammalian cells. Plos Biol 4:e309. https://doi.org/10.1371/journal.pbio.0040309
Ramesh A, Pavlov M, Goh G, et al. (2021) Zeroshot texttoimage generation. arXiv [cs.CV]
Reiser P, Neubert M, Eberhard A et al (2022) Graph neural networks for materials science and chemistry. Commun Mater 3:93. https://doi.org/10.1038/s43246022003156
Replogle JM, Norman TM, Xu A et al (2020) Combinatorial singlecell CRISPR screens by direct guide RNA capture and targeted sequencing. Nat Biotechnol.
Replogle JM, Saunders RA, Pogson AN et al (2022) Mapping informationrich genotypephenotype landscapes with genomescale Perturbseq. Cell 185:25592575.e28. https://doi.org/10.1016/j.cell.2022.05.013
Risso D, Ngai J, Speed TP, Dudoit S (2014) Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol 32:896–902. https://doi.org/10.1038/nbt.2931
Rizvi AH, Camara PG, Kandror EK et al (2017) Singlecell topological RNAseq analysis reveals insights into cellular differentiation and development. Nat Biotechnol 35:551–560. https://doi.org/10.1038/nbt.3854
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140. https://doi.org/10.1093/bioinformatics/btp616
Rosenberg AB, Roco CM, Muscat RA et al (2018) Singlecell profiling of the developing mouse brain and spinal cord with splitpool barcoding. Science 360:176–182. https://doi.org/10.1126/science.aam8999
Saelens W, Cannoodt R, Todorov H, Saeys Y (2019) A comparison of singlecell trajectory inference methods. Nat Biotechnol 37:547–554. https://doi.org/10.1038/s4158701900719
Salmen F, De Jonghe J, Kaminski TS et al (2022) Highthroughput total RNA sequencing in single cells using VASAseq. Nat Biotechnol. https://doi.org/10.1038/s41587022013618
Satija R, Farrell JA, Gennert D et al (2015) Spatial reconstruction of singlecell gene expression data. Nat Biotechnol 33:495–502. https://doi.org/10.1038/s4158701900719
Schwartz GW, Zhou Y, Petrovic J et al (2020) TooManyCells identifies and visualizes relationships of singlecell clades. Nat Methods 17:405–413. https://doi.org/10.1038/s4159202007485
Smith T, Heger A, Sudbery I (2017) UMItools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res 27:491–499. https://doi.org/10.1101/gr.209601.116
Sonnhammer ELL, Östlund G (2015) InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 43:D234–D239. https://doi.org/10.1093/nar/gku1203
Squair JW, Gautier M, Kathe C et al (2021) Confronting false discoveries in singlecell differential expression. Nat Commun 12:5692. https://doi.org/10.1038/s41467021259602
Srivastava A, Malik L, Smith T et al (2019) Alevin efficiently estimates accurate gene abundances from dscRNAseq data. Genome Biol 20:65. https://doi.org/10.1186/s130590191670y
Stevens TJ, Lando D, Basu S et al (2017) 3D structures of individual mammalian genomes studied by singlecell HiC. Nature 544:59–64. https://doi.org/10.1038/nature21429
Stoeckius M, Hafemeister C, Stephenson W et al (2017) Simultaneous epitope and transcriptome measurement in single cells. Nat Methods 14:865–868. https://doi.org/10.1038/nmeth.4380
Street K, Risso D, Fletcher RB et al (2018) Slingshot: cell lineage and pseudotime inference for singlecell transcriptomics. BMC Genomics 19:477. https://doi.org/10.1186/s1286401847720
Stuart T, Srivastava A, Madad S et al (2021) Singlecell chromatin state analysis with Signac. Nat Methods 18:1333–1341. https://doi.org/10.1038/s41592021012825
Svensson V (2020) Droplet scRNAseq is not zeroinflated. Nat Biotechnol 38:147–150. https://doi.org/10.1038/s4158701903795
Svensson V, Teichmann SA, Stegle O (2018) SpatialDE: identification of spatially variable genes. Nat Methods 15:343–346. https://doi.org/10.1038/nmeth.4636
Svensson V, Gayoso A, Yosef N, Pachter L (2020) Interpretable factor models of singlecell RNAseq via variational autoencoders. Bioinformatics 36:3418–3421. https://doi.org/10.1093/bioinformatics/btaa169
Tedesco M, Giannese F, Lazarević D et al (2022) Chromatin Velocity reveals epigenetic dynamics by singlecell profiling of heterochromatin and euchromatin. Nat Biotechnol 40:235–244. https://doi.org/10.1038/s41587021010311
Traag VA, Waltman L, van Eck NJ (2019) From Louvain to Leiden: guaranteeing wellconnected communities. Sci Rep 9:5233. https://doi.org/10.1038/s4159801941695z
Tran HTN, Ang KS, Chevrier M et al (2020) A benchmark of batcheffect correction methods for singlecell RNA sequencing data. Genome Biol 21:12. https://doi.org/10.1186/s1305901918509
Trapnell C, Cacchiarelli D, Grimsby J et al (2014) The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 32:381–386. https://doi.org/10.1038/nbt.2859
Van den Berge K, Roux de Bézieux H, Street K et al (2020) Trajectorybased differential expression analysis for singlecell sequencing data. Nat Commun 11:1201. https://doi.org/10.1038/s41467020147663
VanInsberghe M, van den Berg J, AnderssonRolf A et al (2021) Singlecell Riboseq reveals cell cycledependent translational pausing. Nature 597:561–565. https://doi.org/10.1038/s41586021038874
Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. arXiv [cs.CL]
Vitak SA, Torkenczy KA, Rosenkrantz JL et al (2017) Sequencing thousands of singlecell genomes with combinatorial indexing. Nat Methods 14:302–308. https://doi.org/10.1038/nmeth.4154
Wang J, Ma A, Chang Y et al (2021) scGNN is a novel graph neural network framework for singlecell RNASeq analyses. Nat Commun 12:1882. https://doi.org/10.1038/s4146702122197x
Wikipedia contributors (2022) Birthday problem. In: Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Birthday_problem&oldid=1125950199
Wittgenstein L (1998) Philosophical investigations, 4th edn. WileyBlackwell
Wolf FA, Angerer P, Theis FJ (2018) SCANPY: largescale singlecell gene expression data analysis. Genome Biol 19:15. https://doi.org/10.1186/s1305901713820
Wolf FA, Hamey FK, Plass M et al (2019) PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol 20:59. https://doi.org/10.1186/s130590191663x
Wolock SL, Lopez R, Klein AM (2019) Scrublet: computational identification of cell doublets in singlecell transcriptomic data. Cell Syst 8:281291.e9. https://doi.org/10.1016/j.cels.2018.11.005
Xiong L, Xu K, Tian K et al (2019) SCALE method for singlecell ATACseq analysis via latent feature extraction. Nat Commun 10:4576. https://doi.org/10.1038/s41467019126307
Xu W, Yang W, Zhang Y et al (2022) ISSAACseq enables sensitive and flexible multimodal profiling of chromatin accessibility and gene expression in single cells. Nat Methods 19:1243–1249. https://doi.org/10.1038/s41592022016014
Yan F, Powell DR, Curtis DJ, Wong NC (2020) From reads to insight: a hitchhiker’s guide to ATACseq data analysis. Genome Biol 21:1–16. https://doi.org/10.1186/s1305902019293
Yang F, Wang W, Wang F et al (2022) scBERT as a largescale pretrained deep language model for cell type annotation of singlecell RNAseq data. Nat Mach Intell 4:852–866. https://doi.org/10.1038/s4225602200534z
Young MD, Behjati S (2020) SoupX removes ambient RNA contamination from dropletbased singlecell RNA sequencing data. Gigascience 9.: https://doi.org/10.1093/gigascience/giaa151
Zhang Y, Liu T, Meyer CA et al (2008) Modelbased analysis of ChIPSeq (MACS). Genome Biol 9:R137. https://doi.org/10.1186/gb200899r137
Zhang R, Zhou T, Ma J (2022) Multiscale and integrative singlecell HiC analysis with Higashi. Nat Biotechnol 40:254–261. https://doi.org/10.1038/s4158702101034y
Funding
Open access funding provided by Umea University. I.S.M is supported by the Umeå industrial doctoral school (Företagsforskarskolan) of Umeå university, and Sartorius. J.H.is supported by Vetenskapsrådet grant number #2021–06602.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical approval
Not applicable to this review.
Consent to participate
Not applicable to this review.
Consent for publication
Not applicable to this review.
Conflict of interest
I.S.M is partially funded by Sartorius. Other authors declare no conflict of interest.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mihai, I.S., Chafle, S. & Henriksson, J. Representing and extracting knowledge from singlecell data. Biophys Rev 16, 29–56 (2024). https://doi.org/10.1007/s12551023010914
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12551023010914