Introduction

Single-cell technologies have emerged as powerful methods for unraveling tissue cellular heterogeneity and studying molecular phenotypes at the resolution of individual cells. Despite their remarkable potential, extracting meaningful biological insights from single-cell readouts still poses major analytical challenges. Such challenges arise from the need to integrate multiple interrelated tasks, including data normalization1, denoising2, and inter-sample harmonization and/or identification of a shared low-dimensional space3,4. These processing steps are intertwined with the analytical goals of comparing biological conditions5 or experimental perturbations6, extracting pathway-level activity metrics, or studying gene regulatory networks (GRNs)7,8.

While existing methodologies have made significant strides in addressing these challenges individually, a model that can unify all these concepts into a single framework has remained elusive. For instance, the most common workflow for differential gene expression (DGE) analysis requires sequential application of normalization, inter-sample integration, and cell-type/cluster identification, to then perform a pseudo-bulk DGE analysis for each cell type across conditions of interest5,9,10. In addition to being limited to the analysis of discrete cell clusters, this sequential approach ignores the interplay between inter-sample integration and DGE analysis: integration depends on gene expression shifts across conditions, and DGE identification depends on integration. This is also true for other downstream analyses, such as pathway and GRN activity estimation, that are mutually influenced by normalization, low-dimensional projection, and integration steps. For example, considering prior biological information, in the form of gene networks or functionally related gene sets, may lead to identification of interpretable latent factors in expression data and help deconvolve biological variability from technical noise8.

Furthermore, existing tools are primarily designed to perform such analyses on the gene expression space or other modalities in which the biological quantity of interest, e.g., mRNA abundance, is connected to a single type of observation, e.g., unique molecular identifier (UMI) counts. An array of biological processes, however, are better measured as the ratio of two quantities. For example, alternative cassette exon splicing is commonly quantified as the ratio of abundances of isoforms that include the exon vs. isoforms in which the exon is skipped11. As another example, the ratio of spliced to unspliced transcript abundances is informative about the processing and/or decay rates of mRNAs12,13. Analysis of such ratio-based modalities are particularly challenging when the observed data from the two opposing quantities are sparse (e.g., sparse UMI counts for each of the spliced and unspliced forms of a transcript). Few methods can perform dimensionality reduction on the latent space of ratio-based modalities14 and, to our knowledge, no method exists for their inter-sample harmonization or GRN analysis.

Here, we introduce Gene Expression Decomposition and Integration (GEDI), a framework for multisample, multi-condition single-cell analysis. GEDI incorporates various single-cell analysis steps within a unified Bayesian framework that includes data integration across samples/conditions, data imputation and denoising, cluster-free DGE analysis, as well as pathway and GRN activity analysis. GEDI is competitive with other top-performing integration tools, while uniquely capable of deconvolving the effects of multiple technical and/or biological sources of sample-to-sample variability. This ability enables a natural and efficient approach for cluster-free DGE analysis by identifying the transcriptomic vector field associated with sample-level variables. Furthermore, by incorporating information about gene sets, it identifies axes of heterogeneity that are aligned with prior biological knowledge, thus enabling single-cell projection of pathway and/or GRN activities as well as their direction of change (gradients). Finally, GEDI is the first single-cell analysis framework to expand all these concepts from the gene expression space to the analysis of ratio-based modalities, including the analysis of the latent spaces of alternative splicing and RNA stability.

Results

The GEDI framework

We formulate multi-sample scRNA-seq analysis as the identification of sample-specific, invertible decoder functions such that the decoder function of each sample can reconstruct the expected expression profile of each cell from a (low-dimensional) representation of its biological state (Fig. 1a). Subsequently, correspondences between cells of different samples can be established based on the similarity of their biological states, as given by the inverse of the sample-specific decoders (encoders), enabling horizontal integration of single-cell data across different samples. We further constrain the sample-specific decoders to be from the same family of functions, while allowing for sample-specific parameterizations (Fig. 1b).

Fig. 1: Overview of GEDI.
figure 1

a Schematic representation of a two-sample single-cell analysis: cells from each sample are distributed near a unique manifold determined by the sample-specific decoder functions ψ1 and ψ2 (each dot represents one cell, with coordinates representing gene expression measurements). These invertible functions provide a mapping (represented by grey arrows) from the biological state of each cell (b) to the observed gene expression profile of the cell in each sample. b Using a parametric family of functions, the decoder functions can be defined with sample-specific parameters (θ). c The sample-specific parameter, θ, can be expressed as a probabilistic function of sample-level variable h (left). Thus, for any given sample characteristic, a distribution of decoder functions can be obtained (right). d The derivative of ψ with respect to h forms a vector field, representing the change in expression of each cell at the biological state b as h changes (differential expression). e GEDI learns a “reference” manifold in the form of a hyperplane or hyperellipsoid. Here, the manifold is represented by an ellipse, defined by its center o and principal axes (vectors z1 and z2). Red and blue shaded areas represent cell types and vector yn represents the expression profile of cell n. f Transformations of the center and principal axes can distort the reference manifold. Different samples can have different distortions, resulting in non-alignment of cells. g GEDI can model the distortions as a linear function of sample-level variables (represented by hi for each sample i). h The z vectors in the reference manifold can be modeled as probabilistic functions of prior information C. i Principal axes aligned to transcription factor (TF) regulons. The color gradient indicates the projected regulon activity of TF1. j The GEDI model can be fitted to different types of observations: when the expression of each gene g in each cell n is given (represented by yg,n; top), or when yg,n is latent and a probabilistic observation from yg,n is obtained (middle), or a pair of observations representing different event types whose relative proportion is of interest (bottom).

The decoder parameters can be optionally expressed as a probabilistic function of sample-level variables, resulting in a distribution of decoder functions for any given combination of sample characteristics (Fig. 1c). This formulation gives rise to a hierarchical generative model, in which a probabilistic function connects the characteristics of each sample to a distribution of decoder parameter sets. The parameter set of the sample is then drawn from this distribution, leading to a decoder function that connects the biological state of each cell to its expected gene expression profiles. This hierarchical model enables cluster-free differential gene expression analysis along the continuum of cell states, as we can examine how changes in sample-level variables impact the expected (mean) expression profile of any given biological cell state (Fig. 1d).

We note that the decoder function of each sample effectively defines the manifold representing the observed cell expression profiles within that sample; therefore, this formulation holds the potential for extension to any parametric manifold learning approach. GEDI is a specific application of this general formulation, where the gene expression manifold of each sample is modeled as a hyperplane or hyperellipsoid, defined by a common (reference) set of principal axes (Fig. 1e) and sample-specific transformations of these axes (Fig. 1f). These sample-specific transformations can, in turn, be modeled as probabilistic functions of sample-level variables (Fig. 1g), enabling cluster-free analysis of the association between gene expression and sample characteristics. Optionally, the common coordinate frame can also be expressed as a probabilistic function of gene-level variables such as gene-set memberships (Fig. 1h), aligning the principal axes of the coordinate frame with prior biological knowledge and enabling the projection of pathway and regulatory network activities onto the cellular state space (Fig. 1i).

Finally, we can connect each point on the sample-specific manifolds to different types of observations using diverse data-generating distributions. This versatility enables gene expression analysis based on normalized or raw unique molecular identifier (UMI) counts, analysis of alternative splicing using counts of reads that support opposing splicing events, and analysis of RNA stability based on the UMI counts of spliced and unspliced transcripts (Fig. 1j). The GEDI model is fitted to these observation types using an expectation-maximization algorithm (see Supplementary Methods for details).

GEDI captures different sources of sample-to-sample variability

To assess the ability of GEDI to capture sample-to-sample variability, we applied it to a dataset of peripheral blood mononuclear cells (PBMCs) from two donors15 profiled using different scRNA-seq technologies. We applied GEDI without including sample-level variable information, so the model was oblivious to the biological and technical characteristics of the samples. Examining the sample-specific transformations learned by GEDI revealed that, once the effect of technology is regressed out, they are more similar among samples that are from the same donor (Fig. 2a). Similarly, after regressing out the effect of donor, the sample-specific transformations cluster by the technology (Supplementary Fig. 1a). These results suggest that GEDI properly learns sample-specific parameters that capture different sources of inter-sample variability, including the biological differences between the two donors and the technical variability introduced by the use of different technologies. At the same time, intra-sample variability across the cells is preserved, as the projection of the cell state representations learned by GEDI shows clear separation of cell types without any obvious separation by sample (Fig. 2b).

Fig. 2: GEDI captures sample-to-sample variability.
figure 2

a UMAP embedding of the sample-specific manifold distortions learned by GEDI for the PBMC dataset. Each sample was encoded using the set of sample-specific manifold parameters learned by GEDI (excluding sample-specific translation vectors Δoi), followed by regressing out the effect of technology from sample-specific parameters post-hoc, selection of the top 20 most variable parameters, PCA, and UMAP. Each dot represents one sample, labeled by donor (left) or single-cell technology (right). Only technologies with more than one sample are displayed. See Supplementary Fig. 1a for details and results when the effect of donor is regressed out, and Supplementary Fig. 2a for other choices of top variable features. b UMAP embedding of the cells in the PBMC dataset after integration with GEDI (K = 40). Each dot represents one cell, colored by the cell type labels from ref. 15 (left) or by sample (right). Also see Supplementary Fig. 1b-f. c Overall ranking score comparing the performance of various integration methods over a range of latent factors (K), applied to the PBMC, Pancreas and Tabula Muris datasets. The score reflects the ability to remove technical effects while preserving biological variability, similar to ref. 16 (see Methods and Supplementary Data 1 for details, and Supplementary Fig. 3 for additional comparisons). d PCA embedding of the sample-specific manifold distortions learned by GEDI for the COVID-19 dataset. Samples were first encoded using the sample-specific parameters, similar to (a), followed by regressing out the effect of cohort and selection of the top 20 most variable parameters for PCA. Each dot represents a sample, labeled by the disease group (left) or the cohort of origin (right). e Receiver operating characteristic (ROC) curves assessing the classification between COVID and control cases in the COVID-19 dataset. For the classification task, a Support Vector Machine (SVM) was trained using the top 20 most variable parameters learned by GEDI. Left: SVM was trained with data from cohort 2 and tested on cohort 1. Right: SVM was trained with cohort 1 data and tested on cohort 2. See also Supplementary Fig. 2b. Source data are provided as a Source Data file.

To quantitatively measure the ability of GEDI to separate intra-sample and inter-sample sources of variability, we compared it against a panel of existing single-cell integration methods using previously established metrics16,17 and three benchmarking references: the PBMC dataset described above, a pancreas dataset18,19 and the Tabula Muris20 dataset. Overall, we observed that GEDI was consistently among the top-performing methods, regardless of the number of the latent factors used for low-dimensional projection of data—an often arbitrary choice that affects the performance of most other methods (Fig. 2c and Supplementary Fig. 1b-f). These results suggest that the manifold transformations learned by GEDI explain most of the sample-to-sample variability while retaining the heterogeneity in the biological states of the cells.

Encouraged by the performance of GEDI on PBMC data, we applied it to a recent single-cell atlas of PBMCs that included healthy individuals as well as mild and severe COVID-19 cases from two separate cohorts21. Consistent with the results shown above, when modeling the gene expression manifold of the cells, GEDI learned sample-specific transformations that reflected the biological variability among samples, such as the COVID-19 status and its severity (Fig. 2d). In addition, we successfully trained support vector machine (SVM) models capable of predicting the disease status (COVID-19 vs. healthy) based on sample-specific transformations of the reference manifold. Cross-cohort validation analysis suggests that, when trained on cohort 2, the model perfectly distinguishes COVID-19 vs. healthy individuals in cohort 1; conversely, when the model is trained on cohort 1, it achieves an area under receiver operating characteristic (AUROC) curve of 0.97 in cohort 2 (Fig. 2e). Interestingly, similar SVM models trained on pseudobulk-based features did not generalize well across cohorts (Supplementary Fig. 2b-c). Together, these results show that GEDI can capture most of the sample-to-sample variability present in multi-sample scRNA-seq datasets; this variability can then be directly traced back to sample characteristics (such as disease severity) owing to the parametric nature of GEDI’s modelling framework.

GEDI enables cluster-free differential expression analysis

To enable direct analysis of the relationship between sample-level variables and the gene expression space, we explicitly included them in the GEDI model, by expressing the sample-specific manifold transformations as probabilistic functions of sample-level variables. This model enables us to examine how the manifold and, therefore, the expression vector associated with any given cell state, changes with sample characteristics, providing a transcriptomic vector field for each sample-level variable (Fig. 3a and Supplementary Fig. 4). We applied this approach to the COVID-19 dataset, to obtain the transcriptomic vector field describing the differences between severe COVID-19 and healthy individuals across the PBMCs. Figure 3b, c provides a visual representation of the cell state space and the transcriptomic vector field of severe COVID-19 over that space. The largest vector magnitudes, corresponding to the cell states that show the largest overall gene expression shift, were observed in plasmablasts, HLA-DRlo S100Ahi monocytes, and neutrophils (Fig. 3d); the transcriptomic vector magnitudes observed in monocytes and neutrophils recapitulate the previously reported large cell state shifts in these cell types21.

Fig. 3: Incorporating sample-level variables in the GEDI model.
figure 3

a Schematic of a transcriptomic vector field associated with a sample variable h. (b-c) UMAP embedding of the projection of the transcriptomic vector field of severe COVID-19 for cohort 1. b The position of each cell represents the manifold embedding for the control condition—the input for the UMAP was the low-dimensional projection of each cell on the manifold of the healthy group, as learned by GEDI (i.e., the reference manifold plus distortions associated with the control condition). The color indicates the cell type labels from the original study. See Supplementary Fig. 5a for cells colored by donor. c Each arrow represents a transcriptomic vector, showing the gene expression change that occurs from the control condition to the severe COVID-19 condition. Arrows are obtained by jointly embedding, in the UMAP space, the (extrapolated) gene expression profiles of each cell in the control and severe COVID-19 condition (see Supplementary Methods for details). d Boxplots showing the magnitude of the gene expression changes (transcriptomic vector magnitude), per cell type, between the control and severe COVID-19 conditions. e Comparison between the mean transcriptomic vector field per cell type, obtained from GEDI, and differential gene expression values (log fold-change) obtained from pseudo-bulk analysis. The heatmap shows the Pearson correlation values between the GEDI-based mean vectors (columns) and the pseudobulk-based DE vectors (rows) between cell type pairs, for the comparison of mild COVID-19 vs. control cases in cohort 1 (each element of the heatmap represents Pearson correlation across all genes). f Same as in (e) but showing reproducibility between cohort 1 (rows) and cohort 2 (columns) for the pseudo-bulk analysis. g Same as in e–f but showing reproducibility between cohort 1 (rows) and cohort 2 (columns) for GEDI. See also Supplementary Fig. 5b-d and Supplementary Fig. 6 for additional comparisons.

Calculation of the vector field provides a cluster-free approach for DGE analysis across the continuum of cell states. Nonetheless, it is also possible to perform a traditional cluster-based DGE analysis by summarizing the vector field for any given cell cluster using simple modifications. To showcase this, we calculated the mean COVID-19 transcriptomic vector representing the average shift in gene expression between mild COVID-19 and healthy individuals across all cells of each cell type. Comparison of these cell type-specific mean vectors to cell type-specific estimates from a pseudo-bulk DGE analysis revealed a high degree of agreement between the two approaches (Fig. 3e and Supplementary Fig. 5a-b). Interestingly, GEDI estimates showed improved reproducibility across cohorts compared to the pseudo-bulk approach (Fig. 3f, g and Supplementary Fig. 5c-d). Furthermore, we found that pseudo-bulk DGE estimates had a substantial correlation between cell types (Supplementary Fig. 6a), whereas GEDI estimates were highly cell type-specific (Supplementary Fig. 6b).

To systematically establish the performance of GEDI in clustering-free DGE analysis, we used GEDI to analyze a simulated cohort-level single-cell dataset, allowing us to compare GEDI’s inferences to a known ground truth for each individual cell. Our simulation framework is schematically shown in Fig. 4a, which is based on a set of synthetic cellular archetypes22,23 whose expression vectors are determined by sample-level variables, along with additional sources of variation at the sample- and cell-level5,10. The parameters needed to simulate cells using this framework can be derived from a variety of sources; we decided to use a real scRNA-seq dataset (specifically, the COVID-19 dataset above) as the template to derive these parameters, in order to preserve characteristics such as gene-gene covariances (see Methods for details). We observed that single-cell DE vectors provided by GEDI correlate strongly with the ground-truth DE vector of each cell, with a slightly better performance when modeling the manifold as a hyperplane (median Pearson r = 0.4, Fig. 4b). In comparison, inferences made by LEMUR24, another recent method for clustering-free single-cell DE analysis, had significantly lower correlation with ground truth (Mann-Whitney U P < 10–15; median r = 0.14). At the level of each individual cell, we also defined a set of ground-truth “up-regulated” and “down-regulated” genes by thresholding the ground-truth DE values, and found that GEDI significantly outperforms LEMUR in the identification of up-regulated and down-regulated genes (median AUROC of 0.79 and 0.72 for identification of up-regulated genes by GEDI and LEMUR, and median AUROC of 0.78 and 0.55 for identification of down-regulated genes by GEDI and LEMUR, respectively; all comparisons are significant at P < 10–15 Fig. 4b). Another recent method, miloDE25, can also perform clustering-free differential analysis, albeit at the “neighborhood” level as opposed to single-cell level. To compare with miloDE, we collapsed the ground-truth DE vectors as well as GEDI’s and LEMUR’s inferences to the neighborhood level, by averaging across the cells of each neighborhood (with neighborhoods defined by miloDE). Figure 4c shows that GEDI outperforms both LEMUR and miloDE at the neighborhood level based on different metrics. Finally, we collapsed the ground truth DE vectors as well as GEDI’s and LEMUR’s inferences to the “cell type” level (see Methods), in order to compare to pseudobulk analysis results obtained by DESeq2. Again, we observed a better agreement between GEDI inferences and the ground truth compared to both LEMUR and DESeq2 (Fig. 4d). These results suggest that GEDI can effectively capture the differential expression of genes at the level of single cells, neighborhoods, and cell types.

Fig. 4: Systematic comparison of cluster-free DGE methods using a simulated cohort-level single-cell dataset.
figure 4

a Schematic representation of our framework to simulate cohort-level scRNA-seq data. We start by simulating the cell state manifold of each sample as a set of “archetypes” while constraining any cell state to some weighted average of those archetypes; thus, the cell states are confined within the polytope defined by the archetypes. In each sample, the gene expression vector of each archetype is determined by the sample-level variables (plus an additional variance that is not explained by sample characteristics), resulting in a ground-truth DE vector for each archetype and, by extension, any given cell state. Finally, each cell is drawn from a distribution centered on a given cell state in a given sample, followed by simulating UMI counts (see Methods for more details). b Single-cell-level performance on a simulated dataset. The simulation parameters were extracted from comparison of mild COVID-19 vs. control individuals (see Supplementary Methods). For each cell (n of cells = 86,549), the differential expression estimates were compared against the ground truth via Pearson correlation (top), or by assessing the classification of up-regulated (middle) or down-regulated (bottom) genes using AUROC. Sets of up-regulated and down-regulated genes were defined using a threshold of 0.3 on the log2 ground-truth DE values. For each metric, the violin plots and boxplots show the distribution across all cells; red triangle: mean, center line: median; box limits: upper and lower quartiles; whiskers: 1.5x the interquartile range. c Same as in b but at the neighborhood level (n of neighborhoods = 215). d Same as in b but at the cell type level (n of cell types = 20). See also Supplementary Fig. 7 for simulations based on comparison of severe COVID-19 vs. control.

Pathway and network activity projection with GEDI

GEDI can also incorporate prior biological knowledge, such as gene signatures, biological pathways, or GRNs into its model, by expressing the manifold principal axes as probabilistic functions of prior gene-set associations (gene signatures, pathways, and GRNs can be represented by a weighted gene-set association matrix, similar to previous work8). As a result, principal axes that can be expressed as a linear combination of one or several gene sets/signatures are deemed more likely by the model, encouraging their alignment with prior knowledge, and allowing the projection of the “activity” of known biological axes onto the cellular states (Fig. 1i). To assess the reliability of these projected activities, we examined GEDI’s ability to project cell type signatures across PBMCs. First, using DGE analysis of the PBMC benchmarking dataset, we generated cell-type signatures for each of the two donors (see Methods for details). Then, for each donor, we applied GEDI using the signatures from the other donor as prior biological knowledge. In both cases, the inferred activity of cell type signatures showed strong enrichment for the true cell type labels (Fig. 5a and Supplementary Fig. 8).

Fig. 5: Modeling the manifold as a function of gene-level variables.
figure 5

a Cell type signature projections obtained by GEDI are compared to the true labels in the PBMC dataset (donor 1). Heatmap shows AUROC values for differential enrichment of inferred cell type signatures from GEDI (rows) for each cell type (columns). See Methods for details. Also see Supplementary Fig. 8. b Examples showing single-cell projection of transcription factor (TF) regulon activities. Each UMAP shows the cells from the PBMC dataset, with the regulon activity of PAX5 (left) and TCF7 (right) shown using the color gradient. The UMAP embedding is identical to that shown in Fig. 2b. The cell type with the highest activity for each TF is highlighted. See Supplementary Fig. 9 and Supplementary Data 2 for cell type-specific activities of other TFs. c Schematic illustration of TF regulon activity gradient and its relationship to transcriptomic vector field associated with a sample variable. The color shows the regulon activity, with its gradient represented by the diagonal vector. The arrows within the square represent the vector field of sample-level variable h. The inset shows how the cosine similarity of a transcriptomic vector and the TF regulon activity vector can be obtained. d Comparison between the TF gradient vector field (rows) and the transcriptomic vector field of severe COVID-19 per cell type (columns). In each square, the color represents the mean cosine similarity of the COVID-19 vector and the TF gradient vector. The square size represents the mean CPM, per cell type, of the mRNA encoding each TF. Only TFs with Pearson correlation ≥0.25 between their inferred activity and their mRNA abundance are shown. e-g An example TF whose activity gradient correlates with the transcriptomic vector field of severe COVID-19 in monocytes. e UMAP representation of the transcriptomic vector field of severe COVID-19. The color shows the vector magnitude. (f) Gradient vector field of SPI1 activity. The color represents SPI1 activity. g The same UMAP as in (e-f), but the color represents the cell type labels. See Supplementary Fig. 12 for other examples. Source data are provided as a Source Data file.

Similarly, when a transcription factor (TF) regulatory network was used as prior biological knowledge (see Methods), we observed cell type-specific activity patterns for many TFs (Supplementary Fig. 9), including known lineage regulators such as PAX5 in B cells26 and TCF7 in CD4+ T cells27 (Fig. 5b). For 74 out of 89 TFs included in our regulatory network, the projected activity across cells correlated significantly with the decoded abundance of the mRNA encoding the TF (t-test for Pearson correlation, FDR < 0.001), further supporting GEDI’s inferences. The high correlation between the inferred activity of most TFs and their mRNA abundance can also be seen in the COVID-19 data, even when we stratify the cells by their cell type and by the disease condition of the donors (Supplementary Fig. 10). The ability of GEDI to infer TF activities is also supported by its performance on a dataset of single-cell TF perturbations28. As shown in Supplementary Fig. 11, for the TFs whose perturbation is associated with cell state shifts along the main axes of variation, GEDI-based activities are highly predictive of the TF perturbation status. For this subset of TFs, GEDI (hyperellipsoid) is in fact among the top performers compared to eight other methods we tested, with mean AUROC of 0.86 for distinguishing the cells in which a specific TF is perturbed from other cells. Given that GEDI only models the principal axes of variation as functions of the gene regulatory network, its behaviour in correctly modeling the TFs that cause cell state shifts along these axes is expected.

GEDI network activity projection also enables the calculation of a gradient vector for the activity of each TF, representing the direction of greatest increase in TF activity in the cell space (Fig. 5c). One can then compare the gradient vector of each TF to a given transcriptomic vector field, to examine whether in certain cellular states the transcriptomic vector field is aligned with the gradient vector of that TF (Fig. 5c). We used GEDI to infer the regulon activity of TFs in the COVID-19 dataset and, for each single cell, compared the activity gradient vector of each TF to the transcriptomic vector field of severe COVID-19. Interestingly, the transcriptomic vector field of severe COVID-19 correlated (or anti-correlated) strongly with the TF activity gradients in a cell type-specific pattern (Fig. 5d). In other words, in certain cell states, when we move from healthy to severe COVID-19, the direction of change in the gene expression coincides with the direction of greatest increase (or decrease) in the activity of specific TFs. For example, the activity gradient of a group of TFs showed high concordance with the transcriptomic vector field of severe COVID-19 in HLA-DRlo S100Ahi monocytes, including SPI1, CEBPA, and SP1 (Fig. 5e–g and Supplementary Fig. 12), suggesting that severe COVID-19 is accompanied by increased activity of these TFs in HLA-DRlo S100Ahi monocytes. Among these TFs, we observed strong up-regulation for SPI1 mRNA in monocytes in severe COVID-19 compared to healthy controls (pseudobulk DE and GEDI cluster-free DE analyses; Supplementary Fig. 13a), but not for the other two TFs. Nonetheless, for all three TFs, the direct targets whose expression was highly correlated with the expression of the TF were enriched in various immune-related pathways (Supplementary Fig. 13b).

Modeling the latent space of RNA splicing and stability with GEDI

In contrast to the analysis of mRNA abundance, where for each cell and each feature a single quantity is recorded (e.g., the UMI count), analysis of many other biological processes requires working with the ratio of two quantities. For example, analysis of alternative splicing of cassette exons involves modeling the percent-spliced-in (PSI), representing the ratio between the abundances of isoforms in which the cassette exon is included vs. excluded11 (see Supplementary Fig. 14 for other examples). Such analysis is further complicated, especially in single-cell data, by the fact that the quantities whose ratio is of interest are latent, and instead some probabilistic observation is obtained (e.g., UMI counts of inclusion or exclusion isoforms). By using a hierarchical model in which the latent profile of each cell is connected to the observed data through a binomial data-generating distribution (Fig. 6a), GEDI extends the analyses described in the previous sections to paired quantities whose ratio is of interest.

Fig. 6: Modeling the latent space of RNA splicing and stability with GEDI.
figure 6

a Schematic representation of modeling exon inclusion levels. The log-odds of inclusion/exclusion of exon g in cell n (yg,n), which is unobserved (latent), is connected to the observed counts of reads that support the inclusion of that exon (mg,n) or its exclusion (m’g,n) through a binomial distribution; the distribution parameter pg,n represents the percent-spliced-in (PSI). GEDI models the latent space of exon inclusion/exclusion across cells and samples similar to Fig. 1. b UMAP embedding of the latent splicing space of mouse cortical cells after integration of data from two studies29,30. Labels represent the study of origin (left) or the cell type labels from the original study (right). c Alignment score for removal of the technical variation between the two datasets using different methods. GEDI was applied either assuming a binomial distribution (B) or normal distribution (N) for the input data. For all methods excluding GEDI-B, a naïve estimate of logit (PSI) was used as input (See Methods), whereas GEDI-B was directly fitted to inclusion/exclusion counts following the model shown in a. d An example cassette exon that is differentially spliced between neuronal subtypes. The UMAP is the same as in (b), but the color represents the log-odds of inclusion/exclusion for a cassette exon in Kctd17. See also Supplementary Fig. 15. e A similar model can be fitted to the counts of exonic and intronic reads, with the resulting latent space representing the log-ratio of pre-mRNA processing rate (β) to mRNA degradation rate (γ). Assuming invariability of processing rate, this ratio can be interpreted as mRNA stability at steady state12. f Modeling the mRNA stability manifold as a function of RNA-binding protein (RBP) regulons. The color shows cell type labels (left) or the projected regulon activity of QKI (right). See Supplementary Fig. 17 for other examples. Source data are provided as a Source Data file.

We applied GEDI to the analysis of exon inclusion levels in the mouse cortex using data from two previously published studies29,30. We observed that the latent splicing space learned by GEDI, which represents the lower-dimensional projection of the cells based on their (unobserved) cassette exon PSI values, preserved the cell type structure, while removing the study-specific effects (Fig. 6b). We then compared the ability of GEDI to integrate the latent splicing space of multiple samples against that of other integration methods—note that, to apply other methods, a naïve estimate of PSI needed to be calculated first, while GEDI could be directly fitted to inclusion/exclusion counts. We found that GEDI offered the best performance at removing technical variability (Fig. 6c). Furthermore, as part of its expectation-maximization algorithm for model fitting, GEDI calculates the expected value of the (latent) ratio of inclusion/exclusion events given the observed counts and the model parameters, effectively providing a denoised estimate of PSI. We found that GEDI-inferred PSI values recapitulated previously observed cell type-specific splicing events, e.g., the inclusion of exon 20 of Nrxn1 in GABAergic neurons and exon 2 of Ntpn in glutamatergic neurons31 (Supplementary Fig. 15a), as well as other differentially spliced exons that are enriched for neuron-related functions (Supplementary Fig. 15b-c). It also identified novel associations between cassette exons and neuronal subtypes, such as the glutamatergic neuron-specific inclusion of a cassette exon in Kctd17 (Fig. 6d and Supplementary Fig. 15d). This observation is consistent with simulations showing that GEDI can impute ground truth ratios from paired, sparse counts, while a naïve estimator provides ratios that are almost completely uncorrelated with the ground truth (Supplementary Fig. 16a-e). Using the naïve estimator as input for two existing single-cell imputation methods2,32 slightly improved its correlation with the ground truth, but GEDI substantially outperformed them in recovering the ground truth (Supplementary Fig. 16f-g).

Finally, we evaluated the ability of GEDI to model RNA stability based on the ratio of spliced and unspliced RNA, assuming that RNA stability is proportional to the spliced/unspliced transcript ratio at steady-state in the absence of changes in RNA processing rate12 (Fig. 6e). While these conditions are not fully met in every single cell, we reasoned that spliced/unspliced transcripts ratios are still informative of RNA stability in cells that are not in the middle of a differentiation trajectory (and, therefore, are closer to steady state). To test this hypothesis, we applied GEDI to analyze the spliced/unspliced ratio of RNAs at the single-cell level in a model of sensory neurogenesis33. We compared the log-ratio of spliced vs. unspliced transcripts, as imputed by GEDI, to RNA half-life measurements obtained from mouse embryonic stem cells (ESCs) and in vitro-differentiated terminal neurons (TNs)12. Despite the differences between the biological systems represented by these two datasets, we observed a Pearson correlation of 0.16 between GEDI inferences and differential mRNA half-life measurements (Supplementary Fig. 17a), compared to a Pearson correlation of 0.22 when bulk RNA-seq data from the same in vitro differentiation system is used34, or to a mean Pearson correlation of 0.002 for shuffled single-cell data (Supplementary Fig. 17b). We then used GEDI to analyze spliced/unspliced transcript ratios in human neurons, using a previously published dataset of human embryonic glutamatergic neurogenesis13. In this analysis, we modeled the spliced/unspliced manifold as a function of the regulatory networks of RNA binding proteins and miRNAs (see Methods for details). GEDI was able to recover cell type-specific activities of several known post-transcriptional regulators (Supplementary Fig. 17c), including a higher projected activity of well-characterized factors such as QKI in radial glia35 (Fig. 6f) and miR-124 in differentiated neurons36 (Supplementary Fig. 17d). Collectively, these results show that GEDI can successfully model the latent space of RNA splicing and stability at the single-cell level.

Discussion

GEDI is a specific formulation of a general framework for multi-sample single-cell analysis (Fig. 1a–d) in which a family of functions, parametrized by sample-specific factors, connect each biological cell state to its expected expression profile in each sample. The framework proposed here includes three key components. First, it requires the identification of invertible decoder functions that provide a map from gene expression space to cell state manifold and vice versa, allowing the development of generative models of single-cell data. This requirement distinguishes this framework from unsupervised manifold alignment problem37 and the majority of existing single-cell data integration approaches, such as those based on correlation analysis3,38 or graph-based methods4 in which a given gene expression profile can be mapped to its cell state but not vice versa, or methods based on autoencoders39 in which the encoder and decoder functions are not necessarily the inverse of each other.

Secondly, in the hierarchical model proposed here, sample-specific manifolds are drawn from a distribution around a mean manifold, with the mean manifold optionally expressed as a function of sample covariates. This probabilistic modeling of the manifold, which was inspired by the methodology used by LIGER40 for inter-sample harmonization, separates our framework from LEMUR, another recent method for latent embedding regression24, in which the latent space is a deterministic function of sample covariates without accounting for variation among biological replicates. Previous work has shown that properly modeling sample-to-sample variability is key for unbiased differential expression analysis in single-cell data5,10, which may partially explain the superior performance of GEDI in our simulation-based benchmarking tests (Fig. 4). We note, however, that more extensive analyses are needed to better understand the effects of factors that may influence the performance of GEDI and other DE analysis methods, including the number of differentially expressed genes per cell, the magnitude of their differential expression, cell-cell and gene-gene correlation structures, inter- and intra-sample variances, the number of samples, the number of cells per sample, and the sequencing depth.

Third, the hierarchical framework proposed here models each cell as a sample drawn from a distribution around the manifold of each sample, followed by probabilistic sampling of the observed data from the latent gene expression profile of the cell. This hierarchical structure allows the same model to be generalized to different observation types by employing different data-generating distributions, and underlies the distinguishing feature of GEDI to not only model the latent space of mRNA expression, but also the stability and splicing latent spaces of single cells. We expect this functionality to be useful for analysis of other single-cell modalities that represent the ratio of two biological measurements, as summarized in Supplementary Fig. 14, enabling a range of analyses based on those modalities, including batch correction and cluster/cell type analysis (similar to Fig. 6a–c). This hierarchical model also provides a natural method for denoising and/or imputation of single-cell data: we treat the true profile of each cell as a latent variable, the expected value of which can be calculated conditional on the observed data and the maximum a posteriori estimate of the model parameters (examples can be found in Fig. 6d and Supplementary Fig. 15). As shown in Supplementary Fig. 16, our simulation results underline the unique ability of GEDI to impute the ratios of paired observations (such as spliced vs. unspliced mRNA abundances), but it remains to be tested whether GEDI’s imputations for gene-level expression values are also competitive with existing single-cell imputation methods.

In addition, our formulation enables direct modeling of the parameters of the decoder (and, therefore, the manifold defined by that decoder) as functions of prior knowledge in the form of gene sets. This direct incorporation of prior biological knowledge into the GEDI model, which was inspired by previous work on latent variable modeling in bulk expression data8, provides several unique advantages. First, GEDI penalizes principal axes that cannot be expressed as a linear combination of gene-level prior knowledge, facilitating the downstream interpretability of the resulting manifold axes. Secondly, gene set activities can be projected onto individual cells, thus providing a natural approach to perform single-cell gene set enrichment analysis. This flexible framework enables the study of different types of regulatory mechanisms depending on the observations modelled. For example, it can be employed for the study of transcriptional regulators if gene-level counts are given as input (e.g., Supplementary Fig. 9), or for the analysis of regulatory networks that modulate mRNA stability if paired spliced and unspliced transcript counts are provided (e.g., Supplementary Fig. 17c). Thirdly, the activity gradient of each gene set can be directly compared to any transcriptomic vector field, such as the vector field associated with a specific sample-level feature, ultimately associating the activity of known gene sets with gene expression shifts observed across conditions (Fig. 5d–g).

We envision several potential extensions of this work. First, as noted earlier, our framework can be extended to different parametric manifold approximation approaches (GEDI currently supports linear manifold learning, with the option to further restrict the manifold to a hyperellipsoid). Secondly, it provides a natural path for extension of its concepts to multi-modal mosaic integration40: by using a different family of decoder functions for each modality, we can obtain simultaneous mapping from any given biological state to the spaces of different modalities. The proposed generative model can also naturally handle missing data, simply by excluding missing observations from the calculation of the a posteriori probability. Third, this framework can be used not only for de novo modeling of sample-specific manifolds, but also for post-hoc analysis of the harmonized space obtained by other existing methods. To showcase this potential, we used the integrated space identified by Harmony41 for the samples in cohort 1 of the COVID-19 dataset, and asked GEDI to identify sample-specific transformations that can approximate Harmony’s integrated space, model those sample-specific transformations as a function of disease status, and calculate the transcriptomic vector field of mild COVID-19 relative to healthy controls, as shown in Supplementary Fig. 18. Fourth, the transcriptomic vector fields obtained by GEDI may have applications beyond clustering-free DE analysis. For example, earlier studies42 have shown the utility of transcriptomic vector fields in prediction of cell fate transitions if the vector field represents “velocity” (gene expression change as a function of time). Furthermore, as GEDI’s vector field extends beyond the regions of the manifold that is occupied by observed cells, it may provide an opportunity for counterfactual prediction in previously unobserved cell types, such as prediction of response to specific perturbations43,44. These potential abilities, however, remain currently untested.

Overall, the framework presented here unifies a range of concepts that are central to single-cell data analysis, including multi-sample data integration, cluster-free DGE analysis, imputation and denoising, pathway and GRN activity analysis using prior information, downstream model interpretation, and analysis of different modalities with distinct data-generating processes.

Methods

The GEDI framework

Consider a dataset with measurements for G genes/events in N cells. Let the column vector ynG denote the expression values for G genes in cell n∈{1,…,N}. The vectors yn (n∈{1,…,N}) together form the matrix YG×N, in which each column n can be considered an observation in a G-dimensional space. We further assume that these observations lie near a lower-dimensional manifold, so that some function ψ can reconstruct each yn from a lower-dimensional column vector bnK (K < G):

$${{{{\rm{\psi }}}}}_{\theta }:{{\mathbb{R}}}^{K}\to {{\mathbb{R}}}^{G}$$
(1)
$${{{{\bf{y}}}}}_{n}\cong {{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)$$
(2)

Here, θ is the set of parameters that define the manifold, and bn represents the embedding of the n’th observation on this manifold.

Furthermore, the N cells in the dataset may belong to different samples. Each sample i∈{1,…,Q} may have a different (distorted) manifold, defined by the parameter set θi = θr + Δθi. Therefore, when the cells are derived from multiple samples, each observation yn can be modeled as:

$$\theta={\theta }_{r}+\triangle {\theta }_{i\left(n\right)}$$
(3)
$${{{{\bf{y}}}}}_{n}\cong {{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)$$
(4)

Here, Δθi(n) represents the difference between θr, the parameter set that defines a “reference” manifold, and θi(n), the parameter set that defines the manifold of the sample to which cell n belongs (we denote the sample from which the cell n is derived as i(n)). This formulation allows direct mapping between the cells (each defined by a constant embedding b) across multiple samples (through sample-specific manifold parameterization).

The general concept above can potentially be adapted to various parametric manifold learning methods; GEDI represents a specific case, in which the gene expression manifold is modeled as a K-dimensional hyperplane (with the option to further restrict the manifold shape, as described later). This is achieved by defining the function ψ as:

$$\theta=\left\{{{{\bf{o}}}},{{{\bf{Z}}}}\right\}$$
(5)
$${{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)={{{\bf{o}}}}{{{\boldsymbol{+}}}}{{{\bf{Z}}}}{{{{\bf{b}}}}}_{n}$$
(6)

In a multi-sample analysis, sample-specific parameter sets are defined as:

$${\theta }_{i}=\left\{{{{{\bf{o}}}}}_{r}+\triangle {{{{\bf{o}}}}}_{i},{{{{\bf{Z}}}}}_{r}+\triangle {{{{\bf{Z}}}}}_{i}\right\}$$
(7)
$$\Theta=\left\{{\theta }_{1},\ldots,{\theta }_{Q}\right\}$$
(8)

Here, the column vector orG represents the origin point (center) on a reference hyperplane, and Δoi represents the sample-specific translation of the origin point. Each of the columns of the matrix ZrG×K represents a vector that originates from point or and lies on the reference hyperplane. By default, GEDI restricts these K vectors to be orthogonal to each other, effectively forming the orthogonal axes of a coordinate frame in which the position of each point on the reference manifold can be specified as bn. ΔZi represents the sample-specific transformations of this coordinate frame (excluding translation, which is specified by Δoi).

Thus, GEDI approximates each observation yn as:

$${{{{\bf{y}}}}}_{n}\cong {{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}$$
(9)

More precisely, yn is modeled as an observation drawn from a spherical multivariate normal distribution whose mean is located on the manifold of sample i(n):

$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$
(10)
$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$
(11)

Note the addition of the term sn1G here, which serves as a cell-specific intercept; sn is a scalar (representing library size), and 1G is a column vector of 1’s; 1G = {1,…,1}∈G.

The column vectors bn∈{1,…,N} together form the matrix BK×N, in which each column n can be considered the embedding of the cell n in the manifold. Since the scales of Zr + ΔZi and B are redundant (each column of Zr + ΔZi can be scaled by some constant c and the corresponding row of B can be scaled by c–1 without changing the model likelihood), GEDI restricts B such that each row forms a unit vector:

$${{{\bf{B}}}}\in \left\{{{\mathbb{R}}}^{K\times N},\Big|,\,\forall k\in \left\{1,\ldots,K\right\}\,{{\sum}_{n=1}^{N}}{\left({b}_{k,n}\right)}^{2}=1\,\right\}$$
(12)

Other constraints may also be added to further limit the shape of the manifold. For example, B may be restricted to the points on a ellipsoid; in other words:

$$ {{{\bf{B}}}}\in \left\{{{\mathbb{R}}}^{K\times N},|,\,\forall k\in \left\{1,\ldots,K\right\}\,{{\sum}_{n=1}^{N}}{\left({b}_{k,n}\right)}^{2}=1\,\right\} \\ \qquad \cap \left\{{{\mathbb{R}}}^{K\times N},|,\exists {{{\bf{d}}}}{{{\boldsymbol{\in }}}}{{\mathbb{R}}}_{ > 0}^{K}\,s.t.\,\forall n\in \left\{1,\ldots,N\right\}\,{{\sum}_{k=1}^{K}}{\left({b}_{k,n}/{d}_{k}\right)}^{2}=1\right\}$$
(13)

Here, the vector d contains the lengths of the semi-axes of the ellipsoid, with dk representing the kth element of d.

Modeling the reference manifold as a function of gene-level variables

The reference manifold itself may be approximated as a function of gene-level prior knowledge, such as gene regulatory networks or pathway memberships, by expressing the parameter set θr as a function of gene-level variables. To this end, GEDI expresses Zr as a probabilistic function of CG×P, where C is a matrix representing gene-level prior information matrix:

$${{{{\bf{z}}}}}_{r,k}\left|{{{{\bf{a}}}}}_{k}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{C}}}}{{{{\bf{a}}}}}_{k},{\sigma }^{2}{S}_{Z}{{{\bf{I}}}}\right)$$
(14)
$${{{{\bf{a}}}}}_{k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{A}{{{\bf{I}}}}\right)$$
(15)

Here, zr,kG is the k’th column of Zr, and akP is a vector column whose P elements represent the contribution of each of the P columns of C toward determining the direction of the axis vector zr,k—this formulation is similar to (and inspired by) that used by PLIER8 for pathway-level information extraction from gene expression data. SZ is a hyperparameter that determines the variance of zr,k relative to the model variance σ2. Similarly, SA is a hyperparameter that determines the variance of the prior distribution of ak relative to the model variance σ2 (see Supplementary Methods for the choice of SZ, SA, and other hyperparameters).

The column vectors ak∈{1,…,K} together form the matrix AP×K. For example, C may represent a (weighted) regulatory network connecting P transcription factors to G genes, in which case the element ap,k of the matrix A corresponds to the contribution of the network of the transcription factor p toward the axis vector zr,k. Abn can be considered as the projected “activity” of the P transcription factors in the cell n (AbnP). In other words, αp(b)=ap,.b is the function that provides the projected activity of the transcription factor p at coordinate b in the Zr coordinate system, where ap,. is p’th row of A. Consequently, the gradient of the activity of transcript factor p in the Zr coordinate system (∇αp) is ap,., which can be transformed to the gene expression coordinate system as Zr∇αp= Zrap,..

In the absence of gene-level prior information, the following prior is used for zr,k:

$${{{{\bf{z}}}}}_{r,k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{Z}{{{\bf{I}}}}\right)$$
(16)

Modeling sample-specific distortions of the manifold as a function of sample-level variables

The difference between the manifold of each sample i and the reference manifold can be expressed using the difference of the parameter sets that define these manifolds, i.e., Δθi. In the case of GEDI, we have: Δθi ={Δoizi,1,…, Δzi,K}. Each of the components can in turn be expressed as a function of sample-level variables:

$${{{{\boldsymbol{\triangle }}}}{{{\bf{o}}}}}_{i}\left|{{{{\bf{R}}}}}_{o}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{R}}}}}_{o}{{{{\bf{h}}}}}_{i},{\sigma }^{2}{S}_{\triangle {o}_{i}}{{{\bf{I}}}}\right)$$
(17)
$${{{{\bf{R}}}}}_{o}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{{R}_{o}}{{{\bf{I}}}}\right)$$
(18)
$${{{{\boldsymbol{\triangle }}}}{{{\bf{z}}}}}_{i,k}\left|{{{{\bf{R}}}}}_{k}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{R}}}}}_{k}{{{{\bf{h}}}}}_{i},{\sigma }^{2}{S}_{\triangle {Z}_{i}}{{{\bf{I}}}}\right)$$
(19)
$${{{{\bf{R}}}}}_{k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{{R}_{k}}{{{\bf{I}}}}\right)$$
(20)

Here, hiL is a column vector whose elements represent the values of L variables for sample i. RoG×L and RkG×L are matrices that represent the effects of the L variables on Δoi and Δzi,k, respectively. SΔoi and SΔzi are sample-specific hyperparameters that specify the variance of the Δoi and Δzi,k relative to the model variance σ2. Similarly, SRo and SRk are hyperparameters that determine the variance of the prior distributions of Ro and Rk relative to the model variance σ2.

In the absence of sample-level variables, Δoi and Δzi,k are modeled using the following prior distributions:

$${{{{\boldsymbol{\triangle }}}}{{{\bf{o}}}}}_{i}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{\triangle {o}_{i}}{{{\bf{I}}}}\right)$$
(21)
$${{{{\boldsymbol{\triangle }}}}{{{\bf{z}}}}}_{i,k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{\triangle {Z}_{i}}{{{\bf{I}}}}\right)$$
(22)

Direct inference from count data

Consider the count matrix MG×N, with each element mg,n generated from a Poisson distribution with mean λg,n. GEDI models each λg,n as a latent variable drawn from a log-normal distribution, so that yn = (logλ1,n,…,logλG,n)TG follows a spherical multivariate normal distribution as in the previous sections. Thus:

$${m}_{g,n} \sim {{{\rm{Pois}}}}\left({e}^{{y}_{g,n}}\right)$$
(23)
$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$
(24)
$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$
(25)

Inference from paired UMI counts

Consider the count matrices M1G×N and M2G×N, with each pair of elements m1,g,n and m2,g,n corresponding to success and failure counts, respectively, in mg,n = m1,g,n + m2,g,n Bernoulli trials with success probability pg,n. GEDI models yn = (logitp1,n,…,logitpG,n)TG as a latent variable:

$${m}_{1,g,n} \sim {{{\rm{B}}}}\left({m}_{g,n},{{{\rm{S}}}}\left({y}_{g,n}\right)\right)$$
(26)
$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$
(27)
$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$
(28)

Here, S is the sigmoid function.

Obtaining maximum a posteriori estimates of model parameters

When the gene expression matrix Y is provided and no gene-level or sample-level prior information is available, GEDI uses a block coordinate descent algorithm to obtain maximum a posteriori estimates for Zr, ΔZi, or, Δoi and B. When Y is latent (i.e., when M or the pair M/M is provided), Zr is latent (i.e., gene-level variables are provided), and/or ΔZi and Δoi are latent (i.e., sample-level variables are provided), GEDI uses expectation-maximization to obtain maximum a posteriori estimates (see Supplementary Methods for details).

Datasets and preprocessing

All datasets used in the article were downloaded from the publication access codes, from the publication’s online data repositories, or from public data repositories. Quality control and preprocessing steps were performed with the scuttle package45 (v 1.0.4). For further details, see Supplementary Methods.

Integration methods and benchmarks

We compared the integration performance of GEDI against several other methods, including Seurat46, LIGER40, Harmony41, BBKNN47, scVI39, Scanorama48, and CSS38, as well as PCA (no integration) as a baseline. We ran each method following the documentation obtained from available tutorials or paper methods. Unless specified, we used the default parameters established by each package. For further details, see Supplementary Methods.

We measured the ability of each method to remove technical variability while preserving biological variation associated with the cell types. To do this, we followed the approach established by previous integration benchmark efforts16,17, which used metrics that can be grouped into two broad categories: (a) removal of batch effects and (b) conservation of biological variance. Metrics from group (a) included alignment score (batch), iLISI, kBET, and ASW (batch), while group (b) metrics included alignment score (cell type), cLISI, ARI, NMI and ASW (cell type). For further details on each individual metric, see the Supplementary Methods section.

For each method, we defined an overall score that summarized the performance of the multiple metrics, following the approach established by Luecken et al.16. Briefly, we first rescaled the output of every metric to range from 0 to 1, which ensures that each metric is equally weighted within a partial score and has the same discriminative power. The rescaling was done using the transformation y’=[y-min(Y)]/ [max(Y)-min(Y)]. Then, we defined the ‘Batch’ score’ and the ‘Bio’ score, representing the average for the metrics belonging to the groups (a) and (b) above, respectively. For a given integration method, we calculated the overall score as previously defined by Luecken et al.16, where a weighted average of the two Bio and Batch scores was used, with a weight of 0.6 for the Bio score and 0.4 for the Batch score (integration metrics using different weights can be found in Supplementary Fig. 3). Additional details are found in Supplementary Methods.

Differential expression analysis

We performed pseudobulk differential expression (DE) analysis on the COVID-19 data using DESeq249 (v.30.1). For each cell type and donor combination, we created a pseudobulk using the ‘aggregateAcrossCells’ function from scuttle45. Only cell types that were present in all three conditions (control, mild and severe COVID-19) were considered for DE analysis.

Cluster-free differential expression benchmark

We conducted a systematic analysis to evaluate the performance of GEDI in clustering-free DGE analysis. Our assessment included the comparison of GEDI to two clustering-free DGE methods: LEMUR24 and miloDE25, as well as a comparison to DESeq2-based pseudobulk analysis. First, we developed a generative model that simulates cohort-level scRNA-seq data while preserving characteristics such as gene-gene and cell-cell correlations observed in real data, as discussed in detail in Supplementary Methods. Briefly, for each sample i, we simulate K archetypes representing a set of vertices on the gene expression manifold. If we represent the expression of each gene g across the K archetypes using the column vector γg,iK, then we have:

$${{{{\boldsymbol{\gamma }}}}}_{g,i}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{X}}}}}_{g}{{{{\bf{h}}}}}_{i}+{{{{\bf{x}}}}}_{g}^{{\prime} }{{{{\bf{h}}}}}_{i}{{{{\bf{1}}}}}_{K},{{{{\boldsymbol{\Sigma }}}}}_{g}\right)$$
(29)

where Xg is a matrix of coefficients (XgK×L) that shows, for each of the K archetypes, how the mean expression of gene g is determined by each of the L sample-level variables represented by the column vector hi. xgL is a row vector that represents the global effects of sample-level variables on the observed abundance of gene g, e.g., through affecting ambient RNA abundances or other artifacts50. ΣgK×K is a covariance matrix for gene g.

Each cell n is then modeled as a probabilistic function of the weighted average of the archetypes of its sample:

$${y}_{g,n}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{w}}}}}_{n}{{{\boldsymbol{\cdot }}}}{{{{\boldsymbol{\gamma }}}}}_{g,i\left(n\right)},{\sigma }_{g}^{2}\right)$$
(30)

where wn+K is a row vector of weights connecting cell n to each of the K archetypes (Σkwn,k = 1), and σ2g is a gene-specific variance. Thus, the ground-truth DE effect of the L sample-level variables on gene g in cell n can be derived as the row-vector δn,g=wnXg.

Finally, for each cell n, the UMI counts are simulated as:

$${p}_{g,n}=\frac{{c}^{{y}_{g,n}}}{{\sum }_{g=1}^{G}{c}^{{y}_{g,n}}}$$
(31)
$${{{{\bf{m}}}}}_{n} \sim {{{{\rm{Multinom}}}}}_{G}\left({M}_{n},{{{{\bf{p}}}}}_{n}\right)$$
(32)

where c is the logarithmic basis for the simulated log-scale profiles and Mn is the total UMI count for cell n. This framework allows us to simulate UMI counts starting from a given set of Xg, Σg, and σg for all g∈{1,…,G}, and wn and Mn for all n∈{1,…,N}; we call these the simulation parameters, which can be modified to obtain different simulated datasets with varying numbers of cells, cell-cell or gene-gene correlation structures, cellular diversities, or ground-truth DE effects. In this work, we used the COVID-19 cohort 1 dataset as template to derive realistic simulation parameters, as described in Supplementary Methods, and generated a simulated dataset with the same number of cells and samples as those of the COVID-19 dataset. We then applied GEDI, LEMUR, miloDE, and DESeq2-based pseudobulk analysis to this dataset (see Supplementary Methods), and compared the differential expression estimates from each method to the ground truth DE vectors at the single-cell (GEDI and LEMUR), neighborhood (GEDI, LEMUR and miloDE) and cell-type levels (GEDI, LEMUR and DEseq2). For further details, see Supplementary Methods.

Cell type signature analysis

For the generation of cell-type signatures in PBMC data, we performed DE analysis using DESeq2 for each donor. Our model compared the mean expression of a given cell type versus the average of other cell types, including the batch variable (different sequencing technologies) as a covariate in the model. We considered genes with FDR adjusted p- value < 0.05 and log2 fold-change >1 to define cell type markers. For a given donor, we defined a binary matrix that contained the cell-type markers, which was used as input for applying GEDI on the other donor.

To perform the enrichment of cell type signatures and TF activities, we applied the scoreMarkers function from scran51 (v.1.30.0) using the inferred cell type/TF activities by GEDI.

Gene regulatory network analysis

For analysis of transcription factor (TF) networks (in PBMCs), we downloaded the human DoRothEA gene-regulatory network52 and restricted the TF-gene interactions to the high-confidence sets A, B, and C, as previously defined. We then refined this TF-gene interaction matrix, using the whole blood data from GTEx53, to obtain a generative gene regulatory model in which the expression profile of each gene is a function of the expression of the TFs that are linked to it. Specifically, the target gene expression matrix YG×N (which is TMM-normalized54 and converted to log-scale) is modeled as Y ~ CA, where AP×N is the matrix of the expression of P transcription factors across N whole blood samples from GTEx, and CG×P is a weighted regulatory network whose elements are restricted to have the same sign as the regulatory interactions obtained from DoRothEA (cg,p ≥ 0 for an activating interaction between transcription factor p and gene g, cg,p ≤ 0 for an inhibitory interaction, and cg,p = 0 for no interaction). C was further filtered to include only TFs (columns) that have at least 10 “substantial” regulatory interactions, where we define a substantial effect as having an absolute value > 0.1.

For analysis of post-transcriptional regulatory networks, we defined a regulatory network of RNA binding proteins (RBPs) and microRNAs (miRNAs) based on their potential interactions with mRNA 3’ UTRs. For the RBPs, we downloaded the set of known motifs for all human RBPs from CISBP-RNA55, and filtered them to include only the non-redundant set of motifs described in ref. 56. Then, for each human gene, the transcript with the longest 3’ UTR was identified. Genes whose longest 3’ UTR was shorter than 650nt were removed, and the first 650nt of the 3’ UTR (i.e., the 650nt region immediately downstream of the stop codon) of the remaining genes was used for motif scanning, using AffiMx57. The miRNA regulons were obtained from ref. 34.

Assessment of estimated TF activities

We evaluated the accuracy of GEDI at estimating changes in TF activity by analyzing a previously published single-cell perturbation dataset28. Our evaluation also included a comparison to decoupleR58 (v.2.8.0), an ensemble of computational frameworks to infer biological activities. The methods we included in our evaluation were “aucell”, “gsva”, “mlm”, “ora”, “ulm”, “viper”, “wmean”, and “wsum”. For wmean and wsum, we utilized the normalized outputs “norm_wmean” and “norm_wsum”.

The dataset consisted of two technical batches which were split and analyzed separately. One batch was used to generate a gene signature for each TF and also measure the degree of association of each TF with the principal axes of heterogeneity of the data, while the other was used to estimate TF activities using the learned gene regulatory effects. We opted to learn the gene signature of each TF from the data (as opposed to using an external gene regulatory network) to ensure that our results purely reflected the performance of the activity inference methods without the confounding effect of the quality of the external gene network. At the same time, by using one batch for learning gene-TF associations and the other for evaluating activity inferences, we aimed to avoid circularity. These steps were repeated by swapping the batches. To generate the gene signature of each TF, we performed differential expression analysis for each TF between the cells with perturbed TF expression versus unperturbed cells, using the scoreMarkers function from scran. The mean standardized Cohen’s d log-fold-change was used to define the regulatory effect of a TF on each gene. To identify the TFs whose perturbation is associated with a significant change in the principal axes of heterogeneity of the single-cell RNA-seq data, we ran PCA on the normalized expression data and retrieved the first 40 principal components. Next, for each TF, we fitted a logistic regression model to predict the TF perturbation status of a cell using its the principal component scores, and used a likelihood-ratio test to compare this model to a reduced model with only an intercept.

Single-cell TF activities were estimated for each method using the learned gene-regulatory signatures. For methods that only accept a gene list for each TF (aucell, fgsea, gsva and ora), we defined the downstream targets of each TF as the set of genes with log2 fold-change <–0.3 in TF-perturbed cells in comparison to unperturbed cells. For the methods from decoupleR, normalized expression values were used as input, while for GEDI we used raw counts.

Differential analysis of cell type-specific cassette exon splicing

To perform DE between GABAergic and Glutamatergic cells in the Tasic data, we applied limma59 (v.3.46) using the imputed value of the latent splicing matrix from GEDI, which represents the logit of PSI (percent-spliced-in). We applied the lmFit and eBayes functions to obtain DE estimates using the default parameters.

GSEA analysis was performed using fgsea60 (v.1.16), with the following arguments: minSize=15, eps=0, and maxSize=500. For the analysis of cell-type exon inclusion events using the Tasic data, we used the M5 ontology gene sets from MSigDB61 (m5.all.v2022.1.Mm.symbols.gmt), using the t-statistics from limma as the gene ranks.

Sashimi plots were generated using sashimipy62 (v 0.0.6). For the Tasic dataset, we selected a subset of 100 cells for each cell type and dataset combination, generated a merged BAM file, and provided it as input to sashimipy.

Statistics and Reproducibility

Unless specified, cells were removed from the downloaded datasets based on standard quality control criteria (% of mitochondrial reads <20%; number of UMIs <1000, and number of detected genes <1000). No samples were collected in this study. No statistical method was used to predetermine sample size. Computational experiments and analysis are reproducible using the notebooks and code provided.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.