A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

Madrigal, Ariel; Lu, Tianyuan; Soto, Larisa M.; Najafabadi, Hamed S.

doi:10.1038/s41467-024-50963-0

A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

Article
Open access
Published: 03 August 2024

Volume 15, article number 6573, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

Download PDF

869 Accesses
10 Altmetric
Explore all metrics

Abstract

Single-cell analysis across multiple samples and conditions requires quantitative modeling of the interplay between the continuum of cell states and the technical and biological sources of sample-to-sample variability. We introduce GEDI, a generative model that identifies latent space variations in multi-sample, multi-condition single-cell datasets and attributes them to sample-level covariates. GEDI enables cross-sample cell state mapping on par with state-of-the-art integration methods, cluster-free differential gene expression analysis along the continuum of cell states, and machine learning-based prediction of sample characteristics from single-cell data. GEDI can also incorporate gene-level prior knowledge to infer pathway and regulatory network activities in single cells. Finally, GEDI extends all these concepts to previously unexplored modalities that require joint consideration of dual measurements, such as the joint analysis of exon inclusion/exclusion reads to model alternative cassette exon splicing, or spliced/unspliced reads to model the mRNA stability landscapes of single cells.

MarkerMap: nonlinear marker selection for single-cell studies

Article Open access 14 February 2024

Jointly defining cell types from multiple single-cell datasets using LIGER

Article 12 October 2020

f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq

Article Open access 07 November 2017

Introduction

Single-cell technologies have emerged as powerful methods for unraveling tissue cellular heterogeneity and studying molecular phenotypes at the resolution of individual cells. Despite their remarkable potential, extracting meaningful biological insights from single-cell readouts still poses major analytical challenges. Such challenges arise from the need to integrate multiple interrelated tasks, including data normalization¹, denoising², and inter-sample harmonization and/or identification of a shared low-dimensional space^3,4. These processing steps are intertwined with the analytical goals of comparing biological conditions⁵ or experimental perturbations⁶, extracting pathway-level activity metrics, or studying gene regulatory networks (GRNs)^7,8.

While existing methodologies have made significant strides in addressing these challenges individually, a model that can unify all these concepts into a single framework has remained elusive. For instance, the most common workflow for differential gene expression (DGE) analysis requires sequential application of normalization, inter-sample integration, and cell-type/cluster identification, to then perform a pseudo-bulk DGE analysis for each cell type across conditions of interest^5,9,10. In addition to being limited to the analysis of discrete cell clusters, this sequential approach ignores the interplay between inter-sample integration and DGE analysis: integration depends on gene expression shifts across conditions, and DGE identification depends on integration. This is also true for other downstream analyses, such as pathway and GRN activity estimation, that are mutually influenced by normalization, low-dimensional projection, and integration steps. For example, considering prior biological information, in the form of gene networks or functionally related gene sets, may lead to identification of interpretable latent factors in expression data and help deconvolve biological variability from technical noise⁸.

Furthermore, existing tools are primarily designed to perform such analyses on the gene expression space or other modalities in which the biological quantity of interest, e.g., mRNA abundance, is connected to a single type of observation, e.g., unique molecular identifier (UMI) counts. An array of biological processes, however, are better measured as the ratio of two quantities. For example, alternative cassette exon splicing is commonly quantified as the ratio of abundances of isoforms that include the exon vs. isoforms in which the exon is skipped¹¹. As another example, the ratio of spliced to unspliced transcript abundances is informative about the processing and/or decay rates of mRNAs^12,13. Analysis of such ratio-based modalities are particularly challenging when the observed data from the two opposing quantities are sparse (e.g., sparse UMI counts for each of the spliced and unspliced forms of a transcript). Few methods can perform dimensionality reduction on the latent space of ratio-based modalities¹⁴ and, to our knowledge, no method exists for their inter-sample harmonization or GRN analysis.

Here, we introduce Gene Expression Decomposition and Integration (GEDI), a framework for multisample, multi-condition single-cell analysis. GEDI incorporates various single-cell analysis steps within a unified Bayesian framework that includes data integration across samples/conditions, data imputation and denoising, cluster-free DGE analysis, as well as pathway and GRN activity analysis. GEDI is competitive with other top-performing integration tools, while uniquely capable of deconvolving the effects of multiple technical and/or biological sources of sample-to-sample variability. This ability enables a natural and efficient approach for cluster-free DGE analysis by identifying the transcriptomic vector field associated with sample-level variables. Furthermore, by incorporating information about gene sets, it identifies axes of heterogeneity that are aligned with prior biological knowledge, thus enabling single-cell projection of pathway and/or GRN activities as well as their direction of change (gradients). Finally, GEDI is the first single-cell analysis framework to expand all these concepts from the gene expression space to the analysis of ratio-based modalities, including the analysis of the latent spaces of alternative splicing and RNA stability.

Results

The GEDI framework

We formulate multi-sample scRNA-seq analysis as the identification of sample-specific, invertible decoder functions such that the decoder function of each sample can reconstruct the expected expression profile of each cell from a (low-dimensional) representation of its biological state (Fig. 1a). Subsequently, correspondences between cells of different samples can be established based on the similarity of their biological states, as given by the inverse of the sample-specific decoders (encoders), enabling horizontal integration of single-cell data across different samples. We further constrain the sample-specific decoders to be from the same family of functions, while allowing for sample-specific parameterizations (Fig. 1b).

The decoder parameters can be optionally expressed as a probabilistic function of sample-level variables, resulting in a distribution of decoder functions for any given combination of sample characteristics (Fig. 1c). This formulation gives rise to a hierarchical generative model, in which a probabilistic function connects the characteristics of each sample to a distribution of decoder parameter sets. The parameter set of the sample is then drawn from this distribution, leading to a decoder function that connects the biological state of each cell to its expected gene expression profiles. This hierarchical model enables cluster-free differential gene expression analysis along the continuum of cell states, as we can examine how changes in sample-level variables impact the expected (mean) expression profile of any given biological cell state (Fig. 1d).

We note that the decoder function of each sample effectively defines the manifold representing the observed cell expression profiles within that sample; therefore, this formulation holds the potential for extension to any parametric manifold learning approach. GEDI is a specific application of this general formulation, where the gene expression manifold of each sample is modeled as a hyperplane or hyperellipsoid, defined by a common (reference) set of principal axes (Fig. 1e) and sample-specific transformations of these axes (Fig. 1f). These sample-specific transformations can, in turn, be modeled as probabilistic functions of sample-level variables (Fig. 1g), enabling cluster-free analysis of the association between gene expression and sample characteristics. Optionally, the common coordinate frame can also be expressed as a probabilistic function of gene-level variables such as gene-set memberships (Fig. 1h), aligning the principal axes of the coordinate frame with prior biological knowledge and enabling the projection of pathway and regulatory network activities onto the cellular state space (Fig. 1i).

Finally, we can connect each point on the sample-specific manifolds to different types of observations using diverse data-generating distributions. This versatility enables gene expression analysis based on normalized or raw unique molecular identifier (UMI) counts, analysis of alternative splicing using counts of reads that support opposing splicing events, and analysis of RNA stability based on the UMI counts of spliced and unspliced transcripts (Fig. 1j). The GEDI model is fitted to these observation types using an expectation-maximization algorithm (see Supplementary Methods for details).

GEDI captures different sources of sample-to-sample variability

To assess the ability of GEDI to capture sample-to-sample variability, we applied it to a dataset of peripheral blood mononuclear cells (PBMCs) from two donors¹⁵ profiled using different scRNA-seq technologies. We applied GEDI without including sample-level variable information, so the model was oblivious to the biological and technical characteristics of the samples. Examining the sample-specific transformations learned by GEDI revealed that, once the effect of technology is regressed out, they are more similar among samples that are from the same donor (Fig. 2a). Similarly, after regressing out the effect of donor, the sample-specific transformations cluster by the technology (Supplementary Fig. 1a). These results suggest that GEDI properly learns sample-specific parameters that capture different sources of inter-sample variability, including the biological differences between the two donors and the technical variability introduced by the use of different technologies. At the same time, intra-sample variability across the cells is preserved, as the projection of the cell state representations learned by GEDI shows clear separation of cell types without any obvious separation by sample (Fig. 2b).

**Fig. 2: GEDI captures sample-to-sample variability.**

To quantitatively measure the ability of GEDI to separate intra-sample and inter-sample sources of variability, we compared it against a panel of existing single-cell integration methods using previously established metrics^16,17 and three benchmarking references: the PBMC dataset described above, a pancreas dataset^18,19 and the Tabula Muris²⁰ dataset. Overall, we observed that GEDI was consistently among the top-performing methods, regardless of the number of the latent factors used for low-dimensional projection of data—an often arbitrary choice that affects the performance of most other methods (Fig. 2c and Supplementary Fig. 1b-f). These results suggest that the manifold transformations learned by GEDI explain most of the sample-to-sample variability while retaining the heterogeneity in the biological states of the cells.

Encouraged by the performance of GEDI on PBMC data, we applied it to a recent single-cell atlas of PBMCs that included healthy individuals as well as mild and severe COVID-19 cases from two separate cohorts²¹. Consistent with the results shown above, when modeling the gene expression manifold of the cells, GEDI learned sample-specific transformations that reflected the biological variability among samples, such as the COVID-19 status and its severity (Fig. 2d). In addition, we successfully trained support vector machine (SVM) models capable of predicting the disease status (COVID-19 vs. healthy) based on sample-specific transformations of the reference manifold. Cross-cohort validation analysis suggests that, when trained on cohort 2, the model perfectly distinguishes COVID-19 vs. healthy individuals in cohort 1; conversely, when the model is trained on cohort 1, it achieves an area under receiver operating characteristic (AUROC) curve of 0.97 in cohort 2 (Fig. 2e). Interestingly, similar SVM models trained on pseudobulk-based features did not generalize well across cohorts (Supplementary Fig. 2b-c). Together, these results show that GEDI can capture most of the sample-to-sample variability present in multi-sample scRNA-seq datasets; this variability can then be directly traced back to sample characteristics (such as disease severity) owing to the parametric nature of GEDI’s modelling framework.

GEDI enables cluster-free differential expression analysis

To enable direct analysis of the relationship between sample-level variables and the gene expression space, we explicitly included them in the GEDI model, by expressing the sample-specific manifold transformations as probabilistic functions of sample-level variables. This model enables us to examine how the manifold and, therefore, the expression vector associated with any given cell state, changes with sample characteristics, providing a transcriptomic vector field for each sample-level variable (Fig. 3a and Supplementary Fig. 4). We applied this approach to the COVID-19 dataset, to obtain the transcriptomic vector field describing the differences between severe COVID-19 and healthy individuals across the PBMCs. Figure 3b, c provides a visual representation of the cell state space and the transcriptomic vector field of severe COVID-19 over that space. The largest vector magnitudes, corresponding to the cell states that show the largest overall gene expression shift, were observed in plasmablasts, HLA-DR^lo S100A^hi monocytes, and neutrophils (Fig. 3d); the transcriptomic vector magnitudes observed in monocytes and neutrophils recapitulate the previously reported large cell state shifts in these cell types²¹.

**Fig. 3: Incorporating sample-level variables in the GEDI model.**

Calculation of the vector field provides a cluster-free approach for DGE analysis across the continuum of cell states. Nonetheless, it is also possible to perform a traditional cluster-based DGE analysis by summarizing the vector field for any given cell cluster using simple modifications. To showcase this, we calculated the mean COVID-19 transcriptomic vector representing the average shift in gene expression between mild COVID-19 and healthy individuals across all cells of each cell type. Comparison of these cell type-specific mean vectors to cell type-specific estimates from a pseudo-bulk DGE analysis revealed a high degree of agreement between the two approaches (Fig. 3e and Supplementary Fig. 5a-b). Interestingly, GEDI estimates showed improved reproducibility across cohorts compared to the pseudo-bulk approach (Fig. 3f, g and Supplementary Fig. 5c-d). Furthermore, we found that pseudo-bulk DGE estimates had a substantial correlation between cell types (Supplementary Fig. 6a), whereas GEDI estimates were highly cell type-specific (Supplementary Fig. 6b).

To systematically establish the performance of GEDI in clustering-free DGE analysis, we used GEDI to analyze a simulated cohort-level single-cell dataset, allowing us to compare GEDI’s inferences to a known ground truth for each individual cell. Our simulation framework is schematically shown in Fig. 4a, which is based on a set of synthetic cellular archetypes^22,23 whose expression vectors are determined by sample-level variables, along with additional sources of variation at the sample- and cell-level^5,10. The parameters needed to simulate cells using this framework can be derived from a variety of sources; we decided to use a real scRNA-seq dataset (specifically, the COVID-19 dataset above) as the template to derive these parameters, in order to preserve characteristics such as gene-gene covariances (see Methods for details). We observed that single-cell DE vectors provided by GEDI correlate strongly with the ground-truth DE vector of each cell, with a slightly better performance when modeling the manifold as a hyperplane (median Pearson r = 0.4, Fig. 4b). In comparison, inferences made by LEMUR²⁴, another recent method for clustering-free single-cell DE analysis, had significantly lower correlation with ground truth (Mann-Whitney U P < 10^–15; median r = 0.14). At the level of each individual cell, we also defined a set of ground-truth “up-regulated” and “down-regulated” genes by thresholding the ground-truth DE values, and found that GEDI significantly outperforms LEMUR in the identification of up-regulated and down-regulated genes (median AUROC of 0.79 and 0.72 for identification of up-regulated genes by GEDI and LEMUR, and median AUROC of 0.78 and 0.55 for identification of down-regulated genes by GEDI and LEMUR, respectively; all comparisons are significant at P < 10^–15 Fig. 4b). Another recent method, miloDE²⁵, can also perform clustering-free differential analysis, albeit at the “neighborhood” level as opposed to single-cell level. To compare with miloDE, we collapsed the ground-truth DE vectors as well as GEDI’s and LEMUR’s inferences to the neighborhood level, by averaging across the cells of each neighborhood (with neighborhoods defined by miloDE). Figure 4c shows that GEDI outperforms both LEMUR and miloDE at the neighborhood level based on different metrics. Finally, we collapsed the ground truth DE vectors as well as GEDI’s and LEMUR’s inferences to the “cell type” level (see Methods), in order to compare to pseudobulk analysis results obtained by DESeq2. Again, we observed a better agreement between GEDI inferences and the ground truth compared to both LEMUR and DESeq2 (Fig. 4d). These results suggest that GEDI can effectively capture the differential expression of genes at the level of single cells, neighborhoods, and cell types.

**Fig. 4: Systematic comparison of cluster-free DGE methods using a simulated cohort-level single-cell dataset.**

Pathway and network activity projection with GEDI

GEDI can also incorporate prior biological knowledge, such as gene signatures, biological pathways, or GRNs into its model, by expressing the manifold principal axes as probabilistic functions of prior gene-set associations (gene signatures, pathways, and GRNs can be represented by a weighted gene-set association matrix, similar to previous work⁸). As a result, principal axes that can be expressed as a linear combination of one or several gene sets/signatures are deemed more likely by the model, encouraging their alignment with prior knowledge, and allowing the projection of the “activity” of known biological axes onto the cellular states (Fig. 1i). To assess the reliability of these projected activities, we examined GEDI’s ability to project cell type signatures across PBMCs. First, using DGE analysis of the PBMC benchmarking dataset, we generated cell-type signatures for each of the two donors (see Methods for details). Then, for each donor, we applied GEDI using the signatures from the other donor as prior biological knowledge. In both cases, the inferred activity of cell type signatures showed strong enrichment for the true cell type labels (Fig. 5a and Supplementary Fig. 8).

**Fig. 5: Modeling the manifold as a function of gene-level variables.**

Similarly, when a transcription factor (TF) regulatory network was used as prior biological knowledge (see Methods), we observed cell type-specific activity patterns for many TFs (Supplementary Fig. 9), including known lineage regulators such as PAX5 in B cells²⁶ and TCF7 in CD4⁺ T cells²⁷ (Fig. 5b). For 74 out of 89 TFs included in our regulatory network, the projected activity across cells correlated significantly with the decoded abundance of the mRNA encoding the TF (t-test for Pearson correlation, FDR < 0.001), further supporting GEDI’s inferences. The high correlation between the inferred activity of most TFs and their mRNA abundance can also be seen in the COVID-19 data, even when we stratify the cells by their cell type and by the disease condition of the donors (Supplementary Fig. 10). The ability of GEDI to infer TF activities is also supported by its performance on a dataset of single-cell TF perturbations²⁸. As shown in Supplementary Fig. 11, for the TFs whose perturbation is associated with cell state shifts along the main axes of variation, GEDI-based activities are highly predictive of the TF perturbation status. For this subset of TFs, GEDI (hyperellipsoid) is in fact among the top performers compared to eight other methods we tested, with mean AUROC of 0.86 for distinguishing the cells in which a specific TF is perturbed from other cells. Given that GEDI only models the principal axes of variation as functions of the gene regulatory network, its behaviour in correctly modeling the TFs that cause cell state shifts along these axes is expected.

GEDI network activity projection also enables the calculation of a gradient vector for the activity of each TF, representing the direction of greatest increase in TF activity in the cell space (Fig. 5c). One can then compare the gradient vector of each TF to a given transcriptomic vector field, to examine whether in certain cellular states the transcriptomic vector field is aligned with the gradient vector of that TF (Fig. 5c). We used GEDI to infer the regulon activity of TFs in the COVID-19 dataset and, for each single cell, compared the activity gradient vector of each TF to the transcriptomic vector field of severe COVID-19. Interestingly, the transcriptomic vector field of severe COVID-19 correlated (or anti-correlated) strongly with the TF activity gradients in a cell type-specific pattern (Fig. 5d). In other words, in certain cell states, when we move from healthy to severe COVID-19, the direction of change in the gene expression coincides with the direction of greatest increase (or decrease) in the activity of specific TFs. For example, the activity gradient of a group of TFs showed high concordance with the transcriptomic vector field of severe COVID-19 in HLA-DR^lo S100A^hi monocytes, including SPI1, CEBPA, and SP1 (Fig. 5e–g and Supplementary Fig. 12), suggesting that severe COVID-19 is accompanied by increased activity of these TFs in HLA-DR^lo S100A^hi monocytes. Among these TFs, we observed strong up-regulation for SPI1 mRNA in monocytes in severe COVID-19 compared to healthy controls (pseudobulk DE and GEDI cluster-free DE analyses; Supplementary Fig. 13a), but not for the other two TFs. Nonetheless, for all three TFs, the direct targets whose expression was highly correlated with the expression of the TF were enriched in various immune-related pathways (Supplementary Fig. 13b).

Modeling the latent space of RNA splicing and stability with GEDI

In contrast to the analysis of mRNA abundance, where for each cell and each feature a single quantity is recorded (e.g., the UMI count), analysis of many other biological processes requires working with the ratio of two quantities. For example, analysis of alternative splicing of cassette exons involves modeling the percent-spliced-in (PSI), representing the ratio between the abundances of isoforms in which the cassette exon is included vs. excluded¹¹ (see Supplementary Fig. 14 for other examples). Such analysis is further complicated, especially in single-cell data, by the fact that the quantities whose ratio is of interest are latent, and instead some probabilistic observation is obtained (e.g., UMI counts of inclusion or exclusion isoforms). By using a hierarchical model in which the latent profile of each cell is connected to the observed data through a binomial data-generating distribution (Fig. 6a), GEDI extends the analyses described in the previous sections to paired quantities whose ratio is of interest.

We applied GEDI to the analysis of exon inclusion levels in the mouse cortex using data from two previously published studies^29,30. We observed that the latent splicing space learned by GEDI, which represents the lower-dimensional projection of the cells based on their (unobserved) cassette exon PSI values, preserved the cell type structure, while removing the study-specific effects (Fig. 6b). We then compared the ability of GEDI to integrate the latent splicing space of multiple samples against that of other integration methods—note that, to apply other methods, a naïve estimate of PSI needed to be calculated first, while GEDI could be directly fitted to inclusion/exclusion counts. We found that GEDI offered the best performance at removing technical variability (Fig. 6c). Furthermore, as part of its expectation-maximization algorithm for model fitting, GEDI calculates the expected value of the (latent) ratio of inclusion/exclusion events given the observed counts and the model parameters, effectively providing a denoised estimate of PSI. We found that GEDI-inferred PSI values recapitulated previously observed cell type-specific splicing events, e.g., the inclusion of exon 20 of Nrxn1 in GABAergic neurons and exon 2 of Ntpn in glutamatergic neurons³¹ (Supplementary Fig. 15a), as well as other differentially spliced exons that are enriched for neuron-related functions (Supplementary Fig. 15b-c). It also identified novel associations between cassette exons and neuronal subtypes, such as the glutamatergic neuron-specific inclusion of a cassette exon in Kctd17 (Fig. 6d and Supplementary Fig. 15d). This observation is consistent with simulations showing that GEDI can impute ground truth ratios from paired, sparse counts, while a naïve estimator provides ratios that are almost completely uncorrelated with the ground truth (Supplementary Fig. 16a-e). Using the naïve estimator as input for two existing single-cell imputation methods^2,32 slightly improved its correlation with the ground truth, but GEDI substantially outperformed them in recovering the ground truth (Supplementary Fig. 16f-g).

Finally, we evaluated the ability of GEDI to model RNA stability based on the ratio of spliced and unspliced RNA, assuming that RNA stability is proportional to the spliced/unspliced transcript ratio at steady-state in the absence of changes in RNA processing rate¹² (Fig. 6e). While these conditions are not fully met in every single cell, we reasoned that spliced/unspliced transcripts ratios are still informative of RNA stability in cells that are not in the middle of a differentiation trajectory (and, therefore, are closer to steady state). To test this hypothesis, we applied GEDI to analyze the spliced/unspliced ratio of RNAs at the single-cell level in a model of sensory neurogenesis³³. We compared the log-ratio of spliced vs. unspliced transcripts, as imputed by GEDI, to RNA half-life measurements obtained from mouse embryonic stem cells (ESCs) and in vitro-differentiated terminal neurons (TNs)¹². Despite the differences between the biological systems represented by these two datasets, we observed a Pearson correlation of 0.16 between GEDI inferences and differential mRNA half-life measurements (Supplementary Fig. 17a), compared to a Pearson correlation of 0.22 when bulk RNA-seq data from the same in vitro differentiation system is used³⁴, or to a mean Pearson correlation of 0.002 for shuffled single-cell data (Supplementary Fig. 17b). We then used GEDI to analyze spliced/unspliced transcript ratios in human neurons, using a previously published dataset of human embryonic glutamatergic neurogenesis¹³. In this analysis, we modeled the spliced/unspliced manifold as a function of the regulatory networks of RNA binding proteins and miRNAs (see Methods for details). GEDI was able to recover cell type-specific activities of several known post-transcriptional regulators (Supplementary Fig. 17c), including a higher projected activity of well-characterized factors such as QKI in radial glia³⁵ (Fig. 6f) and miR-124 in differentiated neurons³⁶ (Supplementary Fig. 17d). Collectively, these results show that GEDI can successfully model the latent space of RNA splicing and stability at the single-cell level.

Discussion

GEDI is a specific formulation of a general framework for multi-sample single-cell analysis (Fig. 1a–d) in which a family of functions, parametrized by sample-specific factors, connect each biological cell state to its expected expression profile in each sample. The framework proposed here includes three key components. First, it requires the identification of invertible decoder functions that provide a map from gene expression space to cell state manifold and vice versa, allowing the development of generative models of single-cell data. This requirement distinguishes this framework from unsupervised manifold alignment problem³⁷ and the majority of existing single-cell data integration approaches, such as those based on correlation analysis^3,38 or graph-based methods⁴ in which a given gene expression profile can be mapped to its cell state but not vice versa, or methods based on autoencoders³⁹ in which the encoder and decoder functions are not necessarily the inverse of each other.

Secondly, in the hierarchical model proposed here, sample-specific manifolds are drawn from a distribution around a mean manifold, with the mean manifold optionally expressed as a function of sample covariates. This probabilistic modeling of the manifold, which was inspired by the methodology used by LIGER⁴⁰ for inter-sample harmonization, separates our framework from LEMUR, another recent method for latent embedding regression²⁴, in which the latent space is a deterministic function of sample covariates without accounting for variation among biological replicates. Previous work has shown that properly modeling sample-to-sample variability is key for unbiased differential expression analysis in single-cell data^5,10, which may partially explain the superior performance of GEDI in our simulation-based benchmarking tests (Fig. 4). We note, however, that more extensive analyses are needed to better understand the effects of factors that may influence the performance of GEDI and other DE analysis methods, including the number of differentially expressed genes per cell, the magnitude of their differential expression, cell-cell and gene-gene correlation structures, inter- and intra-sample variances, the number of samples, the number of cells per sample, and the sequencing depth.

Third, the hierarchical framework proposed here models each cell as a sample drawn from a distribution around the manifold of each sample, followed by probabilistic sampling of the observed data from the latent gene expression profile of the cell. This hierarchical structure allows the same model to be generalized to different observation types by employing different data-generating distributions, and underlies the distinguishing feature of GEDI to not only model the latent space of mRNA expression, but also the stability and splicing latent spaces of single cells. We expect this functionality to be useful for analysis of other single-cell modalities that represent the ratio of two biological measurements, as summarized in Supplementary Fig. 14, enabling a range of analyses based on those modalities, including batch correction and cluster/cell type analysis (similar to Fig. 6a–c). This hierarchical model also provides a natural method for denoising and/or imputation of single-cell data: we treat the true profile of each cell as a latent variable, the expected value of which can be calculated conditional on the observed data and the maximum a posteriori estimate of the model parameters (examples can be found in Fig. 6d and Supplementary Fig. 15). As shown in Supplementary Fig. 16, our simulation results underline the unique ability of GEDI to impute the ratios of paired observations (such as spliced vs. unspliced mRNA abundances), but it remains to be tested whether GEDI’s imputations for gene-level expression values are also competitive with existing single-cell imputation methods.

In addition, our formulation enables direct modeling of the parameters of the decoder (and, therefore, the manifold defined by that decoder) as functions of prior knowledge in the form of gene sets. This direct incorporation of prior biological knowledge into the GEDI model, which was inspired by previous work on latent variable modeling in bulk expression data⁸, provides several unique advantages. First, GEDI penalizes principal axes that cannot be expressed as a linear combination of gene-level prior knowledge, facilitating the downstream interpretability of the resulting manifold axes. Secondly, gene set activities can be projected onto individual cells, thus providing a natural approach to perform single-cell gene set enrichment analysis. This flexible framework enables the study of different types of regulatory mechanisms depending on the observations modelled. For example, it can be employed for the study of transcriptional regulators if gene-level counts are given as input (e.g., Supplementary Fig. 9), or for the analysis of regulatory networks that modulate mRNA stability if paired spliced and unspliced transcript counts are provided (e.g., Supplementary Fig. 17c). Thirdly, the activity gradient of each gene set can be directly compared to any transcriptomic vector field, such as the vector field associated with a specific sample-level feature, ultimately associating the activity of known gene sets with gene expression shifts observed across conditions (Fig. 5d–g).

We envision several potential extensions of this work. First, as noted earlier, our framework can be extended to different parametric manifold approximation approaches (GEDI currently supports linear manifold learning, with the option to further restrict the manifold to a hyperellipsoid). Secondly, it provides a natural path for extension of its concepts to multi-modal mosaic integration⁴⁰: by using a different family of decoder functions for each modality, we can obtain simultaneous mapping from any given biological state to the spaces of different modalities. The proposed generative model can also naturally handle missing data, simply by excluding missing observations from the calculation of the a posteriori probability. Third, this framework can be used not only for de novo modeling of sample-specific manifolds, but also for post-hoc analysis of the harmonized space obtained by other existing methods. To showcase this potential, we used the integrated space identified by Harmony⁴¹ for the samples in cohort 1 of the COVID-19 dataset, and asked GEDI to identify sample-specific transformations that can approximate Harmony’s integrated space, model those sample-specific transformations as a function of disease status, and calculate the transcriptomic vector field of mild COVID-19 relative to healthy controls, as shown in Supplementary Fig. 18. Fourth, the transcriptomic vector fields obtained by GEDI may have applications beyond clustering-free DE analysis. For example, earlier studies⁴² have shown the utility of transcriptomic vector fields in prediction of cell fate transitions if the vector field represents “velocity” (gene expression change as a function of time). Furthermore, as GEDI’s vector field extends beyond the regions of the manifold that is occupied by observed cells, it may provide an opportunity for counterfactual prediction in previously unobserved cell types, such as prediction of response to specific perturbations^43,44. These potential abilities, however, remain currently untested.

Overall, the framework presented here unifies a range of concepts that are central to single-cell data analysis, including multi-sample data integration, cluster-free DGE analysis, imputation and denoising, pathway and GRN activity analysis using prior information, downstream model interpretation, and analysis of different modalities with distinct data-generating processes.

Methods

The GEDI framework

Consider a dataset with measurements for G genes/events in N cells. Let the column vector y_n∈ℝ^G denote the expression values for G genes in cell n∈{1,…,N}. The vectors y_n (n∈{1,…,N}) together form the matrix Y∈ℝ^G×N, in which each column n can be considered an observation in a G-dimensional space. We further assume that these observations lie near a lower-dimensional manifold, so that some function ψ can reconstruct each y_n from a lower-dimensional column vector b_n∈ℝ^K (K < G):

$${{{{\rm{\psi }}}}}_{\theta }:{{\mathbb{R}}}^{K}\to {{\mathbb{R}}}^{G}$$

(1)

$${{{{\bf{y}}}}}_{n}\cong {{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)$$

(2)

Here, θ is the set of parameters that define the manifold, and b_n represents the embedding of the n’th observation on this manifold.

Furthermore, the N cells in the dataset may belong to different samples. Each sample i∈{1,…,Q} may have a different (distorted) manifold, defined by the parameter set θ_i = θ_r + Δθ_i. Therefore, when the cells are derived from multiple samples, each observation y_n can be modeled as:

$$\theta={\theta }_{r}+\triangle {\theta }_{i\left(n\right)}$$

(3)

$${{{{\bf{y}}}}}_{n}\cong {{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)$$

(4)

Here, Δθ_i(n) represents the difference between θ_r, the parameter set that defines a “reference” manifold, and θ_i(n), the parameter set that defines the manifold of the sample to which cell n belongs (we denote the sample from which the cell n is derived as i(n)). This formulation allows direct mapping between the cells (each defined by a constant embedding b) across multiple samples (through sample-specific manifold parameterization).

The general concept above can potentially be adapted to various parametric manifold learning methods; GEDI represents a specific case, in which the gene expression manifold is modeled as a K-dimensional hyperplane (with the option to further restrict the manifold shape, as described later). This is achieved by defining the function ψ as:

$$\theta=\left\{{{{\bf{o}}}},{{{\bf{Z}}}}\right\}$$

(5)

$${{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)={{{\bf{o}}}}{{{\boldsymbol{+}}}}{{{\bf{Z}}}}{{{{\bf{b}}}}}_{n}$$

(6)

In a multi-sample analysis, sample-specific parameter sets are defined as:

$${\theta }_{i}=\left\{{{{{\bf{o}}}}}_{r}+\triangle {{{{\bf{o}}}}}_{i},{{{{\bf{Z}}}}}_{r}+\triangle {{{{\bf{Z}}}}}_{i}\right\}$$

(7)

$$\Theta=\left\{{\theta }_{1},\ldots,{\theta }_{Q}\right\}$$

(8)

Here, the column vector o_r∈ℝ^G represents the origin point (center) on a reference hyperplane, and Δo_i represents the sample-specific translation of the origin point. Each of the columns of the matrix Z_r∈ℝ^G×K represents a vector that originates from point o_r and lies on the reference hyperplane. By default, GEDI restricts these K vectors to be orthogonal to each other, effectively forming the orthogonal axes of a coordinate frame in which the position of each point on the reference manifold can be specified as b_n. ΔZ_i represents the sample-specific transformations of this coordinate frame (excluding translation, which is specified by Δo_i).

Thus, GEDI approximates each observation y_n as:

$${{{{\bf{y}}}}}_{n}\cong {{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}$$

(9)

More precisely, y_n is modeled as an observation drawn from a spherical multivariate normal distribution whose mean is located on the manifold of sample i(n):

$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$

(10)

$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$

(11)

Note the addition of the term s_n1_G here, which serves as a cell-specific intercept; s_n is a scalar (representing library size), and 1 _G is a column vector of 1’s; 1 _G = {1,…,1}∈ℝ^G.

The column vectors b_{n∈{1,…,N}} together form the matrix B∈ℝ^K×N, in which each column n can be considered the embedding of the cell n in the manifold. Since the scales of Z_r + ΔZ_i and B are redundant (each column of Z_r + ΔZ_i can be scaled by some constant c and the corresponding row of B can be scaled by c^–1 without changing the model likelihood), GEDI restricts B such that each row forms a unit vector:

$${{{\bf{B}}}}\in \left\{{{\mathbb{R}}}^{K\times N},\Big|,\,\forall k\in \left\{1,\ldots,K\right\}\,{{\sum}_{n=1}^{N}}{\left({b}_{k,n}\right)}^{2}=1\,\right\}$$

(12)

Other constraints may also be added to further limit the shape of the manifold. For example, B may be restricted to the points on a ellipsoid; in other words:

$$ {{{\bf{B}}}}\in \left\{{{\mathbb{R}}}^{K\times N},|,\,\forall k\in \left\{1,\ldots,K\right\}\,{{\sum}_{n=1}^{N}}{\left({b}_{k,n}\right)}^{2}=1\,\right\} \\ \qquad \cap \left\{{{\mathbb{R}}}^{K\times N},|,\exists {{{\bf{d}}}}{{{\boldsymbol{\in }}}}{{\mathbb{R}}}_{ > 0}^{K}\,s.t.\,\forall n\in \left\{1,\ldots,N\right\}\,{{\sum}_{k=1}^{K}}{\left({b}_{k,n}/{d}_{k}\right)}^{2}=1\right\}$$

(13)

Here, the vector d contains the lengths of the semi-axes of the ellipsoid, with d_k representing the kth element of d.

Modeling the reference manifold as a function of gene-level variables

The reference manifold itself may be approximated as a function of gene-level prior knowledge, such as gene regulatory networks or pathway memberships, by expressing the parameter set θ_r as a function of gene-level variables. To this end, GEDI expresses Z_r as a probabilistic function of C∈ℝ^G×P, where C is a matrix representing gene-level prior information matrix:

$${{{{\bf{z}}}}}_{r,k}\left|{{{{\bf{a}}}}}_{k}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{C}}}}{{{{\bf{a}}}}}_{k},{\sigma }^{2}{S}_{Z}{{{\bf{I}}}}\right)$$

(14)

$${{{{\bf{a}}}}}_{k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{A}{{{\bf{I}}}}\right)$$

(15)

Here, z_r,k∈ℝ^G is the k’th column of Z_r, and a_k∈ℝ^P is a vector column whose P elements represent the contribution of each of the P columns of C toward determining the direction of the axis vector z_r,k—this formulation is similar to (and inspired by) that used by PLIER⁸ for pathway-level information extraction from gene expression data. S_Z is a hyperparameter that determines the variance of z_r,k relative to the model variance σ². Similarly, S_A is a hyperparameter that determines the variance of the prior distribution of a_k relative to the model variance σ² (see Supplementary Methods for the choice of S_Z, S_A, and other hyperparameters).

The column vectors a_{k∈{1,…,K}} together form the matrix A∈ℝ^P×K. For example, C may represent a (weighted) regulatory network connecting P transcription factors to G genes, in which case the element a_p,k of the matrix A corresponds to the contribution of the network of the transcription factor p toward the axis vector z_r,k. Ab_n can be considered as the projected “activity” of the P transcription factors in the cell n (Ab_n∈ℝ^P). In other words, α_p(b)=a_p,.b is the function that provides the projected activity of the transcription factor p at coordinate b in the Z_r coordinate system, where a_p,. is p’th row of A. Consequently, the gradient of the activity of transcript factor p in the Z_r coordinate system (∇α_p) is a_p,.^⊤, which can be transformed to the gene expression coordinate system as Z_r∇α_p= Z_ra_p,.^⊤.

In the absence of gene-level prior information, the following prior is used for z_r,k:

$${{{{\bf{z}}}}}_{r,k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{Z}{{{\bf{I}}}}\right)$$

(16)

Modeling sample-specific distortions of the manifold as a function of sample-level variables

The difference between the manifold of each sample i and the reference manifold can be expressed using the difference of the parameter sets that define these manifolds, i.e., Δθ_i. In the case of GEDI, we have: Δθ_i ={Δo_i,Δz_i,1,…, Δz_i,K}. Each of the components can in turn be expressed as a function of sample-level variables:

$${{{{\boldsymbol{\triangle }}}}{{{\bf{o}}}}}_{i}\left|{{{{\bf{R}}}}}_{o}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{R}}}}}_{o}{{{{\bf{h}}}}}_{i},{\sigma }^{2}{S}_{\triangle {o}_{i}}{{{\bf{I}}}}\right)$$

(17)

$${{{{\bf{R}}}}}_{o}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{{R}_{o}}{{{\bf{I}}}}\right)$$

(18)

$${{{{\boldsymbol{\triangle }}}}{{{\bf{z}}}}}_{i,k}\left|{{{{\bf{R}}}}}_{k}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{R}}}}}_{k}{{{{\bf{h}}}}}_{i},{\sigma }^{2}{S}_{\triangle {Z}_{i}}{{{\bf{I}}}}\right)$$

(19)

$${{{{\bf{R}}}}}_{k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{{R}_{k}}{{{\bf{I}}}}\right)$$

(20)

Here, h_i∈ℝ^L is a column vector whose elements represent the values of L variables for sample i. R_o∈ℝ^G×L and R_k∈ℝ^G×L are matrices that represent the effects of the L variables on Δo_i and Δz_i,k, respectively. S_Δoi and S_Δzi are sample-specific hyperparameters that specify the variance of the Δo_i and Δz_i,k relative to the model variance σ². Similarly, S_Ro and S_Rk are hyperparameters that determine the variance of the prior distributions of R_o and R_k relative to the model variance σ².

In the absence of sample-level variables, Δo_i and Δz_i,k are modeled using the following prior distributions:

$${{{{\boldsymbol{\triangle }}}}{{{\bf{o}}}}}_{i}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{\triangle {o}_{i}}{{{\bf{I}}}}\right)$$

(21)

$${{{{\boldsymbol{\triangle }}}}{{{\bf{z}}}}}_{i,k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{\triangle {Z}_{i}}{{{\bf{I}}}}\right)$$

(22)

Direct inference from count data

Consider the count matrix M∈ℤ^G×N, with each element m_g,n generated from a Poisson distribution with mean λ_g,n. GEDI models each λ_g,n as a latent variable drawn from a log-normal distribution, so that y_n = (logλ_1,n,…,logλ_G,n)^T∈ℝ^G follows a spherical multivariate normal distribution as in the previous sections. Thus:

$${m}_{g,n} \sim {{{\rm{Pois}}}}\left({e}^{{y}_{g,n}}\right)$$

(23)

$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$

(24)

$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$

(25)

Inference from paired UMI counts

Consider the count matrices M₁∈ℤ^G×N and M₂∈ℤ^G×N, with each pair of elements m_1,g,n and m_2,g,n corresponding to success and failure counts, respectively, in m_g,n = m_1,g,n + m_2,g,n Bernoulli trials with success probability p_g,n. GEDI models y_n = (logitp_1,n,…,logitp_G,n)^T∈ℝ^G as a latent variable:

$${m}_{1,g,n} \sim {{{\rm{B}}}}\left({m}_{g,n},{{{\rm{S}}}}\left({y}_{g,n}\right)\right)$$

(26)

$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$

(27)

$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$

(28)

Here, S is the sigmoid function.

Obtaining maximum a posteriori estimates of model parameters

When the gene expression matrix Y is provided and no gene-level or sample-level prior information is available, GEDI uses a block coordinate descent algorithm to obtain maximum a posteriori estimates for Z_r, ΔZ_i, o_r, Δo_i and B. When Y is latent (i.e., when M or the pair M/M′ is provided), Z_r is latent (i.e., gene-level variables are provided), and/or ΔZ_i and Δo_i are latent (i.e., sample-level variables are provided), GEDI uses expectation-maximization to obtain maximum a posteriori estimates (see Supplementary Methods for details).

Datasets and preprocessing

All datasets used in the article were downloaded from the publication access codes, from the publication’s online data repositories, or from public data repositories. Quality control and preprocessing steps were performed with the scuttle package⁴⁵ (v 1.0.4). For further details, see Supplementary Methods.

Integration methods and benchmarks

We compared the integration performance of GEDI against several other methods, including Seurat⁴⁶, LIGER⁴⁰, Harmony⁴¹, BBKNN⁴⁷, scVI³⁹, Scanorama⁴⁸, and CSS³⁸, as well as PCA (no integration) as a baseline. We ran each method following the documentation obtained from available tutorials or paper methods. Unless specified, we used the default parameters established by each package. For further details, see Supplementary Methods.

We measured the ability of each method to remove technical variability while preserving biological variation associated with the cell types. To do this, we followed the approach established by previous integration benchmark efforts^16,17, which used metrics that can be grouped into two broad categories: (a) removal of batch effects and (b) conservation of biological variance. Metrics from group (a) included alignment score (batch), iLISI, kBET, and ASW (batch), while group (b) metrics included alignment score (cell type), cLISI, ARI, NMI and ASW (cell type). For further details on each individual metric, see the Supplementary Methods section.

For each method, we defined an overall score that summarized the performance of the multiple metrics, following the approach established by Luecken et al.¹⁶. Briefly, we first rescaled the output of every metric to range from 0 to 1, which ensures that each metric is equally weighted within a partial score and has the same discriminative power. The rescaling was done using the transformation y’=[y-min(Y)]/ [max(Y)-min(Y)]. Then, we defined the ‘Batch’ score’ and the ‘Bio’ score, representing the average for the metrics belonging to the groups (a) and (b) above, respectively. For a given integration method, we calculated the overall score as previously defined by Luecken et al.¹⁶, where a weighted average of the two Bio and Batch scores was used, with a weight of 0.6 for the Bio score and 0.4 for the Batch score (integration metrics using different weights can be found in Supplementary Fig. 3). Additional details are found in Supplementary Methods.

Differential expression analysis

We performed pseudobulk differential expression (DE) analysis on the COVID-19 data using DESeq2⁴⁹ (v.30.1). For each cell type and donor combination, we created a pseudobulk using the ‘aggregateAcrossCells’ function from scuttle⁴⁵. Only cell types that were present in all three conditions (control, mild and severe COVID-19) were considered for DE analysis.

Cluster-free differential expression benchmark

We conducted a systematic analysis to evaluate the performance of GEDI in clustering-free DGE analysis. Our assessment included the comparison of GEDI to two clustering-free DGE methods: LEMUR²⁴ and miloDE²⁵, as well as a comparison to DESeq2-based pseudobulk analysis. First, we developed a generative model that simulates cohort-level scRNA-seq data while preserving characteristics such as gene-gene and cell-cell correlations observed in real data, as discussed in detail in Supplementary Methods. Briefly, for each sample i, we simulate K archetypes representing a set of vertices on the gene expression manifold. If we represent the expression of each gene g across the K archetypes using the column vector γ_g,i∈ℝ^K, then we have:

$${{{{\boldsymbol{\gamma }}}}}_{g,i}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{X}}}}}_{g}{{{{\bf{h}}}}}_{i}+{{{{\bf{x}}}}}_{g}^{{\prime} }{{{{\bf{h}}}}}_{i}{{{{\bf{1}}}}}_{K},{{{{\boldsymbol{\Sigma }}}}}_{g}\right)$$

(29)

where X_g is a matrix of coefficients (X_g∈ℝ^K×L) that shows, for each of the K archetypes, how the mean expression of gene g is determined by each of the L sample-level variables represented by the column vector h_i. x’_g∈ℝ^L is a row vector that represents the global effects of sample-level variables on the observed abundance of gene g, e.g., through affecting ambient RNA abundances or other artifacts⁵⁰. Σ_g∈ℝ^K×K is a covariance matrix for gene g.

Each cell n is then modeled as a probabilistic function of the weighted average of the archetypes of its sample:

$${y}_{g,n}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{w}}}}}_{n}{{{\boldsymbol{\cdot }}}}{{{{\boldsymbol{\gamma }}}}}_{g,i\left(n\right)},{\sigma }_{g}^{2}\right)$$

(30)

where w_n∈ℝ₊^K is a row vector of weights connecting cell n to each of the K archetypes (Σ_kw_n,k = 1), and σ²_g is a gene-specific variance. Thus, the ground-truth DE effect of the L sample-level variables on gene g in cell n can be derived as the row-vector δ_n,g=w_nX_g.

Finally, for each cell n, the UMI counts are simulated as:

$${p}_{g,n}=\frac{{c}^{{y}_{g,n}}}{{\sum }_{g=1}^{G}{c}^{{y}_{g,n}}}$$

(31)

$${{{{\bf{m}}}}}_{n} \sim {{{{\rm{Multinom}}}}}_{G}\left({M}_{n},{{{{\bf{p}}}}}_{n}\right)$$

(32)

where c is the logarithmic basis for the simulated log-scale profiles and M_n is the total UMI count for cell n. This framework allows us to simulate UMI counts starting from a given set of X_g, Σ_g, and σ_g for all g∈{1,…,G}, and w_n and M_n for all n∈{1,…,N}; we call these the simulation parameters, which can be modified to obtain different simulated datasets with varying numbers of cells, cell-cell or gene-gene correlation structures, cellular diversities, or ground-truth DE effects. In this work, we used the COVID-19 cohort 1 dataset as template to derive realistic simulation parameters, as described in Supplementary Methods, and generated a simulated dataset with the same number of cells and samples as those of the COVID-19 dataset. We then applied GEDI, LEMUR, miloDE, and DESeq2-based pseudobulk analysis to this dataset (see Supplementary Methods), and compared the differential expression estimates from each method to the ground truth DE vectors at the single-cell (GEDI and LEMUR), neighborhood (GEDI, LEMUR and miloDE) and cell-type levels (GEDI, LEMUR and DEseq2). For further details, see Supplementary Methods.

Cell type signature analysis

For the generation of cell-type signatures in PBMC data, we performed DE analysis using DESeq2 for each donor. Our model compared the mean expression of a given cell type versus the average of other cell types, including the batch variable (different sequencing technologies) as a covariate in the model. We considered genes with FDR adjusted p- value < 0.05 and log₂ fold-change >1 to define cell type markers. For a given donor, we defined a binary matrix that contained the cell-type markers, which was used as input for applying GEDI on the other donor.

To perform the enrichment of cell type signatures and TF activities, we applied the scoreMarkers function from scran⁵¹ (v.1.30.0) using the inferred cell type/TF activities by GEDI.

Gene regulatory network analysis

For analysis of transcription factor (TF) networks (in PBMCs), we downloaded the human DoRothEA gene-regulatory network⁵² and restricted the TF-gene interactions to the high-confidence sets A, B, and C, as previously defined. We then refined this TF-gene interaction matrix, using the whole blood data from GTEx⁵³, to obtain a generative gene regulatory model in which the expression profile of each gene is a function of the expression of the TFs that are linked to it. Specifically, the target gene expression matrix Y∈ℝ^G×N (which is TMM-normalized⁵⁴ and converted to log-scale) is modeled as Y ~ CA, where A∈ℝ^P×N is the matrix of the expression of P transcription factors across N whole blood samples from GTEx, and C∈ℝ^G×P is a weighted regulatory network whose elements are restricted to have the same sign as the regulatory interactions obtained from DoRothEA (c_g,p ≥ 0 for an activating interaction between transcription factor p and gene g, c_g,p ≤ 0 for an inhibitory interaction, and c_g,p = 0 for no interaction). C was further filtered to include only TFs (columns) that have at least 10 “substantial” regulatory interactions, where we define a substantial effect as having an absolute value > 0.1.

For analysis of post-transcriptional regulatory networks, we defined a regulatory network of RNA binding proteins (RBPs) and microRNAs (miRNAs) based on their potential interactions with mRNA 3’ UTRs. For the RBPs, we downloaded the set of known motifs for all human RBPs from CISBP-RNA⁵⁵, and filtered them to include only the non-redundant set of motifs described in ref. ⁵⁶. Then, for each human gene, the transcript with the longest 3’ UTR was identified. Genes whose longest 3’ UTR was shorter than 650nt were removed, and the first 650nt of the 3’ UTR (i.e., the 650nt region immediately downstream of the stop codon) of the remaining genes was used for motif scanning, using AffiMx⁵⁷. The miRNA regulons were obtained from ref. ³⁴.

Assessment of estimated TF activities

We evaluated the accuracy of GEDI at estimating changes in TF activity by analyzing a previously published single-cell perturbation dataset²⁸. Our evaluation also included a comparison to decoupleR⁵⁸ (v.2.8.0), an ensemble of computational frameworks to infer biological activities. The methods we included in our evaluation were “aucell”, “gsva”, “mlm”, “ora”, “ulm”, “viper”, “wmean”, and “wsum”. For wmean and wsum, we utilized the normalized outputs “norm_wmean” and “norm_wsum”.

The dataset consisted of two technical batches which were split and analyzed separately. One batch was used to generate a gene signature for each TF and also measure the degree of association of each TF with the principal axes of heterogeneity of the data, while the other was used to estimate TF activities using the learned gene regulatory effects. We opted to learn the gene signature of each TF from the data (as opposed to using an external gene regulatory network) to ensure that our results purely reflected the performance of the activity inference methods without the confounding effect of the quality of the external gene network. At the same time, by using one batch for learning gene-TF associations and the other for evaluating activity inferences, we aimed to avoid circularity. These steps were repeated by swapping the batches. To generate the gene signature of each TF, we performed differential expression analysis for each TF between the cells with perturbed TF expression versus unperturbed cells, using the scoreMarkers function from scran. The mean standardized Cohen’s d log-fold-change was used to define the regulatory effect of a TF on each gene. To identify the TFs whose perturbation is associated with a significant change in the principal axes of heterogeneity of the single-cell RNA-seq data, we ran PCA on the normalized expression data and retrieved the first 40 principal components. Next, for each TF, we fitted a logistic regression model to predict the TF perturbation status of a cell using its the principal component scores, and used a likelihood-ratio test to compare this model to a reduced model with only an intercept.

Single-cell TF activities were estimated for each method using the learned gene-regulatory signatures. For methods that only accept a gene list for each TF (aucell, fgsea, gsva and ora), we defined the downstream targets of each TF as the set of genes with log₂ fold-change <–0.3 in TF-perturbed cells in comparison to unperturbed cells. For the methods from decoupleR, normalized expression values were used as input, while for GEDI we used raw counts.

Differential analysis of cell type-specific cassette exon splicing

To perform DE between GABAergic and Glutamatergic cells in the Tasic data, we applied limma⁵⁹ (v.3.46) using the imputed value of the latent splicing matrix from GEDI, which represents the logit of PSI (percent-spliced-in). We applied the lmFit and eBayes functions to obtain DE estimates using the default parameters.

GSEA analysis was performed using fgsea⁶⁰ (v.1.16), with the following arguments: minSize=15, eps=0, and maxSize=500. For the analysis of cell-type exon inclusion events using the Tasic data, we used the M5 ontology gene sets from MSigDB⁶¹ (m5.all.v2022.1.Mm.symbols.gmt), using the t-statistics from limma as the gene ranks.

Sashimi plots were generated using sashimipy⁶² (v 0.0.6). For the Tasic dataset, we selected a subset of 100 cells for each cell type and dataset combination, generated a merged BAM file, and provided it as input to sashimipy.

Statistics and Reproducibility

Unless specified, cells were removed from the downloaded datasets based on standard quality control criteria (% of mitochondrial reads <20%; number of UMIs <1000, and number of detected genes <1000). No samples were collected in this study. No statistical method was used to predetermine sample size. Computational experiments and analysis are reproducible using the notebooks and code provided.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets used in the article were downloaded from the publication access codes or from their online data repositories. PBMC data is available at GEO accession number GSE132044. Pancreas data is available at ArrayExpress accession number E-MTAB-5061 and at GEO accession number GSE84133. The Tabula Muris Bone Marrow data is available from the publication repository [https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102]. The COVID-19 dataset is available at the European Genome-phenome Archive (EGA) under access number EGAS00001004571. Tasic dataset is available at GEO accession number GSE71585 and GEO accession number GSE115746. Faure dataset is available at GEO accession number GSE150150. LaManno dataset is available at the original author’s repository [http://pklab.med.harvard.edu/velocyto/hgForebrainGlut/]. Genga dataset is available at Zenodo accession number 10.5281/zenodo.3564178 [https://zenodo.org/doi/10.5281/zenodo.3564178]. Preprocessed data, embeddings, and GEDI models can be accessed via Zenodo (DOIs: 10.5281/zenodo.8222039, 10.5281/zenodo.8222697, 10.5281/zenodo.11163741, and 10.5281/zenodo.11164776). A description of the Zenodo files is available in Supplementary Table 1. Source data are provided with this paper.

Code availability

GEDI⁶³ is available via GitHub at https://github.com/csglab/GEDI and via Zenodo (DOI: 10.5281/zenodo.12761204). Reproducible notebooks for the analyses presented in this work can be found at https://github.com/csglab/GEDI_manuscript.

References

L Lun, A.T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 1–14 (2016).
Van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729.e727 (2018).
Article PubMed PubMed Central Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS PubMed PubMed Central Google Scholar
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
Crowell, H. L. et al. Muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11, 6077 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Dixit, A. et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. cell 167, 1853–1866.e1817 (2016).
Article CAS PubMed PubMed Central Google Scholar
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. methods 14, 1083–1086 (2017).
Article CAS PubMed PubMed Central Google Scholar
Mao, W., Zaslavsky, E., Hartmann, B. M., Sealfon, S. C. & Chikina, M. Pathway-level information extractor (PLIER) for gene expression data. Nat. Methods 16, 607–610 (2019).
Article CAS PubMed PubMed Central Google Scholar
Luecken, M. D. & Theis, F. J. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Article PubMed PubMed Central Google Scholar
Squair, J. W. et al. Confronting false discoveries in single-cell differential expression. Nat. Commun. 12, 5692 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. methods 7, 1009–1015 (2010).
Article CAS PubMed PubMed Central Google Scholar
Gaidatzis, D., Burger, L., Florescu, M. & Stadler, M. B. Analysis of intronic and exonic reads in RNA-seq data characterizes transcriptional and post-transcriptional regulation. Nat. Biotechnol. 33, 722–729 (2015).
Article CAS PubMed Google Scholar
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Article ADS PubMed PubMed Central Google Scholar
Benegas, G., Fischer, J. & Song, Y. S. Robust and annotation-free analysis of alternative splicing across diverse cell types in mice. Elife 11, e73520 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ding, J. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 38, 737–746 (2020).
Article CAS PubMed PubMed Central Google Scholar
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. methods 19, 41–50 (2022).
Article CAS PubMed Google Scholar
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 1–32 (2020).
Article Google Scholar
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Article CAS PubMed PubMed Central Google Scholar
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst. 3, 346–360.e344 (2016).
Article CAS PubMed PubMed Central Google Scholar
Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris: The Tabula Muris Consortium. Nature 562, 367 (2018).
Article ADS PubMed Central Google Scholar
Schulte-Schrepping, J. et al. Severe COVID-19 is marked by a dysregulated myeloid cell compartment. Cell 182, 1419–1440. e1423 (2020).
Article CAS PubMed PubMed Central Google Scholar
Korem, Y. et al. Geometry of the Gene Expression Space of Individual Cells. PLoS Comput Biol. 11, e1004224 (2015).
Article PubMed PubMed Central Google Scholar
Persad, S. et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat. Biotechnol. 41, 1746–1757 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ahlmann-Eltze, C. & Huber, W. Analysis of multi-condition single-cell data with latent embedding multivariate regression. bioRxiv, 2023.2003.2006.531268 (2023).
Missarova, A., Dann, E., Rosen, L., Satija, R. & Marioni, J. Sensitive cluster-free differential expression testing. bioRxiv, 2023.2003.2008.531744 (2023).
Medvedovic, J., Ebert, A., Tagoh, H. & Busslinger, M. Pax5: a master regulator of B cell development and leukemogenesis. Adv. Immunol. 111, 179–206 (2011).
Article CAS PubMed Google Scholar
Escobar, G., Mangani, D. & Anderson, A. C. T cell factor 1: A master regulator of the T cell response in disease. Sci. Immunol. 5, eabb9726 (2020).
Article CAS PubMed PubMed Central Google Scholar
Genga, R. M. J. et al. Single-Cell RNA-Sequencing-Based CRISPRi Screening Resolves Molecular Drivers of Early Human Endoderm Development. Cell Rep. 27, 708–718.e710 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Feng, H. et al. Complexity and graded regulation of neuronal cell-type–specific alternative splicing revealed by single-cell RNA sequencing. Proc. Natl Acad. Sci. 118, e2013056118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).
Article CAS PubMed PubMed Central Google Scholar
Faure, L. et al. Single cell RNA sequencing identifies early diversity of sensory neurons forming via bi-potential intermediates. Nat. Commun. 11, 4175 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Perron, G. et al. Pan-cancer analysis of mRNA stability for decoding tumour post-transcriptional programs. Commun. Biol. 5, 851 (2022).
Article CAS PubMed PubMed Central Google Scholar
Neumann, D. P., Goodall, G. J. & Gregory, P. A. The Quaking RNA‐binding proteins as regulators of cell differentiation. Wiley Interdiscip. Rev.: RNA 13, e1724 (2022).
Article CAS PubMed Google Scholar
Cheng, L.-C., Pastrana, E., Tavazoie, M. & Doetsch, F. miR-124 regulates adult neurogenesis in the subventricular zone stem cell niche. Nat. Neurosci. 12, 399–408 (2009).
Article CAS PubMed PubMed Central Google Scholar
Liu, J., Huang, Y., Singh, R., Vert, J. P. & Noble, W. S. Jointly Embedding Multiple Single-Cell Omics Measurements. Algorithms Bioinform 143, 10 (2019).
PubMed PubMed Central Google Scholar
He, Z., Brazovskaja, A., Ebert, S., Camp, J. G. & Treutlein, B. CSS: cluster similarity spectrum integration of single-cell genomics data. Genome Biol. 21, 224 (2020).
Article PubMed PubMed Central Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article CAS PubMed PubMed Central Google Scholar
Welch, J. D. et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 177, 1873–1887 e1817 (2019).
Article CAS PubMed PubMed Central Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Qiu, X. et al. Mapping transcriptomic vector fields of single cells. Cell 185, 690–711.e645 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bunne, C. et al. Learning single-cell perturbation responses using neural optimal transport. Nat. Methods 20, 1759–1768 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Article CAS PubMed Google Scholar
McCarthy, D. J., Campbell, K. R., Lun, A. T. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e3529 (2021).
Article CAS PubMed PubMed Central Google Scholar
Polanski, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).
Article CAS PubMed Google Scholar
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Article CAS PubMed PubMed Central Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 1–21 (2014).
Article Google Scholar
Fleming, S. J. et al. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nat. Methods 20, 1323–1335 (2023).
Article CAS PubMed Google Scholar
Lun, A. T., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research 5 (2016).
Garcia-Alonso, L., Holland, C. H., Ibrahim, M. M., Turei, D. & Saez-Rodriguez, J. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res. 29, 1363–1375 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Article CAS Google Scholar
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, 1–9 (2010).
Article Google Scholar
Ray, D. et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172–177 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Alkallas, R., Fish, L., Goodarzi, H. & Najafabadi, H. S. Inference of RNA decay rate from transcriptional profiling highlights the regulatory programs of Alzheimer’s disease. Nat. Commun. 8, 909 (2017).
Article ADS PubMed PubMed Central Google Scholar
Lambert, S. A., Albu, M., Hughes, T. R. & Najafabadi, H. S. Motif comparison based on similarity of binding affinity profiles. Bioinformatics 32, 3504–3506 (2016).
Article CAS PubMed PubMed Central Google Scholar
Badia, I. M. P. et al. decoupleR: ensemble of computational methods to infer biological activities from omics data. Bioinform Adv. 2, vbac016 (2022).
Article Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids Res. 43, e47–e47 (2015).
Article PubMed PubMed Central Google Scholar
Korotkevich, G. et al. Fast gene set enrichment analysis. bioRxiv, 060012 (2021).
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y., Zhou, R. & Wang, Y. Sashimi.py: a flexible toolkit for combinatorial analysis of genomic data. bioRxiv, 2022.2011.2002.514803 (2022).
Madrigal, A. & Najafabadi, H. S. A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data. https://doi.org/10.5281/zenodo.12761204 (2024).

Download references

Acknowledgements

We thank Adrien Osakwe for his help with an earlier iteration of COVID-19 pseudobulk analysis. This work was supported by the Canadian Institutes of Health Research (CIHR) grant PJT-173317, New Frontiers in Research Fund grant NFRFE-2019-00975, and resource allocations from Digital Research Alliance of Canada to HSN. AM is supported by a doctoral training award from Fonds de Recherche du Québec Santé. HSN holds a CIHR Canada Research Chair.

Author information

Tianyuan Lu
Present address: Department of Population Health Sciences, University of Wisconsin-Madison, Madison, WI, 53726, USA

Authors and Affiliations

Department of Human Genetics, McGill University, Montreal, QC, H3A 0C7, Canada
Ariel Madrigal, Larisa M. Soto & Hamed S. Najafabadi
Victor P. Dahdaleh Institute of Genomic Medicine, Montreal, QC, H3A 0G1, Canada
Ariel Madrigal, Larisa M. Soto & Hamed S. Najafabadi
Lady Davis Institute for Medical Research, Montreal, QC, H3T 1E2, Canada
Tianyuan Lu
Department of Statistical Sciences, University of Toronto, Toronto, ON, M5S 1A1, Canada
Tianyuan Lu
McGill Centre for RNA Sciences, McGill University, Montreal, Canada
Hamed S. Najafabadi

Authors

Ariel Madrigal
View author publications
You can also search for this author in PubMed Google Scholar
Tianyuan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Larisa M. Soto
View author publications
You can also search for this author in PubMed Google Scholar
Hamed S. Najafabadi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Methodology and conceptualization: A.M. and H.S.N. Mathematical derivation: H.S.N, with contributions from T.L. Code implementation: A.M. and H.S.N. Analysis: A.M., with contributions from L.M.S. and H.S.N. Visualization: A.M. and H.S.N. Writing: A.M. and H.S.N. Review and editing: T.L. and L.M.S. Study supervision and direction: H.S.N.

Corresponding author

Correspondence to Hamed S. Najafabadi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Natalie Davidson, Xiuwei Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Madrigal, A., Lu, T., Soto, L.M. et al. A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data. Nat Commun 15, 6573 (2024). https://doi.org/10.1038/s41467-024-50963-0

Download citation

Received: 03 September 2023
Accepted: 23 July 2024
Published: 03 August 2024
DOI: https://doi.org/10.1038/s41467-024-50963-0
Springer Nature Limited

A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

Abstract

Similar content being viewed by others

Introduction

Results

The GEDI framework

GEDI captures different sources of sample-to-sample variability

GEDI enables cluster-free differential expression analysis

Pathway and network activity projection with GEDI

Modeling the latent space of RNA splicing and stability with GEDI

Discussion

Methods

The GEDI framework

Modeling the reference manifold as a function of gene-level variables

Modeling sample-specific distortions of the manifold as a function of sample-level variables

Direct inference from count data

Inference from paired UMI counts

Obtaining maximum a posteriori estimates of model parameters

Datasets and preprocessing

Integration methods and benchmarks

Differential expression analysis

Cluster-free differential expression benchmark

Cell type signature analysis

Gene regulatory network analysis

Assessment of estimated TF activities

Differential analysis of cell type-specific cassette exon splicing

Statistics and Reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation