Background

The scRNA-seq technology has emerged as a popular and powerful tool for profiling the transcriptomic landscape of individual cells within complex and heterogeneous systems. It has been revolutionizing our ability to dissect and understand various aspects of biology at the single-cell level, including developmental biology [1] and gene regulation [2], and has opened new avenues for exploring developmental biology, gene regulation, tissue heterogeneity, disease mechanisms, and evolutionary dynamics.

The scRNA-seq study usually involves the integrative analysis of transcriptomic data of individual cells measured with different technologies and derived from multiple tissues [1] of individuals with different phenotypes [2,3,4] across statuses [5] and even species. Indeed, the integrative analysis of heterogeneous scRNA-seq data to identify cell types or states and to compare gene expression across biological conditions has tremendous potential to transform our understanding of complex and heterogeneous biological systems [6]. A successful example of this practice is that joint analysis of scRNA-seq data from multiple melanoma tumors identifies an immune resistance program in malignant cells, which predicts clinical responses to immunotherapy in melanoma patients [7]. However, the heterogeneous variation due to different sequencing experiments and different biological conditions, including donors, tissues, or phenotypes, makes the integrative analysis challenging.

Methods have been developed for the integrative analysis of scRNA-seq data [8,9,10,11,12,13,14]. Computational approaches, including BBKNN [8] and FastMNN[11], are proposed, which assume that the batch effect in scRNA-seq data is almost orthogonal to the biological variation [11]. Thus, their abilities to correct batch effects originating biologically are limited [15]. Seurat [12] matches cell states across biosamples by a shared correlation space defined by canonical correlation analysis. LIGER [6] employs nonnegative matrix factorization (NMF) to delineate shared and dataset-specific features of cells across biosamples. Harmony [9] integrates scRNA-seq data by projecting cells into a shared embedding. Scanorama [13] leverages the matches of cells with similar transcriptional profiles across biosamples to perform batch correction and integration. Most of the above methods focus on batch effect correction and cell identity annotation. They do not model the heterogeneous variation from different biological conditions and thus lack interpretability on how biological conditions affect the gene expression of cells. Moreover, it is also meaningful to model the effect of biological conditions on different cell subpopulations to elucidate the specific cell subpopulation and its related biological process that contributes to the disease pathogenesis.

To fill in the gap and address the above challenges, we develop a general and flexible statistical framework, scParser, which is based on an ensemble of matrix factorization and sparse representation learning. scParser directly models the variation from different biological conditions (e.g., donor, disease status, experimental time points) via gene modules, which bridge gene expression with the phenotype of interest and unveil the relevant biological processes together with their contributing genes. The rationale behind our modeling is that the biological conditions affect the activities of certain biological processes, which in turn affect gene expression; the gene modules in scParser are learnt adaptively from the data and encode the biological processes that are affected by the biological conditions. We also develop an extended version of scParser, which further models the interactive effects of biological conditions on different cell subpopulations and pinpoints the relevant biological processes within the specific cell subpopulation that may contribute to the disease pathogenesis. Empowered by gene modules, scParser can boost the signal of disease-associated genes. In addition, scParser can correct for batch effects and achieves favorable performance in cell clustering. To make scParser scalable to large-scale datasets, we incorporate a batch-fitting strategy. Importantly, the wide applicability of our proposed framework has been demonstrated via extensive applications to various biomedical studies.

Results

Overview of scParser

We propose a flexible computational method scParser (sparse representation learning for scalable single-cell RNA sequencing data analysis), presented in Fig. 1. Here we use donors and phenotypes (e.g., disease status) as an example of biological conditions. The RNA expression profiles of cells from different donors with different phenotypes are profiled (Fig. 1A). To facilitate interpretation, scParser models variation from multiple biological conditions (e.g., donor and phenotype) with matrix factorization and models cellular variation with sparse representation learning (Fig. 1B).

Fig. 1
figure 1

Overview of scParser. A The scRNA-seq data is obtained from biosamples from multiple biological conditions. Variations from biological conditions (e.g., donor and phenotypes) and technical artifacts may bring about batch effects in the scRNA-seq data. B scParser models the variation from donor and phenotype with matrix factorization and cellular variation with sparse representation learning. After cell population annotation, an extension of scParser enables revealing the effects of biological heterogeneous conditions (e.g., phenotypes) on different cell populations. C The output from scParser enables various interpretive data analyses, such as cell population annotation, connecting gene expression to biological conditions via biologically meaningful gene modules, revealing the heterogeneous effect of phenotypes on different cell populations via gene modules, and uncovering donors’ molecular characteristics

More specifically, the expression level of gene \(m\) for cell \(i\), \({z}_{im}\), which is obtained from donor \(j\) with phenotype \(t\), is modeled as

$${z}_{im}\approx {d}_{j}^{\intercal }{v}_{m}+{p}_{t}^{\intercal }{v}_{m}+{s}_{i}^{\intercal }{g}_{m},$$

where \({d}_{j},{p}_{t},{v}_{m}\) are vectors of length \({K}_{1}\), \({s}_{i},{g}_{m}\) are vectors of length \({K}_{2}\). The donor and phenotype information of samples is known. Here we name the above formulation as the vanilla scParser. The equation above has the following matrix form

$$Z\approx \left({X}^{D}D+{X}^{P}P\right)V+SG,$$
(1)

where \({V}^{{K}_{1}\times M}\) is the key component in scParser and represents \({K}_{1}\) metagenes/gene modules (here we use the two phrases metagenes and gene modules interchangeably), and \(M\) is the total number of genes. These gene modules in the matrix \(V\) summarize the expression patterns of thousands of genes to a few metagenes/gene modules (as \({K}_{1}\) is much smaller than \(M\)), which provides a high-level summary of the gene activities affected by the biological conditions; \({D}^{{N}_{1}\times {K}_{1}}\) and \({P}^{{N}_{2}\times {K}_{1}}\) are matrices for the latent representations of \({N}_{1}\) donors and \({N}_{2}\) phenotypes, respectively; \({D}^{{N}_{1}\times {K}_{1}}\) and \({P}^{{N}_{2}\times {K}_{1}}\) can also be interpreted as the expression level of the \({K}_{1}\) gene modules in the \({N}_{1}\) donors and \({N}_{2}\) phenotypes; \({S}^{N\times {K}_{2}},{G}^{{K}_{2}\times M}\) denote latent representations for \(N\) cells and \(M\) genes after modeling variation from donors and phenotypes; \({X}^{D},{X}^{P}\) are indicator matrices of the donor and phenotype labels for the cells, and they have \(N\) rows and \({N}_{1}\) and \({N}_{2}\) columns, respectively. The objective function for the equation with matrix representation is as follows:

$$\begin{array}{ll}\mathcal{L}\left(D,P,V,S,G\right)=& \frac{1}{2}{\left\| Z-\left({X}^{D}D+{X}^{P}P\right)V-SG\right\| }_{\text{F}}^{2}+\\ & \frac{1}{2}{\lambda }_{1}\left({\Vert D\Vert }_{\text{F}}^{2}+{\Vert P\Vert }_{\text{F}}^{2}+{\Vert V\Vert }_{\text{F}}^{2}\right)+\\ & {\lambda }_{2}\left[\frac{1}{2}\left(1-\alpha \right){\Vert S\Vert }_{\text{F}}^{2}+\alpha {\Vert S\Vert }_{1}\right]\\ \text{subject to}& {\Vert {G}_{k}\Vert }_{2}^{2}\le c,\forall k=1,\dots ,{K}_{2},\end{array}$$

where \({\lambda }_{1},\) \({\lambda }_{2},\alpha\) are tuning parameters, and \({\Vert \cdot \Vert }_{F}\) represents the Frobenius norm. \(c\) is a constant, restricting the scale of \({G}_{k}\), the \(k\) th row of \(G\). Alternating block coordinate descent (BCD) is utilized to optimize the objective.

The outputs from scParser provide informative inputs for various downstream analyses (Fig. 1C): cell populations/types can be identified with the cell latent representation matrix \(S\) from Eq. 1, the connection between gene expression and biological conditions can be uncovered via biologically meaningful gene modules using \(P\) and \(V\), the effect of phenotypes on gene expression can be revealed with \(P*V\), and the molecular characteristics of donors can be characterized with \(D\). In addition to the vanilla version of scParser, we also developed an extension of scParser that models the heterogeneous effects of biological conditions on cell populations (details in the “Methods” section) after cell identity annotation with \(S\). Notably, the extended scParser pinpoints the related biological processes in the specific cell subpopulations that may contribute to disease pathogenesis.

Datasets

In our applications, we applied scParser to the three scRNA-seq datasets from studies on the pancreas of type 2 diabetic and normal donors (T2D dataset) [2], human airway epithelium of donors with different smoking habits (Smoking dataset) [5], and peripheral blood cells of patients experiencing mild to severe COVID-19 infection (COVID-19 dataset) [3]. The number of cells in these datasets ranges from ~6000 to over 30,000. To demonstrate the scalability of scParser, we further applied scParser with the batch-fitting strategy to the immune dataset, offering scRNA-seq data of > 300,000 immune cells from 16 different tissues of 12 donors [1], and the GBM (glioblastoma) data covering > 200,000 human glioma, immune, and stromal cells from GBM patients [4]. In our applications, we first applied the vanilla scParser. With the cell populations annotated after fitting the vanilla scParser, we employed the extended scParser.

scParser connects gene expression to phenotypes through gene modules

The vanilla scParser captures variation from heterogeneous biological conditions in scRNA-seq data with biologically meaningful gene modules, which serve as a bridge to connect gene expression to phenotypes via path analysis. The path analysis empowered by gene modules gives us deeper insights into the underlying biological context/process where genes participate in disease pathology, which is usually not provided in standard differential expression (DE) analysis.

Let us use the T2D dataset from the study [2] for demonstration. The gene modules M1 and M2 connect the eight selected most variable genes with diabetes status (Fig. 2A). As revealed by the biological process (BP) enrichment analysis on top genes in gene modules M1 and M2, M1 encodes the BPs related to insulin and peptide secretion, and M2 encodes the BPs involved in the regulation of transcription in response to stress and response to unfolded protein.

Fig. 2
figure 2

Various downstream analysis with outputs from the vanilla scParser in our applications. A The path diagram links the expression of the eight selected most variable genes to diabetes status via two gene modules in the application to the T2D dataset. B Changes in the expression of the most variable (top 10) genes across control, mild, and severe COVID-19 from the analysis of the COVID-19 dataset are shown. C Changes in the expression of the most variable (top 10) genes across different types of GBM from the analysis of the GBM dataset are shown. The words “LGG”, “ndGBM,” and “rGBM” stand for low-grade gliomas, newly diagnosed GBM, and recurrent GBM, respectively. D The top 20 severe COVID-19-associated genes ranked by adjusted p-values are displayed. E The top 20 recurrent GBM-associated genes ranked by adjusted p-values are presented. F The up-regulated BPs enriched by genes in the upper 5% quantile of the differences in the adjusted expression profiles between mild COVID-19 and normal are shown. G The up-regulated BPs enriched by genes in the upper 5% quantile of the differences in the adjusted expression profiles between severe and mild COVID-19 are shown. H The up-regulated BPs enriched by genes in the upper 5% quantile of the difference in the adjusted expression profiles between the recurrent GBM and newly diagnosed GBM are shown

On the one hand, we observed that the expression of M1 is much higher in normal than in diabetes: 1.34 vs. 0.97 (Fig. 2A), suggesting that normal patients have a higher activity level of BPs in insulin and peptide secretion than diabetic patients. This observation is also consistent with the fact that type 2 diabetic patients can have impaired insulin secretion. Additionally, the genes that have large loading in M1 include INS (the loading is 1.14), CHGB (the loading is 0.93), and FXYD2 (the loading is 0.47). This suggests that these genes may affect type 2 diabetic patients through BPs in insulin and peptide secretion. In fact, the role of these genes in type 2 diabetes is supported by previous studies: INS encodes insulin [16], FXYD2 was suggested to be a novel target for diabetes [17], and loss of CHGB impairs glucose-stimulated insulin secretion [18].

On the other hand, we see that the expression of M2 is much higher in diabetes compared to normal: 1.10 vs. 0.62 (Fig. 2A), implying that the diabetic has a higher level of BPs in response to unfolded protein, compared with the normal. This is also supported by the finding that response to unfolded protein leads to endoplasmic reticulum (ER) stress [19], which contributes to insulin resistance [20]. Moreover, the loadings of FOS and JUN are high for M2 (0.58 and 0.56, respectively). This implies that the two genes (FOS and JUN) may contribute to type 2 diabetes via BPs in the regulation of transcription in response to stress and response to unfolded protein. The implication is also consistent with the finding that the AP-1 transcription factor, which is composed of c-Jun, encoded by gene JUN [21], and c-Fos, encoded by gene FOS [22], is necessary for the induction of the ER stress [23], which contributes to pancreatic β-cell loss and insulin resistance [24].

In brief, scParser summarizes the expression patterns of thousands of genes to a few metagenes/gene modules, which provides a high-level summary of the gene activities. These metagenes represent the genes that collectively carry out certain biological functions. Hence, the path analysis empowered by scParser via gene modules not only identifies the genes associated with the disease status but also unearths the relevant biological context/process through which the genes participate in the disease pathology. Notably, these gene modules are learned adaptively from the data.

scParser facilitates the identification of phenotype-associated genes and their enriched biological processes

The changes in gene expression across different phenotypes can be unveiled by \(P*V\), which is the product of the phenotype representation matrix \(P\) and the gene module matrix \(V\) in Eq. 1. We also name \(P*V\) as the adjusted expression profiles for phenotypes because the unwanted variation from donors is controlled in them. The product \(P*V\) together with the matrices \(P\) and \(V\) facilitates the identification of phenotype-associated genes.

We first demonstrate it with the application to the COVID-19 dataset. The top variable genes, which are identified by inspecting \(P*V\), show dynamic changes across severe COVID-19, mild COVID-19, and normal (Fig. 2B). Besides, we identified the genes associated with the disease status by examining the correlation between the rows of \(P\) and the columns of \(V\) [25]). The top identified genes, including HBB (Fig. 2B and D) and GRINA and FTL (Fig. 2D), are shown to be differentially expressed between severe COVID-19 cases and normal in independent studies [26,27,28]. Additionally, the high expression levels of HBB, HBA1, and HBA2 (Fig. 2B) in severe COVID-19 patients are supported by the evidence that the level of hemoglobin, whose subunits are encoded by HBB, HBA1, and HBA2, was found to be significantly higher in severe COVID-19 patients than in mild COVID-19 ones and normal controls [29].

To further verify the usefulness of the adjusted expression profiles for phenotypes \(P*V\), we conducted additional experiments with disease targets provided by Open Targets [30]. First, we extracted target genes for COVID-19 from Open Targets, where the associations between COVID-19 and target genes are indicated by scores (from 0 to 1). Then, we also computed the absolute expression difference of genes across phenotypes (mild COVID-19 vs. normal, and severe COVID-19 vs. mild COVID-19) using \(P*V\). Subsequently, we examined the correlation between the absolute expression difference of genes across phenotypes and the association scores provided by Open Targets.

We found that there is a positive correlation between them: p-values for a positive correlation between the absolute gene expression difference in mild COVID-19 vs. normal and the association score in Open Targets are 0.0005 for both Kendall and Spearman correlation, and p-values for a positive correlation between absolute gene expression difference in severe vs. mild COVID-19 and the association score are 6.47E−05 for Kendall correlation and 6.919E−05 for Spearman correlation, respectively. These findings suggest that the genes that have a bigger difference in expression levels across phenotypes tend to have a higher association score with the corresponding disease.

To complement path analysis by considering all gene modules, BP enrichment analysis can be performed on the list of genes that demonstrate the greatest difference across COVID-19 statuses with the adjusted expression profile \(P*V\), which controls for the variation of donors. The up-regulated BPs in cells from donors with mild COVID-19 compared to those from the normal include the viral life cycle, viral process, response to viral, and the BPs related to immune response, such as (myeloid) leukocyte and neutrophil chemotaxis (Fig. 2F), suggesting elevated viral activity and immune response in mild COVID-19 patients compared to the normal; in the comparison of cells from severe COVID-19 cases to those from mild COVID-19 ones, the BPs related to apoptotic signaling pathways are activated (Fig. 2G), consistent with the previous finding that T cell apoptosis characterizes severe COVID-19 [31].

We also implemented the analyses on the glioblastoma (GBM) dataset. There is a wider difference in gene expression between glioblastoma (GBM) and low-grade gliomas (LGG), compared to that between recurrent and newly diagnosed GBM (Fig. 2H). Also, some listed top genes (FTL [32], FTH1[33], GAPDH [34], and VIM [35]) (Fig. 2C and E) are shown to be associated with GBM. Moreover, recurrent GBM shows higher levels of BPs related to inflammation and immune response compared with newly diagnosed GBM (Fig. 2H). The result is also supported by Open Targets [30]: p-values for a positive correlation between absolute differences in gene expression between recurrent GBM and newly diagnosed GBM and the association score of genes for GBM are 0.0122 for Kendall correlation and 0.0123 for Spearman correlation, respectively.

scParser can also boost the signal of phenotype-associated genes by removing the variation of the less relevant gene modules (Additional file 1: section S1): after filtering less relevant gene modules, the absolute gene expression differences between disease statuses (computed with \(P*V\)) have a more significant and higher positive correlation with the association scores provided by Open Targets and that the BPs enrichment analysis also becomes statistically more significant. In addition, we implemented transcription factor (TF) enrichment analysis to identify potential drivers for the changes in gene expression across biological conditions in the T2D, Smoking, and GBM datasets, and most TFs identified in our analyses are supported by previous studies (Additional file 1: section 12).

scParser pinpoints the relevant biological processes in cell subpopulations that contribute to disease pathogenesis

Previously, we presented how biological conditions affect gene expression through gene modules in the vanilla scParser. However, it is important to note that the same biological condition may exert heterogeneous effects on different cell populations. Motivated by this, we develop the extended version of scParser, defined in Eq. 6 in the “Methods” section, which directly models the interactive effects of biological conditions on different cell subpopulations annotated with \(S\) in Eq. 4. The matrices \(W\) and \(V\) from Eq. 6 enable path analysis. We use the results from our application to the T2D dataset for demonstration. We first show that path analysis informs us of the difference in biological functions for three subpopulations of beta cells (B1, B2, and B3) in normal controls (Fig. 3A), and then demonstrate that path analysis can unveil the relevant biological processes that are affected in B1 beta cells in diabetic patients that may contribute to the disease pathogenesis (Fig. 3B).

Fig. 3
figure 3

Further exploratory analyses with cell population representations from scParser in the application to the T2D dataset. A The path diagram shows that scParser links the expression of 8 beta cell marker genes to the three beta cell populations (B1, B2, and B3) from the normal via two gene modules. B The path diagram shows that scParser links the expression of 8 beta cell marker genes to the beta cell population (B1) across diabetes status via two gene modules

More specifically, we focus on two gene modules M1 and M2. As revealed by enrichment analysis on top genes in M1 and M2, BPs encoded by M1 involve the secretion of hormones and peptides (not including insulin), but BPs encoded by M2 further include insulin secretion. We first looked at the three subpopulations of beta cells (B1, B2, and B3) in normal controls to study their difference. The subpopulation B3 has lower loadings in M1 and M2 (−2.31 and 0.23, respectively), compared to those for B1 (0.47 and 1.54, respectively) and B2 (1.31 and 0.37, respectively) (Fig. 3A), and the loading of M1 and M2 are also different between B1 and B2 (Fig. 3A). This evidence suggests that the beta cell populations from the normal show heterogeneity in biological functions: B1 and B2 (especially B1) may have much higher levels of insulin secretion (encoded in gene module M2) than B3.

To further investigate the role of the beta cell populations in diabetes pathogenesis, we next focus on comparing beta cells B1 in normal vs diabetes through path analysis (Fig. 3B) because B1 has a very high loading in M2 in the normal and it may play a key role in insulin secretion. We see that the expression of M2 for B1 is much higher in the normal than in the diabetic: 1.54 vs. 0.87 (Fig. 3B), suggesting that B1 from the diabetic has a lower level of BPs in insulin secretion, compared to that from the normal. This implies that B1 may be directly associated with diabetes pathogenesis. This implication is consistent with the fact that T2D is due to the dysfunction of beta cells [36].

Besides pinpointing the relevant biological context/process shown above, scParser quantifies the contribution of individual genes to the biological context/process (Fig. 3). As mentioned previously, the gene module M1 encodes BPs for the secretion of hormones and peptides (not including insulin), and M2 further encodes BPs in insulin secretion. The genes (RBP4, IAPP, PCSK1, and INS) with large loadings on M1 (Fig. 3) either encode peptide hormone or participate in peptide hormone secretion, as shown in previous studies [16, 37,38,39]. In addition, the loadings of INS and IAPP are high for M2, and the loading of RBP4 is negative for M2 (Fig. 3). The three genes (INS, IAPP, RBP4) were shown to associate with insulin secretion in previous studies [16, 39, 40], where RBP4 is negatively associated with insulin secretion [40], explaining its negative loading in M2. The findings imply that these genes may contribute to heterogeneity in the function of beta cells and participate in the contribution of the cell populations to the diabetes status through BPs involved in insulin, peptide, and hormone secretion.

In brief, the extended scParser reveals the key biological context/process in the cell subpopulation that contributes to the disease status. In addition, it sheds light on the involvement of genes in the key biological context/process. This unique advantage of scParser provides us with a much deeper understanding of the contribution of cell populations to disease pathology via gene modules.

scParser reveals the heterogeneous effects of biological conditions on different cell subpopulations or types

In the previous subsection, the heterogeneous effects of biological conditions on cell subpopulations have been demonstrated with two gene modules with the T2D dataset as an example. To complement the path analysis by considering all gene modules, we computed the adjusted expression profiles for cell subpopulations (types) under different biological conditions with \(W*V\), which is the product of the representation matrix for cell subpopulations (types) under different biological conditions \(W\) and the gene module matrix \(V\) in Eq. 6. The product of \(W*V\) reveals expression dynamics of genes in cell subpopulations (types) across biological conditions and identify the phenotype- or disease-critical cell populations by BP enrichment analysis.

Let us first demonstrate the results from our application to the T2D dataset. The top variable genes in the beta cell subpopulation B1, which are obtained by examining the rows of \(W*V\) corresponding to B1, show distinct differences across diabetes status (Fig. 2B). Some identified genes (MTRNR2L1 [41], CPE [42], RBP4 [43], FXYD2 [44, 45], and CHGB [18]) are shown to play an important role in beta cells and be involved in diabetes (Fig. 4A), suggesting the contributing role of beta cells in diabetes. Then, the BP enrichment analysis on the list of top variable genes for each cell subpopulation (type) across diabetes status was conducted to reveal the effects of biological conditions on the cell subpopulation (type). Here the list of top variable genes is obtained with the adjusted expression profile \(W*V\) corresponding to the cell subpopulation (type), which controls for the variation of donors. We observed that the levels of BPs in the secretion and regulation of insulin are lower in the two beta subpopulations (B1 and B2) from diabetic donors, compared to normal controls (Fig. 4C and D). However, this finding is not observed for the other three cell types (Additional file 1: Fig. S1A, S1B, and S1C) and the beta cell subpopulation B3 (Additional file 1: Fig. S1D). This implies that the two beta cell subpopulations are directly associated with diabetes.

Fig. 4
figure 4

scParser reveals the heterogeneous effects of biological conditions on different cell populations (types). A The figure shows the changes in expression of the most variable (top 10) genes for the beta cell subpopulation (B1) across diabetes status with the T2D dataset. B The figure shows the changes in expression of the most variable (top 10) genes for the T cell population across normal, mild COVID-19, and severe COVID-19 with the COVID-19 dataset. C The figure shows the down-regulated BPs enriched by genes in the lower 5% quantile of the difference in the adjusted expression profiles for B1 between the diabetic and normal conditions. D The figure shows the down-regulated BPs enriched by genes in the lower 5% quantile of the difference in the adjusted expression profiles for B2 between the diabetic and normal conditions. E The up-regulated BPs enriched by genes in the upper 5% quantile of the difference in the adjusted expression profiles for T cells between severe and mild COVID-19 are displayed. F The up-regulated BPs enriched by genes in the upper 5% quantile of the difference in the adjusted expression profiles for monocytes between severe and mild COVID-19 are displayed

Meanwhile, the average proportions of B1 and B2 are greater in normal controls (35.00% and 8.88%, respectively) than in diabetes (30.56% and 4.58%), and the average proportion of B3 is smaller in normal controls (5.29%) than in diabetes (8.99%). (Additional file 2: Table S1). This leads to the finding that diabetes is associated with a decrease in the subpopulations of beta cells (B1 and B2) that secrete insulin and an increase in the subpopulation of beta cells (B3) that have a low level of insulin secretion.

To verify the finding above, we first obtained a list of eight genes, which are shown to associate with insulin resistance, T2D, or insulin secretion, through literature search (RBP4 [46,47,48], NPY [49, 50], DLK1 [51,52,53,54], PCSK1 [55, 56], MAFA [57,58,59], SIX3 [60,61,62], PFKFB2 [63,64,65], TMEM37 [66]). Then, we conducted two-sided t-tests to compare the expression levels of these genes between normal controls and diabetes across all six cell populations, including the two beta cell populations (B1 and B2) (Additional file 2: Table S2). The p-values tend to be much smaller in B1 and B2 than in other cell populations (Additional file 2: Table S2). In brief, the additional experiment supports the conclusion above. This phenomenon is supported and systematically reviewed by the study [67], suggesting that beta cells are morphologically and functionally heterogeneous and that this is strongly associated with diabetes.

We also performed the above analysis to the COVID-19 datasets. The relationship between the three genes listed in Fig. 4B (HBB, HBA1, and HBA2) and severe COVID-19 has been discussed in our previous section, and two other identified genes (TNFAIP3 [68] and GNLY [69]) were reported to play an important role in COVID-19 (Fig. 4B). Further, we observed that the BPs related to the differentiation of T and lymphocyte cells are up-regulated in monocytes from severe COVID-19 donors, compared to monocytes from mild ones (Fig. 4F). More importantly, the BPs involved in response to viral are up-regulated in T cells in the comparison (Fig. 4E), and the levels of the BPs related to interleukin-2 production, which plays a key role in the human immune system, are also high in T cells, but not in monocytes (Fig. 4E and F). This may suggest that monocytes promote the differentiation of T cells and that T cells play a key role in the human immune response to severe COVID-19.

To further verify the association of monocytes and T cells in severe COVID-19 patients, we conducted the following analysis. Specifically, we first computed the absolute differences in gene expression between severe and mild COVID-19 (here \(W*V\) is used for the gene expression level for cell populations across COVID-19 status, which is adjusted for the variation of donors) for T cells and monocytes, and then computed the correlation between the absolute expression differences and the scores of genes associated with COVID-19 provided by Open Targets [30]. Our results show that there are statistically significant and positive relationships between the absolute expression difference for T cells (P Kendall = 0.0170 and P Spearman = 0.0192) as well as monocytes (P Kendall = 0.0091 and P Spearman = 0.0080) and the gene association scores for COVID-19, corroborating the role of monocytes and T cells in COVID-19. On the one hand, the important role of monocyte cells in T cell formation has been supported by the literature [70,71,72]: monocytes play an important role in the formation of memory T cells [72], and monocyte‐derived cell populations promote the differentiation of CD4+ T cells [71] and CD8+ T cells [70]. Additionally, the contributing role of monocytes in COVID-19 pathogenesis is supported by the literature [73, 74]. On the other hand, the relationship between severe COVID-19 and T cells is also supported by the literature [31, 75,76,77]: T cells play a crucial protective role in the human immune response to COVID-19 [75,76,77], and T cell apoptosis characterizes severe COVID-19 infection [31]. These findings support the conclusion above that monocyte cells may promote the differentiation of T cells and that T cells may play a key role in the immune response to severe COVID-19.

scParser reveals the molecular characteristics of donors

Here we demonstrate how scParser can help reveal the molecular characteristics of donors with the application to the GBM dataset. First, we directly plotted the donor representations of 11 newly diagnosed GBM donors (Additional file 1: Fig. S2). Donors 01, 06, and 10 show obvious differences in molecular characteristics compared with the other eight donors (Additional file 1: Fig. S2). For demonstration, we focus on discussing the molecular differences of these donors with metagenes 12, 13, and 15, and the biological meaning of the metagenes is revealed with enrichment analysis. A joint analysis of the figure and results from enrichment analysis suggest that donor 1 may suffer a poor progression of GBM, donor 6 may experience more severe hypoxia than other donors, and donor 10 has a stronger immune response and a lower level of gliogenesis and axonogenesis than other donors. Technical details of enrichment analysis and result interpretation are provided (Additional file 1: section S2). However, due to a lack of clinical information on these donors to validate our findings, further investigation is needed to explore our findings.

Comparative study of scParser on cell type annotation

In addition to the interpretability empowered by gene modules presented above, scParser also helps annotate cell types. To evaluate its ability in annotating cell populations, we compared it against eight methods in cell identity annotation, including Seurat V3 [12], LIGER [78], Harmony [9], BBKNN [8], ComBat [79], FastMNN [11], scINSIGHT [29], and Scanorama [13].

Practically, we implemented the clustering procedure in SCANPY [80] and Seurat [12] for all methods for a fair comparison, where we performed cluster analysis on the low-rank embeddings from each method using Louvain clustering [81]. We considered the number of cell types provided by the original data as the expected number of clusters. In practice, we adjusted the resolution until the expected number of clusters was attained. Then, we assessed the clustering performance of the methods with adjusted Rand index (ARI), adjusted mutual information (AMI), normalized mutual information (NMI), and homogeneity score. In the comparison, we considered the cell types annotated in original studies as true labels and ignored the unannotated observations. To clarify, the cell types provided in the original studies are annotated by differential analysis with marker genes. For visualization, we employed UMAP [82] to obtain two-dimensional embeddings and plotted them against annotated cell types.

Here we compared the performance of scParser against that of the eight methods on the COVID-19, T2D, and Smoking datasets (Fig. 5A) and the performance of batch scParser on cell type annotation on the Immune and GBM datasets (Fig. 5B). Details of the five datasets are provided in the “Datasets” section.

Fig. 5
figure 5

Comparative study of scParser against other eight methods in cell type annotation. A The clustering performance of scParser was compared to that of the other eight methods in terms of ARI, AMI, NMI, and Homogeneity score with the COVID-19, T2D, and Smoking datasets. The words “COVID-19”, “T2D”, and “Smoking” stand for the COVID-19, T2D, and Smoking datasets, respectively. B The clustering performance of batch scParser is compared to that of the other eight methods in terms of ARI, AMI, NMI, and Homogeneity score on the GBM and Immune datasets. The performance of Combat and scINSIGHT are not shown due to the scalability issue

In general, across the four different evaluation metrics, scParser achieved the start-of-the-art clustering performance (Fig. 5). In terms of the T2D dataset, it outperforms all other methods in the dataset in terms of ARI, NMI, and AMI (Fig. 5A). In the COVID-19 and Smoking datasets, scParser achieved cluster performances comparable to the state-of-the-art methods (Fig. 5A). Moreover, batch scParser outperforms all other methods in terms of ARI (Fig. 5B). Regarding the other three evaluation metrics, the performance of batch scParser is similar to that of other methods (Fig. 5B). In the application of scINSIGHT to the other four datasets except for the COVID-19 dataset, it failed to yield results in two days, so we stopped the program for time consideration.

In our visual comparison of scParser with the three methods (LIGER, Seurat, and Harmony) recommended for scRNA-seq data integration by [83], we found that four methods demonstrate a similar performance on the COVID-19 dataset (Additional file 1: Fig. S3A). In the analysis of the T2D dataset, Harmony and scParser outperform the other two methods, and scParser seems to perform better in separating A (alpha) and B (beta) cells (Additional file 1: Fig. S3B). Specifically, LIGER cannot well distinguish PP cells from B cells, and Seurat fails to distinguish PP cells from a subpopulation of A cells (Additional file 1: Fig. S3B). Moreover, quantitative measurement of these methods in separating A, B, and PP cells with silhouette scores also confirms our observation, with the silhouette scores equal to 0.26, 0.22, 0.23, and 0.14 for scParser, LIGER, Seurat, and Harmony, respectively. All methods perform similarly in the Smoking dataset (Additional file 1: Fig. S3C), and this is also indicated by Fig. 5A. Batch scParser achieved a similar performance in cell type annotation, compared with the other three methods (Additional file 1: Fig. S3D and S3E).

We noticed that the clustering performance for different methods varies across different datasets when different evaluation metrics are used (Fig. 5). Hence, for each evaluation metric above, we followed the idea from the previous study [84] to calculate the averaged ranking for each method across the five datasets as its overall performance (Additional file 1: section S3). The idea of the overall ranking follows the recommended practice for ranking methods in performance comparison from the study [85]. scParser demonstrates the highest average ranking across the five datasets for all four evaluation metrics (Additional file 1: section S3), suggesting that scParser has a superior overall performance over other methods in cell clustering.

In summary, scParser achieves state-of-the-art performance in cell type annotation in each application and demonstrates superior overall performance over other methods in cell clustering. The batch-fitting strategy improves the scalability of scParser without impairing its performance in cell-type annotation. Further experiments also show scParser demonstrates satisfactory performance in batch effect correction (Additional file 1: section S3).

Computational time and memory usage

scParser is a computationally efficient framework for interpretable studies of scRNA-seq datasets. The computational time for scParser with the whole data-fitting strategy on the COVID-19 data, the T2D dataset, and the Smoking dataset is ~20 min, ~1 h, and ~2.5 h, respectively, and the memory consumption of scParser for the applications is less than 5 G. For the batch scParser with 20 batches, it takes roughly 6 h and 12 G memory to complete the analyses with the GBM and Immune datasets.

Discussion

The interpretability and broad applicability of scParser in scRNA-seq data analysis have been demonstrated through comprehensive applications. Specifically, scParser was applied to three scRNA-seq datasets, and batch scParser was employed on two other scRNA-seq datasets of over 200,000 cells. The outputs from these applications enable various interpretative data analyses. Firstly, scParser enables connecting gene expression to diabetes status via biologically meaningful gene modules by path analysis and revealing changes in the expression of genes across COVID-19 infection outcomes and GBM stages. Moreover, enrichment analyses of the most variable genes across biological conditions reveal the effect of COVID-19 outcomes and different GBM stages on cells. Additionally, the ability of scParser to discern subtle changes in gene expression across adjacent time points with the iPSC dataset [86] has been demonstrated (Additional file 1: section S5). Secondly, it captures the heterogeneous effects of biological conditions on cell populations via biologically meaningful gene modules. Hence, scParser pinpoints the biological context in cell subpopulations that contribute to disease pathogenesis via the gene modules by path analysis. Moreover, putative phenotype- or disease-critical cell populations are identified by comparing the adjusted expression profiles (provided by scParser) for cell populations across phenotypes or disease status. Thirdly, donor representations from scParser characterize the molecular aspects of donors with GBM. Fourthly, scParser is computationally efficient in analyzing scRNA-seq data. Empirical studies show that scParser completes analyzing scRNA-seq data with different cell numbers and gene numbers in a reasonable time (Additional file 1: section S7). Additionally, scParser is robust to the sparsity due to dropout events in scRNA-seq data (Additional file 1: section S8). Lastly, scParser has a favorable performance in cell clustering and demonstrates a satisfactory performance on batch effect correction.

In brief, scParser has the following virtues: (1) scParser is a general, flexible, and scalable framework for scRNA-seq data analysis, in which the heterogeneous variation from biological conditions is modeled via biologically meaningful gene modules. In scParser, biological conditions can be defined upon study design and research question; (2) two fitting strategies offered by scParser make it tailored for analyzing scRNA-seq data of various sample sizes; (3) Empowered by gene modules, scParser can boost the signal of disease-associated genes; (4) the extended scParser models the heterogeneous effects of biological conditions on different cell populations, thus providing us with insights on the contributing role in disease pathology; (5) the output from scParser enables various interpretative analyses via biologically meaningful gene modules; and (6) scParser demonstrates unique properties, compared to the standard Seurat pipeline and scINSIGHT [29] (Additional file 1: section 15).

We noticed that \(D,P\) in the vanilla scParser are not identifiable. The issue can be alleviated by the following practices. On the one hand, the ridge penalty on \(D\) and \(P\) can relieve it. In practice, the hyperparameter \({\lambda }_{1}\) for the ridge penalty for \(P,D\) across our applications ranges from 10 to 50, suggesting that the elements of \(P,D\) are shrunk towards 0. Moreover, the Frobenius norm of \(D\) is larger than that of \(P\) (\({N}_{1}>{N}_{2}\)), so the ridge penalty pushes the elements of \(D\) closer to 0, compared to \(D\). On the other hand, we initialize \(D\) to 0 and always update \(P\) before the update for \(D\) in model fitting. Therefore, intuitively, the update for \(D\) is computed on the residuals of \({X}^{P}P\). We think that this will further alleviate the identifiability of \(D\) and \(P\).

The scRNA-seq technology has been widely used to understand gene regulation across heterogeneous biological systems at the single-cell level [1, 3,4,5, 7, 87]. However, most existing computational methods for scRNA-seq data analysis focus on data integration and do not model the variation from biological conditions. Therefore, their ability to understand the heterogeneous effects of biological conditions at the single-cell level is limited. scParser offers an efficient and scalable solution to fill the gap in scRNA-seq data analysis. With a rapid rise in the scale of scRNA-seq data, scParser further incorporates batch-fitting strategy to accommodate scRNA-seq data of arbitrary sample size. Its wide applicability in biomedical data analysis has been demonstrated through comprehensive applications. With these virtues, scParser will benefit researchers in the biomedical field. Further extensions include analysis of other single-cell omics.

Conclusions

This article proposed a computational framework, scParser, for integrative scRNA-seq data analysis. scParser captures variation from heterogeneous biological conditions with matrix factorization and decomposes the cellular variation with sparse representation learning simultaneously. The gene modules empowered by scParser connect gene expression to disease status via gene modules. The extended scParser models the heterogeneous effects of biological conditions on different cell populations, thus pinpointing the relevant biological context/processes through which cell subpopulations contribute to the disease pathogenesis and identifying the putative cell populations associated with biological conditions. The outputs from scParser provide informative inputs for various interpretative analyses. Its wide applicability in biomedical data analysis has been demonstrated through comprehensive applications.

Methods

Here we propose a novel statistical approach, scParser, to decompose cellular variations in scRNA-Seq samples with considering heterogeneous variations from multiple biological conditions (e.g., donor, tissue, phenotype) in low-rank latent spaces to facilitate downstream analysis. In scParser, we integrate matrix factorization to capture variation from biological conditions with sparse representation learning to obtain embeddings of cells. In the employment of sparse representation learning, we introduce elastic-net regularization to encourage sparsity on the low-rank embeddings of cells to facilitate cell clustering and norm constraint to ensure each dimension with equal scale.

The scParser model and its extension

For illustration, let \({Z}^{N\times M}\) denote the matrix of log-normalized scRNA-Seq expression levels of \(N\) samples of \(M\) genes. The \(N\) samples can originate from several biological conditions (e.g., individuals, phenotypes, tissues, disease phases, or different time points). Here we use donor and phenotype for demonstration. The samples come from \({N}_{1}\) donors with \({N}_{2}\) phenotypes. Thus, the expression level of gene \(m\) for cell \(i\) from donor \(j\) with phenotype \(t\), \({z}_{im}\), can be modeled as

$${z}_{im}\approx {d}_{j}^{\intercal }{v}_{m}+{p}_{t}^{\intercal }{v}_{m}+{s}_{i}^{\intercal }{g}_{m}$$
(2)

where \({d}_{j},{p}_{t},{v}_{m}\) are vectors of length \({K}_{1}\), \({s}_{i},{g}_{m}\) are vectors of length \({K}_{2}\). The donor and phenotype information of the samples is known. For simplicity, we name the formulation above as vanilla scParser. The rationale behind our modeling is that biological conditions (e.g., donor and phenotype) affect activities of the common transcription factors (TFs) or biological processes/signals, which in turn affect the expression of genes. In practice, we also observed that increasing model complexity by modeling variation from donor and phenotype in two separate latent spaces does not lead to explaining more variations. Therefore, we capture variation from donors and phenotypes in a shared low-rank latent space. An extra benefit of this practice is that it can reduce model complexity and improve optimization efficiency without impairing model interpretability. Intuitively, after modeling the heterogeneous biological variation across biosamples, we capture the cellular variation across biosamples in another low-rank latent space. This practice allows us to model the cellular variation more independently and also increases model flexibility, compared to restricting our modeling in only one latent space. In our implementation, we considered the option to restrict \({v}_{m}\) and \({g}_{m}\) to be the same. Thus, the last term on the right of Eq. 2 helps obtain the low-rank embeddings of cells. In scParser, biological conditions can be defined based on the research questions and study design, and the number of biological conditions can be further increased.

The objective function for Eq. 2 is formulated as

$$\left\{\begin{array}{ll}\mathcal{L}\left(d,p,v,s,g\right)=& \frac{1}{2}{\sum }_{i,m}{\left(z_{im}-d_{j}^{\intercal} v_{m}-{p}_{t}^{\intercal } v_{m}-{s}_{i}^{\intercal }{g}_{m}\right)}^{2}+\\ & \frac{1}{2}{\lambda }_{1}\left({\sum }_{j}{\Vert {d}_{j}\Vert }_{2}^{2} +{\sum }_{ t}{\Vert {d}_{t}\Vert }_{2}^{2}+{\sum }_{ m}{\Vert {v}_{m}\Vert }_{2}^{2}\right)+\\ & {\lambda }_{2}\big(\frac{1}{2}\left(1-\alpha \right){\sum }_{ i}{\Vert {s}_{i}\Vert }_{2}^{2}+\alpha {\sum }_{ i}{\left|{s}_{i}\right|}_{1}\big),\\ \text{subject to}& \sum_{m}{g}_{mk}^2\le c,\forall k=1,\dots ,{K}_{2}.\end{array}\right.$$
(3)

In the equation, \({g}_{mk}\) is the \(k\)-th element of \({g}_{m}\), and \(c\) is a constant (usually 1). We introduce the elastic net penalty on the cell representation \({s}_{i}\) to encourage sparsity to facilitate cell clustering. Moreover, the norm constraint is imposed to ensure the same scale of each latent dimension in decomposing cellular variation. Equation 3 can be represented with matrix operation as follows:

$$\left\{\begin{array}{ll}\mathcal{L}\left(D,P,V,S,G\right)=& \frac{1}{2}{\left\| Z-\left({X}^{D}D+{X}^{P}P\right)V-SG\right\|}_{\text{F}}^{2}+\\ & \frac{1}{2}{\lambda }_{1}\left({\Vert D\Vert }_{\text{F}}^{2}+{\Vert P\Vert }_{\text{F}}^{2}+{\Vert V\Vert }_{\text{F}}^{2}\right)+\\ & {\lambda }_{2}\left[\frac{1}{2}\left(1-\alpha \right){\Vert S\Vert }_{\text{F}}^{2}+\alpha {\Vert S\Vert }_{1}\right]\\ \text{subject to}& {\Vert {G}_{k}\Vert }_{2}^{2}\le c,\forall k=1,\dots ,{K}_{2},\end{array}\right.$$
(4)

where \({X}^{D},{X}^{P}\) are indicator matrices of \(N\) rows and \({N}_{1}\) and \({N}_{2}\) columns, respectively, which are the dummy variables for \(N\) samples. \({D}^{{N}_{1}\times {K}_{1}},{P}^{{N}_{2}\times {K}_{1}},{V}^{{K}_{1}\times M}\) are matrices of latent representations of \({N}_{1}\) donors, \({N}_{2}\) phenotypes, and \(M\) genes, respectively, and denote \({S}^{N\times {K}_{2}},{G}^{{K}_{2}\times M}\) latent representations for \(N\) cells and \(M\) genes after modeling variation from donor and phenotype. \(c\) is a constant, restricting the scale of \({G}_{k}\), the \(k\)-th row of matrix \(G\). Each row of \({V}^{K\times M}\) represents a gene module, encoding certain biological processes, where the genes can be up-regulated and down-regulated. For a given gene module, the positive loadings of the genes stand for the up-regulation of corresponding genes, and the negative loadings represent the down-regulation of corresponding genes.

The above formulation can be extended to explore the heterogeneous effects of biological conditions on different cell populations after cell type annotation with \(S\) from Eq. 4. To exploit this idea, we model the expression level of gene \(m\) in cell population \(k\), which comes from sample \(i\) obtained from donor \(j\) with phenotype \(t\), \({z}_{im}\) as

$$z_{im}\approx (d_j+w_{kt})^{\intercal} v_m,$$
(5)

where \({d}_{j}, {w}_{kt},{v}_{m}\) are vectors of length \(K\), and \({w}_{kt}\) is the latent representation for cell population \(k\) from donors with phenotype \(t\), which captures the interactive effect between phenotypes and cell populations while controlling variation from donors. In Eq. 5, the biological condition and cell populations can be defined based on research questions. The extension above provides a useful tool to explore the heterogeneous effects of biological conditions on different cell populations, which can be annotated with other methods. Throughout this study, we name the new formulation as extended scParser.

The objective function for Eq. 5 with matrix notation is defined as

$$\mathcal{L}\left(D,W,V\right)=\frac{1}{2}{\left\| Z-\left({X}^{D}D+{X}^{W}W\right)V\right\| }_{\text{F}}^{2}+\frac{1}{2}{\lambda }_{1}\left({\Vert D\Vert }_{\text{F}}^{2}+{\Vert W\Vert }_{\text{F}}^{2}+{\Vert V\Vert }_{\text{F}}^{2}\right).$$
(6)

Here \({X}^{W}\) and \({W}^{{(N}_{2}*{\text{N}}_{\text{c}})\times {K}_{1}}\) denote the indicator matrix of \(N\) rows and \({N}_{2}*{N}_{c}\) columns and latent representations for \({N}_{c}\) cell populations under \({N}_{2}\) biological conditions, respectively. Other notations are the same as in Eq. 4.

Model fitting

Alternating block coordinate descent (BCD) is employed to optimize Eqs. 4 and 6. Practically, each time we update a set of independent parameters with all other parameters fixed. In each iteration of BCD, we update each set of parameters of our model sequentially and repeat the process until the stopping criteria are met. Note that we always update \(P\) before the update for \(D\) in practice.

Optimize with the whole data strategy

With the objective function defined by Eqs. 4 and 6, we can easily derive closed forms for updating \({d}_{j},{p}_{t},{v}_{m}\). Specifically, we have the following update for \({v}_{m}\)

$${v}_{m}={\left[{U}^{\intercal }U+{\lambda }_{1}{\mathbb{I}}_{{K}_{1}}\right]}^{-1}{U}^{\intercal }\tilde{Z}_m.$$

For the objective function defined by Eq. 4, \(U={X}^{D}D+{X}^{P}P\) and \(\widetilde{Z}=Z-SG\); for the objective function is defined by Eq. 6, \(U={X}^{D}D+{X}^{W}W\) and \(\tilde{Z}\) is equal to \(Z\). Under both situations, \(\tilde{Z}_m\) is the \(m\)-th column of \(\tilde{Z}\). Similarly, the update for \({d}_{j}\) is

$${d}_{j}={\left[{N}_{j} \sum_{m}{v}_{m}{v}_{m}^{\intercal }+\lambda_1 {\mathbb{I}}_{{K}_{1}}\right]}^{-1}\sum_{i\in {B}_{j}}\sum_{m}{\tilde{z}}_{im}{v}_{m},$$

where \(\tilde{Z}=Z-SG-{X}^{P}PV\) and \(\tilde{Z}=Z-{X}^{W}WV\) for the objective function defined by Eq. 4 and Eq. 6, respectively, \({B}_{j}\) is the set of indices of cells from donor \(j\), and \({N}_{j}\) is the number of elements in \({B}_{j}\). Likewise, the update for \({p}_{t}\) is

$${p}_{t}={\left[{N}_{t}\sum_{m}{v}_{m}{v}_{m}^{\intercal }+\lambda_1 {\mathbb{I}}_{{K}_{1}}\right]}^{-1}\sum_{i\in {B}_{t}}\sum_{m}{\tilde{z}}_{im}{v}_{m},$$

where \(\tilde{Z}=Z-SG-{X}^{D}DV\) for the objective function defined by Eq. 4, \({B}_{t}\) is the set of indices of cells from donors with phenotype \(t\), and \({N}_{t}\) is the number of elements in \({B}_{t}\). Finally, the update for \({w}_{kt}\) is

$${w}_{kt}={\left[{N}_{kt}\sum_{m}{v}_{m}{v}_{m}^{\intercal }+\lambda_1 {\mathbb{I}}_{{K}_{1}}\right]}^{-1}\sum_{i\in {B}_{kt}}\sum_{m}{\tilde{z}}_{im}{v}_{m},$$

where \(\tilde{Z}=Z-{X}^{D}DV\) for the objective function defined by Eq. 6, \({B}_{kt}\) is the set of indices of cells of cell type \(k\) from donors with phenotype \(t\), and \({N}_{kt}\) is the number of elements in \({B}_{kt}\).

When optimizing Eq. 4 with respect to \(G\), the Lagrange dual proposed in the study [88] is employed. The Lagrange dual for our problem is of the following form

$$\mathcal{L}\left(G,\overrightarrow{\psi }\right)=\frac{1}{2}\text{tr}({G}^{\intercal }QG)-\text{tr}(WG)+\frac{1}{2}\text{tr}\left(\Psi G{G}^{\intercal }-c\Psi \right)+const.$$
(7)

Here \(W={\tilde{Z}}^{\intercal }S\), \(\tilde{Z}=Z-{X}^{D}DV-{X}^{P}PV\), \(Q={S}^{\intercal }S\), and \(\Psi\) is a \({K}_{2}\times {K}_{2}\) diagonal matrix with dual variables \(\psi\) expanding along its diagonal. By taking derivative with respect to \(G\), we have

$$G=(Q+\Psi {)}^{-1}{W}^{\intercal }.$$
(8)

Then, by substituting Eq. 8 into Eq. 7, we have the following dual for our problem

$$\mathcal{D}\left(\overrightarrow{\psi }\right)=\frac{1}{2}\text{tr}\left(-W(Q+\Psi )^{-1}{W}^{\intercal }-c\Psi\right)+const.$$

The gradient \(\nabla\) and Hessian \(H\) for \(\mathcal{D}\left(\overrightarrow{\psi }\right)\) with respect to \(\overrightarrow{\psi }\) can be derived as follows:

$${\nabla }_{i}=\frac{\partial \mathcal{D}\left(\overrightarrow{\psi }\right)}{\partial {\psi }_{i}}=\frac{1}{2} \left\| W{\left(Q+\Psi \right)}^{-1}{e}_{i}\right\|^{2}-\frac{1}{2}c,$$
$${H}_{ij}=\frac{{\partial }^{2}\mathcal{D}\left(\overrightarrow{\psi }\right)}{\partial {\psi }_{i}\partial {\psi }_{j}}=-\big({\left(Q+\Psi \right)}^{-1}{W}^{\intercal }W{\left(Q+\Psi \right)}^{-1}\big)_{i,j}\big({\left(Q+\Psi \right)}^{-1}\big)_{i,j}.$$

Newton’s method is used to optimize our dual problem with respect to \(\Psi\). Thus, the update for \(\Psi\) at iteration \(t\) can be written as

$${\Psi }^{\left(t\right)}={\Psi }^{\left(t-1\right)}-{\left({H}^{\left(t-1\right)}\right)}^{-1}{\nabla }^{\left(t-1\right)},$$
(9)

where \({\Psi }^{\left(t-1\right)},{H}^{\left(t-1\right)},{\nabla }^{\left(t-1\right)}\) are the diagonal matrix of \(\psi\), gradient, and Hessian matrix at iteration \(t-1\), respectively. In practice, we alternatively compute the updates for \(G,\Psi\) until the sum of the squared difference in \(\Psi\) between two consecutive iterations is less than a predefined threshold (\({10}^{-4}\) is used in our studies). Further details on the derivation of Lagrange dual are provided (Additional file 1: section S10).

When optimizing Eq. 4 with respect to \({s}_{i}\), the \(i\)-th row of \(S\), our objective function can be simplified to

$$\mathcal{L}\left({s}_{i}\right)=\frac{1}{2}{\left\| {{\tilde{Z}}_{i}-G}^{T}{s}_{i}\right\| }_{2}^{2}+\frac{1}{2}\lambda_2 \left(1-\alpha \right){\Vert {s}_{i}\Vert }_{2}^{2}+\lambda_2 \alpha {\left|{s}_{i}\right|}_{1}.$$

Here \(\tilde{Z}=Z-{X}^{D}DV-{X}^{P}PV\), and \({\tilde{Z}}_{i}\) is the vector of the \(i\)-th row of \(\tilde{Z}\). To solve the problem, random coordinate descent (RCD) with strong rules is employed, which was proposed in our previous study [89]. Since the subproblems defined for each row of \(S\) are independent, we update \({s}_{i}\) in parallel in practice.

Theoretically, we proved that the BCD proposed above guarantees to reduce our objective defined by Eq. 4 at each iteration and that it converges to the local optimum of the objective in a finite number of steps. In our empirical study on its convergence speed, scParser has a satisfactory convergence speed in analyzing scRNA-Seq data. Details on the proof and empirical study are provided (Additional file 1: section S6).

Optimize with the batch-fitting strategy

When the sample size of data is huge, optimizing scParser with the whole data is memory-demanding and computationally intensive. To relieve this issue, we propose a batch-fitting strategy to optimize scParser. Practically, we split the whole data into several batches and optimized our object function with one batch each time to lower the memory consumption and computation burden.

The key to perform batch optimization in scParser is to propose a surrogate that asymptotically converges to the solution to Eq. 4. As inspired by one previous study [90], we define the following surrogate for our objective

$$\left\{\begin{array}{ll}\mathcal{L}\left(V,G\right)= & \frac{1}{2k}\sum_{j=1}^{k}{\left\| {Z}_{{I}_{j}}-\left({X}^D_{{{I}_{j}}}{D}_{{I}_{j}}+{X}^P_{{{I}_{j}}}{P}_{{I}_{j}}\right)V-{S}_{{I}_{j}}G\right\| }_{F}^{2}+\\ & \frac{1}{2}{\lambda }_{1}\left[\frac{1}{k}\sum_{j=1}^{k}\left({\Vert {D}_{{I}_{j}}\Vert }_{\text{F}}^{2}+{\Vert {P}_{{I}_{j}}\Vert }_{F}^{2}\right)+{\Vert V\Vert }_{F}^{2}\right]+\\ & \frac{1}{k}\sum_{j=1}^{k}{\lambda }_{2}\left[\frac{1}{2}\left(1-\alpha \right){\Vert {S}_{{I}_{j}}\Vert }_{F}^{2}+\alpha {\Vert {S}_{{I}_{j}}\Vert }_{2}\right],\\ \text{subject to}& {\Vert {G}_{k}\Vert }_{2}^{2}\le c,\forall k=1,\dots ,{K}_{2}.\end{array}\right.$$
(10)

Here \(k\) is the number of batches, \({I}_{j}\) denotes the set of indices of cells from batch \(j\), and \({P}_{{I}_{j}},{D}_{{I}_{j}},{S}_{{I}_{j}}\) denote the latent representations for phenotypes, donors, and cells, respectively, that are obtained for batch \(j\). Technically, the surrogate for Eq. 6 is a subproblem of Eq. 10, so we focus on optimizing our objective defined by Eq. 10 and do not diverge to discuss the technical details in optimizing the surrogate for Eq. 6. Algorithm 1 below is proposed to optimize the surrogate defined by Eq. 10.

figure a

Algorithm 1 Batch scParser

In Algorithm 1, we noticed that \({A}_{k},{B}_{k},{E}_{k},{F}_{k}\) carry information for the same batch from all previous iterations. Actually, the information from early iterations is outdated. Mairal et al. [90] suggested removing the old information from the matrices to accelerate convergence. Owning to the design of our algorithm, we use the following equations to exploit this idea:

$$\left\{\begin{array}{ll}A_k\leftarrow&A_{k-1}-\left(X_{I_k}^DD^{\prime}_k+X_{I_k}^PP^{\prime}_k\right)^\intercal \left(X_{I_k}^DD^{\prime}_k+X_{I_k}^PP^{\prime}_k\right)\\ B_k\leftarrow&B_{k-1}- {{\tilde Z_{I_k}^{\prime^{\intercal}}}}\left(X_{I_k}^DD^{\prime}_k+X_{I_k}^PP^{\prime}_k\right)\\ E_k\leftarrow&E_{k-1}-{S^{\prime}_{I_k}}^\intercal S^{\prime}_{I_k}\\F_k\leftarrow&F_{k-1}-{Z^{\prime}_{I_k}}^\intercal S^{\prime}_{I_k}.\end{array}\right.$$

Here \(D^{\prime}_{k},P^{\prime}_{k},S^{\prime}_{k}\) are the matrices for donor, phenotype, and cell representations, respectively, for batch \(k\) at iteration \(t-1\). With a slight abuse of notation, \({\tilde{Z}^{\prime}}_{{I}_{k}}\) in the 2nd and 4th lines of the above equation is computed with the equations in lines 6 and 12 in Algorithm 1, respectively.

In practice, we carried out an additional experiment to show that scParser can handle the extreme case in which the data is divided by samples with another batch-fitting strategy, which has been incorporated into our software scParser (Additional file 1: section S4). In the experiment, the performance of batch scParser with random shuffling is slightly better compared to that of batch scParser with batch assigned according to the samples. Therefore, it is recommended to employ scParser with random batch assignment, as it is slightly more computationally efficient and less memory demanding than scParser with the by-sample batch assignment.

Initialization, hyperparameter tuning, and the stopping criteria

In scParser, all latent variables (\(P,V,S,G\)) except \(D\) are initiated from the normal distribution \(N\left(0,0.001\right)\), and \(D\) is initialized to a zero matrix. For initiating \({s}_{i}\) in Eq. 10 in optimization, we use the solution for \({s}_{i}\) from the previous iteration as a warm start.

For model selection in scParser, grid search is utilized to select hyperparameters \({\lambda }_{1},{\lambda }_{2},\alpha ,{K}_{1},{K}_{2}\). When the number of observations is huge (e.g., \(\ge\) 500,000), we randomly and evenly draw a small proportion (e.g., 0.1) of cells from scRNA-Seq samples as a dataset for model selection. Then, we randomly draw \(10\%\) of elements from the dataset as a test set. For each combination of candidate hyperparameters, we run alternating BCD several iterations (e.g., 20) and choose the one with the best performance in terms of root-mean-square error (RMSE) on the test set.

The robustness of scParser to the hyperparameters \({\lambda }_{1},{\lambda }_{2},\alpha\) has been demonstrated by additional experiments (Additional file 1: section S9). In the experiments, we observed that scParser is insensitive to \({\lambda }_{1}\) and favors \(\alpha =1\). Thus, we set \(\alpha\) equal to 1 to simplify model selection. Meanwhile, we found that our two hyperparameters \({K}_{1}\) and \({\lambda }_{1}\) are somewhat redundant since one can increase \({K}_{1}\) and \({\lambda }_{1}\) simultaneously without changing model complexity. Therefore, we set \({K}_{1}={K}_{2}\).

The detailed procedure for model selection is as follows. First, we set \({\lambda }_{1},{\lambda }_{2}\) to a small number (e.g., \(0.01\)) to avoid singularity in matrix inverse, \(\alpha\) is fixed to \(1\), and choose the ranks \({K}_{1}={K}_{2}\) from the sequence from 10 to 40 with step size 2 with grid search. After choosing the ranks, we define a broad parameter grid for \({\lambda }_{1},{\lambda }_{2}\) and perform a grid search. One may also refine the parameter grid based on the performance of the parameter grid on the test set. Finally, we select the parameters \({\lambda }_{1},{\lambda }_{2}\) with the best performance on the test set and run scParser with the selected parameters on the whole data matrix until the stopping criteria are met.

In our study, the stopping criteria are defined as

$$\frac{\left|{\mathcal{L}}_{i}\left(\cdot \right)-{\mathcal{L}}_{i-10}\left(\cdot \right)\right|}{{\mathcal{L}}_{i-10}\left(\cdot \right)}<\sigma ,$$

where \({\mathcal{L}}_{i}\left(\cdot \right)\) is the loss at iteration \(i\). To reduce computational burden, we calculated it every 10 iterations. \(\sigma\) is a predefined threshold and set to \({10}^{-7}\) in our experiments.

Cell clustering and cell type annotation

Practically, we implemented the clustering procedure in SCANPY [80] and Seurat [12] for all methods for a fair comparison. For scParser, we computed the neighborhood graph of cells using SCANPY [80] with the low-rank embeddings from scParser with the parameter n_neighbors set to 20 and default settings for other parameters and performed Louvain clustering [81] directly on the neighborhood graph. In the clustering, we considered the number of cell types provided by the original data to be the expected number of clusters. Note that the cell types provided by original studies are annotated by marker genes with differential expression analysis.

In searching for a desired number of clusters, we first defined two parameters, min_resolution (usually set to 0) and max_resolution (usually set to 1), and ran the Louvain clustering algorithm with the resolution parameter equal to max_resolution. If the number of clusters obtained equals the expected cluster number, we stop searching and report the clustering result for further analysis; otherwise, we update the resolution parameters according to the following strategy and perform Louvain clustering again with the updated max_resolution. If the number of clusters we obtain is greater than expected, we redefine the max_resolution to be (min_resolution+ max_resolution)/2; if the number of clusters we obtained is less than expected, we set the value of min_resolution to be max_resolution and double the value of max_resolution. We recursively adjust the resolution parameter of the Louvain clustering algorithm until the expected number of clusters is attained.

The cell types of clusters are annotated with the cell types provided by original studies. If there are multiple cell types for one cluster, the cell type of the cluster is determined by the major cell types. For cell type annotation visualization, we further embedded the graph in two dimensions using UMAP [82] and plotted the two-dimensional embeddings against cell types annotated by original studies.