Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data

Andrade Barbosa, Bárbara; van Asten, Saskia D.; Oh, Ji Won; Farina-Sarasqueta, Arantza; Verheij, Joanne; Dijk, Frederike; van Laarhoven, Hanneke W. M.; Ylstra, Bauke; Garcia Vallejo, Juan J.; van de Wiel, Mark A.; Kim, Yongsoo

doi:10.1038/s41467-021-26328-2

Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data

Article
Open access
Published: 20 October 2021

Volume 12, article number 6106, (2021)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data

Download PDF

9358 Accesses
11 Altmetric
Explore all metrics

Abstract

Deconvolution of bulk gene expression profiles into the cellular components is pivotal to portraying tissue’s complex cellular make-up, such as the tumor microenvironment. However, the inherently variable nature of gene expression requires a comprehensive statistical model and reliable prior knowledge of individual cell types that can be obtained from single-cell RNA sequencing. We introduce BLADE (Bayesian Log-normAl Deconvolution), a unified Bayesian framework to estimate both cellular composition and gene expression profiles for each cell type. Unlike previous comprehensive statistical approaches, BLADE can handle > 20 types of cells due to the efficient variational inference. Throughout an intensive evaluation with > 700 simulated and real datasets, BLADE demonstrated enhanced robustness against gene expression variability and better completeness than conventional methods, in particular, to reconstruct gene expression profiles of each cell type. In summary, BLADE is a powerful tool to unravel heterogeneous cellular activity in complex biological systems from standard bulk gene expression data.

CDSeqR: fast complete deconvolution for gene expression data from bulk tissues

Article Open access 24 May 2021

Advances in mixed cell deconvolution enable quantification of cell types in spatial transcriptomic data

Article Open access 19 January 2022

Bayesian Joint Modeling of Single-Cell Expression Data and Bulk Spatial Transcriptomic Data

Article 12 April 2021

Introduction

Over the past decade, gene expression profiling has been applied to elucidate the complexity of transcriptional regulation in diverse biological contexts, such as cancer^1,2. Conventional gene expression profiling, either by RNA sequencing (RNA-seq) or microarrays, captures cumulative gene expression levels of many cells combined. Therefore, it is often referred to as bulk gene expression profiling to distinguish it from the recent single-cell gene expression profiling technologies³. In oncology, single-cell RNA sequencing (scRNA-seq) is employed to study cellular heterogeneity within a tumor, composed of malignant (tumor) and non-malignant cells^{4,5,6,7,8,9,10}. However, scRNA-seq has severe limitations, including technical challenges such as drop-out^11,12 and high cost, which hinder its application to large series and translation to clinical applications.

Several computational deconvolution methods have been developed to predict cellular composition from bulk RNA-seq data by employing a signature of pre-determined cell type-specific gene expression profiles. Initially, these signatures were constructed by sorting each cell type followed by gene expression profiling¹³, whereas recent methods such as CIBERSORTx¹⁴ and MuSiC¹⁵ employed scRNA-seq data for this purpose. Most approaches perform linear regression to reconstruct the bulk gene expression profiles using the gene expression signatures, where the regression coefficients correspond to the cellular composition. However, the standard regression approach does not account for variability in gene expression within the same cell type and may render biased results.

To the best of our knowledge, no deconvolution method can adequately and efficiently account for the gene expression variability within the same cell type. Modeling gene expression variability is challenging specifically for deconvolution due to the incompatibility of the log-normalization¹⁶, which significantly stabilizes gene expression variability. Without the log-normalization (i.e., in linear-scale), gene expression data has a heavily skewed distribution, which is not adequately modeled by the standard linear regression approaches, such as non-negative least square (NNLS) used in EPIC¹⁷. Currently, few probabilistic deconvolution approaches take skewed variability into account. However, these methods handle only a restricted number of cell types due to optimization difficulties (e.g., three cell types in DeClust¹⁸ and Demix/DemixT^19,20).

Recently, several linear-regression deconvolution approaches have been introduced that consider gene expression variability. MuSiC is a variant of NNLS that prioritizes genes for deconvolution by their variability obtained from the multi-subject single-cell RNA-seq data. CIBERSORTx introduced a two-step approach to address variable cell-type-specific gene expression profiles across the samples: first estimate cellular fraction (deconvolution) and then reconstruct gene expression per cell type in each sample (purification). However, the purification step of CIBERSORTx can handle only a part of genes because of the underdetermination problem where too many parameters need to be inferred. In terms of cellular fraction estimation, both MuSiC and CIBERSORTx outperformed the standard linear regression methods, though they are also linear regression approaches.

Here, we introduce BLADE (Bayesian Log-normAl DEconvolution), a Bayesian method that jointly performs deconvolution and purification in a single-step, taking into account prior knowledge of cell type-specific gene expression profiles obtained from scRNA-seq data. BLADE takes a Bayesian framework that integrates two signatures of mean and variability of gene expression per-cell type using a log-normal probability model. The unified probabilistic model for both deconvolution and purification of BLADE can leverage the prior knowledge for purification, which can remedy the underdetermination issue. Furthermore, an efficient variational inference algorithm was developed, for which we show that it can handle at least 20 cell types. Through a comprehensive evaluation based on more than 700 simulated and real bulk gene expression data sets, we demonstrate a robust performance of BLADE regardless of gene expression variability. In particular, BLADE achieves high accuracy and completeness in gene expression purification, underpinning the power of the unified Bayesian framework for both tasks.

Results

Gene expression variability within a cell type

We first assessed gene expression variability within a cell type using publicly available Peripheral Blood Mononuclear Cell (PBMC) CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) data from 10x Genomics. Based on the integration and clustering analysis followed by phenotyping of 9439 cells, we identified fifteen immune cell types, among which nine are in common, with distinct cell-surface markers and gene expression profiles (Fig. 1a; see “Methods” and Supplementary Figs. S1–2). The size of cell populations ranges from 38 regulatory T cells (0.36%) to 2518 classical monocytes (24%). We then identified differentially expressed genes (DEGs) for each cell type. Subsequently, the standard deviation of gene expression levels per gene and per cell type was measured to assess gene expression variability among the same cell types. We identified high gene expression variability among the same cell populations, especially for DEGs without log-transformation (i.e., linear-scale; Fig. 1b, c). The variability further increased when cells from the two scRNA-seq datasets were combined, indicating the presence of more variability between individuals (Fig. 1d; P < 2.2 × 10e−16 from a one-tailed paired t-test of within-sample and between-sample variability).

**Fig. 1: Overview of single-cell CITE-seq data from two PBMC samples.**

Modeling gene-expression variability by probabilistic distribution

To properly account for variation in gene expression, we examined multiple probability distributions. We evaluated normal distribution, negative binomial distribution, and log-normal distribution to fit the expression level of each gene per cell type without log-normalization. The normal distribution is the standard variability model in many deconvolution algorithms, including CIBERSORTx¹⁴, EPIC¹⁷, and ABIS²¹, while the negative binomial distribution is frequently used for handling count data such as RNA-seq data²². Note that Poisson distribution was also introduced for modeling count data^23,24, but it is a special case of negative binomial. The log-normal distribution is identical to the normal distribution but includes an exponential function, assuming gene expression data is normally distributed on a log scale but not on a linear scale. To evaluate the performance of these probability distributions on gene expression variability, we assessed (1) the maximum likelihood of fitting gene expression profiles and (2) the difference between estimated and empirical modes (i.e., the most probable gene expression level; Fig. 2a–c). The log-normal distribution, in general, shows the best performance in per-gene maximum likelihood, followed by the negative binomial and normal distributions (Fig. 2a, c). In particular, we noted a biased fit of the normal distribution toward outlier observations, which led to low accuracy in identifying modes (Fig. 2b; see four example genes with a biased fit with normal distribution in Fig. 2d). In mode estimation, log-normal and negative binomial appears to be fairly competitive, where the log-normal had a somewhat worse median but a better third quartile (Fig. 2b).

**Fig. 2: Comparison of normal, negative binomial, and log-normal distribution in fitting linear-scale gene expression data.**

We further evaluated the performance of the log-normal and negative binomial distributions in the context of deconvolution. To this end, we constructed a generic statistical deconvolution method that can model gene expression profiles with various probabilistic assumptions given known cellular fractions. The method approximates the convolution of random variables with an arbitrary distribution using a probabilistic generating function, for which both negative binomial and log-normal random variables can be accurately approximated (see “Methods”, Supplementary Note 1, and Supplementary Fig. S3). Based on this method, we evaluated the performance of negative binomial and log-normal distribution in fitting the gene expression profiles per cell type using RNA-seq data from TCGA²⁵. First, we obtained TCGA RNA-seq data of mesothelioma (TCGA-MESO; n = 84) and sarcoma (TCGA-SARC; n = 256), from which we estimated the fraction of eight cell types using EPIC¹⁷, a deconvolution method previously applied to the TCGA. Then, we applied the flexible deconvolution method with two different probabilistic assumptions, log-normal and negative binomial, to estimate expression profiles per cell type of 200 random genes. In terms of log-likelihood and root mean square error (RMSE) measured per gene, log-normal and negative binomial deconvolutions performed equally well for most of the genes, except for a few genes (Fig. 2e, f). Cumulatively, we concluded that the log-normal distribution is an attractive probabilistic distribution to model the gene expression variability of each cell type.

Overview of BLADE: Bayesian Log-normal Deconvolution

We constructed a Bayesian Log-normal Deconvolution method, BLADE, by emulating bulk gene expression profiles through convolution of gene expression profiles per cell type (Fig. 3a). The bulk gene expression level of each gene $j$ in sample $i$ was modeled by ${y}_{ij}={\sum }_{t}{f}_{i}^{t}{x}_{ij}^{t}+{{\in }}_{ij}$. Here, the hidden variables ${f}_{i}^{t}$ and ${{x}_{{ij}}}^{t}$ denote the fraction of cell type $t$ for sample $i$ and the purified expression level of gene $j$ of cell type $t$ for sample $i$. These hidden variables ${f}_{i}^{t}$ and ${{x}_{{ij}}}^{t}$ are, respectively, endowed with the Dirichlet distribution and the log-normal distribution. To incorporate prior knowledge from scRNA-seq data, we take a hierarchical approach to model ${{x}_{{ij}}}^{t}$ by taking a conjugate prior of log-normal distribution with hyperparameters ${{\mu }^{t}}_{0j},{{\kappa }^{t}}_{0j},{{\alpha }^{t}}_{0j}$, and ${{\beta }^{t}}_{0j}$ (Fig. 3b). The hyperparameters are chosen based on the mean and standard deviation of each gene per cell type from the scRNA-seq data. By inferring the hidden variables, we can jointly estimate the fraction of cell types, captured by ${f}^{t}_{i}$, and purified gene expression profiles of each cell type in each sample, captured by ${{x}^{t}}_{{ij}}$. For inference, we employed a collapsed variational inference that maximize efficiency by integrating out a subset of hidden variables with a conjugate prior in advance. Furthermore, we employed the L-BFGS algorithm in conjunction with machine-code translated Python code for gradient and objective function calculations instead of native Python code. The compilation of native Python code by the Numba package²⁶ significantly accelerates gradient and objective functions that are executed thousands of times during the L-BFGS optimization (Supplementary Fig. S4). See “Methods” and Supplementary Note 2 for further details of the framework. As a result, BLADE can handle many cell types (>20 cell types); unlike the previous log-normal-based deconvolution that can account for a maximum of three cell types²⁰.

Robustness of BLADE deconvolution against gene expression variability

We assessed the robustness of BLADE, CIBERSORTx, and non-negative least squares (NNLS) against gene expression variability by applying them to model-based simulation data. The simulation data was created to have diverse but controlled variability levels of gene expression profiles (standard deviation of 0.1–1.5) as well as different numbers of cell types (5–20 cell types), marker genes (100–1000 genes), and samples (5–100 samples; in total 700 training data sets). Note that NNLS is a regularized linear regression, a type of constrained linear regression used in many deconvolution methods, including MuSiC¹⁵, EPIC¹⁷, TIMER²⁷, ABIS²¹, and also in the purification step of CIBERSORTx¹⁴. The simulation data variability levels were selected to recapitulate the observed range in the scRNA-seq data (up to standard deviation of 1.5 in log scale; Fig. 1b, c). In general, all three methods could accurately estimate cellular fractions in case of a high number of genes, a low number of cell types, and a low variability level. In contrast, the performance decreased when a smaller number of genes are presented, and the number of cell types is increased (Fig. 4a; Supplementary Figs. S5–7). However, BLADE was the most robust against gene expression variability. In particular, in the range of observed expression variability of DEGs in the PBMC scRNA-seq data (on average standard deviation of > 0.5; Fig. 1b), BLADE significantly outperformed CIBERSORTx and NNLS.

**Fig. 4: Performance evaluation BLADE using simulation data with diverse settings.**

We then compared the performance of BLADE and CIBERSORTx in estimating gene expression profiles per cell type. In this comparison, NNLS is not included because of redundancy since the purification step of CIBERSORTx is based on NNLS. There are two modes of purification in CIBERSORTx, both of which were compared with BLADE: (1) estimating average profile per cell type across the samples (group mode purification), and (2) estimating the profile per cell type for each sample (high-resolution mode purification). For the data set with low variability levels, both BLADE and CIBERSORTx accurately reconstructed gene expression profiles per cell type (Fig. 4b, c; Supplementary Figs. S8–9). However, unlike BLADE, the performance of CIBERSORTx decreased rapidly as the RNA expression variability within a cell type increased. Furthermore, CIBERSORTx often excludes genes for purification, especially in high-resolution mode, when: (1) the number of cell types is larger than or equal to the number of samples, and (2) the variability in gene expression is high (Fig. 4d; Supplementary Figs. S10, S11). BLADE could accurately estimate the gene expression profiles of each cell type in both group mode and high-resolution mode, regardless of the number of cell types and samples, without any filtering (Fig. 4b, c; Supplementary Figs. S8–9).

Application of BLADE to in silico mixture of PBMC scRNA-seq data

We constructed realistic bulk gene expression data by in silico mixing the scRNA-seq data from PBMC samples without any model assumption to further evaluate our method. To this end, we randomly sample 100 cells 20 times from the 9439 cells from the two PBMC scRNA-seq data sets. We chose to use 100 cells since more cells get selected commonly in multiple samples as we sample more, making the simulated bulk gene expression data lose variability between the samples. In order to make the simulation data as realistic as possible, a cumulative sum of raw counts of 100 cells was obtained, followed by a standard normalization. The resulting simulation data recapitulate the gene expression variability of 15 cell types (Fig. 5a; Supplementary Fig. S12). We constructed signature matrices that capture the true mean and the standard deviation of 1007 genes selected and measured using all of 9439 cells (top 200 DEGs with FDR < 0.2 per cell type, combined). We also generated three extra data sets with a coarse classification of the 15 cell types by four (level 1; 441 genes selected), eight (level 2; 604 genes), and 12 cell types (level 3; 880 genes) in the same manner to diversify the difficulty levels for deconvolution (see Supplementary Data 1 for the details of classifications). The increase of cell type often lowers the fraction of each cell type and the number of genes that can classify each cell type (Supplementary Figs. S13–14). In particular, the fraction of T cells in level 1 is 0.47 on average, which gets much lower for their subtypes in level 4 (0.01 and 0.094 on average for naive and memory CD8+ T cells; Supplementary Fig. S13). Furthermore, although more genes selected in the higher levels, there are 25 unique DEGs for T cells (DEGs only identified for T cells) in level 1, whereas there are only 16 and 3 unique DEGs for naive and memory CD8+ T cells in level 4 (Supplementary Fig. S14). Collectively, deconvolution gets more challenging as the number of cell types increases from level 1 to level 4.

**Fig. 5: Performance evaluation of BLADE using simulated PBMC bulk RNA-seq data.**

Using the bulk PBMC data generated above, we evaluated BLADE taking CIBERSORTx, NNLS, and also MuSiC as the baseline. We used the same list of genes and signatures for the baseline methods for a fair comparison. In general, the accuracy of estimated cell type fractions gets lower as the number of cell types gets higher, as expected (Fig. 5b, see also Spearman correlation coefficients and RMSE in Supplementary Fig. S15). All algorithms reached > 0.5 Pearson correlation coefficient for almost all cell types at level 1, where many cell types failed to reach as high performance as the number of cell types increased. Interestingly, the performance was sometimes higher in level 3 than level 2, especially for MuSiC, possibly because the advantage of having more genes overcomes the complexity due to the increased number of cell types (e.g., 880 genes in level 3, compared to 604 genes in level 2). At level 4, BLADE outperformed CIBERSORTx (P-value of 0.0087; a one-tailed paired t-test) and NNLS (P-value of 0.021; a one-tailed paired t-test) and performed comparably to MuSiC (P-value of 0.46; one-tailed paired t-test). The performance of the four methods are significantly correlated (P-value < 0.05 from Pearson correlation test), especially in pairs of MuSiC and BLADE (Pearson correlation coefficient = 0.82; P-value = 1.9e−04), and NNLS and CIBERSORTx (Pearson correlation coefficient = 0.87; P-value = 3.0e−05; Fig. 5c; Supplementary Fig. S16 for the comparison in the levels 1–3). Among the 15 cell types, plasmablasts, classical monocytes, natural killer (NK) cells were the best predicted by all four methods, which commonly failed to predict the composition of regulatory T cells (Tregs), naive CD8+ T cells (NaiveCD8T), and central memory CD4+ T cells (CMCD4T). These cell types are commonly low abundant (fraction of < 7% on average), and only a few unique DEGs were identified for each cell type (< 50 unique DEGs; Fig. 5d; see Supplementary Fig. S17 for other levels). In contrast, we noted a decent predictive performance of all methods for the abundant cell types (> 10%) with a high number of DEGs (> 50 unique DEGs).

BLADE significantly outperformed CIBERSORTx in estimating gene expression profiles per cell type in both group mode and high-resolution mode across all the levels (Fig. 5e, f and Supplementary Fig. S18). For group mode purification, CIBERSORTx performed comparably to BLADE at level 1, which, however, gets lower at the higher level. Here, BLADE’s performance was near-perfect, as expected, since BLADE integrates cell-type-specific gene expression profiles for purification (Fig. 5e). CIBERSORTx did not estimate expression levels of most genes in high-resolution mode, and essentially no genes were purified for 11 cell types at level 4 (right panel of Fig. 5f; Supplementary Fig. S19). Furthermore, estimated expression profiles by CIBERSORTx are in general less accurate than BLADE in all levels, except for few cell types (e.g., central memory CD4+ T cells and naive CD4+ T cells at level 4; Fig. 5f). The performance of BLADE in high-resolution mode purification is consistently accurate (> 0.7 Pearson correlation coefficient) across all cell types in all levels (Supplementary Fig. S20). Cumulatively, Bayesian simultaneous deconvolution and in silico purification by BLADE significantly outperformed CIBERSORTx in reconstructing gene expression profiles per cell type.

Application of BLADE to standard bulk RNA-seq data with incomplete prior knowledge

We further challenged BLADE and other deconvolution algorithms using the standard bulk RNA-seq data of PBMC immune cell mixtures for which the composition of eight immune cell types was determined by flow cytometry²⁸ (Fig. 6a). Of these eight cell types, neutrophils were not identified in our PBMC scRNA-seq data. Furthermore, there are undetermined cells by the flow cytometry analysis that still contributed to the bulk RNA-seq data. Therefore, there is only limited prior knowledge available on cell-type-specific gene expression profiles, which is the case for most applications of deconvolution. We applied BLADE and other baseline methods using the gene expression signatures consisting of 532 genes that can distinguish seven cell types derived in the same manner as in the previous section (see Supplementary Data 1 for the cell type classification). BLADE was able to reconstruct fractions of the seven cell types rather accurately, except for myeloid dendritic cells (mDC; Fig. 6b, c and Supplementary Fig. S21). We confirmed a low concordance of mDC signature compared to the previously determined signature using a large number of RNA-seq data²⁸ (53 samples; Fig. 6d). In fact, mDC signature has a higher correlation with previous B cell and monocyte signatures (Fig. 6e), which makes the signatures less informative and the deconvolution extra challenging. Other baseline methods estimated compositions of monocytes accurately, but they failed to do the same for the majority of the other cell types including mDC (Fig. 6b). In fact, they often failed to detect some cell types, particularly Tregs are commonly missed (Fig. 6c). Instead, BLADE over and underestimated the fractions of Tregs and CD8+ T cells, respectively, by absorbing CD8+ T cell fractions to Tregs. However, BLADE was still able to rank samples accurately by their fractions. Cumulatively, BLADE was the most robust method for estimating cell type fractions when available prior knowledge was incomplete.

**Fig. 6: Performance evaluation of BLADE using PBMC bulk RNA-seq data with incomplete prior knowledge.**

Evaluation of BLADE for deconvolution of tumor RNA-seq data

We further evaluated our method using scRNA-seq data from tumor samples. First, we obtained scRNA-seq data for 35 pancreas samples (CRP000653; Genome Sequence Archive), of which 24 are tumors while the other 11 are normal. The scRNA-seq data contains 57,530 cells classified into 10 cell types²⁹ (Fig. 7a; Supplementary Fig. S22). For a fair evaluation of deconvolution algorithms, the 35 samples and their cells were split into auxiliary (six samples, of which four are tumors) and main samples (29 samples, of which 20 are tumors; Supplementary Fig. S23). From the auxiliary samples, we obtained the mean and standard deviation of 818 genes that can classify ten cell types reliably (top 100 DEGs with FDR < 0.1 per cell types). For the main samples, we generated bulk gene expression profiles by calculating a cumulative sum of the raw count of all cells, followed by the standard log-normalization. For predicting the fraction of 10 cell types, MuSiC performed the best, followed by BLADE and CIBERSORTx (Fig. 7b; see Spearman correlation coefficients and RMSE in Supplementary Fig. S24). Interestingly, the performance of BLADE correlates the most with MuSiC (Pearson correlation coefficient of 0.62; P-value of 0.056), whereas it is less so with CIBERSORTx (Pearson correlation coefficient of 0.39; P-value of 0.27) and NNLS (Pearson correlation coefficient of −0.18; P-value of 0.62; Fig. 7c). BLADE outperformed MuSiC for predicting the fraction of B cells but was worse for predicting endocrine cell fractions. Most cell types achieved high performance (> 0.5 of Pearson correlation coefficient) in all methods, except for B cells (in MuSiC and CIBERSORTx), T cells (in CIBERSORTx and NNLS), and Stellate cells (in NNLS). These cell types are often less dominant (< 5%) and with a small number of DEGs (less than 40 unique DEGs; Fig. 7d). For reconstructing cell-type-specific gene expression profiles, both BLADE and CIBERSORTx achieved high performance for all cell types (> 0.8 of Pearson correlation coefficient; > 0.9 mostly for BLADE; Fig. 7e, f). However, the purification by BLADE is without any filtering, unlike CIBERSORTx, which purified around 30% of genes per cell type on average in high-resolution mode (Fig. 7g). Cumulatively, BLADE is a reliable deconvolution method especially to reconstruct cell-type-specific gene expression profiles in the tumor context.

**Fig. 7: Performance evaluation of BLADE for deconvoluting tumor data.**

Discussion

One of the major challenges in the deconvolution of bulk RNAseq data is the adequate and efficient handling of gene expression variability, mainly since stabilization of variability by log-normalization is inapplicable. Most of the previous algorithms implicitly or explicitly assumed normal distribution, as otherwise, the inference is highly challenging and limits the number of cell types that can be handled maximally (three cell types in Demix¹⁹). However, the normal distribution often renders a biased fit for gene expression variability (Fig. 2a–d), leading to a suboptimal outcome of deconvolution algorithms. Consequently, the performance of the standard regression technique, NNLS, was consistently inferior, especially when there is a realistic level of gene expression variability (Figs. 4–7).

CIBERSORTx and MuSiC are also linear-regression approaches that partially alleviate the issue by prioritizing genes for deconvolution. Support vector regression, the core algorithm of CIBERSORTx, depends on a subset of genes with high reconstruction errors. On the contrary, MuSiC explicitly learns gene weights from the single-cell RNA-seq data and prioritizes genes with low variability, for which the normal distribution can fit accurately due to the low skewness. We noted a consistently superior performance for fraction estimations of these algorithms over NNLS (Figs. 4a, 5b, 6b, 7b). MuSiC outperformed BLADE in some cases, indicating normal distribution-based deconvolution can also be accurate when genes are prioritized based on the gene expression variability. However, performance of MuSiC compared unfavorably to that of BLADE when prior knowledge was incomplete (Fig. 6). Furthermore, the strategy of prioritizing genes reduces the completeness of the purification results (Figs. 4d, 5f, 7g). We observed a lower performance of linear regression-based purification by CIBERSORTx, particularly in high-resolution mode, which may be due to the inefficient variability model and a large number of variables to be estimated (Figs. 4b, c, 5e, f, 7e, f).

BLADE is a hierarchical Bayesian model that simultaneously performs deconvolution and estimation of gene expression profiles per cell type. The log-normal convolution model efficiently accounts for variability in gene expression and also for prior knowledge of gene expression profiles per cell type derived from scRNA-seq data (Fig. 3). Notably, thanks to the unified probabilistic model used in BLADE, the prior knowledge contributes to both deconvolution and gene expression purification. This prior knowledge significantly reduces the search space of solutions for both tasks, which leads to enhanced accuracy and completeness, especially for gene expression purification. The efficient variational inference of BLADE allowed it to handle many cell types while accurately modeling the gene expression variability. Furthermore, the hierarchical approach of BLADE makes it robust against the quality of prior knowledge, as demonstrated in Fig. 6. Finally, unlike MuSiC and CIBERSORTx, the Bayesian framework of BLADE provide the uncertainties of estimates, which may be valuable to evaluate the quality of the results and for further downstream analysis.

Enhanced in silico microdissection by BLADE opens up the possibility to molecularly characterize individual cell types in tissue based on the standard RNA-seq data. For instance, we demonstrated that BLADE could be applied to estimate each cell type’s gene expression profiles that make up the tumor microenvironment (TME). This allows us to characterize pathway activity in each immune cell type and possibly to recognize additional cell (sub-)types. Furthermore, BLADE can aid previously established gene expression subtypes (e.g., PDAC^30,31) by characterizing the subtypes with distinct TME profiles. Finally, the detailed profiling of the TME, particularly immune TME profiles, may lead to a clinically applicable biomarker strategy for immunotherapy based on the standard bulk gene expression profiling. In conclusion, BLADE is a powerful tool that can significantly contribute to unravel cellular heterogeneity in complex biological systems.

Methods

PBMC single-cell RNA-seq data

Two public peripheral blood mononuclear cell (PBMC) CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) datasets of healthy donors were downloaded from 10x Genomics datasets database [https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3] [https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3]. Genes and cells were filtered based on the following criterions: percentage of mitochondrial genes <10% and number of genes per cell between 200 and 4000. After the filtering, raw count data was normalized and scaled, using SCTransform, which performs normalization and variance stabilization using regularized negative binomial regression. Dimensionality reduction was done using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). Following that, k-nearest neighbors (knn) of each cell using 25 dimensions of PCA were determined. This knn graph was used to construct the Shared Nearest Neighbor (SNN) graph by calculating the neighborhood overlap (Jaccard index) between every cell and its 20 nearest neighbors. Cluster determination was done by SNN graph modularity optimization based on the Louvain algorithm with the resolution of 1. Cells were phenotyped separately in both datasets, using primarily cell surface markers and then gene expression levels in case of lack of usable cell surface markers (Supplementary Figs. S1–2). The two datasets were individually normalized, followed by selecting variable genes. The two data set were then integrated, and batch corrected using the common variable genes. The same analysis as described above was performed on the merged data set, including PCA, SNN, and cluster determination³². Finally, the top 200 differentially expressed genes per cell type were identified using a two-sided Wilcoxon Rank sum test by taking a contrast between one cell type versus the rest with an FDR cutoff of 0.2.

Comparison between log-normal, negative binomial, and normal distribution in fitting raw gene expression counts

To evaluate log-normal, normal, and negative binomial distribution in fitting gene expression profiles, we retrieved raw counts per gene and per cell type and fit the three distribution types using the maximum-likelihood method available in fitdistrplus R package. For each cell type, genes with a standard deviation of lower than 0.5 were filtered out as they are mostly not expressed in that cell type. Finally, the log-likelihoods of the optimized distributions were obtained per gene and per cell type for comparison. As an alternative measure, we also identified the mode (i.e., the peak of the probability distribution) in each of the optimized distributions and assessed its accuracy by comparing it to the mode of the empirical distribution for each gene and cell type pair.

A generic deconvolution method with known cellular composition

For a fair comparison of log-normal and negative binomial distribution for deconvolution, we developed a simple, generic maximum-likelihood-based convolution model. Formally it is assumed that there are $i=1,...,I$ samples in which $t=1,...,T$ cell types jointly contribute to expression profiles of $j=1,...,J$ genes. For each sample $i$ and gene $j$, a bulk expression level is given, indicated by ${y}_{{ij}}$. Then, two hidden variables were introduced that jointly makeup ${y}_{{ij}}$: (1) expression level of the gene per cell type and sample, x^t_ij; and (2) cellular composition for each cell type t, f^t_i, where ${\forall} f^{t}_{i}\ge 0$ and ${\sum }_{t}f_{i}^{t}=1$. An important strength of our method here is that it applies to any underlying parametric distribution for ${{x}^{t}}_{{ij}}$. ${y}_{{ij}}$ is a (weighted) convolution:

$${y}_{{ij}}=\mathop{\sum }\limits_{t=1}^{T}{f}^{t}_{i}{x}^{t}_{{ij}}$$

(1)

which implies, with ${{\hat{x}}^{t}}_{{ij}}={f}^{t}_{i}{x}^{t}_{{ij}}$,

$${g}_{{y}_{{ij}}}(y)=\int_{{u}_{1}=0}^{y}\,\cdots \int_{{u}_{T}=0}^{y-\mathop{\sum }\limits_{i=1}^{T-1}\,{u}_{i}}\,{g}_{{{{\hat{x}}^{1}}_{{ij}}}}({u}_{1})\cdots {g}_{{{{\hat{x}}^{T-1}}_{{ij}}}}({u}_{T-1}){g}_{{{{\hat{x}}^{T}}_{{ij}}}}\left(y-\mathop{\sum }\limits_{t=1}^{T-1}\,{u}_{t}\right)d{u}_{1}\cdots d{u}_{T}.$$

(2)

By assuming ${{x}^{t}}_{{ij}}$ follows log-normal distribution (i.e., ${x}^{t}_{{ij}}\sim {LN}({{\mu }^{t}_{j},({\sigma}^{t}_{j})}^{2})$) and thus ${\hat{x}}^{t}_{{ij}} \sim {LN}({\mu }^{t}_{j}+{{{{{\rm{log }}}}}}{{f}^{t}}_{i},({\sigma }^{t}_{j})^{2})$, ${y}_{{ij}}$ is a convolution of $T$ log-normal random variables. The interest lies in estimating parameters ${\theta }_{j}=({\mu }_{j}^{t},{\sigma }_{j}^{t})$ by maximum likelihood.

While numerical evaluation of (2) may still be efficient for $T=2$²⁸, however, the extension to $T \; > \; 2$ is not straightforward to a $T-1$ dimensional integral. To this end, the log-normal density ${{g}_{t}=g}_{{{\hat{x}}^{t}}_{{ij}}}$ is approximated by a probability generating function (PGF). See Supplementary Note 1 for the details of PGF approximation. The PGF-based approximation of ${g}_{t}$ showed higher accuracy than an alternative approximation method, Fenton-Wilkinson (FW) approximation³³, which was also included as a benchmark (see Supplementary Note 1 and Supplementary Fig. S3).

Comparison of LN and NB based on the generic deconvolution technique

The aforementioned generic deconvolution was used to evaluate LN and NB for deconvolution. For this, two RNA-seq data sets are retrieved from The Cancer Genome Atlas (https://tcga-data.nci.nih.gov/tcga/) using TCGAbiolinks³⁴. We considered all complete samples from the following tumor types: Mesothelioma (MESO³⁵, n = 84; and Sarcoma (SARC³⁶, n = 256. We retrieved the upper quartile normalized RSEM (RNASeq by expectation-maximization) TPM (transcript per million) gene expression values (R package curatedTCGAData), merged replicated measurements (R package MultiAssayExperiment), and extracted the sample definitions from the barcodes (R package TCGAutils). We retained genes with mean count larger or equal to 5. For visualizing results, 200 genes were sampled randomly from this set³⁷. The comparison procedure for LN and NB distributions is:

1. Apply a non-statistical method, EPIC¹⁷, to estimate cell type fractions for bulk RNA-seq data using cell type-specific reference signatures. It has shown that EPIC provides a reliable estimate of cellular fractions of $T=8$ cell types³⁸, and it provides fractions that add up to 1.

2. Fix the cellular fractions and fit generic deconvolution models with $T=8$ LN or NB components using maximum likelihood.

3. Compare the maximum likelihood values of the LN and NB models for of $J$ genes.

The above procedure was done for 200 randomly selected genes with mean count per million larger or equal to 5 to exclude lowly expressed genes. Note that the comparison of the maximum likelihood values is fair, because the number of parameters used in the LN and NB components is the same, of $2T=16$ per gene. As an alternative metric, we also measured the accuracy in reconstructing bulk gene expression levels based on deconvolution. Taking actual and predicted bulk gene expression level in LN or NB deconvolution model, root-mean-squared error (RMSE) was evaluated per gene and per model.

Hierarchical Bayesian model for convolution of log-normal variables (BLADE)

BLADE is a hierarchical Bayesian model for log-normal convolution while accounting for the prior knowledge of per cell-type gene expression profiles (see Overview at Fig. 3a). Formally, we assume ${y}_{ij}={\sum }_{t}{f}_{i}^{t}{x}_{ij}^{t}+{{\in }}_{ij}$, where ${\epsilon }_{{ij}}$ is a log-normal error with mean parameter 0 and variance parameter ${\gamma }_{j}$. Then, ${{x}^{t}}_{{ij}}$ follows a log-normal distribution: ${{x}^{t}}_{{ij}}\sim {LN}({{\mu }^{t}}_{j},\frac{1}{{\lambda }_{j}})$, where ${{\mu }^{t}}_{{j} }$ and ${{\lambda }^{t}}_{j}$ are expected value and precision in log-scale. Note that the parameters ${{\mu }^{t}}_{j}$ and ${{\lambda }^{t}}_{j}$ are shared across the samples. To incorporate prior knowledge on gene expression profiles per cell type, a hierarchical Bayesian approach was taken: ${{\mu }^{t}}_{j}$ and ${{\lambda }^{t}}_{j}$ are endowed with normal-gamma priors with hyperparameters ${{{\mu }^{t}}_{0j}} ,{{{\kappa }^{t}}_{0j}} ,{{{\alpha }^{t}}_{0j}}$, and ${{{\beta }^{t}}_{0j}}$: $({{\mu }^{t}}_{j},{{\lambda }^{t}}_{j})\sim {NG}({{{\mu }^{t}}_{0j}},{{{\kappa }^{t}}_{0j}} ,{{{\alpha }^{t}}_{0j}} ,{{{\beta }^{t}}_{0j}} )$. Note that the normal-gamma distribution is a conjugate prior of log-normal distribution, based on which marginal distribution of ${{x}^{t}}_{{ij}}$ given the hyperparameters ${{{\mu }^{t}}_{0j}} ,{{{\kappa }^{t}}_{0j}} ,{{{\alpha }^{t}}_{0j}}$, and ${{{\beta }^{t}}_{0j}}$ is analytically tractable. The other hidden variable, ${{f}^{t}}_{i}$, was endowed with Dirichlet distribution: $({{f}^{1}}_{i},...,{{f}^{T}}_{i})\sim D({{\alpha }^1}_{i},...,{{\alpha }^T}_{i})$.

For the inference, a collapsed variational inference was employed to handle analytically intractable posterior distribution of hidden variables given observed variables³⁹. In the framework, the random variables with conjugate prior distribution, which are ${{\mu }^{t}}_{j}$ and ${{\lambda }^{t}}_{j}$, were integrated out, which allows us to find a fully Bayesian estimation of ${{x}^{t}}_{{ij}}$ instead of estimation of the single most probable ${{\mu }^{t}}_{j}$ and ${{\lambda }^{t}}_{j}$³⁹. By defining the variational distribution for the hidden variables, ${{x}^{t}}_{{ij}}$ and ${f}_{i}^{t}$, the objective function is to minimize the dissimilarity between the variational distribution and probability distribution, measured by Kullback-Leibler divergence (see Supplementary Note 2 for the detailed derivation). The minimization was done by the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm available in SciPy Python library with the constraints of ${{f}_{{i} }}^{t}$ ≥ $0$ and ${\sum }_{t}{f}_{i}^{t}=1$. Numba-compiled objective function and gradients were used for the acceleration.

Selection of hyperparameters based on the empirical-Bayes framework

BLADE has multiple hyperparameters for the hidden variables ${{x}^{t}}_{{ij}}$ and ${{f}_{{i} }}^{t}$, and also for observed variable ${y}_{{ij}}$. For ${{f}_{{i} }}^{t}$, a hyperparameter ${{\alpha }^{t}}_{i}$ for Dirichlet distribution needs to be set. A user-defined value is assigned to ${{\alpha }^{t}}_{i}$ for all $t$ since we do not have prior information on cellular composition. For ${y}_{{ij}}$, we need to specify a precision of each gene, ${\gamma }_{j}$, which we chose $\frac{1}{{\mathbb{V}}(\,{{{{{\rm{log }}}}}}{y}_{{ij}})s}$, where $s$ and ${\mathbb{V}}({{{{{\rm{log }}}}}}{y}_{{ij}})$ are a user-defined scale factor and a variance in log-scale measured per gene, respectively. For hyperparameters of ${{x}^{t}}_{{ij}}$, ${{{\mu }^{t}}_{0j}} ,{{{\kappa }^{t}}_{0j}} ,{{{\alpha }^{t}}_{0j}}$, and ${{{\beta }^{t}}_{0j},}$ we incorporated prior knowledge of gene expression profiles per cell type obtained from the scRNA-seq data. Given log-normal likelihood and normal-gamma priors, average expression level and standard deviation of ${{x}^{t}}_{{ij}}$ are: ${\mathbb{E}}({{{{{\rm{log }}}}}}{{x}^{t}}_{{ij}})={{{\mu }^{t}}_{0j}}$ and ${\mathbb{V}}({{{{{\rm{log }}}}}}{{x}^{t}}_{{ij}})={\frac{{{\beta }^{t}}_{0j}}{{{\alpha }^{t}}_{0j}}}$, respectively. To make use of the prior knowledge, we obtained the sample estimates of ${\mathbb{E}}({{{{{\rm{log }}}}}}{{x}^{t}}_{{ij}})$ and ${\mathbb{V}}({{{{{\rm{log }}}}}}{{x}^{t}}_{{ij}})$ from the scRNA-seq data, denoted by ${{\mu }^{t}}_{j}$ and ${({{\sigma }^{t}}_{j})}^{2}$. Then, we assigned ${{{\mu }^{t}}_{0j}={{\mu }^{t}}_{j}}$ whereas $\,{{{\alpha }^{t}}_{0j}}$ is set by users followed by deriving: ${{{\beta }^{t}}_{0j}={\alpha }^{t}}_{0j}{({{\sigma }^{t}}_{j})}^{2}$. Here, ${{\alpha }^{t}}_{0j}$ allows to adapt to how much information the single cell data carries for the bulk RNA-seq data. The other hyperparameter ${{{\kappa }^{t}}_{0j}}$ is also user-defined, which serve as a scale factor for variance of ${{\mu }^{t}}_{j}$ (see also Supplementary Note 2).

An empirical Bayes approach was employed to select the best set of user-defined parameters⁴⁰. For each configuration of parameters, a maximum likelihood estimate of variational parameters is obtained using a subset of samples. Then, the hyperparameter configuration with the highest likelihood is selected, followed by performing deconvolution using the entire data set. Only a subset of samples is used in the empirical Bayes step, not only to gain computational efficiency but also to avoid overfitting. Throughout the manuscript, we considered a total of 90 different parameter configurations that cover all possible combinations of ${{\alpha }^{t}}_{i}\in \left\{{{{{\mathrm{1,10}}}}}\right\}$, ${{\alpha }^{t}}_{0j}\in \left\{0.1,0.5,1,5,10\right\}$, ${{\kappa }^{t}}_{0j}\in \left\{1,0.5,0.1\right\}$, and $s\in \left\{1,0.3,0.5\right\}$.

Construction of the simulation data with a controlled noise level

We constructed simulation data sets of bulk gene expression profiles with known cellular fraction, gene expression profiles per cell type, and a diverse number of cell types and samples. To this end, given a number of cell types and genes, we first randomly sample an expected gene expression level ${{\mu }^{t}}_{j}$ for gene $j$ and cell type $t$ from a normal distribution with 0 mean and standard deviation of 1.5: ${{\mu }^{t}}_{j}\sim N(0,2)$. Then, we sample gene expression levels per sample and per cell type, ${{x}^{t}}_{{ij}}$ from a log-normal distribution with mean ${{\mu }^{t}}_{j}$ and standard deviation of $\sigma$ (${{x}^{t}}_{{ij}}\sim {LN}({{\mu }^{t}}_{j},\sigma )$), where $\sigma$ is the parameter to control the variability in gene expression per cell type of each simulation data set. Fraction of cell types are sampled from a Dirichlet distribution with uninformative prior: ${f}_{i}^{t}\sim ({\forall }_{i}{\alpha }_{i}^{t})$, where ${\alpha }_{i}^{t}=1$. Then, the bulk gene expression profiles are generated by ${y}_{ij}={\sum }_{t}{f}_{i}^{t}{{x}_{ij}}^{t}$. We constructed a total of 700 training data sets with the following settings: (1) number of samples = [5,10,20,50,100]; (2) number of genes = [100,200,500,1000]; (3) number of cell types = [2,3,5,10,20]; and (4) level of variability in gene expression profiles per cell type: $\sigma$ = [0.1,0.2,0.5,0.75,1,1.25,1.5].

Construction of PBMC simulation data

To construct realistic simulation data, 20 bulk gene expression data sets were generated by randomly sampling and merging a subset of 9439 cells from the two PBMC scRNA-seq datasets. For each sample, the cellular fraction was first sampled from a Dirichlet distribution. The actual fractions of the 15 cell types were used as the parameter of the Dirichlet distribution so that the sampled fraction is similar to the total fraction. The fraction was then converted into the count of each cell type, with the following constraints: (1) the total number of cells is 100, and (2) the minimum number of cells per type is one. Then, the given number of cells were sampled with replacement, followed by obtaining the raw counts per cell type as the cumulative sum of raw counts of the sampled cells. Up to three distinct cells per type were allowed to be sampled since otherwise, gene expression variability was over-stabilized due to the averaging. Finally, the simulated bulk raw counts were obtained by taking the cumulative sum of the raw counts per cell type among 15 cell types. The bulk gene expression data was log-normalized using the Seurat package³².

Standard bulk RNA-seq data for PBMC immune cell mixtures

The raw counts of RNAseq data and immune cell fractions determined by flow cytometry were obtained from the GEO databases with accession GSE107572. The raw counts was log-normalized using the Seurat package³².

Construction of PDAC evaluation data

PDAC single-cell RNAseq data were obtained from the Genome Sequence Archive database under the accession code CRP000653²⁹. A total of 57,530 cells from 35 pancreas samples (11 normal pancreas and 24 PDAC samples) were previously classified into ten cell types. For auxiliary data, we selected 17,266 cells (30% of cells) from six samples, of which two are normal and four are PDAC samples with the most cells. The rest of the 29 samples were used as the main data for evaluation. The signature genes were selected by the top 100 DEGs from each of the ten cell types (FDR < 0.1; 818 genes in total), followed by obtaining mean and standard deviation from the reference data. Note that we used more stringent criteria to select DEGs than for the PBMC data, because a sufficient number of DEGs (>500 DEGs) still satisfies these. For main data, a cumulative sum of the raw count of all cells was obtained from each sample. The standard log-normalization was then applied to the raw count. For the evaluation, the true cell type fractions and cell-type-specific gene expression profiles were obtained per main sample.

Systematic evaluation of BLADE and comparison against baseline methods

The original implementation of CIBERSORTx, NNLS, and MuSiC were obtained from https://cibersortx.stanford.edu/ (docker image), SciPy Python library, and https://github.com/xuranw/MuSiC (R package), respectively. For all four methods, the same set of genes were consistently used for a fair comparison. For the simulation data sets with the controlled gene expression variability level, true mean ${{\mu }^{t}}_{j}$ and variability $\sigma$ per cell type of all genes were retrieved. For the PBMC and PDAC bulk transcriptome data, average and standard deviations of the union of DEGs of, respectively, 15 and 10 cell types were obtained from the scRNAseq data. These DEGs were selected using a FDR cutoff of 0.2 for PBMC data (in total 1007 genes) and a FDR cutoff of 0.1 for PDAC data (in total 818 genes). CIBERSORTx and NNLS require average expression profiles per gene and cell type, and BLADE requires both mean and standard deviation. MuSiC internally calculates the gene weight using the raw counts from scRNA-seq data, which was only available in PBMC and PDAC evaluation data set. The Pearson correlation coefficient, Spearman correlation coefficient, and root mean squared error (RMSE) were measured using the predicted and true fraction of each cell type across the samples to evaluate the deconvolution performance. Likewise, the Pearson correlation coefficient was measured between true and estimated gene expression profiles per cell type for group mode purification and per cell type and per sample for the high-resolution mode purification. The performance evaluation for purification was done only for CIBERSORTx and BLADE as NNLS and MuSiC only estimate cellular fractions.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The data used in this study is from public sources. The two PBMC CITE-seq datasets of healthy donors were downloaded from 10x Genomics datasets database [https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3] [https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3]. The TCGA data is retrievable using the TCGA-biolinks R package. The PBMC data is available from GEO under the accession code GSE107572. The single-cell RNA-seq data of the PDAC cohort is available from the Genome Sequence Archive under the accession code CRP000653.

Code availability

BLADE python software along with a user-friendly demo is available and maintained at https://github.com/tgac-vumc/BLADE⁴¹.

References

Angelova, M. et al. Characterization of the immunophenotypes and antigenomes of colorectal cancers reveals distinct tumor escape mechanisms and novel targets for immunotherapy. Genome Biol. 16, 1–17 (2015).
Article CAS Google Scholar
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
Article ADS CAS Google Scholar
Zheng, Z. et al. Single-cell transcriptomic analysis. Compr. Physiol. 10, 767–783 (2020).
Article Google Scholar
Pottier, C. et al. The importance of the tumor microenvironment in the therapeutic management of cancer. Expert Rev. Anticancer Ther. 15, 943–954 (2015).
Article CAS Google Scholar
Kumar, M. P. et al. Analysis of single-cell RNA-Seq identifies cell-cell communication associated with tumor characteristics. Cell Rep. 25, 1458–1468.e4 (2018).
Article CAS Google Scholar
Xiao, Z., Dai, Z. & Locasale, J. W. Metabolic landscape of the tumor microenvironment at single cell resolution. Nat. Commun. 10, 1–12 (2019).
Article Google Scholar
Sathe, A. et al. Single-cell genomic characterization reveals the cellular reprogramming of the gastric tumor microenvironment. Clin. Cancer Res. 26, 2640–2653 (2020).
Article CAS Google Scholar
Xiao, Z., Locasale, J. W. & Dai, Z. Metabolism in the tumor microenvironment: insights from single-cell analysis. Oncoimmunology 9, 1726556 (2020).
Lee, H. W. et al. Single-cell RNA sequencing reveals the tumor microenvironment and facilitates strategic choices to circumvent treatment failure in a chemorefractory bladder cancer patient. Genome Med. 12, 1–21 (2020).
Article Google Scholar
González-Silva, L., Quevedo, L. & Varela, I. Tumor functional heterogeneity unraveled by scRNA-seq technologies. Trends Cancer 6, 13–19 (2020).
Article Google Scholar
Angerer, P. et al. Single cells make big data: New challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).
Article Google Scholar
Lähnemann, D. et al. Eleven grand challenges in single-cell data science. Genome Biol. 21. https://doi.org/10.1186/s13059-020-1926-6 (2020).
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Article CAS Google Scholar
Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).
Article CAS Google Scholar
Wang, X., Park, J., Susztak, K., Zhang, N. R. & Li, M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun. 10, 380 (2019).
Zhong, Y. & Liu, Z. Gene expression deconvolution in linear space. Nat. Methods 9, 8–9 (2012).
Article CAS Google Scholar
Racle, J., de Jonge, K., Baumgaertner, P., Speiser, D. E. & Gfeller, D. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. Elife 6, 1–25 (2017).
Article Google Scholar
Aoto, Y. et al. DEclust: a statistical approach for obtaining differential expression profiles of multiple conditions. PLoS ONE 12, 1–15 (2017).
Article Google Scholar
Ahn, J. et al. DeMix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics 29, 1865–1871 (2013).
Article CAS Google Scholar
Wang, Z. et al. Transcriptome deconvolution of heterogeneous tumor samples with immune infiltration. iScience 9, 451–460 (2018).
Article ADS CAS Google Scholar
Monaco, G. et al. RNA-Seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 26, 1627–1640.e7 (2019).
Article CAS Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Article CAS Google Scholar
Grün, D., Kester, L., & Van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014).
Article Google Scholar
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Article CAS Google Scholar
TCGA Research Network: https://www.cancer.gov/tcga.
Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. In Proc. Second Work. LLVM Compil. Infrastruct. HPC - LLVM ’15 1–6. https://doi.org/10.1145/2833157.2833162 (2015).
Li, T. et al. TIMER: a web server for comprehensive analysis of tumor-infiltrating immune cells. Cancer Res. 77, e108–e110 (2017).
Article CAS Google Scholar
Finotello, F. et al. Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome Med. 11, 34 (2019).
Article Google Scholar
Peng, J. et al. Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 29, 725–738 (2019).
Article CAS Google Scholar
Dijk, F. et al. Unsupervised class discovery in pancreatic ductal adenocarcinoma reveals cell-intrinsic mesenchymal features and high concordance between existing classification systems. Sci. Rep. 10, 1–12 (2020).
Article Google Scholar
Bailey, P. et al. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature 531, 47–52 (2016).
Article ADS CAS Google Scholar
Stuart, T. et al. Comprehensive Integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Article CAS Google Scholar
Fenton, L. F. The sum of log-normal probability distributions in scatter transmission systems. IRE Trans. Commun. Syst. 8, 57–67 (1960).
Article Google Scholar
Colaprico, A. et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 44, e71 (2016).
Article Google Scholar
Hmeljak, J. et al. Integrative molecular characterization of malignant pleural mesothelioma. Cancer Discov. 8, 1549–1565 (2018).
Article Google Scholar
Abeshouse, A. et al. Comprehensive and integrated genomic characterization of adult soft tissue sarcomas. Cell 171, 950–965.e28 (2017).
Article Google Scholar
Rauschenberger, A., Ciocănea-Teodorescu, I., Jonker, M. A., Menezes, R. X. & van de Wiel, M. A. Sparse classification with paired covariates. Adv. Data Anal. Classif. 14, 571–588 (2020).
Article MathSciNet Google Scholar
Sturm, G. et al. Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology. Bioinformatics 35, i436–i445 (2019).
Article CAS Google Scholar
Teh, Y. W., Newman, D. & Welling, M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Adv. Neural Inf. Process. Syst. 1353–1360. https://doi.org/10.7551/mitpress/7503.003.0174 (2007).
Carlin, B. P. & Louis, T. A. Empirical bayes: past, present and future. J. Am. Stat. Assoc. 95, 1286–1289 (2000).
Article MathSciNet Google Scholar
Andrade Barbosa, B., van de Wiel, M. A. & Kim, Y. Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data. https://doi.org/10.5281/zenodo.5337105 (2021).

Download references

Acknowledgements

The authors thank Kai Ruan for his careful review of the derivation of the BLADE algorithm. This project was supported by stichting Cancer Center Amsterdam (CCA2019-9-62).

Author information

Authors and Affiliations

Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
Bárbara Andrade Barbosa, Saskia D. van Asten, Bauke Ylstra & Yongsoo Kim
Department of Molecular Cell Biology & Immunology, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam Infection and Immunity Institute, Amsterdam, the Netherlands
Saskia D. van Asten & Juan J. Garcia Vallejo
Department of Anatomy, School of Medicine, Kyungpook National University, Daegu, South Korea
Ji Won Oh
Bio-Medical Research Institute, Kyungpook National University Hospital, Daegu, South Korea
Ji Won Oh
Department of Pathology, Amsterdam UMC, University of Amsterdam, Amsterdam, the Netherlands
Arantza Farina-Sarasqueta, Joanne Verheij & Frederike Dijk
Department of Medical Oncology, Cancer Center Amsterdam, Amsterdam UMC, University of Amsterdam, Amsterdam, the Netherlands
Hanneke W. M. van Laarhoven
Department of Epidemiology and Data Science, Amsterdam Public Health research institute, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
Mark A. van de Wiel

Authors

Bárbara Andrade Barbosa
View author publications
You can also search for this author in PubMed Google Scholar
Saskia D. van Asten
View author publications
You can also search for this author in PubMed Google Scholar
Ji Won Oh
View author publications
You can also search for this author in PubMed Google Scholar
Arantza Farina-Sarasqueta
View author publications
You can also search for this author in PubMed Google Scholar
Joanne Verheij
View author publications
You can also search for this author in PubMed Google Scholar
Frederike Dijk
View author publications
You can also search for this author in PubMed Google Scholar
Hanneke W. M. van Laarhoven
View author publications
You can also search for this author in PubMed Google Scholar
Bauke Ylstra
View author publications
You can also search for this author in PubMed Google Scholar
Juan J. Garcia Vallejo
View author publications
You can also search for this author in PubMed Google Scholar
Mark A. van de Wiel
View author publications
You can also search for this author in PubMed Google Scholar
Yongsoo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.K. and M.W. conceived the ideas, and designed the algorithm. Y.K. developed the python software. B.A.B. and S.D.A. analyzed PDAC single-cell RNAseq and PBMC CITE-seq data. B.A.B., S.D.A., and J.G.V. classified immune cell types in the CITE-seq data. Biological interpretation of the outcome is done by S.D.A., J.O., A.F.S., J.V., F.D., H.L. B.Y., and J.G.V. Evaluation of the algorithm performance is designed and performed by Y.K. and B.A.B. All authors discussed the results and contributed to the writing.

Corresponding authors

Correspondence to Mark A. van de Wiel or Yongsoo Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Andrade Barbosa, B., van Asten, S.D., Oh, J.W. et al. Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data. Nat Commun 12, 6106 (2021). https://doi.org/10.1038/s41467-021-26328-2

Download citation

Received: 07 December 2020
Accepted: 27 September 2021
Published: 20 October 2021
DOI: https://doi.org/10.1038/s41467-021-26328-2
Springer Nature Limited

Bayesian log-normal deconvolution for enhanced in silico microdissection of bulk gene expression data

Abstract

Similar content being viewed by others

Introduction

Results

Gene expression variability within a cell type

Modeling gene-expression variability by probabilistic distribution

Overview of BLADE: Bayesian Log-normal Deconvolution

Robustness of BLADE deconvolution against gene expression variability

Application of BLADE to in silico mixture of PBMC scRNA-seq data

Application of BLADE to standard bulk RNA-seq data with incomplete prior knowledge

Evaluation of BLADE for deconvolution of tumor RNA-seq data

Discussion

Methods

PBMC single-cell RNA-seq data

Comparison between log-normal, negative binomial, and normal distribution in fitting raw gene expression counts

A generic deconvolution method with known cellular composition

Comparison of LN and NB based on the generic deconvolution technique

Hierarchical Bayesian model for convolution of log-normal variables (BLADE)

Selection of hyperparameters based on the empirical-Bayes framework

Construction of the simulation data with a controlled noise level

Construction of PBMC simulation data

Standard bulk RNA-seq data for PBMC immune cell mixtures

Construction of PDAC evaluation data

Systematic evaluation of BLADE and comparison against baseline methods

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation