Introduction

An important challenge in the analysis of gene expression data from complex tissue homogenates measured with RNA-sequencing (bulk RNA-seq) is to reconcile cellular heterogeneity or unique gene expression profiles of distinct cell types in the sample. A prime example is bulk RNA-seq data from human brain tissue, which consists of two major categories of cell types, neurons and glia, both of which have distinct morphologies, cell sizes, and functions across brain regions and sub-regions [1,2,3]. Failing to account for biases driven by molecular and biological characteristics of distinct cell types can lead to inaccurate cell type proportion estimates from deconvolution of complex tissue such as the brain [3].

Broadly, methods that computationally estimate cell proportions from bulk tissue “-omics” data, such as gene expression or DNA methylation (DNAm) data, are referred to as “deconvolution algorithms” [4, 5]. Deconvolution commonly uses three terms: (1) a cell type signatures reference matrix, called Z; (2) a convoluted signals matrix, Y; and (3) a vector of the proportions of cell types in Y, called P. Here, we focus on gene expression reference-based algorithms that predict P given Z and Y (Fig. 1). With these standard terms, deconvolution is often mathematically described using the equation \(= Z * P\), where the goal is to estimate the set of proportions \(P\) (i.e., where each \(p\)\(P\) satisfies \(0\le p\le 1\) and \(P\) sums to 1). Approaches for estimating \(P\) have been widely reviewed in the literature [6, 7] and are outside the scope of this review. Nonetheless, recent work has described important challenges (Fig. 2) for deconvolution with various tissues including blood, kidney, and pancreas [8, 9]. However, tissues with notably different cell sizes, total mRNA expression, and transcriptional activity levels, such as brain or immune cell populations, present additional challenges for deconvolution that have not been previously described. It is important to be able to accurately estimate the cell type proportions of these complex tissues, as cell composition has been shown to change with disease [10,11,12,13,14,15].

Fig. 1
figure 1

Diagram of example deconvolution experiment using cell scale factors. A Heatmaps of gene expression: (i) for the (y-axis) marker genes G by cell labels for each of (x-axis) neurons, oligodendrocytes, or astrocytes, (ii) the (y-axis) G marker genes by (x-axis) cell types (K). Expression value colors: blue = low, white = intermediate, red = high. (iii) Wedge diagram of (S) cell scale factors, where wedge size is the value and cartoons indicate each cell type. B (left-to-right) Heatmaps of bulk expression Y, and marker expression Z, cell scale factors S, and cell type proportions P for either (top) scaled or (bottom) unscaled expression, where bar plot values show cell type proportions with colors as in panel C. C Scatterplot of example experiment results for multiple bulk samples Y, showing the (x-axis) true cell proportions and (y-axis) predicted cell proportions, where points are outcomes for a sample and cell type, and shapes show whether the cell scale factor transformation was applied. Plots were created using the ggplot2 v3.4.1 [16] and ComplexHeatmap v2.12.1 [17] software; data used to reproduce these plots are available from GitHub (Data Availability)

Fig. 2
figure 2

Six challenges and opportunities to computationally deconvolve heterogeneous tissue with varying cell sizes using single-cell RNA-sequencing datasets. Direction of experimental process (middle arrow), experiment phases (orange labels), challenge number (red labels), challenge titles (gray panel titles), and depictions of key challenge concepts (box graphics)

In computational methods development, gold standard datasets are used to set baseline performance expectations and provide a well-characterized reference against which new outputs can be evaluated. For example, Sanger sequencing is used as a gold standard platform for validation of genetic sequencing data [18, 19]. Similarly, in deconvolution, independent or orthogonal measurements (Fig. 3) from different platforms of cell composition can be used to validate algorithm-based estimates from bulk tissue expression.

Fig. 3
figure 3

Collecting an integrated dataset of orthogonal assays from the same tissue block across donors and tissues. The development and benchmarking of deconvolution algorithms can be improved with gold standard reference datasets. Gold standards are developed across donors and tissues on which multiple assays are performed on the same tissue block. For example, adjacent sections of a tissue block could be used for spatial transcriptomics, sc/snRNA-seq, bulk/homogenate RNA-seq, and single-molecule FISH (smFISH) to generate orthogonal cell type proportion and transcriptomic profile measurements. These assays generate data with distinct features (i.e., gene expression, cell size/shape, isoform diversity, etc.) that can also be incorporated into deconvolution models to improve accuracy

In this paper, we summarize a set of challenges for performing deconvolution in highly heterogeneous tissues, using human brain tissue as a motivating example. We also present a set of recommendations and future opportunities for how to address these challenges to more accurately estimate tissue cell composition and better understand human disease. This poses an opportunity to set a higher bar for biological discovery and publication practices including increased computational reproducibility [9]. The ability to iteratively implement and optimize new methods and benchmark workflows in heterogeneous tissues will enable deconvolution tools to further our understanding of the role of changes in cell type composition with disease risk and progression.

Challenge 1: lack of orthogonal measurements to evaluate deconvolution results across samples, donors, platforms, and studies

Need for orthogonal measurements from matched tissue samples for bulk and single-cell data

When developing a deconvolution method, using matched bulk and single-cell/nucleus RNA-seq (sc/snRNA-seq) datasets from the same tissue samples (Fig. 3) enables controlling for potential variation beyond cell type variation, observed from unwanted factors [18,19,20,21], such as donor-to-donor variation [22, 23]. For example, confounding variation may come from factors relating to donor demographics (i.e., sex [20], genetics [21], diagnosis [22]), tissue dissection (i.e., tissue microenvironment representation [23, 24]), and/or sample quality (i.e., tissue pH, post-mortem interval, RNA quality [25,26,27]), where certain sources need to be evaluated in specific tissues, such as the expected proportions of white and gray matter in brain specimens [13]. Excess variation from such sources can cause challenges downstream to accurately estimate the cell type reference matrix, Z [24, 25], leading to inaccurate estimates of cell composition, P. This concept is further supported by Wang et al. [28], who studied errors from using a sc/snRNA-seq reference dataset from source A to deconvolve a RNA-seq sample from source B and showed these errors can lead to inaccurate estimates of cell composition P for source B, where sources could be distinct donors or studies. Considering the potential limitations of using existing reference atlas datasets, specific sample sourcing schemes, such as generation of multiple orthogonal assays matched to the same tissue block, could alleviate some of these issues. As orthogonal datasets generated from specimens gathered from the same tissue block (a.k.a. “source-matched” samples) are replicates for important clinical and demographic factors, their greater utilization will limit the influence of excess variation of unwanted factors beyond what is possible with modeling strategies alone. We would also advocate the use of these orthogonal matched assays in the development and benchmarking of new algorithms to better evaluate algorithm performance while controlling unwanted confounds such as technical and biological variation.

Need for orthogonal measurements from health and disease samples

Deconvolution algorithms are commonly used to investigate whether changes in cell composition of tissue samples are associated with a phenotype or outcome, such as in case–control study designs. This poses a potential generalizability challenge when algorithms (Table 1) are only trained on one type of tissue sample (e.g., healthy/control samples) and not on tissues with the observed phenotype or outcome (e.g., disease samples). It was previously shown [22] that differential expression (DE) between group conditions can limit the utility of a normal tissue reference to accurately deconvolve cell type abundances in a disease condition. Including multiple phenotypes can also avoid algorithm overfitting, encourage the selection of better cell type markers, and boost the overall generalizability of findings. Ideally, cases should be matched to the reference samples on potentially confounding factors like subject demographics, tissue collection procedures, and specimen handling strategies.

Table 1 Deconvolution algorithms developed for bulk transcriptomics with sc/snRNA-seq reference datasets. The table includes the name and reference (column 1) along with the year published (column 2) and a description (column 3) of the algorithm. The primary tissues used in the publication associated with the algorithm are also provided (column 4)

Need for orthogonal measurements to form a reference atlas (Z) across multiple donors

A key experimental design consideration is to select the sc/snRNA-seq samples used to build a reference atlas (Z). For example, a reference atlas (Z) could contain data from multiple donors or from only tissue samples that have matched bulk and sc/snRNA-seq samples. This decision depends on the specific research question, the statistical power to detect cell types [46], the availability of previously published data [5], and the cost of generating new data [47]. Multi-group references can mitigate the low reliability of cell type proportion estimates from a single sc/snRNA-seq sample [22]. As sc/snRNA-seq data is characteristically sparse, pooling cells across groups can further boost power to characterize rare, small, or less active cell types [46, 48].

Need for measurements of cell type composition from orthogonal platforms

The primary gold standard measurement to evaluate the accuracy of estimated cell compositions from a deconvolution algorithm is an orthogonal cell type fraction measurement (Table 2) in the tissue sample, and this should ideally be known with high accuracy and reliability. In multiple tissues including blood and brain, fluorescence-activated cell sorted (FACS) RNA-seq [6, 49] and DNAm microarray data [3, 50] have been used as orthogonal measurements of “true” cell composition. Cell type proportion estimates based on relative yields from sc/snRNA-seq data are not likely to be reliable [6] because of dissociation bias [26] and incomplete representation of sequenced cells (i.e., only a subset of the sample is sequenced). This bias impacts the “true” cell composition yield in a cell type-specific manner [51], is not present in bulk RNA-seq data, and can explain systematic expression differences between bulk RNA-seq data [32]. However, orthogonal cell type measures could be extracted from many different data types (Table 2), including microscopy images from molecular marker-based protocols such as single-molecule fluorescent in situ hybridization (smFISH) [3]. These image-based technologies could offer an opportunity to characterize cell type proportions, as well as other size/shape measurements directly from the tissue. Furthermore, these images can then be integrated with transcriptome-wide gene expression measurements based on emerging spatial transcriptomics technologies [52,53,54,55].

Table 2 Orthogonal cell type amount measurements used for bulk transcriptomics deconvolution. Table describes the name (column 1) and a description (column 2) of the type of measurement, the type of assay used to capture the measurement (column 3), and example citations for these measurements (column 4)

Challenge 2: cell types vary in abundance, size, and total mRNA

Cell types exhibit a wide range in size and function within and across human tissues

Most eukaryotic cells are between 10-100 µm in diameter, for example ranging from red blood cells (8 µm), skin cells (30 µm), and neurons (up to 1 m long) [64]. In particular, the brain is an excellent example of a tissue exhibiting a wide range of cell types with different sizes and morphologies [9, 65]. Within the brain, there are a diversity of cell types that fall into several broad categories, including neurons, glia, and vasculature-related cells. These cell types have distinct functions reflected by differences in morphology, physiology, cell body size, and molecular identity. For example, neurons are larger and more transcriptionally active than glial cells [2]. Vasculature-related cells, including endothelial cells, smooth muscle cells, and pericytes that comprise the building blocks of blood vessels and are also smaller in size than neurons [66]. These cell types have specific genetic programs that facilitate distinct functions [66]. For example, neurons (larger excitatory glutamatergic neurons and smaller inhibitory GABAergic neurons [67]) are bigger in size and less numerous than glial cells, a heterogeneous group of cells comprised of oligodendrocytes (Oligo) (20–200 µm) [68], oligodendrocyte precursor cells (OPC) (50 µm) [69], microglia (15–30 µm) [70], and astrocytes (Astro) (40–60 µm) [71], which serve many roles, such as myelination, immune signaling, and physical and metabolic support. This extensive cell type diversity found in the brain, and other tissues, underscores the motivation for adjusting for differences in cell sizes prior to performing deconvolution (see data sources in Table 2).

Cell-type scale factor transformations can improve the performance of deconvolution algorithms

While bulk transcriptomics deconvolution commonly predicts cell type proportions from expression data, it was noted that this approach may instead quantify total mRNA content in the absence of an adjustment for systematic differences in size and expression activity at the cell type level [3]. This adjustment, which we will call a “cell type scale factor transformation” (or cell scale factors for short), is used to transform the cell type reference matrix (Z) data prior to deconvolution [3, 72]. Consider the following standard mathematical formula \({Y}_{GxJ} = {Z}_{GxK} * {P}_{KxJ}\) with dimensions for \(G\) marker genes, \(J\) bulk sample(s), and \(K\) cell types, which we drop the dimensions after this point for brevity. Assume we have \({S}_{KxK}={I}_{KxK}*{s}_{K}\), where \(S\) is a matrix, \({I}_{KxK}\) is an identity matrix, and \({s}_{K}\) is a vector of scalars s1…K that refer to the size, such as the average mRNA molecules in a cell, for each \({k}^{th}\) cell type, which are often experimentally derived (Supplemental Fig. 1). Then, the formulation to deconvolve \(Y\) with cell type scale factors \(S\) is described as \(Y= Z*S*P\), where we can define a new \({Z^{\prime}}_{GxK}={Z}_{GxK}*{S}_{KxK}\), and then we see a formulation similar to the standard formula as before: \(Y =Z^{\prime} *P\). It is worth noting that without this transformation, the assumption made by existing deconvolution methods is that cell types are equal sizes, but incorporating this transformation enables models to assume cells have different sizes. Deconvolution accuracies improved when \(S\) was calculated from a tissue-matched independent reference [35] and even if cell size estimates were from distinct organisms [3].

Cell size scaling was initially introduced for microarray-based expression data [72, 73] and later used for scRNA-seq data in multiple tissues [3, 35, 40]. Cell scale factors are frequently used to generate sc/snRNA-seq-based data that resemble real bulk RNA-seq data based on “pseudobulking” or aggregating molecular profiles across sc/snRNA-seq data [74]. Reference atlas transformation using orthogonal and non-orthogonal cell scale factors reduced errors from deconvolution-based cell proportion predictions. This may be because estimates without this transformation quantify total RNA rather than cell proportion [3]. Cell scale factors may be estimated from either expression or expression-orthogonal data, such as sorted or purified populations of immune cells, which are used in existing deconvolution algorithms such as EPIC and ABIS [35, 40]. The algorithms MuSiC and MuSiC2 [22, 28] can use either expression-based or user-defined scale factors. Similar algorithms using variance-based marker weighting, such as SCDC [31], do not incorporate cell scale factors, but have been made compatible with outputs from algorithms that do (Table 1). Importantly, there are currently no standards for applying cell scale factors prior to deconvolution, and users may need to transform the reference atlas (Z) prior to calling certain algorithms. Further, many algorithms have not been extensively tested in complex tissues, such as brain, that show large differences in size and transcriptomic activity across cell types. Ultimately, more reliable cell scale factor estimation and standardized transformation procedures can facilitate future deconvolution research [3, 72].

Different approaches to obtain cell scale factors can influence cell composition estimates

There are several approaches to estimate and scale cell types in application of deconvolution. Expression-orthogonal cell size estimation methods can come from, for example, fluorescent in situ hybridization (FISH) or immunohistochemistry (IHC) [3, 32, 67] (Table 3). Image processing softwares such as ImageJ/FIJI [75] and HALO (Indica Labs) can provide cell body or nucleus measurements, including diameter, area, perimeter, among other size features (Table 4). However, cell segmentation presents a key obstacle limiting the accuracy of imaging-based approaches, especially for cells with complex morphologies [76]. Expression-based cell size estimates are commonly calculated from total mRNA counts, often referred to as “library size factor” [77], which are typically unique to each cell, but could also be considered distinct for each cell type (Table 1). However, these estimates may be confounded by either the total sequenced RNA or genes with outlying high expression [35]. For this reason, total expressed genes may be a good alternative robust to this type of confound. Cell scale factors from sc/snRNA-seq data are further subject to bias from tissue dissociation, cell compartment isolation, and other factors that have cell type-specific impacts [22, 28, 31]. Another consideration is the application of cell scale factor transformations, as published deconvolution algorithms apply scale factors before [28] or after [72] prediction of cell type proportions. Application of cell scale factor transformation to the reference atlas (Z) may prevent quantification of total RNA rather than cell proportions [3]. In summary, cell scale factor transformations can improve bulk transcriptomics deconvolution across multiple species, tissues, and sequencing platforms.

Table 3 Experimental data platforms to estimate cell sizes and calculate cell size scaling factors to adjust for systematic differences in size and transcriptomic activity between cell types. The table contains the type of experimental data (column 1), the metric used for cell size (column 2), a set of standards (gold, silver, and bronze) introduced by Dietrich et al. [20] (column 3), the format for how the data are captured (column 4), example data analysis challenges when using these data (column 5), and if the experimental data are orthogonal to using sc/snRNA-seq (column 6)
Table 4 Cell scale factor estimates from the literature, with focus on deconvolution studies that use sequencing references. Values for blood cell types are from the SimBu R package (v1.2.0), and values for brain cell types are from Table 1 in (3). The Scale factor value (column 3) can be used in existing deconvolution algorithms leading to less biased results for estimating cell composition

Challenge 3: protocol bias for tissue processing impacts reference atlas (Z)

Acquisition of data with single-nucleus (sn) versus single-cell (sc) RNA-seq protocols

Similar to donor-to-donor variation leading to unwanted confounds in estimating the cell type reference matrix \(Z\), sampling RNA from different cellular compartments can also introduce unwanted variation. For example, experimenters can perform “single cell” sequencing by isolating either whole cells (containing both nuclear and cytoplasmic RNA, often performed from fresh tissue) or just the nuclear compartment (containing only nuclear RNA, performed from frozen tissue). While it has been demonstrated that nuclear RNA is representative of RNA from the whole cell [83, 84], there can be substantial differences for certain transcripts thereby introducing variability into the data contained in \(Z\). In the human brain, the majority of studies are conducted on fresh frozen post-mortem tissue rather than fresh tissue. When post-mortem brain tissues are flash frozen during the preservation process, cells are lysed prohibiting the molecular profiling of whole single cells using scRNA-seq approaches. Instead, only nuclei are accessible for profiling using snRNA-seq approaches. While the nuclear transcriptome can serve as a proxy for the whole cell transcriptome [85,86,87] nuclear transcripts include more intron-containing pre-mature mRNA and may not include transcripts locally expressed in cytoplasmic compartments, such as neuronal axons and dendrites, or transcripts rapidly exported out of the nucleus [2]. On the other hand, compared to whole cells, nuclei are less sensitive to mechanical/enzymatic tissue dissociation procedures, which may artificially impact gene expression [26], and are suitable for multi-omic profiling such as combined RNA-seq and ATAC-seq from the same nucleus [88]. In fact, dissociation protocol differences can help explain variation in average nuclei per donor observed across brain snRNA-seq reference datasets [12]. While prior work showed only a small impact from cell compartment DE between bulk and snRNA-seq data, accounting for this slightly improves deconvolution accuracy [30]. However, new computational methods are being developed to remove these protocol-specific biases [28].

Tissue preparation protocols can impact the diversity and quality of cells profiled during sc/snRNA-seq

Cell type-specific associations between dissociation treatment and gene expression were observed from sc/snRNA-seq data across multiple tissues and species [26]. Expression patterns may further be influenced by the specific cell/nucleus isolation protocol utilized [26, 89]. There are several approaches for isolating nuclei from frozen tissues and removing debris from homogenization steps. While some studies employ a centrifugation-based approach with gradients of sucrose or iodixanol to purify nuclei from debris [90, 91], others use fluorescence-activated nuclear sorting (FANS) to label and mechanically isolate single nuclei [92, 93]. FANS also allows for enrichment of distinct cell types by implementing an immunolabeling procedure for populations of interest prior to sorting. There are pros and cons to each of these nuclei preparation approaches. FANS gating strategies may bias towards certain cell sizes and influence the final population of profiled cells. In the brain, recent work highlighted advantages for sorting approaches that remove non-nuclear ambient RNA contaminating glial cell populations [94]. Ultimately, tissue dissociation protocols can drive variation among and between sc/snRNA-seq populations.

Choice of sc/snRNA-seq platforms can impact reference gene expression profiles

There are several sequencing platform technologies to generate sc/snRNA-seq reference profiles. While these have been previously reviewed [47, 81], it is important to note that the different sample preparations and chemistries required for each of these platforms impact the downstream gene expression data. For example, the widely used single-cell gene expression platform from 10 × Genomics is a droplet-based approach offering a 3′ or 5′ assay for up to 10,000 nuclei/cells in a single pooled reaction [95]. While the 10 × Genomics platform allows profiling a large number of cells in a single experiment, a major limitation is the sparsity of data and restriction of coverage to one end of the transcript. This is in contrast to approaches such as SMART-seq [96] from Takara, which offers full-length transcriptome analysis, but requires isolation of nuclei into individual tubes for separate reactions, thereby often resulting in fewer total cells profiled. Other technologies are rapidly becoming available for sc/snRNA-seq approaches, and each of these can introduce different biases into reference data. Importantly, recently published deconvolution algorithms use data transformation strategies to adjust for these biases [28, 32].

Potential differences in library preparation strategies for bulk RNA-seq and sc/snRNA-seq data

Library preparation is a crucial protocol step impacting RNA profiles in RNA-seq data. Factors of library bias in RNAseq expression have been well documented and include library prep base composition [97], fragmentation bias [98], 3′ direction bias [99], and lacking template DNA [100]. The two most popular library preparation strategies are ribosomal RNA (rRNA) depletion [101, 102], where rRNA is removed and remaining RNA sequenced, and polyA-enrichment [103], where polyA mRNA is isolated and sequenced. The former strategy can isolate a more diverse RNA population, including pre-mature and alternatively spliced mRNAs lacking polyA tails, and non-protein encoding RNAs [104, 105]. This difference may drive protocol bias that needs to be accounted for [106]. Library preparation strategies may differ between bulk and sc/snRNA-seq data used for deconvolution. While polyA-enrichment was initially common for bulk RNA-seq, many newly available datasets now use rRNA depletion. By contrast, with the accessibility and popularity of the sc/sn droplet-based technologies [95], many reference atlases (Z) are based on polyA-enrichment. Further, marker genes may not be consistently expressed across different library preparation conditions, which can reduce deconvolution accuracy. A recent benchmark showed library size normalization was crucial for RNAseq deconvolution [107]. Computational tools [108,109,110,111] for correcting library preparation biases include specialized tools for particular bias types, such as DNA sequence-specific bias correction [112, 113]. As newer deconvolution algorithms accept large marker gene sets, systematic RNA population differences between library preparation strategies likely need to be accounted for, warranting further investigation.

Assay-specific biases between bulk and sc/snRNA-seq data

Systematic differences between bulk RNA-seq and sc/snRNA-seq assays can increase errors and reduce the utility of estimated cell type abundances from deconvolution algorithms. Assay-specific bias is more generally defined than library preparation bias and may arise from differences in sample processing protocol (e.g., cDNA synthesis, PCR amplification, UMI versus full-length transcript), sequencing platform (e.g., short- versus long-read, droplet- or microfluidics-based), and cell compartment isolation (e.g., whole cell, only cytoplasm, or only nucleus) [25, 114]. Different sequencing technologies also show varying transcript length bias, which increases the power to detect highly expressed long transcripts over low expressed short transcripts [115, 116]. This bias can impact the genes and pathways identified from DE analyses [117, 118]. While the use of unique molecular identifiers (UMIs) protocols [116, 119] may reduce the extent of transcript length bias in sc/snRNA-seq data relative to bulk, it may persist from internal priming, a type of off-target polyA primer binding [120]. Furthermore, unlike bulk RNA-seq datasets, sc/snRNA-seq data are highly sensitive to both cDNA synthesis and PCR protocols [25]. Great improvements to both protocols have been made in recent years [121, 122]. Finally, bulk and sc/snRNA-seq data show distinct distributional properties that may impact downstream analyses and the utility of simulation approaches [74, 123]. Dispersion, or the extent of inequality between expression variances and means, is among the most important of these [124]. Bulk RNA-seq expression may show less dispersion, and thus may be modeled either using a Poisson or negative binomial [125] distribution, while expression sparsity and heterogeneity in sc/snRNA-seq data increases dispersion and often motivates the use of the negative binomial distribution [126, 127]. Orthogonal sample collection protocol can limit bias between bulk and sc/snRNA-seq data (Challenge 1). Computational methods are available that have been specifically tested across assays [113, 118, 128], data modalities [129], and preparation protocols [130]. Finally, new analyses of existing datasets can reveal new assay bias sources like cell stress from hypoxia [131].

Differences in detectability of rare cell types across batches and assays

Because cell type detection from sc/snRNA-seq data is confounded by low expression levels, downsampling sc/snRNA-seq profiles on library size is often performed prior to downstream analyses [132]. Recently introduced normalization strategies can further increase the reliability of rare cell type quantification [18], and similar approaches are already being applied to newer spatial sc/snRNA-seq datasets [133]. This may be especially useful for complex heterogeneous tissues like brain, where previously noted protocol biases limit the amount of available reference data for rare cell types [9]. In general, uncommon or rare cell types do not have a large impact on abundant cell type predictions unless there is high expression collinearity between gene markers of rare and abundant cell types [8]. In the human brain, deconvolution accuracy decreased substantially with the exclusion of neurons, but not less common glial cell types [30]. Importantly, the low-end limit for reliable cell type proportion predictions was found to vary across deconvolution algorithms [7]. Computational tools for rare cell type identification have been developed for rare immune cell populations [134], myeloid progenitor cells [135], and rare brain cells [136]. These used a variety of statistical modeling techniques, including latent variable models, such as scLVM [134], and dampened weighted least squares (DWLS) [33], which outperformed several other methods in a recent benchmark [8].

Challenge 4: standardization of cell type annotation and marker selection strategies

Standard brain cell type definitions and nomenclature are complex and emerging

As new cell type-specific molecular and functional datasets rapidly come online, our understanding and definition of cell type diversity is evolving. In the context of the brain, key factors impacting our understanding of distinct cell populations [137] include (1) discovery and improved molecular characterization of functionally distinct cell types in brain regions and subregions, (2) new insights into how physiology and connectivity impact neuronal identity, and (3) an improved understanding of how cells change during development and aging. Anatomical and spatial position also influences cell type gene expression. For example, while virtually all excitatory populations in the cortex are glutamatergic pyramidal neurons, they show strong molecular and morphological differences across cortical layers [27] and still further differences with glutamatergic populations in other brain regions such as the hippocampus and amygdala [92]. This underscores the necessity for a common cell type nomenclature to organize cell type labels and pair these with key contextual features like tissue microenvironment [137, 138]. Reviews of cell type label management strategies highlight challenges with reconciling types and subtypes [139] as well as with tracing cell identities to anatomic and spatial regions (Fig. 1 in [129]). Specialized computational tools automate [140, 141] cell type label assignments across data sources, including scType [142], scAnno [143], scReClassify [144], and neural network-based tool NeuCA [145]. Further, as new data emerge and cell type nomenclature evolves, reference datasets will likely need to be revisited and modified accordingly to ensure their utility.

Cell-type resolution should be experimentally driven

Given that cell type definitions can be complex and defined at multiple resolutions (i.e., as either broad cell classes or as fine subpopulations), the resolution for a given deconvolution analysis needs to be experimentally motivated. That is, the ideal cell type resolution may differ depending on the biological question under investigation. For certain applications, such as distinguishing the contribution of two adjacent brain regions to a given bulk RNA-seq sample, relatively coarse definitions of neurons and glial cells may be adequate. For other applications, such as understanding the contribution of different neuronal cell types to differential gene expression between healthy and disease samples, fine-resolution cell types may be required. An important first step for deconvolution is deciding the appropriate cell type resolution to address the underlying biological question. Prior work in human blood utilized an optimization procedure to identify the 17 most optimal blood and immune cell types for deconvolution from 29 total candidate cell types [35]. In the human brain, it was found that the definition of the reference atlas (Z) is more important than the choice of deconvolution algorithm, and accordingly, the target cell types should have expression data of sufficient quality to select the most optimal marker genes possible [30]. Cell-type resolutions are systematically set and benchmarked to understand deconvolution algorithm performances and generalizability [6]. Higher resolutions typically show worse performance [8], while the exclusion of abundant cell types often has a more detrimental impact than the exclusion of rare cell types [30, 35]. Algorithms such as BayesPrism [29] were designed to handle multiple cell subtypes implicitly and automatically without requiring the user to specify \(K\) dimensions for each resolution, and these could be preferred when robust cell type markers are lacking or cell type identity is particularly uncertain or heterogeneous.

Cell type definitions should be based on robust and identifiable expression data

One of the key conditions of a successful deconvolution experiment is that the cell types of interest are identifiable in the sample type(s) of interest. For a cell type to be identifiable, it should be sufficiently abundant and have clear gene markers. Gene markers should have a sufficient expression to be distinguishable from the background (i.e., relative high expression and sufficient read depth), as well as from other cell types of interest (i.e., sufficient DE from other cell types, with other cell types ideally having none or very low expression) [7]. While reference-free deconvolution algorithms [23, 24, 146] do not rely on specific reference marker genes to the same degree as reference-based algorithms, the suitability of available expression data to perform deconvolution with high accuracy is a key issue across algorithm types and needs to be carefully considered.

Even with appropriate cell type definitions and evidence from expression data, the issue of defining the total cell types (K) to predict in a sample presents its own challenge. If the cell types in the reference do not reflect the cell types in the bulk or pseudobulk sample, deconvolution accuracy can suffer [8]. Given a set of more than two well-defined cell type labels, it is also reasonable to ask whether we should deconvolve all cell types together, or whether similar cell types should be binned prior to attempting deconvolution. For example, suppose an expression dataset contains cells with the Excit, Inhib, Oligo, and Astro cell type labels. From these, we could define the following K = 4 types, each with its own reference atlas: (1) neuronal (i.e., excitatory and inhibitory) and non-neuronal (i.e., Oligo and Astro); (2) Excit, Inhib, and non-neuronal; or (3) Excit, Inhib, Oligo, and Astro. Recent deconvolution studies have advanced our understanding of how cell type label definitions impact deconvolution outcomes. In both blood [35] and brain [30], iterative assessments may lead to the effective quantification of relatively specific cell types and the exclusion of others. Efforts to bin and evaluate cell type definitions should be considered alongside strategies to identify the cell type-specific gene markers for the reference. Marker identification methods may be based on differences in differentially expressed genes, such as Wilcoxon rank-sum statistics, and clustering, to name a few [147].

Expression markers of disease may confound signature atlas reliability

A further consideration for bulk deconvolution methods is heterogeneity introduced by disease state that may influence marker gene expression. As many algorithms are intended for use in bulk tissue samples from disease states, it is important to understand how illness may uniquely impact cell types and their expression of marker genes. Gene expression atlases [148, 149] can be used to identify cell-type marker genes. Development of these atlases parallels technological advances and led to the generation of the first whole-organism atlases in an attempt to collect more assays matched to an individual organism than ever before [148, 150]. However, many new tissue- [148, 149, 151] and condition-specific [11, 152, 153] atlases have been generated from sc/snRNA-seq data that may be limited by differential expression across conditions [13, 14, 57]. For example, in samples from individuals with Alzheimer’s disease (AD) relative to neurotypical control subjects, neurons show marker gene repression, while glial cells generally show up-regulation of marker genes [11]. Changes in gene expression have also been reported for psychiatric disorders such as major depression, where prior work showed 16 cell types with altered expression including excitatory and OPC cell types [10]. Computational interfaces [154] and tools, including scGen [155], scVI [156], and scANVI [157], enable single-cell RNA-seq reference use by managing unwanted variation and between-source marker performance and reliability. Further, new tools [129] and standards [137] facilitate the management of reference atlases from newer technologies measuring combined modalities, such as transcriptomes and proteomes from the same cell. Given that disease-specific differential expression [22, 153] can interfere with the effectiveness of cell-type signature matrices, cell-type marker genes selected for deconvolution should show equivalent expression between healthy and disease conditions. If expression is not equivalent between conditions, further adjustments to either the reference marker or bulk expression data may be necessary and collection of orthogonal matched assays facilitates these efforts (Challenge 1, Fig. 3).

Challenge 5: reference atlases (Z) should be built on standardized and state-of-the-art computational tools and file formats

Standardized data-driven cell type labels can facilitate deconvolution advances

As discussed above, effective cell type definitions are crucial for deconvolution success. As more data comes online (Fig. 4), there is an increasing need for uniform labeling of cell types [9] and careful documentation of study metadata, including cell type enrichment methods [158, 159]. For example, in the brain, anti-NeuN antibodies are commonly used to enrich neuronal cell populations during FANS [160]. Cataloging cell markers and the reagents used to select specific cell types will be important for standardizing data collection practices. On the data analysis side, sc/snRNA-seq cell type labels may be derived from clustering [35, 92, 154], reference-based tools [20, 161], or other analytical approaches [7, 78, 162]. In these cases, cell type labels could be indexed with a link to their originating annotation method. Further, hierarchical organization of cell type descriptors can facilitate insights into their molecular and physiological properties. Examples of this practice include term ontologies from the ENCODE project (https://www.encodeproject.org), common cell type nomenclature [137], and the Human Cell Atlas (HCA) [163], and it can be leveraged for cell type marker selection [162]. Combining key analysis and definitional metadata with standardized cell type labels can encourage reproducibility and new analyses. An individual sc/snRNA-seq experiment can use either manual [42, 56], computational [155,156,157, 164], or combined [165] approaches to assign cell type labels or ascertain their abundances (Table 2). For certain tissues [166] and conditions [152], consulting external references can narrow the cell type or subtype definition according to its expected properties [151]. Despite the availability of existing protocols, recommendations may disagree, protocols may perform suboptimally in a particular experiment, and it may be difficult to reconcile conflicting recommendations. The choice of marker genes impacts downstream analyses [143, 145, 155, 156]. For these reasons, more standard approaches [138] and flexible analysis strategies are needed.

Fig. 4
figure 4

Summary of tissues by literature reference from bulk transcriptomics deconvolution literature. A Dot and line plot of (x-axis) yearly (y-axis) cumulative references by (shape, line type, label) tissue, including (circle, solid line, “all_tissues” label) the combined set of all tissues, (triangle, short dashed line, "blood") blood tissue, (square, middle dashed line, "brain") brain tissue, (plus, long dashed line, "immune_cell") immune cell tissue, (box, dotted line, "pancreas") pancreas tissue, (asterisk, dotted line, "tumor") tumor tissue. B Barplot showing (y-axis) the number of literature references (x-axis) per tissue, including (“all_tissues” label) the combined set of all tissues. Plots were created using the ggplot2 (v3.4.1; [16] software; data used to reproduce these plots are available from GitHub (Data Availability)

Expression data needs to be published using state-of-the-art data science formats

Publishing key datasets and results with essential documentation using standard data formats is an important part of reproducible computational research [167,168,169,170]. Generating, hosting, and distributing large volumes of transcriptomics data and metadata at scale and in a reproducible manner demands substantial time and resources [150, 171], and specialized technologies facilitate this effort [148, 172, 173]. While flat table files (e.g., files with.csv or.tsv extension) are most common, many other data formats allow rapid and memory-efficient access [174, 175] to reduce computing time and resources for access and analysis. Some important examples include relational database formats (e.g., structured query language [SQL], hierarchical data format 5 [HDF5]). For example, the active memory required to load expression counts from 77,604 cells and 36,601 genes is 9.32 MB as an HDF5 file using the DelayedArray framework in Bioconductor table versus 22.72 GB as a standard matrix (Data Availability). Specialized data formats are compatible with increasingly used cloud servers and remote computing environments [176], and large-scale data efforts [5, 177] to comprehensively compile and analyze microarray and sequencing data from the Sequence Read Archive [172], Gene Expression Omnibus [173], Array Express [178], and other public repositories has led to publication of HDF5-based data compilations and Bioconductor packages. Further, many of the 246 [179] actively maintained Bioconductor software packages with the SingleCell descriptor use specialized data formats [77, 180]. These include the SummarizedExperiment format for most omics data types [180], and the SingleCellExperiment format for sc/snRNA-seq expression data [77, 181], which is being extended for use with image coordinate information from spatial transcriptomics experiments [27, 65, 182, 183]. Further, the Azimuth project [149] provides cell reference datasets for multiple tissues as SeuratObjects [48], another data class specialized for sc/snRNA-seq data. Newer data formats may be subject to updates that introduce errors or conflicts with other data classes, and resolving data class conflicts frequently demands a high degree of technical knowledge. This is one reason it is important to publish versions along with packages and object classes, in case an older version needs to be used while a newer version is updated. In summary, sequencing data may be published in a variety of formats to facilitate access, methods should include details like versions for computational tools that were used, and researchers should be aware of the many available technologies for reproducible transcriptomics analyses and data sharing.

Challenge 6: improving algorithm and signature atlas generalizability to new bulk tissue conditions

Cross-validation can limit algorithm overfitting and improve algorithm generalizability

Developers of new deconvolution algorithms and studies seeking to benchmark existing approaches must consider statistical power [184] and generalizability [185]. Here, power refers to the ability to detect cell type markers from DE analysis and differentiate between significantly different cell type proportions [46] and generalizability refers to the replicability of the experiment [167, 186]. For example, an experiment showing good algorithm performance in terms of accurate cell composition estimates and reliable cross-group comparisons could also perform well when analyzing additional data from an independent data source or new participant population. To encourage generalizability and reduce chances of algorithm overfitting to training data, cross-validation should be performed whenever possible, even if sc/snRNA-seq reference data is only available from relatively few sources [186, 187]. As mentioned previously, subjects and sample characteristics should further be balanced across experimental groups, as imbalances could bias the results or undercut their generalizability [13].

Developers should account for the tissues and conditions in which new algorithms will be applied

Deconvolution algorithms have varying performance across tissues and conditions, which we will call “domains,” and algorithms may be considered either general (e.g., good performance across domains) or domain-specific (e.g., good performance in a specific domain). Further, algorithm assumptions may vary depending on their intended domains of use. For example, algorithms often assume good markers are known for each type when developed with normal tissues [7] but algorithms for bulk tumor deconvolution may assume no tumor cell type markers are available [36, 40, 72]. As algorithms are often developed in a single or constrained domain set and then benchmarked in new domains, certain programming practices can facilitate algorithm testing across domains. For example, functions for algorithms like EPIC [40] and MuSiC [28, 72] flexibly support either default or user-specified cell scale factors, which may encourage more standard application of these adjustments in deconvolution experiments. Deconvolution algorithms (Table 1) such as dtangle [34], SCDC [31], Bisque [32], and DWLS [33] that do not support user-specified cell scale factors could instead use a transformed or rescaled reference (Challenge 2). Ultimately, developers should carefully consider the scope and nature of the domain(s) in which an algorithm will be applied.

Deconvolution algorithms should be optimized for prediction across conditions of interest

Beyond understanding normal tissue expression dynamics, effective deconvolution can allow new hypothesis-testing to elucidate relationships between cell types and disease mechanisms. Of particular interest in brain research is the prospect of studying significant changes in the abundances of neurons and/or glial cells between neurotypical samples and neurodevelopmental, neuropsychiatric, and neurodegenerative disorders, including autism spectrum disorder (ASD), Parkinson’s disease (PD), and AD. Glia-specific inflammation in AD is detectable from snRNA-seq data, and further studies could reveal biomarker candidates and risk factors with utility for patient prognosis or diagnosis [14]. Microglial activation has been correlated with AD severity, illuminating mechanisms related to disease progression [32]. Total neuron proportion may decline in AD brains and reflect neuronal death as a hallmark symptom of AD; this trend was detectable in bulk tissue using multiple deconvolution methods [32]. Finally, accurate cell type quantification in case/control studies of bulk tissues revealed 29 novel differentially expressed genes in ASD that were independent of cell composition differences [30]. As new data and algorithms are published, more practical guidelines [6, 7] will be needed to match the most appropriate strategies to their specific biological questions. Specific protocols may be consulted to effectively deconvolve specific cell types across conditions, such as for deconvolution of reactive microglia across brain tissue from donors having neuropsychiatric conditions including Alzheimer’s or epilepsy [187]. Marker gene reference atlases across more conditions [11, 87, 152] could also be consulted and utilized, though new standards for systematic cross-condition atlas utilizations may be needed to reconcile expression differences across conditions [13, 14, 57] (Challenge 4). Finally, a likelihood-based approach that utilizes confidence intervals for cell proportion predictions [34] could be extended to quantify prediction uncertainty across tissues and/or tissue donor conditions.

Future opportunities and recommendations

We wish to highlight several opportunities for bulk transcriptomics deconvolution in heterogeneous tissues, including the human brain. First, new reference datasets featuring multiple orthogonal assays from matched samples have huge potential to shape and inform new studies. Second, aggregation of published data into centralized repositories using standard data formats paired with structured and comprehensive metadata will increase the impact of new reference datasets and the reproducibility of analyses based on these reference data. Finally, mitigating biases and improving statistical rigor in sample collection, experimental design, and training new deconvolution methods should greatly improve the efficacy of new deconvolution algorithms and benchmarking of existing and emerging algorithms. Applying a transformation reference atlas (Z) matrix using cell scale factors, such as in Table 4, may reduce errors in deconvolution predictions due to improved quantification of cell proportions rather than RNA amounts [3].

Researchers can take several steps to act on these opportunities. First, even studies with a small number of donors can improve their rigor by running technical replicates (i.e., multiple runs of the same assay) and biological replicates (i.e., multiple distinct samples or tissue blocks from the same donor). Further, deconvolution algorithms can be deployed as high-quality open-access software packages and made available in centralized curated repositories such as CRAN or Bioconductor [180]. Finally, new research efforts can utilize existing references to perform validation and inform the collection of new samples.

Conclusions

While the rapidly evolving future of transcriptomics is promising, it will be important to not only address existing experimental and computational challenges in this field, but also anticipate future challenges. Orthogonal assays are important for deconvolution experiments (Table 2, Supplemental Fig. 1), allow for biological variation to be systematically studied and modeled (Challenge 1), and are more readily managed and analyzed thanks to specialized open-access technologies (Challenge 5). We have drawn on our collective research experience to detail the key challenges of designing experiments with technical and biological replicates, effective use and integration of orthogonal assays, performance of data analyses to improve statistical rigor and generalizability of findings, and publication of datasets with comprehensive and structured metadata and methods with runnable and versioned code. Taking proactive steps to address these challenges will facilitate studies of increasing scale and complexity while encouraging greater reproducibility.