Background

Alzheimer’s disease (AD) is a neurodegenerative disease that affects ~ 5.7 million patients with annual cost of more than $230 billion in the US [1]. Effective disease-modifying drugs are still elusive despite the urgent need and increasing global burden [2, 3]. Pathologically, AD is marked by amyloid-beta plaques and neurofibrillary tangles, along with neuronal loss and gliosis in the affected brain regions. Transcriptome-wide expression levels have been analyzed from bulk brain tissue of hundreds of AD patients and neuropathologically normal controls [4,5,6,7,8] to discover genes and biological pathways that are perturbed in and/or lead to AD. Systems biology and bioinformatics analysis of these data have implicated altered pathways in AD including immune response [6] and myelin metabolism [4, 5]. However, a fundamental knowledge gap remains concerning whether disease-associated changes in brain gene expression are due to changes in cellular composition of the AD samples secondary to disease neuropathology, or due to changes in the intrinsic regulation/activity of genes in the central nervous system (CNS) cells. From a clinical perspective, it is difficult to target changes in cellular composition secondary to neuropathology, whereas intrinsic changes in gene expression that may drive AD progression are potentially “druggable”.

We expect that identifying cell-intrinsic differentially expressed genes (CI-DEGs) of individual CNS cell types will reveal novel insights into the genes and pathways that could potentially identify drug targets and lead to AD therapeutics. This approach circumvents the influence of cell-composition changes that can impact disease associated DEGs obtained from bulk tissue transcriptome analysis. RNA sequencing (RNAseq) studies from single cell, single nucleus or purified adult human CNS cells [9,10,11] can be used to identify CI-DEGs. Even though such single cell-type RNAseq data can effectively serve as a reference to annotate bulk-tissue transcriptome data [4], such approaches remain costly, cumbersome and limited in sample sizes. On the other hand, there exist large-scale bulk brain RNAseq datasets [5, 8, 12], which can be leveraged to identify CI-DEGs through analytic deconvolution of bulk tissue expression into signals of individual cell types by using cell proportions or their proxies [13, 14].

In this study, we describe the design and application of a bioinformatics pipeline that uses cell type marker genes to estimate cell proportion [15, 16] to deconvolute bulk tissue transcriptome data using three computational approaches [13, 14, 17] and to subsequently identify CI-DEGs. We applied our pipeline to the analysis of three post-mortem brain datasets, one from dorsolateral prefrontal cortex (DLPFC) [8] and two from temporal cortex (TCX) [4, 12, 18] regions, comprised of 685 unique samples. Consensus CI-DEGs common to both TCX and DLPFC regions were analyzed for enrichment of gene ontology (GO) terms. We compared the results of consensus CI-DEGs to consensus bulk-DEGs. In addition, for the DLPFC [8] dataset that had both bulk and single nucleus RNAseq [19] (snRNAseq) data, we compared the CI-DEGs from the computational deconvolution to CI-DEGs obtained from snRNAseq [19].

To our knowledge, this is the first study to detect consensus CI-DEGs and their enriched gene ontology (GO) terms from multiple brain regions using multiple computational deconvolution algorithms for AD and control RNAseq samples. Our study illustrates the cell proportion landscape of AD and control brain regions assessed in three independent RNAseq studies [4, 7, 8, 12]. We identify consensus CI-DEGs many of which are not observed in bulk-DEG analysis and characterize their cell-type specificity. GO terms that are enriched for CI-DEGs implicate cell intrinsic transcriptional alterations that may influence AD, rather than be a result of cell-proportion changes in this disease. These CI-DEGs and their biological pathways may serve as refined molecular targets for therapeutic discoveries and disease modeling in AD. Our study also demonstrates that detection of CI-DEGs through computational deconvolution methods is promising while some challenges remain.

Results

Cellular composition in three brain cohorts from two brain regions

We analyzed three cohorts each consisting of post-mortem brains from AD and control subjects (Table S1), namely the Rush Religious Orders Study and Memory and Aging Project dorsolateral prefrontal cortex (DLPFC) [7, 8], Mayo Clinic temporal cortex (TCX-Mayo) [4, 12], and Mount Sinai VA Medical Center Brain Bank temporal cortex (TCX-MSBB) [18]. We generated the TCX-Mayo RNAseq dataset, and downloaded DLPFC and TCX-MSBB RNAseq datasets from the AMP-AD knowledge portal on Synapse (www.synapse.org).

Cell proportions (Table S2) were estimated for DLPFC, TCX-Mayo and TCX-MSBB datasets independently using the digital sorting algorithm (DSA) method [16] and the top 100 marker genes (Table S3) obtained from R package BRETIGEA [15] for each of the following cell types – neuron, oligodendrocyte, microglia, oligodendrocyte progenitor cell (OPC), astrocyte and endothelial cell.

An inspection about the pairwise correlation between marker genes (Fig. 1a) revealed that markers of OPC have poor median pairwise Pearson correlation values of 0.12 in DLPFC, 0.11 in TCX-Mayo and 0.06 in TCX-MSBB respectively, whereas among the other five cell types neuronal markers have the highest median correlation (0.68 in DLPFC, 0.78 in TCX-Mayo and 0.67 in TCX-MSBB), and microglia markers have the lowest correlation (0.37 in DLPFC, 0.42 in TCX-Mayo and 0.44 in TCX-MSBB). In addition, a computer simulation study (Fig. S1) demonstrated that the estimated proportions of OPC were not robust upon using different selection of marker genes. Therefore, we did not include OPC in downstream analyses in this study.

Fig. 1
figure 1

a) Pearson correlation between marker gene expressions in six cell types. Marker genes are from literature. b) Estimated cell proportions in DLFPC, TCX-Mayo and TCX-MSBB datasets in five cell types. Red asterisk indicates differences between cell proportions in AD and control groups at nominal p value 0.05 from Wilcoxon rank sum test

In all three datasets, neuronal cell proportion estimates were significantly lower in AD compared to controls (Fig. 1b). The magnitude of this decrease was the greatest for TCX-Mayo (AD mean proportion = 28.0%, Control = 35.7%; ratio of AD:control cell proportions = 0.78), followed by TCX-MSSM (AD = 42.3%, control = 49.3%; ratio = 0.87) and DLPFC (AD = 42.4%, control = 47.4%; ratio = 0.89). The estimated proportions of microglia were significantly higher in AD vs. controls for all datasets, with higher magnitude in TCX-Mayo (AD:control ratio = 1.19) and TCX-MSBB (AD:control ratio = 1.19) than for DLPFC (AD:control ratio = 1.06). The estimated proportions of astrocytes and endothelial cells were significantly higher in AD vs. controls for DLPFC and TCX-Mayo datasets, although the magnitude was greater in TCX-Mayo (1.40 and 1.30 respectively) than in DLPFC (1.07 and 1.14 respectively) for both cell types. Oligodendrocyte proportion is significantly higher in AD in DLPFC with AD:control ratio 1.14 and TCX-MSBB with AD:control ratio 1.27, although remains unchanged in TCX-Mayo with the ratio 0.94. Collectively, these findings demonstrate that the proportions of CNS cell types are different in post-mortem AD vs. control brains for most cell types. Although these proportional changes with AD are mostly consistent across the different studies, their extent varies across brain regions, with TCX tending towards higher magnitude of neuronal loss and microglia proliferation than DLPFC. It needs to be emphasized that the cell proportion changes estimated here are relative values, rather than absolute cell proportion changes between ADs and controls.

Differential expression analyses

In this study, three computational approaches were applied to identify cell intrinsic differential expression in individual cell types (CI-DEGs, Table S4-S6), namely CellCODE [14], PSEA [13] and our method WLC. Differentially expressed genes from bulk brain tissue (bulk-DEGs) were identified through linear regression without adjusting for cellular composition (Table S7). For the DLPFC, TCX-Mayo and TCX-MSBB datasets, we obtained bulk-DEGs and CI-DEGs from the three computer algorithms for neuronal, oligodendrocytic, microglial, astrocytic and endothelial cell types respectively.

We compared bulk-DEGs across the three datasets (Fig. 2a, top panel). Similarly, CI-DEGs from CellCODE, PSEA and WLC are compared across datasets (Fig. 2a, lower panels), such that CI-DEGs shared between datasets are required to be consistent in the designated cell type. All DEGs are identified at nominal p-value cutoff 0.05 and shared CI-DEGs have the same direction of change in the compared datasets. The ratio of overlap between any two datasets over all DEGs, i.e. the number in overlapping areas of the Venn diagram over the total number (Fig. 2a, top panel), is 30.0% or 1711/5697 in up-regulated bulk-DEGs, and 34.8% or 2214/6371 in down-regulated bulk-DEGs. This ratio of overlap in bulk-DEGs is much higher than that in CI-DEGs (2.7, 4.7 and 10.0% in up-regulated genes from CellCODE, PSEA and WLC respectively; 3.1, 6.8 and 9.3% in down-regulated genes from CellCODE, PSEA and WLC respectively).

Fig. 2
figure 2

a) Overlap across three independent RNAseq datasets of bulk-DEGs (upper panel) and CI-DEGs (lower panels) from three computational approaches. b) Consensus CI-DEGs between DLPFC and TCX brain regions, which consist of consensus CI-DEGs between DLPFC and TCX-Mayo, or between DLPFC and TCX-MSBB. c) Overlap between consensus CI-DEGs and consensus bulk-DEGs, per cell type. The p-values of overlap are from hypergeometric tests

Consensus CI-DEGs between DLPFC and TCX

To obtain the consensus CI-DEGs that are shared between DLPFC and TCX brain regions, we selected those CI-DEGs that are detected in “DLPFC and TCX-Mayo” or in “DLPFC and TCX-MSBB” under any of the three algorithms (Fig. S2). We combined all such genes, which collectively comprised the consensus CI-DEGs for each cell type (Fig. 2b). Similarly, consensus bulk-DEGs were the combined set of bulk-DEGs shared between “DLPFC and TCX-Mayo” or “DLPFC and TCX-MSBB”.

Most consensus CI-DEGs are from neuronal cells (N = 559), followed by oligodendrocytes (N = 260), whereas microglia contributed the least number (N = 101). The majority (65.5% or 366/559) of neuronal CI-DEGs is down-regulated in AD, and the majority (66.0% or 140/212) of endothelial CI-DEGs is up-regulated in AD, with other cell types lying in between. Some of these CI-DEGs are also among the 1000 marker genes of the corresponding cell type from BRETIGEA [15]; 14.7% or 82/559 of neuronal CI-DEGs are also neuronal markers, 25.4% or 66/260 of oligodendrocyte CI-DEGs are also oligodendrocyte markers, and other cell types lie in between.

With regards to consensus bulk-DEGs (Fig. S3), 28.2% of them (885/3135) are cell type markers; 10.4% neuronal markers, 5.6% oligodendrocyte, 3.4% microglia, 4.8% astrocyte and 4.0% endothelial markers. The above observations indicate that computational deconvolution algorithms could identify CI-DEGs for both marker genes and non-marker genes. Importantly, the proportion of non-marker CI-DEGs is greater than that in bulk-DEGs. This suggests that compared to bulk-DEGs, CI-DEGs may be capturing a greater proportion of expression changes that are not due to mere cell population changes.

We also compared the consensus bulk-DEGs with consensus CI-DEGs (Fig. S4). We determined that only a small fraction (15.0% or 29/193) of the up-regulated neuronal CI-DEGs was also present in up bulk-DEGs although the overlap is still significant (Fig. 2c). In comparison, most of the up-regulated CI-DEGs of the other four cell types were included in up bulk-DEGs. On the other hand, most (84.2% or 308/366) of the down-regulated neuronal CI-DEGs were also present in down bulk-DEGs, whereas most of the down-regulated CI-DEGs of the other four cell types were absent from this group. Since bulk-DEGs did not adjust for neuronal loss and gliosis in AD (Fig. 1b), its ability to identify up-regulated neuronal genes and down-regulated glial genes is likely to be compromised. For the same reason, bulk-DEGs may have a false inflation of detecting down-regulated neuronal and up-regulated glial genes.

Enriched GO terms of consensus CI-DEGs between DLPFC and TCX

To identify pathways implicated by CI-DEGs that are robust across brain regions, we performed Gene Ontology (GO) enrichment analysis [20, 21] for the consensus CI-DEGs, assessing separately those that are up vs. down in AD subjects (Table S8-S17). Figure 3 illustrates the top two enriched GO terms by enrichment p-values, after filtering out terms that encompass less than four CI-DEGs or are cellular compartments.

Fig. 3
figure 3

Top two enriched GO terms in up (red) or down-regulated (blue) consensus CI-DEGs between DLPFC and TCX regions, per cell type

Consensus CI-DEGs revealed biological pathways that are perturbed in AD in specific brain cell types. Some of these pathways have previously been implicated in AD and others are novel. Down-regulated neuronal CI-DEGs were enriched in neuropeptide hormone activity (GO:0005184) and hormone activity (GO:0005179) pathways, which include VGF (a.k.a. neuroendocrine regulatory peptide 1) [22] and corticotropin releasing hormone (CRH) [23] (Table S13). Consensus up-regulated neuronal CI-DEGs were significantly enriched in potassium channel activity (GO:0005267) and regulation of ion transport (GO:0043269) pathways (Table S8). The latter GO term encompasses most of the genes from the former, and also includes other genes involved in neuronal functions such as the glutamate ionotropic receptor NMDA type subunit 1, GRIN1 [24] and SYT1, which encodes the synaptic vesicle protein, synaptotagmin [25].

Many of the most significant GO terms are related to key functions of the respective cell types for the glial CI-DEGs, as well. The top enriched pathway of down-regulated CI-DEGs in oligodendrocytes is myelination (GO:0042552), including myelin basic protein (MBP) [4], plasmolipin (PLLP) [4, 5], myelin and lymphocyte protein (MAL), and myelin-associated glycoprotein (MAG) [26] (Table S14). Up-regulated CI-DEGs of oligodendrocytes are enriched in ceramide biosynthetic process (GO:0046513) including ceramide synthase 4 (CERS4) and UDP glycosyltransferase 8 (UGT8) [5] (Table S9). Ceramide is a constituent of sphingomyelin, a sphingolipid which is particularly found in the myelin sheath; and also a multi-functional signaling molecule [27, 28]. Hence, both the down-regulated and the up-regulated oligodendroglial consensus CI-DEGs highlight different components of the myelin biology that are perturbed in AD.

Similarly, microglial, astrocytic and endothelial CI-DEGs also highlight processes pertinent to the functions of these cell types. Microglial up-regulated CI-DEGs are enriched in inflammatory response (GO:000695) and leukocyte activation (GO:0002696), which includes complement C3a receptor 1 (C3AR1) [29], interleukin 18 (IL18) [30, 31] and CCAAT enhancer binding protein alpha (CEBPA) [32] genes (Table S10).

Astrocytes, a cell type that plays a critical role in maintaining brain energy dynamics [33] and metabolism [34], show enrichment of oxidoreductase activity (GO:0016491) and drug metabolic process (GO:0017144) in down-regulated CI-DEGs which includes genes glutathione S-transferase mu 2 (GSTM2) [35] and thioredoxin2 (TXN2) [36] (Table S16). Astrocytic up-regulated consensus CI-DEGs are enriched for cell-cell junction assembly (GO:0007043) process (Table S11), including the astrocytic gap junction protein connexin43 (GJA1) [37], which was identified as a key regulator associated with AD related outcomes. The other top GO process for astrocytic up-regulated consensus CI-DEGs is adenylate cyclase-inhibiting G protein-coupled receptor signaling pathway (GO:0007193), which harbors adenylate cyclase 8 (ADCY8), involved in memory functions [38].

Finally, endothelial cells, which are crucial in maintaining blood-brain barrier integrity [39, 40], show enrichment of up-regulated DEGs in cytoskeleton organization (GO:0007010) and actin filament-based process (GO:0030029) (Table S12).

Importantly, some CI-DEGs highlight protein translation as a top perturbed biological pathway. Down-regulated microglial consensus CI-DEGs show enrichment in processes involved in protein translation (GO:0006614 and GO:0006613), which include ribosomal protein encoding genes [41,42,43] (Table S15). Similarly, down-regulated endothelial consensus CI-DEGs also harbor ribosomal protein encoding genes, with enrichment in protein translation related GO processes (GO:0006413 and GO:0006613) (Table S17).

Comparison of CI-DEGs from computational deconvolution vs. snRNAseq

We determined the extent to which each of the three computational deconvolution algorithms could detect CI-DEGs from bulk tissue by comparison of their results with those obtained in a published snRNAseq study [19]. The ROSMAP dataset utilized in our study has both bulk RNAseq from DLPFC (bulk-DLPFC) as well as snRNAseq (snDLPFC) in a subset of its participants [19]. We compared the bulk-DLPFC data deconvoluted with three different algorithms with the published snDLPFC [19] data. Endothelial CI-DEGs were not available from the snRNAseq study, therefore overlap of results could be assessed only for four cell types.

We tested the overlap between the top CI-DEGs for each cell type obtained from deconvoluted bulk-DLPFC and those from snDLPFC ranked by their p values (Fig. 4a). We evaluated the overlap for a range of top CI-DEGs up to top 1000 genes. Overlap for CI-DEGs that are either up (Fig. 4a, upper panel) or down (Fig. 4 a, lower panel) in AD were assessed separately. Hence, overlapping genes had both similar ranks and direction of effect in both deconvoluted bulk-DLPFC and snDLPFC analyses. We established the significance of overlap using simulations for a range of top ranked genes (N = 200, 600 and 1000) (Table S18).

Fig. 4
figure 4

Comparison of CI-DEGs from computational deconvolution with CI-DEGs from snRNAseq on DLPFC dataset. a. Upper panel: number of overlapping genes (y-axis) between the top N (x-axis) up-regulated genes in snDLPFC and top N up-regulated genes from bulk-DLPFC deconvoluted with PSEA, WLC and CellCODE, respectively. Lower panel: number of overlapping genes (y-axis) between the top N (x-axis) down-regulated genes in snDLPFC and top N down-regulated genes from deconvoluted bulk-DLPFC. b. Venn diagram of CI-DEGs from computational deconvolution methods and those from snRNAseq. Overlap is evaluated for both bulk-DLPFC and snDLPFC CI-DEGs detected at nominal p value ≤0.05

Neuronal CI-DEGs retained their significance of overlap across all comparisons and for all algorithms, except for the top 1000 up-regulated neuronal CI-DEGs deconvoluted with PSEA. Microglial CI-DEGs had the least numbers of significant overlap for their top ranked genes. Astrocytic and oligodendrocytic top ranked CI-DEGs had significance of overlap between the neuronal and microglial results (Table S18). These findings are reflective of the abundance of these cell types, with the most abundant neurons having the most overlap for the top ranked CI-DEGs between deconvoluted bulk-DLPFC and snDLPFC.

Amongst these comparisons, we determined that the significance for overlap was best for all algorithms for the top ranked 600 genes. Using WLC deconvoluted results, the overlap for the top 600 CI-DEGs from bulk-DLPFC and snDNPFC are statistically significant for all eight comparisons (Fig. 4a). For the top 600 genes, overlap with CellCODE results is significant for all except down-regulated oligodendrocyte and up-regulated astrocyte CI-DEGs. For PSEA, none of the microglia CI-DEGs had significant overlap. PSEA results for the top 600 genes were otherwise significant for all but up-regulated oligodendrocyte and down-regulated astrocyte genes.

We also performed a comparison of CI-DEGs identified at nominal significance (p-value < 0.05) with each algorithm from bulk-DLPFC to nominally significant snDLPFC results (Fig. 4b, Table S19). As with the above comparison, genes that are either up or down in both deconvoluted bulk-DLPFC and snDLPFC data were analyzed separately for each cell type.

Not surprisingly, down-regulated neuronal CI-DEGs have the greatest overlap (537/3732 or 14.4% for WLC, 292/3213 or 9.1% for CellCODE, 415/3516 or 11.8% for PSEA). These overlaps are significant for all three algorithms (Table S19). Down-regulated CI-DEGs in microglia show the least proportion of overlap (9/723 or 1.2% for WLC, 4/609 or 0.66% for CellCODE, 16/820 or 2.0% for PSEA) (empirical p-value > 0.05). Significant overlap detected with WLC (all but down-regulated microglia) and PSEA (all but microglial results and up-regulated oligodendrocytes) were similar, whereas CellCODE results had significant overlaps only for the neuronal CI-DEGs (Table S19).

Discussion

There is an increasing number of large scale RNAseq-based transcriptome datasets generated in bulk tissue for many diseases, including brain tissue from patients with AD, other neurodegenerative diseases and controls [5,6,7,8, 12]. Comparison of such transcriptome data from patient and control individuals has been instrumental in the identification of genes and co-expression networks that are altered in and may therefore be risk factors for these diseases [4,5,6,7, 44]. The discovery that many of these transcriptional networks harbor genes with disease risk variants provides support for the utility of this bulk-transcriptome approach in deciphering molecules and pathways that are risk factors for these conditions. Nevertheless, there is clear evidence for abundant transcriptome alterations in bulk tissue from affected organs of patients with disease, which appears to be due to changes in cellular composition of that tissue as a consequence of the disease processes [4, 45]. Given this, there is a strong need to detect cell-intrinsic transcriptional alterations to be able to distinguish gene expression changes that are upstream of and may therefore be causative factors for disease from those that are merely a result of cell proportion changes that occur due to disease pathology. The discovery of molecules and pathways that are upstream of and risk factors for disease pathology is of paramount importance for development of targeted therapies. This information can also aid in the identification of refined disease biomarkers reflective of disease-causative expression alterations in these conditions. Detection of cell-specific transcriptional changes can also help develop more accurate disease models harboring these cellular alterations. Further, discovery of cell-specific transcriptional alterations in disease may uncover expression changes, particularly in less abundant cell types, which may be missed by the analysis of bulk transcriptome. Thus, there is a growing effort to identify cell-specific expression alterations in human diseases [9,10,11, 14, 15, 17, 46,47,48].

There are two general approaches to decipher cell-specific transcriptional changes in AD. One approach is to conduct single nucleus (snRNAseq), single cell (scRNAseq) or purified cell RNAseq experimentally, followed by data analyses. The alternative approach is to design relatively complex bioinformatics pipelines to decipher cell-intrinsic information of individual cell types from bulk tissue microarray or RNAseq data. The first approach is limited due to significantly higher cost and experimental challenges. Additionally, these approaches may have the drawback that the procedures of dissociating cells break cell-to-cell communication and thus may not reflect the true expression signal in vivo. The alternative bioinformatics approach to decipher cell-specific transcriptome alterations from bulk tissue has the potential to avoid the above weaknesses of the experimental approach. Furthermore, a bioinformatics approach can make use of the large amount of existing RNAseq data [4,5,6,7,8, 12] from fresh or frozen bulk brain tissues with minimum added cost, and may better reflect the true situation in brain where different cells are in communication.

In this study, applying three different algorithms [13, 14] including our own novel approach, we estimated cell-intrinsic gene expression for deconvoluted cell types from three large bulk RNAseq datasets [4, 8, 12, 18] from two brain regions. We identified consensus cell-intrinsic transcriptional alterations (CI-DEGs) in AD, which are conserved across cohorts and brain regions. We also performed an in-depth comparison of these CI-DEGs with bulk brain RNAseq data obtained from the same datasets, collectively comprised of 685 unique brain samples. To our knowledge, this is the first study to detect CI-DEGs and their enriched gene ontology (GO) terms in computationally deconvoluted large-scale RNAseq data from AD and control brain samples. Additionally, we conducted a detailed comparison of CI-DEGs deconvoluted from bulk-DLPFC data with the three algorithms and those obtained from snRNAseq [19] (snDLPFC) of a subset of the samples from the same cohort [7, 8].

The main findings from our study can be summarized as follows: 1) The direction of change in cellular proportions in AD is consistent across two brain regions and three datasets for most cell types, although the magnitude of change seems to vary. Our findings revealed greater neuronal loss and microgliosis in TCX compared to DLPFC. 2) We identified CI-DEGs and bulk tissue DEGs (bulk-DEGs) independently in two TCX and one DLPFC datasets. The overlap in bulk-DEGs across datasets is greater than that for CI-DEGs. 3) We performed an in-depth comparison of the consensus CI-DEGs, common to both TCX and DLPFC against the consensus bulk-DEGs detected in these same datasets. We identified significant overlap between consensus CI-DEGs and consensus bulk-DEGs. The extent of overlap between consensus bulk-DEGs and consensus CI-DEGs was greatest for down-regulated neuronal genes (p = 9.7E-206). This was followed by up-regulated non-neuronal genes (p ranging from 1.6E-77 for endothelia to 1.1E-17 for microglia). 4) Despite the statistically significant overlap between consensus bulk vs. CI-DEGs, the majority of the consensus CI-DEGs for up-regulated neuronal and down-regulated non-neuronal genes were not detected in bulk tissue. This finding highlights the potential ability of computational deconvolution approach to identify CI-DEGs that may be missed in bulk-DEGs especially for genes that are not moving in the direction of cell proportion changes. 5) We identified GO-terms enriched for consensus CI-DEGs, and detected processes that have previously been implicated in AD as well as novel ones. 6) Using an snRNAseq [19] dataset as comparison, we assessed the performance of our CI-DEG detection algorithm (WLC), and the published CellCODE [14] and PSEA [13] approaches. We determined that WLC had comparable or superior performance in the detection CI-DEGs that had significant overlap with snDLPFC results.

Our findings highlight the consistency and reproducibility of our findings across two different brain regions from three different studies conducted separately. We identified similar directions of change in AD:Control cell proportions in TCX and DLPFC. As expected from known AD neuropathology, neuronal populations are significantly lower, and microglial populations are significantly higher in AD vs. control brains in all datasets. Consistent with this pattern of reproducibility, we also found significant overlap of consensus CI-DEGs detected in TCX and DLPFC for all cell types and for both directions of change, i.e. up or down in AD, with consensus bulk-DEGs.

Using consensus CI-DEGs, we identified GO terms, which include processes and genes that have previously been implicated in AD, thereby providing further validation of our approach. Detailed discussion of all of the pathways identified with the consensus CI-DEGs is beyond the scope of this study. Instead, we herein highlight some of the CI-DEG enriched pathways.

Down-regulated neuronal CI-DEGs were enriched in neuropeptide hormone activity (GO:0005184) pathway. These terms include VGF (a.k.a. neuroendocrine regulatory peptide 1), which is selectively expressed in some neurons and shown to be reduced in AD and Parkinson’s disease [22]. Corticotropin releasing hormone (CRH), which is a neuronally expressed peptide that mediates stress in the hypothalamic-pituitary-adrenal axis, is also a member of the same GO term. CRH has been implicated in both adverse outcomes related to AD pathology in model systems and epidemiology studies, as well as having an important role in learning and memory [23]. Neuronal reduction of CRH may either be a potentially neuro-protective response in AD brains or lead to further negative impact on memory. Although the biological implications of our finding remain to be uncovered, our results are aligned with the potential importance of the neuroendocrine system in AD.

Interestingly, CI-DEGs also implicate biological processes that are enriched for neuronal-DEGs that are up in AD, despite reductions in neuronal cell populations in this condition for both TCX and DLPFC. This suggests that our cell type specific transcriptome deconvolution successfully captures transcript changes that occur in the direction opposite to that of cell-population changes, and that may therefore be missed in bulk-DEG approaches.

Indeed, the significant GO terms potassium channel activity (GO:0005267) and regulation of ion transport (GO:0043269) harbor many potassium channels, which are up in AD for neuronal CI-DEGs in both TCX and DLPFC, but do not have consistent results in bulk-DEGs from the same cohorts. These findings highlight the potential utility of cell-specific transcript deconvolution approaches in reducing noise from cell population changes, thereby resulting in consistent detection of true signal. Notably, potassium channels have been reported to be up in AD brains and mouse models of AD [49,50,51], leading to their suggestion as a potential therapeutic avenue in this condition.

Notably, the computational approach we describe can be extended to cellular sub-types. To exemplify this, we utilized published snRNAseq [19] data to identify marker genes of excitatory and inhibitory neurons. Using those and the previously applied markers for the glial cells, we ran DSA algorithm [3] to estimate cell proportion (Fig. S6). We applied PSEA, WLC and CellCODE to identify consensus CI-DEGs in the same fashion as described (Table S20, Fig. S7). As expected, both excitatory and inhibitory neuron proportions are significantly lower in AD samples. Interestingly, the up-regulated excitatory neuronal consensus CI-DEGs are enriched in cation channel activity, which is not observed in inhibitory neurons (Tables S21-S23). These results highlight the ability of this approach to refine transcriptional alterations from bulk brain data to cellular sub-types.

Some consensus CI-DEGs point to AD-related perturbations of key cellular functions for the specific cell type. An example of this is consensus oligodendrocyte CI-DEGs. The down-regulated oligodendrocyte CI-DEGs are enriched for the myelination GO term (GO:0042552) and those that are up in this cell type are enriched in ceramide biosynthetic process (GO:0046513).

Down-regulated oligodendrocyte CI-DEGs include myelin basic protein (MBP) [4], plasmolipin (PLLP) [4, 5], myelin and lymphocyte protein (MAL), and myelin-associated glycoprotein (MAG) [26], even though bulk-DEGs for these genes did not show consistent changes. We [4] and others [5] demonstrated lower levels in AD of co-expression networks of genes implicated in myelination, which is consistent with the present findings from oligodendrocyte CI-DEGs.

Ceramide dysregulation has been implicated in AD [52, 53]. Increased ceramide species were observed in AD and other neuropathological disorders compared to controls [53], and the activation of the neutral sphingomyelinase–ceramide pathway induces oligodendrocyte death [54]. Our present observation from oligodendrocyte CI-DEGs are consistent with these findings.

Another potential utility of the computational deconvolution approach is its complementarity to snRNAseq data. While the latter is able to provide transcriptional data at a cellular level, it can miss information from low abundance cell populations. Indeed, in a recent snRNAseq [19] study of AD and control brain samples, only 121 endothelial cells were identified out of a total of 70,634 cells; and consequently no endothelial DEGs were reported. In contrast, our computational approach identified 140 genes that are up-regulated in endothelial cells in AD and 72 genes that are down-regulated. We further validated the endothelial expression of these genes by confirming their expression in sorted endothelial cells [9]. The significantly enriched GO terms include cytoskeleton organization and translational initiation, for these genes up- and down-regulated, respectively, in endothelia (Fig. 3, Tables S12, S17). These findings highlight the ability of the computational deconvolution approach to provide biological insights that could be missed by snRNAseq approaches.

Despite the biological insights gained from computationally deconvoluted CI-DEGs, they also have some shortcomings. Compared to bulk-DEGs, CI-DEGs between different datasets show less degree of overlap, regardless of the deconvolution algorithm utilized. This highlights the challenge in the field for the ultimate goal of minimizing detection of transcriptional perturbations due to cell proportion changes while maximizing discovery of those that lead to disease. Put differently, CI-DEGs may enhance discovery of true cell-specific transcriptional changes but this may come at the expense of increased false negative findings. In contrast, bulk-DEGs may capture a larger number of perturbed transcripts, but some may be merely due to cell population differences between diseased and healthy tissue. Ultimately, findings from both approaches may provide the greatest utility in detecting true positives.

Another analytic caveat is that cell proportion estimation approaches, including those used in this study yield relative values rather than the absolute levels of cell proportions. Detection of absolute levels in a typical bulk RNAseq experiment will be challenging if not impossible due to different mRNA amounts per cell type and the library preparation protocols (e.g. TruSeq® RNA Sample Preparation v2 Guide) which require similar amounts of cDNA from each sample to be sent to the sequencer. This leads to a normalization which would preclude the detection of the absolute cell proportions, as illustrated in an example in Fig. S8. Despite these challenges, our findings reveal that analytic deconvolution of bulk RNAseq can detect cell-specific transcriptional changes.

Comparison of CI-DEGs deconvoluted from bulk-DLPFC and those detected using an orthogonal approach of snRNAseq [19] from the same cohort (ROSMAP) demonstrates the ability of these computational approaches to identify true cell-specific expression changes in AD. Using our in-house WLC algorithm, there was significant overlap with the results from snRNAseq and CI-DEGs of most cell types (Fig. 4, Tables S18-S19). CellCODE and PSEA also identified significant overlaps, but to a lesser extent, especially for rarer cell types such as microglia. Hence, our WLC algorithm demonstrates at least comparable performance to these two algorithms [13, 14]. This is also supported by the higher degree of overlap among different cohorts for CI-DEGs detected by WLC than the other two algorithms (Fig. 2a). Due to the challenges in deconvoluting noisy data from human series, different computational approaches may be utilized and combined, and that calls for a more devoted effort in developing such algorithms.

In summary, using three distinct computational approaches, we deconvoluted brain bulk-RNAseq data from three large and independent cohorts [8, 12, 18]. We detected cell population changes that are observed consistently across cohorts, and congruent with the known disease pathology. Although there is significant overlap between consensus CI-DEGs and consensus bulk-DEGs, there are more unique CI-DEGs that change in the direction opposite to that of cell population changes. This suggests that CI-DEGs may have utility in detecting disease-related transcriptional changes above and beyond those due to cell proportion changes. Consensus CI-DEGs identify GO terms, including those for hormone activity, myelin biology and channel activity. The enriched CI-DEGs include genes previously implicated in AD or neurodegeneration, such as VGF, CRH, MOBP and MBP, and other novel genes.

Methods

Analysis datasets

We generated the TCX-Mayo data, which consists of temporal cortex RNAseq measurement of 80 AD patients and 28 controls diagnosed according to neuropathologic criteria [4, 12]. RNAseq data were processed and quality control (QC) was conducted as described [4, 12]. ROSMAP DLPFC [7, 8] and TCX-MSBB [18] datasets were downloaded from the AMP-AD Knowledge Portal on Synapse (syn8691134 and syn8691099). We further filtered out non-Caucasian samples and those that had incongruent sex based on provided covariate vs. transcriptome data. All samples were classified as AD or control based on neuropathological data. All TCX-Mayo AD samples had Braak neurofibrillary tangle (NFT) score ≥ 4. TCX-Mayo controls had Braak score ≤ 3 and were without any neurodegenerative disease diagnoses. TCX-MSBB AD samples had Braak ≥4 and CERAD ≥2; and controls had Braak ≤3 and CERAD ≤1. DLPFC AD samples form ROSMAP had Braak score ≥ 4 and CERAD neuritic plaque score ≤ 2. ROSMAP control samples had Braak ≤3 and CERAD neuritic plaque score ≥ 3.

It should be noted that the CERAD [55] neuritic plaque score as applied by the ROSMAP study is defined such that high CERAD indicates lower neuritic plaque burden and decreased probability of AD. In the MSBB study, higher CERAD indicates higher plaque burden.

Raw RNA read counts were normalized using conditional quantile normalization (CQN) method [56] implemented in R cqn package, as previously described [4]. This normalization takes into consideration library size, gene length and GC content. It also performs a log2 transformation so that the resulting distribution for each gene is Gaussian-like. We also determined covariates that contributed significantly to the variation of gene expression in these RNAseq cohorts (Fig. S5) for adjustment in the analyses.

Cell proportion estimation

Digital sorting algorithm (DSA) [16] was applied to estimate cell proportions through R DSA package, function DSA. For each cell type, DSA first computes the average marker gene expression in the analysis dataset, the purpose of which is to construct a variable that better reflects cell proportion variation among subjects. To reduce the effect of outlier expression that is occasionally seen in RNAseq data, we modified the original DSA so that the median instead of mean expression was computed.

CI-DEG analysis for individual cell types

In this study we identified CI-DEGs from deconvoluted bulk RNAseq data using three different algorithms, namely PSEA [13], CellCODE [14] and our in-house algorithm WLC. All analyses adjusted for the following variables: Sex, RIN, age at death and batch for DLPFC and TCX-MSBB datasets, and sex, RIN and age at death for TCX-Mayo dataset (Fig. S5).

PSEA [13] applies model selection procedures to select cell type(s) that should be included in baseline (control) or AD condition, and then estimate differential expression in specific cell types (CI-DEGs). We used the R package PSEA to obtain CI-DEG results of PSEA approach, through functions em_quantvg (to generate candidate models) and lmfitst (to fit all candidate models and pick the best one). Expression values used in PSEA are in linear scale (non log-transformed).

CellCODE [14] assesses overall gene expression difference between AD and control groups with adjustment of relative cell proportion, followed by assigning the cell type that correlates best with the difference (CI-DEGs). R package CellCODE was used to obtain DEG results of CellCODE approach, through functions getAllSPVs (to construct surrogate variable using marker genes through singular value decomposition) and lm.coef (to estimate difference between AD and control groups). Strictly speaking, CellCODE measures the overall differential expression rather than CI-DEGs but identifies the cell type that is most correlated with the observed difference between AD and control using cellPopT function. However, for simplicity, we refer to the DEGs from CellCODE as CI-DEGs in this study. Expression values used in CellCODE are in log scale.

Our in-house method WLC, described in the method section, applies weighted linear regression with constraints and model selection procedures, which also estimates differential expression in specific cell types (CI-DEGs). It guarantees the fitted relative gene expression is non-negative. By weighing the expression, it smooths out the extreme data points. The procedures of this algorithm is illustrated by the following high level pseudo code.

Assume cell type 1,2,3,4,5 are neu, oli, mic, ast and end respectively.

Fit Eq.(1) to identify a set of cell type T in which the gene is significantly expressed.

If the set T is not empty:

Fit Eq.(2) to identify a set of cell type Φ ⊆ T in which the gene is differentiallly expressed

Let Ω = {each cell type in T, Φ, T }

For each element Θ ∊ Ω.

Fit Eq.(3) with adjustment for other covariates Ck(1 ≤ k ≤ m)

Keep Akaike information criterion (AIC) of this model fitting

Identify Θ best that gives the best AIC.

Use the estimated values from the best model and obtain p-value from F-test

$$ {y}_i\sim {a}_0+\sum \limits_{t\in \left\{1,2,3,4,5\right\}}{a}_t{x}_{i,t}+\in, \kern0.5em s.t.{a}_t\ge 0\left(1\le t\le 5\right) $$
(1)
$$ {y}_i\sim {\beta}_0+\sum \limits_{t\in T}{\beta}_t{x}_{i,t}+{d}_i\sum \limits_{teT}\beta {{}_t{}^{\Delta}x}_{i,t}+\in, \kern0.5em s.t.{\beta}_t\ge 0,{\beta}_t+\beta {}_t{}^{\Delta}\ge 0\left(\mathrm{t}\in \mathrm{T}\right) $$
(2)
$$ {y}_i\sim {\gamma}_0+\sum \limits_{t\in T}{\gamma}_t{x}_{i,t}+{d}_i\sum \limits_{t\in \Theta}{\gamma}_t^{\Delta }{x}_{i,t}+\sum \limits_{k=1}^m{\lambda}_k{C}_{i,k}+\epsilon, \mathrm{s}.\mathrm{t}.{\gamma}_t\ge 0\left(t\in T\right),{\gamma}_t+{\gamma}_t^{\Delta }\ge 0\ \left(t\in \varTheta \right) $$
(3)

In Eqs.1, 2 and 3, yi is the observed expression of a gene in subject i; xi, t is the median marker gene expression of cell type t in subject i; Ci, k is covarite k in subject i. In Eq.1, αt is the overall relative expression in cell type t. In Eq.2, βt is relative expression at the baseline condition in cell type t; di = 0 if subject i is in control group, and di = 1 if subject i is in AD group; therefore, \( {\upbeta}_{\mathrm{t}}^{\Delta } \) is the difference of relative gene expression between baseline condition and AD condition in cell type t. Of note, due to the biological meaning these coefficents, they must satisfy constraints such that αt, βt,\( {\beta}_t+{\beta}_t^{\Delta } \) be non-negative. In addition, yi is in linear scale rather than in log scale [57], and these non-log-transformed expression values tend to have extreme data points that need to be weighted down. Based on the above considerations, Eq. 1 was fitted by weighted least square linear regression with contraints, which is implemented in lsei function in R package limSolve. The weight of each observation (wi) is determined by formulae Eqs. 3 and 4 below. Intuitively, if the expression of a gene in a sample is extremely distant from the median expression of all samples in the same diagnosis group, the weight of that sample is smaller than 1 for that gene.

$$ {w}_i=1/\left(1+\frac{\mid {y}_i-\mathrm{median}\left({y}_{AD}\right)\mid }{1+\mathrm{median}{\left({y}_{AD}\right)}^2}\right)\ \mathrm{sample}\ i\ \mathrm{is}\ \mathrm{AD} $$
(4)
$$ {w}_i=1/\left(1+\frac{\mid {y}_i-\mathrm{median}\left({y}_{control}\right)\mid }{1+\mathrm{median}{\left({y}_{control}\right)}^2}\right)\ \mathrm{sample}\ i\ \mathrm{is}\ \mathrm{control} $$
(5)

GO enrichment analysis

Using genes included in the CI-DEG analysis as background genes, p-value for GO term enrichment with consensus CI-DEGs was calculated by “enrichmentAnalysis” function from WGCNA R package [35].

Comparison of CI-DEGs from computational deconvolution vs. snRNAseq

To determine if computational bulk-tissue RNAseq could reveal true CI-DEGs, we downloaded and utilized a published snRNAseq study [19] from frozen DLPFC tissues (snDLPFC) which compared gene expression from 24 individuals with AD-pathology to that from 24 individuals without AD-pathology in six cell types - excitatory neurons, inhibitory neurons, oligodendrocyte, microglia, oligodendrocyte precursor cells and astrocytes. These snDLPFC samples are from the ROSMAP project, of which we analyzed 474 bulk-DLPFC RNAseq data in our current study. Twenty-four (9 AD cases, 15 controls) of the 48 individuals in snDLPFC study are also included in the bulk-DLPFC dataset. Hence both the snDLPFC and bulk-DLPFC are from the same cohort with significant overlap. Three deconvolution methods were included in this comparison – CellCODE [14], PSEA [13] and WLC, our in-house method.

Two types of comparisons were made between the deconvoluted bulk-DLPFC and snDLPFC results. In the first comparison, we ranked the genes by their p-values of differential expression between AD and control subjects, per cell type. We compared the top N up- or down- genes from the snDLPFC study to those identified by each deconvolution algorithm, per cell type. Genes common to both the deconvoluted bulk-DLPFC and snDLPFC were counted and plotted.

In the second set of comparisons, CI-DEGs identified from deconvolution methods at nominal significance (p-value < 0.05) were compared to those identified in the snRNAseq data (p-value < 0.05).

To assess if the observed overlap from each set of analyses is significant with regards to overlap between random selections, we obtained empirical p values from computer simulations described below.

Empirical p-value for the number of overlapping genes

The empirical p-values for the number of overlapping genes between snDLPFC and bulk-DLPFC was obtained using a computer simulation as follows. (A) Let Ssn stand for all genes in snDLFC and Sbulk for all genes in bulk-DLFC; (B) randomly assign up-regulation on each Ssn gene at probability 0.5, and on each Sbulk gene at probability 0.5; (C) randomly pick N genes from up-genes of Ssn , randomly pick N genes from up-genes of Sbulk, and count the number of overlapping genes; (D) Steps B-C were repeated 10,000 times, and the numbers of overlaps (M1, M2, …, M10000) were obtained; (E) Let M be the number of observed overlapping genes, and the empirical p-value is (1 + number of occurrences that Mi >= M)/10,001.

Identification of excitatory and inhibitory neuronal markers

Using a published human brain snRNAseq dataset [19], we identified excitatory and inhibitory neuronal markers using Seurat R package FindMarker function. Excitatory neuronal markers are those that a) are detected in > = 70% of excitatory neurons, b) have average normalized expression in excitatory neurons that are > = 4.5X of that in each of the other cell types (i.e. inhibitory neurons, oligodendrocytes, microglia, astrocytes and OPC), and c) have rank sum test p-value < 0.05 in the comparison of expression levels in excitatory neurons and each of the other cell types. Inhibitory neuronal markers were similarly identified except that we required a less stringent detection limit of > = 50% of inhibitory neurons as the > = 70% threshold yielded too few genes. Excitatory neuronal markers were further refined by requiring their presence in the neuronal markers from BRETIGEA [15]. A total of 53 excitatory and 62 inhibitory neuronal markers were identified (Table S20). Excitatory and inhibitory CI-DEGs were identified as described above.

Conclusions

This study demonstrates the utility of our analytic approach in deciphering cell-specific transcriptional alterations using bulk tissue in a complex disease, provides a comprehensive comparison of our pipeline to existing ones, identifies patterns of cell proportions in AD and control samples across brain regions, discovers novel CI-DEGs with replication across independent cohorts and highlights biological processes with cell-specific expression changes in AD. These findings are expected to refine discovery of molecular therapeutic targets, biomarkers that reflect cellular transcriptional alterations in AD and accelerate generation of more accurate disease models.