Background

Huntington’s Disease (HD) is a devastating neurodegenerative disease caused by an expanded trinucleotide CAG repeat in the HTT gene. The striatum, comprising the caudate nucleus (CAU) and putamen, is the primary affected brain region in HD where as many as 90% of neurons are lost in late stage disease. Although other brain regions, such as the cerebellum and cerebral cortex show the hallmarks of HTT protein intranuclear inclusions, they are relatively free of neurodegeneration [1, 2]. While studying the striatum directly in post mortem HD brains is preferable, the lack of neurons in these highly degenerated tissues makes interpretation difficult. CAU samples from post-mortem human brains of asymptomatic HD gene positive (HD+) individuals, who died before evidence of significant degeneration has occurred, avoid this difficulty but are extremely rare.

Previously, we performed unbiased transcriptomic analysis with high throughput sequencing (mRNA-Seq) in pre-frontal cortex Brodmannn Area 9 (BA9) of twenty HD and forty-nine non-neurological control brain samples[3]. Neuroinflammation and developmental pathways were implicated by the differentially expressed (DE) genes from this study, and there was evidence that every major resident brain cell type (i.e. both neurons and glia) is implicated in HD pathogenesis. However, since all of these individuals were symptomatic and at an advanced stage of disease at the time of death, it was unclear which aspects of the gene expression signature were causes and which were consequences of disease. Examining gene expression from brain tissue of asymptomatic HD+ individuals provides an opportunity to address this key question, as gene expression changes that are present prior to evidence of symptoms and neurodegeneration offer an opportunity to gain insight into initiating disease processes. Furthermore, comparing gene expression changes in BA9 and CAU of the same individuals enables examination of how the changes in a relatively unaffected tissue (BA9) reflect those observed in the primarily affected brain region (CAU).

The Myers lab has obtained brain tissue from BA9 of three asymptomatic HD+ individuals, as well as CAU from two of these same individuals from the McLean Brain Tissue Resource Center (BTRC). These tissues and age and sex matched controls were subjected to mRNA sequencing to assess genome wide alterations in gene expression. The HD+ expression dataset was then compared with our previous HD mRNA-Seq datasets [4], as well as BA9 and CAU mRNA-Seq samples from the Genotype-Tissue Expression (GTEx) database. The goals of this study were to 1) identify DE genes in the CAU prior to clinical onset and neurodegeneration, 2) compare DE genes between BA9 and CAU in HD+ individuals to identify region-specific and common expression patterns, and 3) identify genes involved in the early vs late disease process.

Methods

Human subjects

The individuals in this study are exempt as defined by the Boston University School of Medicine Institutional Review Board, due to the fact that all analyses were conducted on postmortem brain tissue.

Differential expression contrasts

Five different pair-wise contrasts were performed as described in Fig. 1 and Table 1.

Fig. 1
figure 1

Brain region contrasts performed

Table 1 Sample sizes for contrasts performed

This manuscript refers to specific analyses by the corresponding numbers in Fig. 1. Analysis (1) compares BA9 for symptomatic HD and neurologically normal controls. Analysis (2) compares HD+ BA9 with C BA9, identifying DE genes likely implicated in the early disease process. Analysis (3) compares HD+ BA9 with HD+ CAU, identifying DE genes caused either by disease or due to differing brain region. Analysis (4) compares HD+ CAU with C CAU, identifying DE genes implicated by the active HD disease process. Analysis (5) compares GTEx BA9 with GTEx CAU, identifying DE genes caused by difference in brain region, to assist in identifying DE genes identified in analysis (3) that are not simply a consequence of different brain region.

Sample processing

26 symptomatic HD and 49 control BA9 mRNA-Seq libraries were used as previously described [4], and three additional control BA9 samples age-matched for the HD+ BA9 samples presented here were generated. Paired BA9 and CAU tissues from two asymptomatic HD gene positive individuals, one additional asymptomatic HD gene positive BA9 sample, and two CAU samples from neurologically normal controls were extracted and processed to generate mRNA-Seq libraries following the procedure previously described [4]. Table 2 contains a summary of the datasets used in this study, and statistics for the new samples reported in this study are found in Table 3.

Table 2 Sample sizes for each class
Table 3 Sample statistics

Quality control and mRNA abundance estimation

mRNA-Seq libraries were subject to quality control and analysis using a custom pipeline. All sequencing libraries were quality- and adapter-trimmed with trimmomatic [5], and then assessed to be of high quality using fastqc [6] and MultiQC [7]. Trimmed reads were analyzed with salmon [8] to obtain mRNA abundance estimates using the GENCODE v26 gene annotation [9]. Abundance estimates from all samples were concatenated into a single matrix and normalized with the DESeq2 normalization method [10]. The normalized expression matrix was investigated for outlier samples using PCA, where no outliers were found (Additional file 2). Due to the different numbers of samples, genes in each analysis were filtered using different strategies. For analyses 1, 2 and 5, genes with more than 50% zero counts within each group was filtered out. For analyses 3 and 4 genes with more than 2 zeros and genes with more than 4 zeros was filtered out, respectively. Therefore, the genes detected for each analysis were different as seen in Table 1.

GTEx Analysis of BA9 vs CAU

Post mortem human brain samples from BA9 and CAU brains available from the GTEx project [11] were downloaded and processed as above. After processing the samples through the quality control pipeline described above, 56 samples were removed due to differences in per base Sequence Content, over representation of sequences, or discrepancies in read length, leaving a total of 90 BA9 samples, 102 CAU samples for analysis. These 192 samples were used to form the basis of a contrast between HD+ BA9 and HD+ CAU samples.

Differential expression analysis

Five differential expression contrasts were conducted in this study as described in the analysis matrix of Table 1 and Fig. 1. Differential expression statistics for all five analyses were assessed using DESeq2 [10], modeling counts as a function of either disease status or brain region, adjusting for age at death and sex. Differentially expressed genes were considered significant if they had FDR <0.05.

Gene set enrichment

Gene set enrichment analysis for all DE gene lists was performed using the fgsea [12] R package in bioconductor [13] and MSigDB C2 Canonical Pathway database version 6.2 [14, 15]. GSEA statistics were calculated using each gene list sorted by descending log2 fold change, and significance was assessed for gene sets at FDR <0.05.

Random forest predictive model to classify HD+

A random forest of decision trees was used to classify the HD+ CAU and BA9 samples as either HD BA9 or C BA9. A decision tree is a predictive model that iteratively bifurcates a set of labeled samples by identifying features (e.g. genes) that have predictive power when partitioning samples by a fixed threshold. The decision tree algorithm is a machine learning technique that is used to identify features and their levels that best partition a sample set according to given labels. For example, if gene A is expressed between 10 and 20 in one set of samples and between 30 and 40 in another, samples with an expression value less than 25 are likely to belong to one class, while samples with an expression value greater than 25 will belong to the other. If a single gene cannot perfectly divide samples into their labels, additional genes are chosen in a hierarchical fashion until samples with different labels are perfectly partitioned. Once a decision tree has been trained, it may be used to classify previously unobserved samples into the labels used in training.

Individual decision trees trained with all samples are often over-fit, so a randomization technique called random forests are used with decision trees to identify robust predictive features. Random forests perform bootstrap sampling on samples and random selection of features to build a large number of decision trees, where each is a different predictive model with different sets of features. After training, unobserved samples are applied to each decision tree in the forest and the predicted label of each is recorded and reported. The agreement of predicted labels across all trees in the forest is an indication of the predictive power of the overall dataset. A random forest where all trees classify a new sample to have the same label indicates a perfect classification. A random forest predicting a sample to be of either class with equal frequency has no predictive power.

We trained a random forest of decision trees using the HD BA9 vs C BA9 normalized counts matrix to arrive at a predictive model of genes that well classify the samples. The random forest was trained using cross-validation, where the samples are divided into training and test sets. Decision trees in the forest are built using the training samples and their predictive accuracy is assessed on the test set. In this way, cross validation enables assessing the robustness of a classifier and avoids over-fit predictive models. Each random forest contained 20,000 trees, 250 genes, 75% training sets were created with ratios of HD and Control samples which mirrored ratios present in the dataset. Prediction accuracy was assesses as the mean true positive predictions divided by the number of trees across all samples in the test set. 1000 random forests were trained in this way, and the average and standard deviation of true positive rates were recorded for each. See Table 4 for statistics on cross validation prediction accuracy.

Table 4 Accuracy results

HD+ CAU and HD+ BA9 samples were applied to the random forests trained above. First, random forests built with the top 250 DE genes from (1) ranked by significance were used to predict the HD+ samples as either HD or C. We then built random forests with 250 randomly selected genes from (1), irrespective of significance, and performed classification of the HD+ samples. Finally, we built random forests with permuted sample labels and randomly selected genes to assess the basal predictive power under a null dataset. The results of these randomized random forest classifiers is included in Table 8.

t-statistic analysis of DESeq2 log fold changes

We sought to identify genes that show different response between HD+ CAU and HD+ BA9 (3 vs 5) and early response genes from HD+ BA9 vs HD BA9 (1 vs 2). Although an interaction model estimating the difference in effect size between BA9 and CAU by HD status is a more conventional approach, the small number of samples in the HD+ group (i.e. 2 HD+ CAU vs 3 HD+ BA9) prohibits reliable fitting in this case. Therefore, we utilized a t-statistic methodology to quantify the difference between DESeq2 log2 fold change estimates while taking the uncertainty those estimates into account. DESeq2 implements a negative binomial generalized linear model, whose estimated coefficients are normally distributed. DESeq2 also reports the standard error of its log2 fold change estimates, enabling the calculation of a t-statistic corresponding to the confidence-adjusted difference in log2 fold change. Specifically, we calculate a t-statistic assuming both unequal sample sizes and unequal variance:

$$t = \frac{X_{1} - X_{2}}{s_{\Delta}} $$

where

$$s_{\Delta} = \sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}. $$

Here, X1 and X2 are the log2 fold change estimates from each comparison (e.g. (3) vs (5)), and s1 and s2 are the corresponding standard deviation estimates (i.e. standard error multiplied by the square root of the number of samples) as reported by DESeq2. n1 and n2 are the number of samples total used for each analysis (e.g. for (3) vs (5), n1=2+3=5 and n2=90+102=192, see Table 1). When assessing significance, the degrees of freedom is calculated using the Welch-Satterthwaite equation:

$$\texttt{d.f.} = \frac{\left (\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}} \right)^{2}} {\frac{(s_{1}^{2}/n_{1})^{2}}{n_{1}-1} + \frac{(s_{2}^{2}/n_{2})^{2}}{n_{2}-1}}. $$

Comparison of DE gene lists and enriched gene sets

t-statistics were calculated as described above for (3) vs (5) and (1) vs (2), where positive t corresponded to an increased log2 fold change in HD+ CAU over HD+ BA9 and HD+ BA9 over HD BA9, respectively. For (3) vs (5), genes were then ranked by descending t-statistic and analyzed for gene set enrichment with fgsea [12]. These GSEA results were then combined with those calculated for (1), (2), and (4). Significantly enriched gene sets at FDR <0.05 were manually curated into 10 high level functional categories: Angiogenesis/Blood Brain Barrier (BBB), Apoptosis, Cell Cycle/Development, Cytoskeleton/Extracellular Matrix (ECM), Immune Response/Cancer, Metabolism, Neuron System, Protein Folding/Other, Signaling, and Transcription/Translation. Each gene set was also categorized into so-called Agreement Classes, an ordinal scale representing how specific the gene set is to HD+ CAU, as described in the “Results” section. Combination of GSEA results, curation of gene sets, calculation of agreement classes, and plots from Fig. 6 were made using python, jupyter lab, pandas, and matplotlib python libraries.

Gene sets within each functional category were also cast as a graph, where each node is a gene set, and edges between nodes were drawn if the gene sets shared more than 25% of their leading edge genes. Graph analysis was performed using python, networkx, and matplotlib. All analysis and figure code for this project are available at bitbucket.org/bubfnexus/asymptomatic_hd_mrnaseq .

Results

HD BA9, HD+ BA9, and HD+ CAU show concordant DE genes

Figure 2 contains differentially expressed (DE) gene metrics for analyses (1), (2), and (4). In Fig. 2a, we see that the fold change distribution is similar between all three analyses, where more genes have increased expression overall than decreased and that this is particularly evident in the HD+ versus control analyses. The overlap of DE genes at FDR <0.05 in Fig. 2b shows that analyses (1) and (4) are more similar to each other than to (2). Figure 2c depicts the similarity in log 2 fold change (L2FC) for the HD+ versus C in BA9 with HD versus C in BA9 (top figure) and the HD+ versus C in CAU with HD versus C in BA9 (bottom figure) for DE genes at p<0.05 in both groups. These two figures show the extent of similarity of L2FC across these different contrasts. It is interesting to note that the symptomatic HD BA9 expression profile is well correlated with the HD+ versus C in CAU (Spearman ρ=0.55) and consequently the HD BA9 appears to be a good model for early disease effects in HD. This concordance is particularly remarkable when considering that the numbers of samples in the HD+ vs C analyses are extremely small.

Fig. 2
figure 2

DE Genes from (1), (2), and (4). a Histograms of log2 fold changes b 3-way Venn diagram of significant DE genes c Scatter plots of log2 fold change in DE genes in (1) vs (2) and (1) vs (4) with Spearman ρ

The overlapping DE genes in Fig. 2b provide insight into both common gene signatures across brain regions and disease state as well as those unique to individual conditions. Table 5 contains the DE statistics for the 19 genes found in the intersection of analyses (1), (2), and (4). These genes are perturbed across the entire disease course, from the HD+ BA9, which is the least affected tissue, to the most severely degenerated HD BA9 samples. All of these genes implicate the neuroinflammatory and neuroimmune responses, and seven of the 19 genes (BAG3, HSPA6, HSPB1, SERPINH1, DNAJB1, HSPA1A, HSPA1B) have direct roles in the heat shock response. As expected, the genes from Table 5 are highly enriched for unfolded protein binding, molecular chaperones and focal adhesion, heat shock response, apoptosis, and response to oxidative stress.

Table 5 Common response genes in HD BA9, HD+ BA9, and HD+ CAU, corresponds to middle intersection of Venn diagram in Fig. 2b

Figure 3 contains normalized counts distributions for each sample group for the 19 common DE genes from Fig. 2b. From left to right in each plot are counts from GTEx BA9, GTEx CAU, C BA9, C CAU, HD BA9, HD+ BA9, and HD+ CAU sample groups. Since there are so few C CAU samples in this study (i.e. only 2), we include the GTEx CAU counts (102 samples) to illustrate that our C CAU counts are well within the expected range for these genes. In every case except ANGPT2, the mean expression level increases from HD BA9 to HD+ BA9 to HD+ CAU. This increase in expression is particularly large for HSPA6, which shows a 256 fold abundance increase in HD+ CAU vs C CAU. Since the HD+ BA9 samples are the least affected tissues of the three disease groups, it is interesting to note that the asymptomatic HD+ BA9 samples show higher expression than the symptomatic HD BA9 samples overall for these genes.

Fig. 3
figure 3

Boxplot of 19 common response genes in all analyses, corresponds to middle intersection of Venn diagram in Fig. 2b. Base mean columns are the mean normalized counts from all samples. The first two boxes correspond to GTEx BA9 and CAU, respectively, followed by the C BA9 and C CAU from this study. The last three in each plot depict HD BA9, HD+ BA9, and HD+ CAU, respectively

The 26 genes that are uniquely DE in (4) from Fig. 2b (in green segment) appear in Table 6. These genes show only weak functional enrichment for extracellular space compartment, and plasma membrane, but we make two remarkable observations. First, several genes are consistent with the heat shock and inflammatory response observed in the common DE genes and in (1) more broadly, including HSPH1, CCL19, and CX3CR1. Second, four of the genes are readthrough transcripts (RPS10-NUDT3, UBE2F-SCLY, RPL17-C18orf32, and RP5-850E9.3) that originate from different chromosomes.

Table 6 Unique response genes in HD+ CAU, corresponds to only the green area diagram in Fig. 2b

HD+ vs C CAU enriched pathways are a subset of those in HD BA9

We next performed gene set enrichment analysis on each DE gene list to identify associated biological functions. Figure 4 contains the result of gene set enrichment analysis from analyses (1), (2), and (4) using the GSEA [14] algorithm as implemented in the fgsea R package [12] against the MSigDB C2 Canonical Pathway gene set database[14, 15]. Analysis (1), which has the most power to detect DE genes identifies 195 significantly enriched pathways at FDR <0.05. All 13 of the pathways identified in the HD+ CAU versus C CAU are among these 195 of (1), and eleven of these are also seen in the HD+ BA9 versus C BA9. The substantial overlap of the enriched pathways suggests that the most highly perturbed pathways in the prodromal phase of disease expression are also detected in late stage HD BA9. Only seven pathways, seen in the HD+ BA9 versus C BA9 are not also seen in (1). Table 7 lists the 16 gene sets that are significantly enriched in either both (1) and (4) (9 gene sets) or are unique to (2) (7 gene sets). Consistent with our previous work [4], the enriched pathways heavily implicate an increase in neuroimmune and neuroinflammatory response, an increase in transcriptional activity, and a decrease in neuron-related pathways.

Fig. 4
figure 4

GSEA results for DE genes from analyses (1), (2), and (4). Figure shows overlap of significantly enriched MSigDB C2 Canonical Pathway gene sets at padj<0.05 irrespective of direction of effect. Selected gene sets are included in Table 7 and the full results are in Additional file 4

Table 7 Significantly enriched pathways in intersection of (1) and (4) or unique to (2) from Fig. 4

HD BA9 DE genes perfectly predict disease state in HD+ CAU

Figure 5 shows the normalized counts from all HD, HD+, and C samples for the top 200 genes found to be DE in analysis (1) as a heatmap. A distinctive result from our previous HD work [4] was that a set of homeotic genes, most notably the HOX gene clusters, were selectively increased in HD compared with C. By inspection, HD+ CAU appears to demonstrate similar homeotic gene expression to HD, suggesting that the disease process in asymptomatic caudate does indeed resemble symptomatic cortex in these samples. HD+ BA9 expression in these genes is less pronounced and more closely resembles C samples, further supporting the hypothesis that the effect of disease on BA9 is reduced in HD+ individuals. The results suggest that HD+ CAU is more similar to HD BA9, and HD+ BA9 is more similar to C.

Fig. 5
figure 5

Clustered heatmap of normalized counts of top 200 genes from (1). Row clusters were created manually by inspection guided by clustered dendrogram, and enriched biological pathways (or the genes themselves) for the genes in each cluster are listed as indicated. Color label: HD+ CAU - orange, HD+ BA9 - yellow, HD BA9 - red, C BA9 - blue, C CAU - light blue

We sought to perform a more unbiased analysis to better quantify the similarity of the HD+ samples to either HD or C by training a random forest decision tree classifier on the HD and C samples. Briefly, a decision tree classifier identifies key features (in our case these are genes) that partition labeled samples (here either HD or C) into like groups using a threshold cutoff for each gene. A decision tree built using a dataset can then be used to predict the class of new samples that were not used to build the tree. To avoid overfitting, the random forest algorithm generates many different decision trees by randomly sampling samples and genes with replacement many times. When applied to a new sample, the output of a random forest decision tree classifier is the number of trees that predicted the sample to have each label. A random forest where all trees classify a new sample to have the same label indicates a perfect classification. A random forest predicting a sample to be of either class with equal frequency has no predictive power. See the “Methods” section for more details on the random forest decision tree algorithm.

After creating the random forest based on the top 250 significant genes in (1), the forest was used to predict the sample type of each HD+ BA9 and CAU sample. The results of the classifier are in Table 8. Several aspects from the random forest results are of note. First, the random forest perfectly classified both the HD+ CAU samples as symptomatic HD BA9, supporting the intuition built from the heatmap in Fig. 5 (Table 8(a)). Second, the HD+ BA9 samples were evenly split between being predicted as HD BA9 and C BA9 (Table 8(a)). This suggests that there are some genes in the HD+ BA9 samples that resemble symptomatic HD BA9, and others that more closely resemble control BA9. We will explore this difference in greater detail in the last section. Third, there is high prediction consistency for HD+ CAU even when choosing 250 genes randomly from (1), and a greater agreement in classifying HD+ BA9 as C BA9 (Table 8(b)). These results suggest that the DE signal for HD+ CAU and in HD BA9 is strong and genome wide, and are consistent with the hypothesis that HD+ BA9 represents a less severe form of the same response as in HD+ CAU and HD BA9. Last, when the model is fully randomized (i.e. random genes and shuffled labels from (1), Table 8(c)), classification consistency is essentially random, consistent with our expectation of the model. Taken together, this unbiased classification analysis supports the hypothesis that changes in BA9 after symptoms have appeared are reflected in the asymptomatic HD+ caudate and, to a lesser degree, in HD+ BA9.

Table 8 Random forest decision tree classifications of HD+ using genes from (1)

Gene expression patterns and pathways unique to HD+ CAU

Understanding the factors that cause the caudate to degenerate first in HD is critical in understanding the HD disease process. Due to the small number of HD+ and C CAU samples (2 and 3, respectively), the DE statistics for this direct comparison is likely to be highly influenced by noise, as evidenced by the small number of enriched gene sets in this comparison seen in Fig. 4. In addition, directly comparing HD+ CAU and HD+ BA9 may reveal differences between brain regions that are not prominent when comparing the results of corresponding pairwise comparisons. We therefore devised a statistical strategy to identify genes and pathways that are as robust and specific to HD+ CAU as possible. We therefore devised a statistical strategy to identify genes and pathways that are as robust and specific to HD+ CAU as possible.

Since disease status and brain region are convolved in the DE genes identified in (3), we sought to identify genes that differ between CAU and BA9 due to the disease process, and not due to differences in brain region. To accomplish this, the DE results from (3) and (5) were compared by computing a t-statistic of the difference in log2 fold change estimates and their standard errors reported by DESeq2 (see “Methods” section). In essence, this statistical procedure quantifies the difference in log2 fold change of genes when comparing HD+ CAU versus HD+ BA9 while de-emphasizing genes that are different due to differences in brain region. The resulting statistics allow genes to be ranked by the degree of relevance to the disease process in HD+ CAU. Table 9 contains the top 10 genes ranked by descending absolute value of the t-statistic to illustrate this strategy. For example, CFAP157 is increased 19.6 (24.3) fold in GTEx CAU over GTEx BA9, but is decreased by 1.09 (2−0.13) fold in HD+ CAU over HD+ BA9, resulting in an difference in fold change of -4.43 (i.e. −0.13−4.3=−4.43). TVP23C-CDRT4, another readthrough transcript, is essentially unchanged in GTEx CAU compared with GTEx BA9, but is increased 5.85 fold in HD+ CAU over HD+ BA9 (2.55−(−0.02)=2.57).

Table 9 Top 10 genes that show different effect sizes (L2FC) between (3) and (5)

The resulting t-statistics from this analysis induced a ranking of genes that were then subjected to gene set enrichment analysis against the MsigDB C2 Canonical Pathway gene set database. The analysis identified 405 significantly enriched gene sets at FDR <0.05, and all but one of these gene sets were positively enriched, indicating that genes increased in HD+ CAU relative to HD+ BA9 have strong functional coherence (full fgsea results in Additional file 3). These results were combined with the enriched gene sets from (1), (2), and (4) and subsequently divided into so-called Agreement Classes based on the pattern of significance across all four analyses. The Agreement Class is an ordinal indicator for the degree of HD+ CAU-specificity as follows. CAU Unique gene sets are only seen in HD+ CAU relative to HD+ BA9 (i.e. (3) vs (5)). CAU Enhanced are enriched gene sets in HD BA9 vs C BA9 (1) as well as in either HD+ CAU vs C CAU (4) or HD+ CAU relative to HD+ BA9 ((3) vs (5)). Finally, BA9 Unique only show enrichment in HD BA9 vs C BA9 (1). To aid in interpretation, the gene sets were manually curated into 10 high level functional categories: Angiogenesis/Blood Brain Barrier (BBB), Apoptosis, Cell Cycle/Development, Cytoskeleton/Extracellular Matrix (ECM), Immune Response/Cancer, Metabolism, Neuron System, Protein Folding/Other, Signaling, and Transcription/Translation. To illustrate these ideas, a heatmap of the enriched gene sets related to the Neuron System is in Fig. 6a.

Fig. 6
figure 6

a Enriched gene sets related to the Neuron System. The first four columns plot Normalized Enrichment Score (NES) of GSEA analyses from (1), (2), (4), and (3) vs (5), respectively, where red and blue correspond to positive and negative NES scores, respectively. The fifth column indicates the Agreement Class of each gene set, assigned according to HD+ CAU-specificity. Cells with white dots indicate that gene set is significantly enriched in the corresponding analysis. b Overlap of significantly enriched gene sets regardless of category. The gene sets enriched in (4) are a subset of those in (1), and thus are not listed. c Distribution of gene sets by agreement class divided into ten high level functional categories, showing that some functions are more selectively enriched in HD+ CAU relative to BA9 than others. d Graph-based representation of the Cell Cycle/Development gene sets from C. Each node is a gene set, and nodes with connected edges share more than 25% of their leading edge genes, thus representing the same expression signal

As seen in Fig. 6b, 306 out of 405 significantly enriched gene sets are unique to CAU relative to BA9. The distribution of these unique gene sets varies by biological process (Fig. 6c), where processes related to Cell Cycle/Development, Metabolism, Neuron System, and Protein Folding/Other show the greatest proportion of CAU-unique gene sets. This is in contrast to Angiogenesis/BBB, where most of the gene sets are seen in both CAU relative to BA9, and in BA9 independently. Gene sets related to Apoptosis, Cytoskeleton/ECM, Immune Response/Cancer, Signaling, and Transcription/Translation have a mixture of CAU unique, CAU enhanced, and BA9 specific gene sets. When we examine the enriched gene sets from Cell Cycle/Development more closely using a graph-based representation (Fig. 6d), we observe that there are two distinct groups of genes enriched separately in CAU vs BA9 and BA9 itself. In particular, BA9 is enriched for a set of genes relating to meiosis, whereas the CAU unique processes involve mitosis. Heatmaps and graph representations of all other categories are included in Additional file 1. Overall, this comparison of enriched gene sets in HD+ CAU relative to BA9 with the other two brain regions identifies the common and different cellular processes that are active in different brain regions.

Comparing DE gene lists identifies early vs late responding genes

In the random forest analysis discussed above, we noted that the HD+ BA9 samples were classified either as HD BA9 or C BA9 with approximately equal frequency. This suggests that there are some genes with an expression pattern that resembles HD BA9 and some that are yet unaffected in asymptomatic HD+ BA9. Thus, the genes that are consistent between HD+ BA9 and HD BA9 are genes that may form an early response in HD, whereas the genes whose expression differs from HD BA9 might still be intact and only respond later in the disease. We sought to identify which genes were early vs late responders by applying our t-statistic strategy comparing log2 fold changes between analyses (1) and (2). Table 10 contains results from the t-statistic based analysis of (1) and (2).

Table 10 Genes partitioned by significance within analyses (1) and (2) and fold change difference between these analyses

Of particular note are the 215 genes that are DE in (2) and have fold changes different from (1). These are the genes that may reflect early disease processes not identifiable in symptomatic individuals post mortem. We extracted these 215 genes and plotted their log fold changes to examine the relationship between groups as depicted in Fig. 7. Genes in quadrants I and III of the figure are genes that show the same direction of effect (i.e. up or down) but have a different size of effect. Genes in quadrants II and IV are genes that show differential behavior, and are thus potentially unique responses early in the disease process that are not observed in symptomatic HD BA9.

Fig. 7
figure 7

Scatter plot of fold changes from 215 early response genes from (1) and (2)

The statistics from the genes in quadrants II and IV, as well as the 4 additional genes detected in (2) that were filtered out of (1) due to low counts are listed in Tables 11 and 12. Most of the genes that are down regulated in (2) with respect to (1) are ribosomal protein genes that are essentially absent from HD+ BA9. The two genes that are massively increased in (2) but decreased in (1) are NPAS4, Neuronal PAS Domain Protein 4, and NEAT1_2.

Table 11 Putative early response genes in HD+ BA9 from Fig. 5 quadrant II
Table 12 Putative early response genes in HD+ BA9 from Fig. 5 quadrant IV

Discussion

To the authors knowledge, this is the first genome-wide transcriptome analysis of post-mortem asymptomatic HD+ BA9 and CAU. It is also the first systematic comparison of post-mortem symptomatic (HD) BA9 with asymptomatic (HD+) BA9 and CAU gene expression. Differential expression (DE) analysis identified many genes that show altered abundance between diseased and control tissue across brain regions, and there is a high degree of concordance in the direction of effect for these genes. The genes that are commonly DE in HD BA9, HD+ BA9, and HD+ CAU are strongly enriched for heat shock response, while the DE genes specific to HD+ CAU contain some heat shock elements and read-through transcripts. Gene set enrichment results show a high degree of agreement between these three analyses.

The analysis comparing HD+ CAU to HD+ BA9, and filtered using GTEx data for genes specific to brain region, identified a strikingly large number of significantly enriched gene sets that suggest processes related to Cell Cycle/Development, Metabolism, Neuron System, and Protein Folding are the most uniquely perturbed in the HD+ CAU disease process. Overall this analysis suggests that while a large proportion of disease processes are shared between CAU and BA9, there are distinct and important sets of genes perturbed in each brain region related to the disease process. Nonetheless, when the HD+ samples are classified using a random forest classifier built using the symptomatic BA9 samples, there is complete consensus that HD+ CAU most closely resembles HD BA9, while HD+ BA9 resembles aspects of both diseased and control brain. The homeotic and inflammatory gene signatures appear to be equally present in the HD+ CAU and HD BA9, suggesting a similar process affects the cellular milieu in both brain regions.

Finally, we identified key genes that appear to be early responders to the disease process by comparing HD+ BA9 and HD BA9. HD+ BA9 appears to be the least affected brain region of the three studied here; therefore, genes that show different behavior in these regions are likely to be part of an early response that is lost as the disease process progresses. The two genes in particular that are increased in HD+ BA9 relative to HD BA9 are NPAS4, which as been implicated in the cortex of mouse models of HD [16] and NEAT1, which has been shown to be associated with neuronal hyperactive state [17]. A second group of poorly annotated but consistently expressed genes seems to be uniquely expressed in HD BA9 and may be evidence of severe transcriptional dysregulation previously observed in this tissue[3, 4].

The authors caution that, due to small sample size, the results from HD+ BA9 and CAU should be best interpreted as trends warranting further study. Nonetheless, despite the small HD+ sample size, the consistency between the HD+ and HD BA9 results supports the robustness of these findings. Not only do the overall effect size and enriched pathway signatures agree to a great extent, many of the biological processes implicated are well supported in the literature. Immune response has been heavily implicated in HD and neurodegenerative disease in general[3, 4, 1822], and the broad agreement between the diseased tissues across brain regions in this study lends support to the role of inflammation in the prodromal HD brain. Of particular note is the common heat shock response observed in the common DE genes in all comparisons with control. The heat shock system is primarily responsible for maintaining proteostasis and protein conformation during times of stress, and has been directly implicated in both animal [23] and in vitro [24] models of Huntington’s disease. The fact that expression of key heat shock genes appears to be perturbed across the entire disease course is strong evidence of the important role these proteins play in disease.

The genes found to be perturbed across the entire disease course provide an opportunity to develop prognostic tests for HD progression. It is interesting to note that the mRNA abundance of the commonly perturbed genes is often increased in HD+ BA9 relative to HD BA9, suggesting that protein levels of these genes might be usefully measured in a longitudinal setting to track disease progression. HSPA6 in particular shows a massive induction in HD+, and is an attractive first target for follow-up in a clinical setting. However, this gene is but one of thousands implicated by this analysis, and the authors urge readers to avoid the temptation to focus too specifically on any one gene reported here. A robust clinical test measuring disease progression will likely take the form of a panel of key inflammatory and possibly developmental genes measured longitudinally from a peripheral source such as blood or cerebrospinal fluid.

The differences revealed between HD+ CAU and HD+ BA9 may offer insight into why the striatum is uniquely vulnerable to neurodegeneration. The enriched functional categories that are the most specific to HD+ CAU include Metabolism and Cell Cycle and Development. Interestingly, when the cell cycle gene sets are examined closely (Fig. 6d), we observe that the gene sets uniquely enriched in HD+ CAU are related to mitosis, while the smaller number enriched in BA9 involve meiosis. The striatum, unlike the cortex, has a resident population of neuroblasts that enables neurogenesis in the adult human brain[25]. A recent hypothesis has proposed that these neuroblasts are impaired in HD, resulting in a lack of repleneshing neurons over time and eventual destruction of tissue [26]. The unique presence of increased mitotic gene expression, paired with the observation that many neuronal pathways are also increased in HD+ CAU compared with HD+ BA9, is strong evidence that neurogenesis is indeed active in this region prior to symptom onset. However, it still remains to be shown why these specific neurons degenerate in the first place, and why this neurogenesis ceases over time. The enrichment of meiosis in BA9 is curious, and does not lend an immediate interpretation. One possibile explanation is that the same signals that trigger neurogenesis in CAU are also present in BA9, but that cortical neurons lack neurogenic capabilities and ’misfire’ in response to the developmental signals. An intriguing feature of the HD BA9 samples is the expression of homeotic and developmental genes, which might be a consequence of a neuron that is trying to regenerate but cannot.

Given the extreme rarity of HD+ CAU samples, it is difficult to conceive of validation experiments to test these findings given our current disease models. The combined complexity of the central nervous and immune systems makes accurate models of human HD challenging to devise, since there is clear involvement and interaction of major players in both of these systems. Given the immense complexity of both the human HD+ and HD phenotype, the authors also urge caution in drawing strong mechanistic lines between any small number of genes reported here and the disease process. With the exception of the few putative early responder genes identified, it is essentially impossible to separate the causes and effects of neurodegeneration from this inherently observational study. Only a few of the many findings in this study have been discussed in this manuscript, and much greater insight may likely be gained from further examination by those specializing in different aspects of the biology implicated here. These results are therefore put forward as a source of hypothesis and inspiration for new models and avenues of research.

Conclusion

These results provide clear evidence that the caudate nucleus is strongly affected by the HD disease process prior to the appearance of any symptoms. The common gene expression patterns observed in asymptomatic HD+ CAU compared with symptomatic HD BA9 supports the hypothesis that many aspects of the disease process are shared between brain regions, but that the CAU experiences these changes at a much earlier stage of disease than BA9. Of particular importance is the set of heat shock related genes that is commonly perturbed across all HD samples. Despite these similarities, a much broader set of gene expression changes is observed specifically in HD+ CAU, implicating unique disease processes that may help explain why this brain region is particularly sensitive to degeneration. Several pieces of evidence suggest that neurogenic signals are present in both CAU and BA9, but that the response to these signals differentiates between region. The nature and significance of this putative neurogenic signal remains to be shown, and should be investigated further along with the other CAU-specific gene expression patterns found by this study.