Introduction

Substantial clinical and pathological variability has been reported in patients carrying an expanded repeat in the C9orf72-SMCR8 complex subunit (C9orf72) [58], which leads to frontotemporal dementia (FTD) and amyotrophic lateral sclerosis (ALS) [14, 50]. While FTD is the second most frequent cause of dementia in the presenile group, ALS is the most common form of motor neuron disease (MND). Intriguingly, there is considerable clinical, genetic, and pathological overlap between FTD and ALS. In fact, up to 40% of FTD patients demonstrate motor neuron involvement [7, 44]. Similarly, up to 50% of ALS patients have cognitive impairment and 15% fulfill the FTD criteria [17, 46]. Mutations in several genes appear to be specific for either FTD or ALS (e.g., superoxide dismutase 1 [SOD1]); however, most have been detected in both diseases, like the repeat expansion in C9orf72. Furthermore, TAR DNA-binding protein 43 (TDP-43) inclusions can be observed in approximately 50% of FTD patients and more than 90% of ALS patients [43, 44]. Given this overlap, FTD and ALS are thought to represent a disease spectrum.

The repeat expansion in C9orf72 accounts for about 30% of familial cases and 5–10% of sporadic cases [41, 58], possibly due to a reduction in C9orf72 expression [14], the aggregation of flawed RNA transcripts in the nucleus of cells (RNA foci) [14], and the formation of repetitive proteins aberrantly translated from the expansion (dipeptide repeat [DPR] proteins) [4, 42]. The C9orf72 protein itself is known to interact with endosomes and functions in vesicle trafficking [18, 56].

Thus far, a limited number of studies has been performed to investigate the expression pattern of C9orf72-linked diseases. We have, for instance, profiled brain tissue of C9orf72 expansion carriers using expression arrays, which uncovered an upregulation of transthyretin and homeobox genes [19]. In an RNA sequencing study, we also examined differential expression, alternative splicing, and alternative polyadenylation in ALS patients harboring a C9orf72 expansion [47]. We detected widespread transcriptome changes in the cerebellum, particularly of RNA-processing events [47]. Furthermore, we observed elevated levels of repetitive elements (e.g., long interspersed nuclear elements [LINEs]) in patients with a C9orf72 repeat expansion [48]. Several other studies also investigated expression patterns distinctive of an expanded repeat in C9orf72 by examination of laser-captured motor neurons, lymphoblastoid cell lines, fibroblast and induced pluripotent stem cell (iPSC) lines, iPSC-derived motor neuron cultures, and/or postmortem motor cortex tissue from C9orf72 expansion carriers [11, 16, 30, 52, 54].

Despite these efforts, the majority of the clinico-pathological variability remains unexplained in C9orf72 expansion carriers. As such, we have performed an in-depth RNA sequencing study on frontal cortex tissue from a well-characterized cohort. We evaluated individuals who received a pathological diagnosis of frontotemporal lobar degeneration (FTLD) with or without MND as well as control subjects stored at the Mayo Clinic Florida Brain Bank (n = 102). In addition to differential expression and co-expression analyses, we used various analytical approaches within the group of C9orf72 expansion carriers to identify genes associated with clinical and pathological features of C9orf72-related diseases. Our findings provide additional evidence for the involvement of vesicle-mediated transport and reveal several potential modifiers of C9orf72-linked diseases.

Materials and methods

Subjects

Subjects were selected for whom frozen brain tissue was available in our Mayo Clinic Florida Brain Bank (n = 102; Table 1). Frontal cortex tissue was collected from the middle frontal gyrus at the level of the nucleus accumbens. We included C9orf72 expansion carriers (n = 34) pathologically diagnosed with FTLD characterized by TDP-43 inclusions (FTLD-TDP) in the presence or absence of MND, patients with FTLD-TDP or FTLD/MND without known mutations (type A or B; n = 44), and control subjects without neurological diseases (n = 24). Our C9orf72 expansion carriers had a median age at death of 69 years (interquartile range [IQR]: 62–76), a median RNA integrity number (RIN) of 8.9 (IQR: 8.4–9.5), and 35% was female. For patients without a repeat expansion, the median age at death was 78 years (IQR: 68–83), their median RIN value was 9.6 (IQR: 9.1–9.8), and 50% was female. The median age at death of control subjects was 87 years (IQR: 78–89) with a median RIN value of 9.1 (IQR: 8.8–9.6) and 67% was female. Of note, in previous studies, we already obtained the expansion size, RNA foci burden, and DPR protein levels for the majority of our expansion carriers [13, 21, 57]. Methylation levels of the C9orf72 promoter were determined using 100 ng of DNA as input material with a quantitative methylation-sensitive restriction enzyme-based assay, as described elsewhere [40, 51].

Table 1 Subject characteristics

RNA sequencing

Total RNA was extracted from frozen brain tissue using the RNeasy Plus Mini Kit (Qiagen). RNA quality and quantity were determined with a 2100 Bioanalyzer Instrument (Agilent) using the RNA Nano Chip (Agilent); only samples with a RIN value above 7.0 were included. Libraries were made using the TruSeq RNA Library Prep Kit (Illumina; v2) and sequenced at 10 samples/lane as paired-end 101 base-pair reads on a HiSeq 4000 (Illumina) at Mayo Clinic’s Genome Analysis Core. Subsequently, raw sequencing reads were aligned to the human reference genome (GRCh38) with Spliced Transcripts Alignment to a Reference (STAR; v2.5.2b) [15]. After alignment, library quality was assessed using RSeQC (v3.0.0) [60], and gene-level expression was quantified using the Subread package (v1.5.1) [37]. All analyses described below were performed in R (R Core Team; v3.5.3).

Differential expression analysis

We used conditional quantile normalization (CQN) to account for differences in gene counts, gene lengths, and GC content, resulting in comparable quantile-by-quantile distributions across samples [24, 49]. Genes were kept if their maximum normalized and log2-transformed reads per kb per million (RPKM) values were above zero (n = 24,092). Using linear regression models, source of variation (SOV) analysis was then performed to determine how much variation was explained by the disease group (C9orf72 expansion carriers, non-expansion carriers, and controls) as well as by potential confounders (RIN, sex, age at death, plate, and gene counts). We also assessed the effects of differences in cellular composition between individuals using surrogate markers for five major cell types: neurons (enolase 2 [ENO2]), microglia (CD68 molecule [CD68]), astrocytes (glial fibrillary acidic protein [GFAP]), oligodendrocytes (oligodendrocyte transcription factor 2 [OLIG2]), and endothelial cells (CD34 molecule [CD34]) [1, 12, 23]. Based on our SOV analysis, variables with a mean F-statistic above 1.25 were selected. Differential expression analysis was performed using two separate linear regression models: one model included RIN, sex, age at death, plate, and disease group, while the other model also included our five surrogate markers for the major cell types. Fold-changes were determined and p-values were adjusted for multiple testing using a false discovery rate (FDR) procedure [5]. Genes with an FDR below 5% were considered statistically significant (FDR < 0.05). To examine whether significantly differentially expressed genes were enriched for biological processes and pathways, enrichment analysis was performed using the anRichment package [33] and gene sets from the molecular signatures database (MSigDB; v6.2) [39]. For visualization purposes, Venn diagrams were generated with the VennDiagram package [10]. Moreover, heat maps were made with the ComplexHeatmap package [22] and the flashClust package [35], utilizing the Euclidean distance and average method.

Co-expression analysis

In addition to the gene-level analyses described in the previous section, we performed module-level analyses to identify the building blocks of biological systems, revealing relevant information about the system’s structure and dynamics as well as the function of certain proteins [61]. As such, we employed weighted gene co-expression network analysis (WGCNA) to find modules comprised of highly correlated genes that go up or down together [34], using residual expression values adjusted for aforementioned potential confounders as input (both with and without surrogate markers). Separate analyses were performed for each pairwise comparison, creating signed hybrid networks and using the biweight midcorrelation (bicor) method. To achieve a scale-free topology, we selected a power appropriate for each comparison, ranging between 4 and 14. A dynamic tree cutting method was used with a minimum module size of 30 and a merge height varying from 0.25 to 0.35, depending on the comparison. Modules generated using these settings were represented by their first principal component (module eigengene) and a unique color. For every gene, we calculated correlations between expression levels and each module’s eigengene value (module membership). Modules that differed significantly between disease groups were further investigated using enrichment analyses and displayed with heat maps, using methods identical to those described above. Additionally, network visualization was performed for top protein-coding genes belonging to modules of interest with a relatively high module membership (> 0.6), utilizing the force-directed yFiles Organic Layout and Organic Edge Router algorithms in Cytoscape (v3.7.1) [55]. In these network plots, the connectivity of each gene was represented by the size of its node, the module to which it has been assigned by its color, and the strength of the correlation by the thickness of its edges.

Clinico-pathological association analysis

To find associations with clinical and pathological features of the disease in patients carrying an expanded C9orf72 repeat (n = 34), we obtained residuals from linear regression models with expression levels as outcome to account for potential confounders (RIN, sex, and plate, either with or without surrogate markers). First, we performed analyses to examine individual genes, starting with linear regression models. We investigated associations with age at onset and age at death, adjusting for disease subgroup (FTLD or FTLD/MND). Subsequently, we assessed associations with C9orf72 expansion size, RNA foci burden (mean percentage of cells with sense or antisense RNA foci), DPR protein levels (total poly[GP]), and methylation of the C9orf72 promoter, while adjusting for disease subgroup and age at death. Hereafter, we performed a logistic regression analysis to compare expression levels between patients with predominant FTLD to those diagnosed with both FTLD and MND, adjusting for age at death. We ran Cox proportional hazard regression models, including disease subgroup and age at death as potential confounders. Hazard ratios (HRs) and 95% confidence intervals (CIs) were estimated; deaths of any cause were utilized as our survival endpoint. Three approaches were used for our survival analysis to assess expression levels: comparing the top 50% to the bottom 50% as a dichotomous categorical variable, ranking expression levels from low to high, and examining them as a continuous variable. Notably, all models were adjusted for multiple testing using an FDR procedure [5]; an FDR below 5% was considered statistically significant (FDR < 0.05).

Second, we evaluated combinations of genes found to be nominally significant in our single-gene analysis (P < 0.05). To examine the sensitivity of our results, we opted to use two machine learning methods, namely Least Absolute Shrinkage and Selection Operator (LASSO) regression and random forest. LASSO regression was performed with the glmnet package [20]. The most parsimonious model was selected, using leave-one-out cross-validation, an alpha of one, and a lambda within one standard error from the model with the lowest cross-validation error (mean squared error, classification error, or partial-likelihood deviance). This approach was employed using models appropriate for the nature of the given response variable, including age at onset, age at death, expansion size, RNA foci burden, poly(GP) DPR levels, C9orf72 promoter methylation, disease subgroup, and survival after onset. We then used the randomForest package [38], which implements Breiman’s random forest algorithm [6]. We tuned the number of trees in the forest (1000 to 30,000), the number of features considered at each split (2 to 98), and the size of terminal nodes (2 to 10). Subsequently, we created a random forest regressor (age at onset, age at death, C9orf72 expansion size, RNA foci levels, DPR proteins, and promoter methylation) or classifier (disease subgroup). We extracted the out-of-bag error rate as well as information about the importance of each gene (variable importance), as represented by its permuted effect on the error rate (e.g., mean squared error or accuracy), while other genes remained unchanged [38].

Validation experiments and analysis

We validated RNA expression levels of the top candidate genes in C9orf72 expansion carriers from our RNA sequencing cohort (n = 34). Reverse transcription was performed using 250 ng of RNA as template with the SuperScript III Kit (Invitrogen) and an equal ratio of Random Hexamers and Oligo dT primers. The following expression assays (TaqMan) were performed: vascular endothelial growth factor A (VEGFA; Hs00900055_m1), cyclin dependent kinase like 1 (CDKL1; Hs01012519_m1), eukaryotic elongation factor 2 kinase (EEF2K; Hs00179434_m1), and small G protein signaling modulator 3 (SGSM3; Hs00924186_g1). As markers, ENO2 (Hs00157360_m1) and GFAP (Hs00909233_m1) were selected. To obtain relative expression levels for each patient, the median of replicates was taken, the geometric mean of the two markers was calculated, and a calibrator on every plate was used for normalization, utilizing the ΔΔCt method. Subsequently, the correlation between these relative expression levels and residuals from our RNA sequencing analysis was calculated using a Spearman’s test of correlation.

Results

Top differentially expressed gene is C9orf72

We performed RNA sequencing on carriers of a C9orf72 repeat expansion (n = 34), FTLD and FTLD/MND patients without this expansion (n = 44), and control subjects without any neurological disease (n = 24; Table 1). When adjusting for cell-type-specific markers, 6706 genes were significantly different between these groups. Without adjustment, 11,770 genes were differentially expressed. Importantly, the top gene was C9orf72 itself, both with (FDR = 1.41E-14) and without (FDR = 8.69E-08) adjustment for cell-type-specific markers (Table 2; Fig. 1a, b). Hereafter, we specifically compared patients with a C9orf72 expansion to patients without this expansion or to controls. For simplicity, we focused on results that accounted for differences in cellular composition. In total, we detected 4443 differentially expressed genes when comparing expansion carriers to patients without this expansion and 2334 genes when comparing them to controls (Fig. 1c). Heat maps demonstrated that most patients with an expanded repeat clustered together (Fig. 2), especially when comparing them to controls. Of the differentially expressed genes, 1460 overlapped (Fig. 1c, d), including C9orf72 itself. The RNA expression levels of C9orf72 were roughly two-fold lower in expansion carriers than in non-expansion carriers (FDR = 6.04E-06) or control subjects (FDR = 1.08E-05; Table 3). We further investigated overlapping genes using enrichment analyses, which indicated that these genes might be enriched for processes involved in endocytosis (FDR = 0.02; Table 4).

Table 2 Differential Expression (All Groups)
Fig. 1
figure 1

a After adjustment for five major cell types (neurons, microglia, astrocytes, oligodendrocytes, and endothelial cells), expression levels of C9orf72 are shown for all disease groups: patients with a C9orf72 repeat expansion (C9Plus), patients without this expansion (C9Minus), and control subjects (Control). b Without adjustment for five cell types, the expression levels of C9orf72 are displayed for C9Plus, C9Minus, and Control. Importantly, in both graphs, C9orf72 levels are lower in C9Plus than in C9Minus or Control. For each box plot, the median is represented by a solid black line, and each box spans the interquartile range (IQR; 25th percentile to 75th percentile). c In total, 4443 differentially expressed genes are detected when comparing C9Plus to C9Minus. The comparison between C9Plus and Control results in 2334 differentially expressed genes. As displayed in the Venn diagram, 1460 differentially expressed genes overlap. d All overlapping genes go in the same direction (lower left quadrant and upper right quadrant)

Fig. 2
figure 2

a When comparing patients with a C9orf72 repeat expansion to those without this expansion (C9Plus vs. C9Minus), a heat map is displayed. b A heat map is shown when comparing expansion carriers to control subjects (C9Plus vs. Control). In these heat maps, high expression levels are shown in red and low levels in blue. Both heat maps indicate that most expansion carriers cluster together (purple). Of note, for visualization purposes, only the top differentially expressed genes are displayed (false discovery rate [FDR] < 0.001)

Table 3 Differential Expression (Specific Comparisons)
Table 4 Enrichment Analysis (Overlapping Genes)

Co-expression analysis reveals relevant modules involved in processes like vesicular transport

Next, we performed module-level analyses using WGCNA. When comparing patients with an expanded C9orf72 repeat to those without this repeat, we identified 22 modules. Visualization of the module-trait relationships (Fig. 3a), revealed that the strongest relationships were dependent on the presence or absence of a C9orf72 repeat expansion (disease group). In fact, we only detected significant correlations with the disease group, resulting in the identification of 11 modules of interest. None of these modules demonstrated a significant correlation with potential confounders, such as cellular composition, RIN, age at death, sex, or plate (Fig. 3a). Enrichment analysis of these 11 modules (Table 5) showed that they were involved in protein folding (black), RNA splicing (blue), metabolic processes (yellow), Golgi vesicle transport (green), GABAergic interneuron differentiation (greenyellow), synaptic signaling (turquoise), etc. Given the potential function of the C9orf72 protein, we visualized the green module (Fig. 4a); most expansion carriers appeared to have lower module eigengene values for this module than disease controls. In addition to Golgi vesicle transport (FDR = 1.33E-06), the green module was also significantly enriched for related processes, such as endoplasmic reticulum to Golgi vesicle-mediated transport (FDR = 1.97E-05), vacuolar transport (FDR = 9.91E-05), vesicle-mediated transport (FDR = 0.002), and lysosomes (FDR = 0.002). This is in agreement with the cellular components that appeared to be involved, including vacuolar part (FDR = 4.31E-10), endoplasmic reticulum part (FDR = 2.88E-09), endoplasmic reticulum (FDR = 2.34E-08), vacuole (FDR = 8.41E-08), and vacuolar membrane (FDR = 6.53E-07). A gene network, which displayed top genes from significant modules, demonstrated that members of the green module (e.g., charged multivesicular body protein 2B [CHMP2B]) clustered together with genes belonging to the yellow module, most importantly C9orf72 (Fig. 5a).

Fig. 3
figure 3

a Module-trait relationships are presented for patients with an expanded C9orf72 repeat and patients without this repeat (C9Plus vs. C9Minus). b For patients with an expansion and control subjects (C9Plus vs. Control), module-trait relationships are plotted. These plots are generated with weighted gene co-expression network analysis (WGCNA) to find groups of genes that go up (red) or down (blue) together. A unique color has been assigned to each of these groups, also called a module. Correlations and p-values are shown for variables of interest, including disease group (C9Plus, C9Minus, and/or Control; arrow), neurons, microglia, astrocytes, oligodendrocytes, endothelial cells, RNA integrity number (RIN), age at death, sex, and plate. The strongest correlations (brightest colors) are observed for the disease group. Notably, both module-trait relationship plots are based on residuals obtained after adjustment for cell-type-specific markers

Table 5 Enrichment Analysis (C9Plus vs. C9Minus)
Fig. 4
figure 4

a One specific group of genes is visualized in a heat map: the green module. b A heat map is displayed for the yellow module. High expression levels are shown in red and low levels in blue. Below every heat map, the first principal component of a given module (module eigengene) is displayed for each sample. Most C9orf72 expansion carriers (C9Plus) appear to have relatively low levels as compared to patients without this expansion (C9Minus) or to control subjects (Control)

Fig. 5
figure 5

a For patients harboring a C9orf72 repeat expansion and those without this expansion (C9Plus vs. C9Minus; module membership > 0.6 and significance < 1.0E-06), a gene network is displayed. b A gene network is visualized when examining expansion carriers and controls (C9Plus vs. Control; module membership > 0.6 and significance < 2.5E-05). In these network plots, the connectivity of each gene is represented by the size of its node, the module to which it has been assigned by its color, and the strength of the correlation by the thickness of its edges; the C9orf72 gene is denoted by an arrow. Of note, the plots in this figure have been generated after adjustment for cell-type-specific markers

The comparison between expansion carriers and controls resulted in 25 modules. Despite the fact that we adjusted for cell-type-specific markers and other potential confounders, we still observed weak correlations with those variables; for instance, due to differences in cellular composition between affected and unaffected frontal cortices (Fig. 3b). Nevertheless, the disease group displayed the strongest correlations and was significantly associated with 11 modules. An enrichment was seen for processes like GABAergic interneuron differentiation (paleturquoise), synaptic signaling (turquoise), metabolic processes (yellow), Golgi vesicle transport (green), oxidative phosphorylation (orange), protein folding (midnightblue), and cell death (steelblue; Table 6). The C9orf72 gene was assigned to the yellow module, which we visualized (Fig. 4b); in general, expansion carriers seemed to have decreased module eigengene values for the yellow module, when comparing them to control subjects. The yellow module was enriched for various processes, including small-molecule metabolic processes (FDR = 2.10E-13), organic-acid catabolic processes (FDR = 1.39E-11), small-molecule catabolic processes (FDR = 1.15E-10), organic-acid metabolic processes (FDR = 6.24E-08), and oxidation reduction processes (FDR = 8.71E-07). The top cellular components were the mitochondrial matrix (FDR = 2.59E-10), mitochondrion (FDR = 2.18E-09), and mitochondrial part (FDR = 2.27E-09). Our gene network with top genes from significant modules highlighted genes belonging to the yellow module (Fig. 5b), such as small integral membrane protein 14 (SMIM14), pyrroline-5-carboxylate reductase 2 (PYCR2), 5′-nucleotidase domain containing 1 (NT5DC1), S100 calcium binding protein B (S100B), and dynactin subunit 6 (DCTN6).

Table 6 Enrichment Analysis (C9Plus vs. Control)

Of note, without adjustment for cell-type-specific markers, the strongest relationships were no longer observed for the disease group, but for our surrogate markers (Additional file 1: Figure S1). As an example, neurons were highly correlated with the turquoise module, when comparing C9orf72 expansion carriers to patients without this expansion (correlation: 0.82; Additional file 1: Figure S1a) or to control subjects (correlation: 0.83; Additional file 1: Figure S1b). Enrichment analysis confirmed that the turquoise module was enriched for synaptic signaling (FDR = 1.30E-53 and FDR = 2.09E-44, respectively). Similarly, microglia were strongly correlated with the grey60 module, demonstrating a correlation of 0.87 for both comparisons, while being enriched for the immune response (FDR = 8.23E-62 and FDR = 1.51E-63, respectively). The importance of our adjustment for cell-type-specific markers was further substantiated by a cluster dendrogram (Additional file 1: Figure S2); branches in this dendrogram correspond to the modules we identified. After adjustment for cellular composition (Additional file 1: Figure S2a), the turquoise module was relatively small and seemed more closely related to the disease group than to our neuronal marker. Without this adjustment, however, the turquoise module was much larger and resembled the pattern of our neuronal marker (Additional file 1: Figure S2b). Importantly, without adjustment for surrogate markers, the green module involved in vesicular transport and the yellow module that contains C9orf72 still correlated with the disease group (Additional file 1: Figure S1 and S3), but findings were less prominent than those obtained after adjustment.

Machine learning uncovers clinico-pathological associations

We then performed an exploratory analysis aiming at the discovery of clinico-pathological associations, when restricting our cohort to FTLD and FTLD/MND patients harboring an expanded C9orf72 repeat (n = 34). Three types of models were used with residuals adjusted for cell-type-specific markers as input: linear regression models, logistic regression models, and Cox proportional hazard regression models. Our single-gene analysis did not reveal individual genes that remained significant after adjustment for multiple testing (not shown). Nonetheless, when analyzing all nominally significant genes, machine learning did point to interesting candidates, which were consistently associated with a given outcome using multiple methods and which were biologically relevant.

The most parsimonious models generated by LASSO regression contained up to 13 genes, depending on the variable studied (Table 7). When focusing on age at onset as response variable, for instance, only one gene was found: VEGFA (Fig. 6a). Importantly, this gene was the 10th gene based on our random forest analysis (Fig. 7a), and additionally, it was the 6th gene in our single-gene analysis (P = 9.17E-05). One of the four genes selected by LASSO regression that seemed associated with C9orf72 expansion size was CDKL1 (Fig. 6b). This gene was listed as the 19th gene in the random forest analysis (Fig. 7b) and the top gene in the single-gene analysis (P = 5.28E-05). Another interesting gene identified by LASSO regression was EEF2K, which appeared to be associated with the level of poly(GP) proteins (Fig. 6c). This gene was also the 3rd most important variable according to a random forest algorithm (Fig. 7c) and the 6th gene according to the single-gene analysis (P = 9.69E-04). Without adjustment for surrogate markers, similar trends were observed for VEGFA (P = 9.47E-04), CDKL1 (P = 0.01), and EEF2K (P = 0.002; Additional file 1: Figure S4a-c).

Table 7 LASSO Regression
Fig. 6
figure 6

a-d Associations are displayed for patients carrying a C9orf72 repeat expansion. a The first plot shows an association between VEGFA and age at onset. b An association between CDKL1 and C9orf72 expansion size is shown in the second plot. c The third plot displays an association between EEF2K and poly(GP) dipeptide repeat (DPR) protein levels. In these three plots, the solid blue line denotes the linear regression line, while each individual is represented by a solid dark grey circle. d The last plot indicates that patients with higher SGSM3 levels demonstrate prolonged survival after onset, when comparing the bottom 50% (solid salmon line) to the top 50% (solid turquoise line). These plots have been created using residuals adjusted for differences in cellular composition

Fig. 7
figure 7

a-c The importance of genes is visualized in three plots based on a random forest analysis. For continuous variables (age at onset, C9orf72 expansion size, and poly[GP] levels), the importance is defined as an increase in mean squared error. The blue gradient represents the importance of each gene, from very important (light) to less important (dark). Arrows point at genes of interest, namely VEGFA, CDKL1, and EEF2K (Table 7 and Fig. 6)

In the survival after onset model, LASSO regression identified two genes, one of which was a gene called SGSM3 that was the top hit of our single-gene analysis (P = 1.31E-05; Table 7). In patients belonging to the bottom 50% of SGSM3 expression levels, the median survival after onset was 4.8 years (IQR: 3.0–6.8) versus 8.6 years in the top 50% (IQR: 7.5–12.1; Fig. 6d). This difference resulted in an HR of 0.10 (95% CI: 0.04–0.28). We were able to confirm these findings when analyzing expression levels based on rank, listing SGSM3 as the 3rd gene (P = 6.03E-04). Likewise, when treating expression levels as a continuous variable, SGSM3 was the 13th gene on the list (P = 0.001). Although much less profound, this trend with survival after onset was also observed without adjustment for cell-type-specific markers (P = 0.02; Additional file 1: Figure S4d). Together, our findings suggest that lower levels of SGSM3 might be associated with shortened survival after onset in C9orf72 expansion carriers. Notably, of our four genes of interest, SGSM3 was the only gene that was significantly differentially expressed between disease groups (FDR = 0.03), demonstrating elevated levels in patients carrying an expanded C9orf72 repeat (Additional file 1: Figure S5).

We then used TaqMan expression assays for the four top candidate genes to validate the expression results from our RNA sequencing experiment in C9orf72 expansion carriers. When using residuals unadjusted for cellular composition, a significant correlation between our expression assays and RNA sequencing data was found for VEGFA (P = 4.17E-05, correlation: 0.68), CDKL1 (P = 0.003, correlation: 0.55), EEF2K (P = 0.03, correlation: 0.40), and SGSM3 (P = 0.03, correlation: 0.40; Additional file 1: Figure S6b, d, f, h). Similar correlations were obtained when using residuals adjusted for our five surrogate markers (Additional file 1: Figure S6a, c, e, g).

Discussion

In this study, we characterized the expression pattern of C9orf72-related diseases in an affected brain region: the frontal cortex. We examined FTLD and FTLD/MND patients with or without a C9orf72 repeat expansion as well as control subjects (n = 102). Differential expression analysis identified C9orf72 as the top gene; it was approximately 50% reduced in C9orf72 expansion carriers. Importantly, differentially expressed genes were enriched for endocytosis (FDR = 0.02). Without adjustment for cell-type-specific markers, our co-expression analysis revealed modules influenced by neuronal loss (turquoise) and inflammation (grey60). Usage of surrogate markers resulted in the discovery of additional modules that correlated with the disease group, including modules enriched for protein folding, RNA processing, metabolic processes, and vesicle-mediated transport. The C9orf72 gene itself was assigned to a module involved in metabolism (yellow) and clustered with genes belonging to a module that plays a role in vesicular transport (green). To identify potential disease modifiers, we then focused on the subset of individuals with an expanded repeat in C9orf72 (n = 34). We used various analytical approaches, including LASSO regression and random forest, which pointed to promising candidates. In addition to VEGFA, for instance, we detected CDKL1, EEF2K, and SGSM3. Taken together, our RNA sequencing study uncovered that vital processes, such as vesicle transport, are affected by the presence of a repeat expansion in C9orf72. Furthermore, the modifiers identified in this study may represent biomarkers and/or therapeutic targets, which are in great demand.

Although the C9orf72 protein has been studied extensively since the discovery of a repeat expansion in the C9orf72 gene [14, 50], little is known about its function. It has been suggested that C9orf72 is a member of a superfamily called differentially expressed in normal and neoplasia (DENN) [36, 65], which contains GDP/GTP exchange factors (GEFs) that activate regulators of membrane trafficking known as Rab-GTPases. The C9orf72 protein has already been shown to co-localize with Rab-GTPases involved in endosomal transport [18]. Additionally, C9orf72 was found to form a complex with another DENN protein (SMCR8), serving as a GEF for specific Rab-GTPases [2, 53, 62, 64]. Furthermore, the C9orf72 protein appears to play a role in lysosomal biogenesis in addition to vesicle trafficking [56]. The presence of the C9orf72 repeat expansion seems to cause defects in vesicle trafficking and dysfunctional trans-Golgi network phenotypes, which can be reversed by overexpression of C9orf72 or antisense oligonucleotides targeting the expanded repeat [3]. Interestingly, modulation of vesicle trafficking may even rescue neurodegeneration in induced motor neurons from C9orf72 expansion carriers [56].

Our study, in which we compared the expression pattern of C9orf72 expansion carriers to (disease) controls, uncovered C9orf72 as the top hit of our differential expression analysis. This aligns with one of our previous studies where we detected reduced levels of C9orf72 transcripts in expansion carriers and where we observed clinico-pathological associations with specific transcript variants [59]. It was reassuring to see that differentially expressed genes were enriched for endocytosis, especially given the potential role of the C9orf72 protein in vesicular transport. These findings were further substantiated by the fact that our co-expression analysis revealed a module that was enriched for Golgi vesicle transport as well as endoplasmic reticulum to Golgi vesicle-mediated transport, vacuolar transport, vesicle-mediated transport, and lysosomes. Our RNA sequencing study, therefore, provides additional evidence that the presence of a C9orf72 repeat expansion might disrupt vesicle trafficking, a crucial process. Interestingly, we also discovered a promising modifier of survival after onset that is involved in vesicle transport: SGSM3. Our findings indicate that low expression levels of SGSM3 could be detrimental in C9orf72 expansion carriers, while high levels might have protective effects. The SGSM3 protein interacts with Ras-related protein Rab-8A [63], a small Rab-GTPase that is also regulated by the C9orf72-SMCR8 complex [53]. Consequently, one could postulate that higher levels of SGSM3 might counteract some of the harmful effects associated with an expanded repeat in C9orf72. In fact, a recent yeast screen demonstrated that msb3, the yeast ortholog of SGSM3, modifies the toxicity of one of the DPR proteins: poly(GR) [9]; other potential mechanisms seem worthy of exploration.

Another interesting candidate we identified, VEGFA, appeared to be associated with the age at which disease symptoms occur. Our findings suggest that higher expression levels of this gene are associated with a delayed age at onset (P = 9.17E-05, coefficient: 7.36). While age at onset and age at death are strongly correlated, one could speculate that VEGFA levels might simply increase as an individual ages. Our single-gene analysis, however, revealed a stronger association with age at onset than with age at death (P = 0.003, coefficient: 5.81). The VEGFA protein belongs to the vascular endothelial growth factor (VEGF) family and is thought to have neurotrophic effects [28, 29]. Remarkably, reduced expression of Vegfa has been shown to cause an ALS-like phenotype in mice [45]. At the same time, treatment with Vegfa might protect motor neurons against ischemic death [32]. Additionally, genetic variants in VEGFA may render individuals more vulnerable to the development of ALS [31, 32]. Notably, neither an association with survival after onset (P = 0.26) nor a significant difference between disease subgroups (FTLD versus FTLD/MND; P = 0.75) was observed in our C9orf72 expansion carriers, but the association we detected with age at onset is in favor of a protective role for VEGFA.

In addition to SGSM3 and VEGFA, we also found associations with CDKL1 and EEF2K. CDKL1 was associated with the size of C9orf72 expansions: higher levels were observed in individuals with longer expansions. This gene is a member of the cyclin-dependent kinase family and appears to control the length of neuronal cilia [8]. At the moment, how CDKL1 possibly affects C9orf72 expansion size remains elusive. Expression levels of EEF2K were associated with the amount of poly(GP); an increase in EEF2K was seen in expansion carriers when poly(GP) levels decreased. It is a regulator of protein synthesis and synaptic plasticity that has already been studied in Alzheimer’s disease and Parkinson’s disease, where it may affect the toxicity of amyloid-β and α-synuclein [25,26,27]. Given the fact that it functions in protein synthesis and has previously been implicated in other neurodegenerative diseases, EEF2K is an interesting candidate. Of note, for simplicity, we focused on four disease modifiers in this manuscript; however, our study also hints at the involvement of other genes (e.g., Table 7), which might be worth pursuing.

It should be noted that, although we performed RNA sequencing on a precious collection of well-characterized individuals for whom autopsy tissue was available, the actual number of samples included in our study is limited. This mainly affects the clinico-pathological association analyses performed in the subset of individuals carrying an expanded C9orf72 repeat; these analyses, therefore, should be considered exploratory in nature. Additionally, we would like to stress that patients included in this study were generally younger than control subjects. Despite the fact that we adjusted our models for age at death, we realize that this age difference may have influenced our findings. Another limitation that should be mentioned is that we performed RNA sequencing on bulk tissue from the frontal cortex instead of on single nuclei. Because expression levels are cell-type dependent, we included five genes in our models as surrogate markers [1, 12, 23]. Evidently, this approach is not perfect, but it enabled us to (partially) account for various degrees of neuronal loss, inflammation, and gliosis seen in patients with FTLD and/or MND. When taking the cost of single nuclei RNA sequencing into consideration, our bulk tissue analysis with adjustment for cellular composition seems to provide a cost-effective alternative that can yield significant results. Future studies could further investigate expression levels of interesting candidates in specific cell types to elucidate which cells are most relevant for a given gene and appear to drive the detected associations (e.g., using purified cell populations), and additionally, they could clarify whether changes on the protein level mirror changes on the RNA level.

Conclusions

To conclude, in this study, we have used a combination of conventional analyses and machine learning to capture the RNA signature of C9orf72-linked diseases. Our powerful approach highlights the disruptive effects of a repeat expansion in C9orf72, particularly on vesicular transport. Furthermore, we have discovered promising candidate modifiers that were consistently associated with relevant disease features and that may serve as urgently needed biomarkers and/or point to new treatment strategies.