Identification of key genes and pathways for Alzheimer’s disease via combined analysis of genome-wide expression profiling in the hippocampus

In this study, combined analysis of expression profiling in the hippocampus of 76 patients with Alzheimer’s disease (AD) and 40 healthy controls was performed. The effects of covariates (including age, gender, postmortem interval, and batch effect) were controlled, and differentially expressed genes (DEGs) were identified using a linear mixed-effects model. To explore the biological processes, functional pathway enrichment and protein–protein interaction (PPI) network analyses were performed on the DEGs. The extended genes with PPI to the DEGs were obtained. Finally, the DEGs and the extended genes were ranked using the convergent functional genomics method. Eighty DEGs with q < 0.1, including 67 downregulated and 13 upregulated genes, were identified. In the pathway enrichment analysis, the 80 DEGs were significantly enriched in one Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, GABAergic synapses, and 22 Gene Ontology terms. These genes were mainly involved in neuron, synaptic signaling and transmission, and vesicle metabolism. These processes are all linked to the pathological features of AD, demonstrating that the GABAergic system, neurons, and synaptic function might be affected in AD. In the PPI network, 180 extended genes were obtained, and the hub gene occupied in the most central position was CDC42. After prioritizing the candidate genes, 12 genes, including five DEGs (ITGB5, RPH3A, GNAS, THY1, and SEPT6) and seven extended genes (JUN, GDI1, GNAI2, NEK6, UBE2D3, CDC42EP4, and ERCC3), were found highly relevant to the progression of AD and recognized as promising biomarkers for its early diagnosis.


INTRODUCTION
Alzheimer's disease (AD) is an age-related neurodegenerative disease caused by central nervous system disorders. It accounts for 50%-75% of dementia patients. The common symptoms of AD are progressive deterioration of memory and cognitive decline, including degenerated learning, recall accuracy, and problem solving and changes in personality and behavior (Rosenberg et al. 2015). Many studies show that AD is a polygenic disease influenced by several susceptibility Mengsi Wu and Kechi Fang have contributed equally to this work.
genes with a small effect (van Cauwenberghe et al. 2016). However, the specific pathogenesis of AD remains unclear, and no effective treatment and prevention measures are still available.
To explore the molecular changes underlying AD, a number of genome-wide expression profiling experiments were performed on the postmortem brain tissues of AD patients (Blair et al. 2013;Blalock et al. 2004;Cooper-Knock et al. 2012;Liang et al. 2008a, b;Wang et al. 2016b). The hippocampus plays a critical role in memory and learning and is one of the earliest regions to be affected in AD patients (Mak et al. 2017;Weiner et al. 2017). Dysregulated genes and molecular pathways have been identified in a series of gene expression studies in the hippocampus of AD patients Wang et al. 2016b). However, the findings in different studies have heterogeneity and low reproducibility, which are partly attributable to the different array types; small sample size; diverse analysis procedures in different cohorts; and other confounding factors, such as postmortem interval (PMI), age, and gender. To solve these issues, several studies sought to consolidate the knowledge of transcriptomic abnormalities via a combined analysis (Hu et al. 2015;Li et al. 2015). However, these studies (Hu et al. 2015;Li et al. 2015) had several limitations, including the following: (1) covariates, such as age, gender, PMI, and batch effect, were not considered when modeling; (2) compared with the combined-sample reanalysis of the individual-level data, the combined reanalysis of the summary statistics from multiple studies was relatively underpowered (Hess et al. 2016); and (3) new microarray-based gene expression studies of AD were conducted in the past two years.
Therefore, in this study, microarray-based transcriptomic studies in the hippocampus of AD patients were strictly screened, and only the datasets with detailed sample information and raw probe-level data generated from similar Affymetrix platforms were retained. A combined analysis of individual-level biological data from selected microarray studies was conducted for statistical modeling with proper correction for covariates and variances among studies. The differentially expressed genes (DEGs) in the hippocampus of AD patients and the age-matched healthy controls were best identified, thereby providing biological clues for the interpretation of the pathogenic mechanism of AD. Further validations of the DEGs were performed to test the robustness of these findings. Next, pathway enrichment and protein-protein interaction (PPI) network analyses for the DEGs were performed to explore the biological processes and interactions of the dysregulated genes, helping to elucidate the biological underpinnings of AD. Furthermore, gene prioritization was conducted to discover more promising genes for subsequent experimental replication and identification of biomarkers from the large amount of candidate genes. The findings of the present study may contribute to characterizing intrinsic molecular processes underlying AD and implicating promising biomarkers for AD.

RESULTS
DEGs identified in the hippocampus of AD patients and age-matched controls For our combined analysis, data from 116 samples, composed of 40 healthy controls and 76 AD cases, were obtained after quality control. Eight sample data were removed. After normalization, the expression matrices for each dataset were merged, and the combined gene expression matrix consisted of 116 samples and 22,277 probe sets. Detailed information of each dataset is shown in Table 1.
For the variables we considered, a significant difference was observed in gender between the AD cases and controls (p-value = 0.01647). Although age and PMI did not show statistical significance, these factors were still taken into account (Supplementary Table S1). After mixed-effect linear modeling, we identified 82 dysregulated probe sets with q-value \ 0.1, in which 69 probe sets were downregulated, and 13 probe sets were upregulated. These probe sets mapped to 80 DEGs (67 downregulated genes and 13 upregulated genes) in the hippocampus of AD patients and healthy controls (Table 2). Two downregulated genes mapped by more than one probe set (CDC42 and IGF1) implied higher confidence in the results of their expression changes.

Robustness and sensitivity of the DEGs
Jackknife cross-validation was used to validate the robustness of the findings. Each leave-out iteration resulted in a new list of DEGs (q-value \ 0.10), which was subsequently compared with DEGs obtained from the combined analysis (Supplementary Table S2). Thirty-two DEGs (30 downregulated genes and two upregulated genes) were cross-validated by the jackknife method (Table 2).

Biological classification and pathway enrichment analysis of the DEGs
By using DAVID, the 80 DEGs were significantly enriched in one KEGG pathway, GABAergic synapse, and 22 Gene Ontology (GO) terms had Benjamin-corrected p-value \ 0.05 (Table 3). For the 80 DEGs identified between the AD patients and healthy controls, the significantly enriched KEGG pathway was ''GABAergic synapses'' pathway (Benjamin-corrected p-value = 0.039882). The DEGs involved in ''GABAergic synapses'' pathway were all downregulated, suggesting that the function of GABAergic synapses may be impaired in the pathogenesis of AD.

PPI network of the DEGs
As shown in Fig. 1, a PPI network composed of 250 nodes and 497 edges was obtained. Among the 250 nodes, 70 DEGs (ten upregulated genes and 60 downregulated genes) and 180 extended genes interacting with the DEGs were observed. Notably, genes greater than ten degrees were 14 DEGs (CDC42, RBL1, GNAS, CKMT1B, CKMT1A, AMPH, ACVR1B, CNR1, SEPT6, GAD1, PDIA2, MOB4, PRC1, and ACVR2A) and one extended gene (UBC). The DEG CDC42 occupies the most central position in the network because it has the highest degree.

Ranking of the DEGs and the extended genes
To identify more reliable candidate genes from a large number of AD-related genes for subsequent experimental validation and identification of biomarkers, a prioritized list of the DEGs and the extended genes was generated using a convergent functional genomics (CFG) method (Supplementary Table S3). For the 260 candidate genes (80 DEGs and 180 extended genes), 156 genes were listed as highly AD-related candidate genes when received at least two lines of AD-related evidence (CFG score [ 1, Fig. 2). Among the 156 highly AD-relevant genes, 82 genes (25 DEGs and 57 extended genes) showed early expression alteration in the hippocampus of 2-month-old AD mice compared with agematched wild-type mice, implicating them as potential upstream regulators in AD, and 25 genes (eight DEGs and 17 extended genes) were supported by blood gene expression evidence, implicating them as potential blood biomarkers (Supplementary Table S3). In addition, 12 AD-relevant candidate genes, including five DEGs (ITGB5, RPH3A, GNAS, THY1, and SEPT6) and seven extended genes (JUN, GDI1, GNAI2, NEK6, UBE2D3, CDC42EP4, and ERCC3), exhibited early expression alteration in the hippocampus of 2-monthold AD mice and had blood gene expression evidence (Supplementary Table S3). The expression of these genes changed before the emergence of AD pathology and might be detected in the blood, suggesting that they

DISCUSSION
In this study, we investigated the mRNA expression changes in the hippocampus that were consistent across up to three independent cohorts of subjects to illustrate the pathogenesis of AD. Eighty DEGs were identified in the combined analysis, and 31 of them were validated by at least seven leave-one-out tests and confirmed by the results of the AlzBase database. The validations of the DEGs demonstrate the reliability of the results to a certain extent. Pathway enrichment analysis was performed to interpret the function of these DEGs. KEGG pathway analysis for the 80 DEGs suggested that five downregulated genes were significantly enriched in one KEGG pathway ''GABAergic synapses'' (Benjamin-corrected pvalue = 0.039882), indicating that the GABAergic synapse pathway might be impaired in AD patients. GABAergic synapses exert an inhibitory effect on the nervous system. Downregulated GABAergic synapses are intimately coupled with the loss of GABAergic inhibition (Kuzirian and Paradis 2011). A close linkage was observed between GABAergic neurotransmission and various aspects of AD pathology, including Ab toxicity (Bell et al. 2006), tau hyper-phosphorylation (Nilsen et al. 2013), and apoE4 effect (Li et al. 2009(Li et al. , 2016. Significantly lower levels of GABA inhibitory neurotransmitter (*33%) were observed in the AD cases, indicating deficient synaptic function and neuronal transmission in AD (Gueli and Taibi 2013). Furthermore, animal experiments of an AD model illustrated that impaired hippocampal neurogenesis in AD mice may be mediated by the dysfunction of GABAergic signaling or an imbalance between excitatory and inhibitory synapse (Li et al. 2009;Sun et al. 2009). Hence, the pathway of GABAergic synapses is important not only for the function of the hippocampus but also for the pathogenesis of AD. GO analysis indicated that the 80 DEGs were mainly involved in the neuron, secretory vesicle, synaptic signaling, synaptic transmission, cell junction, and synaptic vesicle metabolism. The impairment of neuronal and synaptic functions has long been considered an important pathologic characteristic in neurodegenerative diseases, and decreased synaptic activity is also considered to be the most relevant pathological feature of cognitive impairment in AD (Marttinen et al. 2015). These results demonstrate that GABAergic system, neurons, and synaptic function might be affected in the pathogenesis of AD.
To explore the protein interactions of the 80 DEGs, a PPI network extending from the DEGs was constructed, in which 180 extended genes interacting with DEGs were obtained. In the PPI network with 250 nodes and 497 edges, the 15 genes with greater than ten degrees (including CDC42, RBL1, GNAS, CKMT1B, CKMT1A, AMPH, ACVR1B, CNR1, SEPT6, GAD1, PDIA2, MOB4, PRC1, ACVR2A, and UBC) were selected, which were mainly involved in TGF-beta and Rap1 signaling pathway. Among these genes, the validated DEG CDC42 was the hub gene with the highest degree. This gene codes the protein cell division cycle 42; it is a member of the Rho guanosine triphosphatase (GTPase) family and plays an important role in cell morphology, proliferation, cell migration, and cell progression. CDC42 is reported to play a critical role in striatal neuron growth in the gene expression profiling analysis of Parkinson's disease (Gao et al. 2013). In addition, CDC42 is dysregulated in a transcriptomic meta-analysis between AD and type 2 diabetes mellitus (Mirza et al. 2014). Therefore, although CDC42 has not been reported in original articles, it may play a crucial role in the pathogenesis of AD. Furthermore, to identify key genes for early diagnosis and treatment of AD, we prioritized the DEGs and the extended genes by using the CFG method. The results suggested that 12 highly AD-relevant genes, including five DEGs (ITGB5, RPH3A, GNAS, THY1, and SEPT6) and seven extended genes (JUN, GDI1, GNAI2, NEK6, UBE2D3, CDC42EP4, and ERCC3), might be promising for evaluating early diagnostic biomarkers in AD. Among them, three genes, including two validated DEGs (ITGB5, Symbols Gene name p-value q-value Jackknife crossvalidation AlzBase 220614_s_at ADGB Androglobin 1.12 9 10 -4 5.69 9 10 -2 N N ''Jackknife cross-validation'' denotes DEGs validated by seven leave-one-out tests (Y) or not (N), ''AlzBase'' denotes DEGs identified in AlzBase database (Y) or not (N)  RPH3A) and one extended gene (JUN), had the highest CFG scores. ITGB5 encodes a beta subunit of integrin, which participates in cell adhesion and cell-surfacemediated signaling. It not only supports tumorigenesis but also enhances tumor growth (Reynolds et al. 2002). Moreover, studies have shown that dysregulated ITGB5 gene is correlated with diabetic nephropathy (Wang et al. 2016c) and might play an important role in the progression of AD. RPH3A is a small G-protein that acts in the exocytosis of neurotransmitters and hormones and is involved in neurotransmitter release and synaptic vesicle traffic. Previous studies demonstrate that the expression level of RPH3A is downregulated in different brain regions in a transgenic mouse model of Huntington disease and may be correlated with the symptoms of the neurodegenerative disorder (Smith et alet al. 2007). Although JUN was not dysregulated in the combined analysis of genome-wide expression profiling in the hippocampus, it was found to closely interact with eight genes in the PPI network, including the hub gene CDC42. JUN encodes a virus-like protein, which regulates gene expression in response to cell stimulation by interacting directly with target DNA sequences. Several studies have shown that the transcription factor JUN is essential for neuronal microtubule assembly and apoptosis (Nateri et al. 2004) and plays a very key regulatory role in the unfolded protein response in acute myeloid leukemia, which can serve as a promising therapeutic target in this disease (Zhou et al. 2017). Recently, PPI analysis of AD and non-alcoholic fatty liver disease (NAFLD) has revealed that JUN is one of the hubbottleneck proteins in the PPI network and is an important target for both AD and NAFLD (Karbalaei et al. 2018;Paquet et al. 2017).
Our study makes the statistical improvement to directly combine raw probe-level data from different studies and control several confounding factors to identify a number of best-estimated DEGs between AD patients and age-matched controls. Based on these DEGs, we further elaborated the associated biological pathways and potential biomarkers, shedding new light on the interpretation of the pathogenic mechanisms and early diagnosis underlying AD. However, given that all functional evidences were obtained via bioinformatics analysis, future independent validation studies and essential functional assays are necessary for consolidating the current conclusions and characterizing the putative impact of the candidate genes in AD.

Dataset selection
Microarray-based gene expression data of subjects with AD were obtained from Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo) and ArrayExpress (https:// www.ebi.ac.uk/arrayexpress). The eligible studies were searched with keywords ''Alzheimer's disease'' and we set organism as Homo sapiens, and array type as ''Expression profiling by array'' in GEO or ''Transcription profiling by array'' in ArrayExpress. Raw probe-level data (CEL files) that focused on gene expression profiling in the hippocampus of a cohort of neuropathological healthy subjects and a cohort of AD subjects were collected. Information on covariates, including age, gender, PMI, and batch effect, was required for this study. To avoid deviation among the different microarray platforms, only data generated from two similar Affymetrix platforms, HGU 133a (Human Genome U133A) and HGU 133p 2.0 (Human Genome U133 Plus 2.0), were finally used. Ultimately, after removing the duplicate individuals, raw data from three independent datasets were retained.

Data quality control and pre-processing
To reduce the bias due to different analytical methods, each individual dataset was reprocessed and normalized independently using the R Bioconductor affy package with the default settings for robust multi-array average (RMA) normalization (Irizarry et al. 2003). All data were background-adjusted, normalized, and logtransformed. The microarrays were assessed for data quality using the SimpleAffy package (Wilson and Miller 2005). The scale factor, average background, percent present, the 3 0 /5 0 intensity ratio of GAPDH, and the 3 0 /5 0 intensity ratio of beta-actin provided by SimpleAffy were all evaluated to determine the quality of the RNA samples and their subsequent labeling and hybridization. The default values were selected according to the recommendations of Affymetrix, SimpleAffy, and Larsson and Sandberg (Larsson and Sandberg 2006). Samples beyond the default values were suspected as unqualified and were subsequently identified by RLE and NUSE boxplot using the affyPLM package (Bolstad et al. 2005). If samples in the RLE and NUSE plots were far from 1, they were removed. With regard to the probe sets, only the intersection of probe sets from the two Affymetrix platforms was utilized. Ultimately, data from

Statistical modeling
The expression and sample characteristics from each study were merged, and the gene expression value for each probe was calculated using a standard linear mixed-effects model. Statistical modeling was conducted by the lme4 package in R (Bates et al. 2015). In the combined analysis, disease, age, and PMI were used as fixed effects, whereas gender and batch effect were used as random effects (Wang et al. 2016a (Storey and Tibshirani 2003), and a more permissive q-value \ 0.10 was used to gain more transcripts for comparison and further analysis. Then, the AnnotationDbi package (http:// www.bioconductor.org/packages/release/bioc/html/ AnnotationDbi.html) was utilized to annotate the probe sets with gene symbols, EntrenzIDs, and gene names, thereby obtaining DEGs between the AD patients and age-matched healthy controls.

Validation of the identified DEGs
To test the robustness of the findings, a jackknife crossvalidation was used to conduct the ''leave-one-out'' test.
The procedure was that the data from the total samples, the cases or the control samples in each dataset, were sequentially removed, and the same analysis procedure was applied for the remaining data. Next, the results were compared with the findings of the combined analysis to explore the overlapping genes. This step helped to identify the most important genes that were not dependent on a single study, as well as each study's contribution to the final results.
To assess the sensitivity of DEGs, the findings were contrasted with the results of the AlzBase database (http://alz.big.ac.cn/). AlzBase is an integrative database for dysregulated genes in AD pathogenesis that are discovered from studies with animal models or neuronal cell lines (Bai et al. 2016). This database collects the frequency of the dysregulated genes compiled from published studies of the AD brain transcriptome, which may have a high priority to be pursued further. Our signatures were compared with the genes from the AlzBase database for a better understanding of our results.

Functional and network analysis
To identify the functional categories and biological processes in the hippocampus, we performed pathway analyses using DAVID (version 6.8) (da Huang et al. 2009). The KEGG pathways and GO terms (including cellular component, biological process, and molecular function) were selected in the enrichment analysis, and after a Benjamin multiple test correction, the cutoff of the significance was set to q \ 0.05. The gene interaction network among these DEGs was constructed from the largest PPI database, InWeb_InBioMap (version 2016_09_12, https://www.intomics.com/inbio/map/ #downloads) (Li et al. 2017). With DEGs as the seed genes, the extended genes were introduced on the basis of the experimentally validated interaction in the InWeb_InBioMap database. Each extended node gene was required to have at least two direct interactions with the DEGs. The DEGs and the extended genes were mapped into the PPI network to explore the molecular mechanism of AD. The biological graph visualization tool Cytoscape (version 3.5.1, http://www.cytoscape. org/) software was used to visualize the PPI networks. In the PPI network, the number of genes directly linked to a node was defined as the degrees of the node, and the node with the higher degree (degree [ 10) was defined as the hub gene.

Gene prioritization of the candidate genes
AlzData (http://www.alzdata.org/) integrates five lines of evidence associated with AD, including GWAS, PPI, brain expressional quantitative trait loci, expression correlation with AD pathology in AD mice, and early alteration in 2-month-old AD mouse brain, to prioritize candidate genes for further characterization using a CFG method (Xu et al. 2018). In addition to the evidence collected in the AlzData database, we explored the expression pattern of the blood of AD patients. Dysregulated genes between the blood of AD patients and aged-matched healthy elderly controls were also collected from the results of publications (Chen et al. 2011;Fehlbaum-Beurdeley et al. 2010;Maes, et al. 2007;Sood et al. 2015). The DEGs and the extended genes obtained from the PPI network were scored based on the evidence collected from the AlzData database (CFG score). One point would be assigned if the gene was supported by the above-mentioned evidence. Otherwise, the gene would be assigned zero points. The CFG score of each gene ranged from 0 to 6 points; the higher the score, the more promising the gene is.