Background

Alzheimer’s disease (AD) is the most common form of dementia, characterized by a gradual loss of cognition and memory. AD is expected to affect around 78 million older adults by 2030 [1]. Diagnosis of probable or possible AD is based on symptom presentation and neuropsychological testing according to NINCDS-ADRDA (National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Diseases Association) criteria [2]. Although neuroimaging and biofluid-based biomarkers exist for diagnosis in living patients, they are invasive, not widely accessible and not amenable for definitive diagnosis [3]. On the other hand, blood tests are less invasive, facilitate screening and early diagnosis, and confer significant cost benefits [3]. Therefore, blood-based biomarker discovery has become a key area of clinical research in AD [4].

Disease-related genes are commonly identified using genome-wide association studies (GWAS) [5], in which several million single nucleotide polymorphisms (SNPs) are tested for association with a specific trait or disease. While APOE ε4 has been established as a significant risk factor for AD susceptibility, AD is considered a complex trait in which a combination of genetic and environmental factors influences disease pathology [5]. Several genes, including CR1, PICALM, ABCA7, HLA-DRB5/HLA-DRB1, and SLC24A4/RIN3 have been identified as AD susceptibility loci by GWAS, and are associated with inflammation, immune response, lipid metabolism and intracellular trafficking [5]. Nevertheless, AD-associated variants (except APOE ε4) have small effect sizes and are not sufficient to explain a large component of genetic heritability, which is estimated to contribute up to 80% of the observed disease phenotype [6].

Machine learning (ML) methods have greater power to capture interactions between single variants and genes, which may explain heritability to a greater extent. For instance, Segura et al. [7] showed that tree-based methods prioritized SNPs located in genes PVRL2, TOMM40, APOE, and APOC1. Genomic profiles showed interactions between specific SNPs in both UK BioBank and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) datasets. The telescopic ML-based GWAS strategy proposed by Squillario et al. [8] integrated results at SNP, gene, and pathway levels of information. The authors identified TOMM40 and GRM7 as strongly associated with APOE ε4 status at the SNP, gene and pathway levels, thus providing support for cumulative polygenetic susceptibility to AD. Given that AD pathology is associated with systemic changes reflected in other parts of the body [9,10,11], we hypothesize that ML models are better suited to identify blood-based markers for AD.

Furthermore, recently published artificial intelligence (AI) models have focussed on interpretability, such as the deep learning framework proposed by Qui et al. [12], which provides a disease probability map as an intermediate output after training a fully convolutional network on magnetic resonance imaging (MRI) data. The disease probability map is fed into a multilayer perceptron (MLP) for binary classification along with clinical information such as age, gender and mini-mental state examination (MMSE) scores. Deep learning models in cancer have incorporated biological information in the form of pathway constraints from KEGG (Kyoto Encyclopedia of Genes and Genomes) [13] and Reactome [14] to increase prediction performance and enhance interpretability. Moreover, several studies [15,16,17,18] have employed SHapley Additive exPlanations (SHAP) [19], to understand model prediction and derive insights into disease markers and pathology.

Therefore, in our study, we present c-Diadem, a constrained dual-input AI model for Alzheimer’s disease which incorporates KEGG pathway constraints to accurately predict disease status. We used genomic and expression data from the Alzheimer’s Disease Neuroimaging Inititative (ADNI) [20] for training, validation, and testing. We further explain and interpret the model using SHAP to identify novel, potential genetic markers of AD.

Methods

We have summarized our data preprocessing, model development and interpretation steps in a flow diagram in Fig. 1. The datasets and methods have been described in detail in the following sections.

Fig. 1
figure 1

Flow diagram showing an overview of data preprocessing, model development and interpretation

Alzheimer’s Disease Neuroimaging Initiative (ADNI)

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. For up-to-date information, see www.adni-info.org. In addition to MRI and PET neuroimaging of patients at regular intervals, ADNI has collected and analyzed whole blood samples for genotyping and gene expression analysis. Table 1 provides a summary of the genotyping data provided by ADNI. Blood gene expression profiling was conducted using Affymetrix Human Genome U219 Array for 744 samples in the ADNI2 and ADNI-GO (ADNI-Grand Opportunity) phases [20].

Table 1 ADNI genotyping data summary

Data preprocessing

Genotyping data for ADNI1, ADNI2/ADNIGO and ADNI3 are available for 1877 participants in total. Out of 1877 participants, 626 participants from the ADNIGO/ADNI2 phase have also provided whole blood samples for the gene expression assay. Thus, these 626 patients were selected for our study and their genotyping data was used as the genomic data source for our model. All SNPs have been identified using the hg18 build and therefore, the input data does not contain any missing information. SNP data for the selected patients were downloaded in PLINK binary format, consisting of.bed,.bim and.fam files. The.bed file is the primary representation of genotype calls of biallelic variants. The.bim file accompanies the.bed file and provides extended variant information, i.e., SNP IDs, base-pair coordinates, and the minor and major alleles. The.fam file provides sample information, including parent IDs and phenotype. After minor allele frequency (MAF) filtering, we utilized the.bim file to identify the minor allele and encoded SNPs using additive representation (i.e., 0 = homozygous dominant, 1 = heterozygous, 2 = homozygous recessive). The dbSNP ID was used as the unique identifier for SNPs. Then, coding SNPs (which occur in coding sequences, or CDS) were mapped to their corresponding gene loci. Input values represent the aggregate additive value of all SNPs mapped to the coding sequences of individual genes. The gene expression data merged with the SNP data matrix represents model inputs. For the development of a constrained model, KEGG pathway data was used to populate the constraints matrix.

We applied a train/test/validation split of 56–30-14%. The dataset consists of 212 CN, 317 MCI and 97 AD samples. Thus, the datasets were balanced using target stratification and the Synthetic Minority Oversampling Technique (SMOTE). SMOTE boosted model performance compared to other oversampling techniques such as ADASYN (adaptive synthetic), SVMSMOTE (support vector machines SMOTE) and borderline SMOTE (oversampling limited to borderline cases). Moreover, SMOTE has been used previously in predicting Alzheimer’s disease using MRI images [21]. Therefore, SMOTE was the preferred strategy for addressing data imbalance in our study. Finally, sample labels were converted to 2 × 1 scalars using one-hot encoding.

AI model design

We used Python v3.8 with the Functional API of keras v2.4.3 to design and build the constrained and unconstrained neural networks. The constrained model topology is shown in Fig. 1a. The input data for our model includes genotyping data for 5188 mapped genes and gene expression data for 19,403 genes. The SNP data and gene expression data were provided as separate inputs. The edges between the genotyping input layer and the pathway layers were constrained using prior biological information from KEGG pathway datasets [22]. KEGG data from the Molecular Signatures database (MSigDB) [23] was used to create constraints encoded as a binary weights matrix which sets all non-existent connections among the genes and pathways to zero. Therefore, the edges from the input genes to the unrelated pathways were set to 0 and remained constant during training. The genes were mapped to 186 KEGG pathways, which form the second layer of the constrained model. Then, the pathway features were transformed using a 1D convolutional layer (kernel size 12) before being concatenated with the gene expression input. The concatenated data was passed through a batch normalization layer and three hidden layers. The output layer contains two nodes representing the binary outcomes of CN and MCI/AD. Softmax activation was used to convert the output into the respective CN and MCI/AD probabilities, with the higher probability used for classification of disease status. The hyperparameters are summarised in Table 2.

Table 2 Model hyperparameters

Performance evaluation

The predictive performance of the model was evaluated on the validation and test sets using area under the curve (AUC), accuracy, precision, recall and F1 score. The best-performing weights were chosen using a callback on validation loss. The AUC, accuracy, precision, and recall were calculated for each epoch, for both validation and training. The formulae for the performance metrics are provided below.

$$\begin{array}{c}Accuracy=\frac{TP+TN}{TP+FN+TN+FP}\\\begin{array}{c}Precision=\frac{TN}{TN+FP}\\Recall=\frac{TP}{TP+FN}\\F1score=2\cdot\frac{Precision\cdot Sensitivity}{Precision+Sensitivity}\end{array}\end{array}$$

Here, true positives (TP) refer to the number of correctly predicted MCI/AD cases. True negatives (TN) refer to the number of correctly predicted CN cases. False positives (FP) refer to the CN cases incorrectly predicted as MCI/AD. False negatives (FN) refer to MCI/AD cases incorrectly predicted as CN.

Model interpretation with SHAP

Model interpretation is essential to gain user trust and overcome the ‘black box’ reputation of deep learning models. Lundberg and Lee [19] proposed SHAP values as a unified measure of feature importance, computed using game theory. SHAP scores were computed using the Python shap package (v0.39.0) to identify genes which could potentially be considered biomarkers for diagnosis with prodromal and advanced Alzheimer’s. The top twenty selected genes were then analyzed using STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) [24] to identify enriched pathways and their role in Alzheimer’s disease.

Statistical analysis

Differences in clinical features between CN and MCI/AD subjects were analyzed using statistical tests for significance. We performed the analysis of variance (ANOVA) test on the age of onset and years of education. We used the Chi-square test for differences in the proportion of male and female participants and the presence of the APOE ε4 allele. Student’s t-tests were utilized for the neuropsychological test scores, namely the MMSE (Mini-Mental State Examination) and CDRSB (Clinical Dementia Rating – Sum of Boxes) scores.

Results

We developed a constrained deep learning binary classifier which was trained and tested on genotyping and gene expression data from 626 ADNI participants, with a train-test-validation split of 56–30-14%. The data was imbalanced with 212 CN participants and 414 patients with MCI/AD. Therefore, we applied target stratification and SMOTE to balance the training dataset. The inputs for our model include 5188 genes mapped from SNPs and gene expression data for 19,403 genes.

Clinical cohort characteristics

The clinical characteristics of the CN and MCI/AD are summarized in Table 3. The CN and MCI/AD groups differ significantly in terms of mean age (p = 0.001) and proportion of female participants (p = 0.015). Both CN and MCI/AD groups have similar years of education. We also observe a higher proportion of MCI/AD patients with the APOE ε4 allele compared to CN subjects (p < 0.001). As expected, MCI/AD patients show significantly worse performance in neuropsychological tests such as CDRSB (p < 0.001) and MMSE (p < 0.001).

Table 3 Clinical cohort characteristics

Model development

The constrained deep learning model is a parsimonious feed-forward neural network with 14 layers, which uses SNPs and microarray data from ADNI. The topology of the model is depicted in Fig. 2a. The model was used to predict whether a given patient was cognitively normal (CN) or was in the AD spectrum (either the prodromal stage of MCI or advanced AD). The SNP inputs were constrained using KEGG pathway information before concatenation with gene expression data and passed through dense, hidden layers.

Fig. 2
figure 2

Model topology and performance. a Model topology for unconstrained and constrained model. Pathway constraints are used for the constrained model on the SNP input data. b)Receiver-operating characteristic (ROC) curves show increased performance (area under the curve (AUC) = 0.70) for constrained model. TPR and FPR stand for True Positive Rate and False Positive Rate respectively

The model was trained with the early stopping callbacks on validation loss monitoring to prevent overfitting. The maximum number of epochs was set at 80 with a default batch size of 32. We also developed an unconstrained network with 9 layers, including an embedding layer of size (186, 3), depicted in Fig. 2a, for comparison.

Model performance

We evaluated the ability of the constrained model to classify the patients as MCI/AD or CN. Our model, c-Diadem, achieved an accuracy of 69% and an AUC of 70% on the test dataset. The inclusion of KEGG pathway constraints increased model accuracy and AUC, as shown in Fig. 2b. The performance metrics of our model have been compared with other classifiers in Table 4. Our model shows a high F1 score of 0.69 compared to the SNP-only deep model (F1 score = 0.53). The F1 score includes both the sensitivity and specificity of the model. With SNPs data, our model shows an accuracy of 0.64 and an AUC of 0.67 (F1 score = 0.64). The accuracy and AUC of the model improve with the inclusion of gene expression data. Therefore, combining SNPs and gene expression inputs helps our model significantly outperform previous models (accuracy = 0.69, AUC = 0.70, F1 score = 0.69). An AUC cut-off of 0.70 or more indicates a moderate-to-high predictive ability for models of dementia risk [25].

Table 4 c-Diadem performance metrics compared with current models

Feature importance using SHAP

To determine the relative importance of genetic features (genotyping and gene expression data), we computed SHAP values in the constrained model for the respective inputs. Figure 3 provides SHAP values for the top twenty features which have the highest impact on model classification. Our results show that SNPs in PRKCZ, PLCB1 and ITPR2 are considered important for prediction of disease status (both MCI and AD). On the other hand, the expression of HLA-DQB1, EIF1AY, HLA-DQA1, and ZFP57 has more predictive value compared to the expression of other genes. The genes selected by SHAP may be considered as potential biomarkers.

Fig. 3
figure 3

SHAP results based on ADNI (a) genotyping and (b) gene expression data, computed using the constrained model

We analyzed the interaction network formed by SHAP-identified genes using the STRING database. From a total of 40 genes, 34 genes were used to construct the interaction network. Some genes did not have a representative protein and so were not included in the network (LOC100653194, EIF1AY, HLA-DRB4, ENSG00000206239, MGC39372, LOC100505946). The resulting network had significantly more edges than expected (protein–protein interactions (PPI) enrichment p-value = 0.00247) which indicates biological relationships among the genes. The network was further clustered into subnetworks using k-means clustering. The subnetwork with the most significant enrichment (p = 5.56 × 10–6) was associated with pathways involved in the release of Ca2+ ions into the cytosol (p = 0.0016), as well as pathways commonly associated with long-term depression (false discovery rate (FDR) = 3.77 × 10–5) and salivary secretion (FDR = 4.55 × 10–6). The results of the pathway enrichment analysis have been as summarized in Table 5. The interaction network with clusters (colored red, green and blue) is represented in Fig. 4.

Table 5 Pathway enrichment for gene markers by SHAP
Fig. 4
figure 4

STRING network showing colored clusters

Discussion

Clinical diagnosis through blood sampling would be preferable to cerebrospinal fluid (CSF) sampling, as it is less invasive for patients and amenable for longitudinal monitoring. However, there are several challenges to testing for conventional biomarkers in blood samples. One major issue in blood sampling as a diagnostic tool is the high background of plasma proteins such as albumin and immunoglobulin against which extremely low levels of amyloid-β and tau proteins would need to be quantitated. The low levels of these proteins may be further subject to metabolization and clearing by physiological processes [29]. Moreover, α-synuclein is highly expressed by red blood cells, precluding its use as a blood-based biomarker [30]. Thus, we have implemented a constrained AI model to probe genomic and gene expression data from ADNI for alternative blood-based biomarkers.

Recent studies indicate that the development of Alzheimer’s disease is associated with systemic changes in the neuronal environment reflected in other parts of the body. Studies in peripheral blood mononuclear cells (PBMCs) from amnestic MCI and AD patients show differential expression of senescence markers, such as cell cycle blockade (p16 and p53), DNA damage response (γH2AX) and proinflammatory IL-6/-8 mRNA levels [9]. Garfias et al. [10] have reported significantly higher levels of activated lymphocytes in AD patients. Moreover, a gene expression analysis of PBMC samples in the AddNeuroMed cohort [31] identified DEGs significantly enriched in pathways related to T cell and neutrophil activation in immune response, lymphocyte differentiation, protein serine/threonine kinase activity, GTPase and DNA transcription factor binding [11, 32].

Our findings indicate that the number of SNPs located in ITPR1, PLCB1, DMD, ITPR2 and RYR3 genes are indicative of the dysregulation of processes related to Ca2+ sequestration and release. Increased calcium loads in the cytosol can lead to the formation of mitochondrial pore complexes and consequently, apoptosis. Increased Ca2+ influx and mitochondrial sequestration is observed in CA1 (cornu ammonis 1) neurons in the hippocampus, which are said to be selectively vulnerable to neurodegeneration in Alzheimer’s disease [33]. The Ca2+ influx is facilitated by NR2B overexpression and lower levels of calcium-binding proteins (CBPs) among other factors [34]. Using our model, we found markers of a key event (Ca2+ homeostasis dysregulation) in data extracted from patient blood samples. We also observe that pathways related to long-term depression are enriched in our study. Depression is a known risk factor for cognitive decline [35] and up to 50% of AD patients are known to suffer depression [36]. Moreover, from the selected genes, HLA-DQA1 has been reported previously as a risk factor in late-onset Alzheimer’s disease using GWAS [37] and differential expression analysis [38].

It is important to note that our model is limited by the pathway constraints derived from our dataset of choice, KEGG. The genotyping input data of our model has been populated by coding SNPs. Coding SNPs were utilized as they are easily mapped to genes and their respective pathways. Nonetheless, we believe our model’s performance can be enhanced by the inclusion of non-coding SNPs as well as other types of -omics data. We also recognize that the potential genetic biomarkers identified by our study require experimental validation. We hope to explore biomarkers at different stages of Alzheimer’s by refining and training the model to distinguishing between MCI and AD patients in a future study. Refining the model and addressing limitations to enhance performance will help emphasis the validity of our findings.

Conclusion

The development of alternative biomarkers in Alzheimer’s are foundational for developing less-invasive diagnostics as well as breakthroughs in drug development. Therefore, we have developed a constrained, explainable deep learning model incorporating biological information to accurately predict the disease status of patients. Further, we used SHAP to identify potential biomarkers, which are associated with pathways known to be dysregulated in Alzheimer’s. Therefore, our method can be used to drive research in drug development and diagnostics for Alzheimer’s disease and other dementias.