Introduction

Alzheimer’s Disease (AD) is a neurodegenerative disease [1] and the most frequent cause of dementia. As the number of dementia patients increases continuously, AD is a globally growing health problem [2]. Currently, there is no causal therapy to cure AD [1]. To recruit and monitor subjects for therapy studies, it is important to identify patients at risk to develop AD early and to develop preclinical markers. Subjects with cognitive impairments that do not interfere with everyday activities are considered as having Mild Cognitive Impairment (MCI) due to AD [3]. The risk to develop AD is increased for subjects with MCI in comparison to cognitively normal controls (CN). However, not all subjects with MCI prospectively convert to AD. One possibility for early AD detection is to find patterns distinguishing between progressive MCI subjects (pMCI) who will develop AD and subjects with stable MCI (sMCI).

Multiple Machine Learning (ML) workflows were implemented for this differentiation. Some used models like Decision Trees (DTs) or Logistic Regression (LR), which were interpretable by design. However, black-box models like eXtreme Gradient Boosting (XGBoost) [4], Random Forests (RFs) [5], or Convolutional Neural Networks (CNNs) [6] often outperform those models. Black-box models are designed to identify highly complex associations and are challenging to interpret. Thus, the risk of learning spurious decision functions caused by patterns occurring in the training dataset is increased for black-box models [7].

This research is an extended version of earlier work [8] and thus expands the previously developed ML workflow. The previously developed workflow enabled the interpretation of black-box models based on model-agnostic Shapley values. Shapley values give individual explanations for the prediction of each subject and visualize complex relationships between features and model predictions. In this research, the previous experiments are expanded using three AD datasets and three adjusted feature sets. In addition to the previously trained tree-based models, Support Vector Machines (SVMs) [9] and LR models were implemented and explained. In this work, Shapley-based explanations were compared to classical feature importance methods, absolute log odd’s ratios, and permutation importance.

In comparison to previous work [8], an improvement of the classification results for ADNI and AIBL was achieved for the differentiation between Cognitive Normal (CN) controls and MCI subjects as well as for MCI vs. AD classification and models trained without cognitive test scores and validated for AIBL. Additionally, the ADNI and AIBL results achieved for sMCI vs. pMCI classification, trained with cognitive test scores, outperformed previous work.

This article is structured as follows: In “Related Work”, related work is described. Section “Materials and Methods” introduces the datasets and methods used to implement the ML workflow and the details of the experiments. Section “Results” elaborates on the experimental results. Those results are discussed including the mentioning of limitations in “Discussion”. Finally, “Conclusion” concludes the overall work.

Related Work

Interpretable ML was developed to explain black-box models [10]. As the heterogeneous etiology of AD is not completely understood yet, interpretability is important and enables the validation of the biological plausibility of ML models. Recently, some studies have used interpretable ML in AD detection.

For example, Long Short-Term Memory- (LSTM-) [11] based Recurrent Neural Networks (RNN) [12] were trained to classify CN vs. MCI subjects in [13]. The experiments included multiple techniques to fuse socio-demographic and genetic data with Magnetic Resonance Imaging (MRI) scans. The resulting models were evaluated for two AD datasets—the AD subset [14] of the Heinz Nixdorf Risk Factors Evaluation of Coronary Calcification and Lifestyle (RECALL) (HNR) [15] (61 MCI and 59 CN) and 624 subjects (397 MCI, 227 CN) of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [16] study phase 1. To visually explain individual model decisions, Gradient-weighted Class Activation Mapping (Grad-CAM) [17] was used. A focus on biologically plausible regions was observed.

Four heatmap visualization methods—sensitivity analysis [18], guided backpropagation [19], occlusion [20], and brain area occlusion inspired by [21]—were compared for 3D-CNNs in [22]. The CNN models were trained using 969 MRI scans of 344 ADNI subjects (151 CN, 193 AD). However, it was unclear whether the described workflow ensured independent training and test sets using multiple scans per subject [23]. Thus, the Cross-Validation (CV) accuracy of \(77\,\%\pm 6\,\%\) might be affected by data leakage. All heatmaps focused on AD-related anatomical brain areas.

An interpretable deep learning model consisting of a Generative Adversarial Network [24] to extend the training dataset, a regression network to generate feature vectors from adjacent visits, and a classification model was introduced in [25]. First, the regression model iteratively estimated the feature vector at the following visit. The resulting feature vector was used as input for the classification model, which predicted the final diagnosis. To classify 101 pMCI vs. 115 sMCI ADNI subjects, longitudinal volumetric MRI features were used. The model outperformed SVMs and artificial neural networks.

A new interpretable model based on distinct weighted rules was introduced in [26] and evaluated for 151 subjects (97 AD and 54 CN) of the ADNI cohort. The framework is called Sparse High-order Interaction Model with Rejection option (SHIMR) and consists of two hierarchical stages. In the first stage, the interpretable model was trained using plasma features. The data of subjects with an unclear prediction in this stage were propagated to the second stage. In this stage, an SVM [9] was trained using invasive Cerebrospinal Fluid (CSF) markers. The evaluation included both CV and an independent test set. The described model reached an Area Under the Receiver-Operating characteristics Curve (AUROC) of 0.81 for the test set.

SHapley Additive exPlanations (SHAP) [27] were used in [28] to explain differences in models trained using coreset selection methods. The idea was to determine coresets of subjects with the most informative data. RF and XGBoost models were trained on these coresets to avoid overfitting and improve ML models. The results of Data Shapley [29] coreset selection were compared to Leave-One-Out [30] selection and random exclusion. All models were trained and validated for the ADNI dataset (400 sMCI, 319 pMCI) and externally validated for a subset of the AIBL dataset (16 sMCI, 12 pMCI). SHAP summary plots showed that models trained for both the entire training set and the coreset learned biologically plausible associations.

To examine the predictive influence of \(\upbeta\)-amyloid plaques, tau tangles, and neurodegeneration during the disease progression, RF feature importance was used in [31]. The experimental data included 405 ADNI subjects (148 CN, 147 MCI, 110 AD). \(\upbeta\)-amyloid Positron Emission Tomography (PET) detected \(\upbeta\)-amyloid plaques, invasive CSF features surrogated tau tangles, and MRI and Fluorodeoxyglucose (FDG) PET scans were used to determine neurodegeneration. The experimental results showed that models trained to classify the early AD stages preferred features representing tau tangles and \(\upbeta\)-amyloid plaques. Models trained to predict later stages favored surrogates for neurodegeneration. SHAP [27] and Gradient Tree Boosting (GTB) [32] reproduced those observations. The RF and the entire feature set reached accuracies of 73.17 % (CN vs. MCI), 71.01 % (MCI vs. AD), and 90.34 % (CN vs. AD).

SHAP values were also used in [33] to explain population-based and individual predictions of XGBoost models and RFs. Models were trained using socio-demographic and lifestyle factors to predict the patient’s risk to develop AD based on medical history. Transfer learning applied information extracted from the Survey of Health, Ageing, and Retirement in Europe (SHARE) [34] (80,699 CN, 4,157 AD) to the PREVENT cohort [35] (109 subjects with high risk to develop AD, 364 subjects with low risk). The PREVENT cohort was younger than the SHARE cohort. The models support the hypothesis that age is the most important risk factor in AD detection. Consistent with previous research [36], among other factors, less education, physical inactivity, diabetes, and infrequent social contact were identified as potential risk factors.

A two stage-based classification workflow that used SHAP values to interpret RFs was developed in [37]. In the first stage, CN vs. MCI vs. AD classification was performed. The second stage implemented the differentiation of sMCI and pMCI subjects. The models were based on multiple modalities including MRI, PET, CSF biomarkers, cognitive tests, medical history, genetics, and many more. The RFs were trained and tested using 1,048 subjects (294 CN, 254 sMCI, 232 pMCI, and 268 AD) of the ADNI dataset. For CN vs. MCI vs. AD classification, the model almost exclusively selected cognitive test scores as the most important features. The model learned bad cognitive test results increased the risk of AD and MCI. The most important features for sMCI vs. pMCI classification also were cognitive test scores followed by PET and MRI features. Bad cognitive test scores, small MRI volumes, and small PET uptakes were associated with disease progression (Table 1).

Table 1 Summary of the related work

Materials and Methods

The ML workflow, implemented using the programming language Python v3.6.9 [38], is shown in Fig. 1. It enables the interpretation of black-box models trained to detect early AD. In the following, the workflow and the methods used for implementation are elucidated.

Fig. 1
figure 1

Implemented ML workflow. Volumetric features were extracted for one baseline (BL) MRI scan per subject. The ADNI dataset was randomly split into an 80 % training and 20 % test set. The most important MRI features were selected using forward feature selection, and those were concatenated with socio-demographic features, number of ApolipoproteinE\(\epsilon\)4 (ApoE\(\epsilon\)4) alleles, and cognitive test scores. Bayesian optimization implemented hyperparameter-tuning. Black-box RFs, XGBoost models, LR models, as well as polynomial and radial SVMs, were trained and validated. Shapley values were calculated for black-box model interpretation. An evaluation was performed for the independent ADNI test set and for the external AIBL and OASIS datasets

Datasets

Data used in the preparation of this article were obtained from the ADNI [16], the AIBL [39], and the OASIS [40] cohorts.

ADNIFootnote 1 was launched in 2003 as a public-private partnership. The primary goal of ADNI is to test whether a combination of biomarkers can measure the progression of MCI and AD. Those biomarkers include serial MRI, PET, biological markers, as well as clinical and neuropsychological assessments. The ongoing ADNI cohort recruited subjects from more than 60 sites in the United States and Canada and consists of four phases (ADNI-1, ADNI-2, ADNIGO, and ADNI-3). The subjects were assigned to three diagnostic groups. CNs have no problems with memory loss. Subjects with AD meet the criteria for probable AD defined by the National Institute of Neurological and Communicative Disorders and Stroke–Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) [41]. The diagnostic criteria of ADNI were explained in [16]. The dataset was downloaded on 27 Jul 2020 and initially included 2,250 subjects.

Table 2 ADNI demographics at BL. The mean (\(\bar{x}\)) and standard deviation (\(\sigma\)) are given for all continuous variables

AIBLFootnote 2 is the largest AD study in Australia and was launched in 2006. AIBL aims to discover biomarkers, cognitive test results, and lifestyle factors associated with AD. As AIBL focuses on early AD stages, most of the subjects are CN. The MCI subjects of AIBL met the criteria described in [42], AD diagnoses following the NINCDS-ADRDA criteria [41] for probable AD. The diagnostic criteria of AIBL were described in [39]. Approximately half of the CN subjects recruited in AIBL show memory complaints [39]. AIBL data version 3.3.0 was downloaded on 19 Sep 2019 and originally included 858 subjects.

Table 3 AIBL demographics at BL. The mean (\(\bar{x}\)) and standard deviation (\(\sigma\)) are given for all continuous variables

The aim of the Open Access Series of Imaging Studies (OASIS) 3Footnote 3 [40] dataset is to investigate the effects of healthy ageing and AD. The subjects of OASIS-3 were recruited from several ongoing studies in the Washington University Knight Alzheimer Disease Research CenterFootnote 4. The longitudinal dataset included MRI scans, fMRI scans, Amyloid- and FDG-PET scans, neuropsychological test results, and clinical data for 1,098 subjects. OASIS focuses on the preclinical stage of AD. All OASIS subjects had a Clinical Dementia Rating (CDR) less than or equal to 1. The OASIS dataset provides multiple target values. In this research, CN subjects had normal cognition and absence of MCI or AD diagnosis, MCI subjects had amnestic MCI with memory impairment, and AD diagnosis follows the NINCDS-ADRDA criteria [41] for probable AD.

Table 4 OASIS-3 demographics at the first visit with MRI scan and diagnosis

Subject Selection

For the ADNI dataset, all subjects with an MRI scan at the baseline visit were included. 521 subjects who have no MRI scan at the baseline visit were excluded, 29 subjects failed the MRI feature extraction described in “MRI Feature Extraction”. The demographics of the resulting 1,700 subjects are summarized in Table 2.

The 853 MCI subjects were divided into two groups. The sMCI subjects had a stable MCI diagnosis at all follow-up visits and the pMCI subjects converted to a stable AD diagnosis at any visit. 38 subjects with no follow-up visits and 96 subjects who reverted to CN or MCI were excluded from this separation, resulting in 400 sMCI and 319 pMCI subjects.

For AIBL, the same exclusion criteria were applied. Therefore, 170 subjects had no MRI scan at the baseline visit, and the baseline MRI scans of 76 subjects failed for the MRI feature extraction pipeline described in “MRI Feature Extraction”. The demographics of the resulting 612 subjects are summarized in Table 3. Similar to the ADNI dataset, the 95 MCI subjects were divided into two groups. In this step, 60 subjects with no follow-up visits and 7 subjects who reverted to CN or MCI were excluded from this separation, resulting in 16 sMCI and 12 pMCI subjects.

The exclusion criteria were similarly applied for the OASIS-3 dataset, which originally included 1,098 subjects. For 983 subjects, a diagnosis of CN, MCI, or AD was assigned for at least one visit. The MRI feature extraction pipeline failed for all MRI scans of five subjects, and no MRI scan was successfully matched to a diagnosis with a tolerance of 365 days for 57 subjects. In contrast to the ADNI and AIBL datasets, which exclusively included baseline visits, the first visit with an MRI scan and a diagnosis was used for OASIS. The demographics of the remaining 921 subjects are summarized in Table 4.

The number of subjects with MCI as baseline diagnosis is 19. This number was decreased if subjects without follow-up diagnoses were excluded. Thus, no experiments were executed to separate sMCI and pMCI subjects in OASIS-3. For reproducible research, the supplementary material contains lists with the subject and MRI IDs and the diagnoses for all datasets.

MRI Scan Selection

From the ADNI dataset, T1-weighted MRI scans recorded at the baseline visit were included. The acquisition parameters differ between scanners. During the ADNI-1 study phase, scans were recorded using a field strength of 1.5 T. In the remaining study phases, MRI scans with a field strength of 3.0 T were recorded.

From the AIBL dataset, T1-weighted MRI scans following the protocol of the ADNI 3D T1-weighted sequences were included. All AIBL scans had a resolution of \(1\times 1\times 1.2\) mm.

For the OASIS-3 dataset, T1-weighted MRI scans, recorded on three scanners, were included. The field strengths of those scanners are 1.5 T and 3.0 T [40].

MRI Feature Extraction

Using FreeSurfer v6.0 [43], volumetric features were extracted for each MRI scan. These include the volumes of 34 cortical areas per hemisphere of the Desikan–Killiany atlas [44], 34 subcortical areas [45], and the estimated Total Intracranial Volume (eTIV). As recommended for volumes in [46], the volumetric features were normalized by eTIV. This results in 103 MRI volumes, which were split into 49 features of the left hemisphere, 49 features of the right hemisphere, and five additional unpaired segmentations (3rd ventricle, 4th ventricle, brain stem, CSF, eTIV).

After the normalization, for paired volumes, the sum (described in Eq. 1), the difference (described in Eq. 2), and the ratio (described in Eq. 3) of both hemispheres are calculated to investigate symmetry and to decrease feature interactions. This results in 152 MRI features (49 sums, 49 differences, 49 ratios, and 5 unpaired features). Brain asymmetry was previously associated with AD [47,48,49,50]. Equation 2 shows that differences were calculated by subtracting the right from the left volume similar to [48], where the cortical thickness was used instead of volumetric features

$$\begin{aligned} sum_{ROI}= & {} \frac{vol_{ROI}^{left}}{eTIV}+\frac{vol_{ROI}^{right}}{eTIV} \end{aligned}$$
(1)
$$\begin{aligned} diff_{ROI}= & {} \frac{vol_{ROI}^{left}}{eTIV}-\frac{vol_{ROI}^{right}}{eTIV} \end{aligned}$$
(2)
$$\begin{aligned} ratio_{ROI}= & {} \frac{vol_{ROI}^{left}}{vol_{ROI}^{right}}. \end{aligned}$$
(3)

Manual Feature Preselection

Three feature sets were investigated in the experiments. The manual feature selection aims to choose less-invasive, accessible examination techniques which were able to detect early signs of AD. Feature set 1 (FS-1) includes all MRI features, and socio-demographic features including age, gender, and years of education. However, the years of education are only available for the ADNI dataset. Feature set 2 (FS-2) expands FS-1 by the number of ApoE\(\epsilon\)4 alleles, a genetic risk factor associated with AD, which can be obtained from blood samples or via less-invasive swab tests from the inside surface of the cheek. Feature set 3 (FS-3) extended FS-2 by three cognitive tests including the score of the Mini-Mental State Examination (MMSCORE) and two logical tests to evaluate the long-term (Logical memory, delayed—LDELTOTAL) and the short-term memory (Logical memory, immediate—LIMMTOTAL). The CDR was strongly associated with AD diagnosis and was not included in the experiments.

Dataset Splitting

The ADNI dataset was split on the subject level into two distinct subsets. The training set included 80 % of the data and the test set consisted of the remaining 20 %. The splitting was executed within each diagnostic group to ensure similar distributions. The AIBL and OASIS datasets were used as external test sets. None of the AIBL and OASIS subjects was used in the training or model selection process.

Feature Selection

Initially, 152 MRI features were extracted from the MRI scans. Those features are reduced to focus the ML models on the most important features. For this reason, feature forward selection was implemented. In comparison to feature selection methods like RF feature importance, this method avoids correlated features in the dataset [51]. Forward selection is a greedy procedure that iteratively identifies the best new feature until no improvement was reached. The training dataset was split into an 80 % training set and 20 % validation dataset. The training dataset was used to train the ML model used for classification with default hyperparameters on the feature set, and the validation dataset was used to calculate the validation accuracy for this feature set. The selected MRI features were expanded using the features described in “Manual Feature Preselection”.

Hyperparameter-Tuning

To tune the hyperparameters of the ML models, Bayesian optimization [52] was implemented using the Python package scikit-optimize v0.8.1 [53]. Bayesian optimization maps the dependency of the hyperparameters and the model performance using a Gaussian Process. Initially, ten nearly random hyperparameter combinations were selected by a Latin Hypercube Design (LHD) [54]. Bayesian optimization with LHD initialization was successfully used in previous research [55] to optimize the parameters for early AD detection. Each parameter was split into ten equidistant intervals and one sample was randomly chosen per interval. This results in ten samples per parameter, which were randomly matched.

A stratified \(10\times 10\)-fold CV [56] was applied to the training dataset to estimate the model accuracy for an independent test set. Stratified \(10\times 10\)-fold CV was implemented by splitting each diagnostic group of the training dataset into ten distinct folds using the Python package scikit-learn v0.23.2 [57]. Ten iterations were performed, each with a different fold used as a validation dataset (10 %). The training dataset included the remaining 9 folds (90 %). With shuffled data in each run, this procedure was repeated ten times. The ML model was initially evaluated for ten LHD combinations.

To predict the average CV accuracy for the initial parameter combinations, the Gaussian process was fitted. Afterward, an optimization selected the next promising parameter combination. As an acquisition function, the Lower Confidence Bounds (Eq. 4) was used. In this equation, \(\hat{\mu }_\Theta\) is the Gaussian Process estimation of the CV accuracy and \(\hat{\Sigma }_\Theta\) is the covariance at parameter combination \(\Theta\)

$$\begin{aligned} LCB(\Theta )=\hat{\mu }_\Theta -\hat{\Sigma }_\Theta . \end{aligned}$$
(4)

The hyperparameter combination selected in the previous step was again evaluated using CV. Afterward, to refine the Gaussian Process and to determine the following combination, the respective tuple of hyperparameter and mean CV accuracy was added to the Gaussian Process. The procedure was repeated 25 times. The best hyperparameter combination was chosen to train the final model.

Model Training

During hyperparameter training and final model generation, XGBoost models, RFs, radial SVMs, polynomial SVMs, DTs, and LR models were trained. The preprocessing pipeline included centering, scaling, and median imputation. The entire preprocessing pipeline was implemented within the CV to avoid over-optimistic performance estimations [58]. The parameters were calculated for the CV training set and reused for the test and external datasets. The preprocessing was implemented using the Python package scikit-learn v0.23.2 [57].

Ensemble-based black-box XGBoost [4] models follow the idea of gradient boosting models [32]. It means that the combination of multiple weak classifiers results in a strong, joint classifier. By learning the gradients of the previous classifier, gradient boosting fulfills this assumption. The final prediction consisted of the sum of weak classifier predictions. XGBoost is distributed as an open-source software library, and the main advantages are scalability, parallelization, and distributed execution. The hyperparameters and intervals used during Bayesian optimization are summarized in Table 5. The hyperparameter n_estimators sets the number of boosting iterations, learning_rate was the learning rate that preferences weak classifiers at early iterations, the minimum loss reduction required to split a node is defined by gamma, the hyperparameter max_depth sets the maximum depth of an individual tree, and the minimum number of observations in a child node was denoted as min_child_weight, subsample and colsample_bytree set the proportion of randomly subsampled training instances and features per iteration. The Python package xgboost v1.2.0 [59] implemented the XGBoost algorithm.

RF [5] training was implemented using the Python package scikit-learn v0.23.2 [57]. The RF algorithm is based on multiple DTs. Each DT was trained using randomly chosen features and subjects. Those subjects were selected using bootstrap sampling [60] on the training dataset. RF inference was computed by summarizing the individual DTs using a majority voting. The RF hyperparameters are summarized in Table 5. n_estimators sets the number of DTs, each split used a random subset of max_features features, and the hyperparameter min_samples_leaf describes the minimum number of samples in a leaf node.

Support Vector Machines (SVMs) [9] were implemented using the Python package scikit-learn v0.23.2 [57]. SVMs separate two classes using a decision boundary which was referred to as an n-dimensional hyperplane. Here, n is the number of features. To increase the robustness of the hyperplane for unknown observations, SVMs select the hyperplane with the largest distance from the observations. For this reason, the distance between the hyperplane and the observations was maximized using the hinge loss function [61]. The support vectors describe the observations closest to the hyperplane. Removing support vectors from the dataset directly influences the hyperplane. The cost parameter C enables SVMs to avoid overfitting, the higher C, the less complex an SVM is. Kernel functions help to model complex interactions. In this research, a polynomial and a radial kernel were implemented. The degree hyperparameter of the polynomial kernel controls the degree of the kernel, and high values lead to more complex hyperplanes. The gamma hyperparameter constraints the influence, and a single observation has on the hyperplane. If gamma = scale, \(\frac{1}{\# features\cdot \sigma }\) was used as a value of gamma, if gamma = auto, a value of \(\frac{1}{\# features}\) was used. The SVM hyperparameters and their ranges are summarized in Table 5.

Table 5 Hyperparameters and intervals used to train the ML models

In contrast to the black-box models, DTs [62] and LR models were selected as simple and interpretable comparison models. DTs were implemented using the Python package scikit-learn v0.23.2 [57]. A DT consists of successively learned decision rules of the form \(x \le t\) for numerical or \(x\in t\) for categorical features t is a threshold or a subset of values. The next decision rule was selected by the splitter which ranked all possible rules using a criterion. Decision rules were iteratively expanded until a maximum depth of max_depth was met or a percentage min_samples_split of samples were in a split.

LR [63] is a Generalized Linear Model (GLM) with a logistic link function. This link function allows the processing of binomial output variables. The logistic model function is given in Eq. 5. The model predicts the probability \(P(Y=1\vert X=x,\Theta )\) of observation x with given parameters \(\Theta\) being in the positive class \(Y=1\). The LR algorithm was implemented using the Python package scikit-learn v0.23.2 [57]

$$\begin{aligned} P(Y=1\vert X=x,\Theta )=\frac{1}{1+\exp (x\cdot \Theta ).} \end{aligned}$$
(5)

Model Interpretation with Shapley Values

There are multiple methods to interpret ML models. An overview can be found in [10]. For example, DTs and LR models are interpretable by design. However, black-box models often outperform those interpretable models but the interpretation of black-box models is complicated. In this research, model-agnostic Shapley values were used. Shapley values are local models, which explain the predictions of individual observations and thus enable high clinical benefit and high adaption to the black-box model.

Shapley values [64] are affiliated with coalition game theory and aim to decompose the prediction of a subject into the contributions of each feature. For this aim, Shapley values are based on the additive linear explanation model shown in Eq. 6. For a subject x, the model prediction f(x) is decomposed into the feature contributions \(\Phi _j\), a simplified representation of the feature values \(x'\), and the average model prediction \(\Phi _0\). A binned binary feature representation was used for tabular data, with N being the number of simplified features

$$\begin{aligned} f(x)=\Phi _0+\sum _{j=1}^{N}\Phi _j x'_j. \end{aligned}$$
(6)

The idea of using Shapley values to explain black-box ML models is to fairly decompose the contribution of each feature for the subject’s prediction. Due to this fairness, the sum of all Shapley values is equal to the difference between the average model prediction and the probability prediction of a subject. Equation 7 shows that Shapley values are defined as the average, weighted contribution, a simplified feature has in all subsets. For the exact calculation of a Shapley value \(\Phi _i\) for a given subject and feature i, it is required to determine the contribution of this feature for all subsets S of the entire feature set F. The investigation of each subset S requires the retraining and evaluation of the black-box model \(f_{S}(S)\). With the help of the model performances trained with (\(f_{S\cup i}(S\cup i)\)) and without (\(f_S(S)\)) the feature at interest i, their differences were calculated. The weighted average difference across subsets builds the Shapley value. The weighting depends on the relative number of features \(\vert S\vert\) in subset S. High weights were assigned to subsets with few and many features. In this way, the estimation of the main individual effects and the total effects are supported

$$\begin{aligned} \Phi _i=\sum _{S\subseteq F\setminus \{i\}} \frac{\vert S\vert ! (\vert F\vert -\vert S\vert -1)!}{\vert F\vert !}\bigl (f_{S\cup \{i\}}(S\cup i)-f_S(S)\bigr ). \end{aligned}$$
(7)

However, the number of subsets increases exponentially with the number of input features, leading to high computational effort for the exact calculation of Shapley values. Kernel SHapley Additive exPlanations (SHAP) [65] avoid time-consuming repeated training and evaluation by estimating Shapley values. This algorithm is based on Local Interpretable Model-agnostic Explanations (LIME) [66] and was implemented using the Python package shap v0.38.1 [27]. A new dataset containing variants of the observation at interest is created by permuting selected features. An additive linear model (Eqn. 8) with \(x'\) is a simplified representation of the black-box input features and \(g(x')\) is the explanation model was fitted to the generated dataset

$$\begin{aligned} g(x')=\Phi _0+\sum _{i=1}^{M}\Phi _i\cdot x'_i. \end{aligned}$$
(8)

The weights \(\Phi _i\) of the explanation model estimates the SHAP values for each subject and each feature. For tabular data, the simplified features are binned binary feature representations that represent if the original feature value or a permutation was used.

SHAP force plots [67] explain the model prediction of individual subjects using Shapley values. Features with positive Shapley values show strong positive effects on the prediction and small negative Shapley values represent small negative effects. SHAP force plots can be found in Fig. 15.

SHAP summary plots [67] summarize the explanations for the entire training dataset. Each point visualizes the feature value of a subject and the associated Shapley value. The color of a point depends on the subject’s feature value. On the vertical axis, the features are ordered by the mean absolute Shapley values. The plots were limited to the top ten features. SHAP summary plots can be found in Figs. 2, 5, 6, and 8.

There are some reasons, including out-of distribution sampling during Shapley value approximation and not taking into account feature correlation, why Shapley values should be used with caution for black-box model interpretability [68]. Therefore, it is important to compare Shapley value results with other ML explanation methods, or to reduce or consolidate correlated features [69]. In this work, forward selection was implemented to reduce the number of correlated features in the dataset, Shapley values were compared to classical feature importance measurements (“Classification Model”), and correlated features are consolidated to aspects.

Evaluation

The models were evaluated for the ADNI test set and the external AIBL and OASIS datasets. The performance was measured using accuracy (ACC) (Eq. 9), balanced accuracy (BACC) (Eq. 10), F1-Score (F1) (Eq. 11), and Matthews correlation coefficient (MCC) (Eq. 12). Table 6 visualizes the contingency table used for the calculation of those scores. Providing multiple scores for evaluation increased the comparability to other research. In comparison to accuracy, which focuses on correctly classified cases, the F1-Score focuses on incorrectly classified cases. The macro-averaging F1-Score was calculated to address both, the diseased and the healthy subject classification. Balanced accuracy is based on both, sensitivity and specificity and thus is suitable to evaluate imbalanced class problems. The MCC returns a value between 0 and 1, and is also suitable to handle imbalanced datasets

Table 6 Contingency table for the classification between patients and controls
$$\begin{aligned} ACC= & {} \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(9)
$$\begin{aligned} BACC= & {} \frac{\frac{TP}{TP+FN}+\frac{TN}{TN+FP}}{2} \end{aligned}$$
(10)
$$\begin{aligned} F1= & {} \frac{TP}{TP+\frac{1}{2}(FP+FN)} \end{aligned}$$
(11)
$$\begin{aligned} MCC= & {} \frac{TP\cdot TN -FP \cdot FN}{\sqrt{(TP+FP)\cdot (TP+FN)\cdot (TN+FP)\cdot (TN+FN)}}. \end{aligned}$$
(12)

Additionally, the Area under the Receiver-Operating Curve (AUROC), which models the relationship between the True-Positive Rate (TPR—Eq. 13) and the False-Positive Rate (FPR—Eq. 14) for different classification thresholds was computed for all models. AUROC is suitable to investigate classification tasks with imbalanced datasets

$$\begin{aligned} TPR= & {} \frac{TP}{TP+FN} \end{aligned}$$
(13)
$$\begin{aligned} FPR= & {} \frac{FP}{TN+FP}. \end{aligned}$$
(14)

Results

In the following, the experimental results are presented. The MRI features selected using forward selection and the performances achieved for CN vs. AD, CN vs. MCI, MCI vs. AD, and sMCI vs. pMCI classification were given. SHAP summary plots compared the models trained using different feature sets, validation datasets, and classification models. The results of SHAP summary plots are compared to natural RF- and XGBoost-based feature importance scores and permutation importance scores. The influence of feature interactions for Shapley values is investigated and SHAP force plots explain individual model predictions of interesting subjects.

Feature Selection

The MRI features selected during forward selection for CN vs. AD classification and different ML methods used as base classifiers are summarized in Table 7. In this research, feature forward selection was used to reduce the number of MRI features and the influence of correlated features.

Table 7 Features selected by forward selection using different ML methods as base classifiers for CN vs. AD classification

For the CN vs. AD detection task, the RF, and the polynomial SVM chose five features, the XGBoost, the DT, and the radial SVM chose four features and the LR chose six features. Overall, the six methods chose 16 different features. The most important feature for the RF, the XGBoost, the DT, and the radial SVM was the sum of the left and right amygdalae. For the polynomial SVM and the LR, the most important feature was the sum of the entorhinal cortices. Both features were previously associated with AD detection [70,71,72,73]. Previous research also shows that most of the selected features are associated with atrophy patterns in AD [74]. All methods also selected at least one difference or ratio of the left and right cortical or subcortical areas. Those features describe the asymmetry of both hemispheres. Brain asymmetry measurements were associated with AD [47,48,49,50] and were also successfully applied for ML models in this field [75].

The rankings of the forward selection for CN vs. MCI detection and different base classifiers are given in Table 8.

Table 8 Features selected by forward selection using different ML methods as base classifiers for CN vs. MCI classification

For CN vs. MCI detection, the RF, XGBoost, and LR base classifiers chose four features, the DT chose three features, the polynomial SVM chose eight features, and the radial SVM chose six features. Overall, the six ML methods chose 25 different features. Thus, in comparison to the CN vs. AD classification, the ML models show less agreement about the selected features. Consequently, the feature which was selected first in the forward selection process differed in five out of six methods. For the RF and the DT, the sum of the insular cortices was selected, the XGBoost classifier chose the ratio of the inferior parietal lobule, the polynomial SVM selected the sum of the lingual gyri, the SVM with the radial kernel chose the sum of the temporal pole volumes and the LR selected the sum of the left and right middle temporal gyri. Those features were previously associated with AD progression [70,71,72,73,74, 76,77,78]. Similar to the CN vs. AD classification, all models selected at least one feature describing the asymmetry of the cortical and subcortical brain regions.

The forward feature selection results of the six ML models for MCI vs. AD classification are summarized in Table 9. Four of the six models, namely RF, XGBoost, SVM poly, and LR selected five features. The DT chose six different features and the radial SVM selected two MRI features. Overall, the six methods selected 22 unique features.

The most important features were the sum of the left and right hippocampi for the RF and DT model, the difference of the lateral ventricles for the XGBoost model, the sum of the inferior temporal gyri for both SVMs, and the sum of the entorhinal cortex volumes for the LR. Those features were previously associated with AD detection [70,71,72,73, 79, 80].

Table 9 Features selected by forward selection using different ML methods as base classifiers for MCI vs. AD classification

The results of the forward selection for the sMCI vs. pMCI classification task are summarized in Table 10. Five features were selected by the RF model, the XGBoost model chose six features, the DT selected only one feature, both SVMs chose four features, and the LR selected three features. Overall, the six methods picked 19 unique features. Three methods, namely the RF, the DT, and the SVM with the radial kernel selected the sum of the left and right amygdalae as the most important feature. The forward selection with the XGBoost base model first picked the sum of the hippocampi. The polynomial SVM selected the sum of the left and right precuneus and the LR chose the sum of the inferior temporal gyri. Those features were previously associated with AD detection [70,71,72,73, 80, 81].

Table 10 Features selected by forward selection using different ML methods as base classifiers for sMCI vs. pMCI classification

Classification Tasks

In the following, the classification performances achieved for the four classification tasks are elaborated. The hyperparameters which reached the best accuracies during CV and which were thus used during training of the final models are summarized in Table 11.

Table 11 Hyperparameters tuned for CN vs. AD, CN vs. MCI, MCI vs. AD, and sMCI vs. pMCI classification. Hyperparameters: LR: C; penalty, DT: criterion; max depth; min samples split; splitter, RF: max features; min samples leaf; n estimators, XGBoost: colsample bytree; gamma; learning rate; max depth; min child weight; n estimators; subsample, SVM poly: C; degree; gamma, SVM radial: C; gamma

CN vs. AD

The results achieved for CN vs. AD classification are summarized in Table 12. The no information rates were 60.36 % for the independent ADNI test set, 86.27 % for AIBL, and 78.05 % for OASIS. CN was the most frequent class in all datasets.

The best accuracy during CV of 99.68 %  ±  0.74 was achieved for the DT trained with feature selection and FS-3. This model also reached a perfect classification for the ADNI test set. All models trained for the CN vs. AD task reached accuracies higher than the no information rate for the ADNI dataset. The best AIBL accuracy of 95.94 % was achieved for the XGBoost model trained for FS-3 and with feature selection. This model also reached the best F1-Score (91.48 %) and the best MCC (0.830) for the AIBL dataset. The best balanced accuracies of 95.45 % for the AIBL dataset were reached for both SVMs trained with feature selection and FS-3. The LR model trained with feature selection for FS-3 reached the best AIBL AUROC of 99.55 %. Overall, two models achieved AIBL accuracies smaller than the no information rate of 86.27 %. Those models were the DTs trained with feature selection for FS-1 and FS-2.

The best OASIS accuracy of 90.58 % was achieved for the polynomial SVM which was trained with feature selection and FS-3. For OASIS, four models achieved accuracies worse than the no information rate of 78.05 %. Three of those models were trained on FS-1 and with feature selection, namely, the RF, the DT, and the SVM. The last model reaching an accuracy worse than the no information rate was the DT trained with FS-2 and feature selection.

Table 12 CV and test results for CN vs. AD classification

CN vs. MCI

The results achieved for CN vs. MCI classification are summarized in Table 13. The no information rate for this task was 62.64 % for the ADNI test set, 82.44 % for AIBL, and 97.39 % for OASIS. MCI was the most frequent class in the ADNI dataset, whereas, for AIBL and OASIS, CN was.

The results achieved for CN vs. MCI classification were worse than those for the CN vs. AD task. The best accuracy during CV of \(90.21\,\%\pm 2.72\) was achieved for the XGBoost model trained for FS-3 and without feature selection. The best accuracy for the ADNI test set was 91.58 % reached for two models. Both models, the radial SVM and the XGBoost model, were trained for FS-3 and with feature selection. The latter model also reached the best ADNI balanced accuracy and ADNI F1-Score. Overall, none of the models reached an ADNI accuracy worse than the no information rate of 62.51 %.

The results achieved for AIBL and OASIS were worse than the ADNI results. The best AIBL accuracy was 68.95 % achieved for two DTs trained with forward feature selection for FS-1 and FS-2. These models also reached the best AIBL balanced accuracies, AIBL F1-Scores, and AIBL MCCs.

The best OASIS accuracy of 55.05 % was reached for the DT trained without feature selection for FS-3. For the CN vs. MCI classification, all models achieved accuracies worse than the no information rates for OASIS and AIBL.

Table 13 CV and test results for CN vs. MCI classification

MCI vs. AD

The MCI vs. AD classification results are summarized in Table 14. The no information rate was 71.85 % for the ADNI test set, 57.23 % for AIBL, and 91.24 % for OASIS. The most frequent class was MCI for ADNI and AIBL as well as AD for OASIS.

The best CV accuracy of 89.39 %  ±  2.99 was achieved for the RF trained without feature selection and FS-3. For the independent ADNI test set, the best accuracy was 88.66 %, reached by the RF and LR models trained with feature selection and FS-3. The first of those models also reached the best ADNI AUROC of 95.50 %, whereas the second model achieved the best ADNI F1-Score (85.50 %), and ADNI MCC (0.712). None of the models reached an ADNI accuracy worse than the no information rate. However, the DT trained without feature selection for FS-1 as well as the XGBoost and the DT both trained for FS-2 and with feature selection exactly achieved the no information rate of 71.85 % for the independent ADNI test set. The two DT models mentioned predicted the MCI class for all subjects and thus represented random classifiers.

The best AIBL accuracy was 84.94 % reached by the RF model trained without feature selection and FS-3. This model also reached the best AIBL balanced accuracy (83.64 %), AIBL F1-Score (84.24 %), and AIBL MCC (0.693). Except for the previously mentioned random classifiers, all models outperformed the no information rate for the AIBL dataset which was 57.23 %. The performances for OASIS were worse than those achieved for ADNI and AIBL. The best OASIS accuracy of 57.14 % was achieved for the radial SVM trained without feature selection and FS-3. The random classifiers achieved a worse accuracy of 8.76 % for OASIS. These accuracies correspond to the ratio of MCI subjects in the dataset. MCI was the most frequent class for the ADNI dataset and the rarest class for OASIS. All models achieved OASIS accuracies worse than the no information rate of 91.24 %.

Table 14 CV and test results for MCI vs. AD classification

sMCI vs. pMCI

The results reached for sMCI vs. pMCI classification with no information rates of 55.56 % for the ADNI test set and 57.14 % for AIBL are summarized in Table 15. As previously mentioned, due to the lack of available data, OASIS was not used for this comparison.

The best CV accuracy of \(70.75\,\%\,\pm \,5.94\) was achieved for the RF trained with feature selection and FS-3. For the independent ADNI test set, the radial SVM which was trained with forward feature selection for FS-3 reached the best accuracy (75.00 %), balanced accuracy (74.38 %), F1-Score (74.51 %), and MCC (0.491). The best AUROC of 80.14 % was reached for the XGBoost model trained with forward feature selection for FS-3. None of the models achieved worse results than the ADNI no information rate of 55.56 %. The best AIBL accuracy was 82.14 % reached for the radial SVM which was trained for FS-3, and without feature selection. Five models achieved AIBL accuracies worse than the no information rate. Those models were all DTs trained with FS-1 and FS-3, and the LR trained with feature selection and FS-3.

Table 15 CV and test results for sMCI vs. pMCI classification

Feature Sets

As can be seen in Table 12, for CN vs. AD classification, all models achieved the best scores using FS-3. Thus, adding cognitive test results to the dataset improved the overall classification results. The SHAP summary plots for the polynomial SVMs trained with feature selection for all three feature sets are shown in Fig. 2. SHAP summary plots [67] explain the predictions for the subjects of the entire ADNI, AIBL, and OASIS datasets. Each point plots a Shapley value for a subject and a feature and is colored depending on the feature value. The vertical axis represented the features, ordered by the mean absolute Shapley values and their distribution. The higher a Shapley value is, the more the feature expression increases the probability the model classifies the subject as an AD subject. All SHAP summary plots were limited to the top ten features.

Following the biological processes of AD, small brain volumes [70, 82,83,84], large ventricular volumes [85, 86], the presence of ApoE\(\epsilon\)4 alleles [87,88,89], and bad performances in cognitive test scores were expected to have a pathogenic effect on the disease progression. The left hemisphere was expected to be more affected by atrophy than the right [75] one. However, some investigations for MCI subjects also showed the right hippocampus was more affected [90]. This asymmetry occurs primarily in the hippocampi and amygdalae [91, 92]. For FS-1 and FS-2, the most important feature was the sum of the left and right amygdalae. Consistently with the previously mentioned atrophy patterns [70, 82,83,84], small volumes of the amygdalae (colored in blue) increased the probability the model classifies a subject as an AD subject. Large amygdala volumes (colored in red) were associated with CN subjects. The second most important feature for both models was the sum of the left and right entorhinal cortex. The model, trained using FS-1, learned that large volumes (colored in red) of the amygdalae, the entorhinal cortices, and the inferior parietal lobules had protective effects on the disease progression (negative Shapley values). Those associations correspond to previous research [70, 82,83,84]. The model additionally learned that a large difference between the left and right cortex volume (colored in red) was associated with CN. Large differences were reached if the volume of the left hemisphere was larger than the right one. The same observation applies to the ratio of the left and right paracentral lobules. Considering the socio-demographic features show that the model learned, young age (colored in blue) was associated with disease progression. However, the summary of the ADNI dataset in Table 2 shows the mean age of the CN group was younger than the AD group. Additionally, the model learned that females (colored in blue) and subjects with low education more likely develop AD.

FS-2 added the number of ApoE\(\epsilon\)4 alleles to FS-1. The additional feature was the third most important feature of this model. The model learned that a large number of ApoE\(\epsilon\)4 alleles (no ApoE\(\epsilon\)4 alleles are colored in blue, one ApoE\(\epsilon\)4 allele is colored in purple, two ApoE\(\epsilon\)4 alleles are colored in red) led to an increased risk. Previous research also identified ApoE\(\epsilon\)4 as an AD risk factor [87,88,89]. Biologically plausible associations [70, 82,83,84] were noted for the summed volumes of the left and right amygdalae, the entorhinal cortices, and the inferior parietal lobules. The ratio of the left and right paracentral lobule volumes showed an increased risk of AD if the left hemisphere was smaller than the right one. The same applies to the differences in cortex volumes.

FS-3 added the results of three cognitive tests to FS-2. The LDELTOTAL cognitive test score achieved the best feature importance for this model, followed by the MMSCORE. The LIMMTOTAL score achieved the sixth-best feature importance. For all cognitive test scores, the model associated good scores (colored in red) with healthy subjects. The third most important feature was the summed volume of the entorhinal cortices. Consistently with AD atrophy, small volumes (colored in blue) were associated with AD progression. The same applied to the sum of the inferior parietal lobules and the amygdalae. Similar to the FS-1 and FS-2 models, the FS-3 model also learned young age (colored in blue) was associated with AD, although the mean age of the ADNI-CN group was younger than the mean age of the ADNI-AD group.

Fig. 2
figure 2

SHAP summary plots of the polynomial SVM trained with feature selection for CN vs. AD. The plots visualize the Shapley values of n=2,266 subjects from the ADNI, AIBL, and OASIS datasets and the ten most important model features. Each subject is represented by a dot. The colors decode the subject’s feature expression. High feature expressions are colored in red and small expressions in blue. The model learned that features with high Shapley values increased the patient’s AD risk. Each plot shows a model trained on a different feature set

As can be seen in Table 13, for the CN vs. MCI classification, FS-3 outperformed FS-1 and FS-2 for the ADNI and AIBL performance scores. The best accuracies for OASIS were reached for FS-3, whereas FS-2 models outperform those models for F1-Score, balanced accuracy, AUROC, and MCC.

For the MCI vs. AD task, which is summarized in Table 14, the same applied to all ADNI and AIBL scores. For the OASIS dataset, the best accuracy and F1-Score were reached by FS-1 and the best balanced accuracy, AUROC, and MCC for FS-3.

The results for the sMCI vs. pMCI classification are shown in Table 15. Those results show that FS-3 outperformed FS-1 and FS-2 for all ADNI scores. For the AIBL dataset, the best accuracy, balanced accuracy, F1-Score, and MCC were achieved for FS-2, whereas the best AUROC was reached for FS-3.

To indicate whether the differences in ADNI test accuracies between the three feature sets are statistically significant, a Friedman test [93] (– value < 0.05) was executed. For this investigation, the results of Tables 7, 8, 9, and 10 are summarized, resulting in 48 observations per feature set (six different models, two feature selection methods, and four tasks). The p–value of \(2.2\cdot 10^{-16}\) indicated statistically significant differences between the feature sets. To identify, which feature sets differed from each other, a pairwise Wilcoxon signed-rank test (p – value < 0.05) with Bonferroni adjustment was executed. A summary of the results is given in Table 16. The results FS-3 achieved significantly differed from FS-1 and FS-2. The FS-1 and FS-2 results showed no statistically significant differences.

Table 16 p–values of the pairwise Wilcoxon signed-rank test (p – value < 0.05) with Bonferroni adjustment to compare the differences in ADNI test accuracies between the three feature sets

Reproducibility

In this work, all models were trained using the ADNI dataset. Data from AIBL and OASIS subjects were used to test model reproducibility.

For all classification tasks, most models achieved worse results for AIBL and OASIS in comparison to the independent ADNI test set. The AIBL accuracies are plotted against the ADNI accuracies for all previously described models in Fig. 3. Overall, the AIBL accuracies were worse than those achieved for ADNI. The CN vs. AD classification models achieved the best accuracies for ADNI and AIBL. The worst AIBL accuracies were achieved for CN vs. MCI classification, where all models reached AIBL accuracies worse than the no information rate. For the remaining classification tasks, most models reached AIBL accuracies better than the no information rate. For the sMCI vs. pMCI classification, no correlation between ADNI and AIBL accuracies was observed.

Fig. 3
figure 3

Plot showing the accuracies achieved for the independent ADNI test set and the AIBL dataset for all models described in Tables 12, 13, 14, and 15. The no information rates for all classification tasks are visualized as horizontal lines for AIBL and as vertical lines for ADNI

In Fig. 4, the OASIS accuracies of all previously described models are plotted against their ADNI accuracies. Similar to the AIBL results, the overall OASIS accuracies were worse than those achieved for ADNI. The best results for OASIS were achieved for CN vs. AD classification. Those models mainly reached accuracies better than the no information rate. The OASIS no information rates for the remaining classification tasks were larger than 90 % and all classification models trained for the ADNI dataset performed worse. However, the most frequent classes in OASIS and ADNI differed from each other for those classification tasks. For the OASIS dataset, the worst accuracy was achieved for MCI vs. AD classification. Reasons for the worse OASIS performances were, for example, differing MRI protocols and differences in the subject selection process.

Fig. 4
figure 4

Plot showing the accuracies achieved for the independent ADNI test set and the OASIS dataset for all models described in Tables 12, 1314 and 15. The no information rates for all classification tasks are visualized as horizontal lines for OASIS and as vertical lines for ADNI

To compare the model predictions for the three datasets, SHAP summary plots were visualized for the individual datasets in Fig. 5. Those plots show the Shapley values for the RF trained with feature selection and FS-3, which was trained for CN vs. AD classification. For all three datasets, the three most important features were the cognitive test scores, and bad scores were associated with disease progression. Those cognitive test scores were followed by the volumetric features, of the amygdalae, medial orbitofrontal cortices, and pars triangularis, as well as the AGE, the number of APOE\(\epsilon\)4 alleles, and the number of education years in slightly differing orders. For all volumetric features, biologically plausible associations [70, 82,83,84] were learned. The number of education years was not available in AIBL and OASIS, and those scores are therefore colored in grey.

Fig. 5
figure 5

SHAP summary plots of the RF trained with FS-3 and feature selection for CN vs. AD classification. The plots visualize the Shapley values of subjects from the ADNI, AIBL, and OASIS datasets and the ten most important model features. Each subject is represented by a dot. The colors decode the feature values of the subject. High feature values are colored in red, whereas small feature values are colored in blue. The model learned that features with high Shapley values increased the patient’s risk to develop AD. Each plot shows the results on a different dataset

SHAP summary plots for the RF trained with feature selection for CN vs. MCI classification based on FS-1 are shown in Fig. 6. The figure contains subplots for all three datasets. Overall, the Shapley values for this model were asymmetric. The positive Shapley values show stronger amplitudes than the negative ones. One explanation for this behavior might be that the MCI class was more frequent in the ADNI training dataset. For the ADNI and AIBL dataset, the most important feature was the sum of the inferior parietal lobules followed by the age and gender. The model learned that small brain volumes, young age, and male gender increased the risk to develop MCI. The volumetric observations correspond to previous research [70, 82,83,84]. The volume of the inferior parietal lobules was the second most important feature for the OASIS dataset. Age was the most important feature for OASIS and the second most important feature for ADNI and AIBL. The model learned young age was associated with disease progression. It can be noted in Table 2 that the mean age of CN subjects is older than the mean age of MCI subjects in the ADNI dataset but not in the AIBL (Table 3) and OASIS (Table 4) datasets. The differences observed in the datasets might cause problems in model reproducibility. The feature representing the years of education was in the fifth position for the ADNI dataset. That information was not available in OASIS and AIBL and was thus colored in grey. Consistently, this feature was the least important one for both datasets. Overall, the ranking of the feature importance differed for all models.

Fig. 6
figure 6

SHAP summary plots of the RF trained with FS-1 and feature selection for CN vs. MCI classification. The plots visualize the Shapley values of subjects from the ADNI, AIBL, and OASIS datasets and the ten most important model features. Each subject is represented by a dot. The colors decode the feature values of the subject. High feature values are colored in red whereas small feature values are colored in blue. The model learned that features with high Shapley values increased the patient’s risk to develop MCI. Each plot shows the results on a different dataset

Classification Model

In this research, six ML models were trained to compare their results to each other. A line plot of the accuracies achieved for the independent ADNI test set dependently on the classification task and the ML model is shown in Fig. 7. For the sMCI vs. pMCI classification, it can be seen that the performance variance is smaller for RF and XGBoost models in comparison to the remaining ML models. In addition, the polynomial SVMs achieved worse results for this classification task. Overall, the DT models were often outperformed by RF and XGBoost classifiers. The LR models outperformed the DTs in many cases, except for the CN vs. MCI classification. Overall, no ML model outperformed the remaining models.

Fig. 7
figure 7

Line plot showing the accuracies achieved for the independent ADNI test set dependently on the classification tasks and the ML model. The plot includes all 216 models described in Tables 12, 1314 and 15. For each ML classifier, 36 models were included

To indicate whether the differences in ADNI test accuracies between the ML methods are statistically significant, a Friedman test (p – value < 0.05) was executed. For this investigation, the results of Tables 7, 8, 9, and 10 are summarized, resulting in 24 observations per feature set (three feature sets, two feature selection methods, and four tasks). The p – value of 0.006 indicated statistically significant differences between the ML models. A pairwise Wilcoxon signed-rank test (p – value < 0.05) with Bonferroni adjustment was executed, to identify, which model performances differed from each other. However, the results, summarized in Table 17, show that there are no statistically significant differences between ML models.

Table 17 p–values of the pairwise Wilcoxon signed-rank test (p – value < 0.05) with Bonferroni adjustment to compare the differences in ADNI test accuracies between the six ML models

To visualize ML model differences, Fig. 8 shows SHAP summary plots for all six models. All models were trained using FS-3 with feature selection to distinguish between sMCI and pMCI subjects. The feature selection results in slightly different features within all models. Overall, the Shapley values had the largest deviance for the DT and the SVM with a polynomial kernel, followed by the LR model and the RF. The most important feature for all models except for the RF and the radial SVM was the LDELTOTAL cognitive test score. For this test score, all models associated small feature values (colored in blue) with disease progression. For the radial SVM and the RF, LDELTOTAL was the second most important feature. The most important feature in the RF model was the sum of the left and right amygdalae. The model learned that large brain volumes decreased the patient’s risk to develop AD. This observation is biologically plausible [70, 82,83,84]. The sum of the amygdala volumes was the third most important feature in the DT and the radial SVM. The number of ApoE\(\epsilon\)4 alleles was the most important feature for the radial SVM. The model learned that ApoE\(\epsilon\)4 is an AD risk factor, and the presence of ApoE\(\epsilon\)4 alleles is associated with AD progression. The number of ApoE\(\epsilon\)4 alleles is the second most important feature for the DT, and the LR, the third most important feature for the XGBoost model and the polynomial SVM, and the fourth most important feature for the RF. In this comparison, all models except for the DT and the polynomial SVM had at least one asymmetry feature within its top ten features. The decision tree only depended on three features, namely the LDELTOTAL cognitive test score, the number of ApoE\(\epsilon\)4 alleles, and the hippocampus volume. Most associations, the models learned, were biologically plausible. The radial RF showed two features with a biologically implausible association [70, 82,83,84]. The model learned that high volumes of the lateral occipital sulci, as well as a high number of education years, are associated with disease progression. Those features are ranked as the ninth and tenth important features in this model. Surprisingly, the association of the education feature was also learned for the SVMs and the LR. For the polynomial SVM, the summed volumes of the rostral anterior cingulate cortices show a biologically implausible [70, 82,83,84] association. Overall, biological plausibility should only be expected for high-performing models.

Fig. 8
figure 8

SHAP summary plots for sMCI vs. pMCI classification trained using FS-3, feature selection, and multiple ML models. The Shapley values of n=747 ADNI and AIBL subjects and the ten most important model features are shown. Each dot represents a subject. The colors decode the feature expressions. High expressions are colored in red and small ones in blue. The model learned that features with high Shapley values increased the patient’s AD risk

As a comparison, Fig. 9 visualizes the natural feature importance for the RF and XGBoost models, and the log odd’s ratios for the LR model (ordered by the absolute log odd’s ratio), and Fig. 10 shows the permutation importance of all models. The most important features of all natural feature importance plots and all permutation importance plots correspond to the SHAP summary plots.

Fig. 9
figure 9

Natural feature importance plots of the RF, XGBoost, and DT models and log odd’s ratios for LR models trained to distinguish between sMCI and pMCI subjects using FS-3 and feature selection. Each plot shows a different classification model

Fig. 10
figure 10

Permutation importance plots of all six ML models trained to distinguish between sMCI and pMCI subjects using FS-3 and feature selection. Each plot shows a different classification model

The Kendall’s tau rank correlation [94] between feature rankings for all SHAP models, natural XGBoost and RF feature importances, absolute log odd’s ratios of the LR model, and permutation importance of all models is shown in Fig. 11. Due to the forward feature selection, the different models are trained on slightly different MRI features and the correlation was calculated for pairwise complete observations. However, the socio-demographic data, the number of ApoE\(\epsilon\)4 alleles, and the cognitive test scores were used to train all models. As the features within a specific model are identical, first, the SHAP values, the permutation importance, and the feature importances are compared for each individual model. The SHAP values of the RF model and the permutation importance of the RF have a correlation coefficient of 0.82. The natural feature importance of the RF is only moderately correlated to the permutation importance of the RF (0.45) and weakly correlated to the RF SHAP values (0.35). The XGBoost SHAP values showed a very strong correlation of 0.82 to the natural XGBoost feature importance and a moderate correlation of 0.41 to the XGBoost permutation importance. The DT selected three features in all methods leading to a perfect correlation between the DT SHAP values, and the permutation importance as well as a very strong correlation of 0.89 for the DT SHAP values and the natural feature importances. The SHAP values of the polynomial SVM showed a strong correlation to the permutation importance (0.67) of the same model. A moderate correlation of 0.53 was reached for the SHAP values of the radial SVM and the permutation importance of the same model. The SHAP LR values are strongly correlated (0.73) to the permutation importances and very strong correlated to the log odds (0.96).

As previously mentioned, the features within the different ML models differed which makes the comparison of inter- and intra-model correlations difficult. Considering the inter-model correlations, a perfect correlation of 1 was reached between the SHAP values of the RF and the SHAP values of the XGBoost model, as well as the permutation importance of the LR and the SHAP values of the polynomial SVM.

Fig. 11
figure 11

Plot showing Kendall’s tau correlation between feature importances of all SHAP models, permutation importance of all models, and natural XGBoost and RF feature importances, as well as log odd’s ratios of the LR models for FS-3, feature selection, sMCI vs. pMCI classification and the ADNI and AIBL datasets (n=747)

The execution times of the different ML models, the SHAP algorithm, and the permutation importance calculation are summarized in Table 18. All experiments were executed on an NVIDIA® DGX-1Footnote 5 supercomputer. The execution environment was an NVIDIA®-optimizedFootnote 6 DockerFootnote 7 [95] container, running a DeepoFootnote 8 image. The results showed that, except for the RF model, all models were trained in less than 1 s. The mean training time during CV was 5.66 s for the RF. The RF model was trained using twelve features. The long training time of the RF model was also reflected in the SHAP algorithm and the permutation importance. To compute the SHAP values of one subject, the RF model consumes 47.79 s, whereas the radial SVM which achieved the second slowest time requires only 13.51 s per subject. This results in an execution time of approximately 10 h to calculate the RF SHAP values of the entire dataset. The execution time for permutation importance was approximately 1 h for the RF model and 8 min for the XGBoost model reaching the second-longest execution time. Overall, it has to be mentioned that SHAP value calculation is a time-intensive process. However, the times presented can only be used as an orientation, and optimization is possible by for example clustering the background subjects of the SHAP algorithm. In this work, the samples of the entire training dataset were used as background subjects. The SHAP execution time depended on the number of features in a dataset, the time needed for model inference, the number of background subjects, and the number of subjects that should be explained.

Table 18 Execution times of the different ML models and explainability methods for FS-3, feature selection, sMCI vs. pMCI classification, and the ADNI and AIBL datasets (n=747)

Feature Dependency and Shapley Values

As feature correlations reduce the validity of explainability methods [96, 97], the previously explained SHAP summary plots are all generated using feature selection to avoid strong feature correlations in the dataset. Feature correlations can also make explainability more difficult [97] and may lead to biologically implausible explanations. The original dataset without feature selection contains many correlated features. To compare the explanations for such a dataset [69] developed a method to consolidate correlated features to aspects and compute permutation and SHAP importances for those aspects. First, correlated features of the entire training dataset are identified using Spearman rank correlation coefficients. Hierarchical agglomerative clustering [97] was used to create a dendrogram. In this work, a threshold of \(H=0.5\) determining the least correlated features in a group filtered the resulting aspects from the dendrogram. The permutation and SHAP importances are computed by jointly permuting all features in an aspect. This work uses the python package dalex v1.4.1 [98] for implementation.

The resulting aspects computed for sMCI vs. pMCI classification and FS-3 without feature selection are shown in Table 19. The 161 features of FS-3 are consolidated to 79 aspects. Of those aspects, 14 included an individual feature. Of the remaining 65 aspects, nine included more than two features. As was expected, the differences and ratios of the same region are often correlated. At least one pair of ratio and difference for the same region was included within 49 aspects. Aspect_34 included four regions within the medial temporal lobe. Previous research showed, that those regions are important for the detection of AD progression [91, 92]. Aspect_30 consolidated three ventricular regions. Previous research found that ventricular enlargement was associated with AD progression [85, 86]. Aspect_46 included the cognitive test scores LIMMTOTAL and LDELTOTAL, and aspect_45 included the eTIV and the gender.

Table 19 Aspects extracted with hierarchical agglomerative clustering for Spearman rank correlation and a threshold \(H=0.5\) for sMCI vs. pMCI classification and FS-3 without feature selection

Using those aspects, the SHAP importances visualized in Fig. 12 were computed for the sMCI vs. pMCI classification without feature selection and for all ML models. The most important aspect for the RF, the XGBoost, and the DT was aspect_34, which consolidated the entorhinal cortices, the parahippocampal gyri, the amygdalae, and the hippocampi. Those brain areas were associated with AD in previous research [91, 92]. Aspect_34 also was the second most important aspect of the polynomial SVM and the third most important aspect in the LR and radial SVM. The most important aspect for the LR was aspect_27 which consolidated volumes of the fusiform, the inferior temporal, and the middle temporal gyri. This aspect also reached the second rank for the RF, radial SVM, and XGBoost models as well as the third place for the polynomial SVM. Previous research [91, 99] showed that those regions are affected in early AD stages. Aspect_46 consolidated the LDELTOTAL and LIMMTOTAL cognitive test scores, and was the most important aspect for both SVMs. This aspect also achieved the third rank for the XGBoost, RF and DT models, and the second rank for the LR. Overall, the most important aspects of the different models seem to be similar for the ML models.

Fig. 12
figure 12

SHAP aspect importance plots for all six ML models trained to distinguish between sMCI and pMCI subjects using FS-3 and no feature selection. Each plot shows a different classification model

Figure 13 shows the aspect permutation importance plots of the previously described models. The most important aspects chosen by SHAP importance and permutation importance matched for the DT, and the radial SVM. The most important aspect of the RF and XGBoost models was aspect_46 which included the cognitive test scores LDELTOTAL and LIMMTOTAL. This aspect reached the third rank using the SHAP method for both models. For the LR, aspect_46 was also identified as the most important aspect. This aspect reached second place using the SHAP explanations. The highest permutation importance of the SVM was reached for aspect_34. This aspect reached second place using the SHAP method.

Fig. 13
figure 13

Permutation aspect importance plots for all six ML models trained to distinguish between sMCI and pMCI subjects using FS-3 and no feature selection. Each plot shows a different classification model

To investigate the correlation of the feature rankings, between the different methods, Kendall’s tau rank correlation between the SHAP feature ranking and the permutation method is visualized in Fig. 14. A very strong correlation of 1.00 was observed between the permutation importance and the SHAP values of the DT. A moderate correlation of 0.53 was observed between the SHAP values and the permutation importance of the RF. The XGBoost SHAP rankings also showed a moderate correlation of 0.58 to the permutation importance of the same method. The SHAP values of the polynomial SVM are weakly correlated (0.36) to the model’s permutation importance. A very weak correlation of 0.07 was observed between the SHAP values of the radial SVM and their permutation importance rankings. The LR SHAP values are strongly correlated (0.74) to the permutation importance measurements of this model.

The inter-model correlations of the SHAP values showed a moderate correlation of 0.52 between the polynomial SVM and the radial SVM, as well as a strong correlation (0.65) between the SHAP values of the LR and the radial SVM SHAP values. The SHAP values of the RF was moderately correlated to the XGBoost SHAP values (0.52) and the SHAP values of the polynomial SVM. The SHAP values of the DT and LR model showed a very weak correlation (0.18). Overall, the SHAP values of the DT showed only weak correlations to the remaining ML models. The highest correlation of 0.39 was found for the SHAP values of the XGBoost model.

Fig. 14
figure 14

Plot showing Kendall’s tau correlation between aspect importances of all SHAP models and permutation importance for FS-3, no feature selection, sMCI vs. pMCI classification, and the ADNI and AIBL datasets (n=747)

Explanations of Individual Predictions

To investigate explanations for individual model predictions, Fig. 15 shows SHAP waterfall plots of four ADNI subjects. Those plots visualize the predictions for the RF trained with FS-3, and feature selection for sMCI vs. pMCI classification. SHAP waterfall plots explain the difference between the average model prediction value (E[f(X)]) and the subject’s model prediction based on Shapley values. In all plots, the individual prediction was the probability of the subject being classified as pMCI. Features with a pathogenic expression are shown as red and protective expressions as blue arrows. The model prediction for the subject with PTID 027_S_1387 is explained in Fig. 15a. This is a subject from the ADNI test set and had a diagnosis of pMCI. The model prediction of this subject was 0.735. As this value was higher than 0.50, the subject was correctly classified as a pMCI subject. The most important feature with a pathogenic effect was the volume of the inferior parietal lobules. A relatively small normalized feature value of 0.237 was observed. The Shapley value of this feature was 0.12, and thus, this feature expression increased the model prediction by 0.12. The LDELTOTAL cognitive test score reached a feature value of 3, which was a relatively bad test performance and thus increased the subject’s risk to develop AD. Surprisingly, the relatively old age of 85.6 years decreased the patient’s risk to develop AD by 0.03.

The model prediction for an sMCI subject (PTID: 037_S_4146) is demonstrated in Fig. 15b. This subject was sampled from the ADNI test set and reached a model prediction value of 0.149. The subject had a moderate-to-large volume of amygdalae which decreased the subject’s risk of prospectively developing AD by − 0.15. The subject also has two ApoE\(\epsilon\)4 alleles, and as the presence of ApoE\(\epsilon\)4 alleles is a risk factor for AD, this increased the patient’s risk. Additionally, the relatively high LDELTOTAL cognitive test score of 9 had a protective effect.

SHAP waterfall plots of subjects not included in the sMCI vs. pMCI dataset, because those pMCI subjects reverted to MCI at a later visit (explained in “Subject Selection”) are visualized in Fig. 15c, d. The prediction of the subject with PTID 036_S_4430 is visualized in Fig. 15c. This MCI subject converted to AD 5.54 months after the baseline visit, but reverted to MCI 12.00 months after the baseline, and again converted to AD 23.64 months after the baseline visit. The last diagnosis for this subject was recorded after 83.54 months. The subject reached a model prediction of 0.707 and was thus classified as a pMCI subject. Additionally, the patient had a relatively small LIMMTOTAL cognitive test score of 2. The model learned that this poor test score increased the patient’s risk to develop AD by 0.09. Additionally, the subject had relatively small volumes of the amygdalae, which additionally decreased the patient’s risk to develop AD in the future. The AD risk of this patient was decreased by 0.03, because the subject has no ApoE\(\epsilon\)4 alleles.

The SHAP force plot for the subject with PTID 128_S_0135 is visualized in Fig. 15d. This MCI subject converted to AD 54.52 months after the baseline visit, reverted to MCI after 71.74 months and again converted to AD 83.90 months after the baseline visit which was also the last diagnosis available. However, in contrast to Fig. 15c, the subject reached a small model prediction value of 0.283 and was therefore classified as an sMCI subject. The most important factor decreasing the patient’s risk was the absence of ApoE\(\epsilon\)4 alleles. This factor decreased the model prediction by 0.06. Additionally, the LDELTOTAL cognitive test score of 8, which was relatively large, had a protective effect. The relatively small normalized volume of the lateral occipital sulci decreased the patient’s risk by 0.03. One reason for the classification score might be that the conversion to AD was relatively late for this subject.

Fig. 15
figure 15

SHAP waterfall plots for interesting ADNI subjects to explain the prediction of the RF trained with feature selection and FS-3 for sMCI vs. pMCI classification. Figure 15a, b shows subjects of the ADNI test set, and Fig. 15c, d explains model predictions of two subjects not included in the sMCI vs. pMCI dataset, because those pMCI subjects reverted to MCI at a later visit. The arrow length indicates a Shapley value of the feature expression. Pathogenic feature expressions are shown as red and protective expressions as blue arrows. All volumetric features were scaled to a range between 0 and 1. Mean prediction value (base value): 0.445

Discussion

In comparison to previous research [8], which exclusively trained tree-based models, this work trained several RFs, XGBoost models, DTs, SVMs, and LR models to detect different stages of AD. All models were trained using the ADNI dataset and validated using independent test sets of the ADNI, AIBL, and OASIS cohorts. Bayesian optimization optimized for the best hyperparameters of the models. During this stage, CV was used to estimate the performance for independent test sets. The models were trained using three feature sets. The MRI features included summed volumes, differences, and ratios of predefined brain structures to investigate asymmetry structures associated with different AD stages. Forward feature selection was implemented to focus the models on the most important features and simultaneously avoid correlated features in the datasets. The performances of the different ML models as well as the different feature sets are compared to each other using Friedman tests and pairwise Wilcoxon signed-rank tests with Bonferroni adjustment. SHAP summary plots were used to visualize and interpret those models. The resulting Shapley values were compared to permutation importance of all models as well as natural feature importances of the RF and XGBoost models and to log odd’s ratios of the LR models. As correlated features reduce the validity of explainability methods like permutation importance and SHAP [96], those were also calculated consolidating correlated features to aspects. SHAP force plots investigated individual predictions of interesting subjects.

The experimental results showed that the forward feature selection chose brain regions that were previously associated with AD progression [70,71,72,73, 79, 80] for all classification tasks and models. The performances achieved for models trained with forward feature selection did not outperform the models trained on the entire feature set.

The pairwise Wilcoxon signed-rank tests with Bonferroni adjustment showed that the results of models trained with FS-3, which included cognitive test results, outperformed those models trained for FS-1 and FS-2 for all classification tasks and the ADNI test set. The improvements for FS-3 models in comparison to FS-1 and FS-2 models were smaller for sMCI vs. pMCI than for the baseline classification tasks. The SHAP summary plots of all feature sets mainly showed biologically plausible associations and the most important features for the CN vs. AD classification using FS-3 and the polynomial SVM were the cognitive test scores LDELTOTAL and MMSCORE.

The results for the AIBL and OASIS test sets showed less clear advantages of FS-3. Reasons for this were, among others, differences in the subject recruitment process, leading to differences in socio-demographics and differing MRI protocols across studies. However, the CN vs. AD models were successfully transferred to AIBL and OASIS by mostly achieving classification accuracies better than the no information rate. Additionally, the models trained for MCI vs. AD classification and sMCI vs. pMCI classification were successfully transferred to the AIBL dataset. For CN vs. MCI classification, poor results worse than the no information rate were achieved for the AIBL and OASIS datasets. However, the SHAP summary plots of those models mainly showed biologically plausible results. It was observed that age was a highly important feature in some of those models, which might cause problems transferring those models to datasets with differing demographic distributions. Poor results were also achieved for the MCI vs. AD classification and the OASIS dataset.

Some of the black-box models outperformed the simple and interpretable DTs. However, the pairwise Wilcoxon signed-rank tests with Bonferroni adjustment (\(p-value<0.05\)) showed no significant differences. No model stood out among the black-box models. However, different ML models learned different associations which mostly were biologically plausible. The SHAP summary plots were compared to the permutation importance of all models, and natural RF and XGBoost feature importances, as well as absolute log odds of the LR and agreed for many features. The feature rankings of all models were compared to each other using Kendall’s rank correlation and showed moderate-to-strong correlation.

Because feature dependency structures reduce the validity of SHAP values and permutation importances, those feature importances were also computed by consolidating correlated features using aspects. The results show that the models depended on biologically plausible features. However, the feature rankings of the SHAP values and the permutation importances showed a weaker correlation than those calculated for the models with forward-selected features.

Individual predictions, which are important in clinical practice, were interpreted using SHAP waterfall plots.

Limitations

The approach proposed in this article had several limitations. First, both external datasets had a clear focus on CN subjects and were thus imbalanced which makes the interpretation of model generalizability hard. The external validation of the sMCI vs. pMCI classification, which was medically more interesting than the baseline diagnoses, was based only on 28 AIBL subjects and no OASIS subjects. Future investigations should include more AD datasets knowing those cohorts differ in inclusion criteria. Possible cohorts might be the AD subset [14] of the HNR [15] or a subset of the National Alzheimer’s Coordinating Center [100]. In this context, instead of diagnoses, different biomarkers should be addressed as endpoints. Another idea to increase the number of subjects in the datasets is to relax the exclusion criteria by also including subjects that reverted to MCI or CN, and use follow-up scans of subjects where the baseline scan failed for the MRI feature extraction pipeline. Due to the availability of data in the cohorts, and minimal invasive recording, only MRI, socio-demographics, the number of ApoE\(\epsilon\)4 alleles, and cognitive test scores were included in the investigations. However, PET scans and biomarkers have high medical relevance and should thus be considered in future investigations.

Although in comparison to previous research [8], the number of ML models was already increased, prospectively deep learning models like CNNs, which can automatically extract locally textural features from MRI scans should be investigated. However, currently, there is no consensus on whether those methods can improve AD detection. Much previous work in this area suffered from data leakage [23] or investigated the less challenging discrimination between AD and CN. The Bayesian optimization used for hyperparameter-tuning is a sequential method. Future work should therefore investigate the use of more effective parallelized methods such as presented in [101].

Conclusion

This work extended a workflow [8] to explain ML black-box models trained to distinguish multiple AD stages using Shapley values. The differentiation of sMCI and pMCI subjects is of medical interest to recruit and monitor subjects for therapy studies. The approach was based on non-invasive features, including MRI volumes, socio-demographic data, the number of ApoE\(\epsilon\)4 alleles, and cognitive test results. Volumetric features were extracted from the MRI scans using the FreeSurfer pipeline. The sum, difference, and ratio of the volumes of both hemispheres were calculated to investigate the brain asymmetry in multiple AD stages. Shapley sampling values were calculated to visualize the local feature associations of black-box RFs, XGBoost models, and SVMs. The experiments mainly showed biologically plausible associations and improved results for models including cognitive test scores. Those improvements were smaller for sMCI vs. pMCI classification.

For the investigation of model reproducibility, all models were trained for the ADNI dataset and validated for the external AIBL and OASIS cohorts. The ADNI models achieved reasonable results for AIBL and CN vs. AD, MCI vs. AD, and sMCI vs. pMCI classification. For the OASIS test set, reasonable results were only reached for CN vs. AD classification.

Some of the performances of the black-box models outperformed the simple and interpretable DTs. None of the black-box models achieved outstanding results. SHAP summary plots were used to visualize the associations, the model learned between the features, and the AD diagnosis. The most important features of those plots were previously associated with AD progression. Additionally, those plots showed biologically plausible associations for most of the important features in all classification tasks.

SHAP force plots investigated individual model predictions. The comparison between SHAP values, natural and permutation feature importance showed moderate-to-strong correlations.

The investigation of the feature dependency structure and consolidating correlated features during the computation feature importance computation for sMCI vs. pMCI classification showed that those models depended on features that were previously associated with AD.

This work outperformed previous work [8] for the ADNI and AIBL classification results and CN vs. MCI classification and the AIBL results in MCI vs. AD classification for models trained without cognitive test scores. Additionally, the ADNI and AIBL results achieved for sMCI vs. pMCI classification trained with cognitive test scores outperformed the results of previous work.