Machine Learning Workflow to Explain Black-Box Models for Early Alzheimer’s Disease Classification Evaluated for Multiple Datasets

Hard-to-interpret black-box Machine Learning (ML) was often used for early Alzheimer’s Disease (AD) detection. To interpret eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Machine (SVM) black-box models, a workflow based on Shapley values was developed. All models were trained on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset and evaluated for an independent ADNI test set, as well as the external Australian Imaging and Lifestyle flagship study of Ageing (AIBL), and Open Access Series of Imaging Studies (OASIS) datasets. Shapley values were compared to intuitively interpretable Decision Trees (DTs), and Logistic Regression (LR), as well as natural and permutation feature importances. To avoid the reduction of the explanation validity caused by correlated features, forward selection and aspect consolidation were implemented. Some black-box models outperformed DTs and LR. The forward-selected features correspond to brain areas previously associated with AD. Shapley values identified biologically plausible associations with moderate-to-strong correlations with feature importances. The most important RF features to predict AD conversion were the volume of the amygdalae and a cognitive test score. Good cognitive test performances and large brain volumes decreased the AD risk. The models trained using cognitive test scores significantly outperformed brain volumetric models (p<0.05\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<0.05$$\end{document}). Cognitive Normal (CN) vs. AD models were successfully transferred to external datasets. In comparison to previous work, improved performances for ADNI and AIBL were achieved for CN vs. Mild Cognitive Impairment (MCI) classification using brain volumes. The Shapley values and the feature importances showed moderate-to-strong correlations.


Introduction
Alzheimer's Disease (AD) is a neurodegenerative disease [1] and the most frequent cause of dementia. As the number of dementia patients increases continuously, AD is a globally growing health problem [2]. Currently, there is no causal therapy to cure AD [1]. To recruit and monitor subjects for therapy studies, it is important to identify patients at risk to develop AD early and to develop preclinical markers. Subjects with cognitive impairments that do not interfere with everyday activities are considered as having Mild Cognitive Impairment (MCI) due to AD [3]. The risk to develop AD is increased for subjects with MCI in comparison to cognitively normal controls (CN). However, not all subjects with MCI prospectively convert to AD. One possibility for early AD detection is to find patterns distinguishing between progressive MCI subjects (pMCI) who will develop AD and subjects with stable MCI (sMCI).
Multiple Machine Learning (ML) workflows were implemented for this differentiation. Some used models like Decision Trees (DTs) or Logistic Regression (LR), which were interpretable by design. However, black-box models like eXtreme Gradient Boosting (XGBoost) [4], Random Forests (RFs) [5], or Convolutional Neural Networks (CNNs) [6] often outperform those models. Black-box models are designed to identify highly complex associations and are challenging to interpret. Thus, the risk of learning spurious decision functions caused by patterns occurring in the training dataset is increased for black-box models [7].
This research is an extended version of earlier work [8] and thus expands the previously developed ML workflow. The previously developed workflow enabled the interpretation of black-box models based on model-agnostic Shapley values. Shapley values give individual explanations for the prediction of each subject and visualize complex relationships between features and model predictions. In this research, the previous experiments are expanded by using three AD datasets and three adjusted feature sets. In addition to the previously trained tree-based models, Support Vector Machines (SVMs) [9] and LR models were implemented and explained. In this work, Shapley-based explanations were compared to classical feature importance methods, absolute log odd's ratios, and permutation importance.
In comparison to previous work [8], an improvement of the classification results for ADNI and AIBL was achieved for the differentiation between Cognitive Normal (CN) controls and MCI subjects and MCI vs. AD classification for models trained without cognitive test scores and validated for AIBL. Additionally, the ADNI and AIBL results achieved for sMCI vs. pMCI classification, trained with cognitive test scores, outperformed previous work.
This article is structured as follows: In Section 2 related work is described. Section 3 introduces the datasets and methods used to implement the ML workflow and the details of the experiments. Section 4 elaborates on the experimental results. Those results are discussed including the mentioning of limitations in Section 5. Finally, Section 6 concludes the overall work.

Related Work
Interpretable ML was developed to explain black-box models [10]. As the heterogeneous etiology of AD is not completely understood yet, interpretability is important and enables the validation of the biological plausibility of ML models. Recently, some studies have used interpretable ML in AD detection.
For example, Long Short-Term Memory-(LSTM-) [11] based Recurrent Neural Networks (RNN) [12] were trained to classify CN vs. MCI subjects in [13]. The experiments included multiple techniques to fuse sociodemographic and genetic data with Magnetic Resonance Imaging (MRI) scans. The resulting models were evaluated for two AD datasets -the AD subset [14] of the Heinz Nixdorf Risk Factors Evaluation of Coronary Calcification and Lifestyle (RECALL) (HNR) [15] (61 MCI and 59 CN) and 624 subjects (397 MCI, 227 CN) of the Alzheimer's Disease Neuroimaging Initiative (ADNI) [16] study phase 1. To visually explain individual model decisions, Gradient-weighted Class Activation Mapping (Grad-CAM) [17] was used. A focus on biologically plausible regions was observed.
Four heatmap visualization methods -sensitivity analysis [18], guided backpropagation [19], occlusion [20], and brain area occlusion inspired by [21] -were compared for 3D-CNNs in [22]. The CNN models were trained using 969 MRI scans of 344 ADNI subjects (151 CN, 193 AD). However, it was unclear whether the described workflow ensured independent training and test sets by using multiple scans per subject [23]. Thus, the Cross-Validation (CV) accuracy of 77 % ± 6 % might be affected by data leakage. All heatmaps focused on AD-related anatomical brain areas.  Fig. 1: Implemented ML workflow. Volumetric features were extracted for one baseline (BL) MRI scan per subject. The ADNI dataset was randomly split into an 80 % training and 20 % test set. The most important MRI features were selected using forward feature selection, those were concatenated with sociodemographic features, number of ApolipoproteinE 4 (ApoE 4) alleles, and cognitive test scores. Bayesian Optimization implemented hyperparameter tuning. Black-box RFs, XGBoost models, LR models, as well as polynomial and radial SVMs, were trained and validated. Shapley values were calculated for black-box model interpretation. An evaluation was performed for the independent ADNI test set and for the external AIBL and OASIS datasets. (80,699 CN,4,157 AD) to the PREVENT cohort [35] (109 subjects with high risk to develop AD, 364 subjects with low risk). The PREVENT cohort was younger than the SHARE cohort. The models support the hypothesis that age is the most important risk factor in AD detection. Consistent with previous research [36], amongst other factors, less education, physical inactivity, diabetes, and infrequent social contact were identified as potential risk factors.
A two stage-based classification workflow that used SHAP values to interpret RFs was developed in [37]. In the first stage, CN vs. MCI vs. AD classification was performed. The second stage implemented the differentiation of sMCI and pMCI subjects. The models were based on multiple modalities including MRI, PET, CSF biomarkers, cognitive tests, medical history, genetics, and many more. The RFs were trained and tested using 1,048 subjects (294 CN, 254 sMCI, 232 pMCI, and 268 AD) of the ADNI dataset. For CN vs. MCI vs. AD classification, the model almost exclusively selected cognitive test scores as the most important features. The model learned bad cognitive test results increased the risk of AD and MCI. The most important features for sMCI vs. pMCI classification also were cognitive test scores followed by PET and MRI features. Bad cognitive test scores, small MRI volumes, and small PET uptakes were associated with disease progression.

Materials and Methods
The ML workflow, implemented using the programming language Python v3.6.9 [38] is shown in Figure 1. It enables the interpretation of black-box models trained to detect early AD. In the following, the workflow and the methods used for implementation are elucidated.

Datasets
Data used in the preparation of this article were obtained from the ADNI [16], the AIBL [39], and the OASIS [40] cohorts. ADNI (https://adni.loni.usc.edu, Accessed: 2022-05-01) was launched in 2003 as a public-private partnership. The primary goal of ADNI is to test whether a combination of biomarkers can measure the progression of MCI and AD. Those biomarkers include serial MRI, PET, biological markers, as well as clinical and neuropsychological assessments. The ongoing ADNI cohort recruited subjects from more than 60 sites in the United States and Canada and consists of four phases (ADNI-1, ADNI-2, ADNIGO, and ADNI-3). The subjects were assigned to three diagnostic groups. CNs have no problems with memory loss. Subjects with AD meet the criteria for probable AD defined by the National Institute of Neurological and Communicative Disorders and Stroke-Alzheimer's Disease and Related Disorders Association (NINCDS-ADRDA) [41]. The diagnostic criteria of ADNI were explained in [16]. The dataset was downloaded on 27 Jul 2020 and initially included 2,250 subjects. AIBL (https://aibl.csiro.au/, Accessed: 2022-05-01) is the largest AD study in Australia and was launched in 2006. AIBL aims to discover biomarkers, cognitive test results, and lifestyle factors associated with AD. As AIBL focuses on early AD stages, most of the subjects are CN. The MCI subjects of AIBL met the criteria described in [42], AD diagnoses following the NINCDS-ADRDA criteria [41] for probable AD. The diagnostic criteria of AIBL were described in [39]. Approximately half of the CN subjects recruited in AIBL show memory complaints [39]. AIBL data version 3.3.0 was downloaded on 19 Sep 2019 and originally included 858 subjects.

Subject Selection
For the ADNI dataset, all subjects with an MRI scan at the baseline visit were included. 521 subjects who have no MRI scan at the baseline visit were excluded, 29 subjects failed the MRI feature extraction described in Section 3.2. The demographics of the resulting 1,700 subjects are summarized in Table  2.
The 853 MCI subjects were divided into two groups. The sMCI subjects had a stable MCI diagnosis at all follow-up visits and the pMCI subjects converted to a stable AD diagnosis at any visit. 38 subjects with no follow-up visits and 96 subjects who reverted to CN or MCI were excluded from this separation, resulting in 400 sMCI and 319 pMCI subjects.
For AIBL, the same exclusion criteria were applied. Therefore, 170 subjects had no MRI scan at the baseline visit, the baseline MRI scans of 76 subjects failed for the MRI feature extraction pipeline described in Section 3.2. The demographics of the resulting 612 subjects are summarized in Table 3. Similar to the ADNI dataset, the 95 MCI subjects were divided into two groups. In this step, 60 subjects with no follow-up visits and 7 subjects who reverted to CN or MCI were excluded from this separation, resulting in 16 sMCI and 12 pMCI subjects.
The exclusion criteria were similarly applied for the OASIS-3 dataset, which originally included 1,098 subjects. For 983 subjects, a diagnosis of CN, MCI, or AD was assigned for at least one visit. The MRI feature extraction pipeline failed for all MRI scans of five subjects, no MRI scan was successfully matched to a diagnosis with a tolerance of 365 days for 57 subjects. In contrast to the ADNI and AIBL datasets, which exclusively included baseline visits, the first visit with an MRI scan and a diagnosis was used for OASIS. The demographics of the remaining 921 subjects are summarized in Table 4.
The number of subjects with MCI as baseline diagnosis is 19. This number was decreased if subjects without follow-up diagnoses were excluded. Thus, no experiments were executed to separate sMCI and pMCI subjects in OASIS-3. For reproducible research, the supplementary material contains lists with the subject and MRI IDs and the diagnoses for all datasets.

MRI Scan Selection
From the ADNI dataset, T1-weighted MRI scans recorded at the baseline visit were included. The acquisition parameters differ between scanners. During the ADNI-1 study phase, scans were recorded using a field strength of 1.5 T. In the remaining study phases, MRI scans with a field strength of 3.0 T were recorded.
From the AIBL dataset, T1-weighted MRI scans following the protocol of the ADNI 3D T1-weighted sequences were included. All AIBL scans had a resolution of 1 × 1 × 1.2 mm.
For the OASIS-3 dataset, T1-weighted MRI scans, recorded on three scanners, were included. The field strengths of those scanners are 1.5 T and 3.0 T [40].

MRI Feature Extraction
Using FreeSurfer v6.0 [43], volumetric features were extracted for each MRI scan. These include the volumes of 34 cortical areas per hemisphere of the Desikan-Killiany atlas [44], 34 subcortical areas [45], and the estimated Total Intracranial Volume (eTIV). As recommended for volumes in [46], the volumetric features were normalized by eTIV. This results in 103 MRI volumes, which were split into 49 features of the left hemisphere, 49 features of the right hemisphere, and five additional unpaired segmentations (3rd ventricle, 4th ventricle, brain stem, CSF, eTIV).
After the normalization, for paired volumes, the sum (described in Equation 1), the difference (described in Equation 2), and the ratio (described in Equation 3) of both hemispheres are calculated to investigate symmetry and to decrease feature interactions. This results in 152 MRI features (49 sums, 49 differences, 49 ratios, and five unpaired features). Brain asymmetry was previously associated with AD [47][48][49][50]. Equation 2 shows that differences were calculated by subtracting the right from the left volume similar to [48], where the cortical thickness was used instead of volumetric features.

Manual Feature Preselection
Three feature sets were investigated in the experiments. The manual feature selection aims to choose less invasive, accessible examination techniques which were able to detect early signs of AD. Feature set 1 (FS-1) includes all MRI features, and sociodemographic features including age, gender, and years of education. However, the years of education are only available for the ADNI dataset. Feature set 2 (FS-2) expands FS-1 by the number of ApoE 4 alleles, a genetic risk factor associated with AD, which can be obtained from blood samples or via less invasive swab tests from the inside surface of the cheek. Feature set 3 (FS-3) extended FS-2 by three cognitive tests including the score of the Mini-Mental State Examination (MMSCORE) and two logical tests to evaluate the long-term (Logical memory, delayed -LDELTOTAL) and the short-term memory (Logical memory, immediate -LIMMTOTAL). The CDR was strongly associated with AD diagnosis and was not included in the experiments.

Dataset Splitting
The ADNI dataset was split on the subject level into two distinct subsets. The training set included 80 % of the data and the test set consisted of the remaining 20 %. The splitting was executed within each diagnostic group to ensure similar distributions. The AIBL and OASIS datasets were used as external test sets. None of the AIBL and OASIS subjects was used in the training or model selection process.

Feature Selection
Initially, 152 MRI features were extracted from the MRI scans. Those features are reduced to focus the ML models on the most important features. For this reason, feature forward selection was implemented. In comparison to feature selection methods like RF feature importance, this method avoids correlated features in the dataset [51]. Forward selection is a greedy procedure that iteratively identifies the best new feature until no improvement was reached. The training dataset was split into an 80 % training set and 20 % validation dataset. The training dataset was used to train the ML model used for classification with default hyperparameters on the feature set, and the validation dataset was used to calculate the validation accuracy for this feature set. The selected MRI features were expanded using the features described in Section 3.3.

Hyperparameter Tuning
To tune the hyperparameters of the ML models, Bayesian optimization [52] was implemented using the Python package scikit-optimize v0.8.1 [53]. Bayesian optimization maps the dependency of the hyperparameters and the model performance using a Gaussian Process. Initially, ten nearly random hyperparameter combinations were selected by a Latin Hypercube Design (LHD) [54]. Bayesian optimization with LHD initialization was successfully used in previous research [55] to optimize the parameters for early AD detection. Each parameter was split into ten equidistant intervals and one sample was randomly chosen per interval. This results in ten samples per parameter, which were randomly matched. A stratified 10 × 10-fold CV [56] was applied to the training dataset to estimate the model accuracy for an independent test set. Stratified 10 × 10fold CV was implemented by splitting each diagnostic group of the training dataset into ten distinct folds using the Python package scikit-learn v0.23.2 [57]. Ten iterations were performed, each with a different fold used as a validation dataset (10 %). The training dataset included the remaining nine folds (90 %). With shuffled data in each run, this procedure was repeated ten times. The ML model was initially evaluated for ten LHD combinations.
To predict the average CV accuracy for the initial parameter combinations, the Gaussian Process was fitted. Afterward, an optimization selected the next promising parameter combination. As an acquisition function, the Lower Confidence Bounds (equation 4) was used. In this equation,μ Θ is the Gaussian Process estimation of the CV accuracy andΣ Θ is the covariance at parameter combination Θ.
The hyperparameter combination selected in the previous step was again evaluated using CV. Afterward, to refine the Gaussian Process and to determine the following combination, the respective tuple of hyperparameter and mean CV accuracy was added to the Gaussian Process. The procedure was repeated 25 times. The best hyperparameter combination was chosen to train the final model.

Model Training
During hyperparameter training and final model generation, XGBoost models, RFs, radial SVMs, polynomial SVMs, DTs, and LR models were trained. The preprocessing pipeline included centering, scaling, and median imputation. The entire preprocessing pipeline was implemented within the CV to avoid over-optimistic performance estimations [58]. The parameters were calculated for the CV training set and reused for the test and external datasets. The preprocessing was implemented using the Python package scikit-learn v0.23.2 [57].
Ensemble-based black-box XGBoost [4] models, follow the idea of gradient boosting models [32]. It means the combination of multiple weak classifiers results in a strong, joint classifier. By learning the gradients of the previous classifier, Gradient boosting fulfills this assumption. The final prediction consisted of the sum of weak classifier predictions. XGBoost is distributed as an open-source software library and the main advantages are scalability, parallelization, and distributed execution. The hyperparameters and intervals used during Bayesian optimization are summarized in Table 5. The hyperparameter n estimators sets the number of boosting iterations, learning rate was the learning rate that preferences weak classifiers at early iterations, the minimum loss reduction required to split a node is defined by gamma, the hyperparameter max depth sets the maximum depth of an individual tree, and the minimum number of observations in a child node was denoted as min child weight, subsample and colsample bytree set the proportion of randomly subsampled training instances and features per iteration. The Python package xgboost v1.2.0 [59] implemented the XGBoost algorithm.
RF [5] training was implemented using the Python package scikit-learn v0.23.2 [57]. The RF algorithm is based on multiple DTs. Each DT was trained using randomly chosen features and subjects. Those subjects were selected using bootstrap sampling [60] on the training dataset. RF inference was computed by summarizing the individual DTs using a majority voting. The RF hyperparameters are summarized in Table 5. n estimators sets the number of DTs, each split used a random subset of max features features, the hyperparameter min samples leaf describes the minimum number of samples in a leaf node. Support Vector Machines (SVMs) [9] were implemented using the Python package scikit-learn v0.23.2 [57]. SVMs separate two classes using a decision boundary which was referred to as an n-dimensional hyperplane. Here, n is the number of features. To increase the robustness of the hyperplane for unknown observations, SVMs select the hyperplane with the largest distance from the observations. For this reason, the distance between the hyperplane and the observations was maximized using the hinge loss function [61]. The support vectors describe the observations closest to the hyperplane. Removing support vectors from the dataset directly influences the hyperplane. The cost parameter C enables SVMs to avoid overfitting, the higher C, the less complex an SVM is.
Kernel functions help to model complex interactions. In this research, a polynomial and a radial kernel were implemented. The degree hyperparameter of the polynomial kernel controls the degree of the kernel, high values lead to more complex hyperplanes. The gamma hyperparameter constraints the influence, a single observation has on the hyperplane. If gamma = scale, 1 #f eatures·σ was used as a value of gamma, if gamma = auto, a value of 1 #f eatures was used. The SVM hyperparameters and their ranges are summarized in Table 5.
In contrast to the black-box models, DTs [62] and LR models were selected as simple and interpretable comparison models. DTs were implemented using the Python package scikit-learn v0.23.2 [57]. A DT consists of successively learned decision rules of the form x ≤ t for numerical or x ∈ t for categorical features t is a threshold or a subset of values. The next decision rule was selected by the splitter which ranked all possible rules using a criterion. Decision rules were iteratively expanded until a maximum depth of max depth was met or a percentage min samples split of samples were in a split.
LR [63] is a Generalized Linear Model (GLM) with a logistic link function. This link function allows the processing of binomial output variables. The logistic model function is given in equation 5. The model predicts the probability P (Y = 1|X = x, Θ) of observation x with given parameters Θ being in the positive class Y = 1. The LR algorithm was implemented using the Python package scikit-learn v0.23.2 [57].

Model Interpretation with Shapley Values
There are multiple methods to interpret ML models. An overview can be found in [10]. For example, DTs and LR models are interpretable by design. However, black-box models often outperform those interpretable models but the interpretation of black-box models is complicated. In this research, model-agnostic The idea of using Shapley values to explain black-box ML models is, to fairly decompose the contribution of each feature for the subject's prediction. Due to this fairness, the sum of all Shapley values is equal to the difference between the average model prediction and the probability prediction of a subject. Equation 7 shows Shapley values are defined as the average, weighted contribution, a simplified feature has in all subsets. For the exact calculation of a Shapley value Φ i for a given subject and feature i, it is required to determine the contribution of this feature for all subsets S of the entire feature set F . The investigation of each subset S requires the retraining and evaluation of the black-box model f S (S). With the help of the model performances trained with (f S∪i (S ∪ i)) and without (f S (S)) the feature at interest i, their differences were calculated. The weighted average difference across subsets builds the Shapley value. The weighting depends on the relative number of features |S| in subset S. High weights were assigned to subsets with few and many features. In this way, the estimation of the main individual effects and the total effects are supported.
However, the number of subsets increases exponentially with the number of input features, leading to high computational effort for the exact calculation of Shapley values. Kernel SHapley Additive exPlanations (SHAP) [65] avoid timeconsuming repeated training and evaluation by estimating Shapley values. This algorithm is based on Local Interpretable Model-agnostic Explanations (LIME) [66] and was implemented using the Python package shap v0.38.1 [27].
A new dataset containing variants of the observation at interest is created by permuting selected features. An additive linear model (Eqn. 8) with x is a simplified representation of the black-box input features and g(x ) is the explanation model was fitted to the generated dataset.
The weights Φ i of the explanation model estimates the SHAP values for each subject and each feature. For tabular data, the simplified features are binned binary feature representations that represent if the original feature value or a permutation was used. SHAP force plots [67]  There are some reasons, including out-of distribution sampling during Shapley value approximation and not taking into account feature correlation, why Shapley values should be used with caution for black-box model interpretability [68]. Therefore, it is important to compare Shapley value results with other ML explanation methods, or to reduce or consolidate correlated features [69]. In this work, forward selection was implemented to reduce the number of correlated features in the dataset, Shapley values were compared to classical feature importance measurements (Section 4.5), and correlated features are consolidated to aspects.

Evaluation
The models were evaluated for the ADNI test set and the external AIBL and OASIS datasets. The performance was measured using accuracy (ACC) (equation 9), balanced accuracy (BACC) (equation 10), F1-Score (F1) (equation 11), and Matthews correlation coefficient (MCC) (equation 12). Table 6 visualizes the contingency table used for the calculation of those scores. Providing multiple scores for evaluation increased the comparability to other research. In comparison to accuracy, which focuses on correctly classified cases, the F1-Score focuses on incorrectly classified cases. The macro averaging F1-Score was calculated to address both, the diseased and the healthy subject classification. Balanced accuracy is based on both, sensitivity and specificity and thus is suitable to evaluate imbalanced class problems. The MCC returns a value between 0 and 1 and is also suitable to handle imbalanced datasets.
Additionally, the Area under the Receiver Operating Curve (AUROC), which models the relationship between the True Positive Rate (TPR -equation 13) and the False Positive Rate (FPR -equation 14) for different classification thresholds was computed for all models. AUROC is suitable to investigate classification tasks with imbalanced datasets.

Results
In the following, the experimental results are presented. The MRI features selected using forward selection and the performances achieved for CN vs. AD, CN vs. MCI, MCI vs. AD, and sMCI vs. pMCI classification were given. SHAP summary plots compared the models trained using different feature sets, validation datasets, and classification models. The results of SHAP summary plots are compared to natural RF-and XGBoost-based feature importance scores and permutation importance scores. The influence of feature interactions for Shapley values is investigated and SHAP force plots explain individual model predictions of interesting subjects.

Feature Selection
The MRI features selected during forward selection for CN vs. AD classification and different ML methods used as base classifiers are summarized in Table 7. In this research, feature forward selection was used to reduce the number of MRI features and the influence of correlated features.
For the CN vs. AD detection task, the RF, and the polynomial SVM chose five features, the XGBoost, the DT, and the radial SVM chose four features and the LR chose six features. Overall, the six methods chose 16 different features. The most important feature for the RF, the XGBoost, the DT and the radial SVM was the sum of the left and right amygdalae. For the polynomial SVM and the LR, the most important feature was the sum of the entorhinal cortices. Both features were previously associated with AD detection [70][71][72][73]. Previous research also shows, that most of the selected features are associated with atrophy patterns in AD [74]. All methods also selected at least one difference or ratio of the left and right cortical or subcortical areas. Those features describe the asymmetry of both hemispheres. Brain asymmetry measurements were associated with AD [47][48][49][50] and were also successfully applied for ML models in this field [75].
The rankings of the forward selection for CN vs. MCI detection and different base classifiers are given in Table 8. For CN vs. MCI detection, the RF, XGBoost, and LR base classifiers chose four features, the DT chose three features, the polynomial SVM chose eight features, and the radial SVM chose six features. Overall, the six ML methods chose 25 different features. Thus, in comparison to the CN vs. AD classification, the ML models show less agreement about the selected features. Consequently, the feature which was selected first in the forward selection process differed in five out of six methods. For the RF and the DT, the sum of the insular cortices was selected, the XGBoost classifier chose the ratio of the inferior parietal lobule, the polynomial SVM selected the sum of the lingual gyri, the SVM with the radial kernel chose the sum of the temporal pole volumes and the LR selected the sum of the left and right middle temporal gyri. Those features were previously associated with AD progression [70][71][72][73][74][76][77][78]. Similar to the CN vs. AD classification, all models selected at least one feature describing the asymmetry of the cortical and subcortical brain regions.
The forward feature selection results of the six ML models for MCI vs. AD classification are summarized in Table 9. Four of the six models, namely RF, XGBoost, SVM poly, and LR selected five features. The DT chose six different features and the radial SVM selected two MRI features. Overall, the six methods selected 22 unique features.
The most important features were the sum of the left and right hippocampi for the RF and DT model, the difference of the lateral ventricles for the XGBoost model, the sum of the inferior temporal gyri for both SVMs, and the sum of the entorhinal cortex volumes for the LR. Those features were previously associated with AD detection [70-73, 79, 80].
The results of the forward selection for the sMCI vs. pMCI classification task are summarized in Table 10. Five features were selected by the RF model,

Classification Tasks
In the following the classification performances achieved for the four classification tasks are elaborated. The hyperparameters which reached the best accuracies during CV and which were thus used during training of the final models are summarized in Table 11.

CN vs. AD
The results achieved for CN vs. AD classification are summarized in Table 12. The no information rates were 60.36 % for the independent ADNI test set, 86.27 % for AIBL, and 78.05 % for OASIS. CN was the most frequent class in all datasets. The best accuracy during CV of 99.68 % ± 0.74 was achieved for the DT trained with feature selection and FS-3. This model also reached a perfect classification for the ADNI test set. All models trained for the CN vs. AD task reached accuracies higher than the no information rate for the ADNI dataset. The best AIBL accuracy of 95.94 % was achieved for the XGBoost model trained for FS-3 and with feature selection. This model also reached the best F1-Score (91.48 %) and the best MCC (0.830) for the AIBL dataset. The best balanced accuracies of 95.45 % for the AIBL dataset were reached for both SVMs trained with feature selection and FS-3. The LR model trained with feature selection for FS-3 reached the best AIBL AUROC of 99.55 %. Overall, two models achieved AIBL accuracies smaller than the no information rate of 86.27 %. Those models were the DTs trained with feature selection for FS-1 and FS-2.
The best OASIS accuracy of 90.58 % was achieved for the polynomial SVM which was trained with feature selection and FS-3. For OASIS, four models achieved accuracies worse than the no information rate of 78.05 %. Three of those models were trained on FS-1 and with feature selection, namely, the RF, the DT, and the SVM. The last model reaching an accuracy worse than the no information rate was the DT trained with FS-2 and feature selection.

CN vs. MCI
The results achieved for CN vs. MCI classification are summarized in Table  13. The no information rate for this task was 62.64 % for the ADNI test set, 82.44 % for AIBL, and 97.39 % for OASIS. MCI was the most frequent class in the ADNI dataset, whereas, for AIBL and OASIS, CN was.
The results achieved for CN vs. MCI classification were worse than those for the CN vs. AD task. The best accuracy during CV of 90.21 % ± 2.72 was achieved for the XGBoost model trained for FS-3 and without feature selection. The best accuracy for the ADNI test set was 91.58 % reached for two models. Both models, the radial SVM and the XGBoost model were trained for FS-3 and with feature selection. The latter model also reached the best ADNI balanced accuracy and ADNI F1-Score. Overall, none of the models reached an ADNI accuracy worse than the no information rate of 62.51 %. The results achieved for AIBL and OASIS were worse than the ADNI results. The best AIBL accuracy was 68.95 % achieved for two DTs trained with forward feature selection for FS-1 and FS-2. These models also reached the best AIBL balanced accuracies, AIBL F1-Scores, and AIBL MCCs.
The best OASIS accuracy of 55.05 % was reached for the DT trained without feature selection for FS-3. For the CN vs. MCI classification, all models achieved accuracies worse than the no information rates for OASIS and AIBL.

MCI vs. AD
The MCI vs. AD classification results are summarized in Table 14. The no information rate was 71.85 % for the ADNI test set, 57.23 % for AIBL, and 91.24 % for OASIS. The most frequent class was MCI for ADNI and AIBL as well as AD for OASIS.
The best CV accuracy of 89.39 % ± 2.99 was achieved for the RF trained without feature selection and FS-3. For the independent ADNI test set, the best accuracy was 88.66 %, reached by the RF and LR models trained with feature selection and FS-3. The first of those models also reached the best ADNI AUROC of 95.50 %, whereas the second model achieved the best ADNI F1-Score (85.50 %), and ADNI MCC (0.712). None of the models reached an  ADNI accuracy worse than the no information rate. However, the DT trained without feature selection for FS-1 as well as the XGBoost and the DT both trained for FS-2 and with feature selection exactly achieved the no information rate of 71.85 % for the independent ADNI test set. The two DT models mentioned predicted the MCI class for all subjects and thus represented random classifiers. The best AIBL accuracy was 84.94 % reached by the RF model trained without feature selection and FS-3. This model also reached the best AIBL balanced accuracy (83.64 %), AIBL F1-Score (84.24 %), and AIBL MCC (0.693). Except for the previously mentioned random classifiers, all models outperformed the no information rate for the AIBL dataset which was 57.23 %. The performances for OASIS were worse than those achieved for ADNI and AIBL. The best OASIS accuracy of 57.14 % was achieved for the radial SVM trained without feature selection and FS-3. The random classifiers achieved a worse accuracy of 8.76 % for OASIS. These accuracies correspond to the ratio of MCI subjects in the dataset. MCI was the most frequent class for the ADNI dataset and the rarest class for OASIS. All models achieved OASIS accuracies worse than the no information rate of 91.24 %.

sMCI vs. pMCI
The results reached for sMCI vs. pMCI classification with no information rates of 55.56 % for the ADNI test set and 57.14 % for AIBL are summarized in Table 15. As previously mentioned, due to the lack of available data, OASIS was not used for this comparison. The best CV accuracy of 70.75 % ± 5.94 was achieved for the RF trained with feature selection and FS-3. For the independent ADNI test set, the radial SVM which was trained with forward feature selection for FS-3 reached the best accuracy (75.00 %), balanced accuracy (74.38 %), F1-Score (74.51 %), and MCC (0.491). The best AUROC of 80.14 % was reached for the XGBoost model trained with forward feature selection for FS-3. None of the models achieved worse results than the ADNI no information rate of 55.56 %. The best AIBL accuracy was 82.14 % reached for the radial SVM which was trained for FS-3, and without feature selection. Five models achieved AIBL accuracies worse than the no information rate. Those models were all DTs trained with FS-1 and FS-3, and the LR trained with feature selection and FS-3.

Feature Sets
As can be seen in Table 12, for CN vs. AD classification, all models achieved the best scores using FS-3. Thus, adding cognitive test results to the dataset improved the overall classification results. The SHAP summary plots for the polynomial SVMs trained with feature selection for all three feature sets are shown in Figure 2. SHAP summary plots [67] explain the predictions for the subjects of the entire ADNI, AIBL, and OASIS datasets. Each point plots a Shapley value for a subject and a feature and is colored depending on the feature value. The vertical axis represented the features, ordered by the mean absolute Shapley values and their distribution. The higher a Shapley value is, the more the feature expression increases the probability the model classifies the subject as an AD subject. All SHAP summary plots were limited to the top ten features. Following the biological processes of AD, small brain volumes [70,[82][83][84], large ventricular volumes [85,86], the presence of ApoE 4 alleles [87][88][89], and bad performances in cognitive test scores were expected to have a pathogenic effect on the disease progression. The left hemisphere was expected to be more affected by atrophy than the right [75] one. However, some investigations for MCI subjects also showed the right hippocampus was more affected [90]. This asymmetry occurs primarily in the hippocampi and amygdalae [91,92]. For FS-1 and FS-2, the most important feature was the sum of the left and right amygdalae. Consistently with the previously mentioned atrophy patterns [70,[82][83][84], small volumes of the amygdalae (colored in blue) increased the probability the model classifies a subject as an AD subject. Large amygdala volumes (colored in red) were associated with CN subjects. The second most important feature for both models was the sum of the left and right entorhinal cortex. The model, trained using FS-1, learned that large volumes (colored in red) of the amygdalae, the entorhinal cortices, and the inferior parietal lobules had protective effects on the disease progression (negative Shapley values). Those assocications correspond to previous research [70,[82][83][84]. The model additionally learned that a large difference between the left and right cortex volume (colored in red) was associated with CN. Large differences were reached if the volume of the left hemisphere was larger than the right one. The same observation applies to the ratio of the left and right paracentral lobules. Considering the sociodemographic features show, that the model learned, young age (colored in blue) was associated with disease progression. However, the summary of the ADNI dataset in Table 2 shows the mean age of the CN group was younger than the AD group. Additionally, the model learned that females (colored in blue) and subjects with low education more likely develop AD. FS-2 added the number of ApoE 4 alleles to FS-1. The additional feature was the third most important feature of this model. The model learned that a large number of ApoE 4 alleles (no ApoE 4 alleles are colored in blue, one ApoE 4 allele is colored in purple, two ApoE 4 alleles are colored in red) led to an increased risk. Previous research also identified ApoE 4 as an AD risk factor [87][88][89]. Biologically plausible associations [70,[82][83][84] were noted for the summed volumes of the left and right amygdalae, the entorhinal cortices, and the inferior parietal lobules. The ratio of the left and right paracentral lobule volumes showed an increased risk of AD if the left hemisphere was smaller than the right one. The same applies to the differences in cortex volumes.
FS-3 added the results of three cognitive tests to FS-2. The LDELTOTAL cognitive test score achieved the best feature importance for this model, followed by the MMSCORE. The LIMMTOTAL score achieved the sixth-best feature importance. For all cognitive test scores, the model associated good scores (colored in red) with healthy subjects. The third most important feature was the summed volume of the entorhinal cortices. Consistently with AD atrophy, small volumes (colored in blue) were associated with AD progression. The same applied to the sum of the inferior parietal lobules and the amygdalae. Similar to the FS-1 and FS-2 models, the FS-3 model also learned young age (colored in blue) was associated with AD, although, the mean age of the ADNI-CN group was younger than the mean age of the ADNI-AD group.
As can be seen in Table 13, for the CN vs. MCI classification, FS-3 outperformed FS-1 and FS-2 for the ADNI and AIBL performance scores. The best accuracies for OASIS were reached for FS-3, whereas FS-2 models outperform those models for F1-Score, balanced accuracy, AUROC, and MCC.
For the MCI vs. AD task, which is summarized in Table 14, the same applied to all ADNI and AIBL scores. For the OASIS dataset, the best accuracy and F1-Score were reached by FS-1 and the best balanced accuracy, AUROC, and MCC for FS-3.
The results for the sMCI vs. pMCI classification are shown in Table 15. Those results show that FS-3 outperformed FS-1 and FS-2 for all ADNI scores. For the AIBL dataset, the best accuracy, balanced accuracy, F1-Score, and MCC were achieved for FS-2, whereas the best AUROC was reached for FS-3.
To indicate whether the differences in ADNI test accuracies between the three feature sets are statistically significant, a Friedman test [93] (p − value < 0.05) was executed. For this investigation, the results of Table 7, Table 8, Table  9, and Table 10 are summarized, resulting in 48 observations per feature set (six different models, two feature selection methods, and four tasks). The pvalue of 2.2 · 10 −16 indicated statistically significant differences between the feature sets. To identify, which feature sets differed from each other, a pairwise Wilcoxon signed rank test (p − value < 0.05) with Bonferroni adjustment was executed. A summary of the results is given in Table

Reproducibility
In this work, all models were trained using the ADNI dataset. Data from AIBL and OASIS subjects were used to test model reproducibility. For all classification tasks, most models achieved worse results for AIBL and OASIS in comparison to the independent ADNI test set. The AIBL accuracies are plotted against the ADNI accuracies for all previously described models in Figure 3. Overall, the AIBL accuracies were worse than those achieved for ADNI. The CN vs. AD classification models achieved the best accuracies for ADNI and AIBL. The worst AIBL accuracies were achieved for CN vs. MCI classification, where all models reached AIBL accuracies worse than the no information rate. For the remaining classification tasks, most models reached  Fig. 3: Plot showing the accuracies achieved for the independent ADNI test set and the AIBL dataset for all models described in Table 12, Table 13,  Table 14, and Table 15. The no information rates for all classification tasks are visualized as horizontal lines for AIBL and as vertical lines for ADNI AIBL accuracies better than the no information rate. For the sMCI vs. pMCI classification, no correlation between ADNI and AIBL accuracies was observed.
In Figure 4, the OASIS accuracies of all previously described models are plotted against their ADNI accuracies. Similar to the AIBL results, the overall OASIS accuracies were worse than those achieved for ADNI. The best results for OASIS were achieved for CN vs. AD classification. Those models mainly reached accuracies better than the no information rate. The OASIS no information rates for the remaining classification tasks were larger than 90 % and all classification models trained for the ADNI dataset performed worse. However, the most frequent classes in OASIS and ADNI differed from each other for those classification tasks. For the OASIS dataset, the worst accuracy was achieved for MCI vs. AD classification. Reasons for the worse OASIS performances were, for example, differing MRI protocols and differences in the subject selection process.
To compare the model predictions for the three datasets, SHAP summary plots were visualized for the individual datasets in Figure 5. Those Fig. 4: Plot showing the accuracies achieved for the independent ADNI test set and the OASIS dataset for all models described in Table 12, Table 13, Table 14, and Table 15. The no information rates for all classification tasks are visualized as horizontal lines for OASIS and as vertical lines for ADNI important features were the cognitive test scores, and bad scores were associated with disease progression. Those cognitive test scores were followed by the volumetric features, of the amygdalae, medial orbitofrontal cortices, and pars triangularis, as well as the AGE, the number of APOE 4 alleles, and the number of education years in slightly differing orders. For all volumetric features, biologically plausible associations [70,[82][83][84] were learned. The number of education years was not available in AIBL and OASIS, those scores are therefore colored in grey.
SHAP summary plots for the RF trained with feature selection for CN vs. MCI classification based on FS-1 are shown in Figure 6. The figure contains subplots for all three datasets. Overall, the Shapley values for this model were asymmetric. The positive Shapley values show stronger amplitudes than the negative ones. One explanation for this behavior might be, that the MCI class was more frequent in the ADNI training dataset. For the ADNI and AIBL dataset, the most important feature was the sum of the inferior parietal lobules followed by the age and gender. The model learned that small brain volumes, young age, and male gender increased the risk to develop MCI. The volumetric observations correspond to previous research [70,[82][83][84]. The volume of the inferior parietal lobules was the second most important feature for the OASIS dataset. Age was the most important feature for OASIS and the second most important feature for ADNI and AIBL. The model learned young age was associated with disease progression. It can be noted in Table 2 that the mean age of CN subjects is older than the mean age of MCI subjets in the ADNI dataset but not in the AIBL (Table 3) and OASIS (Table 4) datasets. The differences observed in the datasets might cause problems in model reproducibility. The feature representing the years of education was in the fifth position for the ADNI dataset. That information was not available in OASIS and AIBL and was thus colored in grey. Consistently, this feature was the least important one for both datasets. Overall, the ranking of the feature importance differed for all models.

Classification Model
In this research, six ML models were trained to compare their results to each other. A line plot of the accuracies achieved for the independent ADNI test set dependently on the classification task and the ML model is shown in Figure  7. For the sMCI vs. pMCI classification, it can be seen, that the performance variance is smaller for RF and XGBoost models in comparison to the remaining ML models. In addition, the polynomial SVMs achieved worse results for this classification task. Overall, the DT models were often outperformed by RF and XGBoost classifiers. The LR models outperformed the DTs in many cases, except for the CN vs. MCI classification. Overall, no ML model outperformed the remaining models.
To indicate whether the differences in ADNI test accuracies between the ML methods are statistically significant, a Friedman test (p − value < 0.05) was executed. For this investigation, the results of Table 7, Table 8, Table  9, and Table 10 Table 12, Table 13, Table 14, and Table  15. For each ML classifier, 36 models were included To visualize ML model differences, Figure 8 shows SHAP summary plots for all six models. All models were trained using FS-3 with feature selection to distinguish between sMCI and pMCI subjects. The feature selection results in slightly different features within all models. Overall, the Shapley values had the largest deviance for the DT and the SVM with a polynomial kernel, followed by the LR model and the RF. The most important feature for all models except for the RF and the radial SVM was the LDELTOTAL cognitive test score. For this test score, all models associated small feature values (colored in blue) with disease progression. For the radial SVM and the RF, LDEL-TOTAL was the second most important feature. The most important feature in the RF model was the sum of the left and right amygdalae. The model learned, that large brain volumes decreased the patient's risk to develop AD. This observation is biologically plausible [70,[82][83][84]. The sum of the amygdala volumes was the third most important feature in the DT and the radial SVM. The number of ApoE 4 alleles was the most important feature for the radial SVM. The model learned that ApoE 4 is an AD risk factor, and the presence of ApoE 4 alleles is associated with AD progression. The number of ApoE 4 alleles is the second most important feature for the DT, and the LR, the third most important feature for the XGBoost model and the polynomial SVM, and the fourth most important feature for the RF. In this comparison, all models except for the DT and the polynomial SVM had at least one asymmetry feature within its top ten features. The decision tree only depended on three features, namely the LDELTOTAL cognitive test score, the number of ApoE 4 alleles, and the hippocampus volume. Most associations, the models learned, were biologically plausible. The radial RF showed two features with a biologically implausible association [70,[82][83][84]. The model learned, that high volumes of the lateral occipital sulci, as well as a high number of education years, are associated with disease progression. Those features are ranked as the ninth and tenth important features in this model. Surprisingly the association of the education feature was also learned for the SVMs and the LR. For the polynomial SVM, the summed volumes of the rostral anterior cingulate cortices show a biologically implausible [70,[82][83][84] association. Overall, biological plausibility should only be expected for high-performing models.
As a comparison, Figure 9 visualizes the natural feature importance for the RF and XGBoost models, and the log odd's ratios for the LR model (ordered by the absolute log odd's ratio), and Figure 10 shows the permutation importance of all models. The most important features of all natural feature importance plots and all permutation importance plots correspond to the SHAP summary plots.
The Kendall's tau rank correlation [94] between feature rankings for all SHAP models, natural XGBoost and RF feature importances, absolute log odd's ratios of the LR model, and permutation importance of all models is shown in Figure 11. Due to the forward feature selection, the different models are trained on slightly different MRI features and the correlation was calculated for pairwise complete observations. However, the sociodemographic data, the number of ApoE 4 alleles, and the cognitive test scores were used to train all models. As the features within a specific model are identical, firstly the SHAP values, the permutation importance, and the feature importances are  The execution times of the different ML models, the SHAP algorithm, and the permutation importance calculation are summarized in Table 18. All experiments were executed on an NVIDIA ® DGX-1 1 supercomputer. The execution environment was an NVIDIA ® -optimized 2 Docker 3 [95] container, running a Deepo 4 image. The results showed, that except for the RF model all models were trained in less than 1 s. The mean training time during CV was 5.66 s for the RF. The RF model was trained using twelve features. The long training time of the RF model was also reflected in the SHAP algorithm and the permutation importance. To compute the SHAP values of one subject, the RF model consumes 47. Overall, it has to be mentioned, that SHAP value calculation is a time-intensive process. However, the times presented can only be used as an orientation, and optimation is possible by for example clustering the background subjects of the SHAP algorithm. In this work, the samples of the entire training dataset were used as background subjects. The SHAP execution time depended on the number of features in a dataset, the time needed for model inference, the number of background subjects and the number of subjects that should be explained.

Feature Dependency and Shapley Values
As feature correlations reduce the validity of explainability methods [96,97], the previously explained SHAP summary plots are all generated using feature selection to avoid strong feature correlations in the dataset. Feature correlations can also make explainability more difficult [97] and may lead to biologically implausible explanations. The original dataset without feature selection contains many correlated features. To compare the explanations for such a dataset [69] developed a method to consolidate correlated features to aspects and compute permutation and SHAP importances for those aspects. First, correlated features of the entire training dataset are identified using Spearman rank correlation coefficients. Hierarchical agglomerative clustering [97] was used to create a dendrogram. In this work, a threshold of H = 0.5 determining the least correlated features in a group, filtered the resulting aspects from the dendrogram. The permutation and SHAP importances are computed by jointly permuting all features in an aspect. This work uses the python package dalex v1.4.1 [98] for implementation. The resulting aspects computed for sMCI vs. pMCI classification and FS-3 without feature selection, are shown in Table 19. The 161 features of FS-3 are consolidated to 79 aspects. Of those aspects, 14 included an individual feature. Of the remaining 65 aspects, nine included more than two features. As was expected, the differences and ratios of the same region are often correlated. At least one pair of ratio and difference for the same region was included within 49 aspects. Aspect 34 included four regions within the medial temporal lobe. Previous research showed, that those regions are important for the detection of AD progression [91,92]. Aspect 30 consolidated three ventricular regions. Previous research found that ventricular enlargement was associated with AD progression [85,86]. Aspect 46 included the cognitive test scores LIMMTOTAL and LDELTOTAL, and aspect 45 included the eTIV and the gender. Using those aspects, the SHAP importances visualized in Figure 12 were computed for the sMCI vs. pMCI classification without feature selection and for all ML models. The most important aspect for the RF, the XGBoost, and the DT was aspect 34, which consolidated the entorhinal cortices, the parahippocampal gyri, the amygdalae, and the hippocampi. Those brain areas were associated with AD in previous research [91,92]. Aspect 34 also was the second most important aspect of the polynomial SVM and the third most important aspect in the LR and radial SVM. The most important aspect for the LR was aspect 27 which consolidated volumes of the fusiform, the inferior temporal, and the middle temporal gyri. This aspect also reached the second rank for the RF, radial SVM, and XGBoost models as well as the third place for the polynomial SVM. Previous research [91,99] showed that those regions are affected in early AD stages. Aspect 46 consolidated the LDELTOTAL and LIMMTOTAL cognitive test scores and was the most important aspect for both SVMs. This aspect also achieved the third rank for the XGBoost, RF and DT models, and the second rank for the LR. Overall, the most important aspects of the different models seem to be similar for the ML models. Figure 13 shows the aspect permutation importance plots of the previously described models. The most important aspects chosen by SHAP importance and permutation importance matched for the DT, and the radial SVM. The most important aspect of the RF and XGBoost models was aspect 46 which included the cognitive test scores LDELTOTAL and LIMMTOTAL. This aspect reached the third rank using the SHAP method for both models. For the LR, aspect 46 was also identified as the most important aspect. This aspect reached second place using the SHAP explanations. The highest permutation importance of the SVM was reached for aspect 34. This aspect reached second place using the SHAP method.
To investigate the correlation of the feature rankings, between the different methods, Kendall's tau rank correlation between the SHAP feature ranking and the permutation method is visualized in Figure 14. A very strong correlation of 1.00 was observed between the permutation importance and the SHAP values of the DT. A moderate correlation of 0.53 was observed between the SHAP values and the permutation importance of the RF. The XGBoost SHAP rankings also showed a moderate correlation of 0. 58

Explanations of Individual Predictions
To investigate explanations for individual model predictions, Figure 15 shows SHAP waterfall plots of four ADNI subjects. Those plots visualize the predictions for the RF trained with FS-3, and feature selection for sMCI vs.   Figure 15a. This is a subject from the ADNI test set and had a diagnosis of pMCI. The model prediction of this subject was 0.735. As this value was higher than 0.50, the subject was correctly classified as a pMCI subject. The most important feature with a pathogenic effect was the volume of the inferior parietal lobules. A relatively small normalized feature value of 0.237 was observed. The Shapley value of this feature was 0.12 and thus, this feature expression increased the model prediction by 0.12. The LDELTOTAL cognitive test score reached a feature value of 3, which was a relatively bad test performance and thus increased the subject's risk to develop AD. Surprisingly, the relatively old age of 85.6 years decreased the patient's risk to develop AD by 0.03. The model prediction for an sMCI subject (PTID: 037 S 4146) is demonstrated in Figure 15b. This subject was sampled from the ADNI test set and reached a model prediction value of 0.149. The subject had a moderate to a large volume of amygdalae which decreased the subject's risk of prospectively developing AD by -0.15. The subject also has two ApoE 4 alleles and as the presence of ApoE 4 alleles is a risk factor for AD, this increased the patient's risk. Additionally, the relatively high LDELTOTAL cognitive test score of 9 had a protective effect.
SHAP waterfall plots of subjects not included in the sMCI vs. pMCI dataset because those pMCI subjects reverted to MCI at a later visit (explained in Section 3.1.1) are visualized in Figure 15c and Figure 15d. The prediction of the subject with PTID 036 S 4430 is visualized in Figure 15c. This MCI subject converted to AD 5.54 months after the baseline visit, but reverted to MCI 12.00 months after the baseline, and again converted to AD 23.64 months after the baseline visit. The last diagnosis for this subject was recorded after 83.54 months. The subject reached a model prediction of 0.707 and was thus classified as a pMCI subject. Additionally, the patient had a relatively small LIMMTOTAL cognitive test score of 2. The model learned that this poor test score increased the patient's risk to develop AD by 0.09. Additionally, the subject had relatively small volumes of the amygdalae, which additionally decreased the patient's risk to develop AD in the future. The AD risk of this patient was decreased by 0.03 because the subject has no ApoE 4 alleles.
The SHAP force plot for the subject with PTID 128 S 0135 is visualized in Figure 15d. This MCI subject converted to AD 54.52 months after the baseline visit, reverted to MCI after 71.74 months and again converted to AD 83.90 months after the baseline visit which was also the last diagnosis available. However, in contrast to Figure 15c, the subject reached a small model prediction value of 0.283 and was therefore classified as an sMCI subject. The most important factor decreasing the patient's risk was the absence of ApoE 4 alleles. This factor decreased the model prediction by 0.06. Additionally, the LDELTOTAL cognitive test score of 8, which was relatively large had a protective effect. The relatively small normalized volume of the lateral occipital sulci decreased the patient's risk by 0.03. One reason for the classification score might be, that the conversion to AD was relatively late for this subject.

Discussion
In comparison to previous research [8], which exclusively trained tree-based models, this work trained several RFs, XGBoost models, DTs, SVMs, and LR models to detect different stages of AD. All models were trained using the ADNI dataset and validated using independent test sets of the ADNI, AIBL, and OASIS cohorts. Bayesian Optimization optimized for the best hyperparameters of the models. During this stage, CV was used to estimate the performance for independent test sets. The models were trained using three feature sets. The MRI features included summed volumes, differences, and ratios of predefined brain structures to investigate asymmetry structures associated with different AD stages. Forward feature selection was implemented to focus the models on the most important features and simultaneously avoid correlated features in the datasets. The performances of the different ML models as well as the different feature sets are compared to each other using Friedman tests and pairwise Wilcoxon signed rank tests with Bonferroni adjustment. SHAP summary plots were used to visualize and interpret those models. The resulting Shapley values were compared to permutation importance of all models as well as natural feature importances of the RF and XGBoost models and to log odd's ratios of the LR models. As correlated features reduce the validity of explainability methods like permutation importance and SHAP [96], those were also calculated consolidating correlated features to aspects. SHAP force plots investigated individual predictions of interesting subjects.
The experimental results showed the forward feature selection chose brain regions that were previously associated with AD progression [70-73, 79, 80] for all classification tasks and models. The performances achieved for models trained with forward feature selection did not outperform the models trained on the entire feature set. The pairwise Wilcoxon signed rank tests with Bonferroni adjustment showed, that the results of models trained with FS-3, which included cognitive test results, outperformed those models trained for FS-1 and FS-2 for all classification tasks and the ADNI test set. The improvements for FS-3 models in comparison to FS-1 and FS-2 models were smaller for sMCI vs. pMCI than for the baseline classification tasks. The SHAP summary plots of all feature sets mainly showed biologically plausible associations and the most important features for the CN vs. AD classification using FS-3 and the polynomial SVM were the cognitive test scores LDELTOTAL and MMSCORE.
The results for the AIBL and OASIS test sets showed less clear advantages of FS-3. Reasons for this were, among others, differences in the subject recruitment process, leading to differences in sociodemographics and differing MRI protocols across studies. However, the CN vs. AD models were successfully transferred to AIBL and OASIS by mostly achieving classification accuracies better than the no information rate. Additionally, the models trained for MCI vs. AD classification and sMCI vs. pMCI classification were successfully transferred to the AIBL dataset. For CN vs. MCI classification, poor results worse than the no information rate were achieved for the AIBL and OASIS datasets. However, the SHAP summary plots of those models mainly showed biologically plausible results. It was observed that age was a highly important feature in some of those models, which might cause problems transferring those models to datasets with differing demographic distributions. Poor results were also achieved for the MCI vs. AD classification and the OASIS dataset.
Some of the black-box models outperformed the simple and interpretable DTs. However, the pairwise Wilcoxon signed rank tests with Bonferroni adjustment (p − value < 0.05) showed no significant differences. No model stood out among the black-box models. However, different ML models learned different associations which mostly were biologically plausible. The SHAP summary plots were compared to the permutation importance of all models, and natural RF and XGBoost feature importances, as well as absolute log odds of the LR and agreed for many features. The feature rankings of all models were compared to each other using Kendall's rank correlation.
Because feature dependency structures reduce the validity of SHAP values and permutation importances, those feature importances were also computed by consolidating correlated features using aspects. The results show that the models depended on biologically plausible features. However, the feature rankings of the SHAP values and the permutation importances showed a weaker correlation than those calculated for the models with forward selected features.
Individual predictions, which are important in clinical practice, were interpreted using SHAP waterfall plots.

Limitations
The approach proposed in this article had several limitations. First, both external datasets had a clear focus on CN subjects and were thus imbalanced which makes the interpretation of model generalizability hard. The external validation of the sMCI vs. pMCI classification, which was medically more interesting than the baseline diagnoses, was based only on 28 AIBL subjects and no OASIS subjects. Future investigations should include more AD datasets knowing those cohorts differ in inclusion criteria. Possible cohorts might be the AD subset [14] of the HNR [15] or a subset of the National Alzheimer's Coordinating Center [100]. In this context, instead of diagnoses, different biomarkers should be addressed as endpoints. Another idea to increase the number of subjects in the datasets is, to relax the exclusion criteria by also including subjects that reverted to MCI or CN, and use follow-up scans of subjects where the baseline scan failed for the MRI feature extraction pipeline. Due to the availability of data in the cohorts, and minimal invasive recording, only MRI, sociodemographics, the number of ApoE 4 alleles, and cognitive test scores were included in the investigations. However, PET scans and biomarkers have high medical relevance and should thus be considered in future investigations.
Although in comparison to previous research [8], the number of ML models was already increased, prospectively deep learning models like CNNs, which can automatically extract locally textural features from MRI scans should be investigated. However, currently, there is no consensus on whether those methods can improve AD detection. Much previous work in this area suffered from data leakage [23] or investigated the less challenging discrimination between AD and CN. The Bayesian optimization used for hyperparameter-tuning is a sequential method. Future work should therefore investigate the use of more effective parallelized methods such as presented in [101].

Conclusion
This work extended a workflow [8] to explain ML black-box models trained to distinguish multiple AD stages using Shapley values. The differentiation of sMCI and pMCI subjects is of medical interest to recruit and monitor subjects for therapy studies. The approach was based on non-invasive features including MRI volumes, sociodemographic data, the number of ApoE 4 alleles, and cognitive test results. Volumetric features were extracted from the MRI scans using the FreeSurfer pipeline. The sum, difference, and ratio of the volumes of both hemispheres were calculated to investigate the brain asymmetry in multiple AD stages. Shapley sampling values were calculated to visualize the local feature associations of black-box RFs, XGBoost models, and SVMs. The experiments mainly showed biologically plausible associations and improved results for models including cognitive test scores. Those improvements were smaller for sMCI vs. pMCI classification.
For the investigation of model reproducibility, all models were trained for the ADNI dataset and validated for the external AIBL and OASIS cohorts. The ADNI models achieved reasonable results for AIBL and CN vs. AD, MCI vs. AD, and sMCI vs. pMCI classification. For the OASIS test set, reasonable results were only reached for CN vs. AD classification.
Some of the performances of the black-box models outperformed the simple and interpretable DTs. None of the black-box models achieved outstanding results. SHAP summary plots were used to visualize the associations, the model learned between the features and the AD diagnosis. The most important features of those plots were previously associated with AD progression. Additionally, those plots showed biologically plausible associations for most of the important features in all classification tasks.
SHAP force plots investigated individual model predictions. The comparison between SHAP values and natural RF and XGBoost feature importances showed high correlations of the SHAP importances, the permutation importance, and the natural feature importances for all models.
The investigation of the feature dependency structure and consolidating correlated features during the computation feature importance computation for sMCI vs. pMCI classification showed, that those models depended on features that were previously associated with AD.
This work outperformed previous work [8] for the ADNI and AIBL classification results and CN vs. MCI classification and the AIBL results in MCI vs. AD classification for models trained without cognitive test scores. Additionally, the ADNI and AIBL results achieved for sMCI vs. pMCI classification trained with cognitive test scores outperformed the results of previous work.