Alzheimer’s Disease (AD) is an untreatable, life-changing neurological sickness affecting the elderly population, leading to several hardships for the patients [1, 2]. According to the latest World Alzheimer Report [3], about 55 million clinically diagnosed AD patients live worldwide, with an estimated rise of cases to 139 million by 2050 [3]. The report also quotes a staggering 75% issues that go undiagnosed for several reasons. AD patients will have to endure many difficulties, such as memory loss, behavioural disturbances, vision, and mobility complications that interfere with daily routine tasks [1, 4, 5]. These sufferings will increase to the extent of interfering with one’s ability to lead a self-reliant personal and social life and causing numerous tribulations for the caretaking family members [2, 6].

Of late, Artificial Intelligence (AI) techniques involving machine learning (ML) and deep learning (DL) algorithms have contributed to diverse application domains including: anomaly detection [7,8,9], biosignal and image analysis [10,11,12,13,14,15,16,17,18,19,20,21], neurodevelopmental disorder assessment and classification focusing on autism [22,23,24,25,26,27,28,29,30,31,32], neurological disorder detection and management [2, 33,34,35,36,37,38,39,40,41,42,43,44], supporting the detection and management of the COVID-19 pandemic [45,46,47,48,49,50,51,52], elderly monitoring and care [53,54,55,56], cyber security and trust management [57,58,59,60,61,62], various disease diagnosis [63,64,65,66,67,68,69], smart healthcare service delivery [70,71,72], text and social media mining [73,74,75,76], personalised learning [77,78,79,80], earthquake detection [81, 82], etc.

Part of these methods has significantly boosted clinical diagnosis of AD in an incredibly accurate, fast, and efficient manner using compound medical data (2D or 3D MRI, PET, CT, etc) [83,84,85]. This success can be attributed to several factors, such as certain algorithmic advancements and the availability of powerful GPUs, which come pre-loaded with a spectrum of open-source computational tools [84, 85]. These have facilitated the accurate identification of AD in a remarkable manner. The AI-driven AD prediction is based on the concept that systems can identify stages of dementia by learning patterns through the input data so that optimal decisions can be made with minimal human intervention [86, 87]. The contemporary ML and DL algorithms for AD detection have achieved highly admirable results on various scales of metrics [34, 85].

However, these AI models are considered mainly as blackbox models by medical practitioners due to the inability to derive justifiable reasons (explainability) for the predictions delivered by them, leading to ambiguity [88]. The high opaqueness of these modern AI techniques often poses difficulty for even skilled medical experts to comprehend the solutions [89]. For this reason, it will lead to a trade-off of accuracy over trustworthiness by decision-makers [90]. Consequently, stakeholders and policymakers often prefer responsible and reliable decision-making instead of accurate decision-making. This lack of explainability keeps the medical field reluctant to deploy AI-driven computer-aided diagnosis (CAD), despite proven accuracy demonstrated in the recent literature [91].

In the last decade, several ML and DL algorithms have achieved breakthrough results in various AI-based decision-making, such as disease prognosis and prediction [92], drug discovery and development [93], solid-state material science [94] and machine fault analysis [95]. Furthermore, applications of DL are found in biomedicine [96], biology [96], and speech recognition, synthesising and audio processing in [97]. Sometimes these performances surpassed human-level accuracy.

Such blackbox models will often lead to unclear circumstances such as "Why did you predict/classify that as class x, why not class y?", "When will you succeed or fail ?", "How to correct the wrong feature selection?", "Which dominant feature are you looking to train the model?", "Can I rely on the prediction you gave?" and so on [89] (see Fig. 1). On the contrary, the Explainable AI (XAI) models can deliver reassuring outcomes to the user, such as "I know why you are classifying that as class x and why not as class y", "I know to rectify the wrong feature selection", "I can rely on the prediction you gave" and alike. Hence, XAI is crucial if AI-based CAD are to be reliably deployed.

Fig. 1
figure 1

A Typical AI systems (Top) and Explainable AI systems (Bottom)

XAI or interpretable AI are convertible terms that refer to an emerging sub-field of AI [98]. The XAI is a series of features that interprets how a blackbox model is constructed, perform predictions and get humans to trust and use the system efficiently [99, 100]. The need for a model to be explainable is to justify the model output, make the functioning of the blackbox models transparent, gain new knowledge for smarter decision-making to improve model performances and increase trust by users in the model result [98]. It aims to produce methods and tools to make the AI systems’ decisions, recommendations, or guidance understandable for decision-makers. For instance, in an AI system for classifying MRI images in early AD diagnosis, XAI can explain the model’s working and synthesise influential factors considered for prediction [101]. Furthermore, if the model generates an adverse result, then with XAI’s interpretability, the model will be able to identify and rectify the errors [102]. Explanation and interpretation of the model’s output are therefore required to bring fidelity, trust and use in clinical applications [88, 100, 102]. The stakeholder’s trust at every level is needed to maximally leverage these AI solutions, which is possible only through XAI. In addition to providing advanced insights into AI solutions, XAI can also deliver new opportunities. For instance, involving a human in the decision is a typical medical scenario, where AI solutions and human expertise go in tandem to tackle complex situations where neither can provide a satisfactory solution [103]. Figure 2 shows a general outlook of translating the blackbox model to an explainable model.

Fig. 2
figure 2

A general outlook of explaining models

There is often a trade-off between model accuracy and associated explainability [104]. Linear regression models or decision tree(DT) models are intuitive, inherently interpretable, and easy to validate and understand by a novice in AI. This increases the trust in such models. However, to solve a complex problem, ML algorithms may derive a non-linear model which would yield good results but compromises on explainability. For instance, Convolutional Neural Network (CNN) often performs the best but is least explainable [105]. Figure 3 shows that ML models with high explainability are less accurate but more intuitive to humans. As the model complexity and performance increase, leading to more accuracy in results, explainability decreases. In healthcare systems, predictive accuracy is the most important measure of clinical validation. From the patient’s perspective, there is more trust in the clinician and less tolerance for the machine, which naturally raises the importance of explainability, allowing more complex models and functions to be explainable.

Fig. 3
figure 3

Accuracy - Interpretability Trade-off

In the last couple of years, XAI has gained paramount importance in the AI community not only because they are used in high-stake decisions but also because companies are held accountable by regulators for the decisions made by their AI models. It has grown exponentially in a short time span, potentially transforming how AI is seen and deployed in real-time in the coming days. Several diverse fields have embraced the explainable component of AI, prioritising trustworthiness over accuracy. XAI has been applied in drug discovery [106, 107], industrial applications [108, 109], gaming [110, 111], neurological disorders [112,113,114], neuroscience [115, 116] and recommender systems [117, 118]. This tremendous growth has led to several XAI-based review articles in the healthcare domain in the past years.

The interpretability of ML algorithms was the subject of a comprehensive survey by Tjoa and Guan [119]. The findings were further categorised into different groups by the author. These categorisations are studied from the perspective of application in the medical field. Authors in [89] have surveyed the progress of XAI in healthcare applications. They have also introduced solutions for XAI leveraging the fusion of data from multi-centric data with different modalities. The results of which were analysed and subsequently validated in two real clinical scenarios. In the review presented by Loh et al. [99], a detailed review of areas of healthcare deserving more attention of XAI was presented by considering three major healthcare datasets: clinical, textual, and high-dimensional data. Nazar et al. [91] have discussed the XAI from the perspectives of Human-Computer Interaction (HCI) models. The authors focused on using AI, HCI, and XAI in healthcare. There have been several noteworthy applications of XAI in AD classification too [120,121,122,123,124,125,126,127,128,129,130,131,132,133,134] However, an exclusive systematic review article on XAI for AD classification that points to various XAI frameworks and blackbox models used inside these frameworks is yet to be proposed by the research community.

Also, a wide spectrum of XAI reviews involve crucial components such as various blackboxes considered for interpretation, XAI frameworks, XAI methods, and various output forms of interpretation. Other components include open-source XAI tools, their implementation platform, and associated datasets. Addressing blackboxes interpreted in AD studies alone will only provide partial coverage of this spectrum. Hence, it is essential to fully comprehend the complete XAI for the AD platform to do any worthwhile research in the future. The novelty of this review article is that it covers the entire XAI spectrum in interpretability of blackbox models used for AD detection. To the best of our knowledge, this is among the first attempts to review the XAI models in the context of AD diagnosis. The nomenclature used in this article is listed in Table 1.

Table 1 Nomenclature

The primary contribution of this work can be outlined as follows:

  1. 1.

    A systematic review adhering to the guidelines proposed by both Kitchenham [135] and PRISMA [136].

  2. 2.

    Formulation of essential research questions (RQ) covering the entire spectrum of XAI for AD classification.

  3. 3.

    Collection of different XAI techniques with their GitHub links used in interpreting blackbox models applied for AD detection.

  4. 4.

    A survey of XAI methods for AD classification reported in the last ten years, with critical analysis of findings, results, abilities, and limitations.

  5. 5.

    Identification of the XAI models' strengths for AD detection to ensure their reliability and trustworthiness for adoption by clinicians.

  6. 6.

    A focused discussion on current XAI research, it's benefits, limitations, and challenges along with future directions.

These significant findings will help the research community fill various research gaps, instigating new models that assist clinicians in elucidating the perception of an AI system.

The rest of this article is structured as follows: "Concepts and Background" provides necessary concepts and background in XAI. The data synthesis needed for the systematic review is detailed in "Search Strategy". The findings for the research questions are discussed in "Results and Discussions", and concluding remarks are drawn in "Conclusion".

Concepts and Background

This section provides a succinct background to different XAI methods in general (XAI Methods). It also provides a brief overview of popular XAI frameworks used in the AD prediction (XAI Frameworks). The primary aim of this section is to provide a comprehensive background helpful for discussions in later sections.

XAI Methods

The XAI methods can broadly be classified into four categories [98], as shown in Fig. 4, based on: i) scope of explanations, ii) stages of implementation, iii) applicability to models, and iv) forms of explanation.

Fig. 4
figure 4

Classification of XAI Methods

XAI Methods Based On Scope

The explainability scope is the extent of explanations generated by the XAI method. It can interpret either an entire model or a specific instance of the model based on input test data. Accordingly, the explainability of a model can be either local or global. A global method explains the whole model by considering the entire inferential data set of the model. It gives a general perspective of the relationship of the model with all input instances. The popular DT algorithm can be intrinsically global in nature because the decision-making for all input data instances can be easily explained by visualising in a tree form.

On the other hand, the goal of a local method is to explain only a few instances of test data to the user. Local explanations help understand why certain decisions were made and can increase user confidence in specific examples. In the case of DTs, a local explanation can correspond to a single branch in the tree. It is worth noticing that, combining local inferences made through different input instances can yield global insights for the model. Pictorial representation of local vs global explanation can be represented as shown in Fig. 5.

Fig. 5
figure 5

XAI Explanation: Local vs. Global

XAI Methods based on Stages of Implementation

An XAI method can generate explanations for a model either during or after the training of the model (see Fig. 6). Based on these two ways, XAI methods are further classified as Ante-hoc and Post-hoc [98]. Ante-hoc is a Latin word that literally translates into before-this. Hence, the goal of ante-hoc XAI methods is to provide explainability before the beginning of model training. Such Ante-hoc methods are transparent and self-explanatory and make the model explainable naturally while maintaining optimal accuracy [98]. ML algorithms that are ante-hoc in nature are linear regression, DT, and Bayesian models. These models are also referred to as white box or glass box models.

Fig. 6
figure 6

XAI Explanation: Ante-hoc vs. Post-hoc

The Latin word Post-hoc translates to after-this. Such methods aim to provide explainability after a model has been trained. An external explainable model, called a surrogate model, is augmented to provide explanations for a trained blackbox model. Generally, support vector machines (SVM), and CNN are the ones where the inference mechanisms remain completely unknown to users that necessitate post-hoc models. Gradient-weighted Class Activation Mapping (GradCAM) [137], Layer-wise Relevance Propagation (LRP) [138] and Local Interpretable Model Agnostic Explanations (LIME) [100] are some common examples of XAI frameworks that can be applied on surrogate models for generating explanations.

XAI Methods based on Applicability of Models

Applicability of models refers to a concept of XAI where explainable methods are restricted to particular models or applied to any model as a post-process. The former is called model-specific, and the latter is a post-hoc approach called model-agnostic. Model-specific is an intrinsic approach where explainability is integrated into the model architecture and is not transferable to any other model architecture. For instance, interpreting a neural network model’s weight or activation values is specific to that neural network learning approach. Model-specific approaches for deep neural networks work by traversing the path of CNN in reverse order highlighting specific regions of the input image that majorly contributed to the decision. The guided backpropagation [139] and LRP [138] are example model-specific approaches.

On the other hand, model agnostic methods do not consider any of the internal components like weight or activation values and can be used with any learning approach. They extract explanations by causing perturbation and mutation to the input data and subsequently observing the sensitivity of the performance compared to the original data. In other words, by perturbing the input or weights of important features we can measure how much it has influenced the model performance. This will provide valuable insights into a localised region of input data that underwent perturbation. Alternative methods used are Occlusion Sensitivy analysis [134], GradCAM [137] and Feature Importance [126]. Some popular model-agnostic approaches are LIME [100] and SHapley Additive exPlanations (SHAP) [140]. Figure 7 depicts a few key differences between these two types of XAI methods.

Fig. 7
figure 7

XAI Explanation: Model Specific vs. Model Agnostic

XAI Methods-based on Forms of Explanation

The classification model for images can differ substantially from those that classify using text, categorical, or temporal data such as speech. Therefore, the input formats (numerical, visual, textual or temporal) for a model can play a vital role in framing different forms of explanations for XAI methods. The interpretations of predictions can be in many forms and depend on the end users’ needs and concerns. There are four forms of explanations commonly used to interpret a prediction: numerical, visual, rule-based, and textual [98].

The numerical explanations generated by models are usually a measure of the input variables that contribute to the model’s outcome. They represent numerical formats like values, vectors of numbers, or matrices. A numerical explanation can also be a probability measure assigned to a neural network layer. Visual explanations are the most common way to explain the functioning of a model in a graphical way. For example, heatmaps can highlight important areas of an image that are influential for the decision. A visual explanation can be easily understood by novice end users of the AI model. Textual explanations are precise and specific and generally used for individual predictions. It is not a commonly used form of explanation due to its high computational complexity requiring natural language processing (NLP). However, they are easily understandable to humans and are mostly generated for local scope. Rule-based explanations are simple and basic forms of explanation which are more structured than visual and textual explanations. They are intuitive to humans and can be used to explain the prediction of models using IF-THEN rules or trees with AND/OR operators [98]. This type of explanation is mainly utilised for ante-hoc methods and interprets models with global scope. A detailed discussion of these forms of explanations is dealt with in "XAI Frameworks for AD Detection" and "Benefits of using XAI Methods for AD Detection" in the context of AD.

XAI Frameworks

Local Interpretable Model-Agnostic Explanations

LIME is an open-source tool used to generate explanations for a single instance instead of the entire dataset, hence the term local. LIME provides explanations by perturbing the model’s input data, creating a surrogate model, and observing the changes in prediction and selecting the top significant features [100]. Due to agnosticity of the LIME model, it is used after the model has been trained for prediction and can be used for any blackbox model. For blackbox explanations, LIME can interpret image classifications, explain text-based models and tabular datasets in either textual, numeric or visual form (for more details, see section XAI Frameworks for AD Detection).

SHapley Additive exPlanations

SHAP is an XAI technique that assigns a weight, called Shapley value, to each feature of a trained model [140]. These features with an assigned weight are observed for all possible weighted input combinations. The contribution of each of the Shapley value-added features, for all possible weighed input combinations, is observed based on its efficiency, symmetry, features with no zero contributions and cumulative contribution of a feature with sub-parts. SHAP shows performance consistency and provides good accuracy for predictions in the local scope. In this review, SHAP was commonly used in conjunction with numerical data to provide a visual analysis of blackbox models (see Figs. 15, 16, and 26).

Gradient-weighted Class Activation Mapping

GradCAM is a technique used to make CNN models more transparent by identifying the important regions of an input image for predictions [146]. GradCAM is applied using gradient information of the output layer of the CNN model to produce a localisation map representing crucial regions in an image. This is achieved by assigning important value for each neuron for making specific decisions. Therefore, the final output of GradCAM is a coarse heatmap that highlights important regions suitable for prediction and explanation (see Fig. 19).

Layer-wise Relevance Propagation

LRP is another tool like GradCAM that generates a heatmap with highlighted regions of an image [138]. LRP is used in CNN where inputs can be images or videos. LRP assigns relevance scores to all neurons of a specific output for the last layer of a CNN. LRP then propagates in reverse until the input layer by computing scores for every activation unit (neuron) in each layer. Using the final relevance score, a heat map is generated by LRP as an explanation that can be used to identify influential regions in the prediction (see Figs. 17 and 18).

Individual Conditional Expectations

ICE is an extension of Partial Dependence Plot (PDP) that is used to produce visual explanations for blackbox models [145]. PDP is a visualisation obtained by plotting the average predicted outcome of a model by varying the value of one feature of interest and keeping the other feature values constant. ICE disaggregates the PDP by creating individual plots for each instance of blackbox model predictions by altering the value of a feature of interest and leaving other feature values unchanged [126]. The outcome can be visualised as a line plot which is a set of points for an instance with the altered feature value and the respective predictions (see Fig. 20).

Occlusion Sensitivity Analysis

In an image predicting AD, it is necessary to explain or identify areas in the image that contribute to AD classification. OSA is a technique initially proposed by Zeiler and Fergus [144]. In this technique, portions of the input image are occluded or hidden with a grey or black patch, creating a heatmap. The variations in the output probability of the occluded image are observed [147]. The most critical region, if occluded, will have the highest impact with low probability. An occlusion sensitivity map is therefore used to locate important patches of the image responsible for AD.

Saliency Map

The SM is another important concept of deep learning, which was first introduced by Simonyan et al. [148]. Unlike an occlusion map where portions of the input image are hidden with a black patch and creating a heatmap, in SM each pixel of an AD-classified image is removed and subsequently processed. The heatmap obtained is checked for probability variations where a low probability indicates that the removed pixel plays a vital role in AD classification [149]. Therefore the output heatmap that undergoes the Saliency technique has all important pixels in the image eligible for explaining the disease.

Table 2 provides XAI framework categorisation based on scope, applicability, implementation and interpretable forms for some popular XAI tools used in the literature. The table also provides links to GitHub repositories for rapid reproducibility.

Table 2 Popular XAI Frameworks in the AD Classification

Search Strategy

This section presents the overall steps involved in searching and identifying relevant papers needed for conducting a systematic review. Figure 8 depicts the total stages involved.

Fig. 8
figure 8

Sequence of Steps in Search Strategy to Identify Relevant Papers

This review aims to investigate research articles that use XAI in diagnosing/early detection and thereby interpret the reasons for classifying. To locate contributions and summarise the results, published articles on the subject of artificial intelligence and its associated fields are examined.

The prime aim of this review is to identify research gaps that would instigate XAI-based research for AD detection.

We adopted concrete guidelines proposed by PRISMA [136] and Kitchenham [135] to retrieve relevant papers for this systematic review. The overall process can be outlined below:

  • Formulating the research questions.

  • Framing search strings.

  • Identifying the digital libraries and conducting searches.

  • Choosing the relevant inclusion/ exclusion criteria and filtering the papers based on their relevance to the study topics.

  • Extracting necessary information from the selected articles.

  • Investigating research questions allowing critical analyses to perform a thorough study of the existing methods and their benefits, future needs and limitations.

Research Questions

The main goal behind framing research questions is to lay out a well-defined plan to retrieve papers exclusively from the focused areas of consideration. This way, the reader can apprehend the knowledge disseminated more comprehensively. Table 3 lists the research questions that were addressed in this paper.

Table 3 Research Questions

Identification of Articles

One of the challenging tasks for a comprehensive and inclusive systematic review is to select the appropriate search strings. For this work, the search strings were carefully picked so that they are not too generic to avoid irrelevant papers and not too narrow to lead to missing relevant articles [136]. After several trials of combinations and permutations of relevant keywords, we arrived at the search strings as shown in Table 4.

Table 4 The Search Strings

Research papers were selected from widely accessed databases, as shown in Table 5. Apart from these, we have also considered some books and other online resources that satisfied our research questions.

Table 5 The databases considered

Screening of Articles

A consolidated output of the individual searches produced a total of 1551 records of publications (ACM Digital Library=206, IEEEXplore=147, SpringerLink=158 PubMed=780, ScienceDirect = 260). We decided to include all research articles from the past decade until June 2023. The records were then pruned with duplicates and all those records published before 2012.

In the following task, we screened the identified articles using the inclusion-exclusion criteria shown in Table 6. To begin with, we examined and marked all duplicate records from each of the database search collections. The records after duplication from each collection are combined into a single collection, and duplicates are removed. The task resulted in 928 unique records. We then examined publication titles and abstracts and removed pilot studies, editorials, non-journal articles, conference proceedings, books, posters, and studies published before 2012. The process effectively reduced the number of articles to 73 eligible records.

Table 6 Inclusion-Exclusion Criteria

Furthermore, we excluded inaccessible records, studies that presented only discussion without evidence of model performances and results, and studies that did not relate to the previously framed research questions. The resulting records were then screened for articles relevant to the research questions. Additionally, understanding the accuracy, specificity, sensitivity, and AUROC metrics of ML or DL models is crucial for this review study. We, therefore, excluded studies that did not provide model performances. Overall, this step resulted in 37 credible research articles from quality journals in accordance with our framed RQs. Figure 9 shows a proper understanding of the steps taken in the process. Figure 10 depicts the source and Fig. 11 shows the year-wise statistics of articles considered in our study. These numbers make it clear that XAI’s scientific research for AD is limited and has only grown rapidly in the last few years. To our knowledge, this review can be considered as unique as there were no articles found exclusively on XAI with AD.

Fig. 9
figure 9

PRISMA chart showing the identification, screening and inclusion of articles

Fig. 10
figure 10

Sources of Articles Considered in the Systematic Review

Fig. 11
figure 11

Year-wise statistics of XAI papers for AD Classsification

Results and Discussions

In this section, we present our findings by extensively reviewing the 37 articles through the RQs shown in Table 3.

XAI-based AI Systems for AD Research

This section aims addressing RQ1: What AI systems are available for AD research that incorporate XAI?

The focus on AI in disease diagnosis and treatment began in the early 1970 [84] and has achieved significant momentum over the years. Research in AI-based AD prediction did not involve XAI until the recent decade [150]. Although the time has not yet come for computers to replace doctors, XAI has recently been incorporated into AI-based AD prediction due to a growing demand for transparency and explainability in healthcare and medical practice.

Several studies for AI-based AD detections incorporating XAI have been identified [see Table 7]. Many studies have used datasets of numeric data type for training AI models and obtained explainable results. El Sappagh et al. [122] have developed and utilised a multi-layered multi-model system for an accurate and explainable AD diagnosis. Lombardi et al. [151] present a robust framework for classification between Healthy Control (HC), Mild Cognitive Impairment (MCI), and AD and interpret the predictions with XAI methods. Xu and Yan [152] propose a reliable multi-class classification model supported by XAI methods to explain the predictions accurately. A computer approach called Systems Metabolomics utilising Interpretable Learning and Evolution (SMILE) is proposed by Sha et al. [153]. This article involves a supervised metabolomics data analysis and uses the XAI method to learn and identify the most informative metabolites to understand and diagnose disease development and progression. Hammond et al. [154] analyse Beta-amyloid, tau, and the neuro-degenerative biomarkers, for AD classification. Additionally, the author uses XAI methods to identify the biomarker that is most influential in AD detection. The research article used a numeric data type dataset as input for subjects of different categories like HC, MCI, or AD selected from the Alzheimer’s Disease Neuroimaging Initiative dataset (ADNI). Bloch and Friedrich [123] state that the diverse causes of AD can lead to inconsistencies in disease patterns, protocols used for acquiring scans, and preprocessing errors of MRI scans resulting in improper ML classification. This study investigates whether selecting the most informative participants from the ADNI and Australian Imaging Biomarker and Lifestyle (AIBL) cohorts can enhance ML classification using an automatic and fair data valuation method based on XAI techniques.

Table 7 Summary of AI for AD research that incorporate XAI

Hernandez et al. [155] compare the performances of the best three models from 'The Alzheimer’s Disease Prediction of Longitudinal Evolution’ (TADPOLE) challenge concerning prediction and interpretability within a common XAI framework. Based on interpretable machine learning, Lai et al. [156] investigate the Endoplasmic Reticulum (ER) stress-related gene function in AD patients and identify six feature-rich genes (RNF5, UBA C2, DNAJC10, RNF103, DDX3X, and NGLY1) that enable accurate prediction of AD progression. An XAI method can now illustrate which feature-rich gene will influence the prediction output for an ML model. The datasets are chosen from an indigenous Gene expression dataset having numeric measures for genes. Chun et al. [126] try to improve the predictive power of progression from amnestic MCI to AD using an interpretable ML algorithm. This study uses datasets of numeric input values of neuropsychological and apolipoprotein test data. Sidulova et al. [157] propose a novel approach for classifying Electroencephalogram (EEG) signals to provide early AD diagnosis. The XAI method used in the study provides quantitative features that help arrive at the prediction using EEG recordings obtained from individuals with probable AD, MCI, and HC.

Many of the research articles have utilised datasets that include ADNI, OASIS, and Kaggle data for training AI-based AD detection models with MRI as input data. From Table 7, the articles [101, 121, 130, 158, 159], and [132] propose classifiers of deep neural networks for prediction and classification between HC, MCI and AD. All these articles use datasets from ADNI and choose MRI as input. According to the prominence and severity of dementia in the available MRI, Jain et al. [160] offer a DCGAN-based Augmentation and Classification (D-BAC) model strategy to identify and categorise dementia into several categories. The MRI scan datasets for the purpose are collected from Kaggle. Shad et al. [128] experimented with neural network models for early AD detection by employing classification approaches utilising a hybrid dataset from Kaggle and OASIS. Bloch and Friedrich [161] propose a machine learning workflow to train and interpret different blackbox models and to compare its performance. All models were trained and evaluated on ADNI, AIBL and OASIS datasets. Deep learning models have been created by Ruengchaijatuporn et al. [124] to classify MCI and AD utilising tasks like the traditional clock drawing, cube-copying, and trail-making test. Multiple drawing task images are used as input and have proved to have significantly improved the classification performance between HC and AD. By combining an interpretable graph neural network with the dataset collected from ADNI, Kim et al. [129] bridge the gap between efficiently integrating longitudinal neuroimaging data and biologically meaningful structure and knowledge to develop precise and understandable systems. García-Gutierrez et al. [162] present a Python-based computational tool to deal with the data obtained during clinical diagnosis. The tool integrates data processing, designing predictive models and features of XAI for explainability. Yang et al. [150] have developed three approaches for generating visual explanations from 3D CNN for AD classification and all the approaches identify important brain parts for AD diagnosis. For all the approaches the models were trained with brain MRI scans from the ADNI database.

Some of the articles have used hybrid datasets as input for the training of AI models. Kamal et al. [127] have used images and gene expression to classify AD and also explained the results. Another article by Ilias et al. [131] has used speech recordings and associated transcripts from the ADReSS Challenge dataset to detect AD.

The sunburst chart in Fig. 12 reveals a significant involvement of XAI methods for AD detection with ML techniques. A vast number of DL classifiers, such as CNN, VGG16 etc. are frequently utilised for classifying AD with subsequent explanations. Intuitively, Figs. 11 and 12 also establish that more research is dealt with within the area of ML, which utilises RF, XGBoost, SVM and many other classifiers. It is also to be noted that each research article under ML uses multiple classifiers, whereas articles under DL use very few classifiers. Therefore Fig. 12 shows a comprehensive coverage for ML-based studies.

Fig. 12
figure 12

Sankey diagram of various AI models that incorporate XAI for AD detection

XAI Methods for Interpreting Blackbox Models to Detect AD

This section addresses the RQ2: What different XAI methods are used for blackbox interpretability to detect AD?

This research question is devised to find the number and type of XAI methods currently available for the blackbox interpretability in AD detection. It provides essential details such as understanding primary steps taken to be local/global, posthoc/ante-hoc, and model agnostic/model-specific in the XAI context of AD detection studies. While finding answers to this RQ, below additional questions were raised:

  1. 1.

    Why are the scope of some explanations local AND global, local only, global only?

  2. 2.

    Why are some blackbox models, Random Forest (RF) for instance, present in different XAI method categories?

  3. 3.

    Why is CNN considered both in Model Agnostic and Model Specific contexts?

  4. 4.

    Why would an XAI method be considered Posthoc and Model-Agnostic simultaneously?

  5. 5.

    Why would an XAI method be considered Antehoc and Model-Specific simultaneously?

We answer these questions to provide enhanced clarity for addressing research question 02.

  1. (a)

    Why are some explanations local AND global, local only, global only?

    Some of the XAI methods behave either in a local interpretable format or global. However, it is the prerogative of the researcher to use those methods to interpret globally by aggregating the local explanations. Therefore, there is no fixed distinction between saying that model can only be local or global. For instance, the XAI framework SHAP is predominantly used for local interpretation. However, SHAP can also be used to interpret a global population. Similarly, the LIME which is a local explainer method can also be used for global understanding by aggregating local explanations.

  2. (b)

    Why are some blackbox models, RF for instance, present in different XAI method categories?

    The blackbox models LGBM and XGBoost are tree-based models where the classification in LGBM is done branch-wise, while in the XGBoost, it is done level-wise. Therefore, the scope of explainability for LGBM can be local as a branch-wise classifier. Subsequently, local results can be aggregated to establish global explanations. Since XGBoost is classified level-wise, it can achieve a local description at each tree level, and the final path to the last level can be considered a global explanation. Hence, LGBM and XGBoost, though blackbox models, can be aggregated locally and globally. The case is similar to that of RF as it is a collection of DTs where the above description that we have provided fits into each tree either locally or globally in a similar manner. Therefore, the same blackbox model is aggregated in different XAI method categories based on the nature of explanations provided in the respective study.

  3. (c)

    Why is CNN considered both in Model-agnostic and Model-specific contexts?

    Contrary to the widespread understanding of CNNs being considered for model-agnostic interpretation, some of the CNNs could be interpreted in a model-specific manner. For instance, a model-agnostic approach can explain the prediction of a CNN model without affecting the internal layers (e.g., kernel SHAP). On the other hand, a model-specific system can give perturbations to each layer of a CNN and back-propagate to the input to achieve feature-rich values for better explainability (e.g., LRP) [105].

  4. (d)

    Why would an XAI method be considered Post-hoc and Model-Agnostic simultaneously?

    Post-hoc models are primarily applied to such blackbox models where the inner workings of these models remain untouched. The prediction thus obtained must undergo an XAI method for producing explainability-this concept can be termed both post-hoc and model-agnostic. Hence, some blackbox model is both Post-hoc and Model-Agnostic.

  5. (e)

    Why would an XAI method be considered Ante-hoc and Model-specific simultaneously?

    The Ante-hoc model is where the essential details of a training model are inherently available. To derive explainability out of an ante-hoc model, a model-specific or a model-agnostic XAI method can be employed. However, a model-specific XAI method necessitates the inner working details to integrate explainability during the training of an ante-hoc model.

With this backdrop, we now answer the main RQ. The XAI-based AD papers in our study are broadly classified under three main categories:

  1. 1.

    Local, Global, Post-hoc, Model-agnostic

  2. 2.

    Local, Post-hoc, Model-agnostic

  3. 3.

    Global, Post-hoc, Model-agnostic

Local, Global, Post-hoc, Model-agnostic

El-Sappagh et al. [122] have proposed a multimodal prediction and detection of AD in two stages. In the first stage, the model performs a multi-class classification for the early diagnosis of AD. A binary classification model will follow in the second stage to detect possible MCI to AD progression. The authors have utilised 11 different modalities: PET, MRI, cognitive scores, genetic data, demographic, and other clinical data. The classification was done using the RF algorithm resulting in an overall F1-score of 93.94% and 87.09% in two classification stages. The authors have used explainers based on the DT and fuzzy rule, providing complementary justifications for every prediction made in each stage.

Authors in [125] provided local and global interpretation in the conversion of Dementia to MCI using the XGBoost algorithm. They have utilised multimodal data (personal information, gene expression, PET and MRI measures, cognitive score) in the four-way classification of stages of AD progression. The model achieved an accuracy of 84.0%. The interpretation methods provided insights about data modality influential in each stage of AD progression.

Yang et al. [150] provided visual explanations for the deep 3D CNN in AD classification. The authors have utilised the brain MRI scans from the ADNI dataset for AD vs. HC classification using ResNet and VGG16 architectures. The authors also proposed a variant of ResNet architecture called ResNetGAP, where the Global Average Pooling(GAP) layer was introduced in the original ResNet architecture instead of the conventional Maxpool layer. The approach yielded an overall accuracy of 76.6% and an AUC of 0.863. Regarding interpretability, the authors could produce visual explanations for AD prediction using the network weight map from three different network architectures. The visual interpretation highlighted the cerebral cortex, lateral ventricle, and hippocampus regions in the 2D slices of the brain MRI.

In another study [133], authors used RF and XGBoost algorithms in the HC vs. AD classification using socio-demographic data, medical history, and lifestyle parameters (daily activity and smoking). The study developed an ensemble-based ML model to predict AD and explained the prediction in local and global contexts. The study also includes feature importance analysis and ranked the dominant features influential in AD. The top 7 features considered by both classifiers (RF and XGBoost) in AD prediction were the same. The feature importance analysis also found the least suspected risk factors driving the risk of AD.

Local, Post-hoc, Model-agnostic

Ilias and Askounis [131] have proposed transformer-based models for the identification of Dementia using voice transcripts as data modality. This is the only work found in the review that uses voice with associated text as data modality in Dementia identification and subsequent interpretation. The study involves the identification of Dementia in the first stage (HC vs. AD), followed by identifying the severity of Dementia in the second stage. The authors have employed Bidirectional Encoder Representations from Transformers (BERT). To distinguish between the languages used by AD patients and non-AD patients, word symbols were colour-coded for interpretation.

Another study [157] identifies MCI and AD patients (3-way classification HC vs. MCI vs. AD) using 90 seconds recording of resting state EEG. The study compares the performance of classification using three classifiers: SVM, ANN, and CNN. The explainable component in this study aimed to highlight the brain region most indicative of the onset or progression of MCI/AD.

Lombardi et al. [151] used multimodal data for AD vs. MCI vs. HC classification using the RF classifier. The clinical and neurophysiological indices were used to train the RF classifier in the AD classification. The authors explored various neurophysiological data’s capabilities in predicting different degrees of cognitive impairment. The dominant features used by the classifier for prediction have been enriched with explainable values.

Authors in [153] have used a metabolomic dataset to identify the key metabolites and their interaction associated with AD. The authors have used SVM and RF as classifiers. The model interpretation was provided by ranking significant metabolite features in the prediction based on the Gene importance. The authors claimed that the study provided explanations that could give additional background for the metabolomic backdrop of AD.

Rieke et al. [134] have used 3D CNN in the binary classification task (AD vs. HC) using structural MRI scans of the brain. The authors emphasised the importance of applying different visualisation methods for identifying various brain regions. For instance, a particular visualisation method could highlight the temporal lobe, whereas other techniques could focus on cortical areas. Such details obtained from different visualisation methods help find the distribution of relevant patterns which could vary across patients.

Global, Post-hoc, Model-agnostic

Bloch and Friedrich [161] used Shapley values to interpret XGBoost, SVM, and RF blackBox models using the ADNI dataset. The study considered MRI volume features, cognitive test results, sociodemographic data, and Apolipoprotein alleles. The Shapley values were employed to visualise the feature association in the blackbox classification. The models were trained individually using separate data modalities. The examination found a biological correlation and enhanced results when these models included cognitive test results.

Ruengchaijatuporn et al. [124] proposed a two-way classification for differentiating MCI vs. HC patients using the VGG16 and custom CNN architecture incorporating a self-attention mechanism (Conv-Att). The authors considered digital drawings (clock, cube, and trail making) collected from HC and MCI patients to train the models. The VGG16 model interpretability was provided using the GradCAM, and custom CNN has a built-in self-attention mechanism to offer visual cues. Clinical experts validated the visualisation produced by the GradCAM and the Conv-Att model using a simple rating mechanism. The authors concluded that the heatmap produced by the Conv-Att model was better aligned with the expert’s clinical experience. However, a serious limitation of this study is the non-consideration of a biomarker.

Another study [101] utilised the CNN architecture using 18F-FDG PET images to classify AD vs. MCI vs. HC. The model achieved AUC scores of 0.81, 0.63, and 0.77 for HC, MCI, and AD cases. For explanations, heatmaps were generated and registered with the Talairach atlas (3-dimensional coordinate system of the human brain), indicating each voxel’s importance for the final classification decision.

Table 8 provides a complete summary of XAI methods used in the blackbox interpretability for AD detection. The chart in Fig. 13 provides a clear perspective of research using different XAI methods. In particular, we find that studies conforming to the concepts of Local (Ll), Post-hoc (Ph), and Model-agnostic (Ma) make up the totality of the volume. In this context, most of the studies have been concentrating on classifiers under DL, of which a majority study deals with CNN. Furthermore, it is clear that a subsidiary part of the research focuses on Global, Post-hoc, and Model-agnostic, where classification techniques are widely used within the framework of ML approaches. Cumulatively, it can be understood that the Model agnostic approach covers 31 out of 37 studies considered in our review.

Table 8 Summary of XAI methods used for blackbox interpretability to detect AD
Fig. 13
figure 13

Sankey diagram of XAI methods for different classifiers used in AD Detection. Legends Ll–Local; Gl–Global; Ph–Post-hoc; Ah–Ante-hoc; Ma – Model-Agnostic; Ms – Model-Specefic

XAI Frameworks for AD Detection

This section addresses RQ3: What XAI frameworks are available in the literature which are used in AD detection?

This RQ aims to identify the XAI frameworks and techniques used in the studies to interpret AI-based AD classification. The discussions in this area will encourage researchers, developers, and subject matter experts to comprehend the inner workings of a machine-learning model. Explainable embedded machines, especially in healthcare, can significantly reduce the time medical professionals spend on recurrent patient studies and spend time concentrating on interpreting disease diagnoses. Many XAI frameworks exist to help tackle the problem of blackbox models where predictions are highly accurate, and the inner workings are hidden. We have identified popular XAI frameworks like LIME, SHAP, and GradCAM, among many others, that are extensively used in AD and of interest to RQ3. The methods are classified as follows: Tables 9, 10, 11, and 12 are presented as a list of the studies that have attempted to use the methods LIME, SHAP, LRP, and GradCAM, respectively. Table 13 lists studies that have used a combination of explainable methods, for instance, LIME and SHAP, where one algorithm gives a local explanation and global for the other.

The LIME is a popular method for simple human interpretations of predictive models. The studies in Table 9 use LIME to interpret the predictions from a wide range of ML/DL classifiers, including CNN, SpinalNet, kNN, XGBoost, SVM, and transfer-based model BERT. In the studies, the classifiers have used datasets of the type, including MRI, gene expressions, EEG signals, and linguistic or textual data. Kamal et al. [127] propose a study of four-way classification between mild dementia, moderate dementia, no dementia, and very mild dementia using MRI scans and gene expression. The author uses LIME to obtain local explanations of AD classifications from MRI with CNN and gene expressions with kNN and XGBoost. LIME proved instrumental in identifying and ranking the significant sets of features based on probability values responsible for an AD patient. Figure 14 illustrates how LIME selects the most critical genes from the gene expression data and interprets the predicted genes that are critically responsible for an AD patient. In Fig. 14, a ranking of the genes is shown based on probability values of prediction and separated into AD and non-AD categories. LIME allows users to understand which features contribute positively and negatively to the prediction. Though trust is not inherently quantified the trust in the explanation can be explained based on the probability values. For instance, LIME interprets OR8B8 and ATPV1G1 as the most significant genes for AD and HTR1F and OR6B2 of a lower significance.

Fig. 14
figure 14

LIME Explanation. Reproduced with permission from [127]

Illias and Askounis [131] undertake a thorough linguistic analysis from a medical transcript dataset using the transfer learning model, BERT, with the co-attention mechanism to classify between control and dementia patients. Subsequently, personal pronouns, interjections, adverbs, and past tense verbs are all used by AD patients, according to LIME. Healthy controls, on the other hand, employ present-tense verbs, nouns, and determiners. The studies in Table 9 show how LIME creates local explanations for any machine learning classifier by constructing a trainable interpretable model on data that recognises differences in classification performances.

SHAP is model-agnostic and utilises an approach of game theory for explaining the output of any machine learning model. In this review, it was found that SHAP is another XAI framework that is being used frequently. The papers in Table 10 use SHAP to explain classifications from machine learning models, including RF, XGBoost, SVM, and Logistic Regression(LR). Most studies use datasets of type, including demographic data, Apolipoprotein measures, Mini-Mental state examinations, Clinical Dementia ratings, and other volumetric measurements of MRI and PET scans. The fundamental principle of SHAP is to determine the Shapley values for each sample feature that needs to be understood. Each Shapley value reflects the influence of the corresponding feature on the prediction. It is possible to obtain sample features with high and low Shapley values. Both sets of features are visualised with Shapley plots to understand the impact of the features on a specific sample for a given prediction. Bloch and Friedrich [123] propose a study that compares the classification using RF and XGBoost for volumetric measurements of MRI scans of control and dementia patients. Shapley values are obtained for features of both the classifiers and ranked accordingly. The effect of the attributes on AD prediction is then displayed using Shapley plots, namely force plots and summary plots. Figure 15 is an example of a force plot that shows features that had the most influence on the model’s prediction for a single observation. Figure 16 is an example of a SHAP summary plot used to show the contribution of all features for every instance. Similar approaches are handled in their studies by Bogdanovic et al. [125] and Danso et al. [133]. SHAP force and summary plots are not explicitly quantified to show trustworthiness. However, the visual representations show quantifiable insights based on the magnitude of feature importance.

Fig. 15
figure 15

SHAP Explanation – Force Plot. Reproduced with permission from [122]

Fig. 16
figure 16

SHAP Explanation – Summary Plot. Reproduced with permission from [122]

Table 11 lists XAI studies that use the LRP model-specific interpretation tool. Complex deep neural networks with video, or picture inputs can now be explained using LRP [138]. The prediction is transmitted back through the neural network using local propagation rules. The decisions made by CNN using AD-based MRI data are visualised using LRP by Böhle et al. [121]. LRP creates a heatmap that explains the significance of each voxel that contributes to a specific classification. The study also computes a sum of all layerwise relevance metrics of the MRI that helps to identify critical areas of the image. Based on trained CNN, the author’s individual categorisation choices for AD and HC are explained using LRP.

Pohl et al. [132] propose LRP with multiple rules, also known as composite LRP. On the contrary, LRP with a single rule, also known as uniform LRP, uses a single rule for interpretation. LRP of both uniform and composite forms are used in this study to compare the evaluation measures quantitatively. Figure 17 shows a comparison of interpretations for AD classification, termed positive evidence, between uniform LRP and composite LRP. The study proves that the composite LRP rule, compared with the uniform rule, gives a more focused visualisation of only the relevant regions of the brain for positive AD by filtering out the least relevant ones. The advantage of composite LRP is visualised from the last column in Fig. 17 where a predominant relevance can be observed from the heatmap. Additionally, Fig. 18 shows that in the visualisations of non-AD outcomes (negative evidence), composite LRP proves beneficial. The figure shows negative visualisation for both classes - HC and AD. In Fig. 18 the last column visualises the positive contribution to the HC class (shown in red) and the negative contribution to the AD class (shown in blue). As a result, the LRP studies in Table 11 have a good chance of helping doctors by outlining the neural network decisions used to diagnose AD and other disorders using structural MRI data.

Fig. 17
figure 17

LRP Explanation. Reproduced with permission from [132]

Fig. 18
figure 18

LRP Explanation. Reproduced with permission from [132]

The GradCAM is a model-agnostic XAI tool used by studies in Table 12. GradCAM is typically used to produce visual explanations of the key input regions for predictions, increasing the transparency of CNN-based models. Using a gradient of the localised classification score for the features selected by the network, this technique can identify the areas of the image that are most crucial for prediction [166, 167]. Combining the localised scores creates a high-resolution and class-discriminative visualisation. Ruengchaijatuporn et al. [124] use images of bedside tasks like clock drawing tests, cube-copying and tail-making tests to classify between HC and AD patients in a deep neural network. For improving interpretation, convolutional self-attention and output of class probability as a soft label are applied with the GradCAM tool to visualise the model for essential input regions. The author also compares the CNN outputs with VGG16 with the explanation of the visuals using GradCAM. Figure 19 is an example of the visual explanation obtained from the multi-input VGG16 model with GradCAM and the author-proposed (Conv-self-attention, soft label) model for an AD test sample. The last column in Fig. 19 depicts the crucial regions of interest to be classified as AD compared to the HC column when used with GradCAM. Jain et al. [160] construct a heatmap emphasising the characteristics discovered from the input MRI scan for each layer of the CNN model using GradCAM. Then they combine it for a final interpretation. Though GradCAM provides qualitative visualisations and does not offer quantitative metrics for trust, it indirectly supports the quantitative analysis by assessing a model’s attention and localisation. GradCAM can complement trust assessment by offering visual insights into the model’s attention. Additionally, Zhang et al. [130] employ GradCAM to produce heatmaps or visual explanations from a 3D ResAttNet (Residual Attention network) that will emphasise the features discovered from the input MRI scans for each layer. GradCAM can be used with multimodal inputs without architectural changes or re-training and provides visual explanations by measuring their ability to discriminate between classes. The tool inspires trust in humans, particularly in the healthcare domain.

Fig. 19
figure 19

GradCam Explanation. Reproduced with permission from [124]

The review also identified several other XAI frameworks such as GNNExplainer (GNNE), ICE, OCA, and SM (see Table 13). Without needing to change the underlying GNN architecture, GNNE is a model-agnostic method that is used to deliver trustworthy justifications for predictions made by any Graph Neural Network (GNN) based ML task [129]. The explanation pinpoints a subgraph structure and a selection of node attributes for a specific instance that are essential for accurately forecasting the GNN in a local scope [168]. The GNNE can also produce global explanations for a whole group of instances. By successfully combining longitudinal neuroimaging and biologically significant data, Kim et al. [129] offer an interpretable GNN model for AD prediction. GNNE is used to find significant nodes that contribute to the prediction. This tool creates a subgraph structure and a subset of node attributes crucial to the prediction. The ability to display syntactically relevant structures and interpretations and the capacity to get insight into faulty GNNs are two features that make GNNE useful.

Chun et al. [126] in their paper provides a local explanation for the prediction of conversion from amnestic MCI (aMCI) to dementia or AD using ICE and SHAP for each patient. The XGBoost has shown the best performance for prediction in the paper. ICE show plots for each individual instance with a variation of values for a feature of interest and keeping values of other features constant. Figure 20 shows ICE plots of eight important features - Age, Controlled Oral Word Association (COWAT), Education, Minimental State Examination (MMSE), Rey-Osterrieth Complex Figure Test (RCFT) with delayed recall, RCFT with copy time, Clinical Dementia Rating- Sum of Boxes (CDR-SOB) and Seoul Verbal Learning Test (SVLT) for six patients numbered 1 to 6 in different colours. For instance, for the feature Age, line plots for each patient are drawn by varying the Age feature values and keeping values of other features constant [126].

Fig. 20
figure 20

ICE Explanation. Reproduced with permission from [126]

Bordin et al. [165] use the Occlusion Sensitivity method to reveal the relevant measure of white matter hyperintensities lesions with healthy lesions. Understanding which elements of a picture are most crucial for a deep network’s classification can be done simply using occlusion sensitivity analysis. By eliminating a patch from an image’s input dimension and comparing the output, the study determines an image’s susceptibility to occlusion in various image regions. The removed patch is important for classification if the variation is significant. The authors have successfully classified the brain areas that mainly contribute to the classification using the Occlusion sensitivity technique. As a result, occlusion sensitivity aids in gaining a high-level knowledge of the image attributes that a network employs to produce a specific classification and sheds light on why a network could misclassify an image. Rieke et al. [134] also use the Occlusion sensitivity method to visualise heatmaps that classify HC and AD. Figure 21 shows the brain area occlusion for AD and HC where the red area indicates the importance of the classification decision.

Fig. 21
figure 21

Occlusion Sensitivity Mapping. Reproduced with permission from [134]

Saliency Map (SM) is another XAI tool in which an image voxel brightness represents the voxel’s saliency. SMs are also called heat maps; they refer to those regions of the image that significantly impact predicting the class to which the object belongs [169]. Volumetric 18F-Fluorodeoxyglucose (FDG) PET scans were used by De Santi et al. [101] to train a CNN that conducts a multiclass classification task (HC, MCI, AD) and explains using two different post-hoc explanation strategies, SM and LRP. While maintaining a constant overall relevance across all layers, the authors used LRP to break down the output of the network into individual contributions of input neurons. The authors then created unique heat maps for each input image using SM to show the significance of each voxel for the categorisation process. Figure 22 is an example of an SM plot showing the evaluation of the averages in each brain region. SM measures the influence of the output on changes in the input image.

Fig. 22
figure 22

Saliency Mapping. Reproduced with permission from [101]

We understand from this RQ that a wide range of input parameters, like visual features and volumetric measurements of CT, MRI, and PET scan images and clinical data have been used to train ML and DL models. From Fig. 23, it is evident that SHAP has occupied a predominant position in interpreting AD diagnosis. It is also to be noted that SHAP is employed only on ML techniques. As can be seen from the figure, LIME, DT, GradCAM, and other XAI tools have been used in many other research studies. Furthermore, several XAI frameworks identified in the review prove to reduce model biasing, increase the system’s confidence, and try to bridge the gap with the healthcare domain. The RQ also reveals many limitations, including a lack of ground rules for explanations, data imbalance, non-availability of a comprehensive dataset, and non-inclusiveness of a professional from the healthcare domain. Section "Limitations, Challenges, Needs, and Prospects of XAI in AD Detection" for RQ5 elaborates on the future needs and limitations of AI-based AD detection with XAI.

Fig. 23
figure 23

Sankey diagram showing the various popular XAI frameworks applied to blackbox models

Table 9 Studies incorporating LIME framework for explaining model predictions
Table 10 Studies incorporating SHAP framework for explaining model predictions
Table 11 Studies incorporating LRP framework for explaining model predictions
Table 12 Studies incorporating GradCAM framework for explaining model predictions
Table 13 Studies incorporating a combination of XAI frameworks for explaining model predictions

Benefits of using XAI Methods for AD Detection 

This section addresses the RQ4: What are the proven benefits of using XAI in AD detection and healthcare in general?

In this review, studies have reported several benefits of using the concept of XAI in AI-based AD detection. Most studies have tried to report model accuracy, fairness, and transparency. They have highlighted the importance of XAI in fostering confidence and trust when using AI models for prediction, particularly in the medical industry. Independent studies have shown benefits, demonstrating a responsible approach to the development of AI with XAI. In this section, we categorise the benefits from the selected studies based on the four forms of explanation - Numeric, Rule-based, Textual, and Visual. This classification will help researchers to decide appropriate explanations to be sought based on available data modality. While most studies using XAI tools produced explanations in visual form, nominal studies have interpretations in textual, rule-based, and numeric outcomes.

Textual

The field of dementia detection using transcripts with the transformer-based network - BERT by Ilias et al. [131] produces promising classification results. The authors illustrate how transcripts using LIME explain the classification between dementia and non-dementia patients. Figure 24 shows texts in different colours to identify between the labels AD and HC. The tokens or textual forms in transcripts are assigned different colours, indicating which tokens indicate a control group. The intensity of colours for the tokens indicates the importance of these markers for the final transcript classification.

Fig. 24
figure 24

LIME Explanation – Textual Explanation. Reproduced with permission from [131]

Numeric

In another study, Salih et al. [159] try to develop a proxy that will check for stability in the explanation by choosing the proper XAI method, classifier, and available data. The authors have used Principal Component Analysis (PCA) to verify the stability of the identified predictors with the chosen explainer by quantifying (in numeric form) the informative predictors. In this study, the measure associated with predictors using SHAP and the proxy PCA produces uncorrelated variables that give stable ranking for most classifiers. Figure 25 shows a correlation score of different models for identified features. Due to the widespread use of XAI in delicate fields, including the prognosis of long-term mortality, admission to critical care units, and extubation failure, the results are beneficial to the medical community.

Fig. 25
figure 25

Correlation Score – Numeric Explanation. Reproduced with permission from [159]

Rule-based

We found two articles in our review that obtain explanations in the form of rules. One study integrates the Internet of Things (IoT) and AI agents to remotely monitor seniors’ health status. Khodabandehloo et al. [163] offer a novel HealthXAI system that employs a DT regression method to aid in the early identification of cognitive decline and give caregivers high-level numerical scores reporting inappropriate behaviours and explanations of the forecasts in natural language. The decision rule predicts the value of the target variable and interprets it with a natural language description as either HC or AD. The suggested strategy addresses the problem of ongoing remote monitoring of elderly individuals to aid in the early identification of cognitive decline and to better assist clinicians in reaching a diagnosis. In another study, García-Gutierrez et al. [162] proposed a diagnostic tool that uses a DT that provides a simple and unambiguous set of decision rules to provide capabilities to clinicians to give insights into the pathophysiology of AD and behavioural Fronto Temporal Dementia (bvFTD). This paper is beneficial for early detection and diagnosis in the medical field because it outlines all the processes needed to evaluate the datasets, including data preparation, selection of features using an evolutionary approach, and in the creation of a model for the disease discussed in the paper.

Visual

The data models in the studies that use LRP as an AI explainer include CNN and 3D CNN. LRP provides visual explanations as heat maps of significant areas of the brain in identifying brain atrophy. The studies’ significant features recognised for interpretation by LRP include the hippocampus, entorhinal cortex, and amygdala. Böhle et al. [121] discuss using LRP with guided backpropagation in discovering the heat maps with relevant significant features. Pohl et al. [132] state that they have discovered similar significant features by using composite LRP, using many propagation rules. The author also identifies that damage to the left and right temporal lobes causes problems with verbal semantic memory and visual memory, respectively. Both authors contribute these findings to the benefit of clinicians and radiologists in diagnosis and building trust in the system.

Several studies use the GradCAM XAI tool for a visual explanation of the predictions of a DL model. Ruengchaijatuporn et al. [124] use GradCAM to visually explain predictions from a VGG16 deep learning model. The DL model has three types of neuropsychological test inputs: clock score prediction, cube-copying drawing, and trail-making inputs. However, the authors prove the benefit of using a CNN model with self-attention work more efficiently than VGG16 with GradCAM. The heat maps proved beneficial to experts with clinical experience and are rated far superior to the baseline model. The authors also claim the model yields better classification performance and interpretability and benefits the domain community. In another study by Jain et al. [160], GradCAM was materialised to show heat maps of a four-way classification of AD predicted using a Generative Adversarial Network (GAN) model. Differently coloured heatmaps obtained from the system help inform predictions of the early onset and severity of dementia. The system has proved beneficial in accurately distinguishing between different classes and making appropriate early predictions. The research community benefits from the authors’ use of GAN to create a newly balanced dataset and their awareness of the serious issue with unbalanced datasets. The coloured heat map in the article, which showed the advanced characteristics of various stages of dementia, would aid medical professionals in making judgments. By proposing a 3D Residual Attention Deep Neural Network (3D ResAttNet) that is easy to understand, Zhang et al. [130] have developed an innovative computer-aided technique for the early diagnosis of AD. The authors assert that the 3D ResAttNet enhances the diagnostic performance and interpretability of MRI with GradCAM by capturing local, global, and spatial information. The study offers an entire end-to-end learning system for automated disease diagnosis. Furthermore, the suggested approach’s explainable process can identify and emphasise the role that crucial brain regions like the hippocampus, lateral ventricle, and most of the cortex play in transparent decision-making. Another study by Yang et al. [150] used different 3D-CNNs for classification and AI explainers, including 3D GradCAM. Experts in medicine can gain from the heat maps because they demonstrate how vital the lateral ventricle and most cortical regions are in classifying AD.

Most visualisation techniques consider only the last convolutional layer that extracts global features of pathological abnormalities but do not consider the small subjects and discrepancies. The research by Yu et al. [158] used the High-Resolution Activation Mapping (HAM) approach, which created high-resolution visual explanations that take into account values from the last convolutional layer and intermediate features. Compared to the previous efforts, high-quality heatmaps that display discriminative localisation of brain anomalies perform better. The authors validated the model’s effectiveness with good diagnostic accuracy and insightful explanations, which affirm fidelity in clinical applications.

Bordin et al. [165] created heatmaps using the occlusion sensitivity method by occluding a section of the input image with a black patch. The model’s brain regions contributing to the classification decision were easily discernible from fluctuations in the output probability predictions. The authors identified and reinforced the relevance of white matter hyperintensity as a neuroimaging biomarker for dementia. One of the studies used LRP to decompose the output score of the network with input 18 FDG PET images into individual contributions while maintaining the conservation principle and heat map produced. Using a Saliency map, the study also generates a voxel-wise heat map for each contribution. In their work, De Santi et al. [101] establish that the colour distribution, as opposed to LRP, emphasises a higher variation among the classes in the saliency map. This study demonstrates that the saliency map regarded the frontal-temporal space of the brain as a vital region for classifying all the classes. The occipital lobe, however, was the area that mattered most in LRP. In both studies, their finding proves significant clinical relevance and, in the long run, leads to increased trust and use of AI models.

The study proposed by Kim et al. [129] uses an interpretable Graph Neural Network (GNN) for classifying AD and MCI. Using GNNExplainer, the proposed model’s predictions are explained in light of the actual and predicted labels for HC, MCI, and AD. GNNExplainer visualises nodes of importance with a high region of interests representing a high contribution to the classification. The authors found that GNNExplainer gives encouraging interpretable results. Also, the explainer can capture the predictor’s neuro-anatomical contribution, giving more biological interpretations to better understand AD progression. The authors find the GNNExplainer beneficial as it outperforms other competing models (i.e., DNN, SVM) concerning prediction accuracy.

Several articles considered in the review have used LIME to visualise the explanations of AD predictions. Kamal et al. [127] have used LIME to discover the critical genes responsible for AD. The genes OR8B8 and ATP6V1G1 are found to be very important for AD by the authors. HTR1F and OR6B2 are therefore discovered to be important characteristics of HCs. The predictions about the likely outcome of the generated data are visualised using LIME by Shad et al. [128] and Sidulova et al. [157]. Coloured areas are used to denote the places that prompt models to classify images to make appropriate predictions.

The RF model has been used in numerous research to classify AD ([122, 123, 151, 154]) using SHAP to depict the explanations using force, summary and violin plots (see Figs. 1516 and 26 respectively). According to the study in [122], the output decision is supported by several complimentary, credible, and visible justifications. Additionally, the model displays a significant accuracy-interpretability tradeoff due to the accurate outcomes and great interpretability it produced. The proposed model is accurate and understandable, according to the authors. In [123], the author displays SHAP force plots that can explain specific model predictions widely used in clinical practice. The model displays the most significant features that are learned and show an acceptable relationship. The absolute value of each SHAP score reflects how much each attribute contributes to the final prognosis, as shown in [151] by the authors. The internal workings of the RF classifier that are trained with cognitive and clinical data are explained by SHAP, demonstrating a potential connection between feature relevancy patterns and diagnosis. In a different study [154], the authors demonstrate models with great prediction accuracy because they merge many DTs to create a single global forecast. Additionally, it repeated the study using the SHAP method and returned feature ranking results that agreed with those from RF. The study used AD biomarkers strong enough to predict HC, LMCI, and AD correctly and ranked biomarkers according to their significance. The paper also shows that the Amyloid beta (A), tau (T), and neurogenerative biomarkers(N) have different importance in predicting dementia. The study also establishes that the amyloid beta and tau status throughout disease progression plays a more significant role in predicting early cognitive impairment. The study also demonstrates that glucose consumption is more significant in predicting future cognitive impairment. The study incorporates biomarkers from all A, T, and N framework arms into a single integrated analysis, utilising RF to categorise dementia status and rank biomarker characteristics in order of relative importance.

Fig. 26
figure 26

SHAP Explanation – Violin Plot. Reproduced with permission from [155]

XGBoost and RF are used in the study by Bogdanovic et al. [125] for AD classification and interpreted with SHAP. The classification model proves beneficial in obtaining exactness and validity in prediction results. The SHAP force plot in either model indicates that the feature clinical dementia rating scale has the highest impact. The features of gender and apolipoprotein, as seen from the SHAP force plot, have the most negligible impact and are not decisive factors for having an AD diagnosis [125]. On the other hand, the study also reveals that mini-mental state examination values impact mainly healthy subjects, and age influences the LMCI class. Danso et al. [133] also go through the benefits of the tree-based approach and how it can give details on decisions made concerning forecasts. The research created a machine-learning model with multiple classifiers to predict AD at both global and local levels. Traits such as education, hypertension, hearing loss, smoking, obesity, depression, physical inactivity, diabetes, and infrequent social interaction were highlighted as potential modifiable risk factors in the report and were among the best-ranked predictive model.

In their study, Hernandez et al. [155] compare XGBoost, RF, and SVM models to understand how to quantify each feature’s contribution and achieve the best accuracy. With the help of SHAP violin plots, the study identifies the best models that use information coherent with clinical knowledge. Figure 26 illustrates a violin plot that shows the important features based on the XGBoost classifier for the complete test samples. In Fig. 26 the feature values for various test samples are shown by a colour code, which helps to relate whether a specific feature value favours the high or low probabilities predicted by the model. Blue hues indicate low values, while red hues indicate high values on the colour scale. The author employs a similar justification to demonstrate the utility of the features in distinguishing between the classes.

Lai et al. [156] make use of learning models AdaBoost, LR, LGBM, DT, XGBoost, RF, kNN, Naive Bayes, and SVM along with SHAP to generate force plots to illustrate profiles of the afflicted patient and normal subjects. In this study, the authors found six genes that could accurately predict AD progression and used SHAP to explain the decision-making process of the model used. The study offers fresh perspectives on the function of ER stress-related genes in AD heterogeneity and the creation of brand-new immunotherapy targets for AD patients. The work of Chun et al. [126] uses learning models RF, SVM, and XGBoost. The study is significant because it shows that the Interpretable Machine Learning (IML) method can calculate the individual probability of dementia conversion for each MCI patient. This study’s fundamental discovery is that the IML, consisting of ICE, SHAP, and BreakDown plots, enabled the interpretation of variables crucial in each patient’s conversion to dementia. The authors affirm that a model using any IML techniques enables predicting patients’ conversion from amnestic MCI to dementia.

The study of Xu et al. [152] involves deep learning models that include Random Seeds and Nested Cross-validation, SVM-SMOTE, and RF for a three-way AD classification. Using SHAP, the paper identifies the feature RAVLT-perc-forgetting, and an explanation force-plot for every instance is obtained. The explanations of each instance of the test set can be rotated ninety degrees. Subsequently, the rotated instances are finally stacked horizontally, producing a SHAP summary plot [152]. They consequently provide the doctors with an understanding of how and why the model makes judgements. SHAP is used by Salih et al. [159] and Bloch et al. [161] to determine the order of informative predictors in test data. ML models and their relationships were also visualised and analysed using SHAP summary plots. SHAP force plots examined the individual forecasts of chosen individuals, and the summary plots of those models primarily displayed biologically conceivable outcomes. Moderate to significant correlations were found when comparing the relevance of natural and permutational features to SHAP values.

To summarise, the LIME explainer interprets transcripts predicted by BERT to predict textual tokens. SHAP was used to produce probabilistic prediction in numeric format, DTs produced rule-based ante-hoc interpretations, and other explainers like LRP and GradCAM supported the AD diagnosis by visualising heatmaps showing significant features. A total of 28 research articles out of 37 resorted to visual form representation, one article each for numeric and textual form, and the remaining two explained using the rule-based technique. This research question helped us bring to light the different forms of explanation for AD prediction and will be of significant use for future research.

Limitations, Challenges, Needs, and Prospects of XAI in AD Detection

This section addresses the RQ5: What are the limitations, challenges, needs, and prospects of XAI in AD detection in general?

In the last few years, several studies have been proposed using the XAI concept to better explain AI systems’ decisions. Easy access to several XAI frameworks with readily available source code and the availability of high-performance computers has enabled effortless integration of these explainers into standalone AI systems. Unsurprisingly, these efforts have several limitations despite the promising results demonstrated by independent studies. Here, we list several limitations and research gaps in XAI-based AD detection intending to instigate further research in this field.

  1. 1.

    XAI researchers often resort to self-intuition to determine what establishes a good explanation without validating with a professional from the medical domain [175]. Also, the derived AI explanations are mainly data-driven without domain experts’ input. Delivering maximum benefit to stakeholders necessitates a concurrent involvement of medical and AI experts in ascertaining the interpretability evolved by the XAI framework. None of the papers considered in our study has this distinct aspect.

  2. 2.

    One of the significant drawbacks of XAI-based AD diagnosis is the absence of ground truth data [176]. Several neuroimaging and clinical biomarker datasets exist for AD, but none provide ground truth to validate the explainability elicited by XAI models. For instance, in the case of visual explainers (GradCAM, LRP, SM, etc.), heatmaps are often visually assessed. Heatmaps highlight voxels based on classifier decisions without stating underlying atrophy or shape differences in brain regions. This dilutes the heatmap interpretation to a mere indication of where the trained model sees the evidence. Sometimes, heatmaps and the presence of actual biomarkers may be uncorrelated in the case of a poorly trained classifier. Hence, there is a need for rationalising visual assessments in the case of explainers with visual outputs through appropriate ground truth [121].

  3. 3.

    Furthermore, the influences of XAI explanations dramatically vary when delivered to people with varying levels of domain expertise [121]. When people observe explanations contradicting their own intuition, a confusing situation arises, questioning the counterintuitive relationship delivered by the XAI systems. Such situations lead to further doubting the correctness of the model even though the model delivers a valid explanation [121]. The only way to circumvent such a situation is to have ground truth where one can objectively validate the explanation against the ground truth data without challenging the decisions made by the XAI systems.

  4. 4.

    Confidence measures are crucial in computer-aided diagnosis, where a wrong prediction is almost always life-threatening. When the system cannot deliver a confident prediction, it must warrant a manual intervention to arrive at an appropriate decision. Hence, XAI methods must also incorporate a confidence score to identify situations when the classifier is incorrect before providing explanations. Otherwise, the end user may create false trust in the system [177]. Therefore it is vital to evaluate not only whether an explanation is intuitive to the user but also to arrive at an optimal decision[177].

  5. 5.

    Some papers used multiple XAI frameworks for enhanced explainability. It may be good from an academic standpoint but contributes to added opaqueness in real-time. For instance, LIME and SHAP frameworks were used jointly in one study [120]. The feature rankings derived by these individual frameworks did not correlate with each other. The Mini-Mental State Examination (MMSE) significantly contributes to SHAP, whereas normalised Whole Brain Value (nWBV) dominates the LIME features [120]. In yet another study [161], SHAP was used with other methods to validate the interpretability. Again, a weaker correlation was found between feature rankings of the SHAP values and other models. Such scenarios lead to ambiguity in the explanations delivered by the models resulting in a complete loss of clinicians’ trust in the models.

  6. 6.

    Another significant lapse in almost all studies we considered is the limited use of medical datasets or the non-availability of a comprehensive benchmark dataset that exhibits variations representing real-world scenarios [178]. It impedes testing of the model on an extensive dataset which is crucial in determining the actual robustness [131, 155, 156, 179]. Hence, most of the studies in the literature ended up with subjective claims but exhibited subpar performance due to generalisability issues when tested on a different dataset [178]. Another closely related issue with the dataset is the issue of class imbalance [123, 165]. The ML or DL learning algorithms predict dominant classes more accurately than classes with inadequate samples. Most studies had limited AD samples compared to HC or MCI cohorts [123, 125, 133, 151, 157, 161, 165]. Only a balanced dataset can draw meaningful insights. Applying XAI-based AI techniques in AD diagnosis will become genuinely influential if only research efforts can be diverted into creating such a comprehensive, balanced, and benchmark dataset.

  7. 7.

    Although some studies utilised multimodal data (clinical, sociodemographic, MRI features, neuroimages, etc.) to predict AD [122, 127], explanations were derived for a single modality only. This may be due to the absence of correlation among the interpretation obtained from different modalities (see point 5 above). Hence, having medical experts in the loop and deriving interpretations for every modality used is the way forward.

  8. 8.

    Even though some studies used XAI tools in AD prediction, they did not consider disease biomarkers such as MRI volumetry, cortical thickness, etc., which correlate well with dementia [124, 126].

  9. 9.

    Most studies have not indicated factors (hyperparameter values, the split proportion of train-test data, data preprocessing, etc.) affecting model accuracy and subsequent explainability derived [125].

  10. 10.

    Another huge concern that adds to the reluctance of medical experts to use AI solutions reliably is the inability of AI to consider the history of anomalies that contributed to cognitive decline. The lack of real-world labelled data sets of individuals collected over a long period of time is a genuine limitation for any medical field, not just in AI-based AD diagnosis [163].

  11. 11.

    Even though researchers applied either single or multiple XAI frameworks in the AD prediction, sometimes there was no specific correlation between the AI prediction and the associated brain region. [157].

We have seen numerous studies proposed to explain the AD prognosis and diagnosis using several XAI frameworks. Although these studies have greatly facilitated clinical fidelity in the associated predictions, this RQ made us realise that we are far from making use of the XAI-based AD systems in real medical eventualities due to the aforementioned limitations and challenges. In future, AI technocrats must thoroughly investigate these needs by involving medical experts in the loop to deliver profound trustworthiness to the medical community for AI-driven AD diagnosis.

Conclusion

Explainable Artificial Intelligence has gained tremendous importance over the last several years due to scientific demands and regulatory compliance. Researchers are exploring different XAI frameworks that characterise the accuracy of the model, rationality and clarity in AI-assisted decision-making, which is impeccable in healthcare. XAI aids in creating synergistic environments where it can efficiently address the solution to predictions such as long-term mortality and extubation failures. Hence, promoting wider dissemination of XAI concepts, backgrounds, and techniques to the research community is crucial.

Towards this aim and to serve as a reference source, this article presents a systematic review of XAI models and frameworks’ application on multimodal AD data. We have reviewed articles based on XAI for AD diagnosis for the last decade. The study included 37 research articles thoroughly reviewed through carefully framed RQs. The RQs highlighted different XAI-based studies adopted for AD diagnosis and unveiled various ML and DL models that have embraced XAI frameworks to imbibe transparency and fidelity in AI predictions. The study also reveals several benefits, limitations, and future avenues for clinical diagnosis. We understand it is too early to comment on reducing the gap between medical and AI domains to a minimal zero. Nevertheless, such reviews will reveal the benefits and limitations to the research community so that the trade-off between accuracy in AI solutions and explainability can be sorted out to an acceptable level of fidelity. This review will help explore many healthcare domains to leverage the true capabilities of AI in fostering fidelity in the clinical decision support system.