1 Introduction

Alzheimer’s disease is a slow-developing neurodegenerative condition with symptoms including personality and behavior changes, memory problems, and dementia [1]. The disease’s origins are unknown, and there is no reliable diagnosis or disease-modifying therapy. Treatment is expensive, burdening society and patients’ families [2, 3].

The prevalence of AD and its impact on individuals and society are significant concerns. AD is the most common cause of dementia, affecting around 41 million people worldwide, and it is projected to increase to 152 million by 2050 due to the aging population and the lack of effective treatments [4]. AD can seriously impair a person’s cognitive and functional abilities, leading to difficulties in performing daily activities and maintaining social relationships. The impact of AD on family caregivers, the dementia workforce, and society is also substantial. The disease takes a toll on caregivers, who often provide unpaid care for their loved ones with dementia and face physical, emotional, and financial burdens. The dementia workforce, including healthcare professionals, social workers, and other support providers, is also challenged by the increasing demand and complexity of Alzheimer’s care. The costs associated with Alzheimer’s care, including medical expenses and lost productivity, are significant and can have a ripple effect on society as a whole. According to the Alzheimer’s Association, the total payments for healthcare, long-term care, and hospice for people with Alzheimer’s and other dementias in the United States are estimated to be 355 billion in 2021, and this number could rise to 1.1 trillion by 2050 if medical breakthroughs are not developed to slow or cure the disease [5, 6].

Recent advances in AD research have focused on discovering new targets for drug candidates, identifying molecular subtypes of AD, and developing new biomarkers that may serve as the basis for precision medicine approaches to treatment development [7]. For example, researchers have identified ATP6VA1 and VGF as key regulatory genes in neural networks disrupted in AD and have shown that modulating their expression can improve neural function in cellular and animal models. Additionally, researchers used machine learning and network analysis to classify Alzheimer’s patients into six molecular subtypes based on gene expression profiles, and these subtypes have distinct clinical and pathological features [8, 9]. Additionally, researchers have discovered new biomarkers that can detect early AD, including blood-based markers of amyloid-beta and tau proteins and retinal markers of neurodegeneration [10]. These discoveries have implications for improving the diagnosis, prognosis, and treatment of AD, as well as for designing more efficient clinical trials. However, many challenges and gaps still exist in AD research. For example, the lack of effective treatments to alter the course of AD, the limited understanding of environmental and lifestyle factors that influence the risk and progression of AD, and the need for a more diverse and representative patient population and a cohort of study participants.

Such machine learning (ML) algorithms need to be implemented with the correct architectural design and predefined preprocessing steps. Classification studies using ML typically involve four steps: feature extraction, feature selection, dimensionality reduction, and feature-based algorithm selection [11]. These procedures can be time-consuming and require specialized knowledge and multiple optimization stages. However, the reproducibility of these methods has been an issue. For example, in the feature selection procedure, AD-related features from different neuroimaging modalities are chosen to derive more informative combinatorial measures. These features may include mean subcortical volumes, gray matter densities, cortical thickness, brain glucose metabolism, and cerebral amyloid accumulation in regions of interest (ROIs), such as the hippocampus.

The disease has significant impacts on public health, society, and the economy, including mortality, morbidity, and economic and social burdens [12]. Despite significant efforts, diagnosing AD remains a complex task, with a substantial percentage of undiagnosed cases globally. The current diagnostic challenges highlight the need for more accurate and early diagnosis methods. In this context, the role of DL has emerged as a promising approach for accurately and early diagnosing AD [13]. By leveraging deep neural networks, researchers have made strides in developing more effective diagnostic tools and categorizing the stages of the disease, offering hope for improved patient outcomes. Therefore, integrating DL techniques in Alzheimer’s research represents a significant step towards addressing the diagnostic challenges associated with the disease (Fig. 1) [14, 15].

Fig. 1
figure 1

The number of publications on AD diagnosis using ML and DL from 2018 to 2023. The figure shows the increasing interest and research activity in both fields, based on the results of a web search using the keywords “Alzheimer’s disease diagnosis”, “machine learning”, and “deep learning”. The figure also shows that DL publications are growing faster than ML publications, indicating that DL is a more novel and promising approach for AD diagnosis

This review comprehensively explores the use of deep learning models, particularly convolutional neural networks (CNN) and Vision Transformer (ViT), for classifying AD from medical images. It covers an introduction to the disease, the potential of deep learning in diagnosis, related works on CNN and ViT applications, and discusses combining both models’ strengths and limitations. Emphasis is placed on deep learning and image classification concepts relevant to AD diagnosis.

2 Background

Neurological diseases encompass a wide range of conditions that impact the nervous system, including the brain, spinal cord, and nerves, governing functions like movement, sensation, and cognition [16]. These diseases, such as Alzheimer’s disease, Parkinson’s disease, multiple sclerosis, epilepsy, and migraine, can arise from genetic, environmental, and lifestyle factors [17] (Fig. 2). Symptoms vary based on the specific condition and affected area of the nervous system, ranging from headaches to modifications in movement, sensation, and cognitive function. Diagnosis is intricate and may involve physical exams, medical history reviews, imaging tests, and specialized diagnostic procedures. Treatment options differ depending on the disease and can include medication, physical therapy, surgery, or other interventions [18].

Fig. 2
figure 2

Neurological disorders diagnosis using deep learning, Google scholar 2023

Managing neurological diseases is crucial for improving quality of life; hence, early detection and treatment play a vital role in symptom management and better outcomes [19] (Fig. 3). Ongoing research and advancements in diagnostics and treatments are enhancing our understanding and management of these complex conditions, such as AD, which necessitate early and accurate diagnosis for optimal patient care [20].

Fig. 3
figure 3

AD prevalence and age distribution

2.1 Alzheimer’s Disease Datasets

Several publicly available datasets of AD images exist, with the two most commonly used being the AD Neuroimaging Initiative (ADNI) and the Open Access Series of Imaging Studies (OASIS) (Tables 1, 2).

  • Alzheimer’s Disease Neuroimaging Initiative (ADNI): ADNI is a large-scale longitudinal multimodal study that aims to develop biomarkers to track the progression of Alzheimer’s disease (AD) and evaluate potential therapeutics. The dataset includes imaging data from multiple sources, such as magnetic resonance imaging (MRI), positron emission tomography (PET), and other imaging modalities. ADNI offers both structural and functional MRI data, along with amyloid PET and fluoro-deoxyglucose PET (FDG-PET) scans. The dataset contains imaging data from both AD patients and healthy controls across multiple phases of the study, starting from ADNI1 in 2004 up through ADNI3. It is made freely available at https://www.kaggle.com/datasets/madhucharan/alzheimersdisease5classdatasetadni

Table 1 Description about the Alzheimer’s disease images: ADNI
Table 2 Description about the Alzheimer’s disease images: OASIS

2.1.1 Preprocessing Steps Applied to ADNI and OASIS Datasets

  • Intensity normalization is a crucial step in image processing, especially for neuroimaging data and medical images with varying pixel intensities. It adjusts pixel intensity values to improve contrast and visual interpretation. Linear normalization transforms original grayscale images into desired intensities, ensuring consistency and enhancing image processing quality [21, 22].

  • Skull stripping is a crucial process in Alzheimer’s data images, eliminating non-brain signals for neuroimage processing tasks. Techniques include traditional and deep learning methods, with automated brain extraction methods used for consistency and precision. Skull stripping is essential for brain tissue segmentation and volumetric analysis [23].

  • Motion correction: is a crucial stage in Alzheimer’s data images, repositioning and orienting brain magnetic resonance images to account for motions during imaging. It is essential for neuroimage processing tasks like tissue segmentation and volumetric analysis. It removes motion, eddy-current, and echo-planar imaging artifacts from raw Diffusion-weighted imaging (DWI) images [24].

  • Spatial smoothing in the context of Alzheimer’s data images involves a filtering process based on segmented tissue images from previous steps. This technique often uses the Gaussian kernel function in normalized space to reduce noise and enhance the quality of the images. Spatial smoothing is commonly applied to various imaging data, including Amyloid PET scans, to decrease noise levels and improve the accuracy of clinical trial results.

  • Registration is the process of aligning images into a common coordinate system, crucial for neuroimaging applications like AD diagnostic classification. It allows accurate analysis and interpretation of data, especially in studies using ML techniques to extract features for classification [25].

2.1.2 Image Classification

Medical images such as magnetic resonance imaging (MRI), computed tomography (CT), or positron emission tomography (PET) scans can be examined using image classification tasks to assist with AD diagnosis. These images provide valuable information about the structure and functioning of the brain that can be used to identify AD-related alterations.

Magnetic Resonance Imaging: MRI modality

Fig. 4
figure 4

MRI modality

MRI is a medical imaging technique that uses magnetic fields and radio waves to produce detailed images of body structures, including the brain, spinal cord, joints, and abdomen [26]. It is widely used in AD research due to its excellent anatomical detail and high diagnostic accuracy, which are correlated with underlying pathology (Fig. 4) [27].

Positron Emission Tomography: PET modality

Fig. 5
figure 5

PET modality

PET imaging uses radioactive material to create 3D body images, aiding in cancer diagnosis and treatment monitoring, as well as assessing brain function and cardiovascular health. Its advantages include detecting biochemical changes indicating diseases and monitoring treatment effectiveness such as chemotherapy (Fig. 5) [28, 29].

Image classification can provide insights into brain structure and function, revealing underlying pathology and disease progression. It aids early detection and diagnosis of AD, improving patient outcomes and enabling better treatments and disease management through the analysis of vast medical images.

2.2 Deep Learning

Deep learning is a branch of artificial intelligence that uses multi-layered neural networks to learn from large amounts of data and perform complex tasks [30,31,32]. CNN and ViT are two types of DL models [33] that can be used for image classification tasks [34, 35], such as Alzheimer’s disease (AD) classification. CNN stands for Convolutional Neural Network, which is a neural network that uses convolutional layers to extract features from images. CNNs have been widely used for medical image analysis and have proven to be effective in AD classification tasks, as they can learn hierarchical patterns from brain imaging data, such as MRI and PET images. ViT stands for Vision Transformer, which is a neural network that uses transformer layers to encode patches of images as sequences of tokens. ViT models represent an input image as a series of image patches, similar to the series of word embeddings used in text, and directly predict class labels for the image. ViT models have shown exceptional performance in image recognition tasks when trained on enough data, and offer a promising alternative to traditional CNNs in handling the complexities of brain imaging data [36, 37]. Therefore, both CNN and ViT are well-suited for AD classification [38], depending on the specific problem and the available data [15].

2.2.1 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) is a type of DL algorithm that are commonly used for image classification, object detection, and image segmentation [39,40,41]. They are based on the concept of convolution, which involves sliding a small window, called a kernel or filter, over an input image to extract features [42].

Convolutional, pooling, and fully connected layers are the typical components of a CNN [43]. Convolutional layers train the filters to recognize particular aspects of the input image, such as edges or corners. The feature maps of the convolutional layers are downsampled by the pooling layers, resulting in smaller spatial datasets and a more computationally efficient network. The high-level features discovered by the convolutional layers are used to make predictions by the fully connected layers, which are used for classification (Fig. 6) [44, 45].

Fig. 6
figure 6

Convolutional Neural Networks architecture

CNNs can have multiple convolutional, activation, and pooling layers to learn more complex features and improve classification accuracy. Additionally, we can pre-train CNNs on large datasets such as ImageNet to improve performance on smaller datasets with limited training data. The Table 3 below shows various pretrained models based on CNN architecture applied to the ImageNet dataset.

Table 3 Pretrained CNN models applied on ImageNet validation dataset

2.2.2 Vision Transformers

A recent development in DL architecture called Vision Transformers (ViT) has demonstrated outstanding performance in image classification challenges [46, 47]. The transformer model, initially created for Natural Language Processing (NLP) activities, serves as the foundation for the architecture of ViT [48]. A transformer model transforms the input sequence using a number of self-attention layers that encapsulate the interdependencies between the sequence’s various components. As a result, the model can learn representations of the input that are more suitable for tasks that come after them, like classification.

Attention in Transformer The Transformer model relies heavily on the attention mechanism, which comprises three distinct types of attention modules: self-attention, mask attention, and cross-sequence attention [49]. These modules are crucial for the ability of the Transformer to process and understand sequential data. Self-attention allows the model to focus on different parts of the input sequence, mask attention helps the model to ignore certain parts of the input sequence, and cross-sequence attention enables the model to establish relationships between different sequences [50].

  • Self-attention: In 2017, Ashish Vaswani et al. introduced the concept of the visual Transformer, which later became widely used in visual learning and understanding. The self-attention mechanism plays a crucial role in this approach, enabling the model to capture internal relationships among data or features without relying heavily on external information. By computing the interdependence between various patches of an image, self-attention is capable of resolving the issue of long-range dependencies.

    To apply self-attention to an image X, we can use a particular model that involves transforming the input to obtain queries (Q), keys (K), and values (V). This transformation can be expressed mathematically using a standard formulation for Q, K, and V, as shown in Equation (1).

    $$\begin{aligned} K= W^K X, Q = W^Q X, V = W^V X \end{aligned}$$
    (1)
Fig. 7
figure 7

Comparison of Scaled Dot-Product Attention and Multi-Head Attention

Scaled Dot-Product Attention involves calculating the dot product of a query vector and a group of key vectors, scaling the output by the square root of the key vector dimension, and using a softmax function to compute a weighted sum of the value vectors. In contrast, Multi-Head Attention executes Scaled Dot-Product Attention several times using distinct learned parameters, enabling the model to focus on multiple aspects of the input concurrently (Fig. 7). We may represent the scaled dot-product attention as Eq. (2):

$$\begin{aligned} Attention(Q, K, V ) = softmax\left( \frac{QK^T}{ \sqrt{d_K} }\right) V \end{aligned}$$
(2)

where the \((\frac{QK^T}{ \sqrt{d_K} })\) is referred to as the attention matrix.

Fig. 8
figure 8

DETR’s transformer Architecture [51]

Multi-head attention: It can be expressed using equation 3.

$$\begin{aligned} MultiHead(Q, K, V ) = Concat(head_1,..., head_h)W^O \end{aligned}$$
(3)

where \(head_i\) = Attention(\({W_i}^Q\) X, \({W_i}^K\) X, \({W_i}^V\) X)

The transformer design is modified for ViT to process and classify images. Smaller patches of the input image are first separated, and these patches are subsequently linearly projected into a series of embeddings. These embeddings are then sent via a number of transformer layers, allowing the model to pick up on the image’s general characteristics. The network is then completed by a straightforward classification head that takes the final transformer output and maps it to the output classes.

The popularity of transformers in NLP has spread to the CV research community, where there have been numerous attempts to modify transformers for use with vision-related problems. The most prominent of these models are detection transformer (DETR) [51], ViT [52], data-efficient image transformer (DeiT) [53], and Swin-Transformer [54]. Transformer-based models for vision have been developed at an unparalleled rate [55].

  • DETR: proposed by Carion et al., is an end-to-end detection model that uses a trans-encoder to model image features, generate object queries, and assign labels to objects (Fig. 8).

  • ViT: Dosovitzky et al. proposed ViT, an image classification model that uses the Transformer architecture. It converts input images into patches, encodes spatial information, and outputs the learned patch embedding. The model classifies images using a multi-layer perceptron and can also incorporate feature maps from CNNs for relational mapping (Fig. 9).

Fig. 9
figure 9

ViT’s transformer Architecture

  • DeiT: Touvron et al. [53] proposed DeiT to address large training data requirements for ViT, using a knowledge distillation framework with Transformer terminology and CNNs for easier training and overcoming induced biases (Fig. 10).

Fig. 10
figure 10

DeiT’s transformer Architecture [56]

  • Swin-Transformer: To reduce the cost of computing attention for high-resolution images and to cope with varying patch sizes in scene understanding tasks (such as segmentation), Liu et al. proposed the Swin transformer. They introduced window self-attention to reduce computational complexity and used shifted window attention to model the relationship between windows. In addition, they connected these attention blocks to patch merge blocks. This was done to merge adjacent patches and create a hierarchical representation to handle variations in visual unit scale (Fig. 11).

Fig. 11
figure 11

Swin-transformer architecture

2.2.3 CNN Vs Vision Transformer

The Table 4 compares CNNs and VIT across key aspects such as architecture, feature extraction, application, performance, advantages, and disadvantages, highlighting their differences and respective strengths in handling image data. While CNNs excel at capturing local patterns, VIT leverage self-attention mechanisms to understand global context, offering unique advantages and trade-offs in image processing tasks [57].

Table 4 CNN Vs Vision transformer

3 Survey Methodology

The study investigated the detection of AD through extensive searches across renowned scientific databases, including Springer, IEEE Xplore, ScienceDirect, MDPI, Elsevier, Wiley, Taylor & Francis, and ACM libraries, using relevant keywords. Additionally, articles indexed in Clarivate Web of Science and Scopus databases were consulted to validate findings, resulting in the review of 61 articles (Fig. 12).

Fig. 12
figure 12

Survey methodology: Databases and Keywords

Table 5 Limitations of reviewed survey articles on AD diagnosis (2023)

4 Related Works

In this literature review, we focus on the use of DL with the most popular architectures, to diagnose AD from image data. We selected the studies that met the following criteria: (1) they used ADNI and OASIS datasets; (2) they applied DL techniques, such as CNNs or ViTs and hybrid models, to classify or predict AD; (3) they reported the performance metrics, such as accuracy, precision, recall, or F1-score, of their models; and (4) they were published in peer-reviewed journals or conferences between 2019 and 2024. The main research questions that we aim to answer in this review are: (1) what are the advantages and disadvantages of using CNNs or ViTs for AD diagnosis from MRI data? (2) what are the current challenges and limitations of applying DL to AD diagnosis from image data? And (3) what are the possible directions and opportunities for future research in this field?

To illustrate the limitations of the existing survey articles on AD diagnosis, we present Table 5, which summarizes the main gaps and shortcomings of the four most recent and relevant reviews in the field.

4.1 Metrics

Various evaluation metrics such as accuracy, precision, recall, F1 score, and AUC-ROC score are commonly employed to assess the performance of DL models in Alzheimer’s image classification. These metrics provide insights into the model’s ability to correctly identify AD from medical images, facilitating the optimization of algorithms for improved diagnostic accuracy. By using these evaluation metrics, researchers can ensure the reliability and effectiveness of their models in clinical settings (Tables 6,  7) [64].

Table 6 Confusion matrix for image classification
Table 7 Summary of evaluation metrics used in the relevant works in this review

4.2 Alzheimer’s Disease Diagnosis Using CNN

Al Shehri [65] proposed a DL-based solution for the diagnosis and classification of AD using brain MRI images. The study used two CNN architectures, DenseNet-169 and ResNet-50, to classify the images into four stages of AD: Non-Dementia, Very Mild Dementia, Mild Dementia, and Moderate Dementia. The study reports that DenseNet-169 outperformed ResNet-50 in both training and testing accuracy, achieving 97.7% and 83.82% respectively. The study claims that the proposed model is usable for real-time analysis and classification of AD. The study contributes to the field of computer-aided diagnosis by applying DL techniques to a challenging and important problem.

Kanghan Oh et al. [66] employed a volumetric VCNN to extract features from the images and classify them into four categories: normal control (NC), progressive mild cognitive impairment (pMCI), stable mild cognitive impairment (sMCI), and Alzheimer’s disease (AD). The study used a convolutional autoencoder (CAE)-based unsupervised learning for the AD vs. NC classification task, and supervised transfer learning for the other three tasks. The study also applied a gradient-based visualization method to identify the most important brain regions for the classification. The study reports that the proposed method achieved high accuracies of 86.60% and 73.95% for the AD and pMCI classification tasks respectively, outperforming other network models. The study also shows that the temporal and parietal lobes are the key regions for the diagnosis of AD and pMCI. The study contributes to the field of computer-aided diagnosis by providing a robust and interpretable deep learning solution for AD.

Rajendiran et al. [67] examined the effectiveness of different deep CNNs, including AlexNet, VGG-16 Net, ResNe-50, and Google Net models, for classifying the early stages of AD using the OASIS dataset. They compared their results with other existing methods, such as support vector machine (SVM), k-nearest neighbor (KNN), and random forest (RF). They reported that the Google Net model achieved the highest performance among all the methods with an accuracy value of 97.54%, a precision of 97.67%, a recall of 97.54% and a F1-score of 97.55%.

In terms of two fundamental phases, Sarang Sharma et al. [68] proposed a hybrid-based AI model that combines transfer learning (TL) and permutation-based ML voting classifiers. Two TL-based models, DenseNet-121 and DenseNet-201, are used in the first phase of implementation for feature extraction, and three different ML classifiers, SVM, Naive Bayes, and XGBoost, are used in the second phase for classification. The voting mechanism is permuted in order to evaluate the final classifier results. They performed the classification task of AD vs. NC and reported that their method achieved an accuracy of 91.75%, a specificity of 96.5%, and an F1-score of 90.25%.

Using 6400 labeled MRI scans from two classes, Tayyaba Shahwar et al. [69] proposed a hybrid classical quantum neural network to identify AD. In a quantum processor, traditional neural networks embed instructive feature vectors after extracting high-dimensional features. To produce a four-feature vector for precise decision boundaries, they fed a 512-feature vector into their quantum variational circuit (QVC) using ResNet34 to extract features from the image.

Using first- and second-order gradients, the Adam optimizer is used to take advantage of the adaptive learning rate corresponding to each parameter. Additionally, several quantum simulators (PennyLane, qiskit.aer, and qiskit.basicaer) are used for the detection of demented and non-demented images to validate the model. The learning rate is set at \(10^4\) for a six-layer optimized quantum depth.

Sheela et al. [70] created a model and used a Gaussian filter to remove unwanted noise, Otsu thresholding to segment the images, Prewitt edge detection to find the edges, GLCM to extract the features, FCM to cluster the images, and CNN to determine the final classification of the images. They reported that their model achieved an accuracy of 90.25% for the classification of AD vs. NC (normal control).

In order to classify the images into four categories-mild dementia (MD), moderate dementia (MOD), non-demented (ND), and very mild dementia (VMD)-Ghazal et al. [71] proposed a system for AD detection using transfer learning on multi-class classification using brain magnetic resonance imaging (MRI). They reported that their system achieved an accuracy of 91.70% for the classification of the four categories.

An artificial neural network with the combination of feed forward and back propagation was presented by Nancy et al. [72]. The output values are computed by the input layer and sent to the hidden layer. With the aid of its own weights and bias function, each hidden layer extracts signals from each input layer. The output layer receives these values in order to produce the output. The output is combined with the inputs and retrieved using an activation function. A threshold value is used to determine whether each input symbol is above or below the threshold value.

Using convolutional classifiers and Group Grey Wolf Optimization-based features, Shankar et al. [73] proposed a method for detecting AD.

Deep transfer learning models and MRI (Magnetic Resonance Imaging) images have been proposed by Sisodia et al. [74] to detect the various stages of AD, including "Very-Mild-Demented," "Mild-Demented," "Moderate-Demented," and "No-Demented." By applying data preprocessing and augmentation techniques, the model is able to identify the appropriate class of AD. The early stages of AD are then classified and predicted using additional deep transfer learning models (Resnet50, VGG19, Xception, DenseNet201, and EfficientNetB7).

Bhagat et al. [75] identified various stages of AD using pre-trained health data classification models like the MobileNet by using the transfer learning determinant.

A deep learning-based method that can predict MCI, early MCI (EMCI), late MCI (LMCI), and AD was proposed by Odusami et al. [76] For testing, the 138-subject Alzheimer’s Disease Neuroimaging Initiative (ADNI) fMRI dataset was used.

Ebrahimi et al. [77] used a CNN called ResNet-18 that had been trained on an ImageNet dataset. The temporal convolutional network (TCN) and various recurrent neural network types were the sequence-based models used. For AD detection, a number of deep sequence-based models and configurations were used.

In [78], they put forth AlzheimerNet, a refined convolutional neural network (CNN) classifier that can distinguish between the Normal Control (NC) category and all five stages of AD. The MRI scan dataset from the ADNI database is acquired for the proposed model’s training and testing. They used the CLAHE image enhancement technique to get the raw data ready for analysis. The unbalanced nature of the dataset was corrected using data augmentation, and the resulting dataset contained 60,000 image data across the 6 classes.

In [79], they specifically used images that are segmented by the brain’s gray matter (GM) to illustrate how transfer learning and convolutional neural network (CNN) customization can be used to segment and classify magnetic resonance imaging (MRI) of AD. They applied transfer learning after using a pre-trained DL model as their base model.

For the classification task, [80] used three pre-trained networks, GoogLeNet, AlexNet, and ResNet-18, which were trained and tested using 6000 images gathered from the ADNI database.

In [81], they employed a completely CNN architecture to detect AD. It uses brain MRI images that might be present in the ADNI dataset.The proposed process is primarily divided into two components. The first is feature extraction, the second is classification of Alzheimer disease into stages.

In [82], they suggested, using a dataset from magnetic resonance imaging (MRI) to diagnose neurodegenerative illnesses, an AlexNet-quantum transfer learning technique. By using a traditional pre-trained AlexNet model to extract a meaningful feature vector from high-dimensional data and then feeding this network to a quantum variational circuit (QVC), the hybrid model is created. The quantum computing phenomenon is used in quantum circuits. The 4096 features from the MRI dataset are extracted by the classical pre-trained model using AlexNet architecture, and this vector is then input to the quantum circuit. To solve the binary classification task for a brain illness, QVC generates a 4-dimensional vector and converts it into a 2-dimensional vector by connecting a fully connected layer at the end.

For the purpose of AD early detection, Balaji et al. [83] presented a hybrid Deep Learning Approach. Magnetic resonance imaging (MRI), positron emission tomography (PET), and conventional neuropsychological test results are combined in a strategy for early AD identification using multimodal imaging and convolutional neural network with the Long Short-Term Memory algorithm. The proposed methodology updates the learning weights, and accuracy is increased via Adam’s optimization.

It was suggested to use a CNN to discriminate between patients with AD and those who had not yet received an AD diagnosis. Based on Matthew’s Correlation Coefficient (MCC), the CNN’s size, number of layers, and number of convolutional filters were adjusted and optimized [84].

The provided Fig. 13 presents a summary of the research conducted about AD diagnosis using CNN architectures, allowing for a comparison of the performance of various models.

Fig. 13
figure 13

Alzheimer’s disease diagnosis using CNN architectures

4.3 Alzheimer’s Disease Diagnosis Using ViT

In [85], they used resting-state functional magnetic resonance imaging (rs-fMRI) and structural magnetic resonance imaging (sMRI) data that had been aggressively preprocessed by their pipeline to separate healthy adults, those with mild cognitive impairment, and those with Alzheimer’s brains within the same age group (\(>75\) years). They then introduced an optimized vision transformer architecture to predict the group membership.

In [86], they set out to research an MCI-to-AD prediction approach that combines structural magnetic resonance imaging (sMRI) and Vision Transformers (ViT). It was possible to predict the development of MCI participants into AD using the 598 MCI subjects in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Their study has three main goals: (i) to propose an MRI-based Vision Transformers approach for categorizing MCI-AD progression, (ii) to assess the effectiveness of various ViT architectures in order to select the best one, and (iii) to identify the brain region that has the greatest influence on the prediction of the deep learning approach to MCI progression.

The main contribution of the study lies in applying a ViT model to classify AD using 18F-Florbetaben (FBB) brain images [87]. The ViT model, originally proposed in the paper ’Attention is All You Need,’ offers a novel approach to image classification by leveraging attention mechanisms. The architecture involves creating image patches, generating patch embeddings, adding class tokens, and positional embeddings, followed by linear projection for dimensionality reduction. By using ViT with FBB PET images, the study aims to enhance the accuracy of Alzheimer’s disease classification, potentially revolutionizing early detection and treatment strategies for this debilitating condition.

[88] proposed a ViT Approach for the detection of AD, which stands out for its remarkable efficacy in accurately diagnosing this mental disability. This innovative architecture involves converting input MRI images into a sequence of tokens suitable for the transformer model. The process includes tokenization, embedding each pixel into a token vector, incorporating positional encoding to capture spatial information, and integrating positional encoding with token embeddings to create the input sequence for the transformer. By leveraging this cutting-edge methodology, the study significantly advances the understanding and detection of AD through sophisticated artificial cognition techniques.

A summary Table 8 was created to compare the different approaches used on each dataset. Accuracy serves as the common performance metric across all models, facilitating a direct comparison of their effectiveness.

Table 8 Alzheimer’s disease diagnosis using ViT

4.4 Alzheimer’s Disease Diagnosis Using Hybrid Models

The main contribution of the Dual-3DM-AD model is its innovative approach to the accurate and early diagnosis of AD by integrating MRI and PET image scans [89]. The model enhances image quality through preprocessing steps such as noise reduction and skull stripping, followed by 3D image conversion. It employs a Mixed-transformer with Furthered U-Net architecture for semantic segmentation, reducing complexity. Additionally, a multi-scale feature extraction module extracts relevant features from segmented images, which are aggregated using a densely connected feature aggregator module to leverage information from both modalities. A multi-head attention mechanism aids in feature dimensionality reduction, leading to multi-class Alzheimer’s diagnosis through a softmax layer. These architectural components collectively enhance the accuracy and effectiveness of the model in AD diagnosis.

The significant contribution of Mahim et al.’s [90] research lies in the creation of a novel hybrid deep learning model. This model combines a Vision Transformer (ViT) with a Gated Recurrent Unit (GRU) to enhance the detection and classification of Alzheimer’s Disease (AD) from brain MRI images. The ViT-GRU architecture effectively addresses challenges related to class imbalances and computational limitations in clinical settings. By capturing spatial and temporal information from diverse datasets, the model demonstrates improved generalizability across different populations and imaging protocols, thereby expanding its utility in AD diagnosis. Additionally, the incorporation of eXplainable Artificial Intelligence (XAI) techniques, such as LIME, SHAP, and Attention maps, enhances the interpretability of the model’s predictions, shedding light on the features influencing AD classification decisions.

On the basis of combining the Wide Residual Squeeze-and-Excitation Network with the Cross ViT, [91] they proposed a novel approach. The Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Open Access Series of Imaging Studies (OASIS) provided them with MRI data. They also suggested a brand-new data augmentation method based on a self attention progressive generative adversarial neural network to get beyond the data’s limitations.

A network model (WS-AMN) that combines weak supervision and an attention mechanism was put out by [92]. The basic model is the weakly supervised data augmentation network, the data augmentation is guided by the attention map produced by weakly supervised learning, and the residual network contains an attention module with a channel domain and a spatial domain to concentrate on the distinct channels and spaces of images, respectively. The influence of irrelevant features is reduced and the corresponding features of connected features are enhanced by the location information.

In [93], they proposed a 3D medical image classifier using multiplane and multislice transformer (M3T) networks to classify Alzheimer’s disease (AD) in 3D MRI images. The proposed network synergistically combines 3D-CNN, 2D-CNN and transformers to achieve accurate AD classification. 3D-CNN is used to perform native 3D representation learning, and 2D-CNN is used to leverage pre-trained weights on large 2D databases and 2D representation learning. Using guided-biased CNN networks, it is possible to efficiently extract local information of AD-related abnormalities in the regional brain. Transformer networks are also used to obtain attentional relationships between multiplane (axial, coronal, sagittal) and multislice images according to CNN. Also, by using a transformer without inductive bias, we can learn abnormalities that spread over a wider area in the brain.

In [94] they proposed two improved ResNet algorithms that introduced the Contextual Transformer (CoT) module, group convolution, and channel shuffling mechanisms into the traditional ResNet residual block. They used the CoT module to replace the 3\(\times \)3 convolution of the residue block and improve the feature extraction power of the residue block. Reorganize and improve feature maps for different groups of input layers using a channel shuffling mechanism. Communication between trait cards in different groups. Selected and edited images of his 503 subjects, including 116 healthy subjects (HC), 187 subjects with mild cognitive impairment (MCI), and 200 subjects with her AD from the ADNI database, preprocessed and sliced the data. 10,060 slices were then acquired and three groups, AD, MCI, and HC, were classified using an improved algorithm.

In [95] they used an established pipeline that included registration, slicing, and classification steps. Her research contributions include her three currently promising deep convolutional models (ResNet, DenseNet, and EfficientNet) and her two transform-based architectures (MAE and DeiT) for mapping input images to clinical diagnoses. to the best of her knowledge. Using multiple benchmarks obtained by varying the number of slices per subject extracted from the available 3D voxels, her two published data were used to allow fair comparison. Experiments were performed on the set (ADNI and OASIS). Experiments have shown that very deep ResNet and DenseNet models outperform shallow ResNet and VGG versions tested in the literature. We also found that the Transformer architecture, especially her DeiT, provided the best classification results and was more robust to noise added by increasing the number of slices.

In order to classify NDs, [96] created ViT models using the AD Neuroimaging Initiative (ADNI) dataset. Additionally, they used attention maps and the Automated Anatomical Labeling (AAL) brain atlas to create models that can be explained.

In [97] they proposed a new method for Alzheimer’s disease image classification based on the mechanism of external attention. An external attention module is added after his fourth convolutional block in the fully convolutional network model. At the same time, a double normalization method of Softmax and L1 norm is introduced to improve the classification performance and yield richer feature information for disease probability maps. A softmax activation function can increase the fitness of a neural network to a training set. This transforms linearity into nonlinearity, making neural networks more flexible. The L1 norm can avoid the attention map from being affected by particularly large (and particularly small) eigenvalues. Experiments in their study used 550 3D MRI images and used 5-fold cross-validation.

In [98] they proposed a densely-connected CNN with per-connection attention mechanisms to learn multi-level features of brain MR images for AD classification. They used densely connected neural networks to extract multi-scale features from preprocessed images, applied a connection-oriented attention mechanism to combine connections between features in different layers, and they hierarchically transformed the MR images into more compact, high-level features. Furthermore, they extended the convolution operation to 3D to obtain spatial information for MRI. Features extracted from each 3D convolutional layer were merged with features from all previous layers with different levels of attention and finally used for classification. Their method was evaluated using baseline MRI of 968 subjects from the ADNI database, including (1) AD and healthy subjects, (2) MCI converters and healthy subjects, (3) MCI converters. and nonconverters.

A novel diagnostic model for AD based on multiview slice attention and 3D convolutional neural networks (3D-CNN) was presented [99]. The approach involved using multiple subnetworks to extract local slice-level features in various dimensions. A slice-level attention mechanism was proposed to highlight specific 2D slices and eliminate redundant features. To capture global structural changes at the subject level, 3D CNN was employed. All 2D and 3D features were then merged to create a more sophisticated representation. The model will be tested on 1,451 subjects from the ADNI-1 and ADNI-2 datasets.

In [100] they proposed a novel attention-based 3D multiscale CNN model (AMSNet) that can better capture and integrate multiple spatial features of AD with a concise structure. AMSNet is often generalized to other AD-related classification tasks such as the three-way classification (AD-MCI-CN). Their results demonstrate the feasibility and efficiency of the proposed multi-scale spatial feature integration and attention mechanism used in AMSNet for AD classification.

Wang et al. [101] first explore the usage of attention mechanisms and suggest a VGG-inspired network (VIN) as the backbone network. The convolutional block attention modules were integrated on a VIN backbone to form the Alzheimer’s Disease VGG-Inspired Attention Network (ADVIAN) that was their suggested design. In order to prevent overfitting, 18-way data augmentation is also suggested. To report the unbiased performance, ten runs of 10-fold cross-validation are completed.

In their research [102], they investigate an attention-based convolutional neural network (CNN) model for Alzheimer’s disease diagnosis and classification. The information of brain MRI images was fused from several levels to determine the correlation between the slices in their study, which introduced the attention mechanisms from the regional level and the feature level. The network model was then made more sensitive to the salient channel aspects while reducing the impact of some noise features by building a spatio-temporal graph CNN with dual attention mechanisms.

An ADGNET-based weak supervision deep learning system was used in this study [103] to classify Alzheimer’s disease. It is made up of a backbone network with an attention mechanism and a task network for concurrent image classification and image reconstruction in order to recognize and classify Alzheimer’s disease with few annotated images. The ResNet-50 architecture, which has been modified to add an attention mechanism, serves as the foundation for the backbone network. One branch of the task network is for image classification, while the other is for image reconstruction.

Liu [104] developed of a new method called MPC-STANet for Alzheimer’s Disease recognition. This method is based on multiple phantom convolution and spatial transformation attention mechanism. It is founded on a spatial transformation attention mechanism and multiple phantom convolution. MPC-STANet’s network design is an improved version of ResNet50, a deep neural network architecture. The process involves employing several phantom convolutions to extract features from brain images, and then applying a spatial transformation attention mechanism to highlight the elements that are most important for identifying Alzheimer’s disease. This method has demonstrated promising outcomes in precisely diagnosing Alzheimer’s disease.

Wang et al. [105] created a deep learning method called IGnet to automatically classify AD cases using genetic sequencing and magnetic resonance imaging (MRI) data. With a deep three-dimensional convolutional network (3D CNN) being used to handle the three-dimensional MRI input and a Transformer encoder being used to handle the genetic sequence input, the suggested method combines computer vision (CV) and natural language processing (NLP) approaches. The Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set has been used to test the proposed approach (Table 9).

Table 9 Alzheimer’s disease diagnosis using hybrid models

5 Discussion

This section focuses on the significance of early diagnosis in the treatment and management of Alzheimer’s disease. Notably, image classification has seen remarkable advancements through deep learning algorithms, such as convolutional neural networks (CNNs) and Vision Transformers (ViTs). By combining these algorithms, we can improve the accuracy of classification and gain a deeper understanding of the visual data linked to Alzheimer’s.

5.1 Strengths of CNNs for Alzheimer’s Image Classification

CNNs have emerged as powerful tools for classifying AD using medical imaging data. They possess the ability to automatically extract relevant features from raw images, are robust to spatial transformations, and use parameters efficiently, making them well-suited for this task. Furthermore, by employing transfer learning with pretrained CNNs, performance can be improved even when working with smaller datasets. Deeper layers within CNNs capture hierarchical representations, aiding in the identification of complex brain abnormalities that are associated with AD.

5.1.1 CNN Configuration Hyperparameters

When designing a CNN for Alzheimer’s image classification, it’s essential to consider several hyperparameters. These choices are influenced by the inherent strengths of CNNs, which we discussed earlier. Let’s explore how these strengths guide our configuration decisions (Table 10):

The Fig. 14 presents a summary of hyperparameter configurations used in CNN architectures for Alzheimer’s disease image classification. It includes information such as the number of layers, pooling size, learning rate, batch size, dropout rate, optimizer, weight decay, activation function, epoch, and data splitting. These hyperparameters play a crucial role in the performance and optimization of CNN models for AD classification.

Table 10 CNN configuration hyperparameters
Fig. 14
figure 14

Hyperparameter configurations in CNN architectures used for AD image classification

5.2 Strengths of ViT for Alzheimer’s Image Classification

ViTs are a revolutionary approach to global context modeling, enhancing feature representation and providing a holistic perspective on pathological changes. They significantly reduce GPU memory usage, making them ideal for Alzheimer’s disease diagnosis, outperforming traditional CNNs.

5.2.1 ViT Configuration Hyperparameters

There are several parameters or variables that affect the performance of the ViT model for AD image classification (Table 11). Here are some of the most important:

Table 11 ViT model parameters

The Fig. 15 illustrates a summary of hyperparameter configurations used in ViT architectures for Alzheimer’s disease image classification. hese hyperparameters are pivotal in optimizing the performance of ViT models for Alzheimer’s disease classification tasks.

Fig. 15
figure 15

Hyperparameter configurations in VIT architectures used for AD image classification

5.3 Benefits of CNNs and Vision Transformer Models fusion for Alzheimer’s Image Classification

The fusion of CNNs and ViT models in image classification tasks, particularly in Alzheimer’s image classification, holds significant promise. This fusion approach combines the strengths of both architectures, leveraging the feature extraction capabilities of CNNs and the self-attention mechanisms of ViT models to enhance classification accuracy and performance. Combining CNNs and ViT models in image classification tasks, including Alzheimer’s image classification, presents several advantages. To effectively tune hyperparameters for the CNN-ViT fusion in Alzheimer’s image classification, researchers should adopt a structured approach based on research insights. This involves defining a search space by setting ranges for each hyperparameter according to prior knowledge and specific model requirements. Key hyperparameters such as learning rate, batch size, optimizer choice, and regularization parameters need careful selection due to their significant impact on model performance. Using search methods such as grid search, random search, or Bayesian optimization can efficiently explore the hyperparameter space to identify optimal values for the fusion model. While ViT-based models have not been extensively applied to Alzheimer’s disease classification due to initial challenges in achieving impressive results in medical imaging tasks, recent advancements and research indicate the potential benefits of integrating ViT models with CNNs for improved classification outcomes. By combining these two powerful architectures, researchers can capitalize on the complementary strengths of CNNs and ViT models to enhance the accuracy and effectiveness of Alzheimer’s image classification systems. Through systematic experimentation with various hyperparameter configurations and evaluation using metrics like accuracy, precision, recall, and F1-score, researchers can assess the influence of each hyperparameter on classification performance. By following these steps and leveraging research findings, researchers can effectively fine-tune hyperparameters in CNN-ViT fusion models for Alzheimer’s image classification, thereby enhancing accuracy and overall model performance [106] (Table 12).

Table 12 Benefits of CNNs and vision transformer models fusion

5.4 DL Challenges in Alzheimer’s Disease Imaging

When delving into DL challenges specific to Alzheimer’s disease imaging, it is imperative to address the complexities of merging CNNs and ViT models. The fusion of these architectures presents significant hurdles, notably heightened computational complexity due to their distinct computational demands. While CNNs are renowned for their spatial hierarchy in image processing, ViT models excel in capturing long-range dependencies through self-attention mechanisms. This integration can lead to a more computationally intensive model, necessitating additional resources for training and inference.

Moreover, the intricate process of harmonizing features extracted by CNNs with tokenized embeddings of ViT models poses a notable challenge in achieving seamless information flow while maintaining model efficiency and interpretability. Striking a balance between the strengths of both architectures without introducing redundancies requires meticulous design and optimization.

Addressing these challenges effectively involves optimizing the fusion process, exploring efficient methods to combine CNN and ViT features, and potentially leveraging techniques like knowledge distillation or architecture pruning to alleviate computational overhead. By acknowledging and actively working to overcome these limitations, researchers can advance the field of hybrid CNN-ViT models towards more practical and scalable solutions for complex image analysis tasks [107, 108]

  1. 1.

    Limited data: Many issues, such as the complexity and variability of AD, the difficulty and expense of obtaining and curating high-quality medical photographs, and strict privacy and confidentiality requirements, can be blamed for the lack of databases for images related to the disease [109]. ML models can be severely limited in their efficacy and application, which can result in overfitting and subpar performance on fresh data [110]. Collaboration between researchers, physicians, and data-sharing platforms is crucial to addressing these issues because it facilitates the acquisition, annotation, and exchange of high-quality imaging data related to Alzheimer’s disease. This may aid in the creation of stronger, more realistic models that more fully represent the intricacy of the illness.

    Potential solutions for addressing the challenges in Alzheimer’s disease imaging data include:

    • Transfer learning: Employing well-trained models from similar tasks and fine-tuning the ending of two layers or even one layer based on the limited original data [111].

    • Data augmentation is the process of adding more images to the training set by mirroring, rotating, and translating them. However, in some situations, such as with bioinformatics data, caution must be exercised while using this technique [112,113,114].

    • Simulated data: If the problem is well understood, building simulators based on the physical process can help expand the amount of simulated data in the training set [112].

  2. 2.

    Data pre-processing: is a crucial step in the development of deep learning models for accurate and efficient analysis of brain imaging data [115]. Proper architectural design and pre-processing steps are essential to ensure that the data is in a format that the model can accept and process effectively. Some of the key steps and considerations in data pre-processing for DL models include:

    • Data cleaning: This involves removing inconsistencies, missing values, and noise from the data to ensure data quality and consistency.

    • Data transformation: This step involves converting the data into a suitable format for the model, such as resizing images or normalizing data.

    • Data reduction: This step involves dimensionality reduction techniques to reduce the number of features while preserving the essential characteristics of the data, which can help improve the performance of deep learning models.

    • Data normalization: This step involves standardizing the data by subtracting the mean and dividing by the standard deviation, which can help improve the convergence of the model and reduce overfitting.

  3. 3.

    Class imbalance: Class imbalance in Alzheimer’s disease datasets can pose a challenge for ML models, leading to difficulties in differentiating between classes and biased predictions [116]. To address this issue, various techniques can be employed to rebalance the dataset and enhance the model’s ability to learn from all classes. Some of these techniques include:

    • Oversampling and undersampling: These methods can be applied to modify and balance the distribution of images among various classes. Undersampling entails choosing fewer examples of overrepresented classes, whereas oversampling entails making more copies of underrepresented classes.

    • Weighted cross-entropy loss: Use the weighted cross-entropy loss function to make sure the model works effectively with small class sizes. By giving the loss from underrepresented groups greater weights, it pushes the model to find more accurate representations for those classes [117].

    • Building models for hierarchical levels: Since a biological system often contains a hierarchical label space, this method can be used to build models for any hierarchical level. This may assist in resolving the problem of unbalanced data in DL models.

  4. 4.

    Variability in images: Quality, resolution, and image modality of images of Alzheimer’s disease can vary (e.g., MRI, PET). It may be difficult for the model to learn traits that are resistant to these alterations as a result.

    •  [27]: This article addresses the difficulties in differentiating between asymptomatic and non-Alzheimer’s moderate cognitive impairment (MCI) groups and emphasizes the use of brain imaging modalities in the diagnosis of Alzheimer’s disease.

    •  [118]: The necessity of medical image processing and analysis techniques to support Alzheimer’s disease diagnosis is emphasized by this study. It also covers how to extract features from images using ML or DL approaches, which can enhance diagnosis accuracy.

    •  [119]: The application of multimodal DL models for Alzheimer’s disease early detection is the main emphasis of this work. It emphasizes how crucial it is to combine several imaging modalities-like MRI, PET, and genetic data-in order to provide a more thorough assessment of the illness.

  5. 5.

    Interpretability: DL models are frequently referred to as "black boxes," which makes it challenging to comprehend how the model generates its predictions. In medical applications where interpretability is crucial, this can be an issue. In the context of Alzheimer’s disease, the interpretability of deep learning models poses significant challenges, particularly in understanding how these models generate predictions, which is crucial for medical applications. However, researchers have been exploring potential solutions to address this issue. Some relevant references include:

    •  [120]:The potential for interpretable neural network models to enhance patient outcomes is highlighted by this study, which introduces one that can assist in the noninvasive and cost-effective diagnosis of Alzheimer’s disease.

    •  [121]:The study addresses the current state of interpretable ML for dementia, including Alzheimer’s disease, and highlights the necessity of involving doctors in future research to validate explanation techniques and draw firm conclusions regarding the pathology of dementia-related diseases.

    •  [122]: The work highlights the construction of interpretable models suited to neuropathology, which is vital for understanding the disease. It reports on a DL pipeline that identifies unique neuropathologies in Alzheimer’s disease.

  6. 6.

    Ethical considerations: The classification of Alzheimer’s images raises ethical questions about patient privacy and potential technological abuse [123]. When creating and implementing these models, it is crucial to take these concerns into account [124, 125]. Researchers have been exploring potential solutions to address these ethical challenges. Some relevant references include:

    •  [126]: The ethical issues of openness, fairness, privacy, and data protection that arise from the use of AI in speech-based Alzheimer’s detection are covered in this article. The writers also offer potential solutions for resolving these issues.

    •  [127]: The ethical ramifications of using DL models to diagnose Alzheimer’s disease in people who don’t exhibit any symptoms are covered in this paper. The authors stress the significance of data protection, privacy, and informed consent in the creation and application of these models.

    •  [128]: This essay addresses the moral dilemmas raised by the application of AI and digital biomarkers to Alzheimer’s disease diagnosis and therapy. In the creation and application of these technologies, the authors stress the importance of openness, equity, and privacy.

  7. 7.

    Clinical relevance: In addition to accurately classifying Alzheimer’s disease, it is crucial to think about how the model’s predictions may be applied to enhance patient outcomes. In order to guarantee that the model is being used in a morally and meaningfully appropriate manner, collaboration between researchers, clinicians, and patients is required.

The benefits and drawbacks of using DL models for Alzheimer’s disease diagnosis and classification have practical implications for their application in real-world scenarios. On the one hand, DL models can offer advantages such as high accuracy, automation, scalability, and generalization, which can facilitate the early detection and intervention of Alzheimer’s disease, reduce the workload and bias of human experts, and enable the analysis of large and heterogeneous data sources. On the other hand, DL models also face challenges such as data quality and availability, interpretability and explainability, ethical and social issues, and technical and computational limitations, which can hinder their adoption and validation in clinical settings, affect their trustworthiness and reliability, and raise concerns about their impact on patients’ privacy, autonomy, and well-being. Therefore, it is important to address these challenges and balance these trade-offs when developing and deploying DL models for AD, and to ensure that they are aligned with the best practices and standards of the medical domain.

  • The accuracy and automation of DL models could have a practical effect of diminishing the need for costly and invasive diagnostic techniques, such as lumbar punctures or PET scans, which are presently used to validate the diagnosis of AD [129].

  • Another practical implication of the scalability and generalization of DL models is that they could enable the integration and analysis of multimodal data, such as MRI, PET, and cognitive tests, which can provide a more comprehensive and personalized assessment of the disease progression and prognosis [130].

  • Moreover, a practical challenge of the interpretability and explainability of DL models is that they often act as black boxes, meaning that their decisions and predictions are not transparent or understandable to the human experts or the patients, which can affect their trust and acceptance.

5.4.1 Illustrative Case Studies of Alzheimer’s Disease

Case studies offer valuable insights into AD, providing examples that illustrate different aspects of the disease. For instance, one case study examines a man with a rare resilience to AD, while another focuses on a 63-year-old woman with early-onset AD. Additionally, a case report delves into a 37-year-old male patient also affected by early-onset AD. Lastly, a lesson plan incorporates fictitious case studies of AD. Overall, these case studies shed light on the diverse manifestations of AD, including rare resilience, early-onset cases, and the utilization of imaging techniques for diagnosis.

  • Rare Resilience to Alzheimer’s Disease: An NIH-funded case study examined a man who, although having a rare genetic form of early-onset Alzheimer’s disease, shown little difficulty with verbal learning or language. In spite of his slim genetic chances, the study sought to explain how he avoided dementia for such a long time https://www.nia.nih.gov/news/case-study-unlocks-clues-rare-resilience-alzheimers-disease.

  • Early-Onset Alzheimer’s Disease: A case report included information on the clinical history, autopsy results, and brain histology of a 63-year-old woman with early-onset Alzheimer’s disease. The study emphasized the substantial effects of Alzheimer’s on patients’ brains as well as the genetic abnormalities linked to early-onset Alzheimer’s [131].

  • Distinguished Striatum Amyloid Retention: A 37-year-old male patient with early-onset AD was the subject of another case study. The article covered his progressive cognitive impairment, the diagnosis made with beta-amyloid imaging, and the possible drawbacks of the case study, like the patient’s absence of genetic testing [132].

These case studies offer valuable insights into the diverse manifestations of Alzheimer’s disease, including rare resilience, early-onset cases, and the use of imaging techniques for diagnosis. By examining specific examples, researchers and healthcare professionals can gain a deeper understanding of the disease and explore potential opportunities for intervention and treatment.

6 Conclusion

In this review, we have provided a comprehensive overview of the use of deep learning models, especially Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), for classifying Alzheimer’s disease (AD) from medical images. The review highlights the potential of deep learning models to improve the accuracy of current diagnostic methods and showcases the promise of both CNN and ViT models in accurately classifying AD from medical images. We have discussed the challenges and considerations for AD classification using deep learning, compared the strengths and weaknesses of each approach, and emphasized their potential in improving AD diagnosis. The discussion section critically analyzes the advantages and limitations of combining CNN and ViT models, shedding light on their potential implications for AD diagnosis and treatment.

Our review aligns with the current trend of employing deep learning methods, which have shown superior performance in handling complex three-dimensional data and have achieved great success in AD diagnosis. The insights provided in this article contribute to the understanding of the current state and potential directions for future research in the field of deep learning-based AD classification from medical images. Throughout the review, the focus is on deep learning and image classification concepts related to the specific problem of AD classification.

In conclusion, this review provides an insightful overview of the current state and potential future directions in using deep learning models for Alzheimer’s disease (AD) classification. It carefully considers the challenges, including dataset biases, interpretability of models, scalability issues, and robustness to variations in imaging modalities, offering a balanced perspective. By acknowledging these challenges, the review aims to foster informed decision-making and guide future research effectively. Additionally, it underscores the importance of addressing ethical considerations and enhancing trust among stakeholders. Through recommendations for improving interpretability, robustness, and fairness of deep learning models for AD classification, the review aims to contribute to the advancement of AD research and clinical practice.