1 Introduction

Loss of brain tissue and nerve cells leads to Alzheimer's disease (AD), a terrible, non-transmittable neurological condition. Such death of cells is caused by pairs of TAU proteins that stabilise the microtubules, resulting in improper guidance or blockage of the molecules and essential nutrients to the axons and causing dementia. The impacts of dementia on a person's thinking, acting, and other social abilities. Basically, there are four different stages of dementia caused by AD: non, very mild, mild, and moderate. When AD mutates from very mild to moderate, it severely exacerbates its impact on heart muscles or the respiratory system, which can even lead to the death of the person. The incidences of AD have been increasing significantly over the past few decades all across the world. Recently, the WHO [1] reported that 60% to 70% of patients suffer from AD out of the 55 million identified dementia patients. Several studies have been carried out to diagnose AD using different biomarkers such as blood samples, MRI scans, CT, or PET reports. Although there is currently no cure for AD, early-stage treatment with the right drug can lessen the disease's severity. Therefore, it is crucial to find AD early to improve patients' quality of life.

The vast majority of currently used techniques for diagnosing AD rely on time-consuming manual evaluation by medical professionals. Computer vision employing machine learning-based technologies is also being developed to assist doctors in the diagnosis process. Many new studies use machine learning methods, particularly deep learning strategies, to offer a variety of alternatives for AD identification using MRI scans. Most state-of-the-art models either utilise transfer-learning-based [2,3,4,5,6,7] or non-transfer-learning-based approaches [8,9,10,11,12,13,14,15,16,17,18]. The VGG-16, Alex-Net, Res-Net, and GoogLE-Net are some of the frequently used models in transfer-learning techniques for AD detection. On the other hand, the authors created a number of models employing deep learning architectures like CNN, Autoencoder, GAN, RNN, and LSTM in non-transfer learning approaches. The main objective of all the existing techniques is to exploit high-level spatial features to improve the accuracy of AD detection. However, these techniques lack attention to the spatial features, indicating where the important features are available in each of the high-level spatial features extracted from the deep models. Although such drawbacks can be avoided by using Convolution Spatial Attention Block (C-SAB), this technique suffers from obtaining a better spatial attention map (SAM). The existing solution only relies on max and average pool values to obtain SAM, which can be improved by adding other features. In addition, there is also a lack of techniques that exploit scale-invariant features for AD detection from MRI scans. Thus, the shortcomings can be summarised as:

  • Lack of attention to high or low-level spatial features.

  • The existing spatial block attention mechanism suffers from limited representation as far as the generation of SAM is concerned.

  • The literature also lacks scale-invariant feature extraction for the diagnosis of AD.

Therefore, this study presents a unique deep-learning solution for AD diagnosis to address the aforementioned shortcomings through multiscale feature modelling using improved spatial attention-guided depth-separable CNN. The proposed model improves the Convolution Spatial Attention Block (C-SAB) by introducing an Improved Spatial Attention Block (I-SAB). The I-SAB is designed to enhance the spatial attention map by extracting several features cues from the input and employing stack of depth separable CNN and skip connection. The proposed I-SAB is plugged into different layers of a multilayer depth-separable CNN to obtain multiscale enhanced spatial attention maps. Such maps are used to obtain spatially guided, multiscale features. All such features are fused and concatenated to obtain multiscale features forming scale-invariant features, which are inputted into a multilayer perceptron for classifying multiple classes of dementia caused by AD. The contributions could be summarised as follows:

  • An I-SAB is designed to enhance the spatial attention map.

  • Multilayers of depth separable CNN are designed to exploit multiscale features.

  • The I-SABs are plugged into different layers of the backbone to extract enhanced spatial attention-guided multiscale features, which are used to predict AD using a multilayer neural network.

The remainder of the article is structured as follows: Sect. 2 discusses the literature review; Sect. 3 describes the proposed method and model; Sect. 4 illustrates dataset information and performance measures; Sect. 5 discusses experimental setup and results analysis; Sect. 6 illustrates the ablation study; and Sect. 7 presents the conclusion.

2 Literature Study

There is an overwhelming use of machine learning approaches for AD detection. Nevertheless, the current research trends show the development of various models using deep learning techniques not only for AD detection but also in other research domains such as text classification using NLP [19], Object detection [20], Breast cancer detection [21] and so on. These models can be broadly categorised into two groups: learning-based transfer models and non-transfer learning-based models. The taxonomy of AD detection is illustrated in Fig. 1 and discussed in the following subsection.

Fig. 1
figure 1

Taxonomy of deep learning techniques for AD detection

2.1 Transfer Learning-Based AD Detection

Many of the existing transfer learning-based techniques use the weights of pre-trained baseline models such as VGG16, ResNet, GoogLeNet, AlexNet, Inception, and so on. Noting a few, Shahwar et al. [5] extracted 512 feature vectors from ResNet34 and inputted them to a quantum variational circuit for AD detection. In a study proposed by Naz et al. [3], the authors used freeze features of 11 pretrained baseline models for the detection of AD. The authors [3] identified that the VGG-16 exhibits better performance than other models. Ghazal et al. [7] utilised AlexNet to exploit object-level features for determining the existence of AD. Deepa et al. [2] applied an arithmetic optimisation algorithm to a fully connected neural network for AD detection. The inputs to the neural network are the features of the pre-trained VGG 16, which obtained better detection accuracy. But knowledge transfer from deep models trained in similar research domains can have similar feature maps and produce better results. However, the aforementioned approaches’ main objective is to extract features from MRI images, but they have used pre-trained models that are trained using samples other than medical images, which can result in producing vague results.

To overcome such limitations, finetuning these models is of the utmost importance. Chui et al. [6] focused on extracting transfer learning-based features by finetuning the hyperparameters of a CNN-based network for AD detection. However, detection accuracy must be improved. Shamrat et al. [4] fine-tuned the baseline pretrained architectures such as ResNet50, MobileNetV2, VGG16, AlexNet, and Inception V3 and found that with an accuracy of 96.32%, Inception V3 performs better than other baseline models. However, enhanced attention-guided feature modelling and performance are some of the major setbacks of these baseline models.

2.2 Non-Transfer Learning-Based AD Detection

These models have a variety of architectures that they have constructed utilising deep learning methods like generative models, LSTM (RNN), and CNN. Due to its astonishing ability to extract spatial semantic characteristics from MRI images, CNN-based architectures have been extensively researched. CNN topological structure with multiple layers was proposed by Murugan et al. [9]. The authors were significantly more accurate for the balanced dataset. A unified model using three deep networks was proposed by Orouskhani et al. [18]. The integrated VGG-16 structure serves as the model's structural backbone. Houria et al. [14] designed a multilayer of convolution layers for AD prediction. Bandyopadhyay et al. [11] designed an artificial neural network model with multiple layers of perceptrons for AD detection. However, detection accuracy must be improved. Recently, Lahmiri et al. [16] proposed a hybrid model where the authors used a CNN model for feature extraction followed by a KNN with Bayesian optimisation for the classification of AD stages. However, the validation dataset is limited in sample size. On the contrary, Abbas et al. [8] fulfilled the requirement of locating discriminant landmark areas for AD diagnosis and improved the performance through a proposed Jacobian Domain CNN. Sequential models have also been designed for AD detection. On the other hand, Hajamohideen et al. [13] used a deep CNN with a Siamese foundation to detect AD. The deep network has been optimised by the authors using the triplet loss function. An LSTM-based two-stage deep learning algorithm was proposed by El-sappagh et al. [15]. The authors have greatly improved the detection accuracy. Lee et al. [10], on the other hand, suggested a multi-modal deep learning architecture employing a recurrent neural network to predict AD from several biomarkers. Even though these sequential models were only intended to be used with time series or sequential data, the information sets that are offered are nonetheless static MRI brain pictures.

In the literature, autoencoders are established in addition to the multi-layer CNN structureto enhance accuracy through the utilisation of fine-grained abstract information. Ansingkar et al. [17] utilised a capsule encoder network and optimised it using a hybrid equilibrium method for the diagnosis of AD. Shi et al. [12] proposed a multiple loss-based autoencoder that is constrained by GAN. Among the several aforementioned deep learning techniques, the CNN-based approach has been widely used because of its capacity to exploit fine-grained contextual features. However, currently available CNN-based models perform poorly because they do not efficiently take advantage of enhanced spatial attention-guided multiscale features. The model that has been suggested fills this knowledge gap, which will be addressed in the part that precedes it.

3 Proposed Method and Model

Multiscale feature modelling and feature attention mechanisms play an important role in solving very critical and non-linear classification problems by exploiting fine-grained features. One way to extract multiscale features is to fuse multi-layer features of CNN architecture. Whereas the feature attention mechanism can be exhibited by using a spatial attention block. However, the following are some of the concerns:

  • Exploiting multiscale features using CNN would result in more computation complexity due to the increase in convolution operations.

  • Use of multiscale features without specifying "where the important features are", would result in performance degradation.

  • The current spatial attention block mechanism needs to be changed to improve the quality of spatial attention feature maps.

The proposed model addresses the above concerns. The following are the contents that describe the proposed method and model very clearly.

  • Network overview.

  • Efficient feature modelling using multilayers of depth-wise separable CNN.

  • Backgrounds of spatial attention mechanisms.

  • Improved spatial attention mechanism.

  • Improved spatial attention guided multiscale feature modelling.

  • Classification of AD stages and optimisation.

3.1 Network Overview

Figure 2 presents a graphic illustration of the suggested model's design in detail. The entire proposed architecture is built on multilayers of depth-wise separable CNN (M-DSC). The M-DSC acts as the backbone of the model and takes MRI scans as input. The backbone has been built with 10 layers of depth separable convolution layers and 4 max-pooling 2D layers. All these pooling layers have the kernel shape of \(\left(2\times 2\right)\). Each one of these pooling layers is placed after every couple of DSC layers. The kernel shapes of DSC layers starting from the first to the tenth position are as follows: \(\left(7\times 7\right)\), \(\left( {4 \times 4} \right),\left( {3 \times 3} \right),\left( {3 \times 3} \right),\left( {3 \times 3} \right),\left( {3 \times 3} \right),\left( {3 \times 3} \right),\left( {3 \times 3} \right),\left( {2 \times 2} \right),{\text{and}}\;\;\left( {2 \times 2} \right)\). The depth multipliers of the first and rest of the DSC layers are set to 3 and 2, respectively.

Fig. 2
figure 2

Details of proposed model

The feature maps of every even-numbered depth separable convolution layer are inputted to a proposed Improved Spatial Attention Block (I-SAB), as shown in Fig. 2. There are five I-SABs plugged into the backbone; each one takes feature maps of even-numbered, depth-wise separable convolution layers. The insights of I-SAB are illustrated in Fig. 4.

The I-SAB contains three parallel pooling layers, such as Max, Average, and Min Pooling. The features of these pooling layers are concatenated and then merged by a kernel-based Conv2D layer having sigmoidal activation. The activated feature map is then given to a Feature Map Enhancement Module (FMEM) to obtain an enhanced spatial attention map. The FMEM includes two DS-CNN layers that are Sigmoid triggered. These layers have kernel size of \(\left(2\times 2\right)\) with a depth multiplier of 2.

The input of the FMEM is elementwise multiplied with the sigmoidal output of the second DSC layer via a skip connection (as illustrated in Fig. 4), and then a merging layer (sigmoidal Conv2D with kernel \(\left(1\times 1\right)\)) is applied. This merging layer's output is referred to as an "enhanced spatial attention map." The enhanced spatial attention map of each I-SAB is multiplied by the input of the I-SAB to get enhanced spatially attentive features, which are fused by a depth-wise separable CNN with kernel \(\left(1\times 1\right)\), followed by a ReLU activation layer.

All the fused features corresponding to five I-SABs are flattened and concatenated to obtain improved multiscale features that are scale-invariant in nature. Such features are densely connected to 256 ReLU neurons, which are followed by an output layer containing 4 neurons each for very-mild demented (VMD (class-0)), mild-demented (MD (class-1)), moderate-demented (MoD (class-2)), and non-demented (ND (class-3)).

3.2 Efficient Feature Modelling Using Depth-wise Separable CNN

A specific type of convolutional layer used in the development of Convolutional Neural Networks (CNNs) is the depth-wise separable convolution (DSC) layer. The primary goal of the creation of such a layer is to increase the effectiveness of the standard CNN while reducing the number of parameters necessary to perform convolution operations.

The traditional CNN architecture carried out the convolution operation by moving a kernel, also referred to as a filter, through the input volume and estimating a dot product between the kernel and its associated spatially aligned region in the input. Mathematically such convolution operation could be written as follows:

$$F_{i + 1} = AF\;\left( {F_{i} \otimes k_{i} } \right).$$
(1)

Here \({F}_{i}\) is the input feature at ith layer of a convolution neural network which has \(\left[H\times W\right]\) and \(C\) as spatial and channel dimensions, respectively. It contains all real-valued numbers, thus \({F}_{i}\in {\mathbb{R}}^{H\times W\times C}\). In Eq. 1, the kernels at ith layer are represented as \({k}_{i}\), having \(N\) number of kernels of size, \(\left[h\times w\right]\) thus \({k}_{i}\in {\mathbb{R}}^{h\times w\times N}\). The function \(AF\left(.\right)\) defines the activation function, and the symbol ⨂ represents the convolution operation. The \({F}_{i+1}\)  represents the output of the convolution operation, which will be the input for the \({\left(i+1\right)}^{th}\) layer, and \({F}_{i+1}\in {\mathbb{R}}^{H\times W\times N}\).

A DSC layer, on the other hand, splits the ordinary convolution operation into depth-wise and point-wise convolutions. Here's how it works:

  • Depth-wise convolution: Here, the convolutional kernel is applied separately to each channel of the input volume. Instead of using a single kernel to compute the dot product across all channels, a separate kernel is used for each channel. This means that if the input volume (at ith layer) has C input channels, we will have C separate convolutional kernels such as \({k}_{{i}_{1}}, {k}_{{i}_{2}}, \dots .., {k}_{{i}_{C}}\), where each \({k}_{{i}_{p}}\in {\mathbb{R}}^{h\times w\times 1}\). As input volume or feature at ith layer i.e., \({F}_{i}\) is composed of C channels, let us assume that each channel feature is represented by \({F}_{{i}_{p}}\) where p ranges from 1 to C. Let the features obtained during depth-wise convolution is represented by \({DF}_{i}\) which can be represented mathematically as:

    $$DF_{i} = Concate\;\left[ {F_{{i_{1} }} \otimes k_{{i_{1} }} ,F_{{i_{2} }} \otimes k_{{i_{2} }} , \ldots ,F_{{i_{C} }} \otimes k_{{i_{C} }} } \right] \in {\mathbb{R}}^{H \times W \times C}$$
    (2)

Here, \({\text{Concate}}\) operation defines the concatenation of results of channel-wise convolution operation. Note that the concatenation is done at the channel dimension.

  • Point-wise convolution: After the depth-wise convolution, a 1 × 1 convolution (also known as point-wise convolution) is applied to the output of the depth-wise convolution i.e., \({{\text{DF}}}_{{\text{i}}}\). The 1 × 1 convolution operates on the output channels of the depth-wise convolution, combining information across channels. Let the point convolution be represented by the point kernel i.e., \({{\text{pk}}}_{{{\text{i}}}_{{\text{p}}}}\in {\mathbb{R}}^{1\times 1\times {\text{C}}}\) where p ranges from 1 to M that means we have a \({\text{M}}\) number of point convolutions. Thus, the final feature at ith layer can be obtained as:

    $${\text{F}}_{i + 1} = {\text{Concate}}\;\left[ {{\text{DF}}_{{\text{i}}} \otimes {\text{pk}}_{{{\text{i}}_{1} }} ,{\text{DF}}_{{\text{i}}} \otimes {\text{pk}}_{{{\text{i}}_{2} }} \ldots ,{\text{DF}}_{{\text{i}}} \otimes {\text{pk}}_{{{\text{i}}_{{\text{M}}} }} } \right] \in {\mathbb{R}}^{{{\text{H}} \times {\text{W}} \times {\text{M}}}}$$
    (3)

The DSC effectively reduces the number of parameters and computations compared to standard convolution. It reduces the complexity from \({\text{O}}\left({\text{H}}\times {\text{W}}\times {\text{C}}\times {\text{M}}\right)\) in the standard convolution to \({\text{O}}\left({\text{H}}\times {\text{W}}\times {\text{C}}+{\text{C}}\times {\text{M}}\right)\) in the depth-wise separable convolution. Here, (H, W), C, and M represent spatial, channels and output channel dimensions, respectively. So, by taking advantages of such a variant of convolution, we have been motivated to designing a computationally efficient multilayer of such a depth-wise separable convolution as the backbone for our proposed network for fine-grained feature modelling. The details of the backbone network are illustrated in Sect. 3.1.

3.3 Backgrounds of Spatial Attention Mechanism

The multiscale features can be extracted by fusing the multilayer features of any backbone network [24]. However, before fusing the multiscale features, it is better to highlight and identify where the important features are in several feature maps. This can be done by using Convolution Spatial Attention Block (C-SAB), as described by Woo et al. [22]. The architectural details of C-SAB are given in Fig. 3. The procedures to obtain a spatial attention map are as follows:

  • Both average and max pooling are applied along the channel axes of depth-wise separable convolution features, \({{\text{F}}}_{{\text{i}}}\), let these features are represented by \({{\text{F}}}_{{\text{M}}}\in {\mathbb{R}}^{{\text{H}}\times {\text{W}}\times 1}\) and \({{\text{F}}}_{{\text{A}}}\in {\mathbb{R}}^{{\text{H}}\times {\text{W}}\times 1}\) respectively.

  • These features are fused or concatenated to form a tensor \({[{\text{F}}}_{{\text{M}}};{{\text{F}}}_{{\text{A}}}]\).

  • Then a standard convolution layer with sigmoid activation is applied to the fused features and producing an attention map of shape \(\left[{\text{H}}\times {\text{W}}\right]\) by computing the spatial attention as: \({{\text{M}}}_{{\text{s}}}=\updelta \left({{\text{Conv}}}^{7\times 7}\left(\left[{{\text{F}}}_{{\text{M}}};{{\text{F}}}_{{\text{A}}}\right]\right)\right).\mathrm{ Here},\updelta\) and \({{\text{Conv}}}^{7\times 7}\) denoting the sigmoid function and convolution operation with the filter size of 7 × 7, respectively. The \({M}_{{\text{s}}}\in {\mathbb{R}}^{{\text{H}}\times {\text{W}}}\).

Fig. 3
figure 3

Details of convolution spatial attention block (C-SAB)[22, 23]

Such an approach to getting feature maps has been applied to many research domains, but not as far as AD detection is concerned. In addition, the C-SAB can also be improved to obtain an improved spatial attention map by infusing more features into the map. Section 3.4 describes the details of such an improved spatial attention mechanism.

3.4 Improved Spatial Attention Mechanism for Multiscale Features

The basis behind proposing an improved spatial attention mechanism for multiscale features relies on two key observations from the basic C-SAB structure.

  • First, the spatial attention map \({M}_{{\text{s}}}\) solely depends on the maximum and average pool information from the input feature maps. However, minimum pool properties can also play an important role in providing better discriminant features. Thus, such features can’t be removed. So, three pooling features are obtained and concatenated from the input feature maps. Thus, the fused tensor would be\({[{\text{F}}}_{{\text{M}}};{{\text{F}}}_{{\text{A}}};{{\text{F}}}_{{\text{Min}}} ]\), where \({{\text{F}}}_{{\text{M}}}, {{\text{F}}}_{{\text{A}}},\mathrm{ and }{{\text{F}}}_{{\text{Min}}}\) are the max, average, and min pool features.

  • Second, the attention map obtained from basic C-SAB can also be improved further by processing and exploiting fine-grained features through multilayers of depth-wise separable convolution layers.

Thus, addressing the above points can enhance the model’s ability to exploit and focus relevant spatial features across different scales of features in the proposed backbone of the network by improving the spatial attention mechanism. Such objectives can be fulfilled by doing two important things:

  • First, improve the spatial attention mechanism, and

  • Plugging this mechanism across several layers of the backbone to focus on the relevant spatial features.

The following explains the former point, and the latter is explained in Sect. 3.5.

An improved spatial attention block (I-SAB) is designed that accumulate the aforementioned important features. The details of I-SAB are shown in Fig. 4. The input feature maps are utilised to extract the maximum, minimum, and average pooling features, which are then combined to create a fused tensor and applied to a sigmoidal conv2d layer. The sigmoidal conv2D layer contains one filter and produces a spatially attentive feature map \(\in \boldsymbol{ }{\mathbb{R}}^{{\varvec{H}}\times {\varvec{W}}}\). Such a feature map is again given to a Feature Map Enhancement Module (FMEM) to obtain an improved spatial attention map. The FMEM contains two consecutive sigmoidal depth-wise separable convolution layers with depth multipliers of factor two. These two layers exploit and produce a fine-grained spatial attention map from their input. The input to FMEM is multiplied with the fine-grained features through a skip connection, which can be observed in Fig. 4. The enhanced features are then merged by using a sigmoidal Conv2D layer to obtain an improved spatial attention map (ISAM). Mathematically, we can represent all these operations as follows:

Fig. 4
figure 4

Detail architecture of I-SAB

$$ISAM= \delta \left({Conv}^{1\times 1}\left[\begin{array}{c}\delta \left({Conv}^{1\times 1}\left(Concate\left[{F}_{M},{F}_{A},{F}_{Min}\right]\right)\right).\\ \delta \left({{D}_{Conv}}^{1\times 1}\left(\delta \left({{D}_{Conv}}^{1\times 1}\left(\delta \left({Conv}^{1\times 1}\left(Concate\left[{F}_{M},{F}_{A},{F}_{Min}\right]\right)\right)\right)\right)\right)\right)\end{array}\right]\right)$$
(4)

3.5 Improved Spatial Attention Guided Multiscale Feature Modelling

Five I-SAB modules are plugged in parallelly in the network; each of these modules takes the feature maps of the 2nd, 4th, 6th, 8th, and 10th depth-wise separable convolution layers. As shown in Fig. 2, through a skip connection, the feature maps of an evenly placed DSC layer are element-wise multiplied with the output of I-SAB to form improved spatially attentive feature maps (ISAF) of a particular layer of the backbone. The ISAFs are then merged through a depth-wise separable convolution with kernel shape and depth multiplier of \(1\times 1\) and 1 respectively. All the merged enhanced multilayer features are flattened and fused to form attentive multiscale or scale-invariant features. These features are given to a multilayer neural network (MNN) for AD stage classification. The MNN consists of one hidden layer of 256 ReLU neurons and one output layer of 4 neurons. The output layer of the MNN employs SoftMax activation. ReLU serves as the activation mechanism for the backbone layers.

3.6 Classification of AD Stages and Optimization

The proposed model has four neurons in the output layer belonging to predict Non-Demented, Very-Mild-Demented, Mild-Demented, and Moderate-Demented. The activation function for the output layer is SoftMax. Let \(X=\left[{x}_{1},{x}_{2},\dots \dots ,{x}_{N}\right]\) be a set that provides the expected output score for each of the N samples. Similarly, let another set \(S=\left[{s}_{1},{s}_{2},\dots \dots ,{s}_{N}\right]\) represents the ground-truth labels of the samples. Let, each \(x_{{\text{i}}} |_{{i = {\text{1}},{\text{2}},{\text{3}},{\text{4}},5 \ldots N}} and\;s_{{\text{i}}} |_{{i = {\text{1}},{\text{2}},{\text{3}},{\text{4}},5 \ldots N}}\) is a one-hot vector containing predicted and ground-truth score for four classes and are defined as \(x_{i} = \left[ {x_{{i_{1} }} ,x_{{i_{2} }} ,x_{{i_{3} }} ,x_{{i_{4} }} } \right]\;{\text{and}}\;\;s_{i} = \left[ {s_{{i_{1} }} ,s_{{i_{2} }} ,s_{{i_{3} }} ,s_{{i_{4} }} } \right]\) respectively. Let \(\varnothing\) serve as the representation for each trainable parameter in the recommended network. The proposed model is trained by minimizing the categorical cross-entropy loss between \({\text{predicted}}\;{\text{and}}\;{\text{ground - truth}}\) scores of mini batches of samples. We can calculate the categorical cross-entropy loss of \({t}^{th}\) batch of samples \(\left( {t = 1\;to\;\left\lceil {\frac{N}{b}} \right\rceil where\;b\;is\;the\;batch\;size} \right)\) by using the following equation:

$${{\text{Loss}}}_{{\text{t}}}=\frac{1}{{\text{b}}}\sum_{{\text{i}}=1}^{{\text{b}}}\left[-\sum_{{\text{j}}=1}^{4}{{\text{s}}}_{{{\text{i}}}_{{\text{j}}}}{{\text{logx}}}_{{{\text{i}}}_{{\text{j}}}}\right]$$
(5)

For a given network parameter \(\varnothing\), we have minimised the loss for the \({t}^{th}\) batch \(\left( {i.e.,\mathop {argmin}\limits_{\emptyset } \;Loss_{t} } \right)\) using Adaptive Moment (Adam) optimizer [25].

4 Dataset, Experimental Setup and Performance Measures

4.1 Stats of Datasets

The Alzheimer's disease dataset that is published on Kaggle [26] and OASIS-1[27] are the two publicly accessible datasets that we used. The AD dataset [26] is made up of MRI scans from patients with four different classifications of dementia, such as very mild, mild, moderate, and non-demented. The resolution of MRI scans is \(\left[176\times 208\right]\). The description of such dataset is provided in Table 1. The splitting of the training and testing set for this dataset is adopted from the work of Murugan et al. [9] where 10% of randomly chosen dataset are used for testing.

Table 1 Distributions of data samples across several categories

The OASIS dataset contains samples of 436 subjects. But, only 399 samples can be downloaded as far as the current state of the OASIS-1 is concerned. The sample distribution of the OASIS Category 1 dataset is shown in Table 1. As the samples of such a dataset are very limited, we have used two-fold cross-validation for training and testing by adopting the work of Chui et al. [6]. Figure 5 shows some of the samples of both AD [26] and OASIS datasets.

Fig. 5
figure 5

Examples of MRI scans for patients with a. very mild dementia, b. mild dementia, c. moderate dementia, and d. no dementia

4.2 Experimental Setup

Keras layers with TensorFlow as a backend are used to code the proposed model in Python. The code is executed on the CoLAB platform. The hyperparameters such as kernel regularised parameter, learning rate, and batch size were set to 0.01, 0.01, and 32, respectively. To prevent overfitting, the model adopts early stopping and L2 regularisation. The patience parameter of early-stopping was set to 10 epochs, and the maximum training epochs are 1000.

4.3 Performance Measures

Accuracy, precision, recall, and F1-Score are the performance metrics utilised in this article. The following are the detailed descriptions of these metrics.

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(6)
$$Precision=\frac{TP}{TP+FP}$$
(7)
$$Recall=\frac{TP}{TP+FN}$$
(8)
$$F1-Score=\frac{2\times Precision\times Recall}{Precision+Recall}$$
(9)

Here, the terms TP, TN, FP, and FN, respectively, stand for true positive, true negative, false positive, and false negative. These measures can be obtained from the confusion matrix which is given below in Fig. 6.

Fig. 6
figure 6

Confusion matrix

5 Analysis of Results

The results analysis of the proposed model for two datasets are given below.

5.1 For AD Dataset

On the AD dataset, the suggested framework achieves an accuracy of 96.25%. The proposed model's recall, precision, and F1-score are 96.36%, 96.71%, and 96.52%, respectively. The confusion matrix heat map of the predicted result on this dataset is given in Fig. 7. The performance of the proposed model is best for diagnosing MoD samples, whereas it is least effective for detecting MD samples. The AUC of the proposed model is 99.29. Additionally, Table 2 illustrates the comparison of the proposed model's performance with current state-of-the-art methods. Recent deep models such as Deep ConvNet [28], DCNN-VGG19 [29], Inception-V4 [30], ADDTLA [7], DEMNET [9], and Landmark Feature Modelling [31] are included in this comparison study. All these models exploit spatial features without providing any type of attention mechanism for the spatial features. With an accuracy of 91.70%, the ADDTLA gains second position in Table 2 by adopting a transfer learning approach. On the other hand, models like Deep ConvNet [28], DEMNET [9], Landmark Feature Modelling [31], DCNN-VGG19 [29], and Inception-V4 [30] acquired accuracies of 90.4%, 85.00%, 79.02%, 77.66%, and 73.75% respectively on the concerned dataset. However, the proposed model with improved spatial attention mechanism and enhanced multiscale feature mechanism obtained an accuracy of 96.25% which tops Table 2 among others. The proposed models outperform the most recent ones in terms of recall, precision, and F1-score. Hence, the proposed model addresses the identified challenges very well and exhibits well on the AD dataset available on Kaggle.

Fig. 7
figure 7

The proposed model's confusion matrix for the AD Dataset

Table 2 Comparative quantitative analysis of results for the dataset available on Kaggle

5.2 For the OASIS Dataset

The proposed model's accuracy on the OASIS dataset is 99.75%. The model has a 0.25% error rate, 99.63% precision, 99.91% recall, and 99.77% F1-score, respectively. Figure 8 depicts the predictive diagnosis labels' confusion matrix and the ROC curves are also obtained and illustrated in Fig. 9. The proposed model performs well in predicting all the test labels. Only one label for the non demented (ND) is not classified to it. In addition, the ROC of the proposed model on this dataset is also obtained and mentioned in Fig. 10 in which the AUC score of 99.99%. The suggested model's performance is also contrasted with a few current state-of-the-art methods that used the OASIS dataset to develop their models, which can be seen in Table 3. These models include Deep Net [32], Ensemble Hybrid Deep Net [33], CNN + Optimal KNN with BO [16], Ensemble-Deep CNN [34], Conv-TL [6], and ANN [11]. The DeepNet placed in second place in Table 3 with an accuracy of 99.68% by exploiting deep spatial features from its input. Whereas the GBM-ResNet-50 got an accuracy of 98.99% and placed in third place. Ensemble approaches using deep learning have shown promising results, for example, Ensemble-Hybrid Deep Net [33] and Ensemble-Deep CNN [34] have got an accuracy of 95.23% and 93.18% respectively. The multilayer perceptron-based model such as ANN[11] has got 92.00% accuracy on this dataset. The transfer learning-based approaches like CNN + Optimal KNN [16] and Conv-TL [6] have got accuracies of 94.96% and 93.80% respectively. Gupta et al. [35] have achieved 74.90% accuracy on the OASIS dataset, which is the lowest performance as far as Table 3 is concerned. All these methods exploited spatial features without placing emphasize on the important spatial features for AD detection. However, the proposed model with improved spatial attention mechanism tops in Table 3 with an accuracy of 99.75%. Thus, the proposed model fulfils the obtained research gaps by obtaining better performance as compared with other methods.

Fig. 8
figure 8

On the OASIS dataset, the proposed model's confusion matrix

Fig. 9
figure 9

ROC curves for the AD dataset

Fig. 10
figure 10

ROC curves for the OASIS dataset

Table 3 Quantitative-based comparative results analysis on the OASIS dataset

6 Ablation Study and Generalisation Test

The analysis of the ablation research and the generalisation test on the model are both covered in this part.

6.1 Ablation Study

Apart from the results analysis, this paper also conducted an ablation study on the various (or combination of) components of the proposed model. The main aim is to show the behaviour of each of the components of the proposed model. For this, the whole model is divided into the following modules:

  • Model-1: Proposed model with SAB instead of I-SAB. In this model, the proposed I-SAB is replaced by the conventional SAB [22] and the rest of the model remains same. The purpose is to understand and analyse the behaviour of conventional SAB.

  • Model-2: Proposed model with Four scale only. This model does not contain the first I-SAB.

  • Model-3: Proposed Model with Three Scale. It does not contain initial two I-SABs.

  • Model-4: Proposed Model with Two Scale. It does not contain initial three I-SABs.

  • Model-5: Proposed Model with One Scale. It only contains the last I-SAB.

  • Model-6: Proposed model without I-SAB and multilayer feature modelling. It contains only the backbone of the proposed model.

6.1.1 Ablation Study on the AD Dataset

Table 4 presents quantitative findings from an investigation of the six models stated above in the AD dataset. Performance measures such as accuracy, error rate, precision, recall, and f1-score are used for performance comparison. The confusion matrixes of these models are also obtained and shown in Figs, 11, 12, 13, 14, 15 and 16. The impact of SAB (Model-1) is analysed, and it has an accuracy of 94.06%, which is less than the proposed model. Similarly, it is necessary to observe the performance of multiple scale features for AD detection. This has been addressed through Models 2 to 5, whose accuracies are mentioned in Table 4. There is a gradual decrease in performance if we minimise the inclusion of multiscale features. In addition, the impact of the backbone for AD detection has to be observed. This is done through Model 6, which has an accuracy of 92.03%. Thus, it is very important to accumulate all these models to achieve better performance.

Table 4 Ablation investigation using the AD dataset available on Kaggle
Fig. 11
figure 11

Model-1's confusion matrix on the AD dataset

Fig. 12
figure 12

Model-2's confusion matrix on the AD dataset

Fig. 13
figure 13

Model-3's confusion matrix on the AD dataset

Fig. 14
figure 14

Model-4's confusion matrix on the AD dataset

Fig. 15
figure 15

Model-5's confusion matrix on the AD dataset

Fig. 16
figure 16

Model-6’'s confusion matrix on the AD dataset

6.1.2 Study of Ablation on the OASIS Dataset

The results of the ablation study and quantitative comparison of six models on the OASIS dataset are mentioned in Table 5. The backbone of the proposed model (i.e., Model-6) has an accuracy of 94.25% which is far more less than the proposed model. The model with SAB (Model-1) achieved an accuracy of 97.75% on the OASIS dataset whereas the proposed model got 99.75%. This shows that the improved spatial attention block significantly improves the detection accuracy. The multiscale models such as 4-Scale (Model-2), 3-Scale (Model-3), 2-Scale (Model-4), and 1-Scale (Model-5) have got accuracies of 99.00%, 97.25%, 96.00%, 95.25%, and 94.25% respectively. This demonstrates that as the number of scales is reduced, the model's performance declines. The confusion matrix of all six models is shown in Figs. 17, 18, 19, 20, 21 and 22. Finally, we can conclude from this study is that the proposed model performance well in all aspects by using multiscale improved spatial attention features for AD detection on the OASIS dataset.

Table 5 Study of ablations using the OASIS dataset
Fig. 17
figure 17

Model-1's confusion matrix on the OASIS dataset

Fig. 18
figure 18

Model-2's confusion matrix on the OASIS dataset

Fig. 19
figure 19

Model-3's confusion matrix on the OASIS dataset

Fig. 20
figure 20

Model-4's confusion matrix on the OASIS dataset

Fig. 21
figure 21

Model-5's confusion matrix on the OASIS dataset

Fig. 22
figure 22

Model-6's confusion matrix on the OASIS dataset

6.2 Generalisation Test

A dataset generalisation test is also conducted to show the behaviour of the model during domain adaptability. In this test, the proposed model is trained on the AD dataset and tested on the OASIS dataset. It has been recorded that the proposed model obtained an accuracy of 83.95% during this test. The confusion matrix of this test is shown in Fig. 23.

Fig. 23
figure 23

Confusion matrix of generalizability test

The confusion matrix shows that the proposed model does not classify the moderately demented class. The prediction accuracy of VMD and MD samples is nearly the same, which is 71.20% and 69.20%, respectively. The performance on ND samples is high, at 88.20%. The main reason could be the unavailability of MoD, MD, and VMD samples. Nevertheless, the proposed model exhibits better accuracy (83.95% accuracy) during such a generalisation test. The future work will focus on improving such results by proposing an advanced domain generalisation model.

7 Conclusion

An innovative deep learning-based technique for the diagnosis of AD has been provided in this article. The suggested model's framework is built utilising multiple depth-wise separable convolution layers. The depth-wise separable CNN is preferred over conventional CNN to take the advantages of less computational cost. The model exploited improved spatial attention guided multiscale spatial features for AD detection. The conventional spatial attention mechanism is limited in exploiting a better attention map which is addressed by the proposed improved spatial attention block (I-SAB). The details of I-SAB have been illustrated under Sect. 3. Multiple I-SABs are plugged in multiple layers of the backbone (illustrated in Fig. 2) to provide improved spatially attentive multilayer features. These features are fused and given to a multilayer of perceptron for disease classification. The behaviour of the model is demonstrated by performing experiment on two publicly available AD datasets such as the AD dataset available in Kaggle and the OASIS dataset. The proposed model achieves 99.75% and 96.25% of accuracy on the OASIS and Kaggle dataset. Such results outperform the existing models. On the proposed model, ablation research is also carried out. Six different sub models are generated from the proposed model and their quantitative results analysis is illustrated in Table 4. It is clear from this research that the suggested model outperforms the model with the traditional SAB. Also, the fusion of multiscale features is also important to obtain better accuracy. Additionally, a generalisation test was carried out in this paper using the OASIS dataset after the model had been evaluated on the Kaggle dataset. Such test results in 83.95% of accuracy. Thus, in all aspects the proposed model performs well but there is still a need to improve the generalization accuracy which will be our future research scope.