1 Introduction

Approximately 70% of all dementia cases worldwide are caused by Alzheimer’s disease (AD), a progressive neurodegenerative illness. During its early stages - mild-cognitive impairment (MCI) - the condition is asymptomatic. Even though several studies have been conducted, a cure has not yet been discovered [1]. In general, people aged 65 and older live 4 to 8 years after being diagnosed with AD. Nonetheless, some people can live up to 20 years with AD. This extended duration before death significantly impacts public health as a considerable part of that period is spent in a state of dependence and disability [2]. It is therefore imperative to find more precise and reliable means of diagnosing AD to minimize its impact.

AD has 3 stages: (1) the pre-clinical AD, distinguished by the asymptomatic period that occurs between the initial brain lesions and the appearance of the first symptoms; (2) MCI, the pre-dementia state, in which individuals have cognitive deficits greater than those that naturally emerge with age, but do not fit the criteria imposed for the diagnosis of AD; (3) Dementia due to AD (or simply AD in this study), characterized by severe symptoms.

Dementia due to AD has 3 different phases. The mild phase corresponds to the period where the individual is still operational in several areas, but, for safety reasons, may need help in certain activities. The moderate phase is distinguished by the difficulty in communicating and performing routine tasks. In advanced stages of the disease, individuals require 24-hour care as damage to the areas of the brain responsible for movement emerges [2].

The diagnosis of the disease can be performed in numerous ways. Usually, the main risk factors are pondered through physical and history examinations from the individual and his family. With these risks and through neurological and cognitive exams, it is possible to discard other causes of dementia and evaluate the stage of AD. The most common cognitive test is the Mini-Mental State Examination (MMSE). Its scores range between 0 and 30. Scores on the higher end indicate a higher cognitive function, while lower scores mean more severe cases of dementia [3]. Additionally, there are other employed methods to identify both neurodegeneration and amyloid deposition, such as Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), Electroencephalogram (EEG), and Cerebrospinal Fluid (CSF) Analysis [1].

Imaging techniques are used as non-invasive means for AD diagnosis. The imaging modalities are currently focusing on the identification of amyloid deposition or neurodegeneration, e.g., structural MRI allows the computation of atrophy and changes measurements in tissue [4]. MRI-based atrophy measurements are considered valid markers of disease state and progression since atrophy seems to be an inevitable and intrinsic factor of progressive neurodegeneration. Besides that, changes in structural measures, such as ventricular enlargement, hippocampus, entorhinal cortex, whole brain, and temporal lobe volumes, can be associated with modifications in cognitive performance [5]. In general, atrophy progression assessed by MRI is being widely used as an efficiency and safety outcome measure in clinical trials. Nonetheless, out of all the MRI markers, AD hippocampal atrophy is pondered as the best established and validated [6, 7].

Regarding MRI state-of-the-art studies done to diagnose AD, Ruiz et al. [8] proposed an automated computer-aided diagnosis (CAD) system using MRI to extract features from regions of interest (ROI). Several machine learning classifiers were used, but VAF-FS, Random Forest (RF), and XGBoost classifiers were the ones that suit better the problem with an accuracy of 85.86% in the Healthy Controls (CN) vs AD comparison, 71.92% in CN vs MCI, and 68.92% in MCI vs AD.

Thapa et al. [9] used neuropsychological testing coupled with MRI. The machine learning classifier that performed best was the Support Vector Machine (SVM) feed with information from left and right hippocampal volume and MMSE scores. The obtained discrimination accuracies were 99.2% for CN vs AD, 78.5% for CN vs MCI, and 91.3% for MCI vs AD.

Hon and Khan [10] used MRI images and extracted their entropy to characterize AD activity. Two Convolution Neural Network (CNN) architectures were used (VGG and Inception) and the reached discrimination accuracy was 96.5% for the CN vs AD comparison. Amini et al. [11] used functional MRI (fMRI) images and extracted the average and the standard deviation of cortical thickness, cortical parcel volume, white matter, and surface area. These features were used to feed both machine learning and CNN algorithms. It was found that the proposed CNN obtained a discrimination accuracy of 96.7% for the CN vs AD comparison.

Al-Khuzaie et al. [12] used MRI images and fed the proposed CNN with the 2D image slices. Thus, the discrimination accuracy achieved was 99.3% for the CN vs AD comparison. Liu et al. [13] used MRI images to extract hippocampal features. The chosen classifier was a 3D Densely CNN (DenseNet 3D). The discrimination accuracies obtained were 88.9% for CN vs AD and 76.2% for CN vs MCI. Qiu et al. [14] used MRI images and fed a Fully CNN with AD probability maps. A discrimination accuracy of 87.0% was obtained for CN vs AD.

Vaithinathan and Parthiban [15] extract ROI-based texture measures from MRI images. For the classification, they used several algorithms such as RF, linear SVM, and k-nearest neighbors (KNN). The discrimination accuracy achieved was 87.39% for CN vs AD, 64.74% for CN vs MCI, 63.41% for MCI vs AD, and 66.38% for converter MCI (cMCI) vs stable MCI (sMCI). A Multi-slice ensemble learning was designed by Kang et al. [16] to obtain spatial features to train CNNs models. This approach achieved accuracy values of 90.36%, 77.19%, and 72.36% when classifying AD vs CN, AD vs MCI, and MCI vs CN, respectively. Ebrahimi et.al. [17] applied several deep sequence-based CNN models for AD vs CN with 91.78% accuracy.

In this sense, the main purpose of the present work is to develop an artificial intelligence system that enables to detect AD in MCI and Dementia Stages (AD) stages, using sMRI texture features. The paper is structured as follows: Sect. 2 describes the used MRI database; Sect. 3 focuses on the image processing methodology and the classification process; Sect. 4 discusses the obtained results, lastly, Sect. 5 concludes the work.

2 Materials

The data used in this work are the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership with the aim of testing whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild-cognitive impairment and early Alzheimer’s disease.

Regarding the MRI scan, time overall was about 45 min per subject and session. Each exam undergoes quality control so that in case of, for example, subject motion or poor anatomic coverage, the scan is considered unusable. The database, released in February 2021, consists of 89 subjects scanned longitudinally at 3T with a 3-year follow-up, in which 24 are healthy control subjects, 44 are MCI patients, and 21 are AD patients (patients diagnosed with dementia due to AD). The demographic data of the 3 groups are summarized in Table 1.

Table 1 Database demographic data overview

3 Methods

The proposed methodology is divided into 3 main steps: (1) preprocessing, (2) wavelet decomposition and feature extraction, and (3) feature selection and classification. Figure 1 summarizes the methodology implementation steps.

Fig. 1
figure 1

Image processing methodology workflow

3.1 Preprocessing

The dataset was loaded on FreeSurfer 7.1.1 software (freely available online at https://surfer.nmr.mgh.harvard.edu/) to decompose each 3D subject data into 2D slices comprising 3 different anatomical planes, namely, coronal, sagittal, and axial, and then to execute the skull stripping process on the 2D slice MR images. An example of skull stripping is illustrated in Fig. 2.

Fig. 2
figure 2

Example of the skull stripping process

The resulting 2D slice images were loaded to Matlab\(^{\circledR }\) 2019b software. These images were then filtered by the median filter with a \(3 \times 3\) kernel to remove noise [18]. Subsequently, they were filtered by the imadjust filter to adjust the image intensity values to all scales according to [19]

$$P_{adj}(m,n) = B + \dfrac{P(m,n) - L}{H - L}* (T - B),$$
(1)

where P(mn) is the input image, \(P_{adj}(m,n)\) is the output image, m and n are the image pixel indices, and H and L are the maximum and the minimum pixel level in the original image, respectively, and \(T=255\) and \(B=0\) are the maximum and the minimum pixel levels in the desired image.

3.2 Wavelet Decomposition

The discrete wavelet transform (DWT) was chosen to describe the input images because it is possible to maintain higher resolution at low-frequency bands [20]. It can be obtained by restraining scale (s) and translation (\(\tau\)) parameters to a discrete lattice with \(s=2^{-m}\) and \(\tau =n\,\cdot\,p 2^{-m}\), where m and n are integers. Hence, for a discrete-time signal f(n), the wavelet decomposition on I octaves is given by

$$\begin{aligned} f(n) = \sum _{i=1\,\text {to}\,I}\sum _{k\,\in \,Z} c_{i,k}g[n-2^{i}k] + \sum _{k\,\in \,Z} d_{I,k}h_{I}[n-2^{I}k] \end{aligned}$$
(2)

where \(c_{i,k}\) and \(d_{I,k}\) correspond to the coefficients of the approximation component and coefficients of the detail component, respectively [21, 22]. These coefficients are given by

$$\begin{aligned} c_{i,k}(n)= & {} \sum _{n} f(n) G^{*}_{i}[n-2^{i}k] \end{aligned}$$
(3)
$$\begin{aligned} d_{I,k}(n)= & {} \sum _{n} f(n) H^{*}_{I}[n-2^{I}k] \end{aligned}$$
(4)

The parameters i and k indicate the wavelet scale and translation factors, respectively. Besides that, \(G_{i}\) characterizes the coefficients of the low-pass and \(H_{I}\) the coefficients of the high-pass filters. Every wavelet type and family is different with regard to these filters [21, 23].

Since images are two-dimensional, the DWT is applied to images both vertically and horizontally. The result is four images (subbands) with half the width and the height, one of which is a decimated copy of the image (LL), and the 3 remaining contain information about the details - horizontal (HL), vertical (LH) and diagonal (HH). At each subsequent step of decomposition, the LL subband is replaced by four smaller subbands, so the total number of subbands increases by 3 (see Fig. 3).

In this work, for all participants, in each plane, every image has been decomposed by the DWT until level 2, producing in this way 8 images, as illustrated in Fig. 3.

Fig. 3
figure 3

Image Wavelet Decomposition

3.3 Features Extraction

For each of the 89 study participants, 243 images were used for feature extraction: the 27 original plane images (9 images of each of the 3 planes) and the 8 images resulted from the DWT decomposition of each plane image. From each image, 9 texture features were extracted: contrast, correlation, energy, homogeneity, entropy, line and column variances, and line and column standard deviations. Therefore, for each possible mother wavelet used in the DWT decomposition, 2187 features (729 per plane) were computed for each study participant.

The features were computed from the gray level co-occurrence matrix (GLCM), which is a statistical method that considers the spatial relationship of pixels and is employed to describe the texture of an image [24]. Each element \(\{i,j\}\) of the GLCM \(P_{i,j}\) represents the frequency by which the pixel with gray level i is spatially related to the pixel with gray level j [24]. The formula and description of the features are summarized in Table 2, where

$$\begin{aligned} \mu _i = \sum ^{N}_{i=1}\sum ^{N}_{j=1} i P_{i,j} \end{aligned}$$
(5)

and

$$\begin{aligned} \mu _j = \sum ^{N}_{i=1}\sum ^{N}_{j=1} j P_{i,j} \end{aligned}$$
(6)

are mean values of the GLCM.

Table 2 Features overview description

For each of the 3 planes (coronal, sagittal, and axial) of each study participant, each feature was averaged considering the 9 original images and the 72 images resulting from their DWT decompositions. This leads to 9 average features (1 value per feature) for each plane of each study participant. These average features were used for the selection processes of mother wavelets and features to improve classification results. The averaging processes per plane were applied to decrease the data dimensionality and consequently improve the execution time of these selection processes.

3.4 Wavelet Selection Process

The extracted features were used for binary classification within the pairs CN vs MCI, AD vs MCI, and CN vs AD, and for multi-class classification All vs All. All binary classifications were performed using exclusively the information of each of the 3 planes (coronal, sagittal, and axial) and also using together the information of the 3 planes.

Since the values of each feature depend on the mother wavelet used in the DWT decomposition, a search to find the five wavelets that result in features with greater discriminant capacity considering all study group pairs (CN vs MCI, AD vs MCI, CN vs AD and All vs All) and all study planes (coronal, sagittal, axial, and 3 planes) was performed. The evaluated wavelet families were Haar, Daubechies (Db), Symlets (sym), Coiflets (Coif), Biorthogonal (Bior), Reverse biorthogonal (rbio), Meyer, and Fejer-Korovkin (fk). The average features were used for this purpose.

The average values of each feature were separated for each combination of study group pair, study plane, wavelet, feature, and subband (or full-band). Each combination that uses only 1 plane leads to 1 value per study participant. Each combination that uses together the 3 planes leads to 3 values per study participant. Within each combination, including all study participants, the average values were normalized using z-score [25] and then applied to the Kruskal-Wallis (KW) test [26]. The KW test was used to determine if the null hypothesis that the data of the study groups come from the same distribution is accepted. In this test, p-values lower than 0.05 indicate that there is a significant difference between the distributions and then the null hypothesis is rejected [26]. It is worth mentioning that, for the multi-class study group All vs All, the p-values were corrected by the Bonferroni method [27].

Figure 4 shows the 15 cases with the highest number of average features that reject the null hypothesis and the corresponding wavelet. It is observed that the five wavelets with the highest number of significant features were Biorthogonal 1.1, Reverse Biorthogonal 1.1, Reverse Biorthogonal 1.3, Reverse Biorthogonal 1.5, and Reverse Biorthogonal 3.1. These wavelets were chosen for feature selection and classification procedure steps.

Fig. 4
figure 4

Best performances in the Kruskal-Wallis test and the corresponding wavelets

3.5 Features Selection and Classification

As mentioned earlier, classification within each study group pair (CN vs MCI, AD vs MCI, CN vs AD, and All vs All) was carried out for each study plane (coronal, sagittal, axial, and 3 planes). For improving the execution time and the classification results, for each combination of study group pair and study plane, a search was carried out to find the features, computed through the five selected wavelets, that result in the highest classification accuracy. Once again, the average features were used for selection purposes.

The non-normalized average values of each feature were separated for each combination of the study group pair and study plane. Each combination initially had 369 features (9 features \(\times\) 8 images resulting from DWT decomposition \(\times\) 5 wavelets + 9 features \(\times\) 1 original plane image) for each plane of each study participant included in the study group pair. Within each combination, including all study participants belonging to the corresponding study group pair, the average values were normalized using z-score [25]. Then for each combination, including all study participants belonging to the corresponding study group pair, the normalized average values of all features were applied as inputs to a cascade of one F-score algorithm [28] and one classical machine learning (cML) algorithm to select, according to the maximum classification accuracy, the best set of features. The F-score algorithm individually assesses and rates the features based on their F-score. The features with an F-score value above the average are chosen as the relevant features [28].

The number of features selected by the f-score algorithm ranged from 2 to 9 in unit steps and from 10 to all in steps of 5. The cML algorithms were different configurations of decision trees, discriminant analysis, naive-Bayes, support vector machines (SVM), k-nearest neighborhood (KNN), and ensemble. In addition to the cML algorithms, a convolution neural network (CNN) was also applied. For each combination of study group pair and study plane, the CNN was fed with the sets of selected features that, used as inputs to the cML algorithms, led to the best classification result. The classifiers and their configurations are described in Table 3. In all cases, in order to verify the generalization capacity of the classifiers, a leave-one-out cross-validation procedure was used, a well-known process that allows the use of the whole dataset for testing, without leakage between train and test sets [29].

Table 3 Used classifiers and optimal parameters

4 Results and Discussion

For each combination of study group pair and study plane, the highest classification accuracy achieved using the cML algorithms, and the corresponding number of selected features (ft), are shown in Table 4. The classification accuracy achieved employing the CNN, and the corresponding number of selected features (ft) and study plane, are shown in Table 5.

Table 4 Classical machine learning classification per plane
Table 5 Summary of the DL classification results

Scrutiny of Table 4 reveals that, for the study group pair CN vs AD, the highest classification accuracy achieved through the cML algorithms was 93.3% using 35 features from the sagittal plane and also with 115 features selected from the 3 planes, both with bagged trees classifiers. The lowest classification accuracy achieved through the cML algorithms was 77.8% using the axial plane. For this study group pair, as indicated in Table 5, the highest classification accuracy achieved through the CNN algorithm was 82.2% using the 115 features selected from the 3 planes.

For the pair AD vs MCI, it is observed from Table 4 that the highest classification accuracy achieved through the cML algorithms was 87.7% using 80, 95, and 140 features from the coronal plane and the quadratic SVM classifier. The lowest classification accuracy achieved through the cML algorithms was 78.5% using the sagittal plane. For this study group pair, as indicated in Table 5, the highest classification accuracy achieved through the CNN algorithm was 75.4% using the 95 features selected from the coronal plane.

Regarding the pair CN vs MCI, it is observed from Table 4 that the highest classification accuracy achieved through the cML algorithms was 88.2% using 30, 40, 60, 65, 70, 75, 80, 85, and 90 features selected from the coronal plane and the Fine KNN. The lowest classification accuracy achieved through the cML algorithms was 78.5% using the sagittal plane. For this study group pair, as indicated in Table 5, the highest classification accuracy achieved through the CNN algorithm was 75.4% using the 95 features selected from the coronal plane.

Concerning the study group pair All vs All, as indicated in Table 4, the highest classification accuracy achieved through the cML algorithms was 75.3% using 80, 95, 105 and 115 features selected from the coronal plane and the subspace KNN classifier. The lowest classification accuracy achieved through the cML algorithms was 65.2% using the sagittal plane. It is observed from Table 5 that, for this study group pair, the highest classification accuracy achieved through the CNN algorithm was 64% using the 80, 85, and 95 features selected from the coronal plane. The lowest classification results were obtained for this study group pair, which indicates that the multi-class classification is the one in which the extracted features and the ML algorithms have more difficulty in discriminating between the groups.

Analyzing the results, it is observed that the CNN algorithm did not obtain classification accuracies higher than the cML algorithms in any of the four study group pairs. In fact, except for the pair CN vs MCI, the best result achieved using the CNN algorithm is worse than the worst result achieved using the cML algorithms. The overall poor performance of the CNN algorithm may be due to a non-optimal selection of the features to be applied on its inputs since the features were selected by applying the f-score algorithm combined with the cML algorithms and not with the CNN.

The only result above 90% was obtained in the pair CN vs MCI. This high classification accuracy is particularly important because, due to the lack of a cure for Alzheimer’s disease, early detection plays a key role in medical intervention to reduce brain damage, preserve daily functioning for longer, and give the patient time to plan the future. Despite not having obtained the highest accuracy, the pair CN vs AD was the only one for which classification results above 80% were achieved in all study planes. This overall high performance was expected because CN and AD are the groups that have the greatest anatomical differences in the brain [30].

Among the study planes, the coronal plane was the one in which the best overall classification accuracies were obtained. This result is sustained by previous studies [31] and [32] and can be justified by the fact that the coronal plane enables a clearer view of 3 of the most important tissues for AD, namely, the cerebral cortex, ventricle, and hippocampus. Consequently, it is possible to indicate that the coronal plane allows the best visualization of the differences in the various anatomical regions of the 3 groups studied.

It is worth noting that the results presented and discussed above were obtained by using all study participants on wavelet and feature selection. Although easily found in literature, this is not the most rigorous way to select features because it may introduce a risk factor of overfitting. The selection was performed in this way due to the small size of the database, but this risk was reduced by the cross-validation employed on the evaluation performance.

A comparison between the classification results obtained in the present work and those found in the literature also using the ADNI image database is depicted in Table 6. It is observed that not all state-of-art methods performed the three binary classifications made in the present work, focusing on the pair CN vs AD. And, more importantly, only three of the state-of-art methods carried out the multi-class classification All vs All.

Table 6 Comparison with previous works with ADNI database

For the pair CN vs MCI, crucial for early detection, the sMRI-based method proposed in the present work outperformed the methods developed in Ruiz et al. [8], Lebedev et al. [33], and Zhang et al. [34], Thappa et al. [9], and Liu et al. [35], by 21, 19, 16, 14, and 14%, respectively. However, the 88% accuracy achieved in the present work is 1% lower than that obtained in Lee et al. [36].

In the AD vs MCI case, the 88% achieved in the present work is 19, 18, 16, 12% higher than that obtained in Ruiz et al. [8], Lebedev et al. [33], Lee et al. [36], and Zhang et al. [34], respectively, but 3% lower than that obtained in Thappa et al. [9], where all are sMRI-based methods.

For the pair CN vs AD, compared with only sMRI-based methods, the 93% achieved in the present work is 14, 10, 7 and 7% higher than that obtained in Lebedev et al. [33], Qiu et al. [14], Zhang et al. [34] and Ruiz et al. [8], respectively, but 6% lower than that obtained in Thappa et al. [9]. Regarding the multi-class classification All vs All, the proposed method stands out for achieving the highest accuracy, outperforming the methods developed in Lebedev et al. [33], Zhang et al. [34] and Lee et al. [36] by 34, 23 and 4%, respectively.

Compared with diagnosing methods through images techniques other than sMRI, the proposed method outperformed the methods developed in Liu et al. [35] and Cheng et al. [37] by 2 and 1%, respectively, but it is surpassed by 4% by the fMRI-based method’s developed in Amini et al. [11]. Although the above performance comparisons are evidence of the proposed method’s ability to discriminate the different stages of AD, they should be carefully analyzed since different works may use different amounts of subjects, or the same amount but different subjects, even if the database is the same.

In addition to the ADNI database, the sMRI-based method developed in Qiu et al. [14] was also originally evaluated using other image databases and these results are summarized in Table 7. It is observed that, for the pair CN vs AD, the classification accuracy achieved by applying the proposed method to the ADNI database also outscores those obtained by applying the method developed in [14] to the AIBL, FHS, and NACC databases. Besides the different features computed from the images, a factor that may be contributing to the better general performance of the proposed method is the feature selection, a procedure not performed in [14]. Although enriching, these comparisons need to be carefully analyzed because different image databases were employed in the studies.

Table 7 Comparison with previous imaging works with different databases

A comparison between the classification results obtained in the present work and those found in the literature using signal and biomarkers techniques is summarized in Table 8.

Table 8 Comparison with non-imaging works

It is observed that the proposed sMRI-based method did not present the best performance in any of the analyzed study group pairs. For the pair CN vs MCI, it outperformed the method developed in [38] by 11% but it is surpassed by the method introduced in [39] by 10%.

In the MCI vs AD case, the proposed method outscored the methods developed in [40, 41], and [38] by 10, 9, and 5%, respectively, but is outperformed by the method elaborated in [39] by 6%. For CN vs AD, the proposed method outperformed both the methods developed in [40] and [41] by 10%, but is outscored by the method produced in [38] by 2%. In the multi-class All vs All case, the proposed method did not outperform the EEG-based methods developed in [38] and [39], being surpassed by 21% by the latter.

5 Conclusion

Alzheimer’s disease is one of the neurodegenerative diseases with the highest prevalence, affecting millions of people worldwide. This work aimed to detect AD on the stages CN, MCI, and AD itself using sMRI. A set of co-occurrence matrix and texture statistical measures (contrast, correlation, energy, homogeneity, entropy, variance, and standard deviation) were extracted from a two-level DWT decomposition of sMRI images. The discriminant capacity of the measures was analyzed and the most discriminant ones were selected to be used as features for feeding classical machine learning algorithms and a CNN. The classical algorithms achieved the following classification accuracies: 93.3% for AD vs CN, 87.7% for AD vs MCI, 88.2% for CN vs MCI, and 75.3% for All vs All. The CNN achieved the following classification accuracies: 82.2% for AD vs CN, 75.4% for AD vs MCI, 83.8% for CN vs MCI, and 64% for All vs All. For the All vs All comparison, the proposed method outperformed by 4% the highest classification accuracy of the state-of-art sMRI-based methods.

The accuracies achieved for AD vs CN, AD vs MCI, and CN vs MCI indicate that the evaluated measures have a great ability to distinguish within these binary groups. However, despite surpassing the state-of-the-art, additional research should be conducted to improve the accuracy of the challenging multi-class classification All vs All. Despite the promising results, the database size was a limitation for the present study because all study participants needed to be used for wavelet and feature selection tasks. In future, the work should be updated with a larger sMRI database that can be divided into training and testing subsets.