1 Introduction

Alzheimer’s disease (AD) is a depressive brain illness that is the leading cause of dementia in older people. It causes cognitive decline, thereafter leading to the inability to carry out daily duties [1]. AD not only reduces the way of life of patients, but it also leads to extra stress on healthcare givers. The synthesis of the amyloid peptide is linked to AD, and the symptoms often begin with minor memory loss before progressing to other brain dysfunctions [2]. Since there is no cure for AD, early detection in the prodromal stage, i.e., mild cognitive impairment (MCI), is vital. Early MCI (EMCI) is a stage of cognitive impairment that occurs before MCI [3]. Early diagnosis of EMCI may prevent EMCI from progressing to AD [3]. Studies have stressed the relevance of diagnosing MCI patients by identifying the differences between EMCI and late MCI (LMCI) groups [4,5,6]. Neuroimaging has become a crucial diagnostic tool for AD due to the rapid advancement of neuroimaging technologies [7, 8]. Non-invasive techniques such as MRI and PET are routinely employed to record brain tissue features [9, 10]. By evaluating the brain images captured from PET and MRI, volumetric consolidation in parts of the brain atrophy can be used as an essential biomarker for AD [11, 12].

PET imaging is an important functional technology that allows clinicians to swiftly and precisely study activities relevant to the human brain, with the potential for the early detection of AD [13, 14]. PET images obtained using radioactive 18-fluorodeoxyglucose (FDG) diffusion were used to produce sensitive glucose metabolic rate estimations in the brain [15], and this can be used to trace the progression of the disease from Normal Cognitive (NC) to AD. When it is difficult to tell the difference between physiological and pathological alterations in the anatomy, FDG-PET can be used. The volume of brain structure diminishes with age (particularly in the elderly) [16], making it impossible to identify whether a person’s brain is in a normal or diseased state using only MRI. PET can detect the AD status of people more effectively in these cases. For example, Ozsahin et al. utilized PET data for the automated classification of AD groups [17]. Authors [18] predicted the risk of AD based on the deep learning model by extracting FDG PET image features. Jo et al. established a deep learning-based system for the categorization of AD that recognizes the morphological phenotypes of tau deposition in tau PET images [19]. Liu et al. used a multiscale deep neural network to learn the patterns of metabolism changes due to AD pathology by analyzing PET images [20].

Because of its capacity to show distinct atrophy patterns in the brain, structural MRI (sMRI) is useful in the range of possibilities of AD and its high resolution for soft tissue [21]. For many years, structural information about the brain has been widely employed for early detection and diagnosis of AD [22, 23], due to its universality in clinical practice and convenience in the examination [24]. MRI has confirmed the pattern of AD progression seen in postmortem brain tissue research [25]. The temporal lobe and parietal lobe, as well as sections of the frontal cortex and cingulate gyrus, degenerate as AD progresses, resulting in extensive atrophy of the affected regions [26]. The patient’s structural alterations in the brain can be observed with MRI. Taheri et al. used extracted gray matter (GM) images from sMRI using CNN architecture for the diagnosis and classification of the CN, EMCI, and LMCI groups [27]. Mehmood et al. applied tissue segmentation on sMRI to extract the GM tissue, and VGG layer-wise transfer learning was used to distinguish between EMCI and LMCI patients [28]. Yue et al. employed Deep Convolutional Neural Network (DCNN) on sMRI to extract the most useful spatial features of GM and further segmented into ninety regions for LMCI vs. EMCI classification [29]. Liu et al. extracted structural and functional features for distinguishing EMCI subjects from LMCI subjects [30]. Wee et al. used a spectral graph-CNN-based system for the early detection of AD that used sMRI cortical thickness and its underlying geometric information [31]. Sheng et al. combined sMRI features and genetic features for six binary classifications (HC vs. AD, HC vs. EMCI, HC vs. LMCI, EMCI vs. LMCI, EMCI vs. AD, and LMCI vs. AD) [32]. Jiang et al. utilized the volumetric features of sMRI data to train VGG16 CNN with transfer learning for the classification of EMCI vs. NC [24].

PET imaging can capture brain metabolism characteristics to aid in the detection of lesions, whereas structural MRI can reflect changes in brain structure [33]. Iaccarino et al. assessed gray matter reduction in the early MCI stage as well as FDG-PET metabolic connectivity. Results showed that multimodal data provides a clinically important analysis [34]. Researchers have proposed a multimodal input image modality based on MRI and PET images to improve classification accuracy. Forouzannezhad et al. developed a Deep Neural Network (DNN) model using a 3- hidden layer approach to obtain the relevant information from MRI and PET data for the classification of the AD group [35]. The model classified six binary groups. Their findings revealed that the sensitivity of the EMCI vs. AD classification is higher than the specificity for the combined modality of MRI + PET. Hao et al. extracted MRI and PET features with consistent metric constraints by extracting pairwise similarity measures for PET and MRI modalities, and the extracted features are used as input to SVM for classification [36]. The model could successfully retain the feature’s structural information with higher sensitivity than specificity in the task of LMCI vs. EMCI.

The idea of multimodal data fusion for medical diagnostics is not new [37, 38]. Khan et al. used it to fuse various modalities of brain MRI images (T1, T2, T1CE, and Flair) for brain tumor recognition [39], and by Muzammil et al. to fuse Computed Tomography (CT) and MRI of the brain [40]. Maqsood et al. proposed a multimodal image fusion framework based on multiscale image matting and evaluated the brain MRI and CT images [41]. Guo et al. proposed to fuse structural images, such as CT and MRI images, and functional images such as PET and single-photon emission computerized tomography (SPECT) images [42]. Zhang et al. extracted characteristics from MRI and PET data using a deep multimodal fusion network based on the attention mechanism. Irrelevant information was suppressed [43]. When the inference is made, complementary information from MRI and PET features can be learned, even if a specific type of modal data is absent, the single input and related complementary information obtained from the pretrained model could still be used to forecast AD. Shao et al. proposed a feature correlation and feature structure fusion approach with Support Vector Machine (SVM) [44]. The classification results showed that the model improved greatly in LMCI vs. EMCI classification when compared with other state-of-the-art methods. The authors suggested the need to further improve the binary classification of their model.

Recently, hybrid methods based on the combination of deep learning and heuristics, or nature-inspired optimization methods were proposed to enhance brain MRI image classification for AD diagnostics [45]. Pradhan et al. proposed a hybridized Salp Swarm Algorithm-based Extreme Learning Machine (ELM) used to optimize the ELM model for MRI classification [46]. Raghavaiah et al. used an Enhanced Squirrel Search Algorithm to select the optimal weight parameters of the deep neural network (DNN) architecture for AD stage classification [47]. In our previous work, the Resnet18 pretrained model was utilized for binary classification of AD using MRI from ADNI, proving its effectiveness in EMCI vs. AD and LMCI vs. AD with VA of 99.99 and 99.95%, respectively [48], while it was able to achieve 98.86% accuracy, 98.94% precision, and 98.89% recall in multiclass classification [49]. Odusami et al.. utilized a ResD hybrid technique based on Resnet18 and Densenet121, and for classification, the data from the two pretrained models are mixed [50]. Experiments reveal that the suggested hybrid ResD model has achieved 99.61% (macro) precision. This has inspired us to design an Agitated Resnet18 model using multimodal input images for the early detection of AD. The first convolution layer of Resnet18 is changed into an agitated layer, which is added to the main residual layer. This model takes advantage of the data extracted in the channel dimension and combines them with the original features in multiscale.

Summarizing, there are two approaches for fusing PET and MRI images. One approach for fusing MRI and PET images is to use a multimodal deep neural network (DNN) that takes both modalities as input and outputs a diagnosis. DNN can be trained using a dataset of subjects with AD and healthy controls. The network learns to extract features from both modalities and uses them to distinguish between the two groups. Another approach is to use a DNN to extract features from each modality separately, and then fuse the features using a fusion layer. This approach can be useful when the two modalities provide complementary information and the features from each modality are not directly comparable. This paper offers three significant contributions: the concatenation-based fusion of MRI and PET images, the in-3-Channel Resnet18 model for the AD classification task, and experimental validation of the proposed methodology on images from the ADNI database. The experiments demonstrate that the use of multimodal features extracted from the channel dimension and deep supervision can improve the performance of the AD classification model.

The novelty of this research paper further lies in its contribution to the field of Alzheimer’s disease diagnosis using a combination of MRI and PET images. The paper presents a novel early feature fusion framework that concatenates PET and MRI images and trains a modified Resnet18 deep learning architecture on the combined dataset. The combination of MRI and PET images has been widely studied for the diagnosis of Alzheimer’s disease. However, the use of early fusion, which combines the images at an earlier stage in the analysis process, even though it may not be entirely new, it remains relevant and important to explore further to advance our understanding of AD and develop more effective diagnostic and treatment strategies. By combining anatomical information from MRI with the functional information from PET, early fusion can provide a more accurate and reliable diagnosis of AD. The 3-in-channel approach is used to learn the most descriptive features of the fused images, leading to an improved binary classification of Alzheimer’s disease. Additionally, the paper provides an XAI model to explain the results, adding interpretability to the deep learning-based diagnosis. The experimental results on the ADNI database show promising accuracy and demonstrate the effectiveness of the proposed approach. To guide our research, we formulate the following Research Questions (RQ):

RQ1

How can the combination of MRI and PET images be used to improve the diagnosis of Alzheimer’s disease?

RQ2

What is the effectiveness of the proposed concatenation-based feature fusion framework for fusing MRI and PET images in the diagnosis of Alzheimer’s disease?

RQ3

How does the modified Resnet18 deep learning architecture perform in the classification of Alzheimer’s disease using fused MRI and PET images?

RQ4

Can the results of the deep learning-based diagnosis of Alzheimer’s disease be explained using the proposed Explainable Artificial Intelligence (XAI) model?

RQ5

How does the proposed approach compare with existing methods for diagnosing Alzheimer’s disease using MRI and PET images?

The remaining parts of the paper are summarized as follows. Section 2 describes the dataset and the steps of our methodology, including data preprocessing, image denoising, intensity normalization, and the proposed modification of the ResNet18 neural architecture. Section 3 presents the results of the experiments. Section 4 discusses the results while Sect. 5 compares the proposed model with previous studies. Finally, Sect. 6 presents the conclusions.

2 Materials and Methods

The overall architecture of our proposed model consists of two steps, namely, data preprocessing and classifying with the in-3-channel Resnet18 model.

2.1 Materials

The data used in this study were collected from the Alzheimer’s Disease Neuroimaging (ADNI) database. We obtained spatially normalized MRI and Coreg, Avg, Standardized Image, and Voxel Size PET images of the whole brain. Spatial normalization of MRI images involves aligning different brain images to a common reference space, which allows for meaningful comparisons between groups. Co-registration of PET images with MRI images are important because it allows for accurate localization of PET signals within specific brain regions. A total of 412 MRIs and 412 PETs subjects were included in this study and all subjects received both imaging examinations, each of the modalities containing EMCI and LMCI groups. The middle slices of both MRI and PET ranging from slice number 144 to slice number 153 were extracted for this study, and Clinical Dementia Rating (CDR) were used to determine the cognitive status of each patient data distribution is provided in Table 1.

Table 1 Statistical data of MRI and PET from ADNI

2.2 Preprocessing Steps

To reduce the learning difficulty and enhance the proposed model performance on multimodal data, we utilized data processing steps consisting of noise removal and intensity normalization. Preprocessing is necessary to further improve the image quality.

2.2.1 Removal of Noise

Most MRI and PET images are noisy and typically include regions of low contrast. The original images are rotated by 90 degrees, and a mask with a number greater than ten is generated from the original images to form both the background mask and brain mask. The generated mask is used to perform segmentation. Morphological dilation is further applied to the segmented images to perform non-linear operations related to the morphology of features in the images such as boundaries and skeletons. Dilation enlarges bright regions and shrinks dark regions. Figure 1 depicts the generated brain mask and clean MRI and PET images.

Fig. 1
figure 1

Data preprocessing steps: a original image, b background mask, c generated brain mask, and d clean MRI image after noise removal

2.2.2 Intensity Normalization

In image processing applications concerning MR images, intensity normalization is a crucial preprocessing step. Due to the usage of diverse equipment, MR images have an inconsistent intensity scale across (and within) facilities and scanners, pulse sequences, and scan settings that are different, and a different environment in which the machine is located. Fuzzy C-means is used to find a mask for the white matter on the original MRI and its brain mask is shown in Fig. 1. A white matter mask for the image is created from the brain mask and segmentation is obtained from the morphological dilation. Then the White Matter mask serves as an input again, where it is used to find an approximate mean of the White Matter intensity in the target contrast and move it to the standard value. Figure 2 shows the White Matter mask and the fuzzy means normalized MRI.

Fig. 2
figure 2

Comparison of MRI image masks: a matrix mask, b white matter mask, and c fuzzy c-means normalized image

The overall process of the preprocessing technique is shown in Fig. 3.

Fig. 3
figure 3

Workflow diagram of PET/MRI image preprocessing technique

2.3 Proposed in-3-Channel Resnet18 Model

After completion of noise removal, all the clean data, Fuzzy C-means normalization is used as the segmentation of gray matter so that the clean image is normalized to the mean of the tissue as demonstrated in Fig. 4. The step is described thus:

Fig. 4
figure 4

Framework of the proposed model

Let \(T\,{\complement }\,B\) = the tissue mask for the image \(I,\) where: \(T\) = the set of indices corresponding to the location of the tissue in the image\(I\). Then the tissue mean is described in Eq. (1), and the segmentation-based normalized image is described in Eq. (2).

$$\mu = \frac{1}{|\hspace{0.17em}T\hspace{0.17em}| }{\sum }_{t \in T}I \left(T\right)$$
(1)
$${I}_{seg} \left(X\right)= \frac{c.I\left(X\right)}{\mu }$$
(2)

where c ∈ R > 0 = constant.

In this study, three-class fuzzy—means are used to get a segmentation of the tissue over the brain mask B for the T1-MRI or PET and we set arbitrarily c = 1. Early fusion is performed on the normalized MRI and PERT data by simple concatenation. The first convolution layer of ResNet18 is changed using in _ channel = 3, and out _ channel = 64, kernel _ size = (3, 3), stride = (1, 1), padding = (1, 1), and bias = True. Then the classification method is designed using the extracted features to distinguish the EMCI subjects from the LMCI subjects.

The ResNet18 classification model diagram of concatenated MRI and PET data performance is shown as follows. First, import the model for the classification of AD classes, and then input MRI data and PET data. Early fusion of the two neuroimaging data is performed by direct concatenation. Using a holdout of 80%, the fused data is divided into training and validation for model training and validation. If the optimal result is achieved, the model is further tested on new data and obtains the classification result, else the hyperparameters of the model are updated for an optimal result as shown in Fig. 5.

Fig. 5
figure 5

Workflow diagram of the classification process

To extract meaningful and key information from the multimodal data, we introduce the training algorithm which is achieved by reducing cross-entropy loss and hyperparameter update. A stochastic gradient is utilized to optimize the parameters of the proposed multimodal model. The pseudocode of the learning algorithm for AD classification is shown in Algorithm 1. The parameters of ResNet18 are initialized and the learning rate is set to ɲ. A mini batch of input fused data will be sampled from the training set for network model training.

Assume that there are Q classes, the cross-entropy loss for a batch size of R samples can be represented as follows:

$${Y}_{c}= -\frac{1}{R}\sum _{i=1}^{R}\sum _{j=1}^{Q}{(v}_{j}^{i}{log}({S}_{j}^{i}\left)\right)$$
(3)

\({v}_{j}^{i}\) is the label of the ith sample for the class j, and \({S}_{j}^{i}\) is the corresponding SoftMax probability. Application of gradient descent to the loss function will gradually update the parameters, and the multimodal network is evaluated on the validation set at a frequency of F. The best model will be obtained after training Fmax iterations with optimal hyperparameters.

figure a

2.4 Experimental Setup

In this study, we designed EMCI vs. LMCI binary. In the beginning stage, noise removal and normalization were applied to each image. The train-split ratio was conducted to ensure that the trained model’s generalization was correct. The data ADNI were divided in the ratio 70%: 30%, with 70% for training and the remaining 30% for validation. The effect of Fuzzy C-Means and the modifications made on the first convolutional layer were ascertained by training the MRI and PET data separately on ResNet18 without modification, and then fused data on ResNet18. Samples were extracted from a separate subject outside of the training set and validation set. To reduce overfitting, we made data augmentation for rotation at 15 degrees. The proposed model was implemented using the open-source library Pytorch and performed on Nvidia TU116 [Geforce GTX 1660] GPU with ten epochs. PU architecture is highly efficient for training and deploying deep CNNs [51]. The optimizer used is Stochastic gradient descent (SGD) with a learning rate of 0.0001, momentum 0.9, weight decay 0.1, and the loss function used is Cross-entropy. If the accuracy on the validation dataset does not improve after 5 epochs, and the loss on the validation dataset does not decrease within ten epochs, the learning rate was changed. To further reduce overfitting, ResNet18’s last layer is modified with 0.5 dropouts and an increased number of epochs. A standard measure of accuracy was used to assess the proposed model’s performance.

3 Results

Additional findings to demonstrate the performance of the proposed In-3-channel ResNet18 model for binary AD diagnostic tasks by examining the impact of intensity normalization on the model, as well as the effect of altering RenNet18’s first convolutional layer, was performed. Furthermore, we compared the results. of our proposed model with the existing techniques. Table 2 shows the result of the training set and validation set on ResnNet18 with normal images and normalized image of MRI data with epochs with the best result, likewise, the result of the training set and validation set on the proposed model with normal image and normalized image of MRI data is also shown in Table 2.

Table 2 Training accuracy (TA) and Validation accuracy (VA) of the proposed model with or without intensity normalization, with or without change in the first convolution layer on MRI Data

Figure 6 shows the confusion matrix results of the in-3-channel model for EMCI vs. LMCI classification with normalized data at 10 epochs, where zero label represents EMCI and one label represents LMCI.

Fig. 6
figure 6

The proposed model’s confusion matrix on test data (normalized data)

4 Discussion

The early diagnosis of Alzheimer’s disease (AD) is crucial for the effective management of the condition. Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) are two imaging modalities that have been widely used in AD research. While MRI provides detailed structural information about the brain, PET allows for the assessment of metabolic and functional changes associated with AD. Deep learning models can fuse information from both modalities and learn complex relationships between the imaging data, and improve diagnostic accuracy compared to using either modality alone.

The results of the proposed model are illustrated in Tables 2, 3 and 4. The VA of the in-3 channel model on MRI data yielded an increase of 0.32% as the number of epochs increased from 5 to 10 as depicted in Table 2. As regards PET data, not so much improvement with the increase in the number of epochs used in the training and validation phases as shown in Table 3. An appreciable increase is seen in the training and VA of the proposed model on the fused MRI and PET data as shown in Table 4. In Fig. 5, the total number of both correct and wrong classifications from both classes is represented. Unlike the existing approaches for extracting discriminative features from multimodal MRI and PET data, we propose a novel 3-channel in-channel model by modifying the first convolution layer of the ResNet18 architecture with an in-channel of three. Each channel learns the representation of the combination of different modalities by utilizing the greatest number of available samples. The key benefit of this fusion is that it allows us to train our model with more samples, which improves the classification performance. The combination of MRI and PET images improves the diagnosis of Alzheimer’s disease by providing a more comprehensive view of the brain and its functioning. MRI measures the decrease in brain volume and can identify abnormalities in the mesial temporal cortex and other regions of the brain, while PET measures the decrease of glucose concentration in the temporoparietal association cortex. By combining these data, a more accurate diagnosis of Alzheimer’s disease can be made, because the two modalities provide complementary information about the brain, and the combination of the two can lead to a more robust and reliable diagnosis. Although several previous methods in the literature used separate feature selection methods, our proposed model can automatically learn discriminative features from multimodality data in an end-to-end way [21, 30] and the features used are cortical thickness, shape, and regional volume. The best-performing algorithm gave a sensitivity of 81.2%, a specificity of 66.9%, and an accuracy of 72.5% [27]. The results depicted in Table 3 showed that VA decreased from 96.70 to 94.10% after the number of epochs increased when analyzing the normal PET image using the proposed method. One major possibility of the decrease in VA is that the learning rate may be too high or too low for PET data as this decrement did not occur in MRI data and the fused data. Our proposed model achieved an accuracy of 73.90% in the test data from the ADNI database. As a result, the use of early fusion improved diagnostic accuracy by considering the complex relationships between the imaging data. The proposed concatenation-based feature fusion framework is effective in the diagnosis of Alzheimer’s disease using fused MRI and PET images. The framework performs the concatenation of the two modalities and trains a deep learning architecture on the combined dataset. The 3-in-channel approach is used to learn the most descriptive features of the fused images, leading to improved accuracy in the binary classification of Alzheimer’s disease. The experimental results on the ADNI database show that the proposed framework achieves a classification accuracy of 73.90%, demonstrating its effectiveness in the diagnosis of Alzheimer’s disease.

Table 3 Accuracy with or without intensity normalization, with or without change in first convolution layer on PET Data
Table 4 Accuracy with or without intensity normalization, on concatenated MRI and PET Data

5 Comparison of Proposed Model with Existing Studies

This subsection details the solution to RQ3, RQ4, and RQ5. The modified Resnet18 deep learning architecture has been found to perform well in the classification of Alzheimer’s disease using fused MRI and PET images. The 3-in-channel approach allows the architecture to learn the most descriptive features of the fused images, leading to improved accuracy in the binary classification task. The experimental results on the ADNI database show that the modified Resnet18 architecture achieved a classification accuracy of 73.90% as shown in Table 5, demonstrating its effectiveness in the diagnosis of Alzheimer’s disease using fused MRI and PET images. Three previous studies [34, 35, 44] used Deep Neural Network (DNN) or Support Vector Machine (SVM) models to classify EMCI vs. LMCI and reported varying levels of accuracy, specificity, and sensitivity. The proposed model uses ResNet18 (3-in-Channel) and achieves a higher level of specificity compared to the previous models, but lower levels of accuracy and sensitivity. The proposed model’s novelty lies in the use of ResNet18 (3-in-Channel) and its ability to achieve high specificity, which may have implications for the EMCI classification of individuals. The results of the deep learning-based diagnosis of Alzheimer’s disease can be explained using the proposed Explainable Artificial Intelligence (XAI) model. The XAI model allows for the interpretation of the results of the deep learning-based diagnosis, making the results more transparent and understandable. This can be particularly useful for clinicians who may not have experience with deep learning models and want to understand why a certain diagnosis was made. The proposed approach compares favorably with existing methods for diagnosing Alzheimer’s disease using MRI and PET images. The concatenation-based feature fusion framework and the modified Resnet18 deep learning architecture provide a more comprehensive view of the brain and its functioning by combining MRI and PET images. The experimental results on the ADNI database show that the proposed approach achieved a higher classification accuracy compared to existing methods, demonstrating its effectiveness in the diagnosis of Alzheimer’s disease. Additionally, the proposed XAI model provides interpretability to the deep learning-based diagnosis, making the results more transparent and understandable. Table 5 compares the proposed model with other existing works in the classification of AD.

Table 5 Comparison of Proposed Model with Existing Methods

6 Conclusion

The paper proposed a methodology for combining neuroimaging data from PET and MRI images to make an early diagnosis of AD. We introduced a novel 3-channel phase feature learning model for early fusion for the early diagnosis of AD that concatenates and integrates MRI and PET neuroimaging data simultaneously. Our proposed model could learn latent representations of the multimodality data even in the presence of heterogeneity data; hence, the proposed model partially solved the issue with the heterogeneity of the MRI and PET data. This 3-channel phase feature learning offers maximum samples to be used during training based on multimodality data, thus more imaging modality data could be added to the model. We have achieved improved classification performance over existing techniques. Fusing MRI and PET images using deep learning models with additional preprocessing of data is an important and relevant approach for the early diagnosis of AD. These models have the potential to improve diagnostic accuracy and can help to identify imaging biomarkers that are associated with the disease. The results showed that the use of intensity normalization and early fusion techniques significantly improved the classification accuracy of AD. The accuracy improvement was attributed to the better alignment of the image intensities and the integration of complementary information from both modalities. Further research is needed to optimize the use of these models in clinical practice by effectively fine-tuning the hyperparameters. Furthermore, the use of data-fused deep learning models can help to identify imaging biomarkers that are associated with AD, which can aid in the development of new therapies for the disease.