Early detection of Alzheimer’s disease based on the state-of-the-art deep learning approach: a comprehensive survey

Alzheimer’s disease (AD) is a form of brain disorder that causes functions’ loss in a person’s daily activity. Due to the tremendous progress of Alzheimer’s patients and the lack of accurate diagnostic tools, early detection and classification of Alzheimer’s disease are open research areas. Accurate detection of Alzheimer’s disease in an effective way is one of the many researchers’ goals to limit or overcome the disease progression. The main objective of the current survey is to introduce a comprehensive evaluation and analysis of the most recent studies for AD early detection and classification under the state-of-the-art deep learning approach. The article provides a simplified explanation of the system stages such as imaging, preprocessing, learning, and classification. It addresses broad categories of structural, functional, and molecular imaging in AD. The included modalities are magnetic resonance imaging (MRI; both structural and functional) and positron emission tomography (PET; for assessment of both cerebral metabolism and amyloid). It reviews the process of pre-processing techniques to enhance the quality. Additionally, the most common deep learning techniques used in the classification process will be discussed. Although deep learning with preprocessing images has achieved high performance as compared to other techniques, there are some challenges. Moreover, it will also review some challenges in the classification and preprocessing image process over some articles what they introduce, and techniques used, and how they solved these problems.


Introduction
Alzheimer's disease (AD) is a neurological disease that affects the disorder in brain function and destroys brain cells slowly that leads to a loss in memory and instability in human life [122]. AD pathogenesis is thought to be caused by the overproduction of amyloid-β (Aβ) and hyperphosphorylation of tau protein. This results in the accumulation of Aβ plaques and tau neurofibrillary tangles, which disrupt the nucleocytoplasmic transport between neurons leading to cell death, which causes loss in memory and learning [112]. Physicians diagnose patients concerning many requirements where imaging scanning is an essential part. The common symptoms are (1) loss of motion function, (2) speaking difficulties, and (3) memory problems [12].
In our time with economic development and the advent of computer technology and medical information processing technologies, doctors require fast and accurate ways to diagnose and detect the disease aiming to help patients and save their lives. Compared with the traditional ways for diagnosing the disease. Patients pass through several stages to be diagnosed with the disease, but this can be late due to the diagnosis stages, and patients become in a late stage [109]. So, early diagnosis of AD is very important for patients, that help him in taking precaution and help clinicians to detect the risk of the progress of AD, it provides AD patients with knowledge of the seriousness and encourages them to take preventive steps, such as lifestyle changes and drugs [87].
Researchers want to find a simple and accurate approach to identify Alzheimer's disease before symptoms appear. So, Early detection of AD has discovered the symptoms before reaching to risk stage. AD has several stages, one of these stages that appear disease in the prodromal stage is MCI. MCI is a stage of memory loss or other cognitive abilities loss (such as language or visual/spatial perception) in people who can still do most of their daily tasks independently [39].
Recently, most researchers' interest in a search in this field to care about it to improve the quality of a patient's life and discover drugs by tracking the pathological processes related to several stages of AD [112]. Due to AD is develop progressive disease so it has several stages, cognitive normal (CN), Mild cognitive impairment (MCI), Late Mild cognitive impairment (LMCI), and Alzheimer's disease (AD). There are various technologies for neuroimaging that help researchers in classification, the common one that uses to imaging brain tissue in magnetic Resonance Imaging (MRI) [6]. Deep learning is the most common and best technique used for the diagnosis and classification of disease with a large number of input data [6].
Many surveys have been published recently reviewing histopathological imag#e analysis comprising its history, and detailed information of general artificial intelligence techniques [29,52,54,55,136,139]; the main limitation is the lack of surveys of histopathological image analysis that focused on Alzheimer's disease [29,52,54,63,152]. Accordingly, we present more image analysis from an Alzheimer's disease point of view in this survey.
The main objective of the current survey is to provide a comprehensive overview of the state-of-the-art image analysis and artificial intelligence techniques, specifically for histopathology images in AD, and their challenges. This survey focuses on 159 state-of-the-art related studies, where 110 papers concentrate mainly on Alzheimer's disease. Figure 1 depicts the corresponding statistical distribution of the studies used in the current survey.

Paper contributions
In summary, the current survey is to introduce a comprehensive evaluation and analysis of the most recent studies for AD early detection and classification under the state-of-the-art deep learning approach. Also, we present some preprocessing techniques that can enhance the quality of images and achieve the best performance of the classification process. Moreover, it will highlight different challenges throughout related studies and how they overcome them. In evaluating and analyzing the existing studies, several common trends and gaps have been identified.
The contributions of the current survey are summarized as follows: & Introducing a process and stage of diagnosis of Alzheimer's disease from acquiring the data to how to classify disease. & Presenting type and modalities of brain imaging such as MRI, PET, and CT and comparative advantage and disadvantage. & Summarizing the most important basics and background related to the presented survey with a focus on deep learning as one of the most important recent trends to improve these systems. & Categorizing the most important Research Challenges for each the most recent and significant. & Presenting a comparison between various articles and present the contribution of each article, the most significant feature, and the advantage (and the disadvantage) of the solution that they used in solving a specific problem. & Concluding the most open research points in this area.

Paper organization
The current survey is organized as follows: Section 2 presents different related works in the diagnosis of AD, Section 3 introduces an overview of ML and deep learning (DL), definitions, and challenges. Section 4 focuses on the diagnosis of Alzheimer's disease as an overview and highlights the various used methods. Section 5 explores the most important problems and challenges associated with the early detection of AD and concludes remarks. We introduce some future possibilities in Section 6. We present our limitation in Section 7. Finally, the survey is concluded in Section 8.

Related work
Due to the importance of early diagnosis of AD, many researchers interest search in this field to solve the problem of AD. So, this section will present the most important research in this field. Table 1 introduces a comparison between different studies that diagnosis AD with different DL techniques. Jain et al. [66] proposed a transfer learning approach for classifying MRI images. They used PE SE CTL mathematical model to differentiate between 3 classes (AD, MCI, CN). Firstly, they collected data from ADNI datasets and make preprocessing data by using FreeSurfer (PE) to eliminate unnecessary information of MRI images. Preprocessing techniques that they used are 5 process motion correction, non-uniform intensity normalization, Talairach transform computation, intensity normalization, skull stripping. Then after preprocessing data, select the most important slices (SE) that have more information based on entropy. Lastly, they used the VGG16 pre-trained model and transfer learning to build a classification model (CTL Ding et al. [35] introduced a CNN architecture by using an Inception v3 network trained on 90% of ADNI data and testing 10%. Fluorine 18 fluorodeoxyglucose PET images are processed by using the grid method, these images are acquired from the ADNI dataset. Otsu threshold was applied to detect brain voxel. Adam optimizer was used with a learning rate of 0.0001 and with batch size 8 for the training model. The model was trained by using 90% of the dataset (1921 images studies), this dataset includes 3 classes (AD, MCI, and no disease). The proposed architecture achieves 82% of specificity and 100% sensitivity.
In Chitradevi et al. [28], several optimization algorithms (Genetic Algorithm, particle Swarm Optimization Algorithm, Grey Wolf Optimization, and Cuckoo Search) were used to segment the brain into sub-region such as the hippocampus, white matter, and gray matter.  [94] proposed three models and compared them to get which of them achieve high accuracy. In the first model, images are preprocessed and extracted the handcrafted features and make classification by using support vector machine, k-nearest neighbor, and Random Forest. Second model, training model on the preprocessed dataset from scratch by using CNN deep learning model. The third model, AlexNet is used to extract deep features, to determine the best classifier support vector machine, k-nearest neighbor, and Random Forest was fed with features. By comparing the three models, the deep features-based model achieved the best accuracy with a support vector machine classifier. By comparing result analysis support vector machine achieve the highest accuracy of 99.21% and the accuracy of knearest neighbor 57.32% and random forest achieve 93.97%.
Kundaram et al. [79] acquired data from the ADNI dataset and preprocessed images by rescaling it to 255. CNN models are used to trained and classify disease. Images were classified into 3 classes (AD, MCI, and NC), 9540 images are used for the training model. CNN model is consists of three convolutional layers, three max-pooling, four ReLU activation layers. They used different optimizers such as Adam, SGD, Adagrad, Nadam, Adadelta, Rmsprop. By comparing different optimizers with the proposed framework, Adagrad achieves the best accuracy with less loss. The proposed model achieved 98.57% accuracy on the ADNI dataset.
In Table 2 we compare the recent survey of AD with DL and our proposed survey. Description and limitation for each survey are presented to show the difference and similarities between other survey and our proposed survey.

Machine Learning and Deep Learning overview
Machine learning is a branch of artificial intelligence that has become precisely widespread, and valuable, in the last two decades. One definition of ML is that it is the semi-automated extraction of knowledge from data [15]. ML uses data to feed an algorithm that can understand the relationship between the input and the output. When the machine finishes learning, it can predict the value or the class of a new data point [64].
Deep learning is a subset of ML, as shown in Fig. 2. ML is an algorithm with the ability to learn without being explicitly programmed. Artificial intelligence is a technique of getting the machine to work and behave like humans. In DL, the learning phase is done through the artificial neural network [14]. An artificial neural network is an architecture where the layers are stacked on top of each other. There are different DL types such as convolutional neural networks (CNN), recurrent neural networks (RNN), and autoencoders [58].
Although the performance of ML models has progressively better in many functions, they still need guidance (e.g., human experts) to solve some problems. Developers should enhance the architecture or algorithm if incorrect predictions are returned. On the other hand, a DL model algorithm, whether the prediction is correct or not, is adaptive and learns from the features automatically on its own [89].
The major differences between ML and DL are summarized in Table 1. DL is a specific category (i.e., branch) of ML. ML extracts a relevant feature manually from input data. Then the extracted feature is used to update the model parameters that will help in the correct prediction (i.e., classification) process [146]. This does not apply with DL as relevant features are extracted automatically from data. In addition to that, DL implements end-to-end learning where data and the task required to be implemented are passed to the network [26]. As  mentioned, the learning process is done automatically from the features and hence no more manual modifications (i.e., enhancements) are required. From Table 3. It compares ML and DL according to some factor data size, time of training data, and interpretability. ML provides several approaches and models that you can choose depending on your application, the size of the data you are processing, and the type of problem you want to solve [69]. To train the model, a highperformance DL application requires a very large amount of data (i.e., thousands of records) [51]. Over the last few years, DL has provided extensive applications in image recognition [101,148], speech recognition [33,97], medicine and pharmacy [84,144], natural language processing [100,155]. An extensive number of DL methods have been proposed recently [10], and these methods can be broadly classified into many algorithms that will be discussed in Section 3.

An overview on diagnosis of Alzheimer's disease
Diagnosis (i.e., classification) is an important area in computer science concerning the number of published articles recently [43,151]. The AD diagnosis process passes through many stages to detect the disease and classify it as shown in Fig. 3.
From Fig. 3, the first stage is the data acquisition process for gathering the dataset required for diagnosis. The second stage is preprocessing the dataset to enhance the dataset quality and improve the performance of the classification task. The third stage is the splitting stage. The dataset can be split into training, testing, and validation subsets. The last stage is a learning system with proper and specific techniques to extract the features, learn from the data, update the parameters, and classify the disease to a specific class. The following subsections discuss these stages in detail.

Data acquisition stage
The first stage is the acquisition of raw data. Information in images describes inner aspects of the body that can be taken with different modalities or techniques. In this process, we can collect neuroimaging images that may utilize different physical principles. There are different modalities of neuroimaging such as (1) Magnetic Resonance Imaging (MRI), (2) Positron Emission Tomography (PET), (3) functional Magnetic Resonance Imaging (fMRI), and (4) Computed Tomography (CT) [138]. Selecting one of these modalities depends on the researcher's choice, task, and the used model. Datasets can be acquired from different organizations such as hospitals, clinical centers, radiology centers, and online websites [149].  [102]. Normally, the diagnosing of AD can be accomplished with three altered methods, as depicted in Fig. 4.   Fig. 4, these three methodologies are (1) Memory test through history and discussion, mental status test, and neuropsychological tests. (2) numerical laboratory, and (3) brain imaging scan [21]. There has been a revolution in the part played by neuroimaging in AD study and practice in the last years. Diagnostically, imaging has motivated from a slight exclusionary role to a crucial position [72]. In research, imaging is aiding address numerous scientific interrogations. Concurrently the probability of brain imaging has extended rapidly with new modalities and innovative ways of acquiring images and of analyzing them [141]. The definite modalities included are magnetic resonance imaging (MRI; both structural and functional) and positron emission tomography (PET; for assessment of both cerebral metabolism and amyloid).
Imaging Module and Types These modalities have different strengths and limitations and as a result, have different and often balancing roles and scope [108]. Although additional data are required, imaging is preliminary to offer prognostic information at this premature preclinical phase. The necessity for an earlier and more definite diagnosis will only grow as diseasemodifying therapies are identified [40]. This will be particularly true if, as expected, these therapies work best (or only) when introduced at the preclinical stage. Table 4 summarizes the different brain imaging in AD [73].
In Table 4, it compares advantage and disadvantage between neuroimaging modalities and shows open research area for three modalities. Brain imaging has different scans depend on the type of disease, there are CT, MRI, and PET imaging. Structural imaging provides information about the shape, and volume of the brain like CT and MRI, functional imaging showing activity of the brain and how cells work.

Preprocessing techniques
Datasets, especially images, may contain noise and distortions. Radiography noise is generally caused by changes in the sensitivity of the detector, diminished illumination of the object (i.e., low contrast), photographic limitations, and spontaneous variations in the radiation signal [111]. So, it is essential to preprocess the data to enhance its quality or to optimize its geometric and intensity patterns [107]. Preprocessing lets researchers concentrate on a specific part of the brain and highlights the most vital information that is required in the classification process. Preprocessing techniques are many and the most common ones are depicted in Fig. 5. Figure 5, divide AD diagnosis into two mainly step. Firstly, preprocessing techniques on images. Secondly, techniques which used in learning and classify diseases. Choosing one of these techniques according to the problem that the researcher wants to solve and the type of input data.

Intensity normalization
It is an essential preprocessing technique that is used for mapping the intensity of all image pixels to a reference scale [133]. In general, data collected from different sources, or the same source but at different points of time, may not have identical intensity ranges [120]. For example, normalization can calibrate different pixels to the normal distribution as depicted in Fig. 6 [115].  [56]. Functional A noninvasive technique that delivers an indirect measure of neuronal activity, inferred from measuring changes in blood oxygen level-dependent contrast. Have documented the organization of the brain into multiple large-scale brain networks. Can determine the whole network of brain areas engaged when the person makes specific tasks ( [157]; [19]).
Blood fMRI response is known to be variable across subjects, and very few studies examining the reproducibility of fMRI activation in older The responsiveness of the blood supply to the electrical signals that define neural transmission is poor [4].

PET FDG
Widely accepted to be a valid biomarker of overall brain metabolism to which ionic gradient maintenance for synaptic activity is the principal contributor.
It requires intravenous access and involves exposure to radioactivity, although at levels well below the significant known risk.

Amyloid
Determination of brain Aβ content to be moved from the pathology laboratory into the clinic. Amyloid imaging can detect cerebral β-amyloidosis and appears specific for this type of amyloid pathology, giving negative signals in pathologically confirmed cases of prion amyloid.
The widespread use of amyloid PET cost-effectiveness and availability [104].

Contrast enhancement
Contrast enhancement is the difference between the highest and smallest pixel intensities as shown in Eq. 1. It improves the quality of images and increases the contrast of borders in the image that helps us to differentiate between organs. It improves the brightness of the image by expanding the range of pixel values as well [80].
where fmin is the minimum value, fmax is the maximum value, f(x, y) is the value of each pixel in the image, and g(x, y) is the enhanced pixel after that image contrast is applied [117]. Figure 7 shows a sample MRI axial brain image with high contrast.

Denoising process
Median filter The median filter is a technique that is used to minimize noise without blurring the edges [110]. It is especially appropriate for the enhancement of the required MRI images.
The median filter identifies pixels as noise by matching each pixel in the image to its neighboring pixels [117]. It contains a filter (i.e., kernel) with a specific size that passes through each pixel value in the image and replaces it with the corresponding median value. The median value is determined by sorting the surrounding pixels' values and then replaces the Before and After Normalization Appliance [115] pixel with the corresponding middle pixel value [127]. Figure 8 shows the effect of applying the median filter on a sample image where the filter had s size of (3 * 3).
Gaussian filter Gaussian filtering is a technique that helps to denoising the image and is performed by detecting the size of the mask as shown in Eq. 2 [159]. where σ is the standard deviation and defines how the Gaussian looks like, μ is the mean value, and x is the input value. Figure 9 shows the effect of applying the Gaussian filter on a sample image by using a mask size of (3 * 3).

Brain extraction and skull stripping
It is one of the preprocessing techniques that remove any non-brain tissues such as eyes, necks, and skulls. It segments these by using the dark space between the skull and brain occupied by the Cerebrospinal fluid (CSF). There are many tools presented in [75], that are used in brain extraction, such as the brain extraction tool ROBEX algorithm that uses machine learning. It may use DL also, but it will be a more extensive computational process and will need specific hardware to be able to run the algorithm. Khademi et al. [75] used the Random Forest classifier for brain extraction. They targeted to find a binary segmentation mask for the brain. After getting that mask, they multiplied it with the original image as shown in Fig. 10 [126].

Data augmentation
Data augmentation techniques help our model to avoid the overfitting problem [126]. The meaning of the overfitting is increasing in validation error with decreasing training error value so to build the best model, validation error must still decrease with training error [5]. After collecting your data, we can apply data augmentation to increase the images' diversity in each class. There are different techniques such as cropping, shifting, shearing, scaling, and zooming [95]. From Table 5, A literature review about preprocessing techniques and what is the methodology used to diagnosis AD. Also, we present the advantage and effectiveness of preprocessing on the images and show the performance of the training model.

Classification
DL, as mentioned earlier, is a sub-field of ML [24]. DL is more effective than the traditional ways of ML because it extracts the features automatically [37]. Also, DL performs "end-to-end  learning" where raw data and tasks are provided to the network [17]. Most researchers depended on Convolutional Neural Network approaches for detecting Alzheimer's disease from MRI images compared to other techniques of DL such as Recurrent Neural Networks (as shown in Fig. 11(b)), Deep Neural Networks (as shown in Fig. 11(a)), Autoencoder (as shown in Fig. 11(c)), and Deep Belief Networks (as shown in Fig. 11(d)) [8,62].

Deep neural network (DNN)
DNN, as shown in Fig. 11(a), has an input layer, output layer, and one (or more) hidden layers [50]. It is distinguished by dealing with complicated problems and understanding the relationship between input and output data, also able to model complex non-linear relationships [42]. It considers supervised learning techniques and is used in various areas of research to explore patterns between inputs that were unknown before [68]. It requires a large number of training data to extract the features of the labeled images [98].

Convolution neural network (CNN)
A CNN is one of the most successful techniques to perform image classification and recognition in neural networks. From Fig. 12, CNN is composed of several convolution layers, pooling layers, activation layers, fully connected layers, and a classifier layer. The convolution layer is an essential layer that extracts the feature maps bypassing the learned filter (or kernel) with a specific size of the input image [42]. Then, it follows up the activation function that decides whether the neuron should be activated or not. It makes the nonlinear transformation to the input making it capable to learn and perform more complex tasks [2]. Activation functions have numerous types such as sigmoid, Tanh, and ReLU to  [123]. Pooling layers reduce the dimensionality but keep the most important features. They can be considered as down-scalers [41]. A fully connected layer connects every neuron from the previous layer to all neurons in the current layer. Finally, the classifier layer selects a class (i.e., label) with the highest probabilities [7].
One important thing in CNN is that it can handle large datasets to get high performance in the classification task [53]. From the transfer learning point of view [140], CNN has several architectures that were trained on the ImageNet dataset including VGGNet, LeNet, GoogLeNet, ResNet, AlexNet [118]. With a pre-trained CNN model, the developer can benefit from the parameters (i.e., weights) of that model and transfer them to the new task [128,150].

Recurrent neural network (RNN)
RNN is used in sequence-or time-series problems [91]. The most important advantage in that approach is the used memory and hidden state. Figure 11(b) shows a sample RNN architecture with an input, a hidden, and an output layer. The hidden state is effective in remembering the confident information about the problem sequence [106]. Another distinguishing characteristic of RNN is that they share the same parameters within each layer in the network unlike the feedforward networks [65]. The later networks, the feedforward networks, have different parameters for each node in the networks thus causing a large number of parameters [114].
RNN does not have a large number of layers and is not too deep compared to CNN's or DNNs [88]. However, the model is difficult to train and suffers from vanishing or exploding gradients limiting its application for modeling long-time activity sequence and temporal dependencies in sensor data [81]. The common applications of RNNs are natural language processing [154], speech recognition [57], and language translation [27].
Long short-term memory (LSTM) and Gated recurrent units (GRUs) are common architectures of RNN [36]. The main purpose of the LSTM is to maintain any error that occurs through the different layers and times [96]. It contains cells in the hidden layer, which have three gates: input, output, and forget gate. These gates are responsible for storing the information and regulating the flow of information to predict the output of the network [114]. This single-cell helps the model to decide which one to stock and when it can read and update the information through the gates [98]. GRUs use a hidden state and have two gates: reset gate and update gate. They control what and how much information will be retained [114]. Its performance, in many tasks, is better than LSTM [36,42].

Autoencoder (AE)
AE is an unsupervised learning technique and can take an unlabeled dataset and compress it to feature encoded data. It is used for dimensionality reduction and consists of two major parts: an encoder and a decoder, as shown in Fig. 11(c) [98]. The encoder converts the input data to code (i.e., compressed data) and then the decoder rebuilds the code to the output which looks like the input [105]. Layers in the encoder part may be dense layers or convolution layers. The number of layers in the encoding part must be equal to the layers in the decoding part. The encoder reduces the dimensions of data, but it increases the dimensions of data [103]. The middle layer is called the bottleneck layer that compresses the representation of the input data [48].
AE has different types that improve the performance named (1) denoising AE, (2) sparse AE, and (3) contractive AE [1]. AE faces some problems such as a copy of the input layer to the hidden layer causes inefficient extraction of the meaningful features although it can retrieve the input in the output layer [134]. Denoising AE solves that problem by corrupting the inputs that the AE must then reconstruct or denoising [34]. This helps the model to recognize the feature from noisy input and hence can classify. The model does not copy the input to the output without learning features about data.
The sparse AE added many constraints to reduce the number of hidden nodes and limit nodes that were activated [135]. When the average of activation of the hidden nodes is close to zero this means nodes in the hidden layer are active and the other not active [147]. It can learn features by imposing some penalty, it is applying on the hidden layer. There are two ways to put sparsity constraint: (1) L1 regularization that is added to cost function which helps in preventing the overfitting problem [93] and (2) KL-Divergence constraint that is added to all hidden nodes to provide a low average activation value [11]. The main purpose of contractive AE is to support strong representation which will be able to extract useful information and be less sensitive to small variations in data [124].

Deep belief network (DBN)
DBN is a supervised learning technique that can link the unsupervised features, which are extracted from the stacked layer [70]. It is a generative graphical model and is constructed by a stack of Restricted Boltzmann machines (RBM) which can extract features and reconstruct the input. DBN has an undirected connection between the top of two layers as depicted in Fig. 11(d). DBN reduces the weight initialization by using RBM that helps the model overcome the overfitting problem [70].
DBN was created to analyze the apparent distribution between the input and the hidden layers in such a way that the lower layer node is connected directly, and the upper layer nodes are connected indirectly [92]. This model is helpful in a task that needs to extract features, involving biological data, and with classes that are not separated linearly [13].
Finally, the authors summarize the DL branches including the different models graphically in Fig. 13.

Research challenges
In this section, we present some of the research challenges that can be divided into two categories (1) the first is related to the data, (2) the second is related to the classification problem. The important challenges for each category are highlighted in the following subsections. The authors summarize the challenges section in Fig. 14 and discussed it in detail in the following subsections.

Availability of large datasets
To achieve the best result, DL techniques require a large number of datasets for the training process. Unavailability of data is a major challenge as it is difficult to acquire data from hospitals and clinical centers due to the privacy of patients. A set of online medical datasets available for a researcher for example ADNI (Alzheimer's Disease Neuroimaging Initiative), OASIS (Outcome and Assessment Information Set), COBRE (CENTR for biomedical research excellence), and the FBIRN (Function Biomedical Informatics Research Network). To overcome the small diversity of the available datasets, data augmentation is a technique that is used to increase the number of images without adding a new image by flipping, padding, rotation, etc. Table 6 summarizes the different methods used in different studies to solve the issue of limiting datasets.
In Table 6, we compare different studies how they overcome the limitation of availability of data. We compared their advantage, disadvantage, type of image modalities, datasets, and methods they used for overcoming this problem.

Alzheimer disease datasets
To have the ability to train systems and compare performance between architecture by using a different dataset, large datasets are required for training, testing, and validation. Table 7,   summarize the dataset of AD with their links. ADNI dataset is a multicenter study that aims to develop imaging, clinical, and genetic for tracking the growth of disease and to detect AD at the early stage (pre-dementia). There MIRIAD is a database of volumetric MRI brain-scan of 46 Alzheimer's sufferers and 23 healthy elderly people. It includes a total of 708 scans and should be of particular interest for work on longitudinal biomarkers and image analysis. It is also an open-access dataset. Table 7, presents available datasets with their links, the number of available images, shows the number of classes,s and is open access for the researcher or not?.

Overcoming data imbalance problem
Data imbalance is one of the problems that face researchers when they solve any problem by using DL. We may describe this as the distribution of examples across classes is not equal. For example, in the AD dataset, the number of images that have a disease is larger than images with no disease in which will cause an imbalance in the dataset [71].
There are different suggestions to solve this problem as shown in Table 8. Under-sampling and oversampling are two common techniques to enhance the data imbalance problem. Undersampling is done by deleting data randomly which have enough amount of data in class to balance between classes [113]. It helps in obtaining an equal number of samples in classes and fast training time. Although the simplicity of that approach, there is a high probability that the Over-sampling means increasing the amount of data by copying the existing sample. So, to achieve balanced classes, increase the size of the minority class. This process is done on the minority class which has a smaller number of data than other classes. Overfitting is the major issue that occurs with over-sampling [113].
In Table 8, we compare different studies on how they overcome the problem of data imbalance. We compared their advantage, disadvantage, type of image modalities, datasets, and methods they used for overcoming how to make different classes have the same number of images.

Multimodality images in classification
There is different scan type for neuroimaging such as Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), functional Magnetic Resonance Imaging (fMRI) and Computed Tomography (CT). Learning the model used in this different modality is another challenge. Learning Heterogeneity data may cause less performance because all input data are different and combine [82]. To solve this problem every single modality is learning separately by multi hidden layers to extract features then the second stage is combining features from the last hidden layer of each modality and then learn a model to classify labels by using combined features from the last stage. A combination of features from different modalities has high performance than a training model with a single modality. Each neuroimage modality can offer new different details for the disease that make classification more effective [18]. Table 9 shows a summary of the different methods of the papers used in solving a combination of different modalities of the images.
In Table 9, we compare different studies which used different modalities of images in the learning model. We compared their advantage, disadvantage, type of image modalities, datasets, which classifier they used, and methods they used for overcoming this problem.

Collecting necessary information
MRI images may have some information that is unnecessary for diagnosis of AD that increases the time of processing and training data, computational process, and cause less efficiency in the training of our classification model [90]. The solution for this challenge is using techniques for preprocessing data before training it as shown in Table 10. It summarizes some of the related articles that solve this problem. It is worth mentioning that FreeSurfer is a free tool from the internet used for processing images like skull stripping, segmentation for the essential part of the brain for diagnosis of Alzheimer's disease [22].
In Table 10, we compare different studies on how they collecting important information in images and how to discard other information. We compared their advantage, disadvantage, type of image modalities, datasets, and methods they used for overcoming this problem.

Neuroimages noise manipulation
Adversarial noise may be found in neuroimages and this reduces the performance of the classification process. To minimize noise, as we mentioned before in Fig. 5, we present some preprocessing techniques that help in removing the noise from images as shown in Table 11. It summarizes the different methods that are used in removing noise from neuroimages. Gaussian filter, median filter, and many filters that remove noise from images increase the quality of training of the classification model.
In Table 11, we compare different studies which have datasets or images with noise. We compared their advantage, disadvantage, type of image modalities, datasets, which classifier they used, and methods they used for overcoming neuroimaging noise.

Overfitting problem
Overfitting happens when the model is trained on data that have noise and more unuseful details. The size of data used for training may not be enough is also one of the reasons that cause overfitting in this case to solve this problem we need to increase the amount of data [59].
To overcome this problem, the dropout can be used to drop the units (i.e., hidden and visible) in a neural network as shown in Table 12 [130]. In it, some articles used that method and others solve that problem in another way. This means removing the units temporarily from the network. Thereby, choosing a random sample of neurons to train rather than train whole neurons in the network. This can make the learning of the hidden layer better. Increase the number of images that enhance classification and solve the problem of insufficient samples.
The resolution of normalized sMRI and MD is low.
In Table 12, we compare different studies in which their architecture suffers from an overfitting problem. We compared their advantage, which classifier they used, methods they used for overcoming overfitting problems, and shows the performance of each study.

Hybrid approach
The hybrid model is defined as combining more than an approach or technique to achieve high performance or enhance the training and classification process [99]. One example of a hybrid method is feature selector and CNN model, feature selector with the pre-trained model or transfer learning, and hyperparameters optimizer with any model of DL. As shown in Table 13, some of the articles combined more than one learning method or classification technique to enhance the overall classification of the disease. Ramzan et al. [112] fMRI ADNI FSL-BET toolbox to remove non-brain tissue, using FSL-MCFLIRT toolbox for removing motion correction from the image.
Improved the quality of classification.
Increased processing time.
Jain et al. [67] sMRI ADNI Using the FreeSurfer tool to remove unnecessary details and intensity normalization.
Improved training of classification model. In Table 13, we compare different studies which used a hybrid approach in their studies. We compared methods they used for combining more than one model, shows the performance of each model, type of image, datasets that they used, which classifier they used, and our comments about their methods. Preprocessing increases the sensitivity of the analysis.
In the case of spatial smoothing be larger than the activated region causes loss in the main signal that may reduce the effectiveness of classification.

Black box challenges
One of the trending issues is the black box. Neural networks, which can be thought of as black boxes that convert input into output, are often used in machine learning methods [49]. Although math used to construct a neural network is straightforward how the output arrived is exceedingly complicated, ML algorithms get a bunch of data as input, identify patterns, and build a predictive model but understanding how the model worked is an issue [85]. Although DL has the most success in achieving high performance close to human in classification and predicting process, operate as black boxes [49]. It doesn't offer a specific reason or explanation for choosing a specific feature in the training process or why this achieves high or low performance or how the training data's relations are reflected in the feature selection as shown in Table 14 [25].
In Table 14, we compare different studies about black box challenges. We compared their model, shows the performance of each model, the type of image, datasets that they used, which classifier they used, their contribution, and our comments about their methods.

Future directions
By revising the most recent literature on early diagnosis of Alzheimer's disease, it was concluded that. To achieve overall improvement and upgrading the accuracy of diagnosis using a computer application, the following points must be taken into account: & One of the challenges is collecting brain-balanced and sufficient data related to Alzheimer's disease [60,119,125,131]. & Most of the recently deployed methods and techniques correlated to DL, including deep sparse multi-task learning [131], stacked auto-encoder [137], and sparse regression models [132], each is attempting to overcome the aforementioned challenges. In [119], proposed a deep architecture to remove the features without load redundant information by using sparse multi-task learning in a hierarchy. & Deep learning segmentation (e.g., U-Nets) can be injected in the process to specify only the region of interest. & Combining two various conceptual methods of sparse regression and DL to diagnose AD can be effective [125]. Also, one of the promising techniques is the manifold-based learning method. & Data augmentation and scaling techniques can help to improve the overall state-of-the-art performance.

Limitations
All studies have both strengths and weak points so, we have several limitations. First, we mentioned only the most common preprocessing techniques (intensity normalization, contrast enhancement, De-noising process, brain extraction, and data augmentation) that are used with neuroimaging. Second, we discussed only five techniques of DL (DNN, ANN, CNN, AE, and DBN) although there are many approaches we mentioned the most common one with a diagnosis of AD. Third, ML is not discussed in detail as DL. Finally, we mentioned four datasets, not all of them, and the current study works on ten years of study.

Conclusions and future work
AD is a cumulative neurological disorder that is the most common form of late dementia. AD causes nerve cell death and tissue loss in the brain, resulting in a substantial decrease in brain volume over time and impairment of most of its functions. In this paper, we started with the big difference between traditional ML and DL, followed by the stage of diagnosis of AD. In the diagnosis of AD, we need to preprocess images to enhance the quality of learning, so we show some preprocessing techniques used with images. And also, we presented different methods of DL that are most common in the classification process such as CNN, RNN, DNN, AE, and DBN. Although the importance of the classification of disease by using DL, there are challenges for dealing with the dataset.so, we presented a review of literature for every challenge and show their suggestion to solve these problems. The novelty of the current survey can be summarized in (1) introduce different preprocessing techniques which processed on neuroimaging, (2) combine preprocessing techniques with the most common DL methods in one survey, (3) compared different state-of-art research with their challenges in dealing with dataset and classification stage. In future studies, to classify AD with the proper dataset, we can use (1) Abstract CNN Models, (2) apply transfer learning only, (3) apply both transfer learning with abstract CNN models, (4) use feature selector to select feature separately and after that using CNN models, (5) use feature selector to select feature separately and after that using transfer learning, (6) compare the performance of two models that mentioned in 4 and 5 points, and (7) using hyperparameters optimizer with one of the models like CNN or transfer learning. Table 15 presents the used abbreviations in the current survey with the corresponding definitions. They are sorted in alphabetical order.

Table of Abbreviations
Acknowledgments The authors would like to express appreciation to Hossam Magdy Balaha, who assisted in research work, and to Mansoura University for supporting the authors with the dataset collection.
Authors agreement We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.
Corresponding author We understand that the declared corresponding author is the sole contact for the editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions, and final approval of proofs. We confirm that we have provided a current, correct email address that is accessible by the Corresponding Author. Intellectual property We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, concerning intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
Funding Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Data availability We confirm that no datasets were users, generated, or analyzed during the current study.

Declarations
Conflict of interest The authors declare that they have no competing interests nor conflict of interests for the current study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.