1 Introduction

Cognitive disorders have a significant impact on an individual’s daily life, as they affect various cognitive functions, and Alzheimer’s disease (AD) is one of the most commonly known cognitive disorders. AD and other cognitive disorders can be diagnosed and treated commonly through medical imaging. Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and Computed Tomography (CT) allow medical professionals to identify neurological changes linked to these disorders by offering comprehensive visual depictions of brain structures and functions. In this section, cognitive disorders are discussed in general followed by analysis of deep learning methods for diagnosing them. Next, AD and different AI approaches used to diagnose it are presented. Then, the proposed methodology is summarized followed by the motivation behind the research, and finally, the significant contributions made in this study are highlighted.

1.1 Cognitive disorders

Cognitive impairments [32] refer to difficulties or limitations in cognitive function, which can include memory, attention, perception, language, or problem-solving abilities. These impairments can affect a persons daily life and activities, and can range from mild to severe. Common causes of cognitive impairments include brain injury, stroke, neurodegenerative disorders such as AD, and certain medical conditions such as HIV/AIDS or hypothyroidism [73]. There are various types of cognitive impairments, including Mild Cognitive Impairment (MCI) [34], dementia, and AD [13]. MCI is a condition where a person experiences mild cognitive decline beyond that of what would normally be expected for their age. Dementia is a more severe form of cognitive decline that affects multiple cognitive domains and interferes with a person’s ability to carry out daily activities [70].

Computer-Aided Diagnosis (CAD) techniques have been developed to help detect and diagnose cognitive impairments. These techniques involve the use of computer algorithms to analyze various types of data such as brain scans, medical records, and cognitive tests. For example, MRI can be used to identify structural changes in the brain that are indicative of cognitive impairment [22]. Deep learning algorithms can be trained on these images to help identify patterns and predict the likelihood of cognitive impairment [21, 31]. Other CAD techniques include cognitive screening tests such as the Montreal Cognitive Assessment (MoCA) [44, 45, 66] or the Mini-Mental State Examination (MMSE) [20, 55]. These tests are designed to assess various cognitive domains and can be administered in a clinical setting or remotely using computer-based assessments [94]. Ongoing research is being conducted on cognitive disorders such as AD, Parkinson’s disease, schizophrenia, and depression to comprehend their underlying mechanisms and find effective treatments. Studies on AD are focused on detecting the disease early through biomarkers and developing potential therapies to slow or halt its progression [17, 72, 78, 88].

1.2 Alzheimer disease

One of the most prevalent cognitive disorders is AD, which is also the most common cause of dementia. In 1906, Dr. Alois Alzheimer was the first to discover AD [37, 63]. It is typified by a progressive loss of memory and other cognitive abilities [70]. With 70% of dementia cases, AD is the most prevalent type of dementia worldwide. AD is a cognitive disorder affecting cognitive function and memory, and is a leading cause of dementia in elderly individuals [48]. Over time, there is an irreversible decline in cognitive function associated with this progressive neurological disorder. The following are some of the symptoms and attributes of AD: memory loss, language difficulties, disorientation (forget where they are or how they got there, and have difficulty recognizing people they know), poor judgment, mood swings, loss of initiative, and changes in personality [7, 9,10,11,12, 16, 53, 75].

Proper classification of AD plays a crucial role in comprehending the disease, as it enables early diagnosis and prediction of patient outcomes, and facilitates informed decision-making regarding treatments. Deep learning has emerged as a promising approach for AD classification, particularly with regards to brain imaging data such as MRI or positron emission tomography (PET) scans [29, 57]. To diagnose the disease, a medical evaluation including patient history, mental state examination, physical and neurobiological tests, as well as non-invasive brain imaging techniques such as structural and functional magnetic resonance imaging are used [8, 92]. The process of diagnosing AD typically involves gathering a patient’s medical history, assessing their clinical symptoms, and observing their behavior [33, 62, 77].

MRI scans, in particular, provide important information about AD through the use of deep learning and machine learning techniques. Using features taken from MRI images, machine learning algorithms are one such technique that uses to distinguish between people who are healthy and those who have AD [64, 93]. These characteristics include a range of parameters, including surface area, cortical thickness, and brain volume, and they serve as important markers of AD pathology [60]. Furthermore, deep learning algorithms provide an advanced method for interpreting brain networks seen in MRI scans [81]. These algorithms identify changes linked to AD by examining the patterns of connectivity between various brain regions [97]. Notably, research has shown that reduced connectivity between different parts of the brain is a hallmark of AD [59, 67]. Furthermore, AD detection through MRI analysis has shown notable success with computer vision techniques. The ability to identify structural anomalies indicative of AD is made possible by MRI’s high-resolution imaging capabilities [3]. These developments highlight how important medical imaging is to improving our knowledge and ability to diagnose AD, especially MRI.

1.3 Introduction to proposed approach

Here, a discussion about Convolutional Neural Network (CNN), Vision Transfomer (ViT), and Compact Convolutional Transformer (CCT) models is presented. The methods employed, namely patch encodings and tokenization, will then be discussed. It is followed by an introduction and generic discussion about the suggested methodology. In Sect. 3, the complete working methodology of proposed architecture is presented in detail.

1.3.1 CNN vs ViT vs CCT

The ViT is a machine learning model for image classification that utilizes a transformer-based architecture on patches of the image. It was first introduced in a research paper titled "An Image is Worth \(16\times 16\) Words" presented at the ICLR 2021 conference by Neil Houlsby and colleagues [24]. The model is pre-trained on large image datasets such as ImageNet-21k and ImageNet [56], and employs a mechanism of attention seeking, which allows it to assign varying levels of importance to different parts of the input data. The ViT model is composed of multiple self-attention layers, similar to those used in natural language processing, which hold great potential for use in various data modalities. Figure 1a illustrate the vision transformer.

The performance of ViT is superior to CNNs while using fewer resources. However, due to its weaker inductive bias, it needs more data augmentation or regularization while training on smaller datasets. ViT represents image inputs as a sequence of image patches and requires a significant amount of data to achieve optimal performance. The mathematical formulation of kernel convolution in CNN can be described by Eq. 1.

$$G[m,n]=(f*g)[m,n]=\sum _{j}^{}\sum _{k}^{}h[j,k]f[m-j, n-k]$$
(1)

where the f represent the input image and h denotes the kernel. The indices of columns and rows of the result matrix are marked with n and m, respectively. Unlike CNNs, which use pixel arrays, ViT splits images into visual patches during computation and employs a self-attention layer that embeds information globally in the overall image. The mathematical representation of self-attention is given in the Eq. 2. ViT can also learn to encode the relative location of the image, thereby reconstructing the image structure. Furthermore, ViT has a multi-head layer that concatenates all outputs in the appropriate dimensions, and most attention heads are used to train global and local dependencies in an image. The aspects which mainly differentiate ViT from CNN include patching and self-attention.

$$Attention(Q,K,V)=Softmax\left(\frac{QK^{T}}{\sqrt{D_{h}}}\right).V$$
(2)

On the other hand, CCT is a novel deep learning architecture that combines the strengths of both CNNs and transformers [38]. It aims to capture both local and long-range dependencies in input data efficiently. For instance, CNNs can be used to extract local features from input image followed by self-attention layers from transformers to model long-range dependencies between extracted features. Equations 3 and 4 present the mathematical formulation of the tokenization and encoding processes in CCTs, with the transformer encoder represented as f.

$$x_{o} = MaxPool(ReLU(Conv2d(x)))$$
(3)
$$x_{L} = f(x_{o}) \in R^{b \times n \times d}$$
(4)

The visual diagram of CCT is shown in Fig. 1c. The architecture of a CCT consists of two main components: a convolutional encoder and a transformer decoder. The convolutional encoder is responsible for extracting spatial features from the input image or video, while the transformer decoder processes the encoded features and generates the output. Self-attention layers have also been used in computer vision, but they are computationally expensive and require a large number of parameters, making them challenging to deploy on resource-constrained devices. CCTs address these limitations by using self-attention layers in a compact manner. Instead of applying self-attention to the entire input feature map, CCTs apply it to a smaller set of features, reducing the computational cost. This is achieved by adding self-attention layers after every few convolutional layers. The self-attention layers enable the model to learn long-range dependencies between features, while the convolutional layers capture local spatial relationships.

Fig. 1
figure 1

ViT vs CVT vs CVT

In this work, a novel transformer-based model is developed that combines the beneficial features of CCT and ViT. The PCES is a novel method that combines the tokenization process from CCT with the patch encoding mechanism from ViT. Here, images are tokenized and patched simultaneously in different encoding modules, and their outputs are fed into transformer and self-attention layers. The two representations learned from transformer layers are combined in a single process known as MLF, which combines two different kinds of information. For classification, a multiple layer perceptron is employed. Section 3 offers a more detailed discussion of proposed methodology.

The aim of utilizing a new novel transformer for the classification of AD and cognitive disorders is to enhance the precision of diagnosis and promote the understanding of cognitive disorders. Medical imaging data presents challenges due to its complex spatial relationships and long-range dependencies [84]. Transformer applications in imaging have become a hot topic for research in recent years. By utilising their self-attention mechanism, transformers try to cope with these difficulties by producing detailed features and capturing intricate patterns by combining long-range dependencies and global context. Traditional diagnostic methods are not always reliable, and by employing sophisticated deep learning techniques like transformers, it is possible to detect patterns and characteristics in medical imaging data that may not be visible to the naked eye. This can potentially result in earlier and more accurate diagnoses, allowing for more efficient treatment and care of patients. In addition, as the world’s population continues to age, there is a growing need for the development of more effective tools for diagnosing AD and cognitive disorders.

1.4 Contribution

This study involves the development of a deep transformer architecture for the classification of different stages or types of AD and other cognitive disorders in 2D MRI images data. The main contributions of this paper are listed as follows:

  • A novel computer-aided diagnosis system is suggested for AD and cognitive impairments which can be used by medical professionals for decision making followed by quick and efficient treatment.

  • A deep learning-based model called BiViT has been introduced to detect AD and cognitive disorders in 2D MRI imaging data. This system makes use of (PCES) and MLF, resulting in a significant improvement in the accuracy of the results.

  • The study propose a new PCES technique that involves two types of encoding to process data leading to improved encoding that further enhances model’s performance in terms of achieving accurate results.

The structure of this paper is as follows: Sect. 2 provides an overview of the relevant literature on the subject at hand. The methodologies employed in the current study are described in Sect. 3. Furthermore, Sects. 4 and 5 present the results and discussion, and Sect. 6 concludes the paper.

2 Literature review

This section presents a literature review on cognitive disorders, primarily related to AD, with a focus on the essential role that deep learning plays in the recognition and classification of cognitive disorders from imaging data. AD is a severe neurological condition that leads to progressive damage to brain cells, causing permanent memory loss and dementia [51]. Early detection of AD can help control its spread, and hence there is a need for an autonomous system that can classify medical condition into different stages. In recent times, machine learning and deep learning techniques have been successfully applied to many medical problems, including AD detection [30]. Deep learning has been used in many studies to classify cognitive disorders mainly AD, using imaging data such as MRI or PET scans, and clinical data [15]. Some studies have found that deep learning models can achieve high accuracy in classifying AD, particularly when using imaging data. In this regard, CNN, ViT and autoencoders have been used along with other deep learning architectures to determine the essential features of these MRI scans and categorizing them into healthy or disease groups. In this section, the different methods used for AD detection in relation to their pros and cons are analyzed.

Transfer learning is an important aspect when the training data is very low. Ghazal and Issa [36] aims to detect AD using brain MRI to classify images into four stages using transfer learning including healthy, mild demented, moderately demented, and severe demented. The work utilizes a transfer learning-based AlexNet model for characterizing the disease at an early stage with high accuracy. The proposed system’s simulation results have demonstrated that it can achieve an accuracy of 91.70%, making it an effective tool for early detection of AD. Merits of the transfer learning-based model include fast training of the model, re-usability and reduced data requirement. However, fine-tuning of AlexNet trained on ImageNet dataset [23] on medical images dataset can be questionable. Moreover, there is a chance of overfitting with the new data, especially, if new dataset is significantly different from training data. To sum up, the problem of domain discrepancy exists in this research and it can be considered as the main drawback of this study. In another study, transfer learning-based ResNet50 is used to achieve AD detection and determine its stage by applying brain images [99]. It means that, the approach is developed using hybrid Resnet50 with other CNN architectures like Alexnet, Densenet201, and Vgg16 [52, 91, 98]. The study demonstrated that the proposed hybrid model had an accuracy of 90%, which outperformed the individual CNN architectures. Hence, the hybrid model showed promising results in diagnosing AD and showed better performance than other CNN architectures reported in the literature. However, it has certain drawbacks such as small receptive fields, a lack of long-term dependencies, and a lack of attention mechanisms. As the proposed transfer learning-based CNN model is almost similar to [36] research, it has similar limitations ranging from domain mismatch to overfitting. The second main problem that comes with the use of CNN based models is that they do not use the attention mechanism unlike ViT. Attention-mechanism is one of the key components in image recognition which is used to locate regions of the image that are of importance. Some other shortcomings associated with CNN-based models are limited contextual knowledge, fixed-size input, lack of global attention and large number of parameters.

Mild Cognitive Impairment (MCI) is an intermediate condition between healthy individuals and AD. Taheri Gorji and Kaabouch [87] conducted a study on the significance of early detection of MCI using MRI. The study employed a CNN to classify MRIs of 600 individuals into healthy, Early MCI (EMCI), or Late MCI (LMCI) classes. The CNN, with an efficient architecture, discriminated between the healthy group and the two types of MCI groups, achieving an overall classification accuracy of 94.54%. The advantages include improved accuracy in MCI classification and the potential for early intervention, however, the model’s performance may vary depending on the dataset used and may require further validation.

Ensemble learning is a machine learning technique where multiple models’ predictions are combined to boost the overall performance. Unlike conventional models, it utilizes the heterogeneity among stand-alone models in order to achieve lower error and enhanced stability. One substantial advantage is its capability to accomplish better outcome of prediction by combining the powers of different models. Ensembling involves using several different simple models separately and then joining them together to obtain fitter and more generalizing results, especially if individual models are dissimilar. This approach also enhances the model’s ability to handle noisy data and outliers. Kang et al. [47] presents a CNN-based ensemble learning approach for AD classification from the MRI data. They implemented the use of GAN’s Discriminator, VGG19 and ResNet50 ensemble models, and majority voting is used to fuse the outcomes. The proposed model was able to achieve an excellent performance with an accuracy of 92%. The ensemble learning-based networks are usually more robust and have better generalization capabilities as compared to stand-alone models. However, ensemble learning is computationally expensive because of the need to train and handle multiple models. Moreover, if base models are not well-trained or they look the same, ensemble learning will not give much improvement to a single model and it will lead to overfitting. Furthermore, ensemble learning-based models are prone to overfitting if the base learners are too complex or if the ensemble has too many components. To summarize, this approach can achieve better results but it has issues including limited knowledge about context, lack of global attention, domain mismatch and large number of parameters, that would hinder the practical utilization of this model.

There are currently no biomarkers known to be extremely accurate in diagnosing AD in its early phases, making it a difficult task in medical practise to identify AD in its early stages. Moreover, AD is an incurable disease, and high failure rate was observed in clinical trials for AD treatments. To help slow down the progression of AD, researchers are striving to find ways for early detection. With a focus on neuroimaging and mostly academic articles released since 2016, [65] review the most recent state-of-the-art research on machine learning approaches used for the detection and classification of AD in this study. Various machine learning techniques, including Support Vector Machine (SVM), Random forest, CNN, K-means, and others, have been employed for the detection and classification of AD. The review indicates that there is no single best approach, but deep learning techniques, such as CNNs, appear to be promising for the diagnosis of AD. A similar research by [6] shows that among K-Nearest Neighbor (KNN), SVM, Decision Tree (DT), Linear Discrimination Analysis (LDA), Random Forest (RF) and CNN algorithms, the CNN performs the best for classifying AD using imaging data and some extra data from MRI such as the average cortical thickness, the standard deviation of cortical thickness, the volume of cortical parceling, white matter, and surface area. However, it has a limitation of lack of global attention and contextual understanding.

In contrast to supervised learning whose models are trained by using labeled data, unsupervised learning is essential because it can discover structures and patterns that are hidden in unlabeled data. One of the most widely used approaches of unsupervised learning for image data is the autoencoder. Encoder and decoder are the main components of autoencoders where encoded representation of inputs in a low-dimensional space occur with the help of the former and the latter reconstruct original images from this representation. Leveraging the benefit of autoencoders, [96] employed a Stacked AutoEncoder (SAE) to extract features from MRI data, and a SVM was finally used to classify AD using those features. They found that the deep autoencoder was able to extract useful features that improved the accuracy of the SVM, resulting in an accuracy of 89%. SAEs offer several advantages, such as hierarchical representation learning and the ability to model non-linear transformations. However, they also come with disadvantages, including the potential for overfitting and the computational complexity of training deep models. SAEs also pose challenges in interpretation due to the complexity of their learned representations, making it hard to interpret model decisions. Additionally, they are data-hungry, needing substantial data for training meaningful representations, which can limit their effectiveness with small or unrepresentative datasets.

Recently, researchers have explored innovative approaches to improve the accuracy and efficiency of AD diagnosis using advanced technologies such as deep learning and neuroimaging. One such approach, proposed by [74] from Imperial College London, focuses on utilizing imaging data to differentiate AD from MCI and Normal Control (NC). Their method leverages an autoencoder and a 3D CNN architecture, achieving an impressive accuracy of 95.39% in distinguishing AD from NC individuals. Additionally, a 2D CNN design is developed which yields comparable accuracy results. Likewise, [61] designed a diagnostic approach for AD using multi-modal neuroimaging data. This method utilizes a novel zero-masking technique which preserves all the information contained within the data. SAE is used for extracting high-level features and subsequently feedint them into SVM for the purpose of multi-modal and multi-class MR/PET data classification. The study revealed a performance of 86.86% accuracy by the model, thereby presenting a possibility for the employed method in early detection of AD. Such cutting-edge advertising techniques reinforce the fact that the increased use of advanced technology is very significant to the progress in the diagnosis and treatment of AD. Unsupervised autoencoder approach is useful for nonlinear features learning, but one has to face issues like model generability and interpretability, as well as the model overfitting. The SAEs can be complex and thus hard to interpret, and this complexity may hamper their adoption in clinics where interpretability is very important.

Over the past decades, different deep-learning approaches in medical imaging have shown promising results and performance. Drewitt [25] explores ViT approach for classifying AD in MRI images. It is also compared with other deep learning-based networks and the article further points out limitations to present future prospects of this approach. The performance metrics include accuracy and F1-Score, with the model attaining an accuracy rate of 87.5% and a loss of 0.34 in AD classification. This indicates that the proposed model could help physicians to diagnose AD and give a remedial treatment to the patients accordingly that can ultimately decrease the mortality rate associated with the disease. In another study [43], a ViT is trained using natural images to maximize the large-scale data available in computer vision. The pre-trained ViT model is then mobilized to the brain imaging site where few public but relatively excellent samples are available to achieve an accuracy of 96.8%. This indicates the model’s significant scalability performance which can be an improvement upon the traditional neural networks.

The growing importance of early AD diagnosis parallels the aging global population. A study by [100] introduced a novel approach using the SMIL-DeiT model for AD classification. It also used three categories of the disease including AD, MCI, and NC. The proposed model is inspired by ViT, preceded by data pre-training through DINO, a self-supervised task. The developed architecture is applied to the ADNI dataset and measured by several metrics such as precision, recall, accuracy, and F1-score. The proposed method recognized text with an accuracy of 93.2%, exceeding what was done by the transformer-based (90.1%) and CNN-based (90.8%) models. Self-supervised pre-training methods such as DINO typically necessitate substantial data volumes to achieve meaningful representation learning. This can be particularly challenging in medical imaging, where datasets are frequently constrained in size.

Several studies have proposed innovative approaches for the early diagnosis of AD using deep learning techniques applied to medical imaging data, particularly MRI scans. Sethi et al. [79] introduced a CNN-SVM model that combines the feature extraction capabilities of CNN with the classification abilities of SVM. This model achieved relative improvements in accuracy ranging from 0.85 to 3.4% on different datasets, with an impressive accuracy of 86.2% on the OASIS dataset. The model has shown its potential to be very accurate in terms of diagnosis of this specific condition and it also works very well with the complicated datasets. These advantages can be crucial for the AD diagnosis at the early stages and consideration of further researches in the particular field. On the other hand, the efficiency of the model may be affected by the longer training time and the dependence on big datasets, which could, theoretically, limit the practical applicability in certain contexts in real life. Similarly, [80] developed a CNN classifier named AlzheimerNet, which can identify all stages of AD and the NC class through MRI scans. This model achieved a remarkable test accuracy of 98.67% and outperformed five other pre-trained models. An ablation study demonstrated the model’s superior performance and its ability to outperform traditional methods for classifying AD stages from MRI scans. The advantages of AlzheimerNet include its high accuracy and robustness to noise, but it may be computationally expensive and require a large amount of data for training. Due to the fact that these algorithms are mainly based on CNNs, some challenges may occur such as capturing global context and long-range dependencies of data.

In conclusion, the literature highlights the effectiveness of DL techniques in classifying AD stages from 2D MRI images, with ongoing research focusing on improving accuracy and exploring new approaches such as ViT and transfer learning models. With no doubt, early detection of AD and MCI is essential for timely intervention and improved patient outcomes. However, the majority of methods employ CNN-based models that come with certain limitations such as the inability to capture long-term dependencies and the absence of an attention mechanism. Therefore, it is paramount to develop a more robust system that can cope with these issues.

3 Proposed Bi-Vision transformer (BiViT)

This section elaborates on the research methodology and the constituent components of the proposed BiViT algorithm. In this research, a novel BiViT architecture is developed incorporating parallel coupled encoding strategy (PCES) and mutual latent fusion (MLF). First, the proposed methodology is discussed below followed by each of the novel aspects including PCES and MLF. It is worht mentioning that the present study focuses on the classification of various categories of AD stages and cognitive disorders stages. Moreover, the methodology is comprised of five steps: data augmentation, preprocessing, patch encoding, CC tokenization, self-attention mechanism and MLF. The methodology is illustrated visually in the Fig. 2.

Fig. 2
figure 2

Proposed methodology

The BiViT model incorporates PCES and MLF, which enables it to capture local and global contextual information. Overall, the proposed methodology consisting of preprocessing, data augmentation, and the BiViT model provides a robust framework for computer-aided diagnosis of AD and other cognitive disorders. The model’s performance is evaluated in terms of training, testing, and validation accuracy, demonstrating its efficacy in AD diagnosis and cognitive disorders classification. We’ll get into the mathematical description of the Bi-Vision transformer in the following paragraphs. We start with the input images \(X_n\) and the associated labels, Y. We obtain enhanced versions of the data denoted as \(X_n'\) by using data augmentation techniques. Since we have limited computational resources, the method starts with normalizing and reducing the data. As a result, the images are downsized to \(128 \times 128\). Following the augmentation phase, the minimum and maximum values in each instance are represented by the variables \(X_{min}^{'}\) and \(X_{max}^{'}\). After the data augmentation, resizing, and scaling procedures are finished, the data instance that results is called \(X_n^{''}\), as Eq. 5 illustrates.

$$X_n^{''} = \frac{X_n^{'}-X_{min}^{'}}{X_{max}^{'}-X_{min}^{'}}$$
(5)

Once the initial preprocessing is completed, the subsequent steps involve parallel stages: patching and tokenization. As part of the patching process, the entire image is divided into 256 smaller patches, each measuring \(8 \times 8 \times 3\). which are then projected into a lower-dimensional space. Additionally, we incorporate positional embeddings into the projected patch embeddings. The mathematical formulation for patching on the data batch can be found in Eq. 6. Here, \(x_{n}^{''N}\), where N is 1,2,3.. up to number of patches, represents the patches of a single instance in Eq. 6, and E is for the learnable embeddings.

$$Z_{p}=[X_{class};\textrm{x}_{n}^{''1}E;\textrm{x}_{n}^{''2}E....\textrm{x}_{n}^{''N}E]+E_{pos}$$
(6)

In contrast to patching, tokenization approach involves utilizing convolutional and pooling transformations for operations. Convolutional layers are used to process the data, extracting spatial information through convolutions and downsampling operations. These local features are then presented as tokens and passed into the transformer for additional processing. After applying convolutional operations, tokenization process produces patches with a \(16 \times 16 \times 3\) size, for a total of 64 patches. Tokenization process converts the images data into smaller tokens, as depicted in Eqs. 7 and 8. Tokenization involves first applying pooling and convolutional layers globally to the entire image, and then turning the resulting image into tokens.

$$F_{n} = \text {MaxPool}(\text {ReLU}(\text {Conv2d}(X_n^{''})))$$
(7)
$$Z_{t} = [x_{class};\textrm{F}_{n}^{''1}E;\textrm{F}_{n}^{''2}E....\textrm{F}_{n}^{''N}E]+E_{pos}$$
(8)

The reason for using \(8\times 8\times 3\) patches in patching module and \(16\times 16\times 3\) in tokenization module is to effectively capture and represent the spatial and channel information present in the images. Different sizes are used to achieve both detailed and global context features as bigger token sizes aid in capturing a wider context, while smaller patch sizes enable the capture of more specific information.

Next, the self-attention is applied to \(Z_{p}\) and \(Z_{t}\) to capture dependencies between the patches and tokens on a local and global level. The \(Z_{p}\) attention block take (qkv) as (patches, patches, tokens), while \(Z_{t}\) takes (qkv) as (tokens, tokens, patches). This enhances the information provided to both attention mechanisms which is better for local and global subtle features learning. With the help of self-attention, the model is able to comprehend the spatial dependencies and contextual information present in the image by learning the relationships between patches and tokens. The mathematical formulations for the self-attention mechanism applied to \(Z_{p}\) and \(Z_{t}\) can be found in Eqs. 9 and 10.

$$Attention(Z_{p},Z_{p},Z_{t})=Softmax\left(\frac{Z_{p}Z_{p}^{T}}{\sqrt{D_{h}}}\right)\cdot Z_{t}$$
(9)

The matrix product \(Z_{p}Z_{p}^{T}\) is replaced with the covariance matrix, represented by \(\phi\). To normalize the dot product attention scores in self-attention, \(\sqrt{D_h}\) is used as a scaling factor. This helps to sustain gradient stability during training.

$$Attention(Z_{t},Z_{t},Z_{p})=Softmax\left(\frac{Z_{t}Z_{t}^{T}}{\sqrt{D_{h}}}\right)\cdot Z_{p}$$
(10)

The matrix product \(Z_{t}Z_{t}^{T}\) is replaced with the covariance matrix, represented by \(\Psi\).

Stochastic depth is another idea derived from CCT. Stochastic depth refers to an approach for randomly skipping or dropping network layers during training for the CCT. It is a regularisation technique designed to enhance the functionality and generalizability of deep neural networks. The use of stochastic depth to Eq. 10 can be observed in the Eq. 11. The drop probability \(\vartheta\) is the main argument for stochastic depth, while the keep probability is represented as \(1-\xi\). The mathematical forms of probability (called keep probability) is described in Eq. 11.

$$\xi =1-\vartheta$$
(11)

The Eq. 12 represents the vector obtained from a random distribution, denoted as \({\mathcal {U}}_{\delta }\).

$${\mathcal {U}}_{\delta }= \left( \xi + {\mathcal {U}}_{\Theta } \right) \in {\mathcal {Z}}$$
(12)

The vector \({\mathcal {U}}_{\Theta }\) in Eq. 12 represents a uniform vector drawn from a simple random distribution between 0 and 1. After adding \(\xi\), we apply the floor function to the resulting vector values to convert them into integers within the domain \({\mathcal {Z}}\). The resulting output, obtained after applying stochastic depth, is illustrated in Eq. 13.

$${\mathcal {D}}_{sd}=\frac{{\mathcal {U}}_{\delta }}{{\xi }}. Softmax\left(\frac{\Psi }{\sqrt{D_{h}}}\right)\cdot Z_{p}.$$
(13)

The dense transformation and concatenation operations, represented by \(\lambda _{\theta }\), are applied on the output of self-attention (in case of patching), as described in Eq. 14.

$$\lambda _{\theta }=\lambda _{\theta }\left( Softmax(\frac{\Phi }{\sqrt{D_{h}}})\cdot Z_{t} \right)$$
(14)

For tokenization, the same transformations are applied to the output of self-attention, as depicted in Eq. 15.

$$\lambda _{\Omega }=\lambda _{\Omega }\left( \frac{{\mathcal {U}}_{\delta }}{{\xi }}. Softmax(\frac{\Psi }{\sqrt{D_{h}}})\cdot Z_{p}.\right)$$
(15)

Finally, attention weights are computed, and the representations \(\lambda\) and \(\Omega\) are multiplied by their respective weight matrices (as depicted in Eqs. 16 and 17).

$$\lambda _{w}= \sigma (\lambda _{\theta }.\omega _1)_i^T. \lambda _{\theta }$$
(16)

In Eq. 16, the softmax function \(\sigma\) is applied to the product of \(\lambda _{\theta }\) and the trainable weight matrix \(\omega _1\), denoted as \((\sigma (\lambda _{\theta }.\omega _1)_i^T)\). The expression \(\sigma (\lambda _{\theta }.\omega _1)_i^T\) represents the softmax function applied to the output (\(\sigma\)) multiplied by the weight matrix \(\omega\). Similarly, in the tokenization phase, the same process is performed as described in Eq. 17.

$$\Omega _{w}= \sigma (\Omega _{\theta }. \omega _2)_i^T. \Omega _{\theta }$$
(17)

Finally, we incorporate a fusion function, denoted as \({\mathcal {F}}_{\partial }\), to combine the weighted representations obtained. This fusion function combines the information from both representations. In our specific case, we use concatenation as the fusion method, which is illustrated in Eq. 18.

$${\mathcal {F}}_{\partial }= {\mathcal {F}}_{\partial }\left( \lambda _{w}, \Omega _{w} \right)$$
(18)

The fusion function, denoted as \({\mathcal {F}}{\text {fusion}}\), takes two arguments, \(\lambda {w}\) and \(\Omega _{w}\), as depicted in Eq. 19.

$${\mathcal {F}}_{\partial }= {\mathcal {F}}_{\partial }\left( \sigma (\lambda _{\theta }.\omega _1)_i^T. \lambda _{\theta } ,\; \sigma (\Omega _{\theta }. \omega _2)_i^T. \Omega _{\theta } \right)$$
(19)

The Eq. 19 represents the final representations that are input to the classifier in our case, which is a Softmax classifier due to the multi-class classification task. The process starting from self attention and continuing until Eq. 19 is repeated n times for efficient representation learning. This entire process is often referred to as the transformer encoding, which encodes patches/tokens into meaningful representations. The PCES and MLF processes are covered in the following sections, but first it’s critical to comprehend the tokenization and patching processes. A detailed discussion about how tokenization and patching operate on images is given in Sect. 3.1.

3.1 Patch encoding and tokenization in transformers

Patch encoding and tokenization are both important techniques used in computer vision tasks because they allow input images to be divided into smaller, more manageable parts that can be more efficiently processed by neural networks. This is particularly crucial when dealing with large, high-resolution images, which can be computationally expensive to process using traditional methods. The patch encoding and tokenization processes are illustrated in Figs. 3 and 4, and are explained below.

3.1.1 Patch encoding

To use a transformer model to process image data, patch encoding is performed where the image is divided into fixed-size patches that are flattened into vectors [24]. These patch vectors are then sent to the transformer model which utilizes the self-attention mechanism to extract visual features and classify the image. Patch encoding is more flexible and efficient for processing high-resolution images as can be seen in Fig. 3.

Fig. 3
figure 3

Process of patch encoding in ViT

3.1.2 Image tokenization

In CCT, the process of converting image patches into learnable representations called ’image tokens’ using a trainable CNN is known as image tokenization [38]. These image tokens are subsequently fed into a transformer encoder to perform tasks like image classification or object detection. Unlike traditional CNN, this approach eliminates the need for fully connected layers and pooling layers, which enhances the efficiency and flexibility of image processing. The Fig. 4 shows the steps of tokenization visually. The visual representation of the tokenization process can be observed in Fig. 4.

Fig. 4
figure 4

Image tokenization process in CCT

3.2 Parallel coupled encoding strategy (PCES)

In transformer encoding phase, PCES is implemented to process images using both patch encoding, as shown in Fig. 3, and CC tokenizer, illsutrated in Fig. 4. In the context of a CCT, CC tokenizer refers to the process of breaking down an image or video into smaller, manageable pieces, or "tokens", that can be processed by the CCT’s convolutional encoder. The CC tokenizer is responsible for dividing the input image or video into a fixed-size grid of non-overlapping patches, each of which is then represented as a set of pixels. These patches are then converted into tokens, which are then passed through the convolutional encoder to extract spatial features. These tokens are then passed to the transformer decoder to generate the output. The CC tokenizer is responsible for the pre-processing step that allows CCT to process images and videos more efficiently, as it reduces the dimensionality of the input data by breaking it down into smaller, more manageable pieces.

Patch encoding refers to the process of breaking down an image into smaller, manageable pieces, or "patches", that can be processed by the ViT’s transformer-based architecture. The patch encoding is responsible for dividing the input image into a fixed-size grid of non-overlapping patches, each of which is then represented as a set of pixels. These patches are then flattened and passed through a linear layer to obtain a feature vector. These feature vectors are then used as the input to the transformer-based architecture, where the self-attention mechanism is applied to learn the relationships between the patches and generate the output. The patching process in ViT is mathematically described by Eqs. 6, 9, and 14.

The patch encoding step allows the ViT to process images more efficiently by reducing the dimensionality of the input data by breaking it down into smaller, more manageable pieces, and it also allows the model to learn the relationships between the patches, which is useful for image classification and other tasks. Here, the two concepts are combined together and interconnect in a transformer encoder. In simple transformers, a single encoding strategy is used, however, in proposed model, two strategies are applied and hence known as PCES. It means that two mechanisms run in parallel while information is exchanged continuously between two encoding as shown in Fig. 5.

Fig. 5
figure 5

PCES in transformer

3.3 Mutual latent fusion (MLF)

Following the patch encoding process in ViT, the self-attention mechanism is applied by the model to capture the connections between the patches and produce the ultimate output. The self-attention mechanism in ViT represents each patch with a key, query, and value vector and calculates the dot product between the query and key vectors for all patches [49]. The dot product values are then used to compute a weight for each patch, which determines the importance of the patch for the final output. These weights are used to compute a weighted representation of the patches by taking a linear combination of the value vectors.

After applying the self-attention mechanism in ViT, the weighted representation is produced and it includes the crucial information from the patches, which is further processed by a feedforward neural network to generate the final prediction; thus, the weighted representation in ViT is the result of the encoding and attention mechanism, consisting of the combined patch information weighted by the attention mechanism.

To combine the latent representations generated from both the CC tokenizer and patch encoding process, a fusion mechanism is used where the representations are merged to create a final output for the model. Equation 19 provide the mathematical form for the combined MLF process.

In neural network, feature fusion refers to the process of combining information from multiple layers or multiple channels of the network to form a more comprehensive and robust feature representation [85]. To combine the weighted representation of encoded information, the MLF concept is utilized followed by passing the features to the classifier to predict the instance’s actual label, as depicted in the Fig. 6.

Fig. 6
figure 6

MLF in transformer

4 Experiments and results

This section presents a comparison between the BiViT algorithm and various transfer learning algorithms including DenseNet121, ResNet50, VGG16, VGG19, Inception-ResNet V2, Inception V3, EfficientNet B0, EfficientNet B1, ResNet-101, Xception, MobileNet, and ResNet-152. Additionally, unsupervised deep learning techniques such as convolutional autoencoders, variational autoencoders, and sparse autoencoders were utilized for classifying abnormalities in instances with a primary focus on AD. The subsequent sections discuss the performance of different deep learning techniques in comparison to the proposed BiViT algorithm.

This section follows a structured outline beginning with Sects. 4.1.1, and 4.1.2, which describe the datasets used in the research. In Sect. 4.2, the preprocessing techniques and augmentation techniques utilized in this study for the classification task are outlined. Section 4.3 and 4.5.1 detail the use of various transfer learning algorithms, both with and without augmentation, for classifying AD, and then compare the performance of all algorithms. The results of this analysis are presented in Tables 3 and 4. Table 5 summarizes the class-wise performance of the top four models out of all trained models. Additionally, Sect. 4.5.2 describes the performance of different unsupervised deep learning techniques, such as autoencoders, to extract features from images and classify instances based on abnormalities. The results of unsupervised learning-based classification of AD stages are presented in Table 6. Section 4.6 focuses on the same transfer learning algorithms and deep autoencoders for the classification of cognitive disorders other than AD. The outcomes of this analysis are presented in Tables 7, 8, 9, and 10.

Table 1 Hyper parameter setting and information about the proposed BiViT

The proposed approach is applied to classify cognitive disorders and AD. The model is trained for up to 200 epochs using MRI images, with a total of 2,569,014 trainable parameters. Table 1 provides a summary of the hyperparameters (learning rate, batch size, number of epochs etc.) used in this research work. Various metrics, including F1 score, accuracy, recall, precision, and AUC, are employed for evaluating the performance of the BiViT model.

4.1 Dataset

The research conducted uses two publicly available MRI datasets. The first dataset is utilized for AD stage classification and consists of four classes: mild demented, very mild demented, non-demented, and moderate demented. The second dataset is used for detecting various cognitive impairments and consists of five types of disorders: AD, CN, EMCI, LMCI, and MCI [35, 83]. The first dataset with four classes is used to classify the stage of AD, which is a progressive brain disorder affecting memory, thinking, and behavior. The classification process can help with early diagnosis and treatment of the disease. The four classes in the dataset represent different stages of the disease, with mild demented and very mild demented indicating early stages and non-demented and moderate demented indicating later stages. The stages of cognitive disorder range from mild to severe, with the severity of cognitive decline increasing as the disease progresses. This stage of cognitive decline is often referred to as moderate to severe dementia [27]. Figure 7 illustrates the hierarchical structure of both datasets.

Fig. 7
figure 7

Visual hierarchy of data

The second dataset is used to detect various cognitive disorders, including AD, and is divided into five types of disorders. The CN class represents individuals without any cognitive impairment, while the other four classes represent different stages of cognitive decline. On the other hand, MCI is a condition in which an individual experiences mild cognitive decline beyond what is expected for their age but is still able to perform their daily activities. EMCI refers to the early stages of cognitive decline [42], and LMCI refers to a more advanced stage of cognitive decline [58]. Finally, AD is a progressive neurodegenerative disorder that affects memory, and behavior, and is the most common cause of dementia in older adults. Early detection of cognitive disorders can help with timely interventions to slow down the progression of the disease and improve the quality of life of affected individuals.

The Table 2 presents the details of training and testing samples utilized in the conducted experiments of this study. In the upcoming section, we provide a brief overview of each dataset.

Table 2 Alzheimer disease and cognitive disorders data distribution

4.1.1 Alzheimer disease data

The dataset used in this research was collected from several websites, hospitals, and public repositories [26]. The dataset consists of preprocessed MRI images that were resized into \(128 \times 128\) pixels. The dataset is comprised of four different classes of images, which are classified based on the severity of dementia. In total, the dataset contains 6400 MRI images, after the augmentation process [95]. The Fig. 8 depicts the MRI samples from all the classes together.

Fig. 8
figure 8

Alzheimer MRI Data Samples

The first class in the dataset is mild demented, which contains 896 images. This class represents patients who have mild symptoms of dementia. The second class is moderate demented, which contains 64 images. This class represents patients who have moderate symptoms of dementia. The third class is non demented, which contains 3200 images. This class represents patients who do not have dementia. The fourth and final class is very mild demented, which contains 2240 images. This class represents patients who have very mild symptoms of dementia. The distribution of MRI data for AD can be seen in Fig. 9.

Fig. 9
figure 9

Distribution of Alzheimer disease dataset

4.1.2 Cognitive disorders data

The dataset used in this study comprises five different stages of cognitive disorders, which have been divided into two directories for the purposes of training and testing. The stages included in the dataset are EMCI, LMCI, MCI, AD, and CN. Specifically, there are 204 samples of EMCI, 61 samples of LMCI, 198 samples of MCI, 145 samples of AD, and 493 samples of CN, for a total of 1101 samples [18]. The dataset was sourced from the ADNI website and was created as a collaborative effort to accelerate Alzheimers research. The ADNI website is duly credited as the source of the dataset [2]. The MRI data samples for cognitive disorders are illustrated in Fig. 10.

Fig. 10
figure 10

Cognitive disorders MRI samples

Figure 11 illustrats the distribution of data across different stages of cognitive decline.

Fig. 11
figure 11

Distribution of cognitive disorders dataset

4.2 Data preprocessing

To make the images suitable for the model, the preprocessing step aims to enhance their quality thereby improving the performance of the model. The primary objective of preprocessing is to enhance important image features to improve the image data for further processing. One of the most important steps in improving the quality of images is preprocessing, which highlights important details. The scaling and resizing operation is performed to standardize the size and resolution of images. In contrast, scaling modifies pixel intensity values to enhance contrast and detail visibility, as a result, the pixels have values between 0 and 1. The images were resized to \(128 \times 128\) dimensions, and subsequently, normalized using Eq. 20 such that the pixel values ranged between 0 and 1. Let I represent the input image having a size (\(m \times n\)) and \(I_{norm}\) represent the normalized MRI, it can be given as follows.

$$I_{norm} = \frac{I - I_{min}}{I_{max} - I_{min}}$$
(20)

Data augmentation is incorporated into the methodology to further enhance the learning efficacy of deep learning models once the preprocessing phase is finished. These changes contribute to the training datasets diversification, which may result in a more resilient model with improved ability to generalize new data. Data augmentation helps to improve the generalization ability of the deep learning model by exposing it to a wider variety of images, thereby reducing overfitting and improving the model’s ability to learn from limited data. For the augmentation process, two types of techniques are employed: random crop and random flip. Random crop involves selecting a random portion of the image while preserving its aspect ratio. This technique helps introduce variability in the training data by focusing on different regions of the image. On the other hand, random flip involves horizontally flipping the image. This augmentation technique helps the model to learn recognize objects and patterns from different orientations, further enhancing its ability to generalize and perform well on unseen data. By combining these two types of augmentations, a more diverse and robust training dataset is created, facilitating better learning and improved performance of the model.

4.3 Explanation of transfer learning architectures and unsupervised autoencoders

In this section, various transfer learning algorithms are explored to use in comparative analysis. The aim is to assess the performance of the proposed model in comparison to different transfer learning models including DenseNet-121, ResNet-50, VGG19, VGG16, Inception ResNet-v2, Inception-v3, EfficientNet-B0, EfficientNet-B1, Xception, MobileNet, and ResNet-152.

DenseNet-121 is a CNN-based architecture renowned for its dense connectivity pattern. Within this architecture, each layer is directly connected to every other layer in a feed-forward manner, making it highly parameter-efficient [41]. ResNet-50, on the other hand, is a variation of the Residual Neural Network (ResNet) architecture that was specifically developed to tackle the vanishing gradient problem in deep networks. This strategic addition ensures that gradient information is preserved during training, thereby enabling more effective learning as describe in Eq. 21.

$$x_u = x_u\left( \left[ \begin{matrix} 1 &{} \cdots &{} 0\\ \vdots &{} \ddots &{}\vdots \\ 0 &{} \cdots &{}1 \end{matrix} \right] + \frac{\partial {\mathcal {F}}}{\partial x} \right)$$
(21)

ResNet-152, part of the ResNet family, is a CNN architecture introduced in 2015 by [39] for image recognition. By incorporating skip connections, ResNet-152 simplifies the training and optimization of deep neural networks.

VGG19 and VGG16 are part of a distinct category of transfer learning models, originating from the Visual Geometry Group (VGG) architecture, renowned for its simplicity and uniformity. These models consist of several layers of small-sized convolutional filters, followed by max-pooling layers. VGG19 has 19 layers, whereas VGG16 has 16 layers [71]. Inception ResNet-v2, another transfer learning model, combines the strengths of both Inception and ResNet architectures, resulting in a powerful hybrid model. The presence of residual connections ensures efficient gradient propagation during training [86].

The EfficientNet family comprises models aimed at achieving an optimal balance between model size and accuracy. EfficientNet-B0 serves as the baseline model within this family and leverages compound scaling to optimize depth, width, and resolution simultaneously. On the other hand, EfficientNet-B1 is designed to be slightly larger and more accurate than B0. It adopts the same compound scaling technique but with greater model size and complexity when compared to B0 [90].

Xception, which stands for Extreme Inception, is a CNN architecture inspired by the Inception model. Xception employs depthwise separable convolutions, effectively reducing the number of parameters and computational burden when compared to conventional convolutions [19]. Finally, we have MobileNet, a family of lightweight CNN architectures, specially designed for mobile and embedded devices. These models leverage depthwise separable convolutions to decrease computational demands and model size without compromising on reasonable accuracy [40, 76].

Next, a discussion about various types of autoencoders employed in this study is presented for the purpose of conducting a comparative analysis in this paper. A Convolutional Autoencoder (CAE) is a specialized type of autoencoder that employs convolutional layers in its encoder and decoder, rather than fully connected layers. It is specifically designed for handling image data and excels in capturing spatial patterns and features from images. By using convolutional layers, the CAE can effectively reduce the dimensionality of the input data while retaining crucial features [101]. Variational Autoencoder (VAE) is a probabilistic variant of the conventional autoencoder. Unlike a standard autoencoder that merely learns an encoded representation of the input data, VAE also models the underlying probability distribution of the encoded data. This unique characteristic enables VAEs to generate new data samples by sampling from the learned probability distribution [50].

The sparse autoencoder (SPAE) is a type of autoencoder that enforces sparsity constraints during its training process. These constraints encourage the autoencoder to utilize only a limited number of neurons in its hidden layer to represent the input data. As a result, the autoencoder produces a more concise and efficient representation of the data, offering benefits such as reducing overfitting and enhancing generalization in certain situations [68].

Undercomplete and overcomplete autoencoders are two variations of the traditional autoencoder. An undercomplete autoencoder has a hidden layer with fewer neurons than the input layer, which results in a compressed representation of the input data. On the contrary, an overcomplete autoencoder has a hidden layer with more neurons than the input layer, leading to a redundant representation of the data. This allows the autoencoder to learn multiple representations of the same data, offering more flexibility but also increasing the risk of overfitting [14, 89].

4.4 Experiments setting and hyper-parameter configuration

In the following section, a comparative analysis to assess the performance of our proposed model is presented in comparison to state-of-the-art transfer learning models and unsupervised autoencoders. Here, the experiment setting and hyperparameters configuration utilized during the experiments and comparisons is discussed. Initially, details about the experiment setting is provided, followed by a comprehensive discussion of the hyperparameters configuration for all the other algorithms trained for comparative purposes. The hyperparameter configuration of proposed BiViT is presented at the beginning of Sect.  4.

The experiment setting remains consistent across all models, including the proposed BiViT model. A standardized process of loading, preprocessing, and splitting the data into training and testing sets is followed. Subsequently, all algorithms are trained for the same number of epochs (200), using a uniform learning rate (0.001), optimizer (Adam), and batch size of 32. For the transfer learning models, the weights are initialized with "imagenet" weights and utilized only for the classifier layer of multi-class classification after the flattened layer. For the learning rate, the value is set to default as it has shown satisfactory performance. The number of epochs is 200 to allow sufficient time for observing trends in accuracy and loss measures. This gives an opportunity to analyze how performance evolves over time. Additionally, since the dataset is extensive due to augmentation, a batch size of 32 is used to facilitate efficient and smooth execution of the experiments.

For unsupervised autoencoder, a 5-layer convolutional encoder and a 4-layer convolutional decoder is used. Throughout all layers, the filter size remains consistent at (\(3\times 3\)). The latent space has a dimension of 2, which we use for classifying the disease. The autoencoders are trained for 10 epochs with a learning rate of 0.001, a batch size of 32, and the Mean Squared Error (MSE) serving as the loss function.

4.5 Comparative analysis on Alzheimer disease diagnosis

In this section and the following subsections, a comparative analysis is performed on different transfer learning algorithms for AD classification. Transfer learning is a technique where a pre-trained neural network is used as a starting point for a new task. The pre-trained neural network has learned general features from a large dataset and these features can be reused for a new task, which can save time and computational resources. Two tables are created for the experiment, one with data augmentation experiments and the other without data augmentations. One table contains data that has been subjected to data augmentation techniques, which involves artificially generating new data by making changes to existing data samples. The other table contains data that has not undergone any data augmentation. Both tables are likely used to compare the performance of machine learning or other data-driven models on the augmented versus non-augmented data.

Data augmentation is a technique where new training samples are generated by applying transformations such as rotations, flips, and scaling to the original images. This can increase the diversity of the training data and improve the performance of the model. The third table in the analysis described the class-wise prediction performance of the top three best-performing models. This table provided additional insights into how each algorithm performed for different classes of AD. Multiple algorithms are trained and the results are reported. Each algorithm has a different architecture and varying number of layers. The performance of these algorithms is compared in terms of accuracy, recall, precision, F1-score and AUC.

Section 4.5.2 discusses the usage of various deep unsupervised learning algorithms, also known as autoencoders, to classify instances with abnormality. These autoencoders included CAE, VAE, SPAE, and undercomplete and overcomplete CAEs. The performances of these autoencoders are evaluated based on their ability to extract relevant features from the input images and classify instances with abnormality.

4.5.1 Alzheimer detection with transfer learning algorithms

In this section, a comparative analysis is presented between different transfer learning algorithms for classifying AD and the results presented in Table 3 in terms of accuracy, recall, precision, F1-score, and AUC. It is important to note that the results in Table 3 are obtained by applying data augmentation techniques to the Alzheimer data, which means that the algorithms were trained on augmented data. Then, the models are tested using test data to evaluate their performance. The visual performance of the proposed BiViT algorithm (with data augmentation) after each epoch is illustrated in Fig. 12.

Table 3 The table provides a summary of the results obtained by various transfer learning algorithms for AD diagnosis with different data augmentation techniques

The performance of different algorithms in terms of various measures such as precision, recall, F1-score, accuracy, and AUC are compared in a Table 3. From these results, it can be concluded that the proposed BiViT model outperformed competing algorithms in terms of all performance measures. Furthermore, it can be observed that the precision is low for all the algorithms whereas the recall is high. This suggests that the algorithms are able to detect most of the positive cases, but at the cost of many false positives. However, the BiViT model can achieve good results in terms of both recall and F1 score.

Fig. 12
figure 12

Performance of proposed BiViT with data augmentation technique

In contrast, results illustrated in Table 4 reveals that all transfer learning algorithms perform remarkably well even without the use of data augmentation techniques. In addition, the results are significantly superior to those achieved by applying data augmentation in previous studies. A visual performance in terms of progress of the proposed BiViT algorithm after each epoch is depicted in Fig. 13, where data augmentation was not used. It can be observed that the proposed BiViT model outperforms all other transfer learning models with the highest accuracy and recall scores. Based on these results, it can be concluded that data augmentation techniques may not always be necessary to achieve good performance in image classification tasks, and that the proposed BiViT model shows promising results in this regard. Moreover, the Table 5 shows the classwise comparative analysis of the 4 best performing algorithms.

Table 4 The summary table presents the results achieved by different transfer learning methods for diagnosing AD, without the use of any data augmentation techniques
Fig. 13
figure 13

Performance of proposed BiViT model without data augmentation

Table 5 Class-wise performance of best 4 models in AD classification case

4.5.2 Alzheimer detection with deep unsupervised learning

Deep autoencoders are a type of unsupervised learning technique that can extract features by modeling latent manifold in data. These features can then be used to classify instances using a simple classifier. In this study, different autoencoders such as CAE, VAE, SPAE, undercomplete and overcomplete AE, are used for AD classification and the results are presented in Table 6. Each of these autoencoders were trained and tested on the dataset to classify instances with abnormality. The results show the performance of each autoencoder in terms of different evaluation metrics, such as accuracy, recall, and F1-score.

Table 6 The table presents a comparison between various autoencoders (an unsupervised learning technique) for diagnosing AD without using data augmentation

The results conclude that autoencoders perform well and can compete with transfer learning algorithms in terms of AD classification. In fact, some of the autoencoders even outperformed the transfer learning algorithms. Moreover, the SPAE performed the best out of all the autoencoders which means that the SPAE can extract the most informative features from the images and classify AD instances more accurately than the other autoencoders. The performance of SPAE, CAE, and VAE can be observed in Figs. 14, 15, and 16. Figures 17 and 18 display the performance of over and under autoencoders.

Fig. 14
figure 14

Latent space of sparse autoencoder

Fig. 15
figure 15

Latent space of convolutional autoencoder

Fig. 16
figure 16

Latent space of variational autoencoder

Fig. 17
figure 17

Latent space of overcomplete autoencoder

Fig. 18
figure 18

Latent space of undercomplete autoencoder

4.6 Comparative analysis on cognitive diseases diagnosis

In this section and its subsequent subsections (Sects. 4.6.1 and 4.6.2), a comparison is made among different transfer learning techniques for the purpose of classifying cognitive disorders. Two tables are presented, with data augmentation (Table 7) and without data augmentation (Table 8). The tables are used to compare the performance of machine learning models on augmented versus non-augmented data. Additionally, Table 9 provides the class-wise performance of the top performing four algorithms among all. These sections also present a comparison of various autoencoders for cognitive disorders classification illustrated in Table 10.

4.6.1 Transfer learning for cognitive disorder diagnosis

Different transfer learning techniques, including the proposed BiViT model, were employed to classify cognitive disorders. However, due to limited and imbalanced dataset, the performance of not even a single algorithm is upto the mark even after utilizing data augmentation methods. Table 7 presents the results of various algorithms trained on the cognitive disease dataset. It can be seen that all the algorithms have an accuracy within the range of 35–45%. Although the recall score is high, the precision and F1-score are low. However, the proposed BiViT model can perform very well in terms of accuracy and F1-score. Table 8 provides information on the performance of the same set of algorithms as in Table 7, but this time they are trained without performing data augmentation.

Table 7 The table presents the results of various transfer learning methods employed for the detection of cognitive disorders (with the use of data augmentation)
Table 8 The table presents the results of various transfer learning methods employed for the detection of cognitive disorders (without data augmentation)

The Table 8 compares the performance of different transfer learning algorithms without data augmentation techniques. Although the range of accuracy remains similar, i.e., 40–45%, the precision and F1-score are consistently low. Even the proposed BiViT model performs poorly in this scenario, which is primarily attributed to the limited and imbalanced nature of the dataset. It is suggested that with an increase in data quantity, there may be an improvement in the performance of all algorithms in the future. The Table 9 demonstrates the class-specific performance of the five top performing algorithms for cognitive disorder classification.

Table 9 Class-wise performance of best 4 models for cognitive disorders detection

4.6.2 Deep unsupervised learning for cognitive disorder diagnosis

Table 10 presents the performance of various autoencoders used for classification of different cognitive impairments. From the statistics, it can be concluded that the performance of autoencoder-based models for cognitive disorders classification is not satisfactory. This could be due to the limited nature of the dataset, which may not provide enough examples to help autoencoders generalize well. However, the SPAE performs relatively well and is competitive with transfer learning models.

Table 10 The comparison table presents various autoencoder models (an unsupervised learning technique) and their performance evaluation for the detection of cognitive disorders (without data augmentation)

4.7 Comparison with state-of-the-art approaches

There are various models available in the literature for classifying AD and cognitive disorders. This subsection aims to compare the proposed BiViT model with other state-of-the-art models in the literature, presented in Table 11.

Table 11 The presented table illustrates a comparison between the proposed BiViT model and other state-of-the-art models available in the existing literature

5 Discussion

This section discusses research findings as well as its limitations, applications, and future directions. The Sect. 4.5 presents the outcomes of the suggested BiViT, autoencoders, and transfer learning models when applied on the AD dataset. Table 3 shows that BiViT outperforms all other transfer learning models with 93% accuracy when data augmentation techniques are applied to AD data. However, the results of not applying augmentation are better than those of applying data augmentation 4.

It is true that when augmentation techniques are used on medical image datasets, they might not always produce significant improvements. The statement is made for multiple reasons. First of all, intricate anatomical structures and subtle characteristics are frequently seen in medical images, which make it difficult to enhance them without introducing distortions or unrealistic variations. Additionally, medical image datasets are often smaller and more specialized compared to general image datasets, making it challenging to find augmentation strategies that effectively capture the variability within the dataset without introducing biases or artifacts. Table 5 presents the class-wise performance and clearly shows that, when applied to the AD dataset, BiViT and MobileNet yield the best performance matrices.

Although MobileNet has demonstrated its effectiveness in AD classification, our transformer-based BiViT model presents a strong substitute. Although CNNs are the foundation of MobileNet and are widely recognized for their effectiveness, they are naturally limited in their ability to capture long-range dependencies and maintain spatial relationships across images. Transformers, on the other hand, excel in these areas as well, using self-attention mechanisms to identify complex spatial patterns that are essential for precise diagnosis. Furthermore, because of their intricate hierarchical architectures, CNNs may be difficult to interpret, which makes it difficult to comprehend model decisions-a crucial component in medical applications. Transformers, on the other hand, provide improved interpretability and transparency in decision-making because of their attention mechanisms.

Following that, the results of autoencoder-based models are displayed in Table 6, which also demonstrates how well these models work when used with the AD dataset. Autoencoders attain high accuracy at the expense of precision and F1-score. The same transfer learning algorithms, autoencoder, and suggested BiViT are then applied to cognitive disorder data with and without augmentation in Sect. 4.6. All models and the suggested BiviT-perform the worst in both augmentation and non-augmentation scenarios. Very little data and an uneven distribution of data are the causes of this, as was previously discussed in the sections above.

Finally, our suggested model shows up as a strong competitor with competitive performance metrics in AD classification when compared to state-of-the-art architectures, as evidenced by its effectiveness in the literature. Our model is positioned as an AD detection system that operates in real-time and can be easily integrated with the internet of things (IoT), making it adaptable to a range of healthcare environments. There are many uses for the BiViT model, especially in hospitals where accurate and timely diagnosis is critical. It’s crucial to recognise some inherent limitations in the suggested BiViT model, though. Its training is primarily based on a single AD dataset, which means that it needs to be adjusted for use in various hospital settings. This emphasises how crucial it is to increase the size of the dataset in order to improve model performance, since larger datasets make it easier to extract more useful and subtle features. Although the model performs well on the current dataset, because it was trained on a small amount of data and requires domain-specific knowledge, its generalizability to other datasets may need careful fine-tuning. Given these challenges, scientists can now improve the functionality and applicability of this model in a range of clinical settings.

6 Conclusion

The early-stage detection of AD and different cognitive declines are crucial for the patient’s health. To address this, a computer-aided diagnosis system is proposed that utilizes 2D-MRI images to detect different cognitive disorders including AD by incorporating PCES and MLF mechanisms. In contrast to the simple ViT, which utilizes only a single patching mechanism, the proposed BiViT employs two parallel coupled encoding mechanisms including simple patch encoding and image tokenization, to encode information more efficiently. A comparative analysis is presented using two different datasets of Alzheimer disease and cognitive disorders. Various state-of-the-art models from literature including transfer learning and autoencoder-based models are included in study to compete with the proposed algorithm. The accuracy, recall, precision, F1-score, and AUC for the AD classification task are reported as 96.38%, 97.87%, 88.28%, 92.84%, and 99.47%, respectively. For cognitive disorders, the proposed BiViT algorithm achieved an accuracy, recall, precision, F1-score, and AUC of 44.94%, 68.69%, 33.29%, 45.38%, and 72.62%, respectively. It means that, the proposed algorithm performed well in the case of AD classification but had lower performance for cognitive disorders due to limited and imbalanced data. Therfore, to improve the model’s performance, incorporating diverse and representative datasets is recommended. Additionally, integrating techniques for explainability and interpretability would enhance transparency in the model’s decision-making process.