1 Introduction

Apples are among the most widely cultivated and consumed fruits worldwide, due to their high nutritional and remedial value. The antioxidant effects, due to the presence of large amounts of fiber and phytochemicals, in apples help protect the cellular DNA from oxidative harm, which can lead to cancer [1] Moreover, these chemicals present in apples hinder the proliferation of new cancer cells and reduce the spread of existing ones [1] Apples are rich in vitamin C, sodium, potassium, fiber, phosphorus, calcium, and iron. Larsson et al. [2] have shown that eating more apples can help lower the risk of stroke. In addition to their significant nutritional benefits, apples also play a vital role in the economies of agrarian countries by contributing to employment, export revenue, and local livelihoods. In Pakistan, apple is the fourth largest fruit crop in terms of production. In 2020, apple production in Pakistan exceeded 0.67 million tons. However, apple plants are prone to various diseases that can affect the quality and quantity of apples produced. These diseases can adversely affect the nutritional and therapeutic value of apples. These diseases range from fungal infections and viruses to nematodes and bacteria, among others. Figure 1 shows the process of apple leaf disease. Diseases such as scab, complex, rust, and frog eye leaf spot inhibit the sound development of the apple industry. Furthermore, these diseases significantly affect apple production, leading to substantial setbacks for the country's agricultural sector. Among these, scab, which is caused by the ascomycete Venturia inaequalis [3] is the most significant disease. Identifying the area of the leaf affected by scab is quite difficult, and this disease can inflict more damage on the plant than any other disease. The scab appears first on the leaf in yellow spots. In later stages, it results in the fruit turning ugly, cracked, and malformed, rendering it unusable. Identification of scab in early stages is important in preventing the growth of this disease and protects the apple harvest from deterioration. Rust causes economic losses in several ways. Rust not only causes serious damage to the apple tree, but also leads to a reduction in the size of the fruit. Intense leaf infections and defoliation make trees vulnerable to winter injury. Cedar-apple rust predominantly targets the leaves and fruit of apple and crabapple trees. Caused by the fungal pathogen, the frog eye leaf spot is a quite prevalent disease in apple trees [4] Frog eye leaf spot can lead to fruit infections, severely affecting the harvest of the apple crop. Tiny, purplish spots appearing on the leaves are the initial symptoms of leaf spot on the frog eye. Therefore, it is essential to accurately diagnose and treat apple diseases in a timely manner to ensure a healthy and productive harvest. Figure 1 shows the samples of apple leaf diseases.

Fig. 1
figure 1

Samples of apple leaf diseases

The timely identification and diagnosis of these diseases is essential to prevent economic losses and preserve the nutritional quality of apples. The timely diagnosis of the correct disease, when it appears, and the taking of appropriate precautionary measures can help farmers save both the apple harvest and the environment. Therefore, an automated solution is imperative for timely leaf disease detection. CNNs became the likely choice for vision applications, leading to the development of high-performance models having extensive connections and sophisticated forms of convolutions. The recent success of CNN models was largely due to the stacking of a large number of layers and the training of very deep networks. CNNs have demonstrated significant success in the agriculture industry, particularly in plant leaf disease detection tasks [5,6,7,8] employed deep ResNet-like architectures with certain modifications for the detection and classification of plant leaf diseases. Due to their substantial model size and intricate network structures, these models achieved remarkable success. However, the deployment of these models on mobile devices has remained the biggest challenge for researchers. Successive works [9,10,11,12,13] focused on developing models that were computationally efficient and could be deployed in devices limited in resources. Despite their resource efficiency, these models did not achieve the desired results. Recently, transformer architectures have attracted attention because of their superior performance and ability to model long-range dependencies. Equipped with multi-head self-attention (MHSA), vision transformers (ViTs) have achieved valuable results in the field of image classification [14] video classification [15,16,17] semantic segmentation [18] object detection [19, 20] video object segmentation [21] 3D object detection [22] However, ViTs, due to an excessive number of parameters and high floating-point operations (flops), may not be compatible for real-world applications, especially when they need to be deployed on lightweight devices. Therefore, considering the constraints on storage and computational capacity in the deployment of embedded devices, it is imperative to reduce the number of parameters and flops of the model without compromising the performance of the model. In this research, we proposed a lightweight hybrid vision transformer named AppViT for the detection and classification of apple leaf diseases. The primary contributions of this research are as follows:

  • We proposed a novel lightweight apple leaf disease detection network for the classification of four types of apple leaf disease, including healthy leaves.

  • Through extensive experiments, we prove that our model outperforms state-of-the-art CNN architectures in terms of accuracy-efficiency trade-off on lightweight devices.

  • Additionally, in the future, our lightweight models will provide the way for future extensions to identify and classify diseases in various fruits and vegetables, offering a versatile solution for agricultural applications beyond apple leaf diseases.

2 Related work

Previously, computer vision researchers used different machine learning algorithms for leaf disease detection tasks. These machine learning algorithms have great success in the domain of leaf disease detection tasks. Some of these popular algorithms were support vector machines(SVM) Wetterich et al. [23], Jiang et al. [24], Sethy et al. [25] random forest Wojtowicz et al. [26] filter segmentation Kamath et al. [27] K-means clustering Tian et al. [5] k nearest neighbors (KNN) Chaudhary et al. [28] and some other image processing methods Nosratabadi et al. [29], Tavoosi et al. [30], Asghar et al. [31]. However, the machine learning algorithms discussed have limited scope and have not reached the expected performance level.

The use of deep learning algorithms in the agriculture industry, especially in the detection of plant leaf disease, has shown promising results. Since the emergence of Fuentes et al. [32], Liu et al. [33], Zhang et al. [34] employed CNNs for the classification and identification of different leaf diseases. Compared to traditional machine learning approaches, CNNs demonstrated significant results. The ability of CNNs to extract local features from neighboring pixels enabled them to achieve favorable results compared to machine learning algorithms. Hossain et al. [35] designed a novel CNN for the classification of rice leaf diseases. In the experiment, the authors utilized a new data set consisting of 4199 images of leaf disease. The training precision of the designed model was 99.78% and the validation accuracy of the designed model was 97.35%. The effectiveness of the designed model was tested on images of rice lead disease and achieved 97.82% accuracy. Jiang et al. [36] proposed a method combining the features of VGG and InceptionNet for the recognition of apple leaf disease. The model achieved an accuracy of 97.14% in an ALDD data set containing 26,377 images of five different diseases. In Li and Rai et al. [6] the researchers carried out apple leaf disease detection using ResNet-18 and ResNet-34 architectures. Due to their large size and the huge number of parameters, the models achieved accuracies of 99% and 97%. Although these models demonstrated impressive accuracy, they were not feasible for deployment on lightweight devices. Another successful research carried out in the apple leaf disease detection domain was by Zhong et al. [37]. The authors trained DenseNet-121 on a limited data set that included only three diseases, segmented into five categories based on disease severity. Their trained model with around 8 million parameters achieved an accuracy of 93.1%. Hossain et al. [39] suggested a framework based on a gradient boosting classifier for the classification of plant leaf disease. In the methodology, the authors employed adaptive centroid segmentation using the k optimal value, and features are extracted using a modified histogram-based local ternary pattern. The suggested method achieved 98.51%. In [38] the authors proposed a lightweight MobileNet-based apple leaf disease detection method. The authors collected 334 images of apple leaves affected by two types of disease: Alternaria leaf blotch and rust. Despite the compatibility of the model with lightweight devices, its achieved accuracy of 73.50% raises considerations about its performance, indicating potential challenges in balancing model size and accuracy for efficient disease detection on resource-constrained platforms. Hossain et al. [39] presented a deep CNN model based on separable convolutional depthwise for the classification of plant leaf diseases. In the methodology, the authors presented three novel CNNs. The designed models are compared in terms of model size, accuracy, and computational power. The designed models achieved 99.55%, of highest accuracy. The limitation of this work was overfitting. Islam et al. [40] suggested an automated deep learning-based web application for the classification of apple leaf disease detection. The authors employed pre-trained CNNs including VGG16, VGG19, and ResNet50. The experimental process was carried out on the plant village 1000 image and they achieved 96.15% highest accuracy of 96.15%. The limitation of the suggested method was the utilization of a smaller amount of data for training. Paul et al. [41] presented a real time application based on CNN for the classification of tomato leaf disease. The author designed a customized CNN and utilized VGG16 and VGG18 with transfer learning for the classification phase. For the experimental process, they selected ten tomato diseases and one class of healthy. They achieved 95.00% accuracy. The limitation of the presented framework was the overfitting because they did not implement cross-validation. Yao et al. [42] presented a deep learning framework based on a multi-prediction approach for the identification of plant leaf diseases. The authors developed a customized CNN named generalized stacking multi-output CNN. They used plant village, plant leaves, and plantdoc dataset as a benchmark. A comprehensive comparison was conducted with state-of-the-art models and the designed model was achieved the highest accuracy of 96.51%. Andrishia et al. [43] suggested a capsule network for the classification of vitis vinfera leaves. The authors designed a new capsule network for the detection and classification of vitis vinfera disease. The designed model achieved 98.7% accuracy. The limitation of the presented method was the highly complex architecture and the long computation time for training.

Deep learning models have demonstrated significant achievement and have been widely adopted for the tasks of plant leaf disease detection; however, the deployment of these models on lightweight devices continues to be a hurdle. Additionally, training lightweight models for resource-constrained devices results in performance degradation of these models and makes them unfit for real-time environments. This constraint not only affects performance but also poses challenges for on-spot, real-time applications, which are increasingly vital in modern agriculture. As farmers and agriculturalists move towards more tech-integrated solutions, there is a pressing need for models that can operate seamlessly on portable devices without sacrificing accuracy. Therefore, this research gap motivates us to design a lightweight deep learning model for apple leaf disease detection that not only outperforms state-of-the-art CNN architectures but is also compatible with lightweight devices.

3 Methodology

The proposed apple leaf disease classification framework is presented in this section. The proposed framework comprises a novel hybrid vision transformer for the classification of apple leaf diseases. Figure 2, presents the proposed framework, initially, the apple leaf data set is divided into training and testing. The training data are used for the augmentation process. After augmentation, a lightweight hybrid vision transformer named AppViT is designed and trained on augmented data. Following that, some state-of-the-art models were trained on the augmented data. In the end, a comprehensive study is conducted on the basis of the number of flops, the number of parameters, and memory.

Fig. 2
figure 2

Proposed framework for the classification and comparative analysis of the AppViT model with state-of-the-art models

3.1 Dataset collection and augmentation

The collected data set is publicly available at Kaggle Plant Pathology 2021—FGVC8 [44]. The data set comprises 16,093 images. These images were taken with smartphones and a Canon Rebel T5i DSLR (Canon Inc., Japan) at different levels of disease with diverse backgrounds under natural conditions. All images were acquired in an unsprayed apple orchard located at Cornell AgriTech, which is the New York State Agricultural Experiment Station, situated in New York State, USA [44]. The images have been uniformly cropped to a dimension of 256 x 256 pixels. The data set consists of five classes that include scab, healthy, frog eye leaf spot, complex, and rust. The data distribution is shown in Figure 3.

Fig. 3
figure 3

Analysis of apple leaf diseases in the Pathology 2021 dataset

Insufficient data is one of the biggest challenges in implementing deep learning models. Diversifying the data set with a larger number of images enables models to learn more robust features and mitigates the risk of overfitting. In this work, the data set exhibits an imbalance, with the Scab class containing the highest number of images and the complex class having the lowest. To address this unbalancing problem, we applied data enhancement techniques, including random rotation, random flipping, random translation, brightness adjustment, random zoom, and random contrast modification. These augmentations not only increased the size of the data set but also enhanced the quality of the images. Table 1 shows the description of the data set after augmentation. After augmentation, the data set now contains 44,507 images compared to 16,093 images before augmentation.

Table 1 Augmented apple leaf disease dataset

3.2 Overview of vision transformer

Before delving into our model, we begin by giving a concise overview of ViTs and how they can be optimized for lightweight devices. The transformer architecture, originally introduced by [45] garnered great success in the domain of natural language processing (NLP). The success of transformer models is attributed to the attention mechanism. An attention mechanism can be understood as a method that links a query to a collection of key-value pairs, resulting in an output. In this setup, the query, keys, values, and output are represented as vectors. The output arises from a weighted aggregation of the values, with the weight for each value determined by how well the query matches the respective key. Owing to its ability to model long-range dependencies and learn intricate features, the transformer architectures became a natural choice for researchers working in the language field. Figure 4 below shows the scaled Dot-Product attention mechanism.

Fig. 4
figure 4

Scaled product attention

In vision, transformer architecture was first introduced by [14]. Since then it has been widely used for tasks of image classification [14], video classification [15,16,17], semantic segmentation [18], object detection,[17, 20]video object segmentation [21] 3D object detection [22]. A 2D image x ∈ RH×W×C fed into ViT is first split into a flattened sequence of patches xp ∈ RN×(P 2·C), where H, W represents the height and width of the input image, and C denotes the channel dimensions of the input image [14]. The resulting number of patches obtained after patchification is given by N = HW/P 2 and (P, P) is the resolution of each image patch [14]. Before adding position embedding to the patch embedding, a special class token (cls) is added to the visual tokens. The resulting sequence of embedding vectors is fed as input to the transformer encoder. The vision transformer encoder is a fusion of layers, incorporating multi-head self-attention (MHSA) and a feed-forward network (FFN). Figure 5 presents a detailed overview of the working of vision transformers (ViT).

Fig. 5
figure 5

VIT overview

3.2.1 Multi-head self-attention

The Multi-Head Self-Attention (MHSA) mechanism allows the network to model long-term dependencies and learn complex features. MHSA allows the model to learn inter-patch representations without inductive bias. An extension of single-head self-attention, MHSA employs distinct projection matrices for every attention head. Specifically, the input tokens, represented as xt, are initially mapped to queries (Q), keys (K), and values (V) by using projection matrices. That is, Q = xtWQ, K = xtWK and V = xtWV , where WQ, WK and WV are the projection matrices for the query, key and value, respectively, each of dimension D×D as shown in Figure 6. Following this, the self-attention mechanism is computed as shown in eq. 1:

$$ {\text{Attention}}({\text{Q}},{\text{K}},{\text{V}}) = softmax(\frac{{QK^{T} }}{\sqrt D })V $$
(1)
Fig. 6
figure 6

Multi-Head Self-attention

3.3 Proposed AppViT

In this work, we proposed a hybrid model that is not only lightweight (suitable for deployment on mobile devices) but also exhibits strong performance in the identification and classification of the apple leaf disease dataset. The proposed model is the combination of the convolution blocks with a multi-head self-attention module in a hierarchical structure that allows the network to capture long range dependencies and to learn complex features. The proposed AppViT model starts with a convolution block with ViT blocks stacked afterward. First, the input image is passed through a convolution layer with a kernel size of 3 × 3 and stride 2. After the input layer, the model consists of three sections, each section comprising a convolution block in combination with successive ViT blocks. The convolution block helps the network encode local features and to reduce the size of the attention maps. The convolution block is a combination of alternating layers of convolution and batch normalization (BN). The Swiss activation function is used in the convolution block (CB). The primary purpose of the convolution blocks (CBs) is to down-sample the activation maps. Afterward, successive ViT blocks are stacked, allowing the network to model global dependencies and contextual information of the feature maps. At the end of the network, the Global Average Pooling (GAP) layer and the softmax classifier are applied. After designing, AppViT is trained on the selected dataset. The architecture of AppViT is visually presented in Fig. 7.

Fig. 7
figure 7

Architecture of proposed AppVit

4 Results and discussions

In this section, the experimental results are discussed. AppViT is trained and experiments are conducted on the Plant Pathology 2021—FGVC8 dataset [44]. The selected dataset is divided into 70:30. The 70% data are utilized for training the model, and the 30% data is further divided into a 15:15 ratio. The 15% data is utilized for validation during the training process, and the remaining data is used for the testing process. The proposed model is trained from scratch. The 10-cross validation method is used during the training to gain generalization of the proposed AppViT. The hyperparameters for training are epochs 65, AdamW optimizer, and cosine learning rate scheduler. The scheduler was configured to start with a learning rate of 0.002 and gradually decrease it to 0.000125 over a total of 65 epochs. The batch size was set to 32. All the experimental processes are implemented on Tensorflow 2.11v using a Personal Desktop configured with 32 GB of RAM, 500 SSD, and NVIDIA TESLA P100 graphic card. The results are obtained using standard performance metrics. Accuracy, precision, recall, and F-score. The mathematical formulation of the performance metrics is as follows, as shown in Eq. 25:

$$ A_{ccuracy} = \frac{No.\,of\,correctly\,classified\,observations}{{Total\,no.\,of\,observations}} $$
(2)
$$ P_{recision} = \frac{All\,True\,Positive}{{True\,Positive\,+\,False\,positive}} $$
(3)
$$ R_{ecall} = \frac{All\,True\,Positive}{{True\,Positive\,+\,False\,Negative}} $$
(4)
$$ f1 - score = \frac{2\,\times\,Precision\,\times\,Recall}{{Precision\,+\,Recall}} $$
(5)

4.1 Comparative results of the proposed AppViT

In addition, this research also considers the relationship of parameter memory, number of parameters, and flops with the accuracy of the model. Several parameters and flops are two common metrics used to calculate the computational complexity of the model. Flops stands for the number of floating point operations, encompassing activities such as addition, subtraction, multiplication, and division performed on decimal numbers. These tasks are common in integral mathematical processes to machine learning, including matrix multiplications, activation functions, and gradient computations. Flops serve as a metric to measure the computational expense or intricacy of a specific model or its operation. They offer a valuable perspective when estimating the cumulative arithmetic tasks needed, typically evaluating computational efficacy. The number of weights and biases in a deep learning model determines its number of parameters. These parameters set the model’s configuration boundaries, affecting its ability to understand complex data patterns. While a model with a larger number of parameters consumes more memory, influencing both its storage and operational memory during training and predictions, an increase in parameters does not always translate to enhanced performance. Therefore, finding the right equilibrium is essential. For specific use cases, especially on mobile or edge devices, it is crucial to create models that are resource-efficient, yet retain high performance levels.

Table 2 presents the comparison analysis of the proposed AppViT model with some state-of-the-art models. Comparative analysis is conducted based on accuracy, number of parameters, and parameter memory/mb. From this table, it is observed that the proposed AppViT achieved 96.4% accuracy, it has 0.644 billion flop operations, 1.3 million parameters, and it takes 4.99 MB of memory. whereas, VGGNet has taken 525 MB of memory, 138 million parameters, with 15.3 billion flop operations. Based on the parameters, the VGGNet has the highest complexity, and it takes a handsome amount of computational time. The other models ResNet-152, ResNet 50 and Inception.

Table 2 Comparison of AppViT with state-of-the-art CNN architectures

V3 also required high computational and memory. EfficientNetB3 and B4 models fall into the medium category due to their smaller number of parameters and number of flops which are 23.8 M, 12 M, and 5.7 B, 1.8 B respectively. From all the listed models, the proposed model is effective and computationally efficient based on the parameters.

4.2 Comparison in terms of flops operations

First, we compare AppViT with ResNet models, such as ResNet-152 and ResNet-50. Specifically, compared to ResNet-152, AppViT achieves 16.4% better top-1 accuracy and is remarkably more efficient, using 97.8% fewer parameters and 94.35% flops. Compared to state-of-the-art ResNet-50, AppViT obtains 11.2% higher accuracy and yet has 94.92% fewer parameters and 83.9% fewer flops. Compared to Inception-V3, AppViT obtains 16.1% better accuracy with 88.73% fewer flops and 94.54% fewer parameters. Furthermore, AppViT achieves 13.4% higher accuracy compared to VGGNet. The proposed model has a staggering 9.06% fewer parameters and 95.79% fewer flops than VGGNet. Figure 8 shows the training and validation curves for accuracy and loss. In terms of parameter memory, the proposed AppViT exhibited a reduction of 99.05%, 97.81% and 94.89% as shown in Figure 9, compared to VGGNet, ResNet-152 and ResNet-50, respectively.

Fig. 8
figure 8

Training and validation curves for accuracy and loss

Fig. 9
figure 9

Comparison of model flops

4.3 Comparison in terms of parameters

We also compare our model with lightweight and efficient CNNs, AFD-Net, and MSO-ResNet. Compared to EfficientNet-B3 and EfficientNet-B4 our model achieves 4.3% and 5% better top-1 accuracy. In terms of several parameters, flops, and parameter memory / mb, our model demonstrated a reduction of 64.22%, 89.17%, and 89.12%, respectively, in comparison to EfficientNet-B3 and EfficientNet-B4, our model has 84.67% fewer flops, 93.16% fewer parameters, and 93.10% reduced parameter memory / MB. When compared to AFD-Net and MSO-ResNet, our model outperforms both in terms of accuracy and efficiency. Our model achieves 3.8% higher accuracy than AFD-Net and 0.7% higher accuracy than MSO-ResNet. In terms of flops and number of parameters, our model also outperforms both the AFD-Net and MSO-ResNet. Compared to MSO-ResNet, our model has 66.41% fewer parameters and 34.42% fewer numbers of flops. Compared to AFD-Net, our model has 95.39% fewer parameters, as shown in Figure 10.

Fig. 10
figure 10

Comparison in terms of no. of parameters

4.4 Comparison in terms of classification parameters

In this section, the ViT classification results on the selected dataset are presented. After training, the test data are used to evaluate the proposed AppViT model. Table 3 presents the comparative analysis of the classification results using the proposed ViT and state-of-the-art models. From this table, it is clearly observed that the AppVit model achieved the highest f1-score which is 0.963, and the other parameters are precision and recall having values of 0.967, 0.959 respectively. From the listed models, MSO-ResNet achieved the second highest f1 score, which is 0. 957, precision is 0.957, and recall is 0.958. The proposed AppViT model is ~0.6% accurate than the MSO-ResNet.

Table 3 Comparison of AppViT with state-of-the-art CNN architectures in precision, recall, and f1 score

5 Discussion and comparison with SOTA

A comprehensive comparison has been presented in this section. After designing and training the proposed AppViT model on selected dataset. The proposed model is compared with different state-of-the-art models such as VGG and inceptionNet [36] has 96.14% on ALDD dataset, DenseNEt-121 [37], and ROI-aware DCNN [46] models achieved 93.1, and 84.3% accuracy in the customized dataset, AFD-Net [47], and Modified ResNet [8] achieved 92.3, and 95.7% of accuracy. Although the proposed method achieved the highest accuracy among the listed methods, which is 96.4%. The proposed AppViT is created with the combination of convolutional (CB blocks) and self-attention layers. The residual architecture is followed in the CB blocks. Convolutional layers capture spatial information, and the self-attention mechanism captures global and prominent information from the image. The utilization of both convolutional and self-attention characteristics makes the proposed model hybrid and improves the performance and is less computationally expensive. The comparison of the proposed model with the state-of-the-art methods is shown in Table 4.

Table 4 Comparison with the state-of-the-art methods

6 Conclusions

Deep learning-based methods are being rigorously used in agriculture sciences, especially for plant leaf disease detection; however, these models are computationally inefficient and are incompatible with lightweight devices. To address this lacuna, we propose AppViT in this study. Embodied with the characteristics of CNNs and ViTs, AppViT is an efficient apple leaf disease detection and classification model that not only can detect and classify apple leaf diseases with high accuracy but can also be deployed on lightweight devices. To succeed, fully validate the effectiveness of our model’s effectiveness; we compare its results with state-of-the-art CNN architectures. Compared to the ResNet family, VGGNet and Inception-V3, our model shows a significant increase in accuracy, precision, recall and F - score despite being resource-constrained and compatible with lightweight devices. We also compare the performance of our models with lightweight CNN models, that is, EfficientNets. The proposed AppViT model outperforms EfficientNet-B3 and EfficientNet-B4 in performance despite having a smaller number of parameters and floating-point operations (flops). These experiments verify both the effectiveness and efficiency of our model. The detection system that we have developed can be a very useful tool for farmers and apple growers to aid them in the diagnosis, quantification and follow-up of infections. This study focuses on four different apple leaf diseases, each with its unique symptoms and impact on the overall health of the apple tree. For a more comprehensive analysis, further classes of apple leaf diseases can be investigated. A significant challenge during this research was the lack of data, particularly in real-world scenarios, making it difficult to fully assess the full impact of the disease on yield and fruit quality. In the future, we aim to incorporate data on additional apple diseases, potentially working with local orchards and agricultural institutes to gather more holistic information. Moreover, such lightweight models, with their efficiency and accuracy, can be adapted for detecting and classifying diseases in other fruits and vegetables. This adaptability offers promise for creating a more resilient agricultural ecosystem. As global climate patterns change, early detection and mitigation of diseases will become even more vital. There is potential to expand future research to early-stage detection of plant leaf diseases, and by doing so, we can significantly reduce crop losses, ensuring food security for growing populations.