Efficient identification and classification of apple leaf diseases using lightweight vision transformer (ViT)

Ullah, Wasi; Javed, Kashif; Khan, Muhammad Attique; Alghayadh, Faisal Yousef; Bhatt, Mohammed Wasim; Al Naimi, Imad Saud; Ofori, Isaac

doi:10.1007/s43621-024-00307-1

Efficient identification and classification of apple leaf diseases using lightweight vision transformer (ViT)

Research
Open access
Published: 18 June 2024

Volume 5, article number 116, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Sustainability Aims and scope Submit manuscript

Efficient identification and classification of apple leaf diseases using lightweight vision transformer (ViT)

Download PDF

Wasi Ullah¹,
Kashif Javed¹,
Muhammad Attique Khan²,
Faisal Yousef Alghayadh³,
Mohammed Wasim Bhatt⁴,
Imad Saud Al Naimi⁵ &
…
Isaac Ofori⁶

Abstract

The timely diagnosis and identification of apple leaf diseases is essential to prevent the spread of diseases and ensure the sound development of the apple industry. Convolutional neural networks (CNNs) have achieved phenomenal success in the area of leaf disease detection, which can greatly benefit the agriculture industry. However, their large size and intricate design continue to pose a challenge when it comes to deploying these models on lightweight devices. Although several successful models (e.g., EfficientNets and MobileNets) have been designed to adapt to resource-constrained devices, these models have not been able to achieve significant results in leaf disease detection tasks and leave a performance gap behind. This research gap has motivated us to develop an apple leaf disease detection model that can not only be deployed on lightweight devices but also outperform existing models. In this work, we propose AppViT, a hybrid vision model, combining the features of convolution blocks and multi-head self-attention, to compete with the best-performing models. Specifically, we begin by introducing the convolution blocks that narrow down the size of the feature maps and help the model encode local features progressively. Then, we stack ViT blocks in combination with convolution blocks, allowing the network to capture non-local dependencies and spatial patterns. Embodied with these designs and a hierarchical structure, AppViT demonstrates excellent performance in apple leaf disease detection tasks. Specifically, it achieves 96.38% precision on Plant Pathology 2021—FGVC8 with about 1.3 million parameters, which is 11.3% and 4.3% more accurate than ResNet-50 and EfficientNet-B3. The precision, recall and F score of our proposed model on Plant Pathology 2021—FGVC8 are 0.967, 0.959, and 0.963 respectively.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Apples are among the most widely cultivated and consumed fruits worldwide, due to their high nutritional and remedial value. The antioxidant effects, due to the presence of large amounts of fiber and phytochemicals, in apples help protect the cellular DNA from oxidative harm, which can lead to cancer [1] Moreover, these chemicals present in apples hinder the proliferation of new cancer cells and reduce the spread of existing ones [1] Apples are rich in vitamin C, sodium, potassium, fiber, phosphorus, calcium, and iron. Larsson et al. [2] have shown that eating more apples can help lower the risk of stroke. In addition to their significant nutritional benefits, apples also play a vital role in the economies of agrarian countries by contributing to employment, export revenue, and local livelihoods. In Pakistan, apple is the fourth largest fruit crop in terms of production. In 2020, apple production in Pakistan exceeded 0.67 million tons. However, apple plants are prone to various diseases that can affect the quality and quantity of apples produced. These diseases can adversely affect the nutritional and therapeutic value of apples. These diseases range from fungal infections and viruses to nematodes and bacteria, among others. Figure 1 shows the process of apple leaf disease. Diseases such as scab, complex, rust, and frog eye leaf spot inhibit the sound development of the apple industry. Furthermore, these diseases significantly affect apple production, leading to substantial setbacks for the country's agricultural sector. Among these, scab, which is caused by the ascomycete Venturia inaequalis [3] is the most significant disease. Identifying the area of the leaf affected by scab is quite difficult, and this disease can inflict more damage on the plant than any other disease. The scab appears first on the leaf in yellow spots. In later stages, it results in the fruit turning ugly, cracked, and malformed, rendering it unusable. Identification of scab in early stages is important in preventing the growth of this disease and protects the apple harvest from deterioration. Rust causes economic losses in several ways. Rust not only causes serious damage to the apple tree, but also leads to a reduction in the size of the fruit. Intense leaf infections and defoliation make trees vulnerable to winter injury. Cedar-apple rust predominantly targets the leaves and fruit of apple and crabapple trees. Caused by the fungal pathogen, the frog eye leaf spot is a quite prevalent disease in apple trees [4] Frog eye leaf spot can lead to fruit infections, severely affecting the harvest of the apple crop. Tiny, purplish spots appearing on the leaves are the initial symptoms of leaf spot on the frog eye. Therefore, it is essential to accurately diagnose and treat apple diseases in a timely manner to ensure a healthy and productive harvest. Figure 1 shows the samples of apple leaf diseases.

The timely identification and diagnosis of these diseases is essential to prevent economic losses and preserve the nutritional quality of apples. The timely diagnosis of the correct disease, when it appears, and the taking of appropriate precautionary measures can help farmers save both the apple harvest and the environment. Therefore, an automated solution is imperative for timely leaf disease detection. CNNs became the likely choice for vision applications, leading to the development of high-performance models having extensive connections and sophisticated forms of convolutions. The recent success of CNN models was largely due to the stacking of a large number of layers and the training of very deep networks. CNNs have demonstrated significant success in the agriculture industry, particularly in plant leaf disease detection tasks [5,6,7,8] employed deep ResNet-like architectures with certain modifications for the detection and classification of plant leaf diseases. Due to their substantial model size and intricate network structures, these models achieved remarkable success. However, the deployment of these models on mobile devices has remained the biggest challenge for researchers. Successive works [9,10,11,12,13] focused on developing models that were computationally efficient and could be deployed in devices limited in resources. Despite their resource efficiency, these models did not achieve the desired results. Recently, transformer architectures have attracted attention because of their superior performance and ability to model long-range dependencies. Equipped with multi-head self-attention (MHSA), vision transformers (ViTs) have achieved valuable results in the field of image classification [14] video classification [15,16,17] semantic segmentation [18] object detection [19, 20] video object segmentation [21] 3D object detection [22] However, ViTs, due to an excessive number of parameters and high floating-point operations (flops), may not be compatible for real-world applications, especially when they need to be deployed on lightweight devices. Therefore, considering the constraints on storage and computational capacity in the deployment of embedded devices, it is imperative to reduce the number of parameters and flops of the model without compromising the performance of the model. In this research, we proposed a lightweight hybrid vision transformer named AppViT for the detection and classification of apple leaf diseases. The primary contributions of this research are as follows:

We proposed a novel lightweight apple leaf disease detection network for the classification of four types of apple leaf disease, including healthy leaves.
Through extensive experiments, we prove that our model outperforms state-of-the-art CNN architectures in terms of accuracy-efficiency trade-off on lightweight devices.
Additionally, in the future, our lightweight models will provide the way for future extensions to identify and classify diseases in various fruits and vegetables, offering a versatile solution for agricultural applications beyond apple leaf diseases.

2 Related work

Previously, computer vision researchers used different machine learning algorithms for leaf disease detection tasks. These machine learning algorithms have great success in the domain of leaf disease detection tasks. Some of these popular algorithms were support vector machines(SVM) Wetterich et al. [23], Jiang et al. [24], Sethy et al. [25] random forest Wojtowicz et al. [26] filter segmentation Kamath et al. [27] K-means clustering Tian et al. [5] k nearest neighbors (KNN) Chaudhary et al. [28] and some other image processing methods Nosratabadi et al. [29], Tavoosi et al. [30], Asghar et al. [31]. However, the machine learning algorithms discussed have limited scope and have not reached the expected performance level.

The use of deep learning algorithms in the agriculture industry, especially in the detection of plant leaf disease, has shown promising results. Since the emergence of Fuentes et al. [32], Liu et al. [33], Zhang et al. [34] employed CNNs for the classification and identification of different leaf diseases. Compared to traditional machine learning approaches, CNNs demonstrated significant results. The ability of CNNs to extract local features from neighboring pixels enabled them to achieve favorable results compared to machine learning algorithms. Hossain et al. [35] designed a novel CNN for the classification of rice leaf diseases. In the experiment, the authors utilized a new data set consisting of 4199 images of leaf disease. The training precision of the designed model was 99.78% and the validation accuracy of the designed model was 97.35%. The effectiveness of the designed model was tested on images of rice lead disease and achieved 97.82% accuracy. Jiang et al. [36] proposed a method combining the features of VGG and InceptionNet for the recognition of apple leaf disease. The model achieved an accuracy of 97.14% in an ALDD data set containing 26,377 images of five different diseases. In Li and Rai et al. [6] the researchers carried out apple leaf disease detection using ResNet-18 and ResNet-34 architectures. Due to their large size and the huge number of parameters, the models achieved accuracies of 99% and 97%. Although these models demonstrated impressive accuracy, they were not feasible for deployment on lightweight devices. Another successful research carried out in the apple leaf disease detection domain was by Zhong et al. [37]. The authors trained DenseNet-121 on a limited data set that included only three diseases, segmented into five categories based on disease severity. Their trained model with around 8 million parameters achieved an accuracy of 93.1%. Hossain et al. [39] suggested a framework based on a gradient boosting classifier for the classification of plant leaf disease. In the methodology, the authors employed adaptive centroid segmentation using the k optimal value, and features are extracted using a modified histogram-based local ternary pattern. The suggested method achieved 98.51%. In [38] the authors proposed a lightweight MobileNet-based apple leaf disease detection method. The authors collected 334 images of apple leaves affected by two types of disease: Alternaria leaf blotch and rust. Despite the compatibility of the model with lightweight devices, its achieved accuracy of 73.50% raises considerations about its performance, indicating potential challenges in balancing model size and accuracy for efficient disease detection on resource-constrained platforms. Hossain et al. [39] presented a deep CNN model based on separable convolutional depthwise for the classification of plant leaf diseases. In the methodology, the authors presented three novel CNNs. The designed models are compared in terms of model size, accuracy, and computational power. The designed models achieved 99.55%, of highest accuracy. The limitation of this work was overfitting. Islam et al. [40] suggested an automated deep learning-based web application for the classification of apple leaf disease detection. The authors employed pre-trained CNNs including VGG16, VGG19, and ResNet50. The experimental process was carried out on the plant village 1000 image and they achieved 96.15% highest accuracy of 96.15%. The limitation of the suggested method was the utilization of a smaller amount of data for training. Paul et al. [41] presented a real time application based on CNN for the classification of tomato leaf disease. The author designed a customized CNN and utilized VGG16 and VGG18 with transfer learning for the classification phase. For the experimental process, they selected ten tomato diseases and one class of healthy. They achieved 95.00% accuracy. The limitation of the presented framework was the overfitting because they did not implement cross-validation. Yao et al. [42] presented a deep learning framework based on a multi-prediction approach for the identification of plant leaf diseases. The authors developed a customized CNN named generalized stacking multi-output CNN. They used plant village, plant leaves, and plantdoc dataset as a benchmark. A comprehensive comparison was conducted with state-of-the-art models and the designed model was achieved the highest accuracy of 96.51%. Andrishia et al. [43] suggested a capsule network for the classification of vitis vinfera leaves. The authors designed a new capsule network for the detection and classification of vitis vinfera disease. The designed model achieved 98.7% accuracy. The limitation of the presented method was the highly complex architecture and the long computation time for training.

Deep learning models have demonstrated significant achievement and have been widely adopted for the tasks of plant leaf disease detection; however, the deployment of these models on lightweight devices continues to be a hurdle. Additionally, training lightweight models for resource-constrained devices results in performance degradation of these models and makes them unfit for real-time environments. This constraint not only affects performance but also poses challenges for on-spot, real-time applications, which are increasingly vital in modern agriculture. As farmers and agriculturalists move towards more tech-integrated solutions, there is a pressing need for models that can operate seamlessly on portable devices without sacrificing accuracy. Therefore, this research gap motivates us to design a lightweight deep learning model for apple leaf disease detection that not only outperforms state-of-the-art CNN architectures but is also compatible with lightweight devices.

3 Methodology

The proposed apple leaf disease classification framework is presented in this section. The proposed framework comprises a novel hybrid vision transformer for the classification of apple leaf diseases. Figure 2, presents the proposed framework, initially, the apple leaf data set is divided into training and testing. The training data are used for the augmentation process. After augmentation, a lightweight hybrid vision transformer named AppViT is designed and trained on augmented data. Following that, some state-of-the-art models were trained on the augmented data. In the end, a comprehensive study is conducted on the basis of the number of flops, the number of parameters, and memory.

3.1 Dataset collection and augmentation

The collected data set is publicly available at Kaggle Plant Pathology 2021—FGVC8 [44]. The data set comprises 16,093 images. These images were taken with smartphones and a Canon Rebel T5i DSLR (Canon Inc., Japan) at different levels of disease with diverse backgrounds under natural conditions. All images were acquired in an unsprayed apple orchard located at Cornell AgriTech, which is the New York State Agricultural Experiment Station, situated in New York State, USA [44]. The images have been uniformly cropped to a dimension of 256 x 256 pixels. The data set consists of five classes that include scab, healthy, frog eye leaf spot, complex, and rust. The data distribution is shown in Figure 3.

Insufficient data is one of the biggest challenges in implementing deep learning models. Diversifying the data set with a larger number of images enables models to learn more robust features and mitigates the risk of overfitting. In this work, the data set exhibits an imbalance, with the Scab class containing the highest number of images and the complex class having the lowest. To address this unbalancing problem, we applied data enhancement techniques, including random rotation, random flipping, random translation, brightness adjustment, random zoom, and random contrast modification. These augmentations not only increased the size of the data set but also enhanced the quality of the images. Table 1 shows the description of the data set after augmentation. After augmentation, the data set now contains 44,507 images compared to 16,093 images before augmentation.

Table 1 Augmented apple leaf disease dataset

Full size table

3.2 Overview of vision transformer

Before delving into our model, we begin by giving a concise overview of ViTs and how they can be optimized for lightweight devices. The transformer architecture, originally introduced by [45] garnered great success in the domain of natural language processing (NLP). The success of transformer models is attributed to the attention mechanism. An attention mechanism can be understood as a method that links a query to a collection of key-value pairs, resulting in an output. In this setup, the query, keys, values, and output are represented as vectors. The output arises from a weighted aggregation of the values, with the weight for each value determined by how well the query matches the respective key. Owing to its ability to model long-range dependencies and learn intricate features, the transformer architectures became a natural choice for researchers working in the language field. Figure 4 below shows the scaled Dot-Product attention mechanism.

In vision, transformer architecture was first introduced by [14]. Since then it has been widely used for tasks of image classification [14], video classification [15,16,17], semantic segmentation [18], object detection,[17, 20]video object segmentation [21] 3D object detection [22]. A 2D image x ∈ R^H×W×C fed into ViT is first split into a flattened sequence of patches x_p ∈ R^{N×(P 2·C)}, where H, W represents the height and width of the input image, and C denotes the channel dimensions of the input image [14]. The resulting number of patches obtained after patchification is given by N = HW/P ² and (P, P) is the resolution of each image patch [14]. Before adding position embedding to the patch embedding, a special class token (cls) is added to the visual tokens. The resulting sequence of embedding vectors is fed as input to the transformer encoder. The vision transformer encoder is a fusion of layers, incorporating multi-head self-attention (MHSA) and a feed-forward network (FFN). Figure 5 presents a detailed overview of the working of vision transformers (ViT).

3.2.1 Multi-head self-attention

The Multi-Head Self-Attention (MHSA) mechanism allows the network to model long-term dependencies and learn complex features. MHSA allows the model to learn inter-patch representations without inductive bias. An extension of single-head self-attention, MHSA employs distinct projection matrices for every attention head. Specifically, the input tokens, represented as x_t, are initially mapped to queries (Q), keys (K), and values (V) by using projection matrices. That is, Q = x_tW_Q, K = xtWK and V = x_tW_V , where W_Q, WK and W_V are the projection matrices for the query, key and value, respectively, each of dimension D×D as shown in Figure 6. Following this, the self-attention mechanism is computed as shown in eq. 1:

$$ {\text{Attention}}({\text{Q}},{\text{K}},{\text{V}}) = softmax(\frac{{QK^{T} }}{\sqrt D })V $$

(1)

3.3 Proposed AppViT

In this work, we proposed a hybrid model that is not only lightweight (suitable for deployment on mobile devices) but also exhibits strong performance in the identification and classification of the apple leaf disease dataset. The proposed model is the combination of the convolution blocks with a multi-head self-attention module in a hierarchical structure that allows the network to capture long range dependencies and to learn complex features. The proposed AppViT model starts with a convolution block with ViT blocks stacked afterward. First, the input image is passed through a convolution layer with a kernel size of 3 × 3 and stride 2. After the input layer, the model consists of three sections, each section comprising a convolution block in combination with successive ViT blocks. The convolution block helps the network encode local features and to reduce the size of the attention maps. The convolution block is a combination of alternating layers of convolution and batch normalization (BN). The Swiss activation function is used in the convolution block (CB). The primary purpose of the convolution blocks (CBs) is to down-sample the activation maps. Afterward, successive ViT blocks are stacked, allowing the network to model global dependencies and contextual information of the feature maps. At the end of the network, the Global Average Pooling (GAP) layer and the softmax classifier are applied. After designing, AppViT is trained on the selected dataset. The architecture of AppViT is visually presented in Fig. 7.

4 Results and discussions

In this section, the experimental results are discussed. AppViT is trained and experiments are conducted on the Plant Pathology 2021—FGVC8 dataset [44]. The selected dataset is divided into 70:30. The 70% data are utilized for training the model, and the 30% data is further divided into a 15:15 ratio. The 15% data is utilized for validation during the training process, and the remaining data is used for the testing process. The proposed model is trained from scratch. The 10-cross validation method is used during the training to gain generalization of the proposed AppViT. The hyperparameters for training are epochs 65, AdamW optimizer, and cosine learning rate scheduler. The scheduler was configured to start with a learning rate of 0.002 and gradually decrease it to 0.000125 over a total of 65 epochs. The batch size was set to 32. All the experimental processes are implemented on Tensorflow 2.11v using a Personal Desktop configured with 32 GB of RAM, 500 SSD, and NVIDIA TESLA P100 graphic card. The results are obtained using standard performance metrics. Accuracy, precision, recall, and F-score. The mathematical formulation of the performance metrics is as follows, as shown in Eq. 2–5:

$$ A_{ccuracy} = \frac{No.\,of\,correctly\,classified\,observations}{{Total\,no.\,of\,observations}} $$

(2)

$$ P_{recision} = \frac{All\,True\,Positive}{{True\,Positive\,+\,False\,positive}} $$

(3)

$$ R_{ecall} = \frac{All\,True\,Positive}{{True\,Positive\,+\,False\,Negative}} $$

(4)

$$ f1 - score = \frac{2\,\times\,Precision\,\times\,Recall}{{Precision\,+\,Recall}} $$

(5)

4.1 Comparative results of the proposed AppViT

In addition, this research also considers the relationship of parameter memory, number of parameters, and flops with the accuracy of the model. Several parameters and flops are two common metrics used to calculate the computational complexity of the model. Flops stands for the number of floating point operations, encompassing activities such as addition, subtraction, multiplication, and division performed on decimal numbers. These tasks are common in integral mathematical processes to machine learning, including matrix multiplications, activation functions, and gradient computations. Flops serve as a metric to measure the computational expense or intricacy of a specific model or its operation. They offer a valuable perspective when estimating the cumulative arithmetic tasks needed, typically evaluating computational efficacy. The number of weights and biases in a deep learning model determines its number of parameters. These parameters set the model’s configuration boundaries, affecting its ability to understand complex data patterns. While a model with a larger number of parameters consumes more memory, influencing both its storage and operational memory during training and predictions, an increase in parameters does not always translate to enhanced performance. Therefore, finding the right equilibrium is essential. For specific use cases, especially on mobile or edge devices, it is crucial to create models that are resource-efficient, yet retain high performance levels.

Table 2 presents the comparison analysis of the proposed AppViT model with some state-of-the-art models. Comparative analysis is conducted based on accuracy, number of parameters, and parameter memory/mb. From this table, it is observed that the proposed AppViT achieved 96.4% accuracy, it has 0.644 billion flop operations, 1.3 million parameters, and it takes 4.99 MB of memory. whereas, VGGNet has taken 525 MB of memory, 138 million parameters, with 15.3 billion flop operations. Based on the parameters, the VGGNet has the highest complexity, and it takes a handsome amount of computational time. The other models ResNet-152, ResNet 50 and Inception.

Table 2 Comparison of AppViT with state-of-the-art CNN architectures

Full size table

V3 also required high computational and memory. EfficientNetB3 and B4 models fall into the medium category due to their smaller number of parameters and number of flops which are 23.8 M, 12 M, and 5.7 B, 1.8 B respectively. From all the listed models, the proposed model is effective and computationally efficient based on the parameters.

4.2 Comparison in terms of flops operations

First, we compare AppViT with ResNet models, such as ResNet-152 and ResNet-50. Specifically, compared to ResNet-152, AppViT achieves 16.4% better top-1 accuracy and is remarkably more efficient, using 97.8% fewer parameters and 94.35% flops. Compared to state-of-the-art ResNet-50, AppViT obtains 11.2% higher accuracy and yet has 94.92% fewer parameters and 83.9% fewer flops. Compared to Inception-V3, AppViT obtains 16.1% better accuracy with 88.73% fewer flops and 94.54% fewer parameters. Furthermore, AppViT achieves 13.4% higher accuracy compared to VGGNet. The proposed model has a staggering 9.06% fewer parameters and 95.79% fewer flops than VGGNet. Figure 8 shows the training and validation curves for accuracy and loss. In terms of parameter memory, the proposed AppViT exhibited a reduction of 99.05%, 97.81% and 94.89% as shown in Figure 9, compared to VGGNet, ResNet-152 and ResNet-50, respectively.

4.3 Comparison in terms of parameters

We also compare our model with lightweight and efficient CNNs, AFD-Net, and MSO-ResNet. Compared to EfficientNet-B3 and EfficientNet-B4 our model achieves 4.3% and 5% better top-1 accuracy. In terms of several parameters, flops, and parameter memory / mb, our model demonstrated a reduction of 64.22%, 89.17%, and 89.12%, respectively, in comparison to EfficientNet-B3 and EfficientNet-B4, our model has 84.67% fewer flops, 93.16% fewer parameters, and 93.10% reduced parameter memory / MB. When compared to AFD-Net and MSO-ResNet, our model outperforms both in terms of accuracy and efficiency. Our model achieves 3.8% higher accuracy than AFD-Net and 0.7% higher accuracy than MSO-ResNet. In terms of flops and number of parameters, our model also outperforms both the AFD-Net and MSO-ResNet. Compared to MSO-ResNet, our model has 66.41% fewer parameters and 34.42% fewer numbers of flops. Compared to AFD-Net, our model has 95.39% fewer parameters, as shown in Figure 10.

4.4 Comparison in terms of classification parameters

In this section, the ViT classification results on the selected dataset are presented. After training, the test data are used to evaluate the proposed AppViT model. Table 3 presents the comparative analysis of the classification results using the proposed ViT and state-of-the-art models. From this table, it is clearly observed that the AppVit model achieved the highest f1-score which is 0.963, and the other parameters are precision and recall having values of 0.967, 0.959 respectively. From the listed models, MSO-ResNet achieved the second highest f1 score, which is 0. 957, precision is 0.957, and recall is 0.958. The proposed AppViT model is ~0.6% accurate than the MSO-ResNet.

Table 3 Comparison of AppViT with state-of-the-art CNN architectures in precision, recall, and f1 score

Full size table

5 Discussion and comparison with SOTA

A comprehensive comparison has been presented in this section. After designing and training the proposed AppViT model on selected dataset. The proposed model is compared with different state-of-the-art models such as VGG and inceptionNet [36] has 96.14% on ALDD dataset, DenseNEt-121 [37], and ROI-aware DCNN [46] models achieved 93.1, and 84.3% accuracy in the customized dataset, AFD-Net [47], and Modified ResNet [8] achieved 92.3, and 95.7% of accuracy. Although the proposed method achieved the highest accuracy among the listed methods, which is 96.4%. The proposed AppViT is created with the combination of convolutional (CB blocks) and self-attention layers. The residual architecture is followed in the CB blocks. Convolutional layers capture spatial information, and the self-attention mechanism captures global and prominent information from the image. The utilization of both convolutional and self-attention characteristics makes the proposed model hybrid and improves the performance and is less computationally expensive. The comparison of the proposed model with the state-of-the-art methods is shown in Table 4.

Table 4 Comparison with the state-of-the-art methods

Full size table

6 Conclusions

Deep learning-based methods are being rigorously used in agriculture sciences, especially for plant leaf disease detection; however, these models are computationally inefficient and are incompatible with lightweight devices. To address this lacuna, we propose AppViT in this study. Embodied with the characteristics of CNNs and ViTs, AppViT is an efficient apple leaf disease detection and classification model that not only can detect and classify apple leaf diseases with high accuracy but can also be deployed on lightweight devices. To succeed, fully validate the effectiveness of our model’s effectiveness; we compare its results with state-of-the-art CNN architectures. Compared to the ResNet family, VGGNet and Inception-V3, our model shows a significant increase in accuracy, precision, recall and F - score despite being resource-constrained and compatible with lightweight devices. We also compare the performance of our models with lightweight CNN models, that is, EfficientNets. The proposed AppViT model outperforms EfficientNet-B3 and EfficientNet-B4 in performance despite having a smaller number of parameters and floating-point operations (flops). These experiments verify both the effectiveness and efficiency of our model. The detection system that we have developed can be a very useful tool for farmers and apple growers to aid them in the diagnosis, quantification and follow-up of infections. This study focuses on four different apple leaf diseases, each with its unique symptoms and impact on the overall health of the apple tree. For a more comprehensive analysis, further classes of apple leaf diseases can be investigated. A significant challenge during this research was the lack of data, particularly in real-world scenarios, making it difficult to fully assess the full impact of the disease on yield and fruit quality. In the future, we aim to incorporate data on additional apple diseases, potentially working with local orchards and agricultural institutes to gather more holistic information. Moreover, such lightweight models, with their efficiency and accuracy, can be adapted for detecting and classifying diseases in other fruits and vegetables. This adaptability offers promise for creating a more resilient agricultural ecosystem. As global climate patterns change, early detection and mitigation of diseases will become even more vital. There is potential to expand future research to early-stage detection of plant leaf diseases, and by doing so, we can significantly reduce crop losses, ensuring food security for growing populations.

Data availability

Data will be available on request. Plant Reproducibility: The data used are publicly available at: https://data.mendeley.com/datasets/tywbtsjrjv/1 and are cited as [48].

References

WCRF. International, Diet, nutrition, physical activity and cancer: a global perspective: a summary of the third expert report. World Cancer Research Fund International, 2018.
Larsson SC, Virtamo J, Wolk A. Total and specific fruit and vegetable consumption and risk of stroke: a prospective study. Atherosclerosis. 2013;227(1):147–52.
Article CAS Google Scholar
MacHardy WE. Apple scab: biology, epidemiology, and management. St. Paul: APS Press; 1996.
Google Scholar
Mishra AM, Harnal S, Gautam V, Tiwari R, Upadhyay S. Weed density estimation in soya bean crop using deep convolutional neural networks in smart agriculture. J Plant Dis Prot. 2022;129(3):593–604. https://doi.org/10.1007/s41348-022-00595-7.
Article CAS Google Scholar
Tian K, Li J, Zeng J, Evans A, Zhang L. Segmentation of tomato leaf images based on adaptive clustering number of K-means algorithm. Comput Electron Agric. 2019;165:104962.
Article Google Scholar
Li X, Rai L. Apple leaf disease identification and classification using resnet models. in 2020 IEEE 3rd International Conference on Electronic Information and Communication Technology (ICEICT), 2020: IEEE, pp. 738–742.
Pandian AJ, Rajalakshmi N, Arulkumaran G. An improved deep residual convolutional neural network for plant leaf disease detection,". Comput Intell Neurosci. 2022. https://doi.org/10.1155/2022/5102290.
Article Google Scholar
Yu H, et al. Apple leaf disease recognition method with improved residual network. Multime Tools Appl. 2022;81(6):7759–82.
Article Google Scholar
Khan AI, Quadri S, Banday S, Shah JL. Deep diagnosis: A real-time apple leaf disease detection system based on deep learning. Comput Electron Agric. 2022;198:107093.
Article Google Scholar
Hanh BT, Van Manh H, Nguyen N-V. Enhancing the performance of transferred efficientnet models in leaf image-based plant disease classification. J Plant Dis Prot. 2022;129(3):623–34.
Article Google Scholar
Gao F, Sa J, Wang Z, Zhao Z. Cassava disease detection method based on EfficientNet," in 2021 7th international conference on systems and informatics (ICSAI), 2021: IEEE, pp. 1–6.
Atila Ü, Uçar M, Akyol K, Uçar E. Plant leaf disease classification using EfficientNet deep learning model. Eco Inform. 2021;61:101182.
Article Google Scholar
Li L, Zhang S, Wang B. Apple leaf disease identification with a small and imbalanced dataset based on lightweight convolutional networks. Sensors. 2021;22(1):173.
Article Google Scholar
Dosovitskiy A et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Fan H et al. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 2021; 6824–6835.
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 2021; 6836–6846.
Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding? ICML. 2021;2(3):4.
Google Scholar
Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision. 2021; 12179–12188.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In European conference on computer vision. Springer. 2020; 213-229.
Li Y, Mao H, Girshick R, He K. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision. Springer. 2022; 280-296
Duke B, Ahmed A, Wolf C, Aarabi P, Taylor GW. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021; 5912–5921.
Misra I, Girdhar R, Joulin A. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021; 2906–2917.
Wetterich CB, de Oliveira Neves RF, Belasque J, Marcassa LG. Detection of citrus canker and Huanglongbing using fluorescence imaging spectroscopy and support vector machine technique. Appl optics. 2016;55(2):400–7.
Article CAS Google Scholar
Jiang F, Lu Y, Chen Y, Cai D, Li G. Image recognition of four rice leaf diseases based on deep learning and support vector machine. Comput Electron Agric. 2020;179:105824.
Article Google Scholar
Sethy PK, Barpanda NK, Rath AK, Behera SK. Deep feature based rice leaf disease identification using support vector machine. Comput Electron Agric. 2020;175:105527.
Article Google Scholar
Wójtowicz A, Piekarczyk J, Czernecki B, Ratajkiewicz H. A random forest model for the classification of wheat and rye leaf rust symptoms based on pure spectra at leaf scale. J Photochem Photobiol, B. 2021;223:112278.
Article Google Scholar
Kamath R, Balachandra M, Prabhu S. Crop and weed discrimination using Laws’ texture masks. Int J Agric Biol Eng. 2020;13(1):191–7.
Google Scholar
Chaudhary A, Thakur R, Kolhe S, Kamal R. A particle swarm optimization based ensemble for vegetable crop disease recognition. Comput Electron Agric. 2020;178:105747.
Article Google Scholar
Nosratabadi S, Ardabili S, Lakner Z, Mako C, Mosavi A. Prediction of food production using machine learning algorithms of multilayer perceptron and ANFIS. Agriculture. 2021;11(5):408.
Article Google Scholar
Tavoosi J, Zhang C, Mohammadzadeh A, Mobayen S, Mosavi AH. Medical image interpolation using recurrent type-2 fuzzy neural network. Front Neuroinform. 2021;15:667375.
Article Google Scholar
Asghar MZ, Ullah I, Shamshirband S, Kundi FM, Habib A. Fuzzy-based sentiment analysis system for analyzing student feedback and satisfaction. 2019.
Fuentes A, Yoon S, Kim SC, Park DS. A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors. 2017;17(9):2022.
Article Google Scholar
Liu B, Zhang Y, He D, Li Y. Identification of apple leaf diseases based on deep convolutional neural networks. Symmetry. 2017;10(1):11.
Article Google Scholar
Zhang K, Wu Q, Liu A, Meng X. "Can deep learning identify tomato leaf disease. Adv Multime. 2018. https://doi.org/10.1155/2018/6710865.
Article Google Scholar
Hossain SMM et al. Rice leaf diseases recognition using convolutional neural networks. In Advanced Data Mining and Applications: 16th International Conference, ADMA 2020, Foshan, China, November 12–14, 2020, Proceedings 16. Springer. 2020; 299-314
Jiang P, Chen Y, Liu B, He D, Liang C. Real-time detection of apple leaf diseases using deep learning approach based on improved convolutional neural networks. IEEE Access. 2019;7:59069–80.
Article Google Scholar
Zhong Y, Zhao M. Research on deep learning in apple leaf disease recognition. Comput Electron Agric. 2020;168:105146.
Article Google Scholar
Nagaraju M, Chawla P, Upadhyay S, Tiwari R. Convolution network model based leaf disease detection using augmentation techniques. Expert Syst. 2021. https://doi.org/10.1111/exsy.12885.
Article Google Scholar
Hossain SMM, Deb K, Dhar PK, Koshiba T. Plant leaf disease recognition using depth-wise separable convolution-based models. Symmetry. 2021;13(3):511.
Article Google Scholar
Kaur P, Harnal S, Tiwari R, Upadhyay S, Bhatia S, Mashat A, Alabdali AM. Recognition of leaf disease using hybrid convolutional neural network by applying feature reduction. Sensors. 2022;22(2):575. https://doi.org/10.3390/s22020575.
Article Google Scholar
Paul SG, et al. A real-time application-based convolutional neural network approach for tomato leaf disease classification. Array. 2023;19:100313.
Article Google Scholar
Yao J, Tran SN, Garg S, Sawyer S. Deep learning for plant identification and disease classification from leaf images: multi-prediction approaches. ACM Comput Surv. 2024;56(6):1–37.
Article Google Scholar
Andrushia AD, Neebha TM, Patricia AT, Sagayam KM, Pramanik S. Capsule network-based disease classification for Vitis Vinifera leaves. Neural Comput Appl. 2024;36(2):757–72.
Article Google Scholar
Thapa R, Zhang K, Snavely N, Belongie S, Khan A. The Plant Pathology Challenge 2020 data set to classify foliar disease of apples. Appl Plant Sci. 2020;8(9):e11390.
Article Google Scholar
Vaswani A, et al. Attention is all you need. Adv Neural Inform Process Syst. 2017;30:1.
Google Scholar
Yu H-J, Son C-H. Apple leaf disease identification through region-of-interest-aware deep convolutional neural network. arXiv preprint arXiv:1903.10356, 2019.
Yadav A, Thakur U, Saxena R, Pal V, Bhateja V, Lin JC-W. AFD-Net: Apple Foliar Disease multi classification using deep learning on plant pathology dataset. Plant Soil. 2022;477(1):595–611.
Article CAS Google Scholar
Arun Pandian J, Geetharamani G. Data for: identification of plant leaf diseases using a 9-layer deep convolutional neural network. Mendeley Data V1. 2019;https://doi.org/10.17632/tywbtsjrjv.1.

Download references

Author information

Authors and Affiliations

Department of Robotics and Artificial Intelligence, SMME, National University of Sciences and Technology (NUST), Islamabad, Pakistan
Wasi Ullah & Kashif Javed
Department of Computer Science and Mathematics, Lebanese American Univeristy, Beirut, Lebanon
Muhammad Attique Khan
Computer Science and Information Systems Department, College of Applied Sciences, AlMaarefa University, Riyadh, Saudi Arabia
Faisal Yousef Alghayadh
Model Institute of Engineering and Technology, Jammu, J&K, India
Mohammed Wasim Bhatt
National University of Science and Technology, Muscat, Oman
Imad Saud Al Naimi
University of Mines and Technology, Tarkwa, Ghana
Isaac Ofori

Authors

Wasi Ullah
View author publications
You can also search for this author in PubMed Google Scholar
Kashif Javed
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Attique Khan
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Yousef Alghayadh
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Wasim Bhatt
View author publications
You can also search for this author in PubMed Google Scholar
Imad Saud Al Naimi
View author publications
You can also search for this author in PubMed Google Scholar
Isaac Ofori
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.U., K.J. and M.A.K. wrote the manuscript. F.Y.A. and M.W.B. prepared figures and did formal analysis. I.S.A.N. and I.O. validated the research. All authors reviewed the manuscript.

Corresponding author

Correspondence to Isaac Ofori.

Ethics declarations

Ethics approval and consent to participate

This research does not involve any kind of human or animal participation.

Competing interests

The authors declare that they have no competing interests and agree with the contents of the manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ullah, W., Javed, K., Khan, M.A. et al. Efficient identification and classification of apple leaf diseases using lightweight vision transformer (ViT). Discov Sustain 5, 116 (2024). https://doi.org/10.1007/s43621-024-00307-1

Download citation

Received: 31 October 2023
Accepted: 05 June 2024
Published: 18 June 2024
DOI: https://doi.org/10.1007/s43621-024-00307-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient identification and classification of apple leaf diseases using lightweight vision transformer (ViT)

Abstract

1 Introduction

2 Related work

3 Methodology

3.1 Dataset collection and augmentation

3.2 Overview of vision transformer

3.2.1 Multi-head self-attention

3.3 Proposed AppViT

4 Results and discussions

4.1 Comparative results of the proposed AppViT

4.2 Comparison in terms of flops operations

4.3 Comparison in terms of parameters

4.4 Comparison in terms of classification parameters

5 Discussion and comparison with SOTA

6 Conclusions

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation