1 Introduction

Agriculture remains a cornerstone of the global economy, providing food and raw materials for billions of people. The identification of diseases in plant leaves is crucial for agricultural research as it directly impacts crop yield, quality, and overall food security. Leaf diseases, in particular, pose significant threats to plant health, leading to substantial losses in agricultural productivity. Early measures can be taken to decrease the impact of these diseases on crops if they are timely and precisely identified. In addition, efficient disease control techniques can support the agriculture industry and provide a consistent supply of food for expanding people.

The major challenge in agricultural disease management is to develop robust and accurate models capable of identifying diseases across various crops using minimal training data. The imbalance in datasets often leads to models that are overfitted to well-represented classes, resulting in poor generalization to new or less-represented ones. This problem is particularly relevant to plant disease categorization, as datasets may not contain enough instances of certain plant species or illnesses.

In this research, we utilized the Plant Village dataset, which includes a comprehensive collection of images of both healthy and diseased leaves from various plant species, including tomato, potato, and bell pepper as shown in Fig. 1. The goal was to simulate real-world scenarios where models trained on one crop may be applied to identify diseases in another. By focusing on how well models generalize from one type of plant to another, we aim to address the challenge of dataset imbalance and improve the flexibility and effectiveness of plant disease detection.

Fig. 1
figure 1

Zero shot learning: source and target domain classes

Traditional methods of disease identification typically involve visual inspection by experts or laboratory testing. While these methods can be effective, they are often time-consuming, laborious and require significant expertise. Because of this, adopting these methods across significant agricultural regions, a variety of plant species, and disease kinds becomes challenging. In response to these limitations, machine learning algorithms like K Nearest Neighbors (KNN) (Vaishnnave et al. 2019; Gurunathan et al. 2023; Hossain et al. 2019), k means clustering (Rohilla and Rai 2022), and Support Vector Machines (SVMs) (Es-saady et al. 2016), Zhao et al. (2010), Chanda et al. (2021) have emerged as viable alternatives for disease classification. Apart from these algorithms Random Forest classifiers (Tandekar and Dongre 2023), Saha et al. (2021) are also highly effective machine learning algorithms for improved classification. These algorithms leverage computational techniques to analyze data extracted from plant samples, such as images or spectral data, to automatically detect and classify diseases.

Convolutional Neural Networks (CNNs), a type of deep learning method, have significantly advanced plant disease identification by analyzing leaf images for detection and classification (Harshavardhan et al. 2023), Shivaprasad and Wadhawan (2023). By training on large labeled datasets, CNNs can accurately identify various leaf diseases across different plant species and environmental conditions. The method makes use of CNNs’ ability to identify complex patterns and features, enabling highly accurate disease classification.

CNNs can automate the disease identification process, resulting in quick and accurate diagnoses. This is one of the main benefits of using CNNs in this situation, as it decreases the need for expert intervention. Thotad et al. (2023); Singh et al. (2023); Lakshmanarao et al. (2021). In order to minimize crop loss and stop the spread of diseases, early detection enables prompt intervention. As a result, CNN-based real-time monitoring systems can scan crops continuously and notify farmers,leading to more sustainable and efficient agricultural practices

Transfer learning has become a key method for plant disease identification, using pre-trained models on large datasets to improve performance with limited data. This approach involves using knowledge from models that have been trained on large-scale datasets, such as ImageNet, and adapting them to the specific task of plant disease detection (Kollem et al. 2024; Chellapandi et al. 2021; Pavan et al. 2023; Gopi and Kondaveeti 2023). This allows for accurate disease identification with a much smaller number of labeled images which drastically reduces the time and resources required to create efficient models. It allows us to bypass the need of collecting and annotating huge data, which is difficult and time consuming process.

In recent years advancements in machine learning and computer vision have introduced innovative solutions to this challenge. Zero-Shot Learning (ZSL), a transfer learning technique that enables models to identify previously unseen classes without direct training data, is one potential approach. ZSL learns from seen classes and predicts unseen classes by leveraging semantic information, such as descriptions or attributes, to infer and generalize knowledge from known classes to new, unseen classes (Zhang and Zhang 2024; Lv et al. 2020; Zhen et al. 2023; Wang et al. 2018; Fu et al. 2018; Frome et al. 2013). This capability makes ZSL particularly well-suited for leaf disease classification, where new disease variants or previously unrecorded diseases frequently emerge.

Additionally, Few-Shot Learning (FSL) is another technique that has gained attention. FSL aims to train models using a very small number of examples per class, addressing situations where data is scarce or expensive to obtain. FSL (Wang and Wang 2021; Lin et al. 2022) leverages prior knowledge from related tasks or domains to generalize well from minimal examples, often using methods like meta-learning, where models learn to adapt quickly to new tasks, or transfer learning, where knowledge from a pre-trained model is fine-tuned on the few available examples. Both ZSL (Balavani et al. 2023; Singh and Sanodiya 2023; Fang et al. 2024) and FSL (Dedhia et al. 2022; Kumar et al. 2023; Nuthalapati and Tunga 2021) offer valuable solutions in scenarios where traditional machine learning methods struggle due to the lack of sufficient labeled data.

Embeddings refer to the process of mapping data from different domains into a meaningful feature space. In transfer learning embeddings represent semantic attributes as dense vectors, facilitating the model’s ability to generalize across classes by mapping similar concepts closer together in the vector space. This approach helps models infer unseen classes by leveraging semantic similarities. While existing methods often use embeddings to encode semantic attributes but may not fully capture the complex relationships between attributes and features. Our approach improves on this by concatenating semantic attributes with feature vectors, leading to richer and more descriptive representation. This method enhances generalization and accuracy, where traditional embedding methods fall short.

By incorporating Zero-Shot Learning (ZSL) into our framework, we address the issue of dataset imbalance and improve the model’s ability to generalize across various plant species and disease types. For instance, when the model is trained on healthy tomato leaves, it learns broad characteristics of leaf health-such as texture, color, and shape-that extend beyond a specific plant species. This enables the model to predict the health of potato leaves even without direct exposure to them during training. Similarly, the model’s knowledge of late blight on tomatoes can be generalized to detect similar symptoms in other plants, like potatoes. ZSL leverages semantic attributes to transfer knowledge from seen to unseen classes, effectively mitigating the impact of dataset imbalance and enhancing disease management across different plant species.

This research paper explores the application of Zero-Shot Learning in the context of leaf disease classification. By combining semantic attributes, which are pieces of information, with features for classification, we investigate how ZSL can be used to enhance the accuracy and efficiency of disease detection in plants, even when dealing with limited or no labeled data for certain diseases. ZSL presents a viable substitute for conventional supervised learning techniques, with the potential to revolutionize the way in which farmers and agricultural experts manage plant health.

In this work, we aim to address the following key objectives:

1. Develop a Zero-Shot Learning (ZSL) framework for plant leaf disease identification that leverages both semantic attributes and visual features.

2. Enhance the generalization capabilities of the ZSL model to accurately classify unseen plant leaf diseases by integrating and optimizing the use of semantic and visual features, utilizing different pre-trained models to extract robust feature representations.

3. Investigate the impact of various pre-trained models on the accuracy and generalization ability of the ZSL approach, identifying the most effective models for plant disease classification tasks.

2 Related work

Here, we examine related works on plant disease classification, specifically focusing on methods that align with our proposed framework. We explore various methodologies for feature extraction, emphasizing those designed to address the limitations of disease classification.

2.1 Traditional methods approaches

For instance, Garg et al. utilized a K-Nearest Neighbors (KNN) classifier (Garg et al. 2022) for the automatic detection of plant diseases, achieving notable accuracy in her results. The KNN classifier, a straightforward yet powerful machine learning algorithm, operates by comparing new, unseen data points to the most similar points in the training dataset, identified as the ’k’ nearest neighbors. This method is effective because of its ease of use and capacity to manage multi-class classification issues for the identification of plant diseases.

Support Vector Machines (SVMs) have also proven to perform exceptionally well in automatic disease detection in plant leaves, demonstrating robustness and high accuracy across various datasets and disease types. In the article (Ramanathan et al. 2023), the author employed a novel approach using Butterfly Optimization (BO) for feature extraction in conjunction with Support Vector Machines (SVMs). Kumar et al. proposed a framework that utilizes k-means segmentation combined with multi-class SVM-based classification in four stages, achieving an accuracy efficiency of 95.7% (Kumar et al. 2020).

2.1.1 Deep learning approaches

Deep learning has emerged as a pivotal technology in the field of plant disease detection and diagnosis. Its ability to automatically extract and learn features from large datasets makes it particularly well-suited for analyzing complex patterns in plant health. Sarkar et al. (2023) suggested a new strategy that uses DenseNets with SVMs to combine the advantages of deep learning and traditional machine learning methods. This hybrid model is a potential approach for plant disease detection and classification since it increases both accuracy and generalization.

In Belmir et al. (2023), Belmir et al. proposed a deep CNN model for plant disease classification using the PlantVillage dataset. This model achieved a training accuracy of 98.01% and a test accuracy of 94.33%, showing its efficiency in identifying and categorizing leaf diseases early on, which is essential for maintaining agricultural production. Joseph et al. (2023) proposed a novel approach of using Fusing Deep CNN with Local Binary Pattern (LBP) techniques for enhanced image analysis and feature extraction in their research. Here they have also added predictive analysis system which predicts if there is any possible infection. Militante et al. (2019) used deep learning models and got accuracy of 96.5%.

Showrav et al. (2022) proposed a two-stage classification scheme for plant disease detection, addressing the issue of species-specific symptoms. First, it detects plant species; second, it uses efficient CNN architectures such as EfficientNetB3 and NASNetLarge with transfer learning to identify diseases specific to those species. Tested on the Plant Village and IPM datasets, this strategy performs more effectively than traditional single-stage techniques. Bakshi and Goel (2023) performed a computational analysis of multi class plant disease diagnosis using Logistic Regression classifier on three models like Resnet-50, Densenet-161 and InceptionNet-V3 out of which Resnet-50 performed well with accuarcy of 96.9%.

Guowei Dai et al. proposed a novel deep learning model (PPLC) (Dai et al. 2023) that integrates dilated convolutions, multi head attention and Global Average Pooling(GAP) layers. Additionally, the CBAM attention mechanism is incorporated into the middle layer to enhance the model’s information representation. This model achieved an accuracy of 99.702% and an F1 score of 98.442%.In Pal and Kumar (2023), the authors propose the AgriDet framework, combining Inception-Visual Geometry Group Network (INC-VGGN) and Kohonen-based deep learning for plant disease detection. The framework addresses issues like occlusion and overfitting, utilizing advanced image pre-processing and a pre-trained INC-VGGN model, achieving superior accuracy and sensitivity.

In the article(Dai et al. 2024), authors proposed DFN-PSAN network, which combines the YOLOv5 Backbone with pyramidal squeezed attention (PSA) for effective plant disease classification in natural field environments. DFN-PSAN achieves over 95.27% accuracy and F1-score, while the PSA mechanism reduces model parameters by 26%. Additionally, t-SNE with SHAP interpretable methods enhances the transparency of model attention features.

2.1.2 Transfer learning and domain adaptation approaches

Transfer learning has proven instrumental in leveraging pre-existing knowledge from one domain to enhance learning and performance in another. Degadwala et al. (2023) proposed a pioneering approach that uses transfer learning with state of art CNN’s for the classification of hop plant diseases. This article highlighted how pre-trained models can improve agricultural disease detection efficiency and accuracy. Tumpa and Halder (2023) had used 6 different pretrained models for a comparative work, models inlcuding VGG16, ResNet50,DenseNet121, MobileNetV2, Xception, and InceptionV3. In this reasearch Xception model outperformed other models with accuracy of 98.41% and 0.079 loss value.

Tunio et al. (2021) proposed a hybrid deep CNN transfer learning approach using rice plant images. He employed transfer learning to develop a deep learning model utilizing the Rice Leaf Dataset from a secondary source, achieving an accuracy of 90.8%. Rani and Gowrishankar (2023) used Agri-ImageNet dataset and experimented transfer learning for 38 deep transfer learning models. Among these models, EfficientNetV2B2 and EfficientNetV2B3 achieved the highest accuracies, with 90.3% and 92.4% respectively. Anand et al. (2023) performed transfer learning to four distinct models like AlexNet, VGG16, MobileNetV2, and InceptionV3 where InceptionV3 outperformed all the other models with accuracy of 0.92 and precision of 0.84.

Rahim et al. (2023) leveraged CNN’s with transfer learning and used pretrained models like VGG16, Inception-V3, VGG19, and ResNet-50, additionally he also used traditional methods like KNN, SVM, AdaBoost, Decision Tree, and Random Forest. In this article VGG-19 achieved accuarcy of 98% and random forest of 96.6% exceeding other models.

Table 1 Comparison work on different methods with various models

The Table 1 provides a comparison of various methods and models used in plant disease classification, highlighting their datasets, models, testing on unseen classes, and the integration of semantic attributes. A comparative analysis is conducted using models like VGG16, ResNet50, EfficientNet, AlexNet, Inception v3, YOLO v3, YOLO v4, and MobileNetv3. Our proposed method stands out by uniquely integrating semantic attributes, which enhances the model’s ability to generalize and perform accurately on unseen classes.

2.1.3 Zero shot transfer learning approaches

Romera-Paredes and Torr (2015) introduced a streamlined zero-shot learning method using a single-line implementation. This approach involves training with a signature matrix to derive a V matrix, enabling inference during testing using semantic attributes. In the paper (Han et al. 2024), Han et al. introduced a dual relation mining network featuring a dual attention block designed for visual semantic relationship extraction. Additionally, they employed Semantic Interaction Transformer (SIT) for enhancing attribute representations across images, thereby improving generalization capabilities.

However, semantic attribute models do not capture the intricate manner in which humans perceive and recognize elements within images. To address this limitation, proposed a novel method for zero-shot image classification using human gaze data as auxiliary information. Gaze estimation predicts where humans direct their attention in an image, offering valuable insights for attribute localization. Liu et al. (2021) introduced a Gaze estimation framework consisting of three modules: Attention Module, Attribute Localization, and Attention Transition. In this framework, task-dependent attention is learned using the goal-oriented GEM, while global image features are concurrently optimized through the regression of local attribute features. Karessli et al. (2017) introduced three gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG), and Gaze Features with Sequence (GFS). This paper also presents a key equation pertaining to the Structured Joint Embedding (SJE) model for zero-shot learning.

Han et al. (2021) proposed a contrastive embedding (CE) approach for their hybrid GZSL framework, which leverages both class-wise and instance-wise supervision. Traditional ZSL approaches often encounter domain bias issues, where generated features for unseen classes may not accurately reflect their true distributions. To address this, Liu et al. (2022) proposed a VAE-based framework called Joint Attentive Region Embedding with Enhanced Semantics (AREES). This framework is designed to improve zero-shot recognition by simultaneously optimizing feature extraction and feature generation, enhancing the alignment between visual and semantic features.

Using graph embeddings, Naeem et al. (2021) suggested compositional zero-shot learning with the goal of identifying previously unknown combinations of states and objects based on visual primitives observed during training. The method makes use of the relationships that exist between states, objects, and their compositions inside a graph structure to help transfer information from compositions that are seen to compositions that are not. Liu et al. (2021) introduces an iterative co-training framework comprising two distinct base ZSL models and an exchanging module. Additionally, it features a semantic-guided OOD detector designed to identify the most likely unseen-class samples before class-level classification, addressing the bias problem in GZSL.

3 Methodology

This methodology leverages the strengths of both State of Art CNN models and expert-defined semantic attributes to enhance the accuracy and interpretability of plant disease classification. By combining high-level features with domain-specific knowledge, our approach aims to provide a robust and reliable solution for early disease detection in plants.

Table 2 State of the Art CNN Models

3.1 Feature extraction using pretrained model

In this work we have used different pretrained models for feature extraction.The pretrained models are chosen based on its high performance in related tasks, is fine-tuned to suit the specific needs of plant disease classification (Fig. 2). The steps involved are:

Fig. 2
figure 2

Our Proposed Architecture

  1. 1.

    Model selection In the field of plant disease classification, where obtaining a large amount of labeled data can be challenging, transfer learning with pre-trained Convolutional Neural Networks (CNNs) is vital. Utilizing models pre-trained on extensive datasets like ImageNet allows for more effective classification, even with limited data. To assure coverage of many situations for computing and deployment demands, we have chosen a wide range of models. For resource-constrained environments, lightweight models like MobileNetV2 and ShuffleNetV2 are ideal because of their efficiency and low computational demands. In contrast, when computational resources are abundant, more complex models like VGG19, ResNet50, and Vision Transformer (ViT-B16) excel by capturing intricate patterns in the data, leading to enhanced classification accuracy. Additionally, models like DenseNet121, EfficientNet, and GoogleNet offer a balanced approach for scenarios with moderate resource constraints, combining efficiency with strong performance. By utilizing a diverse range of pre-trained models, from lightweight to complex, we ensure a flexible and comprehensive approach to plant disease classification. This allows us to compare their performance and identify the best model for various resource scenarios, ensuring high accuracy and effectiveness. A detailed description of every selected model is provided in Table 2.

  2. 2.

    Image preprocessing The plant images were preprocessed to match the input requirements of the selected pre-trained models. This involved resizing the images to the specific dimensions required by each model, normalizing pixel values, and applying any necessary augmentations to enhance the dataset’s diversity and robustness.

  3. 3.

    Feature extraction The pre-processed images were fed into selected pre-trained models, and features were extracted from the final pooling or fully connected layers, depending on the architecture. These features, which capture high-level representations of the images, served as the initial input for our classification model. Specifically, the last fully connected layers, which are used for classification in the pre-trained models, were removed to obtain these features.

3.2 Incorporation of semantic attributes

  1. 1.

    Semantic attribute definition Semantic attributes relevant to plant diseases were defined in collaboration with domain experts. These attributes included characteristics such as leaf color, shape, texture, presence of spots, and other visually distinguishable symptoms of diseases.

  2. 2.

    Selection of semantic attributes The selection of semantic attributes for plant disease classification is based on its biological significance and their direct correlation with visible symptoms of plant diseases. Each attribute provide insights in health of plant and nature of disease. For instance, attributes like elliptical shape and lanceolate shape help identify specific types of infections by showing particular patterns on the plant leaves. Green color and its deviations (e.g., yellowing, brown color) indicate plant health and potential infections. Discoloration and deformation degree signal early stress or disease progression. Concentric rings and rust texture are signs of specific fungal or bacterial infections, while spots smaller and spots larger indicate the stage of infection. Analyzing these attributes helps in accurate disease diagnosis and intervention. Table 3 provides in detail significance of selection of the attribute.

  3. 3.

    Annotation process Plant Village Dataset is annotated with these semantic attributes. Each plant image was manually labeled with the corresponding attributes by trained annotators, ensuring high-quality and accurate annotations. This includes detailed annotations based on a set of binary semantic attributes. Each plant image is labeled by trained annotators with attributes such as elliptical shape, lanceolate shape, ovate shape, green color, brown color, discoloration, deformation degree, size reduction, concentric rings, rust texture, smaller spots, larger spots, and yellowing. For instance, an image of a Tomato Late Blight leaf and Potato Early Blight might be annotated as shown in Figure  3.

  4. 4.

    Attribute vector creation For each image, a vector of semantic attributes was created. This vector represented the presence or absence (or degree) of each defined attribute, resulting in a comprehensive attribute representation for each image.

Table 3 Detailed Significance of selected Semantic Attributes
Fig. 3
figure 3

Each attribute is represented as either 0 (absent) or 1 (present), providing a detailed characterization of plant leaf features and conditions in the dataset

3.3 Feature concatenation

The feature vectors extracted from the pre-trained models were merged with the semantic attribute vectors. This combined vector incorporated both the deep learning features and the human-defined semantic attributes, offering a more comprehensive representation of each image. The concatenated feature vectors served as input to the Combined Classifier, a neural network model specifically designed for this task which contains two fully connected layers.

3.4 Formulation of loss function

Here, we are using cross-entropy loss to quantify the dissimilarity between the predicted class probabilities and the actual class labels. Given N as the number of samples and C as the number of classes, the loss is computed as:

$$\begin{aligned} \mathcal {L}(\textbf{y}, \mathbf {\hat{y}}) = -\frac{1}{N} \sum _{i=1}^{N} \sum _{j=1}^{C} y_{ij} \log (\hat{y}_{ij}) \end{aligned}$$
(1)

where yij is the actual class label for sample i and class j (1 if the sample belongs to class j, 0 otherwise). \(\hat{y}_{ij}\) is the predicted probability that sample i belongs to class j.

The cross-entropy loss function calculates the negative log likelihood of the true labels given the predicted probabilities. A lower cross-entropy loss indicates that the predicted probabilities are closer to the true labels, meaning the model is performing better.

3.5 Training the model

In this section, we detail the implementation of our framework using Stochastic Gradient Descent (SGD) for training. The objective is to optimize the neural network’s weights by minimizing the loss function, which measures the disparity between predicted and actual class labels. This process involves updating the parameters \(\Theta\) iteratively via backpropagation.

$$\begin{aligned} \Theta ^{t+1} = \Theta ^{t} - \eta \frac{\partial \mathcal {L}}{\partial \textbf{x}_i} \end{aligned}$$
(2)

where \(\eta\) represents learning rate.

4 Experiments

4.1 Setup

Our experiment utilized the PyTorch framework, employing pretrained models from the PyTorch library for effective feature extraction and streamlined training. The experiments were conducted using a system equipped with NVIDIA GPUs and CUDA acceleration, ensuring sufficient computational power and memory to effectively manage the complexities of our models and the large datasets involved in our research.

4.1.1 Dataset

For this experiment we utilized a subset of the Plant Village dataset, which contains around 20,600 images categorized into 15 classes. These classes encompass a variety of plant leaves, including those from potatoes, tomatoes, and pepper bell covering both diseased and healthy specimens as summarised in table 4.

Table 4 Summary of the selected subset from the Plant village dataset

4.1.2 Baseline method

The baseline method for this work involves extracting features from a pretrained model and classifying them without concatenating with semantic attributes. Here in this method we have taken two types of data splits one with three and four classes, Where we performed training on tomato and testing on potato leaves with classes Healthy, LateBlight, EarlyBlight. We also added BacterialSpot of pepper for training in the second split and evaluated.

4.1.3 Implementation details

The experiment was implemented using frameworks and libraries such as pytorch. The pretrained models used for experiment are VGG16, ResNet50, ViT-B/16, VGG19, ResNet18, AlexNet, GoogLeNet, DenseNet121, EfficientNet-B0, MNASNet1.0, MobileNetV2, and ShuffleNetV2 x0.5.

During the implementation, we concatenated semantic attributes along with features extracted by the pre-trained models to help the model learn more effectively. This integration aimed to leverage both the rich feature representations from the pre-trained models and the additional contextual information provided by the semantic attributes. By doing so, we sought to enhance the model’s performance and accuracy in classifying the unseen potato classes.

We conducted an experiment using two different data splits: one with three classes and the other with four classes. For the first data split, which included three classes-healthy, early blight, and late blight we trained the model on tomato leaves and tested it on potato leaves, and vice versa: training on potato leaves and testing on tomato leaves. We repeated the experiment with a second data split that included four classes by adding another class: bacterial spot.

4.1.4 Parameters

The training was conducted using the Adam optimizer with a learning rate of 0.001, a batch size of 32, and over 100 epochs. The primary evaluation metric was classification accuracy on the unseen classes, supplemented by cross-entropy loss.

For semantic information, we defined a 13-dimensional attribute vector for each class, with example attribute vectors including [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] for Tomato Healthy and [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1] for Potato Late Blight. We also reduced the number of semantic attributes to evaluate the model’s performance with different attribute sets.

In addition to evaluating with the full set of attributes, we also performed experiments with reduced attribute sets of 9 and 7 attributes. This approach helps us assess how varying the number of semantic attributes affects the model’s performance. By comparing results with different attribute configurations, we can better understand the trade-offs between model complexity and accuracy, and determine the optimal number of attributes for balancing semantic richness and computational efficiency.

4.2 Experimental results

In this section, we present the results of our framework (Zero shot Transfer Learning), using various pre-trained models to compare their performance with and without the inclusion of semantic attributes (SA). Our objective is to evaluate the impact of integrating semantic attributes on classification accuracy and other relevant metrics.

For the first experiment, we used three classes: Healthy, Early Blight, and Late Blight. We trained the models on Tomato leaves and tested them on Potato leaves, and then reversed the process by training on Potato leaves and testing on Tomato leaves. The results of both experiments are presented in Table 5.

Table 5 Accuracies of state-of-the-art models in the ’Tomato vs Potato’ and ’Potato vs Tomato’ experiments with 3 classes: healthy, early blight, late blight
Table 6 Accuracies of state-of-the-art models in the ’Tomato vs Potato & Pepper Bell’ and ’Potato & Pepper Bell vs Tomato’ experiments with 4 classes: healthy, early blight, late blight and bacterial spot

“For the second experiment, we employed four classes: Healthy, Early Blight, Late Blight, and Bacterial Spot. We initially trained the models on Tomato leaves and tested them on potato and Pepper Bell leaves. Subsequently, we reversed the experiment by training on Potato and Pepper Bell leaves and testing on Tomato leaves, and both the results are summarized in Table 6".

4.3 Experimental analysis

In our work, we conducted experiments across different source and target domains, performing classifications with and without semantic attributes (SA) across various numbers of classes. The results clearly demonstrate that incorporating semantic attributes consistently enhances model performance.

Table 5 illustrates the model performance for three-class classifications (Tomato vs Potato and Potato vs Tomato) when semantic attributes are included. In the table “SA” refers to Semantic Attributes and "w/o SA" refers to without Semantic Attributes.

For instance, over all epochs, models like as ResNet18 and ResNet50 show significant accuracy increases when utilizing Semantic Attributes in both the Tomato vs. Potato experiment and the reverse scenario. At 100 epochs, ResNet18 achieves an accuracy of 99.95% with SA compared to 62.73% without SA, and GoogleNet reaches 99.81% with SA, up from 68.14% without SA for Tomato vs Potato classification. These results suggest that semantic attributes significantly enhance the model’s ability to differentiate between the two plant species, leading to higher classification accuracy. This trend is observed across most models, with notable improvements in Resnet50, Vit-B16, and EfficientNet-B0. These findings suggest that deeper and more complicated neural networks benefit from the addition of semantic features.

In the more complex four-class scenario (Tomato, Potato, and Pepper Bell), the inclusion of semantic attributes continued to enhance model performance as illustrated in Table 6. For example, the ViT-B16 model showed a marked improvement in accuracy with SA, achieving 69.45% at 100 epochs for the Tomato vs. Potato, Pepper Bell classification, up from 49.01% without SA. In the four-class classification, ResNet50 reached 68.11% accuracy with SA at 100 epochs compared to 45.04% without SA. Importantly, semantic attributes also greatly improved the performance of models like EfficientNet-B0 and MobileNet-V2, demonstrating their value in enhancing model adaptability and generalization.

The significant difference in effectiveness between the three-class and four-class scenarios can be attributed to several factors. Adding Pepper Bell to the classification introduces additional complexity, as the model must now distinguish among three types of plant diseases rather than two. This increased complexity makes it more challenging for models to accurately classify all classes. Semantic attributes enhance performance by providing extra contextual information, but the improvement might be less pronounced due to the increased number of categories. Deeper models like ResNet18, ResNet50 and ViT-B16 benefit more from semantic attributes due to their ability to utilize additional features effectively. Class imbalance issues in the four-class scenario also impact performance, though semantic attributes help mitigate these effects.

To further understand the impact of semantic attributes on the model’s learned features, we visualized the feature space using t-SNE plots for the source domain, as shown in Figure  4. The left plot illustrates the feature distribution before training, where there is significant overlap among the three classes, indicating poor initial separation. In contrast, the right plot shows the feature distribution after training with semantic attributes. It is evident that the classes are more distinctly clustered, with well-defined boundaries between them. This demonstrates that the incorporation of semantic attributes significantly enhances the model’s ability to learn discriminative features, leading to better class separation and reduced confusion between categories.

Fig. 4
figure 4

Feature space visualization in the source domain using t-SNE: before and after training with semantic attributes

The clearer clustering of classes in the t-SNE plot after training indicates that the model effectively utilizes semantic information to distinguish between different plant species, aligning with our observations of improved classification performance in the source domain.

Fig. 5
figure 5

Confusion matrix illustrating resnet18 model performance with and without semantic attributes

The confusion matrices for these experiments further reinforce these observations. Confusion matrices reveal that models with semantic attributes exhibit fewer misclassifications, with the number of false positives and false negatives significantly reduced compared to models without semantic attributes. This indicates that semantic attributes provide valuable context that improves the model’s ability to distinguish between plant species, leading to a clearer separation of classes and a reduction in classification errors. Figure 5 illustrates the difference between confusion matrices with and without Semantic Attributes (SA) for the ResNet-18 model when classifying three classes (Table 7).

Table 7 Comparison of proposed approach with existing state-of-the-art methods

When experimented on three-class classification, our proposed method, which combines ResNet18 with Semantic Attributes (SA), achieves an outstanding accuracy of 99.81%, significantly outperforming all other existing methods. This result is notably higher than the next best-performing method, Inceptionv3 with Transfer Learning, which achieved an accuracy of 88.62% as reported by Morellos et al. (2022). Other methods, such as EfficientNet with Few-Shot Learning and MobileNet-V2 with Convolutional Attention (CA), recorded accuracies of 83.22% and 68.56%, respectively.

The intuition behind the proposed approach lies in its ability to enhance disease detection by integrating semantic attributes with image features from advanced models like CNNs and Vision Transformers (ViTs). By combining detailed image data with rich descriptive information about the diseases, this approach provides a more comprehensive understanding, enabling accurate classification of both known and unseen diseases.

In contrast, existing state-of-the-art methods utilize different strategies. Inceptionv3 with Transfer Learning leverages pre-trained models to utilize extensive learned features for robust classification. EfficientNet with Few-Shot Learning adapts to new diseases from limited examples, balancing model efficiency with minimal data. MobileNet-V2 with CA refines features using a lightweight model and contextual attention. While these methods excel in their specific areas, they do not integrate descriptive disease information as effectively as our approach, which results in superior accuracy.

Despite these advances, practical scenarios can still present challenges for zero-shot learning (ZSL). To address generalization issues, incorporating additional techniques such as data augmentation, class weighting, and resampling can be beneficial. These methods can help refine the model’s ability to generalize and classify unseen plant diseases with greater accuracy and reliability.

4.4 Convergence analysis

To obtain a comprehensive understanding of the convergence patterns displayed by various models across different scenarios, we utilized detailed visualizations, as shown in Fig. 6. These visualizations depict the training and testing accuracies over time. Specifically, the graphs present the number of epochs on the X-axis, while the Y-axis shows both the training and testing accuracies. By analyzing these graphs, we can observe how top models perform and converge under various conditions, providing valuable insights into their trends and behaviour.

The analysis of the provided graphs reveals distinct convergence patterns and trends for the ResNet18 and ViT-B16 models, comparing the performance using 13 semantic attributes. For resnet18 as shown in graph (a), both the training and test accuracies converge quickly, with the training accuracy reaching 100% and the test accuracy stabilizing around 95% within the first 10 epochs. The loss for resnet18 also decreases rapidly, approaching zero early in the training process and remaining stable thereafter as depicted in graph (c). This indicates that resnet18 fits the training data exceptionally well, given the near-perfect training accuracy and minimal loss.

The vit-b16 model demonstrates a more variable convergence pattern. The training accuracy quickly reaches 100%, but the test accuracy fluctuates between 85% and 90%, with noticeable drops around epochs 60 and 80 as shown in graph (b). The training loss also decreases rapidly but exhibits significant spikes, suggesting instability during training in graph (d). These fluctuations in test accuracy and loss suggest that ViT-B16 responds dynamically to different training conditions. This highlights the potential for enhancing the model’s performance through careful tuning of parameters such as learning rate schedules and batch sizes.

Fig. 6
figure 6

Line graph of accuracy (a, b) and loss (c, d) of ResNet-18 and ViT_B_16 over epochs

From the graph 7, "Comparison of Test Accuracy for 13 vs 9 vs 7 Attributes," it is evident that models generally perform better with a moderate number of semantic attributes. In particular, test accuracy is highest with 9 attributes and lowest with 7 attributes for most models. For instance, models such as VGG16, ResNet50, and GoogLeNet exhibit a noticeable increase in accuracy when using 9 attributes compared to 13 or 7 attributes. This improvement suggests that removing certain common attributes, such as green color, which is present in nearly all healthy leaves, helps the models to focus on more distinguishing features. As a result, models achieve better generalization and produce more accurate projections. This trend is particularly evident in models like VGG16, DenseNet121, ShuffleNet-v2, where the difference in accuracy between using 9 attributes and 13 attributes is quite pronounced, highlighting the importance of selecting the right attributes to improve model performance.

Fig. 7
figure 7

Illustration of test accuracies of all models for three classes with different number of semantic attributes

The above graph provides a comparative analysis of test accuracy for different models when evaluated using 13, 9, and 7 semantic attributes.

In conclusion, the analysis indicates that selecting an optimal number of semantic attributes, such as 9, significantly enhances the performance of various deep learning models in plant disease classification. Models exhibit higher accuracy with 9 attributes compared to 13 or 7 attributes. This improvement is due to the removal of attributes that are common across all classes, such as the green color, which helps in reducing noise and redundancy. The experiment underscores the significance of selecting a balanced set of attributes to achieve the best performance in plant disease classification tasks, allowing models to extract more relevant features and make better decisions.

5 Conclusion and future works

Our paper introduces a novel approach that leverages semantic attributes by concatenating these attributes with the features extracted by the model for the classification. Addition of semantic attributes to the feature set enabled the access of more contextual information for classification. Through the integration of semantic features, we observed an increase in performance and accuracy. Our approach involves using different pretrained models like ResNet, GoogleNet, VitB16, ensuring a comprehensive evaluation on different architectures. Our work demonstrates that this strategy effectively enhances model performance across a variety of trials and model designs. Additionally we provide a detailed analysis of impact of semantic attributes on classification, highlighting their significance in models performance.

While this method of using using semantic attributes is providing good results but there are several other approaches for future exploration and implementation to consider. One potential approach is integrating semantic attributes with multimodal data, combining visual and textual features to enhance model performance. By incorporating self attention layer into pretrained models, which helps in capturing dependencies at different levels and enhance feature extraction. Other approaches like use of semantic attributes in fewshot learning metrics. Our approach can also be supplemented with various data augmentation methods to generate more data and make the model robust in practical scenarios.

Advanced data augmentation methods should also be explored, as techniques like rotation, scaling, and color adjustments can produce more diverse training samples and enhance model robustness. Additionally, integrating Generative Adversarial Networks (GANs) could further improve performance by generating synthetic samples to enrich the dataset. These approaches can complement the success of semantic attributes, leading to more effective solutions in plant disease classification and precision agriculture.