Introduction

Agriculture, as a significant driver of the global economy, serves as the primary provider of food, income, revenue, and employment opportunities. Different human societies have been capable of producing food to adequately cater to the current and growing population using advanced technology in the agricultural sector [1]. However, depending on the season or environmental factors, plant pests and diseases are caused by nematodes, fungi, viruses, protozoa, and bacteria [2, 3]. These severely influence plant health, structure quality, production, quantity, and the economy. One of the highly complex tasks regarding plant protection is the timely identification of plant symptoms, pests, and diseases [4]. Traditional approaches used in underdeveloped or developing nations are through human eye inspection, which is inaccurate, tedious, and time-consuming. Furthermore, smart agricultural gadgets are costly, and understanding these obtained classifications and detection on large farms needs agronomists and specialists is expensive [5].

Employing intelligent technologies capable of automatically detecting plant pests and diseases presents a promising approach to reducing total expenses in agriculture [6]. Therefore, academia and industry have used transfer learning (TL) and CNNs, particularly in the agricultural sector, for instance, in plant leaves, fruit, and disease classification, among other applications [7]. However, deep learning (DL) demands an increased number of parameters, thus increasing the training time and resulting in implementing small devices becoming complex and impractical [8]. Furthermore, properly extracting relevant characteristics from any given dataset is vital to the CNN-based model performance; for example, the studies utilized the widely used PlantVillage dataset, with various species of plant diseases across distinct categories [9, 10].

There has been a growing emphasis on rapid plant disease identification and classification using TL architectures. The complexity and required parameters of the TL model are determined by the level of model sophistication and the number of filters utilized [11]. Although TL methods often require advanced image processing techniques, they have simplified the procedure, making it more efficient in terms of time, especially when the model has no starting weights [12]. In addition, TL models require minimal computational resources compared to traditional learning approaches. However, implementing these models on small devices with limited resources can be challenging and limitation of the traditional learning approach [13].

Several studies demonstrate that some current models are developed using the idea of TL to attain better results compared to other well-developed approaches using DL architectures through potent computing equipment, such as graphics processing units (GPUs) and servers [14]. Because of the high cost, it is not practical to use advanced equipment that includes GPUs in the agriculture field that traditional farmers cannot afford. Therefore, there is a need for applications with a reduced number of parameters and reduced levels of computation and power consumption [15].

A survey on adopting computer vision and soft computing methods for disease identification and classification from plant leaves was conducted. It demonstrated that Computer vision techniques enhance plant growth, increasing productivity, quality, and economic value [16]. They are critical in medical, defense, agriculture, remote sensing, and business analysis applications. Digital image processing methods simulate human visual capabilities, providing automatic monitoring, disease management, and water management [17].

Another proposed system used a neural network to segment mango leaves for disease. It involved real-time images, preprocessing, feature extraction, training, and extraction of diseased regions. The system achieved high-level accuracy for anthracnose disease segmentation, with an average Specificity of 0.9115 and Sensitivity of 0.9086. The system demonstrated an intuitive and user-friendly interface and is being developed for precision agriculture [18]. Similarly, a hybrid Fuzzy Competitive Learning-based Counter Propagation Network (FCPN) was proposed for image segmentation of natural scene images. Fuzzy Competitive Learning (FCL) was used to train the instar layer of FCPN, whereas Grossberg learning was used to train the Outstar layer. The region-developing method was utilized for seed point selection, clustering, and estimating the number of crop seeds. The FCPN method produced a lower convergence ratio and greater precision than alternative methods [19].

Pattern recognition and machine vision are indispensable for the resolution of complex problems. Combining conventional and optimization methods, like Nature-Inspired Algorithms (NIA) or Bio-inspired methods, can enhance precision and decrease computational time. One such application is image segmentation, for which the Bacterial Foraging Optimization Algorithm (BFOA) is a promising method [20]. The efficiency of the BFO-ANN method was demonstrated through comparison with other approaches. IPM was developed using an automated Radial Basis Function Neural Network (RBFNN) system to detect plant diseases. The system uses leaf images from the IPM agriculture database repository. The RBFNN achieves higher segmentation accuracy than other methods, making it a promising solution for detecting diseases in plants with biotic elements [21].

While a considerable number of studies availed some plant disease classification and detection models, there are notable deficiencies in these studies [4, 15, 17, 20], including training on limited dataset size leading to model overfitting and generalization complexity to diverse environments. Training models under controlled backgrounds and environmental conditions, in contrast to the natural setting that makes these models impractical in the natural environment, the accuracy and robustness of models. Computation-related issues, for example, overfitting and difficulties in accurately extracting fine features during training, have impacted the efficiency and usefulness of DL models in the identification and classification of plant diseases. The conventional laboratory diagnosis of plant diseases is expensive, laborious, and time-intensive, which restricts its feasibility for prompt prevention in agriculture. Several current models, like [22, 23], encounter challenges in terms of resilience and ability to apply them to diverse plant diseases since most are only trained on a single crop. Moreover, early classification and detection models proposed were done with restricted image constraints, like images containing colors. However, in the controlled environment, the background and the foreground are put in binary format [24, 25]. However, in this scenario, the approaches employed in many earlier experiments are unsuitable for real-world smart-based agricultural system deployment that employ images that vary with natural-world backdrops.

Such shortcomings leave a gap in the availability of a robust and generalized model trained on the big dataset to detect and classify plant diseases trained on images without restriction on the background to increase the growth truth, thus being practical and implementable on small devices. As illustrated in this study, it employs transfer learning with pre-trained CNNs to improve performance and solve the issue of data scarcity; fine-tuning and deep feature extraction techniques on current cutting-edge CNNs are used to cater to background complexities. Moreover, it tackles computational issues by introducing two models, namely early fusion and lead voting ensemble, that incorporate several pre-trained CNNs; these models assist in overfitting reduction and improving feature extraction.

This study proposes two plant disease detection (PDD) and classification for CNN architecture with considerably reduced parameters. The TL uses nine comparative models: EfficienceNetB7, NASNetMobile, ConvNeXtSmall, DenseNet201, DenseNet101, ResNet50, GoogleNet, ResNet18, and AlexNet. These architectures employ numerous convolutions with changing filter sizes, resulting in superior feature extraction. We have turned to residual connections to address early disease detection problems. We opted for depth-wise separable convolution over conventional convolution because it reduces computational complexity, size, and parameter set without compromising the performance. This study uses a real-time PlantVillage image containing natural image traits. Therefore, this research contributes the following:

  • Propose two detection models (PDDNet) CNN architectures integrating the top six common CNNs that extract significant features and perform better. These models are demonstrated concisely. Arithmetic average ensemble (PDDNet-AAE) integrated fine-tuned network outputs. For the use of ensemble feature attraction (PDDNet-AE), the early average fusion method is used. In this instance, we combined deep traits collected from multiple DNNs and then trained with the LR classifier on these combined features. Lead voting ensemble class labels (PDDNet-LVE).

  • The study uses a logistic regression classifier to assess the proposed model performance compared to its counterparts (nine pretrained CNNs) that were used to extract deep features. Ultimately, all class labels that were the highest (lead) were voted for, and the system's decision was the most predicted class label.

  • The suggested architecture needs minimal parameters and is faster than traditional ML models tested on DenseNet201, DenseNet101, ResNet50, GoogleNet, ResNet18, AlexNet, EfficientNet, NASNet, and ConvNet.

  • Since the PlantVillage dataset is the most significant plant disease dataset currently publicly available, it was used to assess the proposed approaches. PDDNet-LVE outperformed the other current network models.

  • The proposed models achieved 96.74% and 97.79% accuracy on the early fusion and lead (majority) voting methods for this plant disease detection and classification, respectively. The CNN-LR combination of PDDNet-AE and PDDNet-LVE outperformed the simple averages of CNNs and has demonstrated improved results.

The remainder of this research is arranged into four sections following the introduction. Section "Related Literature" sheds light on the related literature on plant pest and disease classification and detection utilizing TL models, demonstrating the classification techniques used, the type of crop studied, and the reported accuracy. Section "Material and Methods" illustrates the study materials and methodology, including the PlantVillage description and plant diseases within the dataset. Section "Obtained Results and Discussion" demonstrates the results, and related discussions are presented, illustrating performance evaluations on classifiers, sampled deep learning, PDDNet-AE, PDDNet-EA, and PDDNet-LVE, and a comparison based on state-of-art models. Finally, Section "Conclusion" discusses the research conclusion and future research directions.

Related literature

The section discusses some DL methods for plant pests and disease detection and classification. Traditional ML approaches are based on creating features and segmentation, and DL techniques are based on learning from data in its raw form.

Using pre-trained CNNs like GoogleNet and AlexNet could classify twenty-six pests and diseases within fourteen plant species [7]; 99.34% was obtained through GoogleNet. AlexNet, GoogleNet, VGG, feat, and AlexNetOWTBn could recognize 58 leaf diseases [9]. A nine-layered deep convolutional network was used for plant disease detection, and 96.46% achieved accuracy [26]. Similarly, AlexNet's fully connected layer with GoogLeNet's inception layer to classify four diseases of apple leaves, and the average accuracy score reported 97.62% [27]. For the model optimization, InceptionV3, VGG19, VGG16, and ResNet detected tomato leaf disease and obtained 93.70% field accuracy and 99.60% laboratory accuracy [28]. VGG16 identified eggplant diseases using the super vector machine classifier for red, green, and blue; YCbCr and HSV were tested for robustness with 99.4% RGB [29]. Entirely improved pretrained DL plant disease with 99.75% model classification accuracy.

The authors demonstrated that using the SVM classifier on rice leaf disease classification could categorize eleven deep CNN model features and obtain an average of 98.38% using ResNet50 depth SVM [1]. Authors in [30] identified ten diseases of four plant species using six pretrained TL architectures, VGG16 corrected 90% of test datasets, and the authors found three cassava plant diseases and two pest damages using InceptionV3 transfer learning noticed six cassava illnesses using mobile devices [31]. In [32], the study determined that a 50-layer residual neural network can detect three wheat diseases using the ReLU activation and batch normalization following convolution and pooling. Using the German real-time field images, they reported a 96% accuracy. SqueezeNet 227.6MB and SequenzeNet 2.9MB obtained four tea leaf diseases after being tested Cifar ten fast CNN model depthwise separable convolution [33].

A well-trained VGG model identified and classified rice and agricultural diseases [8]. Two inception layers replaced VGGNet's fully connected layers: corn with 80.38% and rice with 92%, respectively. Singh and Misra [34] detailed how the soft computing methods and segmentation of images aid in plant, pest, and disease identification and classification in mostly grown plants like Malus domestica (apple), Zea mays, and genus Vitis diseases using pre-trained CNNs like VGG16 model, some other metaheuristic-inspired algorithms like genetic algorithm.

Gray level co-occurrence matrix (GLCM) with a moveable client-to-server structure for leaf disease detection and their classifications through Gabor wavelet transform (GWT) was used. In the mobile disease diagnosis system, feature vectors represent disease regions that can indicate many resolutions and directions. The mobile client preprocesses leaf photos, segments, and the affected leaf sections and sends them to the Pathology Server, lowering transmission costs. The Server extracts GWT–GLCM features and classifies K-Nearest Neighbors. Short message service displayed results with 93% accuracy under ideal conditions [35]. Table 1 summarizes the conventional methods, datasets, and the reported performance accuracy corresponding to those methods. In most cases, to summarize this, these studies are presented in three primary stages:

  • Plant pests and disease image segmentation is based on applying techniques like mathematical morphology, edge detection, color transformation, and pattern classification.

  • Detection of plant pests and diseases using traditional ML techniques.

  • Representative feature extraction from the segmented images that were obtained utilizing approaches that were based on color, texture, and shape.

Table 1 Plant pest and disease literature according to conventional techniques (Note: BPNN, Back Propagation Neural Network; SVM, Support Vector Machine; PNN, Probabilistic Neural Networks)

The presented models specified in Table 1 are classification algorithms that utilized minimal datasets to differentiate between a limited number of species. Some studies used datasets from apple, Solanum lycopersicum, R. groenlandicum, and maize plants, and most of the reported accuracy ranged from 84% to 97%. Several plant disease detection studies have employed DL as demonstrated. These systems, datasets, and outcomes are demonstrated in Table 2. Most of these experiments included deep network fine-tuning and pretrained CNN feature extraction. To illustrate this, Sabrol and Satish based their study on the tomato disease classification; they used TL to extract features from the images, for example, shape, texture, color, and features with constrained image appearance, and reported a 94% accuracy [40]. The algorithms described in the literature utilize varied datasets and categorize two to four plant species; hence, they cannot be compared directly.

Table 2 Plant pest and disease literature according to the conventional techniques

Material and methods

This section entails the background of the deep learning techniques, the PlantVillage dataset, and the proposed methodology.

Deep learning techniques

Deep learning has been applied extensively in several arenas; its approach to plant disease detection and classification has been extensively used through pretrained deep networks [73,74,75,76]. Within this study, we use nine edge-cutting pretrained networks for deep feature extraction for our classification model to have a starting training weight. Table 3 demonstrates the nine pretrained deep CNNs (namely, EfficientNetB7, NASNet, ConvNet, DenseNet201, DenseNet101, ResNet50, GoogleNet, ResNet18, and AlexNet), showing their distinct characteristics on size, accuracy, parameters, depth, and GPU requirements.

Table 3 Employed deep network characteristics in this experiment

PlantVillage dataset

There is a considerable number of plant pests and disease datasets publicly available, including strawberry [79], rice [80], NLB dataset for maize plant, Turkey-PlantDataset [81], apple, AES-CD9214, PlantVillage, among others. According to the available datasets, we consider the PlantVillage dataset since it has several plant species and over thirty categories with almost all plant characteristics from different datasets. The rest of the datasets checked were found to focus on a single crop that narrows the classification base, and the number of plant leaf images was limitedly low compared to the PlantVillage dataset. Using the pretrained CNNs on a big dataset like PlantVillage assumes proper deep feature extraction. DenseNet models are comparable to ResNet models, except that each layer receives information from all preceding layers. Each Densenet layer feeds forward as early as demonstrated [82]. This study employs DL with six models to extract features to categorize plant diseases. CNNs that have been previously trained and proficient at extracting features and training deep networks. This approach is exceptional since it is more precise in using the LR classifier as a substitute for the output layers of these CNNs.

The PlantVillage dataset was developed to provide effective methods for identifying 38 distinct plant disease classes. It comprises 61,486 plant images in three versions: color, gray-scaled, and segmented. However, we consider 15 categories containing 54,303 PlantVillage images for this experiment. The study considered the PlantVillage dataset with 15 categories since it is more evenly distributed across the different classes than 38 categories. Uneven data distribution can lead to class imbalance issues, where some classes have significantly fewer samples than others. This significantly impacts ML models' performance when accurately predicting the underrepresented classes. Notably, the source of this dataset (https://plantvillage.psu.edu) no longer exists. However, our open-source platforms, including Kaggle and GitHub, have datasets available as linked.

Deep features were extracted using nine different pretrained CNNs to make the dataset more diverse and show a wide range of details. During this process, numerous modifications were made employing three channels as well. These enhancements included gamma correction, principal component analysis, noise injection, scaling, image flipping, rotation, and color augmentation. In addition, scaling, rotation, and image flipping (RGB) were used. Figure 1 presents image samples from the PlantVillage plant disease species.

Fig. 1
figure 1

Plantvillage selected leaf image samples from the considered plant dataset in this study. (Legend: D1) Pepper bell bacterial spot, D2) potato early blight, D3) potato late blight, D4) Tomato bacterial spot, D5) Tomato early blight, D6) Tomato Lead mold, D7) Tomato Septoria leaf spot, D8) Spider mites Two-spotted spider mite, D9) Tomato target spot, D10) Tomato Yellow Leaf Curl Virus, D11) Tomato mosaic virus, D12) Apple Scrab, D13) Grape black rot, D14) Orange Huanglongbing (Citrus_greening), and D15) Squash powdery mildew

Methodology

To tackle the challenge of plant disease identification and classification, we consider feature extraction and fine-tuning approaches among the existing TL approaches, including the intermediate layers, fine-tuning, and feature extraction. The selected pre-trained CNNs are used as a feature extractor. The output of the last convolutional layer is used as a feature vector for the new task. Then, the CNNs are fine-tuned on the new dataset. The weights of the lower layers are frozen, and only the weights of the upper layers are updated. TL can save resources, as the model does not need to be trained from scratch.

Therefore, we consider TL for the nine most recent pretrained deep networks: DenseNet201, DenseNet101, ResNet50, ResNet18, GoogleNet, AlexNet, EfficientNetB7, NASNetMobile, and ConvNeXtSmall for feature extraction to aid in the classification problem process. Then, the LR classifier will evaluate the performance at an individual model level, utilizing the weights obtained from these networks. A comparison is then made based on the arithmetic average (AAE), initial (early), amalgamation or fusion (EA), and lead voting ensemble (LVE), commonly referred to as majority voting. Finally, we use the LR classifier to replace a superficial network block for fusion in the PDDNet technique coupled with the final layers of deep neural network in the PDDNet-LVE method.

The image input size is often different depending on the selected pretrained deep network architecture, as the second last column of Table 3 illustrates. For example, AlexNet and DesNet201 require different data inputs of 227 x 227 x 3 and 224 x 224 x 3, among others, at the input layer. Furthermore, due to the diverse CNNs selected for these experiments, the initial convolutional layer and the subsequent convolutional layers use different kernels; for instance, DenseNet201 with all convolutional layers use 3x3 kernels; ResNet101 utilized the initial convolutional layer uses a 7x7 kernel and the all-subsequent convolutional layers use 3x3 kernels.

ResNet50 at the initial convolutional layer uses a 7x7 kernel, and all subsequent convolutional layers use 3x3 kernels. GoogleNet uses 7x7 kernels at the initial convolutional layer uses, and most of the subsequent convolutional layers use 1x1 kernels, while a few layers use 3x3 kernels. AlexNet considers that the initial convolutional layer uses an 11x11 kernel, and the subsequent convolutional layers use 3x3 kernels. Lastly, ResNet18 at the initial convolutional layer uses a 7x7 kernel, and all subsequent convolutional layers use 3x3 kernels, as used during the experiment.

The proposed model approaches were executed using the MATLAB 2022b DL toolboxFootnote 1. The PlantVillage dataset was divided into training, validation, and testing. Adaptive Moment Estimation (Adam) is applied as the optimizer since it employs stochastic optimization, like ML and TL. The recursive nature of the method enables the efficient solving of noisy linear systems and the estimation of extreme values of functions that are only accessible over noisy annotations. Incorporating square propagation in stochastic gradient descent, adaptive gradient, and root mean, Adam combines the benefits of stochastic gradient descent with momentum and root mean square propagation. In addition, the batch sizes varied depending on a step size of 10 within the range of 10 to 100, and it was saturated at 10 epochs. The selected networks were configured with a 1 gradient threshold, and the learning rate ranged between 0.1 to 0.001.

PDDNet‑AAE

In this method, we experimented based on an arithmetic ensemble average that included late fusion. Initially, TL was applied to architectures, including DenseNet201, DenseNet101, ResNet50, ResNet18, GoogleNet, AlexNet, EfficientNetB7, NASNetMobile, and ConvNeXtSmall. In this instance, the focal contribution of this study is to substitute the last three layers of these CNNs, that is to say, a fully connected (designed to learn features from the images), a softmax (sometimes called a normalized exponential function that presents covert real numbers to probability function to approximate outcomes), and a classification (follows the softmax layer, it detects, classify mutually exclusive classes (categories) via the cross entropy function) layers with new layer definition. After fine-tuning procedure, the effectiveness of every transfer learning pretrained model was analyzed employing the data prepared for testing. Finally, the results of the PDDNet-AAE ensemble were agreed upon with the rest of the finely adjusted networks.

PDDN‑EA

For the early fusion, this model is trained with the LR classifier with features produced from numerous deep networks with fully connected layers and then concatenates these features using the methodology presented (Section "PDDNet‑AAE"). Figure 2 demonstrates an overview of the method's flow diagram.

Fig. 2
figure 2

General overview of the PDDNet-EA model

Considering the demonstrated flow within Fig. 4, the classifier trains the deep features aggregated after being assembled from numerous pretrained networks. Additionally, we employed various combinations of six defined networks to ascertain the class label with the PDDNet model that we suggested. It is significant to mention that these pretrained networks were utilized in this ensemble.

PDDNet‑LVE

We started by extracting deep features from the layers of these fully connected architectures. Then, the final three layers were changed to the LR classifier of previously trained deep network architectures. The deep features accumulated from every architecture were utilized during classifier training. Finally, the approach of lead voting by a majority (LVE) was employed for all existing labels within the PlantVillage dataset. Only the class label considered to have the highest level of accuracy served as the final selection for the method (LVE), as depicted in Fig. 3.

Fig. 3
figure 3

General overview of the PDDNet-LVE model

Obtained results and discussion

This section mainly demonstrates the obtained results and the corresponding discussions of proposed models of an integrated ensemble LR model classifier that uses deep features and averages of the CNN models. The proposed models are based on deep feature extraction, and then we tested three model approaches, namely AAE, EA, and LVE, employing pretrained networks.

We used the PlantVillage dataset to test the suggested approach described in subsection "Methodology". This dataset includes color, gray-scaled, and segmented image categories, encompassing healthy and unhealthy plant left species collected and utilized in their natural ecological setting. Table 4 provides the dataset class literature according to disease names. Table 5 depicts the plant type and image sampling quantities used in this research's training and testing phases; the computer and simulation parameters are presented in Table 6.

Table 4 PlantVillage dataset based on the disease names
Table 5 PlantVillage dataset based on plant type
Table 6 Accuracy performance for every class using different classifiers (for plant class identifier, consider Fig. 1 for details) with the proposed model

We discuss the results and performance assessments in detail in the following subsections. The experiments were used using Matlab2022b simulatorFootnote 2 and NVIDIAFootnote 3 with GeForce RTX 2070 and DirectX runtime version 12.0.

Performance evaluation of classifiers

There are five standard classifiers, for instance, K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), LR, and Naive Bayes (NB), that are often used in deep learning methodologies to evaluate these pretrained CNNs networks. Table 7 illustrates the testing accuracy for every class using different classifiers. Moreover, the model performances are further assessed in terms of F1 scores, accuracy, recall, and precision using False Positives (FP), False Negatives (FN), True Negatives (TN), and True Positives (TP).

Table 7 Performance assessment of PDDNet‑EA and PDDNet‑LVE models

TP represents the number of instances correctly predicted as positive by the model. In other words, it corresponds to the case where the model predicts the positive class correctly. TN represents the number of instances correctly predicted as negative by the proposed model. It corresponds to the case where the model predicted the negative class correctly. FP epitomizes the number of instances incorrectly predicted as positive by the model. It corresponds to the case where the model predicted the positive class when the actual class was negative. Finally, FN denotes the number of instances incorrectly predicted as negative by the model. It corresponds to the case where the model predicted the negative class when the actual class was positive.

Accuracy

The term "accuracy" is the proportion of correct predictions completed compared to the total number of data points collected (T). In scientific literature, it is referred to as recognition, correctness, or success rate and expressed as Eq. 1.

$$\mathrm{Accuracy }=({\text{TN}}+{\text{TP}})/{\text{T}}$$
(1)

Precision

The proportion of actual positive samples found to the total samples anticipated to be positive calculated as presented in Eq. 2.

$$\mathrm{Precision }={\text{TP}}/({\text{TP}}+{\text{FP}})$$
(2)

Sensitivity and recall

The term "sensitivity" or "recall" refers to the proportion of correctly anticipated positives to the total number of actual positive results (Eq. 3).

$$\mathrm{Recall }={\text{TP}}/({\text{FN}}+{\text{TP}})$$
(3)

F1- scores

The F1-score refers to the harmonic mean of precision and sensitivity (recall), expressed in Eq. 4.

$${\text{F}}1 =2*\left({\text{Recall}}*{\text{Precision}}\right)/\left({\text{Recall}}+{\text{Precision}}\right)$$
(4)

Based on the testing accuracies presented in Table 6, on average, LR obtained 93.88 %, NB with 70.98%, KNN with 84.93 %, SVM with 91.55%, and RF with an accuracy performance of 78.43%, thus making us select LR to be used during the experiments leading to the conclusion that increasing the data size improves exceptionally the performance accuracies. Table 7 presents the precision and recall values and F1 scores. Finally, Table 8 illustrates the accuracy scores obtained with different batch sizes in the LR classifier.

Table 8 Accuracy scores obtained with different batch sizes in the LR classifier

Performance evaluation on deep learning

We performed fine-tuning for previously trained CNN models using the DL methodology to evaluate these DL networks. The process of fine-tuning was accomplished by transferring new layers to our plant disease detection and classification problem to replace the deep CNN's last three layers, as described earlier. We examined the accuracy of fine-tuning to observe the effect of TL on the overall performance of the counterparts. After using the hyperparameter fine turning, we considered the minimum batch capacity to be sixteen, the max epochs were put to 10, 0.0001 on the weight decay adjustments, and the learning rate primarily ranged from 0.001 to 0.01. Similarly, for the learning optimization approach, a Mini Batch Stochastic Gradient Descent (MB-SGD) was applied for the deep neural networks to optimize their performance. As a result, 5000 iterations were fully completed for the training procedure, and the obtained accuracies are presented in Table 9. The bold figures within all tables denote the best-performing model.

Table 9 Fine-tuned accuracy scores of pretrained networks in percentages

According to Table 9, the DenseNet201 achieved the highest accuracy among pretrained models based on transfer learning, achieving 93.48%, while the AlexNet achieved the lowest performance with 86.93%. Both results can be compared to those attained using transfer learning on the DenseNet201 architecture. It is further observed that an increment in the complexity improves the accuracies. According to these reported results, the last layer of these models is replaced with the LR classifier. Consequently, the LR was fed with deep features extracted from pretrained CNN networks, presented in Table 10.

Table 10 Obtained accuracy after replacing the last layer with a linear regression (LR) classifier

The LR classifier parameters used were quadratic kernel functions, cubic and tenfold cross-validation approach, and the "one versus all" strategy, which was proven to be the most effective evaluator. According to Table 11, the DenseNet201 model demonstrated an accuracy of 94.86% when detecting plant diseases. Depending on the results, this was the maximum level of accuracy that could be attained after several fine turns. More interestingly, the presented findings in Table 9 are improved to those in Table 10, demonstrating that utilizing the LR as the last layer is advantageous. As a result, we use LR with the other pretrained models with deep features for the remaining part of the experiments.

Table 11 Accuracy of the final fusion centered on fine-tuned DNs (refer to Fig. 3 for disease numbering)

Performance evaluation on PDDNet‑ AAE model

To evaluate this proposed model, a combination of the above-mentioned pretrained CNNs is used by calculating the average scores from these networks for each class as early as demonstrated in [64]. The accuracy score was calculated using the score-based fusion technique of the deep CNNs with the finest performance, as Table 11 demonstrates. Based on the class distribution, the weighted average accuracy was 93.7%.

Performance evaluation on the PDDNet‑EA model

The early fusion that was hypothesized, the CNN-LR model, was initially developed based on an early fusion combining the information gathered from the deep CNNs (as Fig. 2 demonstrated). Through several combinations of the six selected CNNs, we achieved the outcomes provided in Table 12 in the subsequent columns, determined by the average accuracy and the standard deviation of those scores. For example, based on Table 13, the PDD-AAE model's maximum accuracy score was 96.79% using DenseNet201, ResNet101, AlexNet, ResNet50, and GoogleNet networks. Because of this, utilizing a pretrained version of ResNet18 in the presence of ResNet50 and ResNet101 is not productive, as most networks provide the most significant results without being used.

Table 12 The PDDNet-EA and PDDNet-LVE model results (σ = standard deviation)
Table 13 Comparison of proposed network models on the accuracy scores with pretrained networks

Performance evaluation on the PDDNet‑LVE model

The results were produced with the PDDNet-LVE model, based on the lead (majority) votes obtained from detecting the class labels acquired from the LR classifier with deep features presented in Fig. 3 and the last column of Table 12. Moreover, the maximum accuracy score possible with the PDD-LVE model was attained when a mixture of AlexNet, DenseNet201, ResNet50, ResNet101, and GoogleNet was used. This resulted in accuracy scores of 96.94% and 97.79% for the EA and LVE models, respectively. These findings are consistent with those seen in Table 13, which shows that the best outcomes were achieved with all CNNs.

Comparison with edge-cutting models

As demonstrated earlier, CNNs have widely been used in object class label classification, object recognition patterns, and objection detection most recently. Since the most pretrained deep networks were considered, DenseNet201, DenseNet101, ResNet50, GoogleNet, ResNet18, AlexNet, EfficientNetB7, NASNetMobile, and ConvNet have been compared based on the documented accuracy with the most recent published results about plant disease classification [83]. Table 13 demonstrates the accuracy of the models used during the experiment, and Table 14 shows the recently proposed model using some or all used pretrained models during the study.

Table 14 Comparison of proposed network models on the accuracy scores with pretrained networks

The study considers tomato class with 16,703 plant images obtained from the PlantVillage dataset entailing 1,591 healthy leaves, 373 Mosaic Virus, 3,209 Yellow Leaf Curl Virus, 1,404 Target Spot, 1,676 Spider Mites Two Spotted Spider Mite, 1,771 Septoria leaf spot, 952 Leaf Mold, 1,909 Late Blight, 1,000 Late Blight, 2,127 Bacterial Spots images as presented within the dataset. After 10 epochs, the classification results are demonstrated in Fig. 4, utilizing some of the considered pre-trained models, namely ResNet101 and DesnseNet201. Figure 5 presents a confusion matrix after replacing the first and second modified layers (i.e., a fully connected and a softmax layer) of EfficientNet and ConvNet. Figure 6 presents a confusion matrix of the proposed two models. Note: 1 through 10 on the horizontal axis depict the ten tomato leaf image categories.

Fig. 4
figure 4

Tomato leaves (PlantVillage) classification results of the best performed amongst the selected CNNs

Fig. 5
figure 5

Tomato leaves (PlantVillage) classification results after replace layer replacement

Fig. 6
figure 6

Tomato leaves (PlantVillage) classification results by proposed models classification

Conclusion

In this research, early fusion and lead voting ensembles were introduced, combined with nine pretrained CNNs, and fine-tuned for deep feature extraction. Using TL and 15 classes of PlantVillage Dataset, the models outperformed CNNs in plant and disease detection with 96.74% and 97.79% accuracy. These models are robust and generalizable, providing practical solutions to improve plant disease detection and classification accuracy and effectiveness, improving agricultural practices and sustainable food production as the population grows. The research's findings emphasize the significance of advanced technology in mitigating concerns associated with plant disease classification and detection.

In future research, focus on resolving issues related to real-time data collecting and creating a multi-object deep learning model capable of identifying plant illnesses based on a cluster of leaves rather than just a single leaf amidst comparative statistical analysis. Moreover, we are striving to implement a mobile application or web-enabled service utilizing the trained model derived from this research to support a wider plant disease research community to benefit the agricultural sector. Also, to move toward a more lightweight disease classification, model quantization, and object localization networks are critical to better spot the species leaves in a complex background using the trending vision transformers.