1 Introduction

In the global economy, agriculture is critical, and with population growth as well as the COVID-19 pandemic, the agricultural system faces more strain. After wheat and rice, potato is currently the third most important food crop in the world, and global production of potatoes which are regarded by over a billion people as the primary stable exceeds 300 million metric each year (Oppenheim et al. 2019). In addition to being a considerable source of calories for humanity, potatoes are also widely utilized as industrial materials. However, the potato crop is susceptible to being infected by diverse diseases too. Early detection and diagnosis have a positive impact on suppressing the epidemic of potato plant diseases, while the traditional approach of visual observations requires constant supervision of plants. It is undoubtedly inefficient, intuitive, labor-intensive, and cannot be transplanted in a broad range (Marino et al. 2019a; Al-Hiary et al. 2011). Thence, there is a great need and significant realistic importance to seek a simple, quick, and effective tool for automatically recognizing potato plant diseases.

In the latest literature, new methods of plant disease identification are being proposed with the rapid advancement of digital cameras and calculational capacity. More and more attention has been paid to the research and application of machine learning (ML) and image processing techniques, which are becoming attractive alternative approaches for the continuous monitoring of plant diseases (Chen et al. 2021a). For instance, by integrating image processing and ML techniques, Islam et al. (2017) introduced a potato disease recognition model and successfully identified over 300 images. They achieved a 95% recognition accuracy. Using a hyperspectral imaging technique, Ji et al. (2019) recognized the bruised potatoes through discrete wavelet transform, and they attained the highest recognition accuracy of 99.82% for the damaged potatoes. Gassoumi et al. (2000) recommended an artificial neural network (ANN) based method for the identification of insect pests in cotton ecosystems. Their method was implemented with good stability and achieved 90% accuracy. Using ANN, random forest (RF), and support vector machine (SVM) methods, Patil et al. (2017) executed a comparative analysis for identifying potato disease images. In their experiments, the RF realized an accuracy of 79%, the SVM achieved an accuracy of 84%, and the ANN gained the highest accuracy of 92%. Although impressive results are reported in the literature, the conventional ML methods also have some demerits, such as the dependence on hand-crafted features, complicated image processing procedures, and low robustness. Recently, a novel ML technology named deep learning (DL), explicitly convolutional neural network (CNN), has been introduced to address the most challenging tasks associated with image identification and classification (Junde et al. 2021; Pattnaik et al. 2020; Cristin et al. 2020; Shrivastava et al. 2019). It has also been applied in the field of plant disease recognition. For example, using 2250 potato leaf images on the PlantVillage dataset, Al-Amin et al. (2019) trained a CNN model to identify different potato diseases. Their model realized the best recognition accuracy of 98.33%. By applying the method of transfer learning (TL), Islam et al. (2019) recognized three categories of potato leaf images, including 1000 late blight leaves, 1000 early blight leaves, and 152 normal leaves. Their experimental results revealed that TL outperformed the compared methods and they reached a 99.43% test accuracy with a 4:1 ratio for splitting the training and test sets. Marino et al. (2019b) inferred a CNN model to locate the regions of potato defects, which is realized by a heat map output. They performed the classification of potato defects and realized an average F1-score of 0.94. Besides, applying an ensemble CNN model, Nanni et al. (2020) performed the detection of plant insect pests and attained an advanced accuracy of 92.43%. Based on the squeeze-and-excitation (SE) MobileNet, Chen et al. (2021b) proposed a method to identify paddy diseases and they attained an average identification accuracy of 99.78% for recognizing paddy disease types on the publicly accessible dataset, etc. Generally speaking, there are two varieties of DL methods, including strongly supervised and weakly-supervised approaches for crop disease recognition. The strongly-supervised method primarily adopts object detection techniques based on more manually-annotated information like coordinate data, bounding boxes, and crucial points of the target objects. Undoubtedly, this approach is tedious and labor-intensive to gain numerous annotation data for training models. Instead, the other alternative approach is a weakly-supervised scheme, which only requests the label information of images, e.g., the plant disease images with the same disease type are stored in the same folder, and the detailed annotation information is not needed. Therefore, more and more research has focused on fine-grained image classification using a weakly-supervised learning strategy, which is also adopted in our work. On another front, despite reasonably good findings reported in the literature, deep CNN (DCNN) based methods need a great number of annotated samples to train the model, which poses a challenging problem for DCNN models. Particularly, the great volume of deep CNN models also limits the portable device deployment for plant disease identification models than can run offline in practical applications. As a consequence, this study puts forward a lightweight network architecture for recognizing potato diseases. The pre-trained MobileNet-V2 was chosen as the basis feature extractor of the network, and to enhance the learning ability of minute plant lesion characteristics, we modified the classical MobileNet-V2 architecture. The atrous convolution along with the SPP module was incorporated into the network, which was followed by a hybrid attention module including sequential channel-wise attention and spatial attention mechanisms. In this manner, the features of inter-channel dependencies and spatial points are grasped, thereby improving the accuracy of the model. Overall, our work makes the following specific contributions:

  • A lightweight MobS_Net model was proposed for recognizing potato plant diseases with an accuracy of 97.73%, and it attained increasing effectiveness compared with other state-of-the-art methods.

  • The traditional MobileNet-V2 was modified and the atrous SPP was incorporated into the pre-trained network, which was connected by a hybrid attention mechanism for improving the capability of feature extraction.

  • We enhanced the Focal-Loss (FL) function to make it address multi-classification tasks. To alleviate the imbalance of data problems and make the model keep more attention to positive samples, we utilized the enhanced Focal-Loss (EFL) function to substitute the traditional Cross-Entropy one.

The remainder of this writing is organized below. Section 2 introduces the used materials and the proposed approaches. This section importantly discusses the methodology. Section 3 dedicates to the experimental analysis, and extensive experiments are performed in this section along with an ablation study. Section 4 concludes this paper with a summary and specific recommendations for further work.

2 Materials and methods

2.1 Materials

We have collected the materials from diverse sources and many images are derived from the open-access dataset of the 2018 AI Challenger Contest (www.challenger.ai, AI dataset), which is a wide acquisition of plant leaf images employed for ML algorithm test of plant disease identification. It is essential to emphasize that the potato plant leaf images of this dataset are sourced from the PlantVillage repository, where the samples are taken under controlled conditions, both in background and brightness. This potato image dataset contains 3276 potato leaf images categorized into five classes: early blight fungus serious, early blight fungus general, late blight fungus serious, late blight fungus general, and health. In other terms, the species is composed of several disease types and a healthy category, and each disease type includes two varieties of severity levels such as serious and general types. Additionally, the other images are from the local dataset, which is collected under practical field wild scenarios with complicated backgrounds and uneven lighting intensities. Including 363 healthy samples, 207 early blight samples, 109 late blight samples, and 133 potato virus disease samples, a total of 812 image samples are collected in this local dataset. Note that these images have a wide variety, which means that some of the images taken have less noise background, while others have high noise interference. The potato plant images belonging to the same disease type are stored in the same folder. Only the category information is labeled for each folder, and the detailed annotation data are not required by the weakly-supervised methods. By this means, the sample dataset is formed and employed for the test of potato disease recognition. Figure 1 shows the partial sample images, and the details of these samples are summarized in Table 1.

Fig. 1
figure 1

The samples of plant disease images

Table 1 The details of the dataset

2.2 Related work

2.2.1 MobileNet-V2

Mobile-nets, which can be deployed on portable devices for image recognition and classification, is a series of lightweight healthy depending on the depth-wise separable convolution (DSC) and streamlined structure (Sifre and Mallat 2014). Among them, DSC splits a standard convolution into a depth-wise convolution (DC) and a point-wise convolution (PC), respectively. DC executes the convolution operation on each channel with one filter for input maps, and PC performs the 1 × 1 convolving on the output of DC. The formulas of DC and PC are calculated in Eqs. (1, 2), respectively.

$$DC(\theta ,x)_{(i,j)} = \sum\limits_{w = 0}^{W} {\sum\limits_{h = 0}^{H} {\theta_{(w,h)} \cdot x_{(i + w,j + h)} } }$$
(1)
$$PC(\theta ,x)_{(i,j)} = \sum\limits_{k = 0}^{K} {\theta_{k} \cdot x_{(i,j)} } ,$$
(2)

where H and W separately stand for the height and width of the input feature map, θ is the weights of filters, (i, j) index position of input x. By this means, DC does not modify the channel number and PC unifies the outputs of the DC, as expressed in Eq. (3).

$$DSC(\theta_{p} ,\theta_{d} ,y)_{(i,j)} = PC_{(i,j)} (\theta_{p} ,DC_{(i,j)} (\theta_{d} ,y))$$
(3)

Consequently, the output results of DSC can be gained, and in addition to DSC, MobileNets also embed batch normalization (BN) behind the convolution layer to alleviate the problem of disappearing gradient in the back-propagation (BP) procedure. On the ground of this, MobileNet V2 (Sandler et al. 2018) introduces the concepts of linear bottleneck framework along with inverted residual block to address the risk of vanishing gradients and attains some advancement over V1.

2.2.2 Atrous spatial pyramid pooling

Spatial pyramid pooling (or SPP in short) (He et al. 2015a) is a pooling method that maps local characteristics to diverse dimension spaces and merges them. Except for generating fixed-size feature vectors, SPP can make the CNN architecture adapt the image input with different dimensions and extract multi-scale feature information of plant diseases or pests. The module of SPP accepts features extracted from the backbone network and executes convolution operations at multiple scales for extracting global contextual information. Suppose w and h separately represent the width and height of an input feature map, and thus with a G × G grid size of SPP, the size of the convolution kernel represented by f = fh = fw can be computed using fh = [h/G] and fw = [w/G], where [·] symbolizes the ceiling operation. However, because of increasing parameters and computational loads, the convolution kernel cannot be as large as desired, and the normal convolution kernel has the demerit that the spatial resolution of the feature map is halved at each step. Therefore, atrous SPP was designed to alleviate this challenge and the hyper-parameter of rate r = 2 was set for the atrous convolution. Compared with traditional SPP, the atrous SPP increases receptive fields of convolution while remaining the same computational cost (Fig. 2).

Fig. 2
figure 2

The structure of Atrous SPP

2.2.3 Channel-wise and spatial attention

Similar to the human visual attention mechanism, the attention module in deep learning can help the model focus on useful features while suppressing unwanted information. For this purpose, researchers have introduced many attention mechanisms, which can be classified as channel-wise attention (e.g. SE-block) (Hu et al. 2018), spatial attention (Wang et al. 2019), time attention mechanisms (Woo et al. 2018), etc. Among them, channel-wise attention is prominent in capturing the desirable objects in multi-scale feature maps while spatial attention is positive in locating the object regions in feature maps. In this study, unlike a single attention network such as channel-wise or spatial attention networks used in recent research, we incorporated a hybrid attention mechanism that combined the merits of channel-wise or spatial attention into the plant disease identification model.

Suppose a feature map f ∈ RW×H×C is input into the attention module, the channel-wise attention shrinks the feature map using a global average pooling (GAP) to form a statistic z, and thus an excitation operation is executed to grasp the features of channel dependency with the information accumulated in the shrinking phase, as calculated in Eq. (4).

$$s = F_{ex} \left( {W,z} \right) = \sigma \left( {g\left( {W,z} \right)} \right) = \sigma \left( {W_{2} \delta \left( {W_{1} z} \right)} \right),$$
(4)

where σ symbolizes the Sigmoid function, δ is the ReLU function (He et al. 2015b), W2 ∈ RC×c/r, W1 ∈ R(c/rC, and r is a hyper-parameter of reduction ratio. In particular, W1 and W2 are inferred by two completely linked layers around the non-linearity, and thus the results of the channel-wise attention module can be gained through rescaling uc using the activations s:

$$\tilde{x}_{c} = F_{scale} \left( {s_{c} ,u_{c} } \right) = u_{c} \cdot s_{c}$$
(5)

where \(\left[ {\tilde{x}_{1} ,\tilde{x}_{2} , \cdots ,\tilde{x}_{c} } \right] \,\) indicates the output \(\tilde{X}\). Subsequently, the spatial attention module executes the pooling for the input feature map, thereby gaining the spatial attention map. The formula is expressed in

$$F_{S} \left( {\tilde{X}} \right) = sigmoid\left( {c^{7 \times 7} \left( {\left[ {GMP\left( {\tilde{X}} \right);GAP\left( {\tilde{X}} \right)} \right]} \right)} \right),$$
(6)

where GMP and GAP denote the global maximum pooling and global average pooling, respectively. c7×7 implies a 7 × 7 convoluting. After that, the channel-wise attention (CWA) and spatial attention (SPA) is concatenated using a sequential cascade manner in our network, as written by

$$F_{att} = CWA(f) + SPA(f) = F_{C} (f) * f + f * F_{s} (f)$$
(7)

In Eq. (7), f denotes the input feature map, and * symbolizes a dot product operation.

2.3 Proposed approach

2.3.1 MobS_Net

To the best of our knowledge, the DL-based models, which usually involve a great number of parameters and have large volumes, require big computational memories for training models. Therefore, they are not suited to be deployed in mobile phone applications because of the limited capacities and computation capability of mobile smartphones. In this study, we select the lightweight CNNs as the backbone network, and the MobileNet-V2 is selected as the backbone extractor in our model for recognizing plant disease types. To improve the learning capability of minute plant disease features, we altered the classical architecture of the MobileNet-V2 using the fine-tuning approach conducted by transfer learning. The atrous convolution along with the SPP module was embedded into the pre-trained network, which was followed by a hybrid attention mechanism containing channel-wise and spatial attention submodules to efficiently extract high-dimensional features of plant disease images.

More specifically, the atrous convolution layer designed in this study consisted of 512 convolution kernels with the size of 3 × 3, and the atrous rate was assigned as r = 2, which was used to increase the convolutional receptive fields. Then, following the BN layer for alleviating the vanishing gradient problem, the SPP module was integrated into the network to generate fixed-size feature vectors and make the CNN adapt to the input images with different sizes, thereby extracting multiple-scale features of images efficiently. Other than that, a hybrid attention mechanism that concatenated channel-wise and spatial attention in a cascade manner was embedded into the network, which makes the network infer the interdependence between channels and the importance of spatial points for intermediate features. Ultimately, the completely linked (CL) layer was substituted by a GAP layer, and a new CL Softmax layer with the actual number of classes was embedded as the classification layer of the network. By doing this, the newly formed network, which we termed the MobS_Net, was utilized to execute the task of potato disease recognition. It is noteworthy that the initial parameters of the network were injected by the following (He et al. 2015b).

Figure 3 portrays the network architecture of the proposed MobS_Net, where MobileNet-V2 pre-trained on ImageNet is utilized as the bottom convolution layers, and the atrous SPP module is incorporated into the network for multi-scale feature extraction. Plus, a hybrid attention module comprised of channel-wise and spatial attention is introduced into the network to maximize the reuse of inter-channel relations and infer the importance of spatial points, thereby recalibrating the channel-wise and spatial features. In this manner, we aim to realize a trade-off between the memory requirement and recognition accuracy in the CNN model, i.e., the model volume was compressed while the accuracy was improved as much as possible. Table 2 summarizes the relevant parameters of the MobS_Net.

Fig. 3
figure 3

The structure of MobS_Net

Table 2 The major parameters of MobS_Net

2.3.2 Loss function

Generally speaking, the Cross-Entropy (CE) loss function is frequently employed in CNN models, and the formula of CE loss can be expressed by

$$L\left( {p_{k} } \right) = - \sum\limits_{k = 1}^{C} {y_{k} \log \left( {p_{k} } \right)} ,$$
(8)

where C signifies the class number, and yk is an indicator variable. If k is the same as the true class of the sample, then yk = 1, otherwise yk = 0. pk denotes the predicted probability that the observed sample belongs to class k. Because the class loss weights of positive and negative samples are considered the same for the CE loss function, Reference (Lin et al. 2017) reports an FL function to alleviate this unbalanced sample issue. The formula of the FL function is presented in Eq. (9).

$$FL\left( {p_{k} } \right) = - \left( {1 - p_{k} } \right)^{\gamma } \theta_{k} \log \left( {p_{k} } \right),$$
(9)

where γ symbolizes a hyper-parameter of the modulating factor, and θk is the weighting factor when the class is 1. It is worth pointing out that the classical FL function is developed to handle the tasks of binary classification in object detection. However, the recognition of plant diseases belongs to a multi-classification task, and thus we modified the FL function and employed the enhanced FL (EFL) in place of the traditional CE function in the plant disease recognition model, as expressed in Eq. (10).

$$EFL\left( {p_{k} } \right) = - \sum\limits_{k = 1}^{C} {\theta_{k} \left( {1 - p\left( {k\left| x \right.} \right)} \right)^{\gamma } y_{k} \log \left( {p\left( k \right)} \right)}$$
(10)
$$w_{k} = {{count\left( x \right)} \mathord{\left/ {\vphantom {{count\left( x \right)} {count\left( {x \in k} \right)}}} \right. \kern-\nulldelimiterspace} {count\left( {x \in k} \right)}}$$
(11)
$$y_{k} = \left\{ \begin{gathered} 1, \, k = actual\_class \hfill \\ 0, \, k \ne actual\_class \hfill \\ \end{gathered} \right.,$$
(12)

where x denotes the sample.

3 Experimental analysis and results

3.1 Experimental setup

Apart from the image pre-processing task performed by the software of Photoshop, the major algorithms were executed by Python 3.6, where the frequently-used libraries like OpenCV3, Tensorflow, and Keras were employed and accelerated by GPU. The experimental hardware configuration includes GeForce RTX 2080 Graphics Card, 64 GB RAM, and E5-2620V4 CPU, which are utilized for algorithm operation.

3.2 Model training

As mentioned in Sect. 2.1, the potato leaf disease images are utilized in our experiments. Considering the imbalanced samples and the number limitation of sample images, we utilized the data augmentation scheme to enrich the dataset. The commonly-used augmentation methods like color jetting, random rotation, flipping, translation, and other geometric transformation were executed to augment the dataset. Note that color jittering is altering the contrast, saturation, and brightness of color with a random adjustment variable in (0, 3.1), the rotation range is in [0,360°], the translation range is in ± 20%, and the scale is changed from 0.9 to 1.1. Except for preserving some original images to assess the effect of the model, the proportion of the sample images randomly assigned to the validation and training sets was 1:4. Besides, to compare the proposed approach with other advanced methods, the five influential deep DCNNs including Xception, VGGNet-19, DenseNet-121, ResNet-50, and MobileNet-V2 were chosen as the benchmarks to implement the comparison experiments. Using transfer learning (TL), the original classification layers of the networks were truncated and the new CL layers with Softmax activation functions were embedded into the networks for the classification, where the class number was set as the actual number of potato plant disease types.

With this method, the diverse DCNN models were built and the weights were initialized with the parameters pre-trained on ImageNet (Russakovsky et al. 2015). The hyper-parameters of model training were assigned as a learning rate of 1 × 10–3, a mini-batch size of 64, 30 epochs, and a stochastic gradient descent (SGD) optimizer. There is standard measure param for image recognition to check the efficiency of the network. These are Accuracy (Acc.), Sensitivity (Recall, Rec.), Specificity (Spe.), FPR (false positive rate, Fpr.), and F1-Score (F1). The formulas of these evaluation metrics are expressed in Eqs. (1317).

$$Acc. = \frac{TN + TP}{{TN + FN + TP + FP}}$$
(13)
$$Rec. = \frac{TP}{{TP + FN}}$$
(14)
$$Spe. = \frac{TN}{{TN + FP}}$$
(15)
$$FPR = \frac{FP}{{TN + FP}}$$
(16)
$$F1 = \frac{2TP}{{2TP + FP + FN}},$$
(17)

where TP represents the number of correct recognition for positive samples. FN is the reverse, which denotes the number of mistakenly recognized. FP is the number of wrong-identified samples. TN implies the number of properly-recognized negative samples. Table 3 summarizes the training results and the performance of diverse methods is depicted in Fig. 4.

Table 3 The accuracy of diverse methods
Fig. 4
figure 4

The training performance of the models

From Table 3, it can be visualized that the proposed approach has delivered an increasing performance relative to other advanced methods. After training for 10 and 30 epochs, the proposed MobS_Net has attained the training Acc. of 98.63% and 99.87%, respectively. Especially, after 30 epochs of training, the proposed approach realizes a validation accuracy of 94.57%, which is the top performance of all the algorithms. The crucial explanation for the substantial efficiency of the proposed method is that the atrous SPP coupled with a hybrid attention mechanism is embedded into the network, which enhances the capability to extract multi-dimensional features and maximize the reuse of inter-channel relation and spatial point characteristics. Moreover, the TL and enhanced Focal Loss function are applied in the model training, which makes the network gain the optimum weights and alleviates the problem of data imbalance, thereby improving the performance of the model. By comparison, the other methods are single networks and don’t attain the ideal performance, although the TL and fine-tuned approach are utilized in model training. Additionally, the running time of the proposed method is 6.30 min, which is the competitive time-consuming of all the compared methods.

3.3 Ablation study

We implemented the ablation study on our model, where we analyzed the efficacy of Atrous SPP and hybrid attention modules on the test dataset of potato disease images. In the first ablation experiment, we separately removed the modules of Atrous SPP and hybrid attention in the network to investigate the performance of the model training. We notice a minor decrease in the results of the ablated model, where the validation accuracy of removing Atrous SPP and hybrid attention modules drop to 92.76% (decrease by 1.80%) and 93.67% (decrease by 0.90%), respectively. Although the effectiveness of ablated models is still better than that of the baseline model, it suffers a minor decline compared with the proposed architecture of MobS_Net. Therefore, this ablation experiment indicates that both the Atrous SPP and attention modules contribute to the performance gain of the proposed approach, and relatively, removing the Atrous SPP has a significant impact on the accuracy compared to the MobS_Net. In the second ablation experiment, we evaluate the effect of the optimized loss function in the recognition of plant diseases on the potato leaf image dataset. To do so, we substitute the enhanced Focal Loss function with the existing Cross-Entropy (CE) one, and we notice a minor decrease in the results of the ablated model, where the validation accuracy drops to 92.41% (decrease by 2.16%). This ablation experiment demonstrates that the enhanced Focal Loss (EFL) function delivers slightly better results than that of the CE loss function used in our model for potato plant disease identification. Table 4 summarizes the comparison results of ablation experiments.

Table 4 The comparison results of ablation experiments

3.4 Recognition results

Therefore, using the trained model of the MobS_Net, we further performed the identification of potato plant diseases on new unseen samples (test set), where unseen samples in this context indicate potato leaf images that have never been used by the model during the training and validation. Figure 5 is the identification results depicted by the Receiver Operating Characteristic (ROC) curve and confusion matrix. The related measurement metrics are summarized in Table 5.

Fig. 5
figure 5

The visualization of the tested results

Table 5 The metrics assessment for the recognition results

As seen in Fig. 5a, the curves of most classes are close to the upper left corner of the figure, which exhibits the satisfied operating points of the ROC curve. Plus, it can be also observed from the confusion matrix (Fig. 5b) that MobS_Net has accurately identified most of the sample images. The sample images in the category of Potato healthy and Early_Blight have all been properly identified by the proposed method. For the category of early blight fungus general, 18 samples have been correctly recognized in 29 samples. Also, 67 images are properly recognized by the proposed method in 73 early blight fungus serious samples, and the accuracy rate reached 98.05%. In summary, a total of 319 images have been properly identified in 413 test samples, and the average recognition Acc. attains 97.33%. The average Rec. and Spe. have also reached no less than 91.99% and 98.39% respectively, as presented in Table 5.

Furthermore, the comparative analysis of our experimental results with that of some latest literature has been summarized in Table 6, where most of the experimental materials are sourced from the potato leaf images of the PlantVillage dataset. As mentioned in Sect. 2.1, the public dataset we tested also comes from the PlantVillage repository, which is the same as the materials used by other methods. In addition, we have identified some local potato disease images with cluttered backgrounds and uneven illumination intensity, which undoubtedly increases the difficulty of potato disease recognition. Nevertheless, a competitive performance has been achieved by our method. The comparison results demonstrate the validity of the proposed approach compared with other state-of-the-art methods.

Table 6 Comparison with recent work

On the contrary, there are also several misidentified samples, such as 7 misdetections in the category of “Early blight fungus general”, which are incorrectly recognized as the type of “Early blight fungus serious”. Despite individual misidentifications, most of the sample images have been accurately recognized and the misidentifications are primarily the misclassification of disease severity instead of the potato disease types. Consequently, this reveals the proposed MobS_Net has a certain ability to recognize potato plant diseases. Figure 6 presents the samples of recognized potato disease types.

Fig. 6
figure 6

The samples of recognized potato disease types

As shown in Fig. 6, the samples in the top layer are the raw potato disease images, the samples in the middle layer are the disease regions exhibited by visual technology of classification activation map (CAM), and the samples in the bottom layer are the images recognized by the proposed method. It can be observed from Fig. 6 that the recognized classes of most samples are compatible with their actual disease types. For example, the real disease type of Fig. 6a is early serious and this sample is accurately identified by the proposed method with a probability of 0.8513. Likewise, the sample of Fig. 6b is properly identified by the MobS_Net with a high probability of 0.9052. The other samples, such as Fig. 6d and e, have also been properly identified by the proposed method. Despite the impressive performance, there are also some misidentified instances, including the sample of Fig. 6c, which belongs to the category “potato early general” but is wrong classified as category “potato early serious”. A certain ambiguity for the severity division of plant disease images may result in this issue. Additionally, the irregular lightweight intensities, which affect the feature extraction of plant disease images, can lead to the misidentification of plant diseases too. Though individual samples were incorrectly recognized, most of the samples have been correctly identified by the proposed approach and the misidentified samples are primarily for the severity level rather than the detailed disease types. Moreover, the predicted probability of misidentification is also relatively low, such as the 0.3609 of the sample in Fig. 6c. Consequently, depending upon the experimental findings, it can be assumed that the proposed approach has successfully performed the identification of potato plant diseases and can also be applied to other domains.

4 Conclusions

Various plant diseases can result in a disastrous impact on crop growth and food security. To guarantee an adequate supply of foods, the timely and effective identification of plant diseases is of great realistic significance. The latest development in DL has delivered an impressive alternative approach for the automatic identification of plant diseases instead of the traditional manual approaches. Among them, DCNNs are the most popular methods because they can extract features of images automatically and implement the classification. However, due to the great number of parameters and large volumes, the classical DCNNs are not suitable to be deployed on portable device applications and require a large number of annotated images to train models, which is undoubtedly a challenging problem. To this end, this study proposes a novel lightweight network architecture named MobS_Net and uses transfer learning to implement the recognition of potato plant diseases. The pre-trained MobileNet-V2 was chosen as the backbone network of the model, and to enhance the learning ability of minute plant lesion characteristics, we altered the classical architecture of MobileNet-V2 by incorporating the atrous convolution along with the SPP module into the network. Further, a hybrid attention module containing the channel-wise attention and spatial attention submodules sequentially was embedded into the network to grasp the features of inter-channel dependencies and the significance of spatial points. Experimental findings demonstrate the effectiveness of the proposed method. In future development, we want to assign the model on mobile devices to monitor broader ranges of crop disease information. Moreover, we would like to transplant the model to other domains like online failure detection, computer-aided diagnosis, and virtual defect assessment, etc.