An Efficient Plant Disease Recognition System Using Hybrid Convolutional Neural Networks (CNNs) and Conditional Random Fields (CRFs) for Smart IoT Applications in Agriculture

In recent times, the Internet of Things (IoT) and Deep Learning Models (DLMs) can be utilized for developing smart agriculture to determine the exact location of the diseased part of the leaf on farmland in an efficient manner. There is no exception that convolutional neural networks (CNNs) have achieved the latest accomplishment in many aspects of human life and the farming sector. Semantic image segmentation is considered the main problem in computer vision. Despite tremendous progress in applications, approximately all semantic image segmentation algorithms fail to achieve sufficient hash results because of the absence of details sensitivity, problems in assessing the global similarity of image pixels, or both. Methods of post-processing improvement, as a wonderfully critical means of improving the underlying flaws mentioned above from algorithms, depend almost on Conditional Random Fields (CRFs). Therefore, plant disease prediction plays important role in the premature notification of the disease to alleviate its effects on disease forecast investigation purposes in the smart farming arena. Hence, this work proposes an efficient IoT-based plant disease recognition system using semantic segmentation methods such as FCN-8 s, CED-Net, SegNet, DeepLabv3, and U-Net with the CRF method to allocate disease parts in leaf crops. Evaluation of this network and comparison with other networks of the state art. The experimental results and their comparisons proclaim over F1-score, sensitivity, and intersection over union (IoU). The proposed system with SegNet and CRFs gives high results compared with other methods. The superiority and effectiveness of the mentioned improvement method, as well as its range of implementation, are confirmed through experiments.


Introduction
Crops are continually vulnerable to a full range of biotic factors that as pathogens (fungi, insects, bugs, bacteria, etc.) as well as due to biotic factors like lack of irrigating, warmth, cold, and salinity [1]. It is essential to contemplate that crop diseases, induced by microorganisms, caused poor normalcy of the plant and a modification of its vital functions. It can lead to significant crop losses and a limiting factor in productivity [2]. Several assessments can be enacted to stave off the diseases spread on farms. In this context, Integral Pest and Disease Management decreases crop opportunity of Losses and reduces the necessity to use insecticides. Efficiency Pest and Disease Management are important not just for diagnosis but too to measure plant strain, both works evenly essential for plant pathology [3].
Consequently, the conception of Smart Farming appeared, Smart Farming stands for the implementation of modern Information and Communication Technologies (ICT) into agriculture, leading to what can be called the Third Green Revolution, After the plant breeding and genetics revolutions, this Third Green Revolution is taking over the agricultural world, depending on the aggregated application of ICT solutions such as precision equipment, sensors, and actuators, the Internet of Things (IoT), Geopositioning systems, Big Data, Unmanned Aerial Vehicles (UAVs, drones), robotics, etc. From the farmer's point of view, Smart Farming should offer the farmer added value in the form of better decision-making or more powerful exploitation operations and management.
Many standard techniques are used to help early discovery of the disease for achieving high production [4]. Environmental pollution and contamination of the crops caused by these methods have serious paraphernalia on human health. With the occurrence of advanced technologies, several attempts have been performed using artificial intelligence to help farmers accurately detect diseases affecting their production and severity of Symptoms. Computer-Aided Diagnostic Systems enable (CADS) any countryman with a smartphone access to enjoy specialist knowledge feasibly and cheaply. The major difficulty for these self-sufficient platforms is to detect the exact position of the diseased part of the leaf [5]. And it considers the significant applications of deep learning in the agriculture sector to detect diseases and deal with them.
In this consideration, the agricultural industry passionately adopts Artificial Intelligence (AI) in its application and overcomes difficulties such as reducing the workforce and rising demand. During peak time, farmers required to hire specialist agricultural workers with experience in farming production, to carry out various tasks including sowing crops, picking fruits, eliminating feed, and harvesting. Lately, a lot of these missions are executed by machines and disease detection is a significant implementation in computer vision that helps farmers in this duty. Leaf disease image segmentation is to split the image into a disease image that includes only disease pixels, it instantly affects the accuracy of feature extraction and disease diagnosis, discovered from the analysis of image samples of leaf disease, healthy leaves are usually pure green, and the diseased area is usually yellow, white, or brown, which is not green. Consequently, leaf segmentation depends on the features of disease color and can split the diseased area successfully. Corresponding to this consideration, an algorithm is accumulated for disease image segmentation. By mathematical morphology treatment with the leaf image, pixels of the healthy area are set to black, and pixels of the diseased area are unchanged. Extremely sophisticated selective disease leaf techniques are needed for practical applications such as pixel-level fragmentation, the marking of different types of organisms from the image, and are largely imposed in different areas such as space, military, intelligent command, multimedia, therapy, etc.
In recent years, machine learning technology focused on deep learning has gained attention. Self-driving cars have embedded deep learning processes that demand the algorithm to detect and learn from the images fed as raw data, initial applications of computer vision needed the detection of essential elements such as edges (lines and curves), or gradients. Even so, understanding an image at pixel level came around only with the coining of full-pixel semantic segmentation. It gathers parts of the image together, which fall into the same object of interest. Semantic Segmentation is the Computer Vision trouble that involves taking as input some raw data (images) and transforming them into a mask with an area of interest highlighted. Many use the term full-pixel semantic segmentation, where each pixel in an image is allocated, and a classification relies upon which object of interest it belongs to. Previously, computer vision problems only figured out elements such as edges (lines and curves) or gradients, but they never quite provided an understanding of images at a pixel level, Semantic Segmentation (identify what is present in the image and where (by finding all pixels that belong it), which clusters parts of images together, which join to the same object of interest, solves this problem.
The plurality of common learning methods for segmentation of semantic images depends mostly on the Fully Convolutional Network (FCN), whichever significantly enhances the accuracy of segmentation and is the foundation of this research area. Even so, the inherent stability of spatial shifts of Convolutional Neural Networks (CNN) buildings that reach almost every way still have the following trouble: (1) Up-Sampling methods are not delicate to information in images, although systems are created for them. Additionally, sampling outcomes remain smooth and unclear. (2) The correlation between pixels is not deemed considered and neglected the spatial regulation step that is used globally is the general pixel-rated retail method, leaving the retail grid lacking spatial consistency. For the above problems, this limits the implementation of the semantic image fragmentation algorithm to some extent, the investigator relies on the post-processing method such as (CRF). It merges the structured modeling potential of the CRF with the power of the feature extraction of CNNs that are used to enhance the retail outcomes and create more satisfying masks. Therefore, the contribution of this work is concise as follows: • Develop a system for expertly identifying diseases of plant leaves in smart farming systems. This system can be used as an early trigger for recognition with the help of deep learning methods. • The proposed work investigates disease identification based on semantic segmentation with deep learning and CRF approach to efficiently demonstrate the detection and boundary recovery of leaf diseases plant. • Suggest an IoT-based plant diseases' recognition system that can assist in the development of a new lowcost smart system using a portable mobile phone, image processing, and CNN-based Transfer Learning Models in smart agriculture applications in the context of smart cities around the world. • Presenting a systematic evaluation of the projected system implementation through various experimental circumstances. The obtained results showed the efficiency of the system for assisting farmers in early identifying diseases regarding the plant in an energetic manner.
The residual of this paper is prepared as follows: related work is introduced in Sect. 2, while the architecture of CNN models is presented in Sect. 3. The proposed prediction model detecting the diseased part of the leaf using hybrid CNN with CRF is described in Sect. 4 while the high-level IoT-based smart agriculture systems are presented in Sect. 5; in Sect. 6, the experimental and results analysis to evaluate the proposed system; while the conclusion and future work are presented in Sect. 7.

Related Work
Earlier, several studies were conducted on various diseases of leaf, and various implementations have previously applied deep learning approach for identification and fragmentation for classification of diseases in several crops. The author in [6] principally adopts the summary and the induction methods of deep learning. First, it presents the general development and the present situation of deep learning. Second, it designates the structural principle, the characteristics, and some kinds of traditional models of deep learning, such as stacked autoencoder, deep belief network, deep Boltzmann machine, and convolutional neural network. Third, it introduces the most recent developments and utilization of deep learning in many fields, such as computer vision, natural language processing, and medical applications. Finally, it offers advanced problems and the future research directions of deep learning. In [7], the author provides an App for the Android platform capable of fragmentation and identifying several leaf lesions diseases and computing the intensity of stress resulting from biotic agents in coffee leaves using CNN, which are the drawbacks of this work; more precisely, networks can be designed which will have less computational cost. Additionally, various varieties of coffee can be classified using different feature sets and the benefits of it is An App for Android helps both experts and agriculturalists to detect and measure biotic stresses using images of coffee leaves obtained by a smartphone. In [8], integrated frames according to different Convolutional Neural Networks (CNN) computerize identification detection/pest recognition. Photos in the field combined using smartphones include the leaf of the coffee tree. First, they apply a mask R-CNN, for example, hashed out; second, PSPNet and UNet, semantic algorithms are applied; and finally, a ResNet network for classification. More precisely, algorithms can be designed, and the author neglects the treatment of missing parts of the ground truth of the datasets, and it takes into consideration the comparison between instance and semantic segmentation.
In [9], purpose of the semantic segmentation technique is depended on a cascaded encoder-decoder system to separate weeds from plants. In current weed structures, the semantic segmentation of crops is very depth, with billions of variables requiring more training period to get over these limitations, the author suggests the thought of training small systems in serial to gain coarse-to-fine forecasting and then aggregated to make the outcomes, the authors training small networks in cascade to obtain coarse-to-fine predictions, which are then combined to produce the parameters of the final result that require longer training time, but they used only two networks to train and test their works. In [10], develop a network that is able of assessing diseases in the images of leaves of wheat acquired by mobile phones; they must be in consideration of different evaluation metrics to show the performance of their work. The remaining inaccuracy was the result of an inappropriate diagnosis. Regarding [11], increase in this work uses deep air-conditioned residues neural network-based systems to trade with the identification of several crop diseases from images acquired in the field via smartphone, which has been prepared using deep ways to learn. Although that augmentation is considered a great tool to increase the size of datasets in some cases, augmentation may affect the accuracy of the models, and the accuracy of models up to 96 is based on traditional deep learning algorithms.
In [12], the author proposed a mobile app for crop detection diseases utilization fuzzy inference systems, one of the advantages of this system is that the author provides an easy-to-use interface, but on the other hand, the authors did not consider time complexity. The author in [13] develops a quantification system and classification of Miner and rust coffee leaves. The system applies a threshold established on a way to determine the identification and severity of respective symptoms individually.
The authors in [14] suggested deep neural network-based analysis according to a single task a multi-tasking network, able to assess four types of bio-stress and appreciation of its severity. In symptom results, the rating was greater than 97%. One of the advantages of this system is to use more than one way to implement its operation, such as AlexNet, GoogLeNet, VGG16, ResNet50, ResNet50 * , and Mobile-NetV2. One constraint of this system is related to the low illustration of the dataset that covers only the major biotic stresses that impact coffee trees. The authors in [15] suggested a CNN-based system for identifying and classifying weeds. Neither grass nor wide leaf, the superpixel segmentation algorithm (SLIC), was employed to construct a robust image data set and categorize images using the model trained by Caffe software, such as different locations and height of image acquisition. Without the dependency, on feature extractors, three semantic segmentation networks, such as SegNet, UNet, and Pyramid Scene Parsing Network (PSPNet), were utilized to segment paddy crops and two types of weeds; this search realized above 98% accuracy, according to ConvNets in the identification of broadleaf and grass weeds with respect to soil and soybean.
In related work [16] introduced a semantic segmentation method to categorize two types of weeds in paddy fields, namely sedges and broadleaved weeds. Three semantic segmentation modelings, such as SegNet, Pyramid Scene Parsing Network (PSPNet), and UNet, were employed in the fragmentation of paddy crop and two types of weeds, and the author thinks that this can be exploited to suggest appropriate herbicides to agriculturalists. The authors may increase the dataset and use several semantic segmentation algorithms to study its performance. The annotation of images appears to be inadequate, and therefore, an author must think about addressing them to improve the model's performance.
In [5] during stem identification, each local window offers details about the position of the trunk or non-stem area development of an automated diagnostic system for the detection of tomato diseases based on the deep neural network and using real-world data demonstrates that agriculturalists cannot give information beyond the obvious symptoms of their crops and their nature. Consequently, the sophisticated but complex method proposed in the literature does not consider the implementation and usability from the point of view of the farmers. In [17], location information on weeds for site weed management (SOM) dataset and results on a network depending on the encoding-decoding network for semantic segmentation (using transfer learning) obtained an average accuracy of up to 92.7%. Therefore, this work aims to develop an intelligent prediction technique using CNNs and CRFs in smart agriculture applications to classify and predict plant diseases efficiently for smart farming systems.
The author in [18] focuses on the issues of adaptive path planning for accurate semantic segmentation using UAVs. They present an online planning algorithm that adapts the UAV paths to obtain high-resolution semantic segmentations requirements in areas with fine details as they are detected in incoming images. This facilitates performing close inspections at low altitudes only where required, without losing energy on exhaustive mapping at maximum image resolution. A key attribute is a new accuracy model for deep learning-based architectures that catch the correlation between UAV altitude and semantic segmentation accuracy and using real-world data.
The author in [19] deals with weed treatment to accomplish this, a convolution neural network called Reduced Residual U-Net uses Depth-wise separable Convolution (RRUDC) network is introduced. Residual Depth-wise separable Convolution Block (RDCB) is created as an operational unit in both constructive and expanding paths. Residual connection is merged inside every RDCB unit. This network employs semantic segmentation to study the crop field images pixel-wise. To decrease the parameter size, a depth-wise separable convolution technique is used which reduces the number of parameters generated by the model at a ~ 1/9 ratio with a very negligible drop inaccuracy. The model is trained using Crop Weed Field Image Dataset (CWFID), and then, the trained model is pruned to diminish the model size. Multi-crop weed fragmentation network for partitioning weeds submits in the different kinds of crops in the farming ground. A description of a comparison of some existing schemes highlighting the aim, proposed solution, pros, and cons is formed in Table 1.

CNN Network Architecture
In this work, we exploited the most populace semantic architecture in the literature: CED-Net, SegNet, FCN-8 s, and DeepLabv3. The architecture of U-Net makes up an FCN and can be separated into two major phases: The first one is More precisely algorithms can be designed and neglect the treatment of missing parts of the ground truth of datasets a contracting path. It is in charge of downsampling and complies with CNN architecture, none the fully connected layer, i.e., an input image has its dimensionality following depth increased through a convolution, ReLU, and Max-pooling operations. Later, an expansive path is applied, whatever does the opposite process. Upsampling accompanying and feature map convolutions, which are consolidated with corresponding feature maps from the contracting path that executed via the expansive path.

UNet Model
To build a model that can generate a mask as the real one, we will use a well-known deep learning structure called the UNet model, and UNets are models that excel at segmenting images but are computationally demanding [20]. Just like what the name indicates the structure looks like the letter U, the network consists of two primary parts: the convolutional encoding and decoding units. The basic convolution operations are implemented, followed by ReLU activation in both parts of the network. For downsampling of the encoding unit, 2 × 2 max-pooling operations are executed. In the decoding step, the convolution transpose (representing up-convolution, or de-convolution) operations are performed to up-sample the feature maps. The very first version of U-Net was used to crop and copy feature maps from the encoding unit to the decoding unit, This architecture consists of three layers: the contraction, the bottleneck, and the expansion layers. The contraction layer is made of many contraction blocks. Each block takes leaf images and applies two 3 × 3 convolution layers followed by a 2 × 2 max-pooling. The number of feature maps or kernels after each block doubles, so that architecture can learn the complex structures successfully. The bottommost layer arbitrates between the contraction layer and the expansion layer. It uses two 3 × 3 CNN layers followed by a 2 × 2 up-convolution layer, like to contraction layer, the expansion layer also consists of several expansion blocks. Each block crosses the leaf image to two 3 × 3 CNN layers followed by a 2 × 2 up-sampling layer.
Also, after each block number of feature maps used by the convolutional layer get half to maintain symmetry. Nevertheless, every time, the input is also getting appended by feature maps of the corresponding contraction layer. This action would ensure that the features that are learned while contracting the image will be used to reconstruct it. The number of expansion blocks is as same as the number of contraction blocks. After that, the resultant mapping crosses through another 3 × 3 CNN layer with the number of feature maps equal to the number of segments wanted, as shown in Fig. 1. The first image shows downsampling, and the second image shows up-sampling, as shown in Fig. 2.   Fig. 1 The architecture of UNet model

SegNet
SegNet has an encoder network and an equivalent decoder network, followed by a final pixel-wise classification layer. The encoder network contains 13 convolutional layers which match the first 13 convolutional layers in the VGG16 network aimed for object classification. They reject the fully connected layers in favor of retaining higher resolution feature maps at the deepest encoder output. This too decreases the number of parameters in the SegNet encoder network significantly. Each encoder layer has a corresponding decoder layer, and consequently, the decoder network has 13 layers. The basic trainable segmentation structure consists of an encoder network, and an accompanying decoder network tracked by a pixel-wise classification layer. The decoder network is used to delineate the low-resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. SegNet innovation lies in the way in which the decoder upsamples its lower resolution input feature maps. Particularly, the decoder uses pooling indices calculated in the max-pooling step of the corresponding encoder to execute non-linear up-sampling. Each encoder in the encoder stage accomplishes convolution with a filter bank makes producing a set of feature maps, these are then batch normalized, and then, an element-wise rectified linear non-linearity (ReLU) max (0, x) is applied. And finally, max-pooling with a 2 × 2 window and stride 2 (non-overlapping window) is implemented and the outcoming yield is sub-sampled by a factor of 2. Max-pooling is employed to accomplish translation invariance over small spatial shifts in the input image [20].
The SegNet decoding approach is shown in Fig. 3. These feature maps are undulated with a trainable decoder filter bank to build dense feature maps, phase of batch normalization is then used for each of these maps, the high dimensional feature illustration at the output of the final decoder

Fcn8s
FCN can be utilized to produce pixel-wise forecasting for semantic segmentation tasks. There are three FCN architectures, FCN32s, FCN16s, and FCN8s, which are all built on top of a VGG16 model with transpose convolutions used to up-sample the layers back to the original image height and width. The architecture of the VGG16 network can be shown in Fig. 5. And in the fully convolutional version, the 2 fully connected layers ('fc1' and 'fc2') are converted to convolutional layers. The semantic segmentation networks that I have implemented in this paper are built on top of the fully convolutional VGG16 network. Fully Convolutional Neural Network (FCN) with VGG16 architecture. The network contains a set of max-pooling and convolution layers to detect pixel-wise class labels and forecast the mask enabling accurate location. The FCN-based semantic segmentation process has been explored for leaf segmentation, for training the model. The model extracts a semantic feature map of the input image. These semantic features are achieved obtained using training images, which are used to build learned/trained semantic features. The model permits dense prediction of arbitrary-sized images. The VGG16 is used as a backbone that generates features. Instead of classification scores like other deep learning-based object detection models, it outputs a spatial segmentation map [20].
To build the FCN8s network, a 1 × 1 convolution is applied to the output of the second convolutionalized, fully connected layer of the VGG16, and then, a transpose convolution with strides of 32 × 32 upsamples the output to the original input image height and width. In the FCN16s, the output from the second fully connected layer is up-sampled using a transpose convolution with stride = 2, and a skip connection is applied which combines the output of the upsampled layer with the block 4 pooling layer. After applying the skip connection, the output of the model will be another transpose convolution, this time with strides of 16 × 16. The FCN8s follow the same architecture as the FCN16s, except an additional skip connection is applied which also incorporates the output from the block 3 pooling layer. The output of this model is a transpose convolution with strides of 8 × 8. As shown in Fig. 6, there are some Ground Truths against prediction samples [20]

DeepLav3 +
DeepLabV3 model with encoder-decoder architecture. The encoder consists of a trained Convolutional Neural Network (CNN) to encode feature maps of the input image. The decoder is used for up-sampling and reconstruction of output using important information extracted by the encoder. The model mostly uses encoder-decoder architecture, as illustrated in Fig. 7. The encoder consists of the CNN model, which is used to get encoded feature maps of the input image. The decoder is used to re-establish the output from the essential information extracted by the encoder, using the up-sampling method. It includes an atrous separable convolution for each input channel along with point-wise convolution; to deal with multi-scale images, DeepLab utilizes a method of multiple pooling layers, also known as Spatial Pyramid Pooling (SPP). It split the feature maps extracted from the convolutional layer into spatial bins of fixed numbers as input image size DeepLabv3 + offers DeepLabv3 by inserting an encoder-decoder structure.
The encoder module dealings multi-scale contextual information by implementing dilated convolution at multiple scales, while the decoder module improves the segmentation  outcome along object boundaries [20]. Figure 7 shows the architecture of DeepLabv3 + . Figure 8 shows some of the Ground Truth against prediction samples with the Deep Lav3 + model.
In dilated convolution, we can support the stride constant but with a larger field of view without raising the number of parameters or the amount of computation. In addition, it facilitates larger output feature maps, which is helpful for semantic segmentation. The reason for using Dilated Spatial Pyramid Pooling is that it was shown that as the sampling rate becomes larger, the number of valid filter weights (i.e., weights that are applied to the valid feature region, instead of padded zeros) becomes smaller.

Ced-Net
The network structure is shown in Fig. 9. The predictions of Model1 and Model3 are up-sampled, concatenated with corresponding input image size, and used as inputs by Model2 and Model4, respectively. Two cascaded networks (Model1, Model2) are thus trained for disease part predictions, and (Model3, Model4) for the remainder of leaf predictions. In total, then, we have four such small networks (Fig. 10). Figure 11 shows some of the Ground Truth against prediction samples with the Ced-Net model. CED-Net consists of four small encoder-decoder networks divided into two levels. Encoder-decoder networks of each level are trained independently for leaf segmentation. More specifically, Model-1 and Model-2 are trained for disease prediction, while Model-3 and Model-4 are trained for health. Instead of a big encoder-decoder network with millions of parameters, we can implement the same system with the small network in the cascaded form. The detailed construction of a single encoder-decoder network is displayed in Fig. 3. The input for this small network is an RGB image, while the aim is a binary mask with equal dimensions as the Input [20].
This network is like U-Net, but instead of going very deep, we restricted the maximum several features map to 256. For the encoder, the number of feature maps was increased to {16, 32, 64, and 128} while decreasing the spatial dimensions, using 2 × 2 max-pooling [23] with stride = 2 resulted in feature maps subsampling by a factor of 2. In the bottleneck, the maximum number of feature maps was set to 256. For the decoder, the bottleneck feature maps were decreased to {128, 64, 32, and 16} while increasing their spatial dimensions by a factor of 2 through bilinear interpolation. At each stage of the decoder, the up-sampled feature maps were concatenated with corresponding feature maps of the encoder, indicated by a horizontal arrow shown in Fig. 9. A rectified linear unit (ReLU) was employed as an activation function for each convolutional layer of encoder, bottleneck, and decoder, while the output layer is used to sign. Model1 and Model3 encoder-decoder networks have 1,352,881 parameters, whereas Model2 and Model4 have 1,353,025 parameters [20].
This gain in the number of parameters occurs, because the input dimensions for Level2 (Model2, Model4) are 896 × 896 × 4 rather than 896 × 896 × 3 of Level1 (Model1, Model3). The concatenation of up-sampled predictions of Level1 with input images increases the input channel of Level2 by 1. The outputs of Level2 are combined by stringing together their predictions, as displayed in Fig. 2, and the last output is then mapped onto the input images. To distinguish between disease and health, background pixels were kept the same as in the input image. For each target, network training was performed in two stages. In the first stage, Level1 models (Model1 and Model3) were trained separately to produce coarse outputs. Level2 models (Model2 and Model4) were trained in the second stage by utilizing the predictions from Level1 models as booting in a concatenated form with the input image. All four models were trained using Adaptive Moment Estimation optimization with β1 = 0.9 and β2 = 0.99, learning rate = 0.01.

Fully Connected Conditional Random Fields (CRFs)
The native manner of effective heavy forecasting mission like semantic segmentation is to categorize each pixel separately using some features generated from the image. Even so, such separate pixels-wise classification usually makes insufficient outputs that are incompatible with the visual features in the image to forecast each pixel label, a local classifier applies a small spatial context in the image, and this caused noisy predictions. The best outcome can be derived by recognizing that we are predicting the structure output and modeling the problem to incorporate our prior knowledge about good pixel-wise prediction, CRF deals with two inputs one is the original image, and the other is predicted probabilities for each pixel, the CRF which is an effective efficient inference algorithm for fully connected CRF models whose conjugate edge potential are determined by a linear combination of Gaussian kernels in an arbitrary feature space, thereby it considers the surrounding pixel also while assigning a class to the particular pixel which results in better semantic segmentation results [21]. Each pixel n is linked with a finite set of its potential states L, modeled by random variable Xn, these finite states are called the labels that can be designated to every pixel (i.e., L = {health part, disease part, background}), each state has an associated unary cost as in Eq. (1), that has to be paid to assign label w to the pixel n, given the input image I, this unary cost is generally achieved from the classifier, such as the CNN output that allows us to perform only independent per-pixel predictions where X donates to the label of pixel n that existed in the given image and I is the image input.
To form an interrelationship between pixels, pairwise costs are created; the pairwise cost as in Eq. (2) for allocated a couple of labels w and v to the pixels n and m, respectively, a common pairwise cost in semantic segmentation apply is the Potts model (generated from statistical mechanics) where the cost is zero when two adjacent pixels have the identical label and k (a real, positive scalar) if they have separate labels, this cost encourages nearby pixels to take on the same label and is based on our prior knowledge that objects are generally continuous [21] CRF can be illustrated as a graph (V, E), where V is the set of nodes in the graph equivalent to the image pixels, and E is the set of edges linking those node pairs for which pairwise cost is defined, since inference is easier in this case and the model parameters are easy to learn. Inference of the CRF implies finding a structure w, such that the total unary and pairwise also called energy is minimized, the problem of maximizing the conditional probability becomes an energy minimization problem, and in computer vision problems, we often see CRF with (1) Unary Cost (U) = n = (X n = w∕I), (2) Pairwise Cost = nm (X n = w, X m = v∕I). The objective of this system is to develop an intelligent prediction technique using CNNs and CRFs in smart agriculture applications to classify plant diseases effectively. This system delivers a model for predicting plant disease detection. The comprehensive planned system consists of phases as follows: • Phase 1: This phase concerns gathering the images of plant leaves from the farm using available IoT data acquisition sensors such as cameras. • Phase 2: Apply image pre-processing methods on the dataset to enhance images that overwhelm unwilling distortions or heighten image features important before applying CNNs' models. • Phase 3: This is the plant disease forecast phase using CNNs' models. • Phase 4: The system performance evaluation.  Figure 13 shows the suggested IoT system for Plant Diseases' Recognition. In this system, IoT-based wireless network of cameras or smartphones can be utilized to gather images inside a smart farm system. Afterward, the gathered plant images are forwarded to an IoT-based gateway such as Raspberry pi. This gateway is used between the wireless network of cameras and the analytics server hosted in the cloud for disease prediction for best decision-making regarding smart farming systems. The controller sends the gathered images to the respective channel periodically via an IoT communication protocol like CoAP and MQTT. Finally, mobile or web applications can be used to communicate with cloud-based analytics servers to get the decision on plant disease categories using the proposed hybrid CNN with CRF prediction model, as shown in Fig. 12. The prediction results are then sent to the management team in the smart farm to take suitable action.

Evaluation Metrics
The performance of networks can be measured and compared using various estimation measures such as Jaccard similarity (JS)/intersection over Union (IoU), dice coefficient/F1-score, mean intersection over union, and sensitivity/recall was calculated. These metrics were estimated by determining the variables, false positive (FP), false-negative (FN), true positive (TP), and true negative (TN) by determining the confusion matrix using the prediction and the ground truth. The equations for recall, IoU, and F1-score are clarified as [8] F1-score is calculated from the harmonic mean of recall and precision as shown in Eq. (8) [24] The calculation of mean intersection over union metrics is described by the following equation [25]: where P IJ considers the number of samples belonging to the class I predicted to be of the class J, i.e., P IJ is corresponding to the confusion matrix values produced from the output of classification, and k denotes the number of classes of the problem. The major difference between semantic segmentation and classification metrics is that P II is now handled as pixels and not as samples.  FN) .  Figure 14 shows samples of the dataset.

Experimental Results and Discussion
All experiments described above were implemented using a PC with a Tesla p100 GPU and 27 GB of RAM. We exploited the Keras frame with a Tensorflow backend. For the performance of our experiment, the ratio of 80%/10%/10% was used for the training, validation, and test, respectively. The weight of the system is adapted, so the most relevant selective feature of this set of images  is learned; to build more efficient training, the learning transfer approach has been applied. The training was carried out by configuring the pre-trained network with the ImagNet dataset, the networks were retrained from start to end, and the hyperparameters used in the experiments in this work are mentioned in Table 2. Semantic segmentation tries to label different elements of an image into semantically significant objects and to sort and categorize each object into one of the pre-determined classes. There is a semantic relationship between the diseased part and the healthy part in the agronomic images. In this issue of semantic segmentation, it is viewed that there are multiple semantic relationships within an image. Not only every single object has a semantic identity, but it also has a subset of disease and health parts. In this research, semantic segmentation was used in conjunction with convolutional neural networks to split the diseased part of the leaf image. Superpixels and K-means algorithms were used for the segmentation process as well to make a comparison between the methods. Agronomic images with a high existence of diseases were utilized. In the study of the agronomic images, the area of interest was in almost every location of the images. Semantic segmentation was chosen as it is a technique that allows clean detection of irregularly shaped objects.
Visually analyzing the segmentation results obtained. The images whose networks introduced larger fragmentation problems were those of rust-infused leaves, which are expected behavior, because rust symptoms sometimes show yellow shades that are very similar to the color of leaf tissue. As shown in Table 3, the test result obtained by networks for the segmentation dataset before CRF. We conclude that SegNet executes improve than Fcn-8 s, Dee-pLav3 + , Ced-Net, and UNet models. The pooling in Seg-Net is implemented to the whole image and the increasingly smaller areas. This enables the network to catch information not only at the global level but too at various object levels about the pixel of interest. Consequently, it learns more context relations. Therefore, it gives good performance when there are objects at multiple scales.
Semantic segmentation is a more suitable technique for real-time in-field leaf disease segmentation problems. The SegNet model is capable to detect both the labeled and the pattern of the leaf to some extent.
For quantitative estimation, the standards MIOU (mean intersection over union) are utilized to display improvement effects. As shown in Table 3, the proposed method of enhancing semantic segmentation results has already been confirmed after being tested using 49 images in the dataset. According to that, the ground truth of these datasets can miss some of the edges that are affected in determining the exact location of the disease and its consequences leading to misleading in the performance of algorithms.
One of the most used CRF-based optimizations that have been applied to interactive image segmentation is based on the pairwise potential of nearest neighbors. In the meantime, the improvement results of the suggested method are compared with the post-processing method based on the CRF. The intuitive improvement effects of the post-processing methods are similar only to the semantic segmentation, but the misstatements of the target area resulting from miscalculation are still not corrected in semantic segmentation results as mentioned in Table 4. The shortcomings of the proposed scheme can be as the system needs more improvement in the performance and is applied for more other datasets in the agriculture field.

Conclusion and Future Work
This paper presents an efficient IoT-based plant disease recognition system using semantic segmentation methods, such as SegNet, CED-Net, U-Net, DeepLabv3, and FCN-8s, with the post-processing enhancement method (CRF) to allocate disease parts in leaf crops. In this work, we introduced a method of post-processing with theoretical simplicity in a unified frame to improve hash results and work out problems inherent in the present algorithm, such as lack of detail identification, spatial incompatibility, and global interrelationship of information when combing the semantic segmentation with post-processing enhancement method (CRF). Future ideas are to augment the size of the dataset and several semantic segmentation networks to study its achievement and performance. The results show that SegNet performed better when compared to UNet, Fcn-8 s, DeepLav3 + , and Ced-Net. The result is sufficient, with the MIoU around 79%, this can be increased by augmenting the size of the dataset, and overall results suggest that deep learning-based semantic segmentation of Leaf disease can be used to detect specific parts of disease from leaf crops and thus toward no loss of food production. We aim to develop the proposed system in the future and plan to utilize other techniques such as genetic and fuzzy for improving the system's performance [27][28][29][30].