Deep Learning Object Detection for Image Analysis of Cherry Fruit Fly (Rhagoletis cerasi L.) on Yellow Sticky Traps

Insect populations appear with a high spatial, temporal and type-specific diversity in orchards. One of the many monitoring tools for pest management is the manual assessment of sticky traps. However, this type of assessment is laborious and time-consuming so that only a few locations can be controlled in an orchard. The aim of this study is to test state-of-the art object detection algorithms from deep learning to automatically detect cherry fruit flies (Rhagoletis cerasi), a common insect pest in cherry plantations, within images from yellow sticky traps. An image annotation database was built with images taken from yellow sticky traps with more than 1600 annotated cherry fruit flies. For better handling in the computational algorithms, the images were augmented to smaller ones by the known image preparation methods “flipping” and “cropping” before performing the deep learning. Five deep learning image recognition models were tested including Faster Region-based Convolutional Neural Network (R-CNN) with two different methods of pretraining, Single Shot Detector (SSD), RetinaNet, and You Only Look Once version 5 (YOLOv5). R‑CNN and RetinaNet models outperformed other ones with a detection average precision of 0.9. The results indicate that deep learning can act as an integral component of an automated system for high-throughput assessment of pest insects in orchards. Therefore, this can reduce the time for repetitive and laborious trap assessment but also increase the observed amount of sticky traps


Introduction
Agricultural plants are often threatened by pests such as specific insects, making it difficult to produce high-quality food. Different pest control methods are available based on cultivation standards. Once the species of a pest is identified, many methods can be improved by adapting them. In order to keep track of that, one can lay out yellow sticky traps that are subsequently evaluated by experts. Additionally, changing climatic conditions are occurring all over the world, owing mostly to the phenomena of climate change that affect temporal and spatial patterns of precipitation and temperature causing significant effects on crop-pest interactions (Heeb et al. 2019). Losses due to pests are not only economic (40% of the world's food supply is destroyed by pests) but also decreasing food security (IPPC Secretariat 2021). Responding appropriately to usually growing healthy food demands under these potentially threatening situations, necessitates the use of innovative pest detection techniques and technologies more than ever (Saleem et al. 2021;Böckmann et al. 2021). One of the main priorities of any pest management solution is to find suitable methods and models in order to detect insects better (Böckmann et al. 2021). Nowadays, scientific and technological advances, particularly in image processing techniques and computer vision technology have enabled the application of new tools for describing areas affected by insects and also facilitating pest management to increase yields in the context of precision horticulture (Zude-Sasse et al. 2016;Cardim Ferreira Lima et al. 2020). The combination of these techniques and data-driven computing tools such as machine learning (ML), more specifically deep learning (DL) approaches, have led to increased accuracy in translating unstructured image data to practical information for the end-user. In recent years ML data driven methods have been more and more employed in insect detection and monitoring systems (Cardim Ferreira Lima et al. 2020;Jiang et al. 2008) and, more specifically, deep learning methods have been employed in pest detection studies (Wenyong et al. 2021). Wang (2022) indicated that the complexity of data preprocessing in traditional methods of artificial intelligence and machine learning models is high, therefore he proposed an improved deep learning model namely, AlexNet, for detecting crop diseases and insect pests. Thenmozhi et al. (2021) applied convolutional neural network (CNN) for two different insect datasets with 24 classes. The results showed more than 90% accuracy for classification using a CNN model. Kuzuhara et al. (2020) studied the application of region based convolutional neural Networks (R-CNN) and "You Only Look Once" v3 (YOLOv3) for insect pest detection. They concluded that deep learning models need a large dataset to optimize parameters during the training stage and consequently proposed data augmentation methods to overcome the lack of data problem. These studies showed that the deep learning approach is one of the most suitable data driven models for image processing and object detection studies for insect pest detection.
The insect detection image processing of installed sticky traps using DL data driven methods can be entwined with two main challenges. On the one hand, many parameters including different types of insects, appropriate annotation methods, image resolution, etc., should be considered when using trap images for object detection. On the other hand, the selection of DL methods that are suitable for processing images needs to be properly tested and adapted. This can be difficult because of the magnitude of available ML methods for object detection. In this context, the objective of this study focuses on the application of different object detection DL methods, namely, faster R-CNN, single-shot detector (SSD), RetinaNet, and YOLOv5, for insect detection on sticky trap images using two high resolution data sets, i.e. a single-class-cherry fruit fly (Rhagoletis cerasi)and a multi-class data set, taken from insect traps on cherry orchards located in eastern Germany.

Study Area and Data
The cherry orchard is located in the Leibniz Institute of Agricultural Engineering and Bio-economy (ATB) research site in Marquardt, eastern Germany; the longitude and latitude of the field center is 52°28 0 01 00 N 12°57 0 27 00 E (Fig. 1). The data consist of two high-resolution sets of images that display insects on hanging yellow sticky traps (Fig. 2) in cherry orchards placed at 0.74 to 2.17 m over the ground in two consecutive years (2020 and 2021). The most serious pest of cherries is the European cherry fruit fly (Rhagoletis cerasi), which appears in orchard between mid-May and mid-June (Böhm 1949). The datasets consist of 140 and 850 images for 2020 and 2021, respectively. In 2020, within the images, 47 different classes were detected and labeled, whereas in 2021, within the images, the whole area of each trap was categorized in two classes, namely, RHAGCE for European cherry fruit fly and non-RHAGCE. Since the target pest is the European cherry fruit fly, all deep learning object detection models in the present study have been trained and validated based on the 2021 labeling method.

Insect Annotation
Many insects find bright yellow to be very attractive and European cherry fruit fly (Rhagoletis cerasi) is not an exemption (Lu et al. 2019). As Fig. 2 shows, the images have been labeled in a way that each trapped insect has a class label and a bounding box. For the annotation, the computer vision annotation tool (CVAT) was applied. Using bounding boxes, the CVAT software enabled the insect individuals to be quickly annotated within the images using bounding boxes specified with left-down (x_min, y_min), right-up (x_max, y_max) points. The images with annotated classes in CVAT were stored as a comma-separated values (csv) and converted to the pattern analysis, statistical modeling and computational (PASCAL) visual object classes (VOC) model format (Everingham et al. 2015). In order to make the data persistent and easier to employ, all information in each individual image was further converted to the Extensible Markup Language (XML) format by PASCAL-VOC (Everingham et al. 2007), which is composed of several attributes, such as filename, size and objects that consists of a label and a bounding box.

Data Augmentation
In order to increase the amount of data, the augmentation techniques including flipping and cropping was used.
Flipping is a frequent technique in computer vision that leads to a significant performance improvement (Shorten and Khoshgoftaar 2019). It aids in preventing a model from learning a certain order of pixels used to create an object. By shifting one's point of view, one can gain a more general and fine-tuned "understanding" of an object. Cropping is used to reduce the size of the input. This can boost performance; for example, a batch of a smaller size can be processed faster. While the insects are just a small part of the photo, the insects may occupy a bigger area in the cropped picture. As a result, the model may extract different features than the original sized input. Two methods of cropping were applied. First, grid cropping created patches from a picture of equal size. Second, sliding window cropping created images of a fixed size by moving a window over the image and cutting it out at every step. The latter The resolution of image has not been changed (3248-4032 pixels × 1960-3024 pixels) b The cropped parts has the resolution of (300 pixels × 300 pixels) Fig. 3 a A visual representation that shows how machine learning (ML) and deep learning (DL) are regarded within the context of artificial intelligence (AI); b Architecture of a DL model (Chollet 2017) allows the identical object to appear in a different area of the cropped image (See Table 1). Due to less complexity, i.e., detection of one class only, the single-class problem should be "easier" for the model to learn. Therefore, exploration with grid cropping methods are evaluated on the 2021 data set for feasibility reasons prototypically. The reason is that a small object can be harder to detect in a large image. When cropping the images into sub images, the objects will take up more space when occurring. Also, detecting on smaller images may have a performance boost during train and test time and can be elaborated on in future research.

Deep Learning Models
As shown in Fig. 3, deep learning (DL) is a subfield of artificial intelligence (AI) and machine learning (ML). The term "deep" in this approach refers to a layered structure in the learning process, and it is not always associated with a deeper understanding of this problem-solving methodology. The sample deep learning model is composed of an input layer, multiple hidden layers and one output layer. Each layer has multiple fully connected nodes. The layer names are self-explanatory, as the input layer handles input data and the output layer delivers output values, for example probabilities for a class in a classification problem. The hidden layers lay between them and do the calculations between the layers. As a result, it is also known as "hierarchical representations learning" and "layered representations learning." (Chollet 2017). Although the idea of DL was first introduced in 1993, it was not really applied for more than a decade because of the lack of data and high-performance computing hardware as well as popularity of other ML models at the time such as support vector machines (SVM). However, Many labeled datasets have been collected since 2006 and, moreover, significant progress in computers and data training methods have been made in the recent years that removed obstacles to apply DL methods (Fig. 4). In fact, more recently it has become a rather hot topic in image processing, specifically the CNN models (Chollet 2017;Hatt et al. 2019). In order to apply deep learning to detect cherry fruit flies, five different models were applied in this study: i) Faster-RCNN model pretrained using a residual neural network (ResNet); ii) Faster-RCNN model pretrained using Mo-bileNet; iii) single-shot detector (SSD); iv) RetinaNet (onestage deep CNN); and v) YOLOv5. Faster R-CNN models are a modified version of R-CNN (Girshick 2015). The model is known as a two-staged approach because it extracts region proposals at the first stage and computes CNN features on latter proposals at second stage.
As shown in Fig. 5, given the input images as three-dimensional matrices (RGB), the model extracts region suggestions for each input image using the selective search technique, which recursively merges comparable neighboring region pairings into bigger ones using a similarity mea-sure. Then, using affine image warping, each proposal is transformed to a fixed size that serves as an input layer for the CNN. Then the CNN generates a feature vector from each area suggestion. Finally, each class is predicted using a separate linear support vector machine (SVM) model. Fast-RCNN is a form of R-CNN that uses a neural network instead of SVM and modifies the feature computation. A further development of both models is called Faster-RCNN, which is a single and unified network for object detection. On the PASCAL VOC 2007 data set, relative to R-CNN, the Faster-RCNN model takes a shorter time (about 0.2 s) for image proposal and detection (Liu et al. 2016). SSD is a single-stage object detection model proposed by Liu et al. (2016). Similar to Faster R-CNN, SSD uses an offset prediction of the default boxes and its confi- dence values but at different scales. RetinaNet is a singlestage object detection model based on the focal loss introduced by Lin et al. (2020) that also uses anchor boxes. You Only Look Once (YOLO) proposed by Redmon et al. (2016) is the last single-stage object detection model that divides images into grids at the first step. In a grid, each cell detects objects within itself. The latest version (YOLOv5) has been used in present study. The general workflow of all applied models has been shown in Fig. 6. It includes data preparation and annotation, classification made by a deep learning algorithm and iterative evaluation of the results. In order to give a broad overview of different learning rate behaviors, an extensive grid search is applied for the learning rates [10 -2 , 10 -3 , ..., 10 -7 , 10 -11 and 10 -12 ] onto each model for the 2020 and 2021 data set. They are used along with a stochastic gradient descent optimizer and a momentum of 0.9 (Ruder 2016). The learning rate hyper-parameters define how much the weights are adjusted during the backpropagation process of the neural network according to the loss of stochastic gradient descent. The training process is combined with a learning rate scheduler that is applied after 15 episodes. It multiplies the learning rate by 0.1 similar to Liu et al. (2016). That should make a network find a more exact solution and fine-tune its parameters in theory.
As shown in Fig. 6, the model evaluation and tuning plays an important role in the DL model. This stage includes the loss function in the training stage and precision/recall metrics in the validation and test stages. The loss functions-also called error function-calculate how far a model output deviates from its ground truth. It is composed of the classification loss (Lcls) and the bounding box regression loss (Lreg) (Lin et al. 2020).

Backbones and Pre-training
Backbones and pre-training in object detection models have been used for improvement in DL model results. A backbone describes a certain method for feature extraction of images in any CNN model like MobileNet (Howard et al. 2017), ResNet (He et al. 2016) or Visual Geometry Group-16 (VGG-16) network (Krizhevsky et al. 2017). A pretrained backbone can use its knowledge of extracting certain features out of the data for a new problem and accelerates training, especially when the new data set is similar to the one used during pre-training. A pre-trained backbone network has been applied in this study in order to improve the performance of DL models.

Classification Model Evaluation Metrics
In the case of a classification issue with two output classes (RHAGCE, non-RHAGCE), the prediction model's output is a probability that decides, which class the output is allocated to. There are four possible results of a classification prediction model, namely, true positive (TP), false positive (FP), true negative (TN), and false negative (FN). This list of possible results defines an important metric for the model, called the confusion matrix (Tharwat 2021;Bradley 1997) (Table 2).
Precision and recall are the measures used to evaluate the model based on these four possible outputs. Precision [0, 1] quantifies how accurate the model is when it produces Here, TP means that the predicted label correctly estimates the ground truth label, TN denotes that the predicted label correctly estimates the absence of the ground truth label, FP means that the predicted label wrongly estimates a ground truth label, if it is present and FN indicates that the predicted label estimates a ground truth label, if it is absent.
In order to find the degree of overlap between predicted bounding box around the target insect individual and its associated annotated bounding boxes, the object detection methods use the intersection over union (IoU) metric, which is calculated by Eq. 3 (Ren et al. 2015).
IoU and precision-recall measures are used to compute average precision (AP), which is the most popular evaluation metrics for object detection models. It is defined as the area under the precision-recall curve (AUC-PR) evaluated at α threshold of IoU is equal to AP@α. The latter parameter has been calculated in the VOC2007 challenge (Everingham et al. 2015) by the following equation: where pint is the interpolated average precision and interpolated average precision (pint). For a class and a set of precision at a certain recall level (r), one sums the maximum precision at each rank level (e r), i.e., at every recall rank. This measure approximates the AUC-PR (Everingham et al. 2015). The main idea of using AP is to reduce the impact of the "wiggles" in the precision-recall curve. This parameter is shown in the form of AP@α, which means that the AP precision is defined at α threshold of IoU (Ren et al. 2015). In the present study, AP@0.5 and AP@0.75 was calculated for all DL models. If the number of classes (C) exceeded one, the AP will be the average value for all classes or mAP: AP i ; C = 1; 2; ::: Since all models run by an efficient computational system, frame per seconds (fps) was used to describe the performance of the model. FPS determines how quickly the object detection DL model processes images and generates the model output.

Insect Annotation Results
There are a total number of 4905 annotations for 140 images of the 2020 data set, more than 90% of the annotations belong to the 7 upper classes in terms of occurrence (at least 100 times). By eliminating the first two classes, "NOTHIN" and "BACKGR", the most frequent class was RHAGCE or European cherry fruit fly, which was observed in 60% of the sticky trap's images. Regarding the observation results in 2020, it was decided that in the 2021 images only the RHAGCE class will be considered as a target pest for cherry orchards (See Table 3). The annotation results clearly showed that the size of both datasets is limited, ergo, data augmentation techniques are needed to increase the performance and results of the DL models by creating additional and diverse instances for training datasets. Table 4 shows the results of data classification after applying augmentation techniques.

Deep Learning Models Evaluation
For the training, validating and testing of the DL models, the augmented 2021 data set was randomly divided into three sets, wherefore the length of the training and testing data sets are 977 and 108 images, respectively. The validation data consisted of the remaining 118 images. All training, testing and validating images were resized to 1000 × 1000 pixels to have identical image sizes. The training results of all models showed that after 30 episodes (even earlier) the amount of loss function stopped changing, ergo, the calculated training weights after 30 episodes have been applied for validating and testing stages. Fig. 7 shows the loss development graphs of some DL models. Each subgraph shows a different model and its learning rate. The summed loss is the sum of the loss for the bounding box regression and classification, respectively. Based on the model validation results, the appropriate learning rates for insect detection on the 2021 data set, were chosen to be 10 -2 , 10 -3 and 10 -5 with an AP of 0.9 (Fig. 8). After finding the appropriate learning rates, the model performance was measured on the test data set, where the goal was to find out the performance on the unseen data. For the learning rates, the results from training were limited to the best values, i.e. (10 -2 , ..., 10 -5 ). The models that were dropped due to a loss of "NaN" were also ignored during training because there are no weights available for testing. For testing, the weights after training for 30 episodes have been used. The average precision was calculated for the IoU thresholds 0.5 (AP@0.5) and 0.75 (AP@0.75) ( Table 5).
Regarding the training, validating and test results, the Faster R-CNN-MobileNet and RetinaNet models with learning rates of 0.01 and 0.001 were suitable DL algorithms for detecting European cherry fruit flies in yellow sticky traps images.
As Fig. 9 shows, the DL model detected European cherry fruit flies on sticky traps after 30 epochs, properly. The impact of the sliding window cropping (SW) data augmentation technique on performance of the DL model has been evaluated by cropping the original size images into the 300 × 300 pixels patches and predictions for all patches merged and mapped to the original size of the image. In this regard, a RetinaNet was trained with the learning rate of 0.001. The results show a recall and precision values of 1 and 0.9 on the validation data, respectively (see Fig. 10). Although the performance of the model improved, training the SW model is very slow and the performance of the computing system degrades noticeably. Moreover, the trained model on SW cropped data computes 0.34 frame per second (fps) on average, which is significantly smaller than DL models before applying the latter data augmentation method i.e. 11 to 16 fps.
Regarding the results of the 2021 DL models, Faster R-CNN and RetinaNet models have been chosen for multi- class object detection for the 2020 data set. After data augmentation, 732 images splitting as 588/71/73 images for training/validation/test stages were selected. For this data set only classes that occurred over 100 times (except "nothing" and "background") were considered. In training and validation stages, the Faster R-CNN ResNet model with learning rate, mAP@0.5 and fps of 0.01, 0.5 and 1.35, respectively, outperformed the other models. Moreover, the prediction results for Rhagoletis cerasi, Drosophila, Muscidae and Formicidae classes are shown in Table 6. The precision recall curves (Fig. 11) of Faster R-CNN ResNet model with learning rate of 0.01 for multi-class object detection clearly showed that the AP@0.5 for "RHAGCE" class is considerably higher than other individual classes. It is related to the fact that the DL models have been already trained on the same class using the 2021 data set.
The single-class data set had an output performance of around 9-16 frames per second (fps), which means that the model takes 111 millisecond (ms) to 62.4 ms on average for computing one input image. For on-site detection, this can be considered as a decent value providing almost realtime detection. However, when considering image-streaming of 30 fps, which is a typical frame rate for videos, the models are not fast enough to be used for what one can consider "live" detection with sufficient performance. Taking a closer look at the loss values over the episodes for the 2021 model, one notices that a smaller loss does not always correspond to better results. This is why it is also very important to consider metrics like the AP or plot the predicted Fig. 9 Results of the Faster R-CNN-MobileNet with learning rate of 0.01 for two test (prediction) images. Ground truth and predictions with confidences over 0.5 are green bordered. Predictions that have a confidence value below 0.5 are purple bordered predictions. As Kuzuhara et al. (2020) have already shown, increasing the dataset using augmentation methods could increase the model's performance. By adding more artificial data, the data set will get better balanced. In the present study, five different deep learning models were applied to detect cherry fruit flies on yellow sticky traps images and, consequently, choose the better models for multi-class insect detection in the study area. This is a complementary of Böckmann et al. (2021) study published in the literature. As indicated in Thenmozhi et al. (2021) and Wenyong et al. (2021), DL can be valuable modeling tools to help increase our understanding of the complex insect detection processes in pest management solution linked to the agricultural management practices.

Conclusions
This study shows different methods and approaches for applying computer vision to the problem of insect detection. The best deep learning (DL) model (Faster R-CNN) reach an average precision of around 0.88 for the single class data set. The speed performance for the single-class data set is around 9-16 frames per second during test time. Since the final goal is to develop a system that can automatically detect the type and number of insect pests on sticky traps, the results can be considered as a step towards real-time detection. The multi-class data set was limited to its most occurring classes and achieved a mean average precision of around 0.51. The best class (cherry fruit fly) in the multiclass object detection model achieved the average precision of 0.82. It can be concluded that the imbalance in the data   set causes performance differences between classes. This lack of proportion in datasets can be fixed by adding artificial data for the underrepresented classes in future studies.
The main achievement of current study is to successfully translate results of DL models into useful information for horticultural management. Furthermore, the results show possibilities of improving the model performance by increasing the dataset using sliding window data augmentation for single class. This method is time-consuming and only achieves around 0.34 fps which is far away from real-time. However, this method gives a different view on the problem and suggests tweaks like parallelization and merging of bounding boxes to increase the performance. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4. 0/.