1 Introduction

A general goal of vegetable and fruit growers is to achieve high crop yield. Since crop yields are strongly affected by insect pests which may damage crops, farmers use insecticides against them at scheduled times without taking into consideration the size of pest population [1]. Spraying is the main control strategy against several insect pests. As an example, approximately 70% of insecticide treatments applied against codling moth in apple orchards [2]. Instead of the periodical spraying, a more optimized solution would be to use insecticides only when the insect pest population size exceeded the economic threshold. To realize such a spraying strategy, an exact pest population forecast is necessary. Such a forecast has not only a significant environmental (e.g. less amount of insecticides) but also economic effects (e.g. saving money, manpower, etc.) because growers can apply insecticides at the right time to defend their crops.

To acquire quantitative information for pest density prediction, different types of traps can be used such as light and pheromone-based [3, 4]. In the case of the pheromone traps (this article focuses on that), the pheromone substance attracts male insects to the trap and when the pest enters the trap it remains stuck on the sticky paper. Sticky papers are then periodically changed and inspected by an expert who counts the number of insects found on them. This type of manual or “conventional” insect monitoring has several well-known disadvantages (e.g. requires a skilled person, time consuming, expensive, etc.) which have been mentioned in several articles [5,6,7]. In addition, the manual insect counting does not provide continuous feedback therefore the insect pest population monitoring has a low temporal resolution. However, the temporal resolution is significant because if the number of cached insects cannot be obtained in time, it is impossible to take quick intervention [8]

Due to the drawback of manual insect counting, researchers and their industrial partners turned towards smart solutions. Recently, several embedded system-based automatized traps, or Internet of Things (IoT) systems (edge devices plus the server side) have been developed with the support of machine learning for insect counting [9, 10]. Some of them provide real-time data while others provide off-line data for more precise treatments and interventions. In this article we will refer to the image capture device which is inside the trap as sensing device.

Beyond the remote sensing devices, an accurate insect-counting method is also needed. Since insect counting can be seen as a special object detection problem, researchers realized that the state-of-the-art one or two stages deep object detectors could be used efficiently for this task [6, 11, 12]. Zhong et al. [1] were among the first researchers to apply the You Only Look Once (YOLO) two-stages object detector model. In their work the first version of YOLO played an object proposal role while the object classifier was a Support Vector Machine (SVM). Their reason for this unusual model pairing was the relatively small dataset that was available for them. With this approach, they measured 92.5% counting accuracy (correctly detected objects per all objects) on their test images. Later, Hong et al. [13] investigated the accuracy and inference time of more deep object detectors including Faster Region-based Convolutional Neural Network (R-CNN) and the Single Shot Multibox Detector (SSD) on manually collected sticky trap images. Beyond the trap images, the authors added 168 photos to their dataset to increase the number and type of negative samples in the “unknown” class. Not surprisingly, their investigation showed that the Faster R-CNN model had the highest (90.25%) mean average precision (mAP) and the longest decision time while the SSD detector was the fastest, but its mAP was only 76.86%. The authors also mentioned that the object detector needs be updated with remote sensed trap images which include various environmental light effects to be robust. Li et al. [14] compared the Faster R-CNN, Mask R-CNN and YOLOv5 on some selected insect categories of the Baidu AI insect detection and IP102 datasets. Their experimental results showed that the YOLOv5 is the recommended model in the case of the Baidu AI insect detection dataset because its accuracy was above 99% while Faster R-CNN’s and Mask R-CNN’s reached approximately 98%. They explained this result with the homogeneous background of Baidu AI images.

Even though the earlier results are very encouraging, in most cases the images were not remote sensed trap images. For illustration, Fig. 1 shows a manually taken sticky paper image and a remote sensed trap image where the differences are clearly visible.

Fig. 1
figure 1

Manually taken sticky paper trap image (left) and remote sensed trap image (right)

The lack of remote sensed trap images is a general issue in the field of insect pest counting [10, 15]. For example, Li et al. [14] noted that the weakness of their model lies in the lack of high adaptability because the images used for model training are significantly different from trap images taken by a remote sensing device. Cardoso et al. [16] claimed that there are significant differences between manually taken trap images in controlled environment and remote sensed images. For a bigger training dataset, Diller et al. [17] added manually taken images of sticky papers to their remote sensed images. Their laboratory environment consisted of the trap’s house, a camera and the yellow sticky-board populated with insects. With this approach, several hundred images have been generated where images were collected under controlled conditions, with a significant contrast between the background and the target insect.

In order to artificially increase the training set size, researchers turned to data augmentation techniques for help. Data augmentation has been an important component of the learning chain especially in those cases where the available training dataset is limited like in insect pest detection. Year by year, more data augmentation approaches appear in the practice. Many of them also have been used in pest detection related articles [18]. Several articles can be mentioned where the authors applied image rotation, scaling, flip, translation, brightness adjustment, and noise pollution on the training images [1, 4, 17, 19, 20]. The authors of the [21] review categorized data augmentation techniques into five classes including geometric transformations, noise injection, color space transformations, over sampling, and Generative Adversarial Network-based augmentation. Since moths are caught in different poses, the insect counter model needs to be rotation invariant. Moreover, to handle size differences the model also needs to be scale invariant. The geometric augmentation methods help to handle pose discrepancies of caught insects. On the other hand, photometric augmentation such as brightness and contrast adjustment help to handle the texture difference of insects. Beyond the above-mentioned techniques, there are some additional augmentation strategies such as random erase, mixup or mosaic augmentation that are also taken into consideration in modern object detectors [22]. From the insect detection point of view, mosaic augmentation is especially useful because it helps to better handle the well-known “small object detection problem”. Its idea is to compose a new training image from four other images in specific ratios.

The above-mentioned data augmentation techniques try to handle scale invariance, texture differences, and “the small object detection” problems. However, they do not handle well the significant quality and illumination discrepancies between manually taken (high-resolution) and remote sensed trap images. To “attenuate” those differences we introduce three new data augmentation approaches. Namely, gamma correction, image compression, and bilateral image filtering-based data augmentations. The efficiency of the proposed augmentation approaches is demonstrated on the YOLOv5 and Faster R-CNN (with ResNet50 backbone) models which are trained on manually taken high resolution trap images and tested on remote sensed trap images. The experimental results clearly show that the performance of both models improve significantly if the proposed augmentation techniques are also used for data enrichment. In the case of the YOLOv5 had a spectacular improvement where the gamma correction-based augmentation approach increased the mean average precision (mAP) from 0.887 to 0.934 while its counting error decreased from 3.29 to 1.07.

2 Materials and methods

2.1 Object detector

Automated insect pest counting can be seen as object detection problem which is an outstanding subject in computer vision. Most object detection problems involve detecting visual object categories like faces, humans, vehicles, etc. For those tasks, we can apply more possible detector algorithms that belong into three categories: traditional computer vision-based like the Viola-Jones [23], two-stage deep learning-based like the Fast and Faster R-CNN [24, 25], and the single-stage deep learning-based methods like the members of the YOLO model family. Since the appearance of R-CNN (in 2014), the CNN-based object detection has started to evolve at an unprecedented rate [26]. In the next years several single and multi-stage deep object detector models have been developed. At now, the lates YOLO models are considered as the state of the art due to their fast inference and accurate localization capabilities. In this paper we used the YOLOv5 as object detector. Those fasts motivated us to use the Faster R-CNN and the YOLOv5 object detectors in this work as reference models.

The YOLOv5 has been released on GitHub by Glen Jocker (Ultralytics) in 2020. YOLOv5 offers more object detector architectures including the YOLOv5s (small), YOLO5m (medium), YOLOv5l (large) and YOLOv5x (extra-large) models. All of them are pre-trained on the Microsoft COCO dataset [27]. The main difference between the models is the number of convolutional layers. Since insect pest counting takes place in the sensing device in some systems, the inference time (due to the battery lifetime) and the computational resource requirement (e.g. available physical memory) is a critical factor. Therefore, we have chosen the YOLOv5s as object detector which is the smallest member of the YOLO family after the nano architecture.

2.2 Data augmentation

In machine learning, a generally accepted fact is that the (relevant) data growth contributes to the validation performance increase of the machine learning model. In addition, deep models demand a huge amount of data to fine-tune weights. Therefore, data augmentation or in other words artificial data enrichment is an important component of the learning chain because it helps to extend the available training dataset. Its importance increases even more in those cases where the available training dataset is limited, like in insect pest detection.

Remote sensed trap images acquired in the fields are affected by a wide variety of illumination conditions due to day-cycle light, weather conditions, and landscape elements that cause shadows [28]. In order to attenuate the illumination differences between the manually taken high resolution images and remote sensed images we used gamma correction (also called as power law transform). Gamma correction transforms the input image pixelwise according to formula (1) where f(x,y) is the scaled input pixel (range from 0 to 1) at coordinate (x, y), and c is the gain. Both c and γ are positive numbers.

$${f}^{*}(x,y)={cf(x,y)}^{\gamma }$$
(1)

Remote image capturing is also affected by oscillations due to the wind, which may result worse image quality due to motion blur. In addition, there is a significant image quality difference between remote sensed and manually taken images due to the different spectrum sensitivity, field of view, focusing, etc. of cameras. Those models that use high quality images for training tend to achieve worse results since they cannot deal with such variability [15]. To try to compensate the image quality difference, we introduced bit-plane slicing and image smoothing as two additional augmentation techniques. In the RGB color representation the value of all channels is stored in 8-bit. This 8-bit can be considered as eight 1-bit planes where the lower order planes (least significant bit positions) carry the subtle intensity details of the image. Decomposing an image into bit-planes useful for investigating the importance of all bit positions. Leaving the least significant bit positions (LSB) of the original representation can be seen as image compression where we keep the “main features” of the image while the fine details will be removed. Generally, removing the content of the last k LSB positions will not degrade the appearance of the original image significantly.

To mimic the possible motion blur, the manually taken images have been smoothed. Blurring removes fine details from the original image which can be beneficial in the case of high-resolution images because too much detail may lead to over segmentation [15]. The degree of blurring is determined by the size of the kernel and the values inside it. However, using a simple averaging or Gaussian kernel strongly damages edges. Therefore, we applied bilateral filtering where the kernel takes into consideration not just the spatial but also the intensity relationship between neighbor pixels. Both relationships are modelled by Gaussian distributions. Changing their standard deviation controls the smoothing effect of the filter. More information about the bilateral filtering can be found in [29].

2.3 Evaluation metrics

In insect pest detection and counting, an important question is: how to measure the accuracy of the algorithm? In 2016, the automatized insect counting with deep object detectors was a relatively new research field and there was no standard protocol for the evaluation of insect counting algorithms [30]. Therefore, researchers adopted metrics from other fields of computer vision such as pedestrian detection. In computer vision a generally accepted performance metric of object detectors is the mAP. It is equal to the average precision (AP) metric across all classes in the dataset where AP is equal to the area under the precision-recall curve.

To construct the precision-recall curve, we need to have information about the true and false detections (proposed bounding boxes). The correctness of a detection can judged with the Intersection-over-Union (IoU). IoU is a ratio of the overlapping area between the ground truth bounding box and the predicted bounding box and the area of their union. In most studies if the IoU value is equal or higher than 0.5 the proposed bounding box is considered as true positive otherwise the proposed box is false positive [19, 28, 31]. In this paper we also used the mAP as the primary metric with 0.5 IoU threshold value to analyze the performance change of the YOLO model.

Beyond mAP, some other metrics are also available to better describe the counting method’s performance [10]. In this work, we used a simple error function (2) over all test images (N) where cp is the predicted number of caught insects (for all test images) while cr is the real number of insects (ground truth boxes). The (2) formula can be seen as the average counting error of the algorithm. Although this can be positively influenced by the same number of false negative and false positive detection, and it does not give any information about the localization accuracy of the insect counter model, but it is a much clearer indicator for the final user.

$$e\left({{\varvec{c}}}^{p},{{\varvec{c}}}^{r}\right)=\frac{1}{N}\sum\nolimits_{i=1}^{N}\left|{c}_{i}^{p}-{c}_{i}^{r}\right|$$
(2)

3 Results and discussion

3.1 Model settings

In this work, we used the freely available Faster R-CNN (two-stages) and the YOLOv5s (single-stage) object detector models to investigate the efficiency of the proposed data augmentation methods. Both have been trained with stochastic gradient descent (SGD) method where the minibatch size was 16 (aligned to the available GPU memory), the momentum was 0.9 while the weight decay was 0.0001. The initial learning rate was set to be 0.001. As stop condition, we applied the “no improvement in 20 epochs”. The maximum number of epochs has been set to 500 in the case of YOLO while it was 100 in the case of Faster R-CNN. The experiments have been executed on a notebook with AMD Ryzen 9 5900HX 3.3GHz processor, 24 GB physical memory and Nvidia GeForce RTX 3060 GPU.

3.2 Manually taken and remote sensed trap images

The core of our training dataset was 175 manually taken, high resolution trap images. Those images have been acquired by the Eközig Company, Debrecen, Hungary. The image capture circumstances were different. A few of them have been taken indoor while others are in the field. The image capture camera was also not uniform. In most cases, the image capture device was a smartphone camera, but professional cameras also have been used.

The number of remote sensed test images is 36. In this case, the data capture framework was the same. All the test images have been taken with a particular sensing device dedicated to remote insect pest monitoring. It consists of a plug-in board, a Raspberry Pi Zero W, and a Raspberry Pi Camera v2. A sample image about the sensing device can be seen in Fig. 2. In brief, the sensing device turns on every day during the observation period at 10:00 o’clock for sufficient light and lower daily temperature. At start up, the controller software synchronizes the real-time clock (RTC) and generate a timestamp according to it. Setting the time accurately is important to track the date of capture. After that the Raspberry Pi controller unit takes an image of sticky paper (the paper is located at the bottom of the trap house) which can be transmitted toward the server side through the Long-Term Evolution (LTE) network. At the end, the controller software sets up the next “wake up” time and starts the shutdown process. Additional details about the sensing device can be found in [32].

Fig. 2
figure 2

Front (left) and back (right) sides of the remote sensing device

3.3 Data augmentation with the proposed methods

At first the original 175 manually taken training images have been augmented with gamma correction. In the case of γ < 1, the (1) mapping function will transform a narrow range of small intensities into a wider range of output intensities. On the other hand, the effect will be the opposite if γ > 1. Since we need both approaches, the used γ interval starts from 0.4 and goes until 2.8 with 0.2 step size (except 1.0). The visualization of the mapping functions can be seen on Fig. 3.

Fig. 3
figure 3

Mapping functions of gamma correction

Gamma correction working on a single bit plane so it can be applied independently on the different channels of a colour image. Since, the RGB colour histogram of the trap images showed similar intensity distributions in all channels, we applied the same mapping function for each channel. The effect of gamma correction on a sample trap image can be seen on Fig. 4. The gamma correction-based data augmentation produced 2100 new training sample from the initial 175 trap images.

Fig. 4
figure 4

The effect of gamma correction on a sample image

In the second step, the bit-plane-based data augmentation has been performed on the original 175 trap images. As can be seen on Fig. 5, even the two MSB bit positions carry a huge amount of information about the content of the image. Taking into consideration additional least significant bit positions the difference between the original and the bit reduced image is almost invisible to the human eye. With this type of augmentation 1050 new training sample have been generated from the original trap images.

Fig. 5
figure 5

Bit reduced versions of a sample image

The last data augmentation has been performed with the bilateral filtering. The bilateral kernel has three parameters: kernel size, standard deviation of the spatial distribution (σs), and the standard deviation of the intensity distribution (σc). The value of those parameters is application dependent. For simplicity, we tried to minimalize the number of parameter combinations. According to our preliminary investigations, the used kernel size was fix 31 × 31 pixels while σs\(\in\) {5, 10} and σc\(\in\) {50, 100, 150}. The effect of bilateral filtering on a sample image can be seen on Fig. 6.

Fig. 6
figure 6

Bilateral filtering of a trap image with three different filter parameters

The effect of the above-described data augmentation methods on the YOLOv5’s and Faster R-CNN’s mAP and loss (2) are summarized in Tables 1 and 2 respectively. As references, both models also have been trained on the original manually taken trap images without the proposed augmentation methods. It is worth mentioning again that the model’s training chain already incorporates geometric and photometric augmentations. When one of the proposed data augmentations is used (in addition to the original trap images) the newly generated images were also a part of the training set.

Table 1 The effect of the proposed data augmentation methods on the YOLOv5’s performance
Table 2 The effect of the proposed data augmentation methods on the Faster R-CNN’s performance

The results in Tables 1 and 2 show that, all proposed image augmentation techniques increase the model’s mAP value and decrease (or do not modify) their counting error. Out of the three augmentations techniques, the gamma correction-based was the most efficient for both models. Although the improvement tendency for each metric was similar for both models, but the magnitude of improvement was more spectacular on the YOLOv5. The gamma correction-based augmentation increased the YOLOv5’s mAP value from 0.887 to 3.29 and reduced the counting loss from 3.29 to 1.07. This is a huge improvement because it means that the model’s average insect count prediction differs from the true insect count by approximately one. A visual demonstration of this high detection accuracy can be seen on Fig. 7 where all caught insects have been correctly localized and recognized in both images.

Fig. 7
figure 7

Detected insects with the trained YOLOv5s on remote sensed trap images

The combined use of augmentation approaches brought another interesting result. Even though the size of the training set was bigger after the combined use of the bilateral and gamma augmentations, but the mAP and the error count of the models degraded compared to the solo use of gamma correction-based augmentation. In the case of the YOLOv5, the mAP was 0.919 when all the three augmentation approaches have been applied which is significantly smaller than the mAP of the model with the gamma augmentation only. On the other hand, the counting error was only 1.0 which is the smallest out of all augmentation approaches.

Although the increased training set makes the model more restrictive against overfitting, but our results showed that the involvement of more and more data augmentation methods does not guarantee performance improvement. In the case of excessive augmentation, the original training set will be only a small portion of the augmented set where a huge number of images do not contain useful information, but they may add extra noise. Finally, we can also observe a negative correlational relationship between mAP and counting loss (Fig. 8).

Fig. 8
figure 8

Scatter plot of mAP and counting error

4 Conclusion

In this paper we proposed three data augmentation methods to increase the training dataset and to try to attenuate the quality difference between the manually taken high resolution and the remote sensed insect trap images. To demonstrate that the proposed data augmentation approaches result further performance improvement of model’s efficiency in addition to the well-known (e.g. geometric, mosaic, etc.) augmentation techniques, the YOLOv5s object detector model has been used. The change of the model’s performance was measured with the mAP and the average counting error metrics. The experimental results on our trap images showed that each proposed data augmentation method increased the mAP and decreased the counting error. The most efficient augmentation approach was the gamma correction-based which increased the mAP of the model from 0.887 to 0.934 while it decreased the counting error from 3.29 to 1.07. The counting error decreased to the third which means a huge improvement. The highest mAP values were 0.934 and 0.893 with YOLOv5 and Faster R-CNN, respectively. In other similar works the authors achieved 0.886 mAP with Faster R-CNN (ResNet50) [13] or reported 0.762 and 0.812 mAP with Faster R-CNN and YOLOv5(s) [16]. Although the dataset in those articles was not the same as in our case but this comparison also indicates what high localization capability can be achieved with the proposed augmentation approaches.

Surprisingly, the combined usage of the three proposed augmentation approaches had not brought significant improvement neither in mAP nor in counting loss compared to the solo gamma-based augmentation. This observation raises questions because a generally accepted rule of thumb says that increasing the dataset size contributes to the increase of generalization capability of the model. However, this is not always the case as our results show. In our opinion, the efficiency of combined application of data augmentation techniques depends on the type of the problem and the ratio of the original and the augmented data size. Based on it, a small subset of the data augmentation techniques may achieve higher performance improvement than using many (relevant) augmentation techniques. Unfortunately, the most optimal combination of data augmentation techniques is not known in this research field. In order to get a clearer picture about it, further investigations are necessary.