Introduction

Over the past 50 years, the global population has experienced unprecedented growth, posing a significant challenge in ensuring food security through increased yields of major cereals such as wheat, rice, and maize1. Wheat, as a staple in the human diet and a primary food source for domesticated animals worldwide, plays a critical role. Given the ongoing urbanization and upgrading of consumption patterns, it is predicted that a 60% increase in wheat production will be required by 2050. Recently, there has been a growing focus among scientists on studying wheat growth monitoring, health assessment, and plant breeding. In the breeding process, the number of wheat heads per unit area is a key trait that directly impacts yield potential2. However, accurately counting wheat heads in the wild is a labor-intensive and time-consuming task that still relies on manual observation. Therefore, the need for accurate detection and automated counting of wheat heads with new technologies has become crucial. With the rapid advancements in deep learning and computer vision, image detection based on artificial neural networks shows tremendous potential in providing fast, accurate, and cost-effective solutions for wheat head detection and counting.

The emergence of smartphones, unmanned aerial vehicles (UAV) equipped with affordable digital cameras has made in-field images more readily available. Consequently, several large and well-annotated wheat head datasets for wheat head detection and yield estimation have been proposed. Deep learning methods offer an alternative solution to the traditional manual measurement. Utilizing the Global Wheat Datasets, two worldwide competitions have been conducted: the Global Wheat Head Detection 2020 powered by Kaggle and the Global Wheat Challenge 2021 powered by AI crowd. These competitions attracted 2245 and 563 teams respectively, resulting in the development of more accurate and robust algorithms. The majority of these solutions rely on common computer vision methods for object detection, using horizontal bounding box annotation for each wheat head. Models developed for wheat head detection involve one-stage and two-stage detectors. One-stage detectors, such as YOLO (You Only Look Once)3 and SSD (Single Shot MultiBox Detector)4, directly predicts the bounding boxes and class probabilities of objects in a single pass of the input image. One-stage detectors tend to be faster and suitable for real-time applications, but they may sacrifice accuracy for small and overlapping objects. On the other hand, two-stage detector such as R-CNN and Faster R-CNN first generates a set of region proposals, then refines them with bounding box regression and classifies them into different categories using a convolutional neural network (CNN). Two-stage detectors typically achieve higher accuracy by leveraging more complex architectures and multi-stage processing pipeline but are slower in comparison.

With the active contributions of competition participants, significant progress has been made in wheat head detection using algorithms based on horizontal bounding boxes. However, there are still challenges in precise object representation and robust detection of wheat heads. These challenges are illustrated in Fig. 1 and arise due to multiple factors.

  1. (1)

    Orientation: Wheat heads tend to grow towards the sunlight or in a specific direction influenced by phototropism or natural wind. Additionally, the wind generated by UAVs during image acquisition can also impact the orientation of wheat heads. To accurately detect wheat heads, it is necessary to employ oriented object detection algorithms that can account for their orientation and enhance the accuracy of detection algorithms.

  2. (2)

    Aspect ratio and overlap: Wheat heads have a distinctive aspect ratio and often overlap with each other. They typically have a long and narrow shape, resulting in a high aspect ratio. Wheat heads often grow in dense clusters, causing them to overlap with each other. Figure 1b displays different annotations using horizontal bounding boxes and oriented bounding boxes. Traditional detection methods based on horizontal bounding boxes may struggle to accurately locate oriented wheat heads, leading to the inclusion of more background regions and imprecise representation of elongated wheat heads.

  3. (3)

    Variations: There are various variations in wheat varieties, illumination conditions, and maturity stages, all of which impact wheat head detection. Wheat varieties worldwide exhibit variations in shape, size, and color. Different illumination conditions, such as shadows, uneven illumination, or varying intensities, can make traditional detection methods less reliable. Additionally, the appearance of wheat heads is influenced by their maturity stages, introducing challenges for accurate detection and counting.

Figure 1
figure 1

The challenging wheat scenarios due to diverse factors: (a) orientation by phototropism or wind, (b) boxes location and overlap of bounding boxes, the wheat targets are labeled with red horizontal and blue oriented boxes, respectively; (c) variations of variety, illumination and maturity.

To address the aforementioned challenging issues, it is crucial to explore alternative methods for object detection that effectively address the complexities associated with wheat head detection. Notably, wheat heads in field images, as well as ships and vehicles in aerial images, exemplify small and densely packed rotated objects with a large aspect ratio. Drawing inspiration from oriented object detection in aerial images5. we advocate for the utilization of oriented bounding boxes in wheat head detection. This approach offers a more precise representation of objects by incorporating orientation information into the detection process, facilitating accurate identification and localization of wheat heads. By leveraging oriented object detection techniques, farmers and researchers can make informed decisions pertaining to crop management, yield estimation, and resource allocation.

Research has demonstrated that improving the size, diversity, and quality of the dataset proves more effective than increasing the complexity and depth of the network6. Given that the detection of wheat heads extends beyond a single region or country, it is imperative to develop a model capable of identifying them across various environments. In this study, we undertake the re-annotation of wheat images from the GWHD dataset7 by incorporating oriented bounding boxes. This approach allows for precise localization of the wheat heads while excluding unnecessary background or unrelated objects that might impede accurate wheat identification.

The main contributions are summarized in the following points:

  1. (1)

    A large-scale rotated wheat heads dataset RGWHD is introduced and the images are manually annotated using oriented bounding boxes(OBB). The RGWHD dataset provides a benchmark for small and dense rotated wheat heads detection.

  2. (2)

    We propsoe a novel wheat head detection model named the Oriented Feature Pyramid Network (OFPN). This model incorporates ResNeXt as the backbone for feature extraction, and utilizes Path-aggregation and Balanced Feature Pyramid Network (PBFPN) for effective fusion of multi-scale features. Additionally, the OFPN network employs a Soft-NMS algorithm to filter and refine proposal bounding boxes, improving detection performance in cases of overlapping wheat heads.

  3. (3)

    We conduct extensive experiments to compare our OFPN network with six state-of-the-art rotated object detection networks. Experimental results demonstrate the superiority of our proposed OFPN network in terms of rotated object detection accuracy, wheat category recognition, and wheat heads counting.

Related work

Object detection based on horizontal bounding boxes

Object detection is a fundamental task in computer vision that involves identifying and locating objects of interest within images. In addition to recognizing the presence of objects, accurate localization is achieved by marking their boundaries with horizontal bounding boxes. Object detection algorithms can be broadly classified into two types: one-stage8 and two-stage9. Two-stage algorithms, such as Faster R-CNN10, first generate a set of object candidates called object proposals using a dedicated proposal generator of Region Proposal Network (RPN)11. Subsequently, the classification and bounding box regression processes are performed. On the other hand, one-stage algorithms such as YOLO and SSD skip the intermediate step of generating object candidates. Instead, they directly employ a convolutional neural network to extract features for object classification and bounding box regression. Although one-stage algorithms are faster, they typically exhibit lower accuracy compared to two-stage algorithms.

Convolutional Neural Networks (CNNs) are deep learning algorithms specifically designed for image recognition and analysis. Their application has revolutionized the field of precision agriculture. Researchers have successfully utilized CNNs to automatically count and monitor wheat heads. And this process is crucial for estimating crop yield and making informed decisions about irrigation, fertilization, and harvesting12. Lu proposed TasselNet13, a deep convolutional neural network for accurately counting maize tassels in unconstrained field-based environments. TasselNet utilizes a local counts regression network architecture to address challenges such as in-field variations, resulting in excellent adaptability and high precision in maize tassel counting.

Fares Fourati14 developed a robust model that combines the Faster R-CNN and EfficientDet architectures, giving more prominence to the proposed final architectures and leveraging semi-supervised learning techniques to enhance previous models of objection detection. Fourati's approach was submitted in response to the Global Wheat Challenge on GWHD, and their method achieved a top 6% ranking in the competition. In order to address the limitation of labor-intensive data collection in wheat breeding, S. Khaki15 proposed a lightweight model WheatNet which utilizes a truncated MobileNetV2 and point-level annotations. WheatNet is robust and accurate in counting and localizing wheat heads across different environmental conditions.

M. Hasan16 proposed a region-based convolutional neural network model to accurately detect, count, and analyze wheat spikes for yield estimation subjected to three fertilizer treatments. They tested their approach on an annotated wheat dataset called SPIKE comprising 10 wheat varieties with images captured by high definition RGB cameras mounted on a land-based imaging platform. Wen17 utilized the GWHD dataset and introduced a novel wheat head detector named SpikeRetinaNet, which achieved outstanding detection performance.

Based on the GWHD dataset, the WheatLFANet model proposed by Ye, J18 is able to operate efficiently on low-end devices while maintaining high accuracy and utility. Jun S19 proposed a WHCnet model utilizing the Augmented Feature Pyramid Networks (AugFPN) to aggregate feature information and using cascaded Intersection over Union (IoU) threshold to remove negative samples to improve the training effect, and finally using a novel detection pipeline object counting method to count wheat sheaves from the top view in the field. Zhou, Q20 proposed the NWSwin Transformer to extract multiscale features and used a Wheat Intersection over Union loss by incorporating Euclidean distance, area overlapping, and aspect ratio, thereby leading to better detection accuracy. Wang, Y21 introduced the convolutional block attention module (CBAM) into the backbone network to make the model pay more attention to the target region of wheat ears and improve the detection results.

However, the solutions above were primarily based on horizontal bounding boxes, limiting their capabilities and robustness in detecting small and dense wheat heads with varying orientations.More importantly, they are not concerned about the fusion of feature information between non-adjacent feature maps, resulting in the loss of some feature information.

Object detection based on oriented bounding boxes

In contrast to detection methods using horizontal bounding boxes, object detection techniques based on oriented bounding boxes offer advantages in terms of improved accuracy and better representation. Oriented object detection has numerous applications in computer vision and image analysis, proving valuable in various scenarios such as text detection and recognition, object detection in aerial imagery, autonomous driving, and medical imaging.

Oriented object detection algorithms can be categorized into anchor-based, which use a number of anchors with fixed scales and aspect ratios, and anchor-free, which are based on points. The anchor-based approach is utilized in the following models. Y. Jiang22 developed R2CNN for text detection, an innovative model based on Faster R-CNN. It extracts features using different pooled sizes while simultaneously predicting the text score, axis-aligned box, and inclined minimum area box. X. Xie23 proposed a novel RPN called oriented R-CNN, which generates high-quality oriented proposals rapidly while maintaining high detection accuracy and efficiency comparable to one-stage oriented detectors.

The models mentioned below are all based on anchor-free boxes. G. Cheng24 designed the Anchor-free Oriented Proposal Generator (AOPG), which generates oriented proposals instead of using sliding fixed-shape anchors on images. X. Wang25 developed PP-YOLOE-R, an efficient anchor-free rotated object detector. It incorporates several tricks to improve detection accuracy, including ProbIoU26 and a decoupled angle prediction head.

Zhonghua Li27 introduced FCOSR, an innovative rotated target detector that builds upon FCOS28 and utilizes a 2-dimensional Gaussian distribution to enable rapid and accurate prediction of objects. Drawing inspiration from object detection methods employing oriented bounding boxes, we anticipate that this approach will effectively address challenging aspects of wheat head detection, including the handling of oriented wheat heads and the overlap between predicted bounding boxes.

Datasets

The available public datasets for wheat head detection include GWHD, SPIKE, ACID29, UWHD30, WED31. The GWHD dataset is a comprehensive collection of well-annotated wheat head images, compiled by nine research institutions across seven countries. It serves as a valuable resource for training robust models to accurately estimate the location and density of wheat heads in seven categories. The GWHD dataset can be accessed at https://www.kaggle.com/competitions/global-wheat-detection/data. The SPIKE dataset comprises 335 images captured at three distinct growth stages, covering ten different wheat varieties. The UWHD dataset consists of 550 images captured by a drone at an altitude of 10 m. The ACID dataset consists of 520 images taken in controlled greenhouse conditions, featuring 4158 labeled wheat heads with point annotations. The WED dataset contains 236 high resolution images with 30,729 wheat heads and derived the WEDU32 dataset with more accurate labeling information. For the specifics of each dataset, please refer to Table 1.

Table 1 Specifics of public wheat head datasets and the proposed RGWHD: Most of the existing public wheat datasets are mainly labeled with horizontal bounding boxing, and our proposed RGWHD dataset is labeled with oriented bounding boxes as the labeling information.

In order to address the challenges posed by oriented wheat heads, we constructed a new dataset named RGWHD based on GWHD, as the majority of existing datasets are labeled using horizontal bounding boxes. To achieve this, we randomly selected 100 images from seven categories within the GWHD dataset, resulting in a total of 700 images. RGWHD is comparable in size to most existing wheat head datasets. Given the complexity of image annotation, we relabeled the partially sampled images from GWHD, while leaving the remaining images for future research using weakly supervised learning. The roLabelImg annotation tool was utilized to annotate each image with oriented bounding boxes, using five parameters: (x, y, w, h, θ). Here, x and y represent the coordinates of the center point, w and h denote the width and height of the wheat head, and θ indicates the rotation angle of the bounding box.The constructed oriented bounding boxes labeled RGWHD was randomly divided into training, validation and test sets in the ratio of 8:2.The number of wheat ears included in the training-test dataset respectively is shown in Table 2.

Table 2 Details of training-test set division.

The dataset proposed in this study, RGWHD, provides a more accurate representation of the wheat head by utilizing tighter bounding boxes. To assess this difference quantitatively, we compared the average area occupied by horizontal and oriented bounding boxes in all RGWHD images. The notable contrast in area between the two types of bounding boxes is presented in Table 2, while Fig. 2 visually illustrates the horizontal and oriented annotations. The average area is calculated using the following formula, providing a standardized measure for comparison.

$$A = \frac{{T_{a} }}{N}$$
(1)

where, \(A\) represents the average area (in pixels) of each bounding box, \(N\) is the number of bounding boxes, and \(T_{a}\) is the total area of all the bounding boxes in this image.

Figure 2
figure 2

Visualization of the difference in area between horizontal and oriented bounding boxes: (a) original image, (b) annotated with horizontal bounding boxes, (c) annotated with oriented bounding boxes, (d) annotated with both horizontal and oriented bounding boxes.

The statistical results presented in Table 3 demonstrate a significant reduction in the average area occupied by the oriented bounding box (OBB) compared to the area occupied by the horizontal bounding box (HBB) across the seven wheat categories. Among these categories, the average relative proportion of area reduction reaches 49.51%, with arvalis_2 showing the highest proportion of area reduction at 62.63%. These findings highlight the effectiveness of using oriented bounding boxes in mitigating the challenges posed by overlapping objects and achieving a more accurate representation of the wheat head.

Table 3 Average area comparison of horizontal and oriented bounding box for wheat heads in RGWHD: Quantitative comparison of the difference in the area of wheat ears between oriented and horizontal frames under the same conditions.

Methods

Network architecture

In light of the challenging field conditions for wheat head detection, this study adopts the two-stage high-precision algorithm Faster R-CNN as its foundation. Figure 3 depicts the oriented feature pyramid network (OFPN) built upon Faster R-CNN. The OFPN comprises three interconnected components designed to detect oriented wheat heads within an image: Feature Extraction Network, Region Proposal Network, and Detection Network. The feature extraction network performs as the feature extractor, utilizing a convolutional neural network (backbone) to produce multi-scale features from the input image. To improve the representation of small and densely-packed wheat heads, a feature fusion module called PBFPN has been integrated to merge features from various levels. The region proposal network generates oriented proposals employing a regression branch, while a classification branch determines whether the proposals represent foreground objects. For the final classification and refinement of proposal positions, the detection network applies an oriented R-CNN head.

Figure 3
figure 3

Architecture of the Oriented Feature Pyramid Network (OFPN): Leveraging ResNeXt as the backbone, PBFPN as the feature fusion module, and Soft-NMS for proposal bounding box filtering.

Orientation box definition

In practice, an oriented bounding box is visually depicted as a slimmer horizontal bounding box that rotates either clockwise or counterclockwise around its center. Along with the corner coordinates, a rotation angle is provided to represent the extent of rotation. Consequently, detecting rotated objects can be achieved through parametric regression of oriented bounding boxes. There are two main methods for representing oriented bounding boxes: the five-parameter method, which utilizes an explicit rotation angle, and the eight-parameter method, which employs the coordinates of the four vertices of a quadrilateral as implicit rotation parameters. This study adopts the five-parameter method with the long-side definition, as illustrated in Fig. 4. In this method, the center coordinates of the oriented bounding box are denoted as x and y, while the width and height are represented by w and h, respectively, with w being greater than h. The rotation angle \(\theta\) is determined by the long side (width) and the x-axis, with clockwise being considered the positive direction, and the angle range is specified as \([ - {\pi \mathord{\left/ {\vphantom {\pi 2}} \right. \kern-0pt} 2},{{{\kern 1pt} \pi } \mathord{\left/ {\vphantom {{{\kern 1pt} \pi } 2}} \right. \kern-0pt} 2})\).

Figure 4
figure 4

Oriented bounding box representation with long-side definition: Use the long side as the width of the bounding box.

Backbone

In this study, the ResNeXt was chosen as the backbone due to its powerful and efficient design33 ResNeXt introduced innovative techniques such as group convolution and the cardinality block. Group convolution employs independent kernels for each group of input channels, enabling parallel computation and reducing computational complexity. The cardinality block aggregates a set of transformations with the same topology, increasing parallel pathways within each block. These parallel pathways allow ResNeXt to learn more diverse and complex features. The key difference between ResNet and ResNeXt lies in the structure of their repeatable residual blocks, as illustrated in Table 4.

Table 4 Block structure of ResNet and ResNeXt (cardinality = 32). ResNeXt incorporates a group convolution with a group size of 32 in the repeatable residual block structure.

Multi-scale feature fusion balanced pyramid (PBFPN)

Feature pyramid networks (FPN)34 have become a common module in object detection for their ability to detect objects at various scales. By fusing feature maps from different scales, the model can gather more information on the object's position and semantics, thereby significantly enhancing detection accuracy.

In the case of wheat head detection, the uneven distribution and overlap of wheat heads within an image present challenges for extracting and representing wheat heads' features. Consequently, this adversely affects the accuracy and performance of object detection models. To tackle this issue, we propose PBFPN, an enhanced FPN network, which is based on the Path Aggregation Network (PANet) and Balanced Features Pyramid (BFP).

Firstly, PBFPN constructs two parallel network pathways to effectively capture multi-scale features. The top-down pathway gradually upsamples the feature maps extracted by the backbone from high-level to low-level. The bottom-up pathway aggregates features from higher to lower resolutions and combines its own features with the corresponding higher-level features from the bottom-up pathway through a lateral connection. Prior to the final level in the bottom-up pathway, we employ the Atrous Spatial Pyramid Pooling (ASPP) module to expand the receptive field using different dilation rates. This enables the network to develop a better understanding of the wheat heads at various scales, resulting in improved object detection accuracy and robustness.

Next, we introduce the Balanced Features Pyramid (BFP) module to tackle the challenges related to fusing features across non-adjacent levels. BFP takes the resulting multi-level features from PANet as inputs, using Interpolation and Max-Pooling to generate feature maps of different scales scaled to medium size, then Integrate the feature maps, and finally re-scaled the resulting features using the same, but reversed, process to enhance the original features. The formula is as follows:

$$C = \frac{1}{L}\sum\limits_{{l = l_{\min } }}^{{l_{\max } }} {C_{l} }$$
(2)

where \(C_{l}\) denotes the resolution of the lth layer of the feature map and \(L\) denotes the total number of layers of the feature map, This module ensures a balanced consideration of the importance of multi-level features generated by PANet, resulting in improved performance in detecting objects of different sizes and scales. In summary, we propose a more efficient feature pyramid network named PBFPN, as illustrated in Fig. 5.

Figure 5
figure 5

Structure of PBFPN: Incorporating ASPP structure in PANet and combining it with BFP module to enhance object features across multiple scales.

Soft-NMS

The accurate detection of wheat heads in the presence of overlaps presents a significant challenge, particularly when selecting proposal bounding boxes. In dense wheat head detection scenarios, the presence of overlaps, cluttered backgrounds, and variations in wheat appearance may lead to the generation of multiple bounding boxes for the same wheat head. Traditionally, the NMS (Non-Maximum Suppression) algorithm has been employed to select the bounding box with the highest confidence score while suppressing all other proposals. However, this approach may discard potentially valid bounding boxes. To address this limitation, the Soft-NMS algorithm proposed by Bodla35, assigns lower scores to overlapping bounding boxes instead of directly eliminating them. By reducing the scores of overlapping proposals, Soft-NMS retains more bounding boxes, allowing for better coverage of objects and minimizing the risk of discarding valuable detections. The degree of suppression increases as the Intersection over Union (IoU) of the proposal bounding boxes with the highest score increases, allowing for a more nuanced selection process in object detection.

\(B\) represents the initial detection boxes, \(S\) is the set of scores for each detection box, and \(D\) corresponds to the set of final detection boxes. \(N_{t}\) is the IoU threshold. \(b_{m}\) signifies the prediction box with the highest score among all prediction boxes.

The NMS algorithm employs the following function to re-score the neighbor proposal of the detection M with the highest score:

$$s_{i} = \left\{ {\begin{array}{*{20}c} {s_{i} ,} & {iou(M,b_{i} ) < N_{t} } \\ {0,} & {iou(M,b_{i} ) \ge N_{t} } \\ \end{array} } \right.$$
(3)

Soft-NMS uses a Gaussian function to reduce the confidence scores of the overlapped boxes as follows:

$$s_{i} = \left\{ {\begin{array}{*{20}c} {s_{i} ,} & {iou(M,b_{i} ) < N_{t} } \\ {s_{i} (1 - iou(M,b_{i} )),} & {iou(M,b_{i} ) \ge N_{t} } \\ \end{array} } \right.$$
(4)
$$s_{i} = s_{i} e^{{ - \frac{{iou(M,b_{i} )^{2} }}{\partial }}} ,\forall b_{i} \notin D$$
(5)

The equation illustrates that the efficacy of the penalty function depends on the IoU of the prediction bounding boxes. When the intersection ratio is lower than the threshold, the penalty function remains inactive. However, if the intersection ratio exceeds the threshold, the penalty function diminishes the confidence score of the corresponding prediction bounding box.

Loss

During the RPN phase, positive and negative proposals are represented by 1 and 0, respectively. A positive proposal must meet either of the following conditions: (i) an anchor has an Intersection over Union (IoU) exceeding 0.7 with any ground-truth bounding box, or (ii) an anchor has the highest IoU with a ground-truth bounding box, and the IoU is greater than 0.3. Negative proposals are anchors with an IoU value less than 0.3 in relation to the ground-truth bounding box. Invalid samples, which are neither positive nor negative, are ignored during the training process. In the second stage for Region of Interest (ROI), a proposal is considered positive if its IoU with the true bounding box is greater than 0.5, and negative if its IoU is less than 0.5. The multi-task loss function for each proposal is then defined as:

$$F(p,t,v,v^{*} ) = \frac{1}{N}\sum\limits_{n = 1}^{N} {F_{cls} \left( {p_{n} ,t_{n} } \right)} + \lambda \frac{1}{N}\sum\limits_{n = 1}^{N} {F_{reg} \left( {v_{n} ,v_{n}^{*} } \right)}$$
(6)

where, \(N\) is the number of proposal bounding boxes, \(F_{cls}\) is the cross entropy loss for the classification task. \(p_{n}\) is the prediction probability generated by the softmax function, indicating whether the anchor is wheat head or background. The variable \(t_{n}\) denotes the anchor’s class label, which can take values of either 0 or 1.

$$F_{cls} \left( {p_{n} ,t_{n} } \right) = - \log p_{n}$$
(7)

\({F}_{reg}\) is the \({smooth}_{{L}_{1}}\) loss for the localization regression task. The trade-off between two terms is controlled by the balancing parameter \(\lambda\).

$$F_{reg} \left( {v_{n} ,v_{n}^{*} } \right) = \sum\limits_{{i \in \{ x,y,w,h,\theta \} }} {smooth_{{L_{1} }} (v_{i} - v_{i}^{*} )}$$
(8)
$$smooth_{{L_{1} }} (x) = \left\{ {\begin{array}{*{20}c} {0.5x^{2} } \\ {|x| - 0.5} \\ \end{array} } \right.{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \begin{array}{*{20}c} {if\;{\kern 1pt} \left| x \right| < 1} \\ {otherwise} \\ \end{array}$$
(9)

where, \({\text{v}}\) = (\({{\text{v}}}_{{\text{x}}}\), \({{\text{v}}}_{{\text{y}}}\), \({{\text{v}}}_{{\text{w}}}\), \({{\text{v}}}_{{\text{h}}}\), \({{\text{v}}}_{\uptheta }\)) is a ground truth bounding box regression tuple containing the coordinates of the center point, the width and height, and the angle of rotation. \({{\text{v}}}^{*}\) = (\({{\text{v}}}_{{\text{x}}}^{*}\), \({{\text{v}}}_{{\text{y}}}^{*}\), \({{\text{v}}}_{{\text{w}}}^{*}\), \({{\text{v}}}_{{\text{h}}}^{*}\), \({{\text{v}}}_{\uptheta }^{*}\)) is the predicted tuple for the target.

Experiments

We conducted comprehensive experiments on the RGWHD dataset to evaluate the performance of the proposed model, comparing it with state-of-the-art (SOTA) models.It is worth noting that the resolution of the training images is 1024 × 1024.

Experimental setup

All experiments are conducted on the deep learning framework PyTorch with Python 3.7 as the programming language. All models were trained for 30 epochs. The initial learning rate is set to 0.005 and then decreased at epochs 8. The models were trained and tested on a desktop computer with an Intel Core i7-8700 CPU @3. 20 GHz × 6, 24 GB of RAM and a GeForce GTX 1080Ti GPU, running Ubuntu 18.04.5 LTS.

Evaluation metrics

In this study, we evaluated the model’s accuracy in terms of the mean Average Precision (mAP) metric, which combines four key measures: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). True Positive (TP) indicates the correct prediction of a wheat head. False Positive (FP) refers to the incorrect prediction of a wheat head that is not present in the image. False Negative (FN) represents the missed prediction of an actual wheat head. True Negative (TN) represents the correct prediction of an object that is not present in the image.

The aforementioned measures are widely utilized in the computation of evaluation metrics, including precision, recall, Average Precision (AP), and mAP, which offer valuable insights into the performance of object detection models.

Intersection over Union (IoU) is a crucial metric that quantifies the extent of overlap between the predicted bounding box and the ground truth bounding box. It is computed as the ratio of the intersection area to the union area of the two bounding boxes. In object detection tasks, a specific IoU threshold is commonly defined to to determine TP and FP.

$$IoU = \frac{{S_{ \cap } }}{{S_{ \cup } }}$$
(10)

Precision: the percentage of true positives detected by the model among all the objects predicted.

$$P = \frac{TP}{{TP + FP}}$$
(11)

Recall: the percentage of true positives detected by the model among all the ground truth objects.

$$R = \frac{TP}{{TP + FN}}$$
(12)

AP (average precision): the average precision at different recall levels. It is computed by integrating the precision values along the Precision-Recall (PR) curve.

$$AP = \int_{0}^{1} {P\left( R \right)d_{R} }$$
(13)

mAP: the average of AP across all categories.

$$mAP = \frac{{\sum\nolimits_{i = 1}^{N} {AP_{i} } }}{N}$$
(14)

Ablation study

We conducted ablation experiments to assess the impact of individual components within the proposed OFPN. The modules examined included ResNeXt, PBFPN, and Soft-NMS. The base model was constructed using the Rotated Faster RCNN with ResNet50 as its backbone. Various configurations were then applied for training and validation on the RGWHD dataset. The experimental results successfully demonstrated the effectiveness of combining these three modules in enhancing the overall performance of OFPN. Table 5 presents the results, indicating a notable 4.14% improvement in detection accuracy (mAP) compared to the base model.

Table 5 Ablation study results of the proposed model OFPN on the RGWHD dataset, where bold numbers represent the best result.

Wheat heads detection results

We conducted experiments on the RGWHD dataset and evaluated the performance of Oriented Feature Pyramid Network (OFPN) in terms of mAP. For each catogory, the dataset is split into training, validation, and testing sets in a ratio of 7:1:2. Figure 6 presents the visualization of prediction results for the seven wheat categories. The proposed model accurately predicts the oriented bounding boxes for the wheat heads in each test image, while also providing additional information including the total count of detected heads and the classification probability for each detection.

Figure 6
figure 6

Wheat heads detection results with OFPN on RGWHD: (a) arvalis_1, (b) arvalis_2, (c) arvalis_3, (d) ethz_1, (e) inrae_1, (f) rres_1, (g) usask_1.

The size of the RGWHD image is 1024 × 1024 pixels, which makes the bounding boxes in Fig. 7 difficult to discern. In order to provide a more detailed demonstration of the oriented object detection results, we evaluated two similar models based on R-CNN. The zoomed details of the same image can be seen in Fig. 7. Using the same settings for the threshold of IoU and bounding box score, the proposed OFPN detected an additional 5 small and overlapping wheat heads compared to Faster RCNN.

Figure 7
figure 7

Enlarged detail of the same prediction result image: (a) Faster R-CNN, (b) OFPN.

We conduct a comparative analysis of our Oriented Feature Pyramid Network (OFPN) with six state-of-the-arts oriented object detection models on the RGWHD dataset. These models include both one-stage detection models, such as Rotated RetinaNet36, R3Det37, S2ANet38, Rotated FCOS, as well as two-stage detection models including Faster RCNN-OBB10, Oriented RCNN23. The results, which are presented in Table 6 and Fig. 8, demonstrate that the two-stage detection models perform better compared to the one-stage detection models. Among the seven categories of wheat, the detection tasks for varieties arvalis_2, arvalis_3, usask_1 are more challenging for all the models we compared. However, our model exhibits the best performance compared to all the other state-of-the-art models. For the rest wheat varieties, OFPN achieves almost equivalent accuracy with the best model Faster RCNN. Furthermore, Fig. 9 shows the validation mAP of all the comparative models during the training process.

Table 6 Comparison results with state-of-the-art oriented object detection models on the RGWHD dataset, where bold numbers represent the best result.
Figure 8
figure 8

Comparison of different model predictions: using the green box to enclose targets missed by the model.

Figure 9
figure 9

mAP during training for all comparative models.

Classification results

To assess the model's object recognition capabilities, we compared the performance of the proposed OFPN with the high-performance Faster R-CNN using their respective confusion matrices. Figure 9 illustrates the prediction probabilities for different wheat categories in the test set, with rows representing actual classes and columns representing predicted classes. The analysis of Fig. 10 reveals that Faster R-CNN tends to misclassify the arvalis_2 and arvalis_3 categories. In contrast, OFPN significantly enhances the classification capability for all wheat categories. These results indicate that our proposed model effectively extracts aggregated features from the images, thereby mitigating the influence of factors such as an appearance on wheat head classification.

Figure 10
figure 10

Comparison of the classification results in confusion matrix: (a) Faster R-CNN, (b) OFPN.

Wheat head counting

To evaluate the counting performance of wheat head detection models, we established a threshold of 0.7 for the confidence score of prediction proposals across all comparative models. Among the 140 images in the test set, a total of 5074 ground truth bounding boxes were identified. Table 7 presents the results, indicating that OFPN achieved a wheat head counting accuracy of 93.97%, surpassing Faster R-CNN by 3.13%. It is worth noting that both Faster R-CNN and OFPN exhibited subpar counting performance for wheat arvalis_2. This can be attributed to the challenging lighting conditions under which the arvalis_2 images were captured, making it difficult to distinguish wheat heads from the background.

Table 7 Wheat head counting results on the RGWHD dataset, the bold fonts indicate the best results.

Conclusion

Wheat heads detection and counting are significant for various purposes, including visual object detection, wheat yield estimation and planting management. Similar to ships in aerial images, wheat heads often appear with arbitrary orientations in field images. Motivated by the oriented object detection in DOTA dataset, we propose a new Rotated Global Wheat Head Dataset (RGWHD). We also present an Oriented Feature Pyramid Network (OFPN) for the detection of small and dense wheat heads. OFPN enhances the representation of small wheat heads through the utilization of a multi-scale feature fusion network known as PBFPN. Furthermore, it handles the detection of dense wheat heads with a Soft-NMS by assigning lower scores to overlapping boxes. OFPN performs well in the tasks of oriented wheat heads detection, category recognition and wheat heads number prediction. Considering the extensive distribution and diverse varieties of wheat, as well as the challenges associated with dataset collection, our future research will focus on wheat detection models under weakly supervised conditions.