GLE-Net: A Global and Local Ensemble Network for Aerial Object Detection

Recent advances in camera-equipped drone applications increased the demand for visual object detection algorithms with deep learning for aerial images. There are several limitations in accuracy for a single deep learning model. Inspired by ensemble learning can significantly improve the generalization ability of the model in the machine learning field, we introduce a novel integration strategy to combine the inference results of two different methods without non-maximum suppression. In this paper, a global and local ensemble network (GLE-Net) was proposed to increase the quality of predictions by considering the global weights for different models and adjusting the local weights for bounding boxes. Specifically, the global module assigns different weights to models. In the local module, we group the bounding boxes that corresponding to the same object as a cluster. Each cluster generates a final predict box and assigns the highest score in the cluster as the score of the final predict box. Experiments on benchmarks VisDrone2019 show promising performance of GLE-Net compared with the baseline network.


Introduction
Object detection in aerial images has become a challenging and active field in computer vision. Importantly, aerial object detection has been a significant success in many applications, i.e., disaster assistance, military, and agriculture. With the advancement of aerial photography techniques and equipment (e.g., unmanned aerial vehicles and satellites) to shoot high-resolution aerial images, more researchers have devised many object detection algorithms based on deep learning. The natural images often capture smaller visual fields and larger object sizes, whereas the aerial images generally capture the information of the lower resolution and small scale of the objects. Aerial images have a wide covered area and contain a mickle tiny and dense distribution of objects. Although many object detectors have achieved advanced performance on natural images, they are not able to attain satisfactory detection results on aerial images.
There are three special challenges for aerial images as follows: (1) aerial datasets mostly high-resolution images; (2) objects typically have small scales relative to the images, and (3) the object distribution of images is not uniform in large scenes. Therefore, it is difficult for the general-purpose detector to effectively detect objects of the aerial images and the most recent works focus on aerial images (e.g., CAD-Net [1] R 2 CNN [2]), which cannot reach the level where the state-of-the-art object detection methods perform on natural images. To solve these • We propose a global and local ensemble network (GLE-Net) to integrate the inference results of multiple stateof-the-art detectors for object detection in aerial images. • We design an effective plug-and-play module to fuse these predicted classification and box regression information of several detectors.
• Our method achieves better performance than the baseline pipeline models on aerial image dataset Vis-Drone2019 [7].
The rest of the paper is organized as follows: Sect. 2 introduces the related work about generic and aerial image object detection algorithms. Section 3 describes our proposed method in detail. Section 4 introduces datasets and experimental results. Section 5 is a summary of the paper.

Related Work
Object detection has received an important amount of attention in the last two decades. In this section, the most relevant work to ours is summarized under two subcategories: (1) Generic Object Detection and (2) Aerial Image Object Detection.

Generic Object Detection
With the rapid development of the deep neural network, the performance of object detection has been greatly improved. State-of-the-art object detection methods can be broadly classified into two categories, namely, one-stage and twostage methods. The representative one-stage detectors include YOLO (which is an acronym for You Only Look Once) [4,[8][9][10][11], single-shot detector (SSD) [12], RetinaNet [13], and CenterNet [5,6], which methods can perform nearly real-time detection, do not need proposal generation procedure, and directly conduct object detection in images. The YOLO families achieve state-of-the-art performance by integrating bounding boxes and subsequent feature resampling in a single stage. RetinaNet [13] can alleviate the fore-back class imbalance problem by Focal Loss. RefineDet [14] introduces a module to refine anchor boxes. CornerNet [15] proposes a method to eliminate anchor boxes, and an object is detected as a pair of keypoints (the top-left corner Fig. 1 Examples detection results of Yolov5, CenterNet, and our proposed method (GLE-Net). The red boxes represent undetected objects. Yolov5 algorithm, which undetected many people and bicy-cles. CenterNet can detect these small objects, whereas it failed to distinguish between people and bicycles and bottom-right corner) of a bounding box. In CenterNet [6], an object is detected as one center keypoint and two keypoints of a bounding box, which contains the center location and other attributes of an object (e.g. size). In contrast, the most representative two-stage detectors, such as the R-CNN series and its variants. R-CNN [16][17][18] is one of the earliest and effective methods that adopt the deep convolutional neural network (CNN) for object detection, which replaces the traditional hand-crafted feature extracting process with CNN-based feature learning and improves the accuracy of object detection. There are two steps in two-stage detectors. In the first stage, focusing on generating a series of candidate region proposals that may contain objects. In the second stage, feature maps are extracted by region-of-interest (ROI) pooling from each proposal for classification and localization tasks. Fast R-CNN [17] generates region proposals on the feature map rather than the original input images, which improve detection efficiency by a large margin. Faster R-CNN [16] introduces an RPN to generate region proposals from the convolutional neural network and achieves end-toend calculation of object recognition. R-FCN [19] uses the full convolution network ResNet to replace VGG to improve the effect of feature extraction and classification. Cascade R-CNN [20] proposes multiple repeated networks and they are connected sequentially, which can increase the number of high IoU score samples and allow the detector to obtain well performance. You only look one-level feature (YOLOF) [21] proposes diluted encoder and uniform matching to optimize detection. In CE-FPN [22], the authors were inspired by sub-pixel convolution, and then proposes a sub-pixel skip fusion method to perform both channel enhancement and up-sampling. BorderDet [23] proposes an efficient Border-Align to extract border features from the extreme point of the border to enhance the point feature. CvT-ASSD [24] modified transformer backbone module by adding the convolutional token embedding and convolutional projection into transformer encoder block, along with the multi-stage design of the network by convolutions and making this maintaining certain computational efficiency.

Aerial Image Object Detection
Along with the publication of a few large-scale annotated datasets, such as DOTA [25], VisDrone [7], and DIOR [26] for object detection in aerial images, lots of researchers have attempted to transfer detectors for natural images to aerial image object detection. RICNN [27] adds a method to learn the rotation invariant neural network model based on existing R-CNN architecture, which is used for multiple classifications arbitrary orientation object detection. ROI Transformer [28] designs a rotated ROI learner to transform a horizontal ROI into a rotated ROI. In addition, this network is based on the RROIs to propose RPS-ROI-Align to extract rotation-invariant features. In LEVIR [29], the authors propose a new adaptive updating method for object detection inference in aerial images under the condition of prior small objects. DFL-CNN [30] proposes a double focal loss convolutional neural network framework for aerial vehicle detection. A Context-Aware Detection Network (CAD-Net) [1], which learns global and local contexts of objects by capturing their correlations with the global scene and the local neighboring objects or features, respectively. The rotational region CNN (R 2 CNN) [2] proposes a modification of Faster R-CNN to extract pooled features of bounding boxes with different pooled sizes and then detect arbitrarily oriented objects. The small, cluttered, and rotated object detector (SCRDet) [31] fuses multi-layer features with effective anchor sampling and adds a supervised pixel attention network and channel attention network for small object detection. Furthermore, detecting small, cluttered, and rotated objects detector (SCRDet++) [32] devise an instance-level denoising module in the feature map for robust detection. The feature-merged single-shot detection (FMSSD) [33] aggregates the context information both in multiple scales and the same scale feature maps. SyNet [34] introduces a method multi-stage and single-stage in highresolution aerial images to decrease the false-negative rate in multi-stage and increase the probability of proposals in single-stage. The multi-head rotated object detector (MRDet) [35] proposes an arbitrary-oriented region transformed from horizontal anchors to increase the original RPN and obtain accurate bounding boxes. R3Det [36] introduces an end-toend refined one-stage rotation detector using a progressive regression approach from coarse to fine granularity.

The Proposed Method
In this section, we will describe the overall structure of our proposed method, shown in Fig. 2, and then explain the global and local ensemble network in detail. In this paper, we propose a global and local ensemble network to promote object detection accuracy. Specifically, in the global module, we first use a collection strategy to combine total detection results from multiply models and setting a dictionary T to store these results. Next, according to whether the object categories are consistent, the candidate bounding boxes in the dictionary T are saved in a list L. Furthermore, the bounding box with the highest confidence score is selected as the top priority box, which is used to match the remaining bounding boxes. We then continue to find the predictions with IoU greater than 0.50 as a subset M. According to our initial observations, the prediction boxes from various models should have assigned different weights. The specific statement formula refers to Eq. (9). In the local module, we normalize these confidence scores of all bounding boxes in this subset M to obtain a new score of each bounding box, denoted as s j . Finally, using the followed formulations to calculate a new optimization predict box, defined by Eq. (11).

Backbone
In this work, we use three Bottleneck-CSPs to generate proposals and extract foreground object features from multi-layers using ROI-Align [37]. The backbone network is a local cross-layer fusion method to reduce the problem of excessive memory consumption. In addition, we apply standard data augmentation techniques that have proven effective for object detection, such as flipping, rotating and mosaic. Specifically, mosaic represents a new data augmentation approach that mixes four training images, which allows object detection outside their normal context and improves the accuracy of detection.

Detection Head
To effectively reduce the computational cost, we apply Bottleneck-CSPs to obtain the object feature map in the backbone network. Next, we use a similar top-down approach to collect feature maps from different stages. Besides, the spatial features and contextual features of the neck are extracted and fused to realize accurate object detection. Following the setting of the Yolo framework, at each scale, we predict three bounding boxes for each of the class-specific features maps. In addition, this model has always combined the classification and bounding box regression processes.

Loss Function
In this part, we will introduce the first model's loss function, which is composed of three losses formally defined as Eq. (1): The object confidence loss L obj is binary cross-entropy loss and the classification loss L cls is soft-max cross-entropy loss, shown as Eqs. (2) and (3). The bounding boxes loss L box is L1-smooth loss, shown as Eq. (4). The specific expressions of the above three formulas are as follows: are the indicator functions. What's more, GIoU is optimized based on IoU, which not only focuses on overlapping areas but focuses on other non-coincident areas. Therefore, the above approaches can improve performance in object detection benchmarks.

Backbone
CenterNet has 4 architectures: ResNet-18 [38], ResNet-101 [38], DLA-34 [39], and Hourglass-104 [40]. In our experiments, we use CenterNet with a deep layer aggregation (DLA-34) backbone as another ensemble model for the task of detecting objects from an aerial image, where 34 represents 34 convolutional layers. This backbone is an image classification network with hierarchical skip connections, which utilize the full convolutional upsampling version of DLA for dense prediction and use iterative deep aggregation to increase feature map resolution symmetrically. In addition, we use deformable convolution to skip connections from the lower layer to the output layer, so that more object features in the aerial image can be preserved.

Detection Head
Object detection task could be treat as a keypoint estimation problem. CenterNet uses a center point of its bounding box to locate the object. And, this method uses keypoint estimation to find center points and regresses to all other object properties (e.g., size, orientation, 3D location and pose). Besides, CenterNet can simply extract a single center point per object without the need for grouping and post-processing and reduce negative bounding boxes. The detection head of Model II to predict heatmap, embedding and offset for the object. Heatmap is used to identify the heatmap of corners at the resolution of the feature map, embedding is applied to distinguish which corners belong to the same object, and offset is used to slightly adjust the l the locations of object on the heatmap.

Loss Function
The overall loss function (Eq. (5)) of the second network is defined as follows: where L hm , L S and L O are the heatmap loss, the size loss, and the offset loss of the prediction box, respectively. The hyperparameters s , o control the tradeoff and we set s = 0.1 and o = 0.1 . In addition, L hm is similar to focal loss. L S , L O both are mean absolute error (L1 loss). The specific expressions of the above three losses are defined as Eqs. (6)- (8).
where N represents the number of keypoint in the image. is the backbone output offset value.

Global and Local Ensemble Module
In this part, we have bounding box predictions for the same image from N various models. In this paper, we choose Yolov5 and CenterNet as our benchmarks. The characteristics and structure of these networks have been introduced above in Sects. 3.1 and 3.2. The global and local ensemble module includes two parts, namely the global section and the local section. In the global section, these works in the following steps: (i) Traverse all the prediction results of each model.
According to the image ID, these predictions bounding boxes into the dictionary T, and then return this dictionary T. Specifically, image ID is the key of the dictionary, and the value of the dictionary is composed of model ID, coordinates of the box, score, and category.
(ii) According to this dictionary T, each image ID is traversed. For all predicted objects in the image ID, they are classified based on dataset categories. Objects belonging to the same category are stored in a list L i (i = 1, 2, 3, … , n) , where i denotes the sequence number of categories. (iii) In this category, the list is sorted in descending order of the classification confidence scores C. (iv) Select a predicted bounding box with the largest confidence score in each list L i (i = 1, 2, 3, … , n) , as the top priority candidate box, and denoted as B 1 L i . (v) Declare a new subset M for the matchboxes. Use B 1 L i to iterate through the remaining predicted boxes in L i and try to find the matching boxes. The match criterion is defined as a subset M, which is composed of the top priority candidate box B 1 L i and the rest prediction boxes with intersection-over-union (IoU) greater than 0.50. The IoU is formally defined as IoU = area (b p ∩b g ) area (b p ∪b g ) , where b p and b g represent predicted and ground-truth bounding boxes, respectively. (vi) According to the original detection results of different models, the same object of different models is given various weights. We propose a formulation (Eq. (9)) for learning the weight of the prediction boxes.
where box i denotes the ith prediction bounding box. m 1 and m 2 represent the first model and the second model. s i (i = 1, 2, 3 …) indicates the confidence of the ith bounding box. 1 denotes the indicator function. In this paper, 1 , 2 , 3 are hyper-parameters to balance the weight of each box, and we set 1 = 0.9 , 2 = 1.1 and 3 = 1.3.
In addition, the calculation method for the local section is as follows: (vii) Normalize the confidence scores of all predicted boxes in this subset M to get the new confidence scores of their respective prediction boxes, denoted as ws i (i = 1, 2, 3 … , n) . And the normalization formula is expressed as: (viii) Use the following equation to calculate an optimization box as follows: where (x new , y new , w new , h new ) and (x k , y k , w k , h k ) (k = 1, 2, 3…) represent an optimization box and the coordinates, width, and height of the center point of the original prediction box. The number k corresponds to the candidate box included in the list L i , and ws i represents the new score of each prediction box calculated in the sixth step. The score of the optimization box is replaced with the highest confidence score.

Aerial Image Dataset
VisDrone2019 [7] is a large-scale visual object detection benchmark, which was collected by Tianjin University. The VisDrone2019 DET [7] dataset for aerial object detection consists of 6471 aerial images for training and 548 images for the test, which were taken by camera-equipped unmanned air vehicles. The dataset annotated contains 10 object classifications: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. Each image scale ranges from 540 × 960 to 2000 × 1500 pixels and contains various shapes and scales. Since the aerial image detection task is still challenging because of class imbalance and object-image size mismatch, this dataset is utilized in this work for the validation of our proposed method.

Evaluation Metrics
The evaluation standard adopted in this paper is the mean average precision (mAP) in MS COCO [41], which is utilized to evaluate the performance of our method relative to other benchmarks. We computed three different average precision metrics: AP 50 , AP 75 and mAP. For AP 50 and AP 75 both consider a bounding box prediction as true, and overall object categories when the interest of union (IoU) scores between the predicted and the ground-truth bounding box must be larger than 0.5 and 0.75, respectively. The mAP, which takes a value between 0 and 1, is the average of all 10 IoU thresholds from a range of [0.5, 0.95] with a step size of 0.05.

Experimental Details
We use bottleneck-CSPs and DLA-34 as the backbones for our detection structure, and both have been pre-trained on the ImageNet [42]. Our proposed framework is shown in Fig. 2. In the training and testing stage, the input images are resizing to 608 × 608. In the training phase, we trained the model for 300 epochs with one batch size of 6 and a learning rate of 0.001. We have implemented the proposed method on PyTorch 1.5.0 and trained it based on Yolov5 and Center-Net. Our proposed model is continued to be trained on one server with an NVIDIA GeForce GTX 2080Ti GPU. In this experiment, we modified the number of Yolov5 output, in which only 100 boxes were selected as candidate boxes for each object. This is consistent with the number of candidate boxes by CenterNet.

Analysis of Comparison Results
To demonstrate the effectiveness of GLE-Net, we compared our model with the SOTA detection methods on the Vis-Drone2019 validation dataset in Table 1. We use CenterNet and Yolov5 in our proposed method. Compared to Corner-Net, Yolov3, RefineDet512, Cascade RCNN, and Faster RCNN, GLE-Net achieves the best performance of 23.1% mAP with global and local ensemble strategy. Compare to original Yolov5 with the same backbone, GLE-Net improves the AP by 6.2% (from 16.9 to 23.1%). GLE-Net also outperforms the original CenterNet with DLA-34 backbone by 1.0% (from 22.1 to 23.1%). Specifically, GLE-Net improves nearly 8% points compared with the original CenterNet and exceeds 0.9% points with Yolov5 in terms of AP 75 which indicates the flexibility and robustness of GLE-Net at higher IoU thresholds. However, the result of AP 50 does not surpass FRCNN + FPN. The possible reason for this condition is that FRCNN + FPN is a two-stage algorithm, which have been proved better than one-stage algorithm in most cases. Nevertheless, our proposed method performance on AP 50 goes beyond one-stage methods (CenterNet and Yolov5).
GLE-Net proposes to increase the accuracy of the regression box using global and local ensemble modules. The idea of the ensemble is also used by object detection competition or machine learning methods, such as IEEE Global Road Damage Detection Challenge. Therefore, to further improve the performance of aerial object detection, we also use global and local ensemble modules. Table 2 shows that the GLE-Net helps improve the performance from 16.9% and 22.1% to 23.1%, especially for all categories with small objects. Specifically, the improvements for car, pedestrian, van, truck, and bus are 9.3%, 6.1%, 5.1%, 9.5% and 13.2%, respectively. To verify the effectiveness of our method, a set of experiments was also done. Figure 3 depicts the relationship between recall and precision curves of Yolov5, CenterNet, and GLE-Net algorithms in the VisDrone2019 dataset. It is obvious that the anchorbased method (Yolov5 in the green curve) is significantly better than the anchor-free method (CenterNet in the red curve). In the relationship between recall-precision curves, our GLE-Net method also performs better than the above methods. Specifically, we can see that the detection effect of the GLE-Net algorithm is better in the four categories of pedestrian, car, truck, and bus.

Experimental Results on VisDrone
The aerial images often contain small, dense objects in some regions. As shown in Fig. 4, when analyzing a straight road, the object nearly an aerial camera is larger and the far is smaller. In addition, for some objects, the edges of foreground features are not very different from the background features, and the boundaries are blurry in the night scene. As shown in Fig. 4a, Yolov5 adopts a multi-scale prediction strategy, so that it can fuse the image features of edge information with different scales to obtain better details. Therefore, even when both the foreground and the background are influenced blurred, and Yolov5 can still detect edge objects. Unfortunately, the multi-scale features are extracted from Yolov5 is a low-resolution map that will obtain much lower accuracy with many dense and tiny missing objects. In contrast, CenterNet applies DLA-34 as the backbone network. Its characteristic is that all input and output feature maps have the same large spatial resolution, and they can express more small object features. Thus, Cen-terNet performs better than Yolov5 on overlapping small objects. However, for the foreground and background information not easily distinguishable, CenterNet does not perform well. Fortunately, GLE-Net successfully extracts the small objects and the edge objects. These results indicate that the global and local ensemble strategy of this paper combines the advantages of these two models, which can effectively reduce the missed detection rate. Aerial images taken by unmanned aerial vehicles will contain many dense small objects, most of which are overlapping or unevenly distributed. In addition, there will be different aspect ratios of objects at different heights and angles in aerial images. From the overall results of Fig. 5, the proposed GLE-Net is substantial to Yolov5 and Cen-terNet. The reason is that the global and local ensemble approach by adding an appropriate weight for the detected Fig. 3 The relationship between precision and recall curves of Yolov5, CenterNet, and GLE-Net in the VisDrone2019 dataset, respectively. The yellow lines represent GLE-Net, the red lines represent CenterNet and the green lines denote Yolov5, respectively Fig. 4 Visualization of three detectors for the night scene in aerial images. The red boxes denote undetected objects, and the green boxes denote detected objects objects so that reducing the undetected rate and enhancing the detection accuracy.
Visualization results of aerial images between close two objects are shown in Fig. 6. In the first row, a car and a bicycle are close together and the car partially obscures the rear of the bicycle, only the front of the bicycle can be seen. We can see that Yolov5 detects the car, but the bicycle behind it is not. The possible reason is that Yolov5 cannot identify incomplete objects, which leads to missed detection. Whereas CenterNet is the opposite, and we can observe that the corners of the bicycle and the car overlap, and these two objects may be identified as one object, so the car cannot be detected. Our algorithm can combine the advantages of the two methods to detect these objects. Similarly, the advantages of our approach are also shown in the second row. Some features of people and bicycles are weakened under strong light. In Yolov5, the related features of people cannot be extracted, and then the people cannot be detected. In CenterNet, the corner of the candidate box of people and bicycle features are close to being separated, which leads to the incorrect recognition of the two objects. However, in our proposed method, the advantages of the two methods are used to obtain the final detection result.  Figure 7 shows a few sample images from the VisDrone2019 dataset and the corresponding detection using the baseline model: Yolov5 and CenterNet (in the first column) and the proposed GLE-Net (in the second column). Qualitative results show that our proposed GLE-Net combines undetected objects in different situations, and then increases the detection precision of both algorithms. As Fig. 7 shows, the SOTA generic detection technique Yolov5 tends to produce miss detection under different road scenarios such as people in the first two examples and cars in the third example. In addition, the anchor-based method CenterNet undetected detection because of inaccurate corner matching. As a comparison, the proposed GLE-Net is capable of correctly detecting those objects under various adverse scenarios as illustrated in the third column of Fig. 7. The outstanding detection performance is largely attributed to the inclusion of the global contexts and local contexts (as described in Section III) within the proposed GLE-Net.

Inference Time and Parameters
In this part, we compare the inference speed and parameters of the two baselines, as shown in Table 3. The parameters of Yolov5 and CenterNet are 89 MB and 74.99 MB, respectively. It can be seen that the parameters of the two models are not much different. In the inference speed, there is a difference. The reason is that the backbone network used by CenterNet is DLA-34 [39], which is a multi-layer combination that spans the entire network, and then the inference time of its model will increase.

Ablation Study
In this subsection, to demonstrate the effectiveness of the global and local ensemble network and show this plug-andplay approach, we adopt three network methods to evaluate our proposed strategy. As shown in Table 4, the mAPs of the three baseline methods are 16.9%, 20.9%, and 22.1%, respectively. In this experiment, we applied our strategy by Fig. 7 Qualitative results of Yolov5, CenterNet, and GLE-Net in aerial images. The red boxes denote undetected objects, and the green boxes denote detected objects combining them in pairs. It can be find that the integrated modules have improved to a certain extent. Specifically, the ensemble of Yolov4 and Yolov5 can improve the mAP from 20.9 to 23.9%. Yolov4 + CenterNet and Yolov5 + Center-Net can improve mAP over the baselines by 5% and 6.2%. The integration of the prediction results of Yolov4, Yolov5 and CenterNet can achieve an accuracy of 24.2%, with an increase of 7.3%, 2.1% and 3.3% compared to their baselines, respectively. Experiment results show that our idea of constructing ensemble predictions is effective.

Conclusions
In this paper, we have presented a global and local ensemble network for objects in aerial images. Considering the advantages and disadvantages of two state-of-the-art object detection models (Yolov5 and CenterNet), an ensemble module with global and local object features was added. The module fused regression and classification information from different models, and that is independent of the underlying algorithm, which can serve as an efficient plug-and-play network to improve the detection accuracy of the arbitrary model. The experimental results on the VisDrone2019 dataset demonstrate the competitive results of our proposed method.