Introduction

Vehicle detection from unmanned aerial vehicle (UAV) images is a key technology in many fields, such as search and rescue [1], surveillance [2], military [3], and transportation [4,5,6], which has practical research significance and wide application value. However, accurate and quick vehicle detection remains a challenging problem due to many issues, such as, but not to limited to, small-sized vehicles, low-resolution vehicles, partial occluded vehicles, vehicle scale diversity, limited datasets, and information imbalance of different feature scales.

Because of the powerful representation ability of convolutional neural networks (CNN), object detection [7,8,9,10,11] has been made significant breakthroughs in ground-level images, and vehicle detection in UAV images has also been continuously improved. Same as object detection in ground-level images, vehicle detectors in UAV images can be divided into two categories: two-stage vehicle detectors and single-stage vehicle detectors. Based on the two-stage detection network, such as Fast R-CNN [12] and Faster R-CNN [8], the two-stage vehicle detectors [13,14,15] introduce high-level contextual semantic information to enhance the feature representation ability of vehicles. The detectors can ensure high accuracy, but are not suitable for real-time applications. Based on the single-stage detection network, such as SSD [9]] and YOLO series [16,17,18,19,20,21], the single-stage vehicle detectors use the top-down architecture [22, 23] to introduce contextual information, which enhances the feature representation for vehicles. The detectors can guarantee high accuracy and real-time performance.

Above these detectors only consider introducing high-level semantic information into shallow features. But there is still information loss in deeper features, resulting in information imbalance, which is particularly unfavorable for small vehicle detection. Wang et al. [24, 25] proposed that shallow-level detailed and spatial information are crucial for accurate target localization, especially for small target detection.

To reduce the imbalance problem of lack of spatial information in deep features, we propose a shallow-level feature information guidance part. This guidance part mainly realizes the mid-/low-level information passing to supply detailed and spatial information for deeper features. In this part, the image pyramid is introduced to supplement spatial information into each feature prediction layer of the backbone network. Therefore, we design the feature transform module (FTM) to transform image pyramid into the mid-/low-level feature information, which can reserve more detailed and spatial features for deep features. The FTM can be understood as a shallow light-weighted network trained from scratch, which can reduce the gap between classification and localization. At the same time, it only contains simple convolution layers and batch normalization layers, without consuming a lot of training time.

Meanwhile, in the shallow-level feature information guidance part, we use residual product fusion method to implement feature fusion, which guides more mid-/low-level spatial information to be embedded into the backbone network to enhance features for UAV vehicles. Furthermore, to effectively reduce unnecessary shallow background information for fused features, we design a light-weight attention module (LAM) to make the network more focused on small-sized vehicles. The LAM can be understood as a spatial attention mechanism, which can enhance the discriminability and robustness of features by filtering important information on feature maps.

In the shallow-level feature information guidance part, we use the FTM to obtain more detailed and spatial features, which are then added to the deep prediction features through the residual product fusion module and the LAM. This reduces the information imbalance problem of lack of spatial information in deep features, and enables better detection features to be learned.

Apart from this, we use top-down architecture of the standard RefineDet [26] to introduce contextual semantic information for shallow features, which is called the deep-level semantic information guidance part. The part can guide the backbone network to enhance contextual information for small-sized vehicles and reduce the imbalance problem of lacking high semantic information in shallow layers. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles.

The whole structure combining the shallow-level feature information guidance part and the deep-level semantic information guidance part is called a bi-directional information guidance network (BDIG-Net). The BDIG-Net not only integrates high-level semantic information that is conducive to classification for shallow features, but also, more importantly, integrates mid-/low-level information that is conducive to localizate for deep features. Therefore, the proposed BDIG-Net can ensure both the mid-/low-level spatial information and high-level semantic information are abundant for each feature prediction layer, and reduce the problem of information imbalance.

In summary, we make the following main contributions:

  1. (1)

    A bi-directional information guidance network (BDIG-Net) for UAV vehicle detection is proposed, which can ensure that each prediction layer has rich mid-/low-level spatial information and high-level semantic information, and reduce the problem of information imbalance.

  2. (2)

    In the shallow-level guidance part, a feature transform module (FTM) is proposed to obtain abundant mid-/low-level feature information, which can guide the BDIG-Net to enhance detailed and spatial features for small-sized vehicles. Beside, a light-weight attention module (LAM) is used to reduce unnecessary shallow background information, making the network more focused on small-sized vehicles. The part can reduce the imbalance problem of lacking spatial information in deep features.

  3. (3)

    In the deep-level guidance part, a feature enhancement module (FEM) is designed to suppress redundant features and improve the discriminability of small-sized vehicles. The part can reduce the imbalance problem of lacking high semantic information in shallow layers.

  4. (4)

    Our method achieves the state-of-the-art performance on both datasets. 92.9% mean average precision (mAP) and 91.1% mAP are achieved on the XDUAV dataset and Stanford Drone dataset, respectively. The proposed method can process 50 frames per second on a single NVIDIA 1080Ti GPU. Code is available athttps://github.com/03100076/BDIG.

Related work

Some classical classification algorithms have been successfully applied in various fields [44, 45]. In recent years, convolutional neural networks (CNNs) have made breakthroughs in classification tasks, especially in image processing. Among them, UAV vehicle detectors [66, 67] based on CNNs have attracted many researchers’ attention in recent years. The detectors can be grouped into the two-stage vehicle detector and the single-stage vehicle detector.

Two-stage UAV vehicle detector

The two-stage vehicle detectors [14, 15, 27,28,29,30,31] based on Fast R-CNN [12] and Faster R-CNN [8] enhance feature representation by introducing contextual information [32,33,34,35,36], which improves detection performance. Xu et al. [13] use Faster R-CNN to improve vehicle accuracy from low-altitude UAV imagery, but due to the difficulty of feature extraction, higher altitude and multi-category of vehicles cannot be extended. Sommer et al. [27] extend multiple categories for the UAV vehicle detection task. Zhang et al. [37] realize dense and small vehicles detection with Cascade R-CNN [38] in UAV vision. Huang et al. [39] utilize the improved Cascade R-CNN to add superclass detection on top of the original one, and then fuse the regression confidence and modify the loss function to enhance the detection capability of targets. However, the two-stage UAV vehicle detection methods suffer from the enormous modeling complexity and the speed limitations.

Single-stage UAV vehicle detector

In order to satisfy real-time detection requirement, the single-stage vehicle detectors are proposed and achieve as good performance as the two-stage detectors. Tang et al. [40], Radovic et al. [41], and Ringwald et al. [43] use the improved SSD [9], YOLOv1 [16], and YOLOv2 [17] to achieve real-time vehicle detection and tracking in traffic monitoring images, respectively. To further enhance the features for weak and small-sized vehicles, Zhang et al. [46] construct an improved YOLOv3 network with 16 layers to achieve an efficient and accurate vehicle detection. With the continuous updates of the YOLO series, Tan et al. [47] propose the accurate and lightweight UAV detectors based on YOLOv4 [19], using dilated convolution and ultra-lightweight subspace attention mechanism to enhance multi-scale feature representation and improve target detection performance. Based on YOLOv5, Deng et al. [48], and Zhan et al. [49] employ different feature enhancement methods to achieve accurate and fast UAV vehicle detection. ShuffleDet [50] uses the deformable and inception modules to complete real-time UAV vehicle detection. These real-time single-stage vehicle detectors only consider the passing of high-level semantic information in convolution neural network to provide contextual information for shallow features, and do not consider the passing of the shallow-level information for deeper features. However, shallow-level detailed and spatial information are also crucial for accurate object localization, especially for small object detection.

Information imbalance for UAV vehicle detector

There is an information imbalance problem of different feature scales in CNN. Shallow features with weak semantics have spatial information that is conducive to precise localization. On the contrary, deep features with strong semantics are beneficial for classification, but lack detailed information. Some studies have been proposed to address the information imbalance problem at the feature level. The classical feature pyramid network (FPN) transmits high-level semantic information to shallow features to reduce the imbalance impact to some degree. PA-Net [51] shortens the information propagate path between low-level features and the high-level features by adding a bottom-up path. Libra R-CNN [52] employs a balanced feature pyramid to reduce feature-level imbalances. IPG-Net [53] introduces a image pyramid to solve the problem. For the information imbalance in UAV vehicle detectors, both single-stage and two-stage vehicle detectors only consider the passing of high-level semantic information for shallow features, and do not consider the passing of the shallow-level information for deeper features. Small-sized vehicle detection needs not only high-level semantic information that can distinguish other categories, but also mid-/low-level information that can accurately describe vehicles. Inspired from the above discussion, we propose a bi-directional information guidance network to reduce the imbalance problem, which ensures that each prediction layer has abundant mid-/low-level feature information and high-level semantic information to improve the detection performance.

Proposed method

Baseline and motivation

In this paper, we use the RefineDet framework [26] as our baseline, because the network not only has the real-time advantage of single-stage detection algorithms (e.g., SSD), but also has the high accuracy advantage of two-stage detection algorithms (e.g., Faster R-CNN). The standard RefineDet adopts a VGG-16 architecture [54] as the backbone network, and converts fc6 and fc7 of VGG-16 to convolution layers conv_fc6 and conv_fc7 through subsampling parameters. It then adds two extra convolution layers conv6_1 and conv6_2 to the end of the truncated VGG-16. Meanwhile, the standard RefineDet adopts the top-down architecture of the classical feature pyramid network (FPN) to achieve feature fusion, providing context information for shallow features. The standard RefineDet utilizes different prediction feature layers conv4_3, conv5_3, conv_fc7 and conv6_2 to complete multi-scale object classification and localization.

For the UAV vehicle detection task, most targets are small and weak. On the basis of the standard RefineDet network, we add a prediction layer conv3_3 to predict the relatively smaller vehicles. Meanwhile, because the deeper feature layers introduce large receptive field, it will hurt performance for small vehicles detection, we remove the deeper layers conv6_1 and conv6_2. Therefore, we ultimately use conv3_3, conv4_3, conv5_3 and conv_fc7 as multi-scale feature prediction layers. The subsequent experimental results prove the effectiveness of above prediction layer selection. Furthermore, since the size and aspect ratio distribution of targets are different in various UAV vehicle datasets, it is necessary to set suitable anchors. We reset anchors according to the distribution and the effective receptive field of vehicles in different convolutional layers to improve recall rate of vehicles.

Overall architecture

The proposed overall architecture for UAV vehicle detection is shown in Fig. 1. There are two main parts in the BDIG-Net: the shallow-level feature information guidance part and the deep-level semantic information guidance part.

Fig. 1
figure 1

The overall architecture of the BDIG-Net, including shallow-level feature information guidance part and deep-level semantic information guidance part. The shallow-level feature information guidance part realizes the mid-/low-level information passing to supply detailed and spatial information for vehicles. The deep-level semantic information guidance part realizes the high-level semantic information passing to supply contextual information for vehicles

The shallow-level feature information guidance part realizes the mid-/low-level information passing to supply detailed information and spatial information for vehicles. The part is mainly composed of a feature transform module (FTM), a feature fusion module (FFM), and a light-weight attention module (LAM). In this part, we firstly use down-sampling to obtain a series of images of different resolutions from image pyramid, and then design the feature transform module (FTM) to extract features of these images. The extracted features reserve more mid-/low-level feature information, which can guide the backbone network to enhance detailed and spatial features for deep features. Meanwhile, we use feature fusion module (FFM) to better integrate the mid-/low-level feature information provided by the FTM into the backbone network. Finally, the function of the light-weight attention module (LAM) is to reduce unnecessary shallow background information of the fused features, making the network more focused on small-sized vehicles.

The deep-level semantic information guidance part realizes the high-level semantic information passing to supply contextual information for vehicles. The part uses the top-down architecture of the standard RefineDet to fuse deeper semantic information, which can guide the backbone network to enhance contextual information for shallow features. In this part, we design a feature enhancement module (FEM) to suppress redundant features and improve the discriminability of small-sized vehicles.

The shallow-level and deep-level information guidance part together form the bi-directional information guidance network, providing both mid-/low-level detail information and high-level semantic information, enhancing the discriminative features for small-sized vehicles.

Feature transform module

The feature transform module (FTM), as shown in Fig. 2, mainly performs the feature transform on the input images from image pyramid, in order to obtain the mid-/low-level spatial and detailed information for UAV vehicles. The FTM is a shallow light-weighted network trained from scratch, which does not consume a lot of training time. The module contains four convolutional layers: a 3\(\,\times \,\)3 convolutional layer and BN layer [55], a 1\(\,\times \,\)1 convolutional layer and BN layer, a multi-channel dilated convolutional layer [56], and a concatenation integrated layer. Compared to general convolution, the dilated convolution adds a parameter named dilated rate r to introduce large receptive field while maintaining the image resolution. Various dilated rate r can introduce different receptive fields and different features. In this work, the multi-channel parallel dilated convolutional layer uses different receptive fields to provide more abundant mid-/low-level feature information for UAV vehicles.

Fig. 2
figure 2

The structure diagram for feature transform module

The input image of image pyramid firstly uses a 3\(\,\times \,\)3 convolutional layer and a 1\(\,\times \,\)1 convolutional layer to obtain the feature \(s_n\). Then the feature \(s_n\) is divided into multiple channels, which completes dilated convolutions of different dilated rate r. Here, the three-channel dilated convolution is shown as Fig. 2. Three 3\(\,\times \,\)3 convolutional features with dilated rates of 1, 2, and 3 are concatenated, and then the channel dimension is changed through a 1\(\,\times \,\)1 convolutional layer. Therefore, the convolutions are integrated to achieve feature concatenation of different receptive field. Finally, the purposed of different feature fusion is achieved, and the mid-/low-level feature information \(F_n\) for UAV vehicles is obtained. This process can be shown as:

$$\begin{aligned} \begin{aligned} F = Cat(D_{3,1}(s), D_{3,2}(s), D_{3,3}(s)) \end{aligned} \end{aligned}$$
(1)

where Cat is the concatenation operation, \(D_{k,r}(s)\) is the dilated convolution. k is the size of the convolution kernel, we set k to 3 in this paper. r is dilated rate, s is the input features of multi-channel dilated convolution. F is the output features of multi-channel dilated convolution, that is, the extracted mid-/low-level feature information.

Feature fusion module

The idea of the FFM is to make a transformation of the two types of features firstly and then fuse them together to achieve an augment effect for small-sized vehicles detection. We formulate the FFM as follow:

$$\begin{aligned} \begin{aligned} Y_n = \alpha ({G_n(k_0)}, {F_n(h_n)}), n\in [1, N] \end{aligned} \end{aligned}$$
(2)

where \(Y_n\) is the output feature of the FFM in level n. \(G_n(\cdot )\) and \(F_n(\cdot )\) correspond to the output of the backbone network and the FTM respectively. \(\alpha (\cdot )\) is the fusion function of the FFM, which is different variants in shallow-level and deep-level information guidance network.

In the shallow-level feature information guidance part, the \(F_n(\cdot )\) and \(h_n\) are the output and the input of the FTM in level n separately. The \(G_n(\cdot )\) and \(k_0\) are the output in level n and the input of the backbone network separately. The \(\alpha (\cdot )\) is a residual product fusion method as shown in Fig. 3a, which is to perform the element-wise product operation on the \(F_n(\cdot )\) and the \(G_n(\cdot )\). Then, the result of the operation is added to the \(G_n(\cdot )\) in a residual form. The corresponding formula is as follow:

$$\begin{aligned} \begin{aligned} Y_n = W \cdot (((W_k \cdot {CT(F_n)}) \otimes (W_p \cdot G_n)) + G_n) \end{aligned} \end{aligned}$$
(3)

where \(W_p\) and \(W_k\) are the 1\(\,\times \,\)1 convolutional layer. W is the 3\(\,\times \,\)3 convolutional layer and BN layer. \(CT(\cdot )\) is channel-dimension transform, in order to align the channel dimension of the fused features. \(((W_k \cdot {CT(F_n)}) \otimes (W_p \cdot G_n))\) can be considered as the lost information in the backbone network feature \(G_n(\cdot )\), and then the lost information can be added to the backbone network to enhance features for UAV vehicles.

Fig. 3
figure 3

Two feature fusion modules. a Shows the residual product fusion method. b Shows the element-wise sum fusion method

In the deep-level semantic information guidance part, the fusion method of the \(F_n(\cdot )\) and the \(G_n(\cdot )\) is the same as that of the standard feature pyramid network (FPN). Therefore, the \(\alpha (\cdot )\) is an element-wise sum fusion method as shown in Fig. 3b, which is to perform the element-wise sum operation on the \(F_n(\cdot )\) and the \(G_n(\cdot )\). In this part, \(F_n(\cdot )\) is high-level semantic feature, which is different from the output of the FTM. The transmission of the \(F_n(\cdot )\) provides the contextual information to enhance the discriminative features for vehicles. This fusion method requires the same dimension for the two features, and the corresponding formula is as follows:

$$\begin{aligned} \begin{aligned} Y_n = W \cdot (W_k \cdot {CT(F_n)}) + G_n) \end{aligned} \end{aligned}$$
(4)

where the explanation of all parameters is the same as the above formulas.

Light-weight attention module

During the process of mid-/low-level feature information passing, the fused features are interfered by irrelevant background information. According to the residual attention module [58] for image classification, attention mechanisms can suppress irrelevant background during the forward propagation of mid-/low-level information. However, the large number of attention mechanism parameters can increase the complexity of the model. Therefore, we design a light-weight attention module (LAM), which is actually a spatial attention mechanism, as shown in Fig. 4.

Fig. 4
figure 4

The structure diagram for the light-weight attention module

The proposed LAM is a spatial attention mechanism that mainly includes a mask branch (top branch) and a trunk branch (bottom branch). In the mask branch, we use a light-weight hourglass structure to perform down-sampling and up-sampling to obtain attention feature maps. A max pooling layer and three convolutional layers are used to perform down-sampling phase. A convolutional layer, a bilinear interpolation operation and a sigmoid function are used to perform up-sampling phase to get attention feature maps. In the trunk branch, we use a convolutional layer to get the output. Finally, the trunk branch output and the attention features are fused by the element-wise product manner to obtain enhanced features. The LAM is an attention mechanism module with few parameters, which can reduce unnecessary background information from shallow-level feature guidance network, further focus attention on small-sized vehicles, and enhance the discriminative feature of vehicles in shallow layers.

Feature enhancement module

Almost all UAV vehicle detection networks are proposed on the basis of general object detection networks. Therefore, when these general networks are used for small datasets, they will generate a large number of redundant features for UAV vehicles, which will reduce the discriminability of vehicle features. For the UAV vehicle detection task in this work, we design a feature enhancement module (FEM), which suppresses the redundant features and improves the discriminability of small-sized vehicles. The FEM is a channel attention mechanism that can quantify the importance of each convolutional kernel in the feature layers, thereby obtaining a one-dimensional vector [57] for the importance of all convolutional kernels in the corresponding feature layers. Then, the one-dimensional vector is used to adjust the feature maps of each channel. The FEM can increase the difference between vehicle features and redundant features, making the vehicle features more discriminative. The feature enhancement module (FEM) is actually a channel attention mechanism, as shown in Fig. 5.

Fig. 5
figure 5

The structure diagram for the feature enhancement module

The input of the FEM is the feature maps of the network structure layers. In this work, we use the global average pooling to obtain the response values of each channel feature map. The formula for the FEM is as follow:

$$\begin{aligned} \begin{aligned} z_i = F_{global}(X) = {\frac{1}{H \times W}} {\sum \limits _{m = 1}^H}{\sum \limits _{n = 1}^W}{x_i(m,n)} \end{aligned} \end{aligned}$$
(5)

where \(F_{global}\) represents the global average pooling. H and W are the height and the width of feature map X respectively. i is the index of channels. \(x_i(m,n)\) is the value of each point on the channel i feature map. In this formula, all pixel values are summed in the channel i feature map, and then taking the average value of the sum to obtain the response of the channel i. The output is a vector with dimension C (C = 256). In order to avoid the vector amplitude being too large, we use the \(L_2\) function to normalize it, the formula is as shown:

$$\begin{aligned} \begin{aligned} s_i = F_{L_2}(Z) = \frac{z_i}{\Vert z \Vert _2} = \frac{z_i}{\sqrt{{\sum \limits _{i = 1}^C}{z_i^2}}} \end{aligned} \end{aligned}$$
(6)

Finally, the normalized vector is used to scale the overall amplitude of the feature maps channel by channel. Therefore, enhanced feature map \({\tilde{x}}_i\) is obtained by multiplying the original feature map \(x_i\) with the weight vector \(s_i\), which makes the vehicle features and redundant features more distinguishable. The corresponding formula is as follow:

$$\begin{aligned} \begin{aligned} {\tilde{x}}_i = F_{scale}(s_i, x_i) = s_i \cdot x_i \end{aligned} \end{aligned}$$
(7)

where \(\cdot \) refers to multiplying weight vectors with feature maps to scale the overall amplitude of the feature maps channel by channel.

Experimental results and analysis

Datasets and implementation details

XDUAV Dataset. The dataset [42] contains a large amount of truncated, occluded, and multi-angle small vehicles. The vehicle category consists of car, bus, truck, tanker, motor and bicycle. The whole dataset contains 4344 images with 3475 images for training and 869 images for testing.

The Stanford Drone Dataset. The dataset [60] contains annotated videos of pedestrians, bikers, skateboarders, cars, buses, and golf carts. In this work, we choose 3 categories of vehicles (i.e., car, bus and golf cart) as experimental data. The whole dataset contains 4331 images with 3500 images for training and 831 images for testing.

Loss Function. During the training process, a multi-task (classification and regression) loss function is used to minimize, where the classification loss function is the cross entropy function and the regression loss function is the SmoothL1 function. The total loss function of the network is defined as:

$$\begin{aligned} \begin{aligned} L(p_i,t_i)= {\frac{1}{N_{pos}}}{\sum _i L_{cls}(p_i,c_i^*)}+{\frac{1}{N_{pos}}}{\sum _i{c_i^*}{L_{reg}(t_i,t_i^*)}}. \end{aligned} \end{aligned}$$
(8)

where i is the index of anchor. \({{N}_{pos}}\) is the number of positive samples. The label \(c_{i}^{*}\) is 1 if the anchor i is positive and 0 otherwise. \(p_i\) and \(t_i\) are the predicted category and location of the anchor i respectively. \(t_i^*\) is the ground truth location and size of i. The \(L_{cls}\) is classification loss, and \(L_{reg}\) is the regression loss.

Performance Metric. To evaluate the results of the proposed detector on two datasets, we use the typical PASCAL VOC metrics: Average Precision (AP) for a single category, mean Average Precision (mAP) for all categories, and detection speed (Frame Per Second, FPS). The corresponding mathematical formula is shown in the following:

$$\begin{aligned} \begin{aligned} AP=&\frac{TP+TN}{TP+TN+FP}\\ mAP=&{\frac{1}{n}}{\sum _{i=1}^{n} AP_i} \end{aligned} \end{aligned}$$
(9)

where TP (True Positives) denotes the number of correctly identified positive samples, TN (True Negatives) denotes the number of correctly identified negative samples, FP (False Positives) denotes the number of incorrectly identified negative samples as positive samples, FN (False Negatives) denotes the number of failed to recognize positive samples. n is the number of positive samples, and \(AP_i\) is the average precision of the category i. The mAP refers to a measure of the overall performance of the proposed detector in correctly classification and location all UAV vehicles. The higher the value of mAP, the better the network performance.

Training Implementation Details. All experiments are implemented based on the deep learning open-source framework Caffe [61]. We use the VGG ILSVRC [62] as the parameter initialization. The network is optimized by stochastic gradient descent (SGD) with back-propagation. Weight decay is set as 0.0005 and momentum is set as 0.9. The "step" strategy is adopted to adjust learning rate. The maximum iteration is set as 120k with an initial learning rate 0.001, which is reduced by a factor of 10 at iteration 80k and 100k respectively. We train the model with batch size 16 and test the model with batch size 1 using NVIDIA GTX-1080Ti GPU, CUDA 8.0, Cudnn7.0.

Ablation study

In this work, we perform ablative analysis with the XDUAV dataset to verify the effectiveness of the proposed bi-directional information guidance network.

Baseline analysis

Considering that most of vehicles from UAV are small and weak, we make some adjustments to the baseline of the standard RefineDet. Because the deeper feature layers introduce large receptive field for small vehicles, which hurt performance for the vehicles detection, we remove the deeper layers conv6_1 and conv6_2. Meanwhile, we add a prediction layer conv3_3 to predict the relatively smaller vehicles. The modified baseline is named as Modifying-R.

The standard RefineDet mainly realizes training and testing for PASCAL VOC2007 dataset, but the distribution of vehicle dataset and PASCAL VOC2007 dataset is completely different. Therefore, the setting of anchors needs to be changed accordingly. Based on the overall distribution of XDUAV dataset and the Stanford Drone dataset, the setting of anchors is shown in Table 1. In the Conv3_3 and Conv_fc7 layers, we only set a anchor for them, which can avoid the convergence problem caused by parameter redundancy during the training process and improve the detection performance of vehicles. The setting model is referred to as Setting-R.

Table 1 The setting of aspect ratio and scale for anchors in different prediction layers

We make comparative experiments in different settings, which demonstrate the effectiveness of the baseline settings as shown in Table 2. Furthermore, subsequent experiments will be analyzed in detail based on the Modifying-R and Setting-R model.

Table 2 Performance comparison of different baseline settings

The shallow-level feature information guidance part analysis

  • The importance of mid-/low-level feature information guidance

Small-sized vehicles detection needs not only high-level semantic information that can distinguish other categories, but also mid-/low-level information that can accurately describe vehicles. In this section, we prove the importance of mid-/low-level information guidance part for vehicles detection. We embed mid-/low-level feature information into the prediction layers conv3_3, conv4_3, conv5_3 and conv_fc7 of the backbone network to perform comparative experiments, as shown in Table 3.

Table 3 The ablation study on the embedding mid-/low-level information into different prediction layers

The experimental results indicate that embedding mid-/low-level information for conv3_3 layer and conv4_3 layer has little effect. Because the two layers themselves are shallow features for the backbone. If additional parameters are introduced, it will actually bring feature redundancy, causing a burden for the network and resulting in a decrease for detection performance. Adding appropriate mid-/low-level information for conv5_3 layer and conv_fc7 layer can just make up for the missing information of vehicles, which is conducive to the precise location for small-sized vehicles.

Therefore, the shallow-level feature information guidance part is very important, which ensures that each prediction layer has abundant mid-/low-level feature information to improve the detection performance.

It is noted that the detection results shown is Table 3 are completed using the basic convolution operation in feature transform module (FTM). The basic operation is to sequentially pass inputs through a 3\(\,\times \,\)3 convolutional layer, a 1\(\,\times \,\)1 convolutional layer, and a 3\(\,\times \,\)3 convolutional layer to obtain mid-/low-level features, without the involvement of multi-channel dilated convolutions.

  • The effectiveness of feature transform module

The shallow-level feature information guides the backbone network to enhance detailed and spatial features for small-sized vehicles. We use the FTM to transform the image pyramid into the mid-/low-level feature information. In order to demonstrate the effectiveness of the FTM, we use a single convolution layer and multi-channel (including two and three channels) dilated convolution layers for verification, as shown in Table 4.

Table 4 The ablation study on the multi-rate dilated convolution in the effective transform module

The mAP of two-channel dilated convolutional layers with \(r = 1, 3\) and \(r = 2, 3\) are higher 0.2% and 0.1% respectively than a single convolution layer. The reason is that the receptive field of the convolution with large dilated rate becomes larger, and the contour information of vehicles can be extracted. Therefore, the concatenated convolution features with different dilated ratios are more abundant, and the feature responses are stronger. Compared with two-channel dilated convolutional layers with \(r = 1, 3\) and \(r = 2, 3\), the three-channel dilated convolutional layers with \(r = 1, 2, 3\) can bring more diverse receptive fields. In this way, more information from different ranges can be obtained, leading to an increase in mAP values. However, considering that most vehicles from UAV are weak and small, large receptive field may bring more background interference. Therefore, the detection performance of three-channel dilated convolution is not as good as that of two-channel dilated convolution with \(r = 1, 2\). This is because the extracted features by the two-channel convolution are more delicate and can retain richer details, which is more conducive to small vehicles detection.

Therefore, for the FTM, we adopt the two-channel dilated convolution with \(r =1, 2\) to obtain different receptive field information, and enrich the mid-/low-level features of vehicles. Furthermore, dilated convolutional layers need padding operation for input images, which can increase computational complexity of the network. The testing time for each image of different dilated convolutions is shown in the last column of Table 4, indicating that the FTM does not affect the real-time performance.

  • The significance of light-weight attention module

In mid-/low-level information guidance part, unnecessary background information will be brought to the backbone network during the process of mid-/low-level information passing, which affects the detection performance of vehicles. Inspired by the residual attention module (RAM) [59], we design a light-weight attention module (LAM) to suppress the irrelevant background information. The experimental results in line 1 from Table 5 only use the residual product fusion method to fuse the output of FTM and the corresponding backbone network features, and directly embed the fused information into the backbone network without using any attention module. “Without” in line 1 from Table 5 is that the spatial attention module was not used in the experiment. Due to the limited features for small vehicles and the influence of irrelevant background information, the network’s attention for small vehicles is insufficient. As shown in line 2 and line 3 from Table 5, the introduction of attention modules greatly improves vehicle detection performance. Although the RAM can improve detection performance, it also introduces a large number of parameters, increasing the complexity and detection speed of the network. The comparison of experimental results indicates the effectiveness of the LAM. Especially for smaller vehicles such as bicycles, the AP is increased by 1.7%. Therefore, the LAM is able to capture smaller vehicles areas of focus, and then invests more attention in the areas to obtain more detailed information, while ignoring other irrelevant information. The LAM can quickly filter out high-value information from limited attention resources without affecting the real-time detection performance.

Table 5 The ablation study on the light-weight attention module (LAM)
Table 6 The effect of the feature enhancement module (FEM) in the deep-level semantic information guidance part

The deep-level semantic information guidance part analysis

In the deep-level semantic information guidance part, the element-wise sum fusion method is used to implement the feature fusion module (FFM), which guides high-level contextual information to enhance features for small-sized vehicles. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles.

  • The necessity of feature enhancement module

As shown in Table 6, the comparison of experimental results indicates the necessity of the FEM. “Without” in line 1 from Table 6 is that the channel attention module was not used in the experiment. The light-weight attention module (LAM) is essentially a spatial attention mechanism, while the FEM is a channel attention mechanism. We first employ the LAM to filter out irrelevant background information, and then use the FEM to improve the discriminability of small-sized vehicles. These two modules enhance the features extraction ability of the BDIG-Net from both spatial and channel perspectives, which play a complementary role. The complementarity of the two modules is not only beneficial for improving the classification performance, but also greatly improves the localization prediction of vehicles.

Overall performance

XDUAV dataset analysis

Due to the designs of the feature transform module (FTM) and the light-weight attention module (LAM) in the shallow-level feature information guidance part, and the feature enhanced module (FEM) in the deep-level semantic information guidance part, the proposed BDIG-Net can achieve 92.9% accuracy and 50 FPS in speed on XDUAT dataset. We compare some single-stage real-time methods and two-stage high-accuracy methods by mAP and FPS in this section as shown in Table 7. We can see that, the proposed method achieves the best performance while keeping real-time detection. It is noted that all methods are trained under the same conditions, so the experimental results are credible. Figure 6 shows comparison results on XDUAV dataset. Compared with the single-directional (deep-level semantic) information guidance network (Fig. 6b), the bi-directional information guidance network (BDIG-Net) (Fig. 6a) obviously reduces missed rate, and redundant bounding boxes, etc. Especially for vehicles with scale diversity and occlusion, the proposed detector has good robustness for precise vehicles location. These above results demonstrate the effectiveness of the BDIG-Net on XDUAV dataset.

Table 7 Detection results (%) of different methods for the XDUAV dataset

The stanford drone dataset analysis

To further demonstrate the practicality and robustness of the BDIG-Net detector for UAV vehicles, we evaluate the detection performance on Stanford Drone dataset. we use the same ablation experiments as the XDUAV dataset to analyze the shallow-level feature information guidance part and deep-level semantic information guidance part. The experimental results of Stanford Drone dataset also demonstrate the effectiveness of the proposed BDIG-Net. The overall results of different detection methods are shown in Table 8. The proposed method achieves 91.1% AP, which is superior to other real-time vehicle detection methods. Figure 7 shows detection results of the proposed BDIG-Net detector on Stanford Drone dataset in different scenarios.

Table 8 Detection results (%) of different methods for Stanford Drone dataset
Fig. 6
figure 6

Qualitative results comparing the proposed bi-directional information guidance network with the deep-level semantic information guidance network. a Detection results from the single-directional information guidance network and b detection results from the bi-directional information guidance network

Fig. 7
figure 7

Qualitative detection results in different challenging scenarios for Stanford Drone dataset

Conclusion

This paper introduces a bi-directional information guidance network (BDIG-Net) for multi-category vehicle detection, which achieves accurate and real-time detection in UAV images. Firstly, the BDIG-Net is divided into two parts: shallow-level feature information guidance part and deep-level semantic information guidance part. Secondly, in the shallow-level guidance part, we use the FTM to transform the image pyramid into the mid-/low-level feature information. Meanwhile, residual product fusion method is adopted to implement feature fusion, which guides mid-/low-level information to be embedded into the BDIG-Net. In order to reduce unnecessary shallow background information of the fused features, the LAM is designed to make the network more focused on small-sized vehicles. Thirdly, In the deep-level guidance part, we use top-down architecture of the standard RefineDet to fuse deeper semantic information. The element-wise sum fusion method is used to implement feature fusion, which guides high-level contextual information to enhance features for small-sized vehicles. Meanwhile, a feature enhancement module (FEM) is proposed to suppress the redundant features and improve the discriminability of small-sized vehicles. The BDIG-Net not only integrates high-level semantic information that is conducive to classification for shallow features, but also, more importantly, integrates mid-/low-level information that is conducive to localizate for deep features. To some extent, the BDIG-Net reduces the problem of information imbalance of different feature layers. The experimental results on two datasets demonstrate that the proposed method can detect small-sized vehicle more accurately and achieve real-time detection. In our future work, we plan to integrate inter-frame correlation and scene understanding into UAV vehicle detection network, so as to infer and detect smaller and weaker vehicle targets.