Keywords

1 Introduction

Generic object detection has been extensively studied by the computer vision community over several decades [4, 6, 8, 10, 13, 16, 17, 22, 26, 37, 41, 42, 44, 45, 48] due to its appeal to both academic research explorations as well as commercial applications. Given an image of interest, the goal of object detection is to predict the locations of objects and classify them at the same time. The key challenge of the object detection task is to handle variations in object scale, pose, viewpoint and even part deformations when generating the bounding boxes for specific object categories.

Fig. 1.
figure 1

Architecture of the Deep Regionlets detection framework. It consists of a region selection network (RSN) and a deep regionlet learning module. The region selection network performs non-rectangular region selection from the detection window proposal generated by the region proposal network. Deep regionlet learning module learns the regionlets through a spatial transformation and a gating network. The entire pipeline is end-to-end trainable. For better visualization, the region proposal network is not displayed here.

Numerous methods have been proposed based on hand-crafted features (i.e. HOG [10], LBP [1], SIFT [30]). These approaches usually involve an exhaustive search for possible locations, scales and aspect ratios of the object, by using the sliding window approach. However, Wang et al.’s [45] regionlet-based detection framework has gained a lot of attention as it provides the flexibility to deal with different scales and aspect ratios without performing an exhaustive search. It first introduced the concept of regionlet by defining a three-level structural relationship: candidate bounding boxes (sliding windows), regions inside the bounding box and groups of regionlets (sub-regions inside each region). It operates by directly extracting features from regionlets in several selected regions within an arbitrary detection bounding box and performs (max) pooling among the regionlets. Such a feature extraction hierarchy is capable of dealing with variable aspect ratios and flexible feature sets, which leads to improved learning of robust feature representation of the object for region-based object detection.

Recently, deep learning has achieved significant success on many computer vision tasks such as image classification [20, 24, 34], semantic segmentation [29] and object detection [16] using the deep convolutional neural network (DCNN) architecture. Despite the excellent performance of deep learning-based detection framework, most network architectures [8, 28, 37] do not take advantage of successful conventional ideas such as deformable part-based model (DPM) or regionlets. Those methods have been effective for modeling object deformation, sub-categories and multiple aspect ratios. Recent advances [9, 32, 33] have achieved promising results by combining the conventional DPM-based detection methodology with deep neural network architectures.

These observations motivate us to establish a bridge between deep convolutional neural network and conventional object detection schema. In this paper, we incorporate the conventional Regionlet method into an end-to-end trainable deep learning framework. Despite being able to handle arbitrary bounding boxes, several drawbacks arise when directly integrating the regionlet methodology into the deep learning framework. First, in  [45], Wang et al. proposed to learn cascade object classifiers after hand-crafted feature extraction in each regionlet. However, end-to-end learning is not feasible in this framework. Second, regions in regionlet-based detection have to be rectangular, which does not effectively model the deformations of an object which results in variable shapes. Moreover, both regions and regionlets are fixed after training is completed.

To this end, we propose a novel object detection framework named “Deep Regionlets” to integrate the deep learning framework into the traditional regionlet method [45]. The overall design of the proposed detection system is illustrated in Fig. 1. It consists of a region selection network (RSN) and a deep regionlet learning module. The region selection network performs non-rectangular region selection from the detection window proposalFootnote 1 (RoI) to address the limitations of the traditional regionlet approach. We further design a deep regionlet learning module to learn the regionlets through a spatial transformation and a gating network. By using the proposed gating network, which is a soft regionlet selector, the resulting feature representation is more effective for detection. The entire pipeline is end-to-end trainable using only the input images and ground truth bounding boxes.

We conduct a detailed analysis of our approach to understand its merits and evaluate its performance. Extensive experiments on two detection benchmark datasets, PASCAL VOC [11] and Microsoft COCO [27] show that the proposed deep regionlet approach outperforms several competitors [8, 9, 32, 37]. Even without segmentation labels, we outperform state-of-the-art algorithms such as Mask R-CNN [18] and RetinaNet [26]. To summarize, we make the following contributions:

  • We propose a novel deep regionlet approach for object detection. Our work extends the traditional regionlet method to the deep learning framework. The system is trainable in an end-to-end manner.

  • We design the RSN, which first performs non-rectangular region selection within the detection bounding box generated from a detection window proposal. It provides more flexibility in modeling objects with variable shapes and deformable parts.

  • We propose a deep regionlet learning module, including feature transformation and a gating network. The gating network serves as a soft regionlet selector and lets the network focus on features that benefit detection performance.

  • We present empirical results on object detection benchmark datasets, demonstrating superior performance over state-of-the-art.

2 Related Work

Many approaches have been proposed for object detection including both traditional ones [13, 42, 45] and deep learning-based approaches [6, 8, 9, 16, 17, 19, 21, 28, 32, 35, 37, 41, 43, 48, 50,51,52]. Traditional approaches mainly used hand-crafted features to train the object detectors using the sliding window paradigm. One of the earliest works [42] used boosted cascaded detectors for face detection, which led to its wide adoption. Deformable Part Model-based detection (DPM) [12] proposed the concept of deformable part models to handle object deformations. Due to the rapid development of deep learning techniques [2, 5, 20, 24, 34, 40, 46, 47, 49], the deep learning-based detectors have become dominant object detectors.

Deep learning-based detectors could be further categorized into single-stage detectors and two-stage detectors, based on whether the detectors have proposal-driven mechanism or not. The single-stage detectors [14, 25, 26, 28, 35, 38, 48, 50] apply regular, dense sampling windows over object locations, scales and aspect ratios. By exploiting multiple layers within a deep CNN network directly, the single-stage detectors achieved high speed but their accuracy is typically low compared to two-stage detectors.

Two-stage detectors [8, 17, 37] involve two steps. They first generate a sparse set of candidate proposals of detection bounding boxes by the Region Proposal Network (RPN). After filtering out the majority of negative background boxes by RPN, the second stage classifies the proposals of detection bounding boxes and performs the bounding box regression to predict object categories and their corresponding locations. The two-stage detectors consistently achieve higher accuracy than single-stage detectors and numerous extensions have been proposed [6, 7, 9, 18, 21, 32, 41]. Our method follows the two-stage detector architecture by taking advantage of RPN without requiring dense sampling of object locations, scales and aspect ratios.

3 Our Approach

In this section, we first review the traditional regionlet-based detection methods and then present the overall design of the end-to-end trainable deep regionlet approach. Finally, we discuss in detail each module in the proposed end-to-end deep regionlet approach.

3.1 Traditional Regionlet-based Approach

A regionlet is a base feature extraction region defined proportionally to a window (i.e. a sliding window or a detection bounding box) at arbitrary resolution (i.e. size and aspect ratio). Wang et al. [45] first introduced the concept of regionlet, as illustrated in Fig. 2. It defines a three-level structure among a detecting bounding box, number of regions inside the bounding box and a group of regionlets (sub-regions inside each region). In Fig. 2, the yellow box is a detection bounding box. R is a rectangular feature extraction region inside the bounding box. Furthermore, small sub-regions \(r_{i \{i = 1\dots N\}}\)(e.g. \(r_1\), \(r_2\)) are chosen within region R, where we define them as a set of regionlets.

The difficulty of the arbitrary detection bounding box has been well addressed by using the relative positions and sizes of regionlets and regions. However, in the traditional approach, the initialization of regionlets possess randomness and both regions (R) and regionlets (i.e. \(r_1\), \(r_2\)) are fixed after the training. Moreover, it is based on hand-crafted features (i.e. HOG [10] or LBP [1]) in each regionlet respectively and hence not end-to-end trainable. To this end, we propose the following deep regionlet-based approach to address such limitations.

Fig. 2.
figure 2

Illustration of structural relationships among the detection bounding box, feature extraction regions and regionlets. The yellow box is a detection bounding box and R is a feature extraction region shown as a purple rectangle with filled dots inside the bounding box. Inside R, two small sub-regions denoted as \(r_1\) and \(r_2\) are the regionlets. (Color figure online)

3.2 System Architecture

Generally speaking, an object detection network performs a sequence of convolutional operations on an image of interest using a deep convolutional neural network. At some layer, the network bifurcates into two branches. One branch, RPN generates a set of candidate bounding boxesFootnote 2 while the other branch performs classification and regression by pooling the convolutional features inside the proposed bounding box generated by the region proposal network [8, 37]. Taking advantage of this detection network, we introduce the overall design of the proposed object detection framework, named “Deep Regionlets”, as illustrated in Fig. 1.

The general architecture consists of an RSN and a deep regionlet learning module. In particular, the RSN is used to predict the transformation parameters to choose regions given a candidate bounding box, which is generated by the region proposal network. The regionlets are further learned within each selected region defined by the region selection network. The system is designed to be trained in a fully end-to-end manner using only the input images and ground truth bounding box. The RSN as well as the regionlet learning module can be simultaneously learned over each selected region given the detection window proposal.

Fig. 3.
figure 3

(a) Example of initialization of one affine transformation parameter. Normalized affine transformation parameters \(\varTheta _0 = [\frac{1}{3}, 0, -\frac{2}{3}; 0, \frac{1}{3}, \frac{2}{3}]\) (\(\theta _i \in [-1, 1]\)) selects the top-left region in the \(3\times 3\) evenly divided detection bounding box, shown as the purple rectangle. (b) Design of the gating network. f denotes the non-negative gate function (Color figure online)

3.3 Region Selection Network

We design the RSN to have the following properties: (1) End-to-end trainable; (2) Simple structure; (3) Generate regions with arbitrary shapes. Keeping these in mind, we design the RSN to predict a set of affine transformation parameters. By using these affine transformation parameters, as well as not requiring the regions to be rectangular, we have more flexibility in modeling objects with arbitrary shapes and deformable parts.

Specifically, we design the RSN using a small neural network with three fully connected layers. The first two fully connected layers have output size of 256, with ReLU activation. The last fully connected layer has the output size of six, which is used to predict the set of affine transformation parameters \(\varTheta = [ \theta _1, \theta _2, \theta _3; \theta _4, \theta _5, \theta _6 ]\).

Note that the candidate detection bounding boxes proposed by RSN have arbitrary sizes and aspect ratios. In order to address this difficulty, we use relative positions and sizes of the selected region within a detection bounding box. The candidate bounding box generated by the RPN is defined by the top-left point (\(w_0, h_0\)), width w and height h of the box. We normalize the coordinates by the width w and height h of the box. As a result, we could use the normalized affine transformation parameters \(\varTheta = [ \theta _1, \theta _2, \theta _3; \theta _4, \theta _5, \theta _6 ]\) (\(\theta _i \in [-1, 1]\)) to evaluate one selected region within one candidate detection window at different sizes and aspect ratios without scaling images into multiple resolutions or using multiple-components to enumerate possible aspect ratios, like anchors [14, 28, 37].

Initialization of Region Selection Network: Taking advantage of relative and normalized coordinates, we initialize the RSN by equally dividing the whole detecting bounding box to several sub-regions, named as cells, without any overlap among them. Figure 3(a) shows an example of initialization from one affine transformation (i.e. \(3\times 3\)). The first cell, which is the top-left bin in the whole region (detection bounding box) could be defined by initializing the corresponding affine transformation parameter \(\varTheta _0 = [\frac{1}{3}, 0, -\frac{2}{3}; 0, \frac{1}{3}, \frac{2}{3}]\). The other eight of \(3 \times 3\) cells are initialized in a similar way.

3.4 Deep Regionlet Learning

After regions are selected by the RSN, regionlets are further learned from the selected region defined by the normalized affine transformation parameters. Note that our motivation is to design the network to be trained in a fully end-to-end manner using only the input images and ground truth bounding boxes. Therefore, both the selected regions and regionlet learning should be able to be trained by CNN networks. Moreover, we would like the regionlets extracted from the selected regions to better represent objects with variable shapes and deformable parts.

Inspired by the spatial transform network [23], any parameterizable transformation including translation, scaling, rotation, affine or even projective transformation can be learned by a spatial transformer. In this section, we introduce our deep regionlet learning module to learn the regionlets in the selected region, which is defined by the affine transformation parameters.

More specifically, we aim to learn regionlets from one selected region defined by one affine transformation \(\varTheta \) to better match the shapes of objects. This is done with a selected region R from RSN, transformation parameters \(\varTheta = [ \theta _1, \theta _2, \theta _3; \theta _4, \theta _5, \theta _6 ]\) and a set of feature maps \(Z = \{Z_{i}, i = 1,\dots ,n \}\). Without loss of generality, let \(Z_{i}\) be one of the feature map out of the n feature maps. A selected region R is of size \(w \times h\) with the top-left corner \((w_0, h_0)\). Inside the \(Z_{i}\) feature maps, we propose the following regionlet learning module.

Let s denote the source and t denote target, we define \((x_p^s, y_p^s)\) as the spatial location in original feature map \(Z_{i}\) and \((x_p^s, y_p^s)\) as the spatial location in the output feature maps after spatial transformation. \(U_{nm}^c\) is the value at location (nm) in channel c of the input feature. The total output feature map V is of size \(H \times W\). Let \(V(x_p^t, y_p^t, c| \varTheta ,R)\) be the output feature value at location (\(x_p^t, y_p^t\)) (\(x_p^t\in [0, H]\), \(y_p^t\in [0, W]\)) in channel c, which is computed as

$$\begin{aligned} \begin{aligned} V(x_p^s, y_p^s, c| \varTheta ,R) = \sum _{n}^{H}&\sum _{m}^{M} U_{nm}^c \max (0, 1 - | x_p^s - m|) \max (0, 1 - | y_p^s - n|) \end{aligned} \end{aligned}$$
(1)

Back Propagation Through Spatial Transform.

To allow back propagation of the loss through the regionlet learning module, we can define the gradients with respect to both feature maps and the region selection network. In this layer’s backward function, we have partial derivative of the loss function with respect to both feature map variable \(U_{mn}^c\) and affine transform parameter \(\varTheta = [ \theta _1, \theta _2, \theta _3; \theta _4, \theta _5, \theta _6 ]\). Motivated by [23], the partial derivative of the loss function with respect to the feature map is:

$$\begin{aligned} \begin{aligned} \frac{\partial V(x_p^s, y_p^s, c| \varTheta ,R)}{\partial U_{nm}^c} = \sum _{n}^{H} \sum _{m}^{M} \max (0, 1 - | x_p^s - m|) \times \max (0, 1 - | y_p^s - n|) \end{aligned} \end{aligned}$$
(2)

Moreover, during back propagation, we need to compute the gradient with respect to each affine transformation parameter \(\varTheta = [\theta _{1}, \theta _{2}, \theta _{3}; \theta _{4}, \theta _{5}, \theta _{6}]\). In this way, the region selection network could also be updated to adjust the selected region. We take \(\theta _{1}\) as an example due to space limitations and similar derivative can be computed for other parameters \(\theta _i (i = 2,\dots ,6)\) respectively.

$$\begin{aligned} \begin{aligned} \frac{\partial V(x_p^s, y_p^s, c| \varTheta ,R)}{\partial \theta _{1}} = x_p^t \sum _{n}^{H} \sum _{m}^{M} U_{nm}^c \max (0, 1 - | y_p^s - n|) \times {\left\{ \begin{array}{ll} 0 \text { if } |m - x_p^s| \ge 1 \\ 1 \text { if } m > x_p^s\\ -1 \text { if } m < x_p^s \end{array}\right. } \end{aligned} \end{aligned}$$
(3)

It is worth noting that \((x_p^t, y_p^t)\) are normalized coordinates in range \([-1, 1]\) so that it can to be scaled with respect to w and h with start position \((w_0, h_0)\).

Gating Network. The gating network, which serves as a soft regionlet selector, is used to assgin regionlets with different weights and generate regionlet feature representation. We design a simple gating network using a fully connected layer with sigmoid activation, shown in Fig. 3(b). The output values of the gating network are within range of [0, 1]. Given the output feature maps \(V(x_p^s, y_p^s, c| \varTheta ,R)\) described above, we use a fully connected layer to generate the same number of output as feature maps \(V(x_p^s, y_p^s, c| \varTheta ,R)\), which is followed by an activation layer sigmoid to generate the corresponding weight respectively. The final feature representation is generated by the product of feature maps \(V(x_p^s, y_p^s, c| \varTheta ,R)\) and their corresponding weights.

Regionlet Pool Construction. Object deformations may occur at different scales. For instance, deformation could be caused by different body parts in person detection. Same number of regionlets (size \(H\times W\)) learned from small selected region have higher extraction density, which may lead to non-compact regionlet representation. In order to learn a compact, efficient regionlet representation, we further perform the pooling (i.e. max/ave) operation over the feature maps \(V(x_p^s, y_p^s, c| \varTheta ,R)\) of size (\(H \times W\)). We reap two benefits from the pool construction: (1) Regionlet representation is compact (small size). (2) Regionlets learned from different size of selected regions are able to represent such regions in the same efficient way, thus to handle object deformations at different scales.

3.5 Relations to Recent Works

Our deep regionlet approach is related to some recent works in different aspects. We discuss both similarities and differences in detail in the supplementary material section.

4 Experiments

In this section, we present comprehensive experimental results of the proposed approach on two challenging benchmark datasets: PASCAL VOC [11] and MS-COCO [27]. There are in total 20 categories of objects in PASCAL VOC [11] dataset. We follow the common settings used in [4, 8, 17, 37] to enable fair comparsions.

More specifically, we train our deep model on (1) VOC 2007 trainval and (2) union of VOC 2007 trainval and 2012 trainval and evaluate on VOC2007 test. We also report results on VOC 2012 test, following the suggested settings in [4, 8, 17, 37]. In addition, we report the results on the VOC2007 test split for ablation studies. MS-COCO [27] contains 80 object categories. Following the official settings in COCO website , we use the COCO 2017 trainval split (union of 135k images from train split and 5k images from val split) for training. We report the COCO-style average precision (AP) on test-dev 2017 split, which requires evaluation from the MS-COCO server.

For the base network, we choose both VGG-16 [40] and ResNet-101 [20] to demonstrate the generalization of our approach regardless of which network backbone we use. The á trous algorithm [29, 31] is adopted in stage 5 of ResNet-101. Following the suggested settings in [8, 9], we also set the pooling size to 7 by changing the conv5 stage’s effective stride from 32 to 16 to increase the feature map resolution. In addition, the first convolution layer with stride 2 in the conv5 stage is modified to 1. Both backbone networks are intialized with the pre-trained ImageNet [20, 24] model. In the following sections, we report the results of a series of ablation experiments to understand the behavior of the proposed deep regionlet approach. Furthermore, we present comparisons with state-of-the-art detectors [8, 9, 18, 25, 26, 37] on both PASCAL VOC [11] and MS COCO [27] datasets.

4.1 Ablation Study

For a fair comparison, we adopt ResNet-101 as the backbone network for ablation studies. We train our model on the union set of VOC \(2007+2012\) trainval and evaluate on the VOC2007 test set. The shorter side of image is set to be 600 pixels, as suggested in [8, 17, 37]. The training is performed for 60k iterations with an effective mini-batch size 4 on 4 GPUs, where the learning rate is set at \(10^{-3}\) for the first 40k iterations and at \(10^{-4}\) for the remaining 20k iterations. First we investigate the proposed approach to understand each component (1) RSN, (2) Deep regionlet learning and (3) Soft regionlet selection by comparing it with several baselines:

  1. (1)

    Global RSN. RSN only selects one global region and it is initialized as identity transformation (i.e. \(\varTheta _0 = [1, 0, 0; 0, 1, 0]\)). This is equivalent to global regionlet learning within the RoI.

  2. (2)

    Offset-only RSN. We set the RSN to only learn the offset by enforcing \(\theta _1, \theta _2, \theta _4, \theta _5\) not to change during the training process. In this way, the region selection network only selects the rectangular region with offsets to the initialized region. This baseline is similar to the Deformable RoI Pooling in  [9] and  [32].

  3. (3)

    Non-gating selection: Deep regionlet without soft selection. No soft regionlet selection is performed after the regionlet learning. In this case, each regionlet learned has the same contribution to the final feature representation.

Table 1. Ablation study of each component in deep regionlet approach. Output size \(H\times W\) is set to \(4\times 4\) for all the baselines
Table 2. Results of ablation studies when the RSN selects different number of regions and regionlets are learned at different level of density.

Results are shown in Table 1. First, when the region selection network only selects one global region, the RSN reduces to the single localization network [23]. In this case, regionlets will be extracted in a global manner. It is interesting to note that selecting only one region by the region selection network is able to converge, which is different from [8, 37]. However, the performance is extremely poor. This is because no discriminative regionlets could be explicitly learned within the region. More importantly, when we compare our approach and offset-only RSN with global RSN, the results clearly demonstrate that the RSN is indispensable in the deep regionlet approach.

Moreover, offset-only RSN could be viewed as similar to deformable RoI pooling in [9, 32]. These methods all learn the offset of the rectangle region with respect to its reference position, which lead to improvement over [37]. However, non-gating selection outperforms offset-only RSN by \(2.8\%\) while selecting the non-rectangular region. The improvement demonstrates that non-rectangular region selection could provide more flexibility around the original reference region, thus could better model the non-rectangular objects with sharp shapes and deformable parts. Last but not least, by using the gate function to perform soft regionlet selection, the performance can be further improved by \(0.7\%\).

Next, we present ablation studies on the following questions in order to understand more deeply on the region selection network and regionlet learning module: (1) How many regions should we learn using the region selection network? (2) How many regionlets should we learn in a selected region (density is of size \(H\times W\))?

How Many Regions Should We Learn Using the Region Selection Network? We investigate how the detection performance varies when different number of regions are selected by the region selection network. All the regions are initialized as described in Sect. 3.3 without any overlap between regions. Without loss of generality, we report results for \(4 ( 2\times 2)\), \(9 (3 \times 3)\) and \(16 (4 \times 4)\) regions in Table 2. We observe that the mean AP increases when the number of selected regions is increased from \(4(2 \times 2)\) to \(9 (3 \times 3)\) for a fixed regionlets learning number, but gets saturated with \(16(4 \times 4)\) selected regions.

How Many Regionlets Should We Learn in One Selected Region? Next, we investigate how the detection performance varies when different number of regionlets are learned in one selected region by varying H and W. Without loss of generality, we set \(H = W\) and vary the H value from 2 to 6. In Table 2, we report results when we set the number of regionlets at \(4 (2 \times 2)\), \(9 (3 \times 3)\), \(16 (4 \times 4)\), \(25 (5 \times 5)\), \(36 (6 \times 6)\) before the regionlet pooling construction.

First, it is observed that increasing the number of regionlets from \(4(2\times 2)\) to \(25 (5\times 5)\) results in improved performance. As more regionlets are learned from one region, more spatial and shape information from objects could be learned. The proposed approach could achieve the best performance when regionlets are extracted at \(16 (4\times 4)\) or \(25 (5 \times 5)\) density level. It is also interesting to note that when the density increases from \(25(5 \times 5)\) to \(36 (6 \times 6)\), the performance degrades slightly. When the regionlets are learned at a very high density level, some redundant spatial information may be learned without being useful for detection, thus affecting the region proposal-based decision to be made. In all the experiments, we present the results from 16 selected regions from the RSN and set output size \(H \times W = 4 \times 4\).

Table 3. Detection results on PASCAL VOC using VGG16 as backbone architecture. Training data: “07”: VOC2007 trainval, “07 + 12”: VOC 2007 and 2012 trainval. Ours\(^\S \) denotes applying the soft-NMS [4] in the test stage.
Table 4. Detection results on PASCAL VOC using ResNet-101 [20] as backbone acchitecture. Training data: union set of VOC 2007 and 2012 trainval. Ours\(^\S \) denotes applying the soft-NMS [4] in the test stage.
Table 5. Detection results on VOC2012 test set using training data “07++12”: 2007 trainvaltest and 2012 trainval. SSD\(^*\) denotes the new data augmentation. Ours\(^\S \) denotes applying the soft-NMS [4] in the test stage.

4.2 Experiments on PASCAL VOC

In this section, we compare our results with a traditional regionlet method [45] and several state-of-the-art deep learning-based object detectors as follows: Faster R-CNN [37], SSD [28], R-FCN [8], soft-NMS [4], DP-FCN [32] and D-F-RCNN/D-R-FCN [9].

We follow the standard settings as in [4, 8, 9, 37] and report mean average precision (mAP) scores using IoU thresholds at 0.5 and 0.7. For the first experiment, while training from VOC 2007 trainval, we use a learning rate of \(10^{-3}\) for the first 40k iterations, then decrease it to \(10^{-4}\) for the remaining 20k iterations with a single GPU. Next, due to more training data, an increase in the number of iterations is needed on the union of VOC 2007 and VOC 2012 trainval. We perform the same training process as described in Sect. 4.1. Moreover, we use 300 RoIs at test stage from a single-scale image testing and set the shorter side of the image to be 600. For a fair comparison, we do not deploy the multi-scale training/testing or online hard example mining(OHEM) [39], although it is shown in [4, 9] that such enhancements could enhance the performance.

The results on VOC2007 test using VGG16 [40] backbone are shown in Table 3. We first compare with a traditional regionlet method [45] and several state-of-the-art object detectors [4, 28, 37] when training using small size dataset (VOC 2007 trainval). Next, we evaluate our method as we increase the training dataset (union set of VOC 2007 and 2012 trainval). With the power of deep CNNs, the deep regionlet approach significantly improves the detection performance over the traditional regionlet method [45]. We also observe that more data always helps. Moreover, it is encouraging that soft-NMS [4] is only applied in the test stage without modification in the training stage, which could directly improve over [37] by \(1.1\%\). In summary, our method consistently outperform all the compared methods and the performance could be further improved if we replace NMS with soft-NMS [4]

Next, we change the network backbone from VGG16 [40] to ResNet-101 [20] and present corresponding results in Table 4. In addition, we also compare with D-F-RCNN/D-R-FCN [9] and DP-FCN [32].

First, compared to the performance in Table 3 using VGG16 [40] network, the mAP can be significantly increased by using deeper networks like ResNet-101 [20]. Second, comparing with DP-FCN [32] and Deformable ROI Pooling in [9]Footnote 3, we outperform these two methods by \(\mathbf {3.9\%}\) and \(\mathbf {2.7\%}\) respectively. This provides the empirical support that our deep regionlet learning method could be treated as a generalization of Deformable RoI Pooling in [9, 32], as discussed in Sect. 3.5. In addition, the results demonstrate that selecting non-rectangular regions from our method provides more capabilities including scaling, shifting and rotation to learn the feature representations. In summary, our method achieves state-of-the-art performance on the object detection task when using ResNet-101 as backbone network.

Results evaluated on VOC2012 test are shown in Table 5. We follow the same settings as in [8, 14, 28, 32, 37] and train our model using VOC “07++12”: VOC 2007 trainvaltest and 2012 trainval set. It can be seen that our method outperform all the competing methods. In particular, we outperform DP-FCN [32], which further proves the generalization of our method over [32].

4.3 Experiments on MS COCO

In this section, we evaluate the proposed deep regionlet approach on the MS COCO [27] dataset and compare with other state-of-the-art object detectors: Faster R-CNN [37], SSD [28], R-FCN [8], D-F-RCNN/D-R-FCN [9], Mask R-CNN [18], RetinaNet [26].

Table 6. Object detection results on MS COCO 2017 test-dev using ResNet-101 backbone. Training data: 2017 train and val set. SSD\(^*\) denotes the new data augmentation.

We adopt ResNet-101 as the backbone architecture of all the methods for a fair comparison. Following the settings in [8, 9, 18, 26], we set the shorter edge of the image to 800 pixels. Training is performed for 280k iterations with an effective mini-batch size 8 on 8 GPUs. We first train the model with a learning rate of \(10^{-3}\) for the first 160k iterations, followed by learning rates of \(10^{-4}\) and \(10^{-5}\) subsequent for another 80k iterations and the last 40k iterations respectively. Five scales and three aspect ratios are deployed as anchors. We report results using either the released models or the code from the original authors. It is noted that we only deploy single-scale image training without the iterative bounding box average, although these enhancements could further boost performance (mmAP).

Table 6 shows the results on 2017 test-dev set, which contains 20, 288 images. Compared with the baseline methods Faster R-CNN [37], R-FCN [8] and SSD [28], both D-F-RCNN/D-R-FCN [9] and our method provides significant improvements over [8, 28, 37] (+\(3.7\%\) and +\(8.5\%\)). Moreover, it can be seen that the proposed method outperforms D-F-RCNN/D-R-FCN [9] by a wide margin(\(\sim \) \(\mathbf {4\%}\)). This observation further supports that our deep regionlet learning module could be treated as a generalization of Deformable RoI Pooling in [9, 32]. It is also noted that although most recent state-of-the-art object detectors such as Mask R-CNN [18] utilize multi-task training with segmentation labels, we still outperform Mask R-CNN [18] by \(1.1\%\). In addition, the focal loss in [26], which overcomes the obstacle caused by the imbalance of positive/nagetive samples, is complimentary to our method. We believe it can be integrated into our method to further boost performance. In summary, compared with Mask R-CNN [18] and RetinaNetFootnote 4 [26], our method achieves competitive performance over state-of-the-art on MS COCO when using ResNet-101 as a backbone network.

5 Conclusion

In this paper, we present a novel deep regionlet-based approach for object detection. The proposed RSN can select non-rectangular regions within the detection bounding box, and hence an object with rigid shape and deformable parts can be better modeled. We also design the deep regionlet learning module so that both the selected regions and the regionlets can be learned simultaneously. Moreover, the proposed system can be trained in a fully end-to-end manner without additional efforts. Finally, we extensively evaluate our approach on two detection benchmarks and experimental results show competitive performance over state-of-the-art.