SSD: Single Shot MultiBox Detector
- 2.9k Citations
- 107 Mentions
- 64k Downloads
Abstract
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https://github.com/weiliu89/caffe/tree/ssd.
Keywords
Real-time object detection Convolutional neural network1 Introduction
Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN [2] albeit with deeper features such as [3]. While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications. Often detection speed for these approaches is measured in frames per second, and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). There have been many attempts to build faster detectors by attacking each stage of the detection pipeline (see related work in Sect. 4), but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy.
This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and is as accurate as approaches that do. This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP 74.3 % on VOC2007 test, vs Faster R-CNN 7 FPS with mAP 73.2 % or YOLO 45 FPS with mAP 63.4 %). The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. We are not the first to do this (cf. [4, 5]), but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. With these modifications—especially using multiple layers for prediction at different scales—we can achieve high-accuracy using relatively low resolution input, further increasing detection speed. While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from 63.4 % mAP for YOLO to 74.3 % mAP for our SSD. This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks [3]. Furthermore, significantly improving the speed of high-quality detection can broaden the range of settings where computer vision is useful.
-
We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN).
-
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
-
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio.
-
These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.
-
Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.
SSD framework. (a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. \(8 \times 8\) and \(4 \times 4\) in (b) and (c)). For each default box, we predict both the shape offsets and the confidences for all object categories (\((c_1, c_2, \cdots , c_p)\)). At training time, we first match these default boxes to the ground truth boxes. For example, we have matched two default boxes with the cat and one with the dog, which are treated as positives and the rest as negatives. The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax).
2 The Single Shot Detector (SSD)
This section describes our proposed SSD framework for detection (Sect. 2.1) and the associated training methodology (Sect. 2.2). Afterwards, Sect. 3 presents dataset-specific model details and experimental results.
2.1 Model
A comparison between two single shot detection models: SSD and YOLO [5]. Our SSD model adds several feature layers to the end of a base network, which predict the offsets to default boxes of different scales and aspect ratios and their associated confidences. SSD with a \(300 \times 300\) input size significantly outperforms its \(448 \times 448\) YOLO counterpart in accuracy on VOC2007 test while also improving the speed.
Multi-scale feature maps for detection. We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer (cf Overfeat [4] and YOLO [5] that operate on a single scale feature map).
Convolutional predictors for detection. Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. These are indicated on top of the SSD network architecture in Fig. 2. For a feature layer of size \(m \times n\) with p channels, the basic element for predicting parameters of a potential detection is a \(3 \times 3 \times p\) small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the \(m \times n\) locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO [5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).
Default boxes and aspect ratios. We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we compute c class scores and the 4 offsets relative to the original default box shape. This results in a total of \((c+4)k\) filters that are applied around each location in the feature map, yielding \((c+4)kmn\) outputs for a \(m\times n\) feature map. For an illustration of default boxes, please refer to Fig. 1. Our default boxes are similar to the anchor boxes used in Faster R-CNN [2], however we apply them to several feature maps of different resolutions. Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box shapes.
2.2 Training
The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO [5] and for the region proposal stage of Faster R-CNN [2] and MultiBox [7]. Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.
Matching Strategy. During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box we are selecting from default boxes that vary over location, aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best Jaccard overlap (as in MultiBox [7]). Unlike MultiBox, we then match default boxes to any ground truth with Jaccard overlap higher than a threshold (0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.
Choosing Scales and Aspect Ratios for Default Boxes. To handle different object scales, some methods [4, 9] suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [10, 11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, [12] showed that adding global context pooled from a feature map can help smooth the segmentation results. Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (\(8 \times 8\) and \(4\times 4\)) which are used in the framework. In practice, we can use many more with small computational overhead.
By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predictions, covering various input object sizes and shapes. For example, in Fig. 1, the dog is matched to a default box in the \(4 \times 4\) feature map, but not to any default boxes in the \(8 \times 8\) feature map. This is because those boxes have different scales and do not match the dog box, and therefore are considered as negatives during training.
Hard Negative Mining. After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training.
-
Use the entire original input image.
-
Sample a patch so that the minimum Jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.
-
Randomly sample a patch.
The size of each sampled patch is [0.1, 1] of the original image size, and the aspect ratio is between \(\frac{1}{2}\) and 2. We keep the overlapped part of the ground truth box if the center of it is in the sampled patch. After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [13].
3 Experimental Results
Base Network. Our experiments are all based on VGG16 [14], which is pre-trained on the ILSVRC CLS-LOC dataset [15]. Similar to DeepLab-LargeFOV [16], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from \(2 \times 2-s2\) to \(3\times 3-s1\), and use the atrous algorithm to fill the “holes”. We remove all the dropout layers and the fc8 layer. We fine-tune the resulting model using SGD with initial learning rate \(10^{-3}\), 0.9 momentum, 0.0005 weight decay, and batch size 32. The learning rate decay policy is slightly different for each dataset, and we will describe details later. The full training and testing code is built on Caffe [17] and is open source at https://github.com/weiliu89/caffe/tree/ssd.
3.1 PASCAL VOC2007
On this dataset, we compare against Fast R-CNN [6] and Faster R-CNN [2] on VOC2007 test (4952 images). All methods use the same pre-trained VGG16 network.
PASCAL VOC2007 test detection results. Both Fast and Faster R-CNN use input images whose minimum dimension is 600. The two SSD models have exactly the same settings except that they have different input sizes (\(300 \times 300\) vs. \(512 \times 512\)). It is obvious that larger input size leads to better results, and more data always helps. Data: “07”: VOC2007 trainval, “07+12”: union of VOC2007 and VOC2012 trainval. “07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12.
Visualization of performance for SSD 512 on animals, vehicles, and furniture from VOC2007 test using [19]. The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). The bottom row shows the distribution of top-ranked false positive types.
Sensitivity and impact of different object characteristics on VOC2007 test set using [19]. The plot on the left shows the effects of BBox Area per category, and the right plot shows the effect of Aspect Ratio.
3.2 Model Analysis
Effects of various design choices and components on SSD performance.
SSD300 | |||||
---|---|---|---|---|---|
more data augmentation? | ✔ | ✔ | ✔ | ✔ | |
include {\(\frac{1}{2},2\)} box? | ✔ | ✔ | ✔ | ✔ | |
include {\(\frac{1}{3},3\)} box? | ✔ | ✔ | ✔ | ||
use atrous? | ✔ | ✔ | ✔ | ✔ | |
VOC2007 test mAP | 65.5 | 71.6 | 73.7 | 74.4 | 74.3 |
Data Augmentation is Crucial. Fast and Faster R-CNN use the original image and the horizontal flip to train. We use a more extensive sampling strategy, similar to YOLO [5]. Table 2 shows that we can improve 8.8 % mAP with this sampling strategy. We do not know how much our sampling strategy will benefit Fast and Faster R-CNN, but they are likely to benefit less because they use a feature pooling step during classification that is relatively robust to object translation by design.
More Default Box Shapes is Better. As described in Sect. 2.2, by default we use 6 default boxes per location. If we remove the boxes with \(\frac{1}{3}\) and 3 aspect ratios, the performance drops by 0.6 %. By further removing the boxes with \(\frac{1}{2}\) and 2 aspect ratios, the performance drops another 2.1 %. Using a variety of default box shapes seems to make the task of predicting boxes easier for the network.
Effects of multiple layers.
Source layers from: | mAP use boundary boxes? | # Boxes | ||||||
---|---|---|---|---|---|---|---|---|
conv4_3 | conv7 | conv8_2 | conv9_2 | conv10_2 | conv11_2 | Yes | No | |
✔ | ✔ | ✔ | ✔ | ✔ | ✔ | 74.3 | 63.4 | 8732 |
✔ | ✔ | ✔ | ✔ | ✔ | 74.6 | 63.1 | 8764 | |
✔ | ✔ | ✔ | ✔ | 73.8 | 68.4 | 8942 | ||
✔ | ✔ | ✔ | 70.7 | 69.2 | 9864 | |||
✔ | ✔ | 64.2 | 64.4 | 9025 | ||||
✔ | 62.4 | 64.0 | 8664 |
Multiple Output Layers at Different Resolutions is Better. A major contribution of SSD is using default boxes of different scales on different output layers. To measure the advantage gained, we progressively remove layers and compare results. For a fair comparison, every time we remove a layer, we adjust the default box tiling to keep the total number of boxes similar to the original (8732). This is done by stacking more scales of boxes on remaining layers and adjusting scales of boxes if needed. We do not exhaustively optimize the tiling for each setting. Table 3 shows a decrease in accuracy with fewer layers, dropping monotonically from 74.3 to 62.4. When we stack boxes of multiple scales on a layer, many are on the image boundary and need to be handled carefully. We tried the strategy used in Faster R-CNN [2], ignoring boxes which are on the boundary. We observe some interesting trends. For example, it hurts the performance by a large margin if we use very coarse feature maps (e.g. conv11_2 (\(1 \times 1\)) or conv10_2 (\(3\times 3\))). The reason might be that we do not have enough large boxes to cover large objects after the pruning. When we use primarily finer resolution maps, the performance starts increasing again because even after pruning a sufficient number of large boxes remains. If we only use conv7 for prediction, the performance is the worst, reinforcing the message that it is critical to spread boxes of different scales over different layers.
3.3 PASCAL VOC2012
PASCAL VOC2012 test detection results. Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is \(448 \times 448\). data: “07++12”: union of VOC2007 trainval and test and VOC2012 trainval. “07++12+COCO”: first train on COCO trainval35k then fine-tune on 07++12.
3.4 COCO
To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset. Since objects in COCO tend to be smaller than PASCAL VOC, we use smaller default boxes for all layers. We follow the strategy mentioned in Sect. 2.2, but now our smallest default box has a scale of 0.15 instead of 0.2, and the scale of the default box on conv4_3 is 0.07 (e.g. 21 pixels for a \(300 \times 300\) image).
COCO test-dev2015 detection results.
Method | Data | Mean average precision | ||
---|---|---|---|---|
0.5 | 0.75 | 0.5:0.95 | ||
Fast R-CNN [6] | train | 35.9 | - | 19.7 |
Fast R-CNN [21] | train | 39.9 | 20.5 | 19.4 |
Faster R-CNN [2] | train | 42.1 | - | 21.5 |
Faster R-CNN [2] | trainval | 42.7 | - | 21.9 |
Faster R-CNN [22] | trainval | 45.3 | 24.2 | 23.5 |
ION [21] | train | 42.0 | 23.0 | 23.0 |
SSD300 | trainval35k | 41.2 | 23.2 | 23.4 |
SSD512 | trainval35k | 46.4 | 26.7 | 27.7 |
3.5 Preliminary ILSVRC Results
We applied the same network architecture we used for COCO to the ILSVRC DET dataset [15]. We train a SSD300 model using the ILSVRC2014 DET train and val1 as used in [20]. We first train the model with \(10^{-3}\) learning rate for 320k iterations, and then continue training for 80k iterations with \(10^{-4}\) and 40k iterations with \(10^{-5}\). We can achieve 43.2 mAP on the val2 set [20]. Again, it validates that SSD is a general framework for high quality real-time detection.
3.6 Inference Time
Considering the large number of boxes generated from our method, it is essential to perform non-maximum suppression (nms) efficiently during inference. By using a confidence threshold of 0.01, we can filter out most boxes. We then apply nms with Jaccard overlap of 0.45 per class and keep the top 200 detections per image. This step costs about 1.7 ms per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 ms) spent on all newly added layers.
Results on Pascal VOC2007 test. SSD300 is the only real-time detection method that can achieve above 70 % mAP. By using a larger input image, SSD512 outperforms all methods on accuracy while maintaining a close to real-time speed. Using a test batch size of 8 improves the speed further.
4 Related Work
There are two established classes of methods for object detection in images, one based on sliding windows and the other based on region proposal classification. Before the advent of convolutional neural networks, the state of the art for those two approaches – Deformable Part Model (DPM) [23] and Selective Search [1] – had comparable performance. However, after the dramatic improvement brought on by R-CNN [20], which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent.
The original R-CNN approach has been improved in a variety of ways. The first set of approaches improve the quality and speed of post-classification, since it requires the classification of thousands of image crops, which is expensive and time-consuming. SPPnet [9] speeds up the original R-CNN approach significantly. It introduces a spatial pyramid pooling layer that is more robust to region size and scale and allows the classification layers to reuse features computed over feature maps generated at several image resolutions. Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness.
The second set of approaches improve the quality of proposal generation using deep neural networks. In the most recent works like MultiBox [7, 8], the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network. This further improves the detection accuracy but results in a somewhat complex setup, requiring the training of two neural networks with a dependency between them. Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks. This way region proposals are used to pool mid-level features and the final classification step is less expensive. Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the achor boxes in the RPN. But instead of using these to pool features and evaluate another classifier, we simultaneously produce a score for each object category in each box. Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks.
Detection examples on COCO test-dev with SSD512 model. We show detections with scores higher than 0.6. Each color corresponds to an object category.
5 Conclusions
This paper introduces SSD, a fast single-shot object detector for multiple categories. A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. This representation allows us to efficiently model the space of possible box shapes. We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. We build SSD models with at least an order of magnitude more box predictions sampling location, scale, and aspect ratio, than existing methods [5, 7].
We demonstrate that given the same VGG-16 base architecture, SSD compares favorably to its state-of-the-art object detector counterparts in terms of both accuracy and speed. Our SSD512 model significantly outperforms the state-of-the-art Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being \(3 \times \) faster. Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy.
Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component. A promising future direction is to explore its use as part of a system using recurrent neural networks to detect and track objects in video simultaneously.
Footnotes
Notes
Acknowledgment
This work was started as an internship project at Google and continued at UNC. We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google. We also thank Philip Ammirato and Patrick Poirson for helpful comments. We thank NVIDIA for providing GPUs and acknowledge support from NSF 1452851, 1446631, 1526367, 1533771.
References
- 1.Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV 104, 154 (2013)CrossRefGoogle Scholar
- 2.Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
- 3.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
- 4.Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. In: ICLR (2014)Google Scholar
- 5.Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
- 6.Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
- 7.Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: CVPR (2014)Google Scholar
- 8.Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-quality object detection. arXiv preprint v3 (2015). arXiv:1412.1441
- 9.He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10578-9_23 Google Scholar
- 10.Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
- 11.Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)Google Scholar
- 12.Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. In: ILCR (2016)Google Scholar
- 13.Howard, A.G.: Some improvements on deep convolutional neural network based image classification. arXiv preprint (2013). arXiv:1312.5402
- 14.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: NIPS (2015)Google Scholar
- 15.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. IJCV 115, 211 (2015)MathSciNetCrossRefGoogle Scholar
- 16.Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)Google Scholar
- 17.Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: MM. ACM (2014)Google Scholar
- 18.Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)Google Scholar
- 19.Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 340–353. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33712-3_25 Google Scholar
- 20.Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
- 21.Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: CVPR (2016)Google Scholar
- 22.COCO:Common Objects in Context (2016). http://mscoco.org/dataset/#detections-leaderboard. Accessed 25 July 2016
- 23.Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)Google Scholar