Deep Learning for Generic Object Detection: A Survey

Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.


Introduction
As a longstanding, fundamental and challenging problem in computer vision, object detection has been an active area of research for several decades.The goal of object detection is to determine whether or not there are any instances of objects from the given categories (such as humans, cars, bicycles, dogs and cats) in some given image and, if present, to return the spatial location and extent of each object instance (e.g., via a bounding box [53,179]).As the cornerstone of image understanding and computer vision, object detection forms the basis for solving more complex or high level vision tasks such as segmentation, scene understanding, object tracking, image captioning, event detection, and activity recognition.Object detection has a wide range of applications in many areas of artificial intelligence and information technologies, including robot vision, consumer electronics, security, autonomous driving, human computer interaction, content based image retrieval, intelligent video surveillance, and augmented reality.
Recently, deep learning techniques [81,116] have emerged as powerful methods for learning feature representations automatically from data.In particular, these techniques have provided significant improvement for object detection, a problem which has attracted enormous attention in the last five years, even though it has been studied for decades by psychophysicists, neuroscientists, and engineers.
Object detection can be grouped into one of two types [69,240]: detection of specific instance and detection of specific categories.The first type aims at detecting instances of a particular object (such as Donald Trump's face, the Pentagon building, or my dog Penny), whereas the goal of the second type is to detect different instances of predefined object categories (for example humans, cars, bicycles, and dogs).Historically, much of the effort in the field of object detection has focused on the detection of a single Fig. 2 Milestones of object detection and recognition, including feature representations [37,42,79,109,114,139,140,166,191,194,200,213,215], detection frameworks [56,65,183,209,213], and datasets [53,129,179].The time period up to 2012 is dominated by handcrafted features.We see a turning point in 2012 with the development of DCNNs for image classification by Krizhevsky et al. [109].Most listed methods are highly cited and won one of the major ICCV or CVPR prizes.See Section 2.3 for details.category (such as faces and pedestrians) or a few specific categories.In contrast, in the past several years the research community has started moving towards the challenging goal of building general purpose object detection systems whose breadth of object detection ability rivals that of humans.
However in 2012, Krizhevsky et al. [109] proposed a Deep Convolutional Neural Network (DCNN) called AlexNet which achieved record breaking image classification accuracy in the Large Scale Visual Recognition Challenge (ILSRVC) [179].Since that time the research focus in many computer vision application areas has been on deep learning methods.A great many approaches based on deep learning have sprung up in generic object detection [65,77,64,183,176] and tremendous progress has been achieved, yet we are unaware of comprehensive surveys of the subject during the past five years.Given this time of rapid evolution, the focus of this paper is specifically that of generic object detection by deep learning, in order to gain a clearer panorama in generic object detection.
The generic object detection problem itself is defined as follows: Given an arbitrary image, determine whether there are any instances of semantic objects from predefined categories and, if present, to return the spatial location and extent.Object refers to a material thing that can be seen and touched.Although largely synonymous with object class detection, generic object detection places a greater emphasis on approaches aimed at detecting a broad range of natural categories, as opposed to object instances or specialized categories (e.g., faces, pedestrians, or cars).Generic object detection has received significant attention, as demonstrated by recent progress on object detection competitions such as the PAS-CAL VOC detection challenge from 2006 to 2012 [53,54], the ILSVRC large scale detection challenge since 2013 [179], and the MS COCO large scale detection challenge since 2015 [129].The striking improvement in recent years is illustrated in Fig. 1.
There are few recent surveys focusing directly on the problem of generic object detection, except for the work by Zhang et al. [240] who conducted a survey on the topic of object class detection.However, the research reviewed in [69], [5] and [240] is mostly that preceding 2012, and therefore before the more recent striking success of deep learning and related methods.
Deep learning allows computational models consisting of multiple hierarchical layers to learn fantastically complex, subtle, and abstract representations.In the past several years, deep learning has driven significant progress in a broad range of problems, such as visual recognition, object detection, speech recognition, natural language processing, medical image analysis, drug discovery and genomics.Among different types of deep neural networks, Deep Convolutional Neural Networks (DCNN) [115,109,116] have brought about breakthroughs in processing images, video, speech and audio.Given this time of rapid evolution, researchers have recently published surveys on different aspects of deep learning, including that of Bengio et al. [12], LeCun et al. [116], Litjens et al. [133], Gu et al. [71], and more recently in tutorials at ICCV and CVPR.
Although many deep learning based methods have been proposed for objection detection, we are unaware of comprehensive surveys of the subject during the past five years, the focus of this survey.A thorough review and summarization of existing work is essential for further progress in object detection, particularly for researchers wishing to enter the field.Extensive work on CNNs for specific object detection, such as face detection [120,237,92], pedestrian detection [238,85], vehicle detection [247] and traffic sign detection [253] will not be included in our discussion.Deep Learning [116] 2015 Nature An introduction to deep learning and its typical applications

Categorization Methodology
The number of papers on generic object detection published since deep learning entering is just breathtaking.So many, in fact, that compiling a comprehensive review of the state of the art already exceeds the possibility of a paper like this one.It is necessary to establish some selection criteria, e.g.completeness of a paper and importance to the field.We have preferred to include top journal and conference papers.Due to limitations on space and our knowledge, we sincerely apologize to those authors whose works are not included in this paper.For surveys of efforts in related topics, readers are referred to the articles in Table 1.This survey mainly focuses on the major progress made in the last five years; but for completeness and better readability, some early related works are also included.We restrict ourselves to still pictures and leave video object detection as a separate topic.
The remainder of this paper is organized as follows.Related background, including the problem, key challenges and the progress made during the last two decades are summarized in Section 2. We describe the milestone object detectors in Section 3. Fundamental subproblems and relevant issues involved in designing object detectors are presented in Section 4. A summarization of popular databases and state of the art performance is given in 5. We conclude the paper with a discussion of several promising directions in Section 6.

The Problem
Generic object detection (i.e., generic object category detection), also called object class detection [240] or object category detection, is defined as follows.Given an image, the goal of generic object detection is to determine whether or not there are instances of objects from many predefined categories and, if present, to return the spatial location and extent of each instance.It places greater emphasis on detecting a broad range of natural categories, as opposed to specific object category detection where only a narrower predefined category of interest (e.g., faces, pedestrians, or cars) may be present.Although thousands of objects occupy the visual world in which we live, currently the research community is primarily interested in the localization of highly structured objects (e.g., cars, faces, bicycles and airplanes) and articulated (e.g., humans, cows and horses) rather than unstructured scenes (such as sky, grass and cloud).
Typically, the spatial location and extent of an object can be defined coarsely using a bounding box, i.e., an axis-aligned rectangle tightly bounding the object [53,179], a precise pixel-wise segmentation mask, or a closed boundary [180,129], as illustrated in Fig. 3. To our best knowledge, in the current literature, bounding boxes are more widely used for evaluating generic object detection algorithms [53,179], and will be the approach we adopt in this survey as well.However the community is moving towards deep scene understanding (from image level object classification to single object localization, to generic object detection, and to pixel-wise object segmentation), hence it is anticipated that future challenges will be at the pixel level [129].
There are many problems closely related to that of generic object detection 1 .The goal of object classification or object categorization (Fig. 3 (a)) is to assess the presence of objects from a given number of object classes in an image; i.e., assigning one or more object class labels to a given image, determining presence without the need of location.It is obvious that the additional requirement to locate the instances in an image makes detection a more challenging task than classification.The object recognition problem denotes the more general problem of finding and identifying objects of interest present in an image, subsuming the problems of object detection and object classification [53,179,156,5]. 1 To our best knowledge, there is no universal agreement in the literature on the definitions of various vision subtasks.Often encountered terms such as detection, localization, recognition, classification, categorization, verification and identification, annotation, labeling and understanding are often differently defined [5].

Ideal Detector
High Accuracy • Localization Acc.
• Recognition Acc.Fig. 5 Changes in imaged appearance of the same class with variations in imaging conditions (a-g).There is an astonishing variation in what is meant to be a single object class (h).In contrast, the four images in (i) appear very similar, but in fact are from four different object classes.Images from ImageNet [179] and MS COCO [129].

High Efficiency
Generic object detection is closely related with semantic image segmentation (Fig. 3 (c)), which aims to assign each pixel in an image to a semantic class label.Object instance segmentation (Fig. 3  (d)) aims at distinguishing different instances of the same object class, while semantic segmentation does not distinguish different instances.Generic object detection also distinguishes different instances of the same object.Different from segmentation, object detection includes background region in the bounding box that might be useful for analysis.

Main Challenges
Generic object detection aims at localizing and recognizing a broad range of natural object categories.The ideal goal of generic object detection is to develop general-purpose object detection algorithms achieving two competing goals: high quality/accuracy and high efficiency, as illustrated in Fig. 4. As illustrated in Fig. 5, high quality detection has to accurately localize and recognize objects in images or video frames, such that the large variety of object categories in the real world can be distinguished (i.e., high distinctiveness), and that object instances from the same category, subject to intraclass appearance variations, can be localized and recognized (i.e., high robustness).High efficiency requires the entire detection task to run at a sufficiently high frame rate with acceptable memory and storage usage.Despite several decades of research and significant progress, arguably the combined goals of accuracy and efficiency have not yet been met.

Accuracy related challenges
For accuracy, the challenge stems from 1) the vast range of intraclass variations and 2) the huge number of object categories.
We begin with intraclass variations, which can be divided into two types: intrinsic factors, and imaging conditions.For the former, each object category can have many different object instances, possibly varying in one or more of color, texture, material, shape, and size, such as the "chair" category shown in Fig. 5 (h).Even in a more narrowly defined class, such as human or horse, object instances can appear in different poses, with nonrigid deformations and different clothes.
For the latter, the variations are caused by changes in imaging conditions and unconstrained environments which may have dramatic impacts on object appearance.In particular, different instances, or even the same instance, can be captured subject to a wide number of differences: different times, locations, weather conditions, cameras, backgrounds, illuminations, viewpoints, and viewing distances.All of these conditions produce significant variations in object appearance, such as illumination, pose, scale, occlusion, background clutter, shading, blur and motion, with examples illustrated in Fig. 5 (a-g).Further challenges may be added by digitization artifacts, noise corruption, poor resolution, and filtering distortions.
In addition to intraclass variations, the large number of object categories, on the order of 10 4 −10 5 , demands great discrimination power of the detector to distinguish between subtly different interclass variations, as illustrated in Fig. 5 (i)).In practice, current detectors focus mainly on structured object categories, such as the 20, 200 and 91 object classes in PASCAL VOC [53], ILSVRC [179] and MS COCO [129] respectively.Clearly, the number of object categories under consideration in existing benchmark datasets is much smaller than that can be recognized by humans.

Efficiency related challenges
The exponentially increasing number of images calls for efficient and scalable detectors.The prevalence of social media networks and mobile/wearable devices has led to increasing demands for analyzing visual data.However mobile/wearable devices have limited computational capabilities and storage space, in which case an efficient object detector is critical.
For efficiency, the challenges stem from the need to localize and recognize all object instances of very large number of object categories, and the very large number of possible locations and scales within a single image, as shown by the example in Fig. 5 (c).A further challenge is that of scalability: A detector should be able to handle unseen objects, unknown situations, and rapidly increasing image data.For example, the scale of ILSVRC [179] is already imposing limits on the manual annotations that are feasible to obtain.As the number of images and the number of categories grow even larger, it may become impossible to annotate them manually, forcing algorithms to rely more on weakly supervised training data.

Progress in the Past Two Decades
Early research on object recognition was based on template matching techniques and simple part based models [57], focusing on specific objects whose spatial layouts are roughly rigid, such as faces.Before 1990 the leading paradigm of object recognition was based on geometric representations [149,169], with the focus later moving away from geometry and prior models towards the use of statistical classifiers (such as Neural Networks [178], SVM [159] and Adaboost [213,222]) based on appearance features [150,181].This successful family of object detectors set the stage for most subsequent research in this field.
In the late 1990s and early 2000s object detection research made notable strides.The milestones of object detection in recent years are presented in Fig. 2, in which two main eras (SIFT vs. DCNN) are highlighted.The appearance features moved from global representations [151,197,205] to local representations that are invariant to changes in translation, scale, rotation, illumination, viewpoint and occlusion.Handcrafted local invariant features gained tremendous popularity, starting from the Scale Invariant Feature Transform (SIFT) feature [139], and the progress on various visual recognition tasks was based substantially on the use of local descriptors [145] such as Haar like features [213], SIFT [140], Shape Contexts [11], Histogram of Gradients (HOG) [42] and Local Binary Patterns (LBP) [153], covariance [206].These local features are usually aggregated by simple concatenation or feature pooling encoders such as the influential and efficient Bag of Visual Words approach introduced by Sivic and Zisserman [194] and Csurka et al. [37], Spatial Pyramid Matching (SPM) of BoW models [114], and Fisher Vectors [166].
For years, the multistage handtuned pipelines of handcrafted local descriptors and discriminative classifiers dominated a variety of domains in computer vision, including object detection, until the significant turning point in 2012 when Deep Convolutional Neural Networks (DCNN) [109] achieved their record breaking results in image classification.The successful application of DCNNs to image classification [109] transferred to object detection, resulting in the milestone Region based CNN (RCNN) detector of Girshick et al. [65].Since then, the field of object detection has dramatically evolved and many deep learning based approaches have been developed, thanks in part to available GPU computing resources and the availability of large scale datasets and challenges such as Im-ageNet [44,179] and MS COCO [129].With these new datasets, researchers can target more realistic and complex problems when detecting objects of hundreds categories from images with large intraclass variations and interclass similarities [129,179].
The research community has started moving towards the challenging goal of building general purpose object detection systems whose ability to detect many object categories matches that of humans.This is a major challenge: according to cognitive scientists, human beings can identify around 3,000 entry level categories and 30,000 visual categories overall, and the number of categories distinguishable with domain expertise may be on the order of 10 5 [14].Despite the remarkable progress of the past years, designing an accurate, robust, efficient detection and recognition system that approaches human-level performance on 10 4 − 10 5 categories is undoubtedly an open problem.

Frameworks
There has been steady progress in object feature representations and classifiers for recognition, as evidenced by the dramatic change from handcrafted features [213,42,55,76,212] to learned DCNN features [65,160,64,175,40].
In contrast, the basic "sliding window" strategy [42,56,55] for localization remains to be the main stream, although with some endeavors in [113,209].However the number of windows is large and grows quadratically with the number of pixels, and the need to search over multiple scales and aspect ratios further increases the search space.The the huge search space results in high computational complexity.Therefore, the design of efficient and effective detection framework plays a key role.Commonly adopted strategies include cascading, sharing feature computation, and reducing per-window computation.
In this section, we review the milestone detection frameworks present in generic object detection since deep learning entered the field, as listed in Fig. 6 and summarized in Table 10.Nearly all detectors proposed over the last several years are based on one of these milestone detectors, attempting to improve on one or more aspects.Broadly these detectors can be organized into two main categories: A. Two stage detection framework, which includes a pre-processing step for region proposal, making the overall pipeline two stage.B. One stage detection framework, or region proposal free framework, which is a single proposed method which does not separate detection proposal, making the overall pipeline singlestage.
Section 4 will build on the following by discussing fundamental subproblems involved in the detection framework in greater detail, including DCNN features, detection proposals, context modeling, bounding box regression and class imbalance handling.

Region Based (Two Stage Framework)
In a region based framework, category-independent region proposals are generated from an image, CNN [109] features are extracted from these regions, and then category-specific classifiers are used to determine the category labels of the proposals.As can be observed from Fig. 6, DetectorNet [198], OverFeat [183], MultiBox [52] and RCNN [65] independently and almost simultaneously proposed using CNNs for generic object detection.
RCNN: Inspired by the breakthrough image classification results obtained by CNN and the success of selective search in region proposal for hand-crafted features [209], Girshick et al. were among the first to explore CNN for generic object detection and developed RCNN [65,67], which integrates AlexNet [109] with the region proposal method selective search [209].As illustrated in Fig. 7, training in an RCNN framework consists of multistage pipelines: SPPNet: During testing, CNN features extraction is the main bottleneck of the RCNN detection pipeline, which requires to extract CNN features from thousands of warped region proposals for an image.Noticing these obvious disadvantages, He et al. [77] introduced the traditional spatial pyramid pooling (SPP) [68,114] into CNN architectures.Since convolutional layers accept inputs of arbitrary sizes, the requirement of fixed-sized images in CNNs is only due to the Fully Connected (FC) layers, He et al. found this fact and added an SPP layer on top of the last convolutional (CONV) layer to obtain features of fixed-length for the FC layers.With this SPPnet, RCNN obtains a significant speedup without sacrificing any detection quality because it only needs to run the convolutional layers once on the entire test image to generate fixed-length features for region proposals of arbitrary size.While SPPnet accelerates RCNN evaluation by orders of magnitude, it does not result in a comparable speedup of the detector training.Moreover, finetuning in SPPnet [77] is unable to update the convolutional layers before the SPP layer, which limits the accuracy of very deep networks.
Fast RCNN: Girshick [64] proposed Fast RCNN that addresses some of the disadvantages of RCNN and SPPnet, while improving on their detection speed and quality.As illustrated in Fig. 8, Fast RCNN enables end-to-end detector training (when ignoring  Faster RCNN [175,176]: Although Fast RCNN significantly sped up the detection process, it still relies on external region proposals.Region proposal computation is exposed as the new bottleneck in Fast RCNN.Recent work has shown that CNNs have a remarkable ability to localize objects in CONV layers [243,244,36,158,75], an ability which is weakened in the FC layers.Therefore, the selective search can be replaced by the CNN in producing region proposals.The Faster RCNN framework proposed by Ren et al. [175,176] proposed an efficient and accurate Region Proposal Network (RPN) to generating region proposals.They utilize single network to accomplish the task of RPN for region proposal and Fast RCNN for region classification.In Faster RCNN, the RPN and fast RCNN share large number of convolutional layers.The features from the last shared convolutional layer are used for region proposal and region classification from separate branches.RPN first initializes k n × n reference boxes (i.e. the so called anchors) of different scales and aspect ratios at each CONV feature map location.Each n × n anchor is mapped to a lower dimensional vector (such as 256 for ZF and 512 for VGG), which is fed into two sibling FC layers -an object category classification layer and a box regression layer.Different from Fast RCNN, the features used for regression in RPN have the same size.RPN shares CONV features with Fast RCNN, thus enabling highly efficient region proposal computation.RPN is, in fact, a kind of Fully Convolutional Network (FCN) [138,185]; Faster RCNN is thus a purely CNN based framework without using handcrafted features.For the very deep VGG16 model [191], Faster RCNN can test at 5fps (including all steps) on a GPU, while achieving state of the art object detection accuracy on PASCAL VOC 2007 using 300 proposals per image.The initial Faster RCNN in [175] contains several alternating training steps.This was then simplified by one step joint training in [176].
Concurrent with the development of Faster RCNN, Lenc and Vedaldi [117] challenged the role of region proposal generation methods such as selective search, studied the role of region proposal generation in CNN based detectors, and found that CNNs contain sufficient geometric information for accurate object detec- Fig. 8 High level diagrams of the leading frameworks for generic object detection.The properties of these methods are summarized in Table 10.tion in the CONV rather than FC layers.They proved the possibility of building integrated, simpler, and faster object detectors that rely exclusively on CNNs, removing region proposal generation methods such as selective search.
RFCN (Region based Fully Convolutional Network): While Faster RCNN is an order of magnitude faster than Fast RCNN, the fact that the region-wise subnetwork still needs to be applied per RoI (several hundred RoIs per image) led Dai et al. [40] to propose the RFCN detector which is fully convolutional (no hidden FC layers) with almost all computation shared over the entire image.As shown in Fig. 8, RFCN differs from Faster RCNN only in the RoI subnetwork.In Faster RCNN, the computation after the RoI pooling layer cannot be shared.A natural idea is to minimize the amount of computation that cannot be shared, hence Dai et al. [40] proposed to use all CONV layers to construct a shared RoI subnetwork and RoI crops are taken from the last layer of CONV features prior to prediction.However, Dai et al. [40] found that this naive design turns out to have considerably inferior detection accuracy, conjectured to be that deeper CONV layers are more sensitive to category semantic and less sensitive to translation, whereas object detection needs localization representations that respect translation variance.Based on this observation, Dai et al. [40] constructed a set of position sensitive score maps by using a bank of specialized CONV layers as the FCN output, on top of which a position sensitive RoI pooling layer different from the more standard RoI pooling in [64,175] is added.They showed that the RFCN with ResNet101 [79] could achieve comparable accuracy to Faster RCNN, often at faster running times.
Mask RCNN: Following the spirit of conceptual simplicity, efficiency, and flexibility, He et al. [80] proposed Mask RCNN to tackle pixel-wise object instance segmentation by extending Faster RCNN.Mask RCNN adopts the same two stage pipeline, with an identical first stage (RPN).In the second stage, in parallel to predicting the class and box offset, Mask RCNN adds a branch which outputs a binary mask for each RoI.The new branch is a Fully Convolutional Network (FCN) [138,185] on top of a CNN feature map.In order to avoid the misalignments caused by the original RoI pooling (RoIPool) layer, a RoIAlign layer was proposed to preserve the pixel level spatial correspondence.With a backbone network ResNeXt101-FPN [223,130], Mask RCNN achieved top results for the COCO object instance segmentation and bounding box object detection.It is simple to train, generalizes well, and adds only a small overhead to Faster RCNN, running at 5 FPS [80].
Light Head RCNN: In order to further speed up the detection speed of RFCN [40], Li et al. [128] proposed Light Head RCNN, making the head of the detection network as light as possible to reduce the RoI regionwise computation.In particular, Li et al. [128] applied a large kernel separable convolution to produce thin feature maps with small channel number and a cheap RCNN subnetwork, leading to an excellent tradeoff of speed and accuracy.

Unified Pipeline (One Stage Pipeline)
The region-based pipeline strategies of Section 3.1 have prevailed on detection benchmarks since RCNN [65].The significant efforts introduced in Section 3.1 have led to faster and more accurate detectors, and the current leading results on popular benchmark datasets are all based on Faster RCNN [175].In spite of that progress, region-based approaches could be computationally expensive for mobile/wearable devices, which have limited storage and computational capability.Therefore, instead of trying to optimize the individual components of a complex region-based pipeline, researchers have begun to develop unified detection strategies.
Unified pipelines refer broadly to architectures that directly predict class probabilities and bounding box offsets from full images with a single feed forward CNN network in a monolithic setting that does not involve region proposal generation or post classification.The approach is simple and elegant because it completely eliminates region proposal generation and subsequent pixel or feature resampling stages, encapsulating all computation in a single network.Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
DetectorNet: Szegedy et al. [198] were among the first to explore CNNs for object detection.DetectorNet formulated object detection a regression problem to object bounding box masks.They use AlexNet [109] and replace the final softmax classifier layer by a regression layer.Given an image window, they use one network to predict foreground pixels over a coarse grid, as well as four additional networks to predict the object's top, bottom, left and right halves.A grouping process then converts the predicted masks into detected bounding boxes.One needs to train a network per object type and mask type.It does not scale up to multiple classes.De-tectorNet must take many crops of the image, and run multiple networks for each part on every crop.
OverFeat, proposed by Sermanet et al. [183], was one of the first modern one-stage object detectors based on fully convolutional deep networks.It is one of the most successful object detection frameworks, winning the ILSVRC2013 localization competition.OverFeat performs object detection in a multiscale sliding window fashion via a single forward pass through the CNN network, which (with the exception of the final classification/regressor layer) consists only of convolutional layers.In this way, they naturally share computation between overlapping regions.OverFeat produces a grid of feature vectors, each of which represents a slightly different context view location within the input image and can predict the presence of an object.Once an object is identified, the same features are then used to predict a single bounding box regressor.In addition, OverFeat leverages multiscale features to improve the overall performance by passing up to six enlarged scales of the original image through the network and iteratively aggregating them together, resulting in a significantly increased number of evaluated context views (final feature vectors).OverFeat has a significant speed advantage over RCNN [65], which was proposed during the same period, but is significantly less accurate because it is hard to train fully convolutional network at that stage.The speed advantage derives from sharing the computation of convolution between overlapping windows using fully convolutional network.
YOLO (You Only Look Once): Redmon et al. [174] proposed YOLO, a unified detector casting object detection as a regression problem from image pixels to spatially separated bounding boxes and associated class probabilities.The design of YOLO is illustrated in Fig. 8. Since the region proposal generation stage is completely dropped, YOLO directly predicts detections using a small set of candidate regions.Unlike region-based approaches, e.g.Faster RCNN, that predict detections based on features from local region, YOLO uses the features from entire image globally.In particular, YOLO divides an image into a S × S grid.Each grid predicts C class probabilities, B bounding box locations and confidences scores for those boxes.These predictions are encoded as an S ×S ×(5B +C) tensor.By throwing out the region proposal generation step entirely, YOLO is fast by design, running in real time at 45 FPS and a fast version, i.e.Fast YOLO [174], running at 155 FPS.Since YOLO sees the entire image when making predictions, it implicitly encodes contextual information about object classes and is less likely to predict false positives on background.YOLO makes more localization errors resulting from the coarse division of bounding box location, scale and aspect ratio.As discussed in [174], YOLO may fail to localize some objects, especially small ones, possibly because the grid division is quite coarse, and because by construction each grid cell can only contain one object.It is unclear to what extent YOLO can translate to good performance on datasets with significantly more objects, such as the ILSVRC detection challenge.
YOLOv2 and YOLO9000: Redmon and Farhadi [173] proposed YOLOv2, an improved version of YOLO, in which the custom GoogLeNet [200] network is replaced with a simpler Dark-Net19, plus utilizing a number of strategies drawn from existing work, such as batch normalization [78], removing the fully connected layers, and using good anchor boxes learned with kmeans and multiscale training.YOLOv2 achieved state of the art on standard detection tasks, like PASCAL VOC and MS COCO.In addition, Redmon and Farhadi [173] introduced YOLO9000, which can detect over 9000 object categories in real time by proposing a joint optimization method to train simultaneously on ImageNet and COCO with WordTree to combine data from multiple sources.

SSD (Single Shot Detector):
In order to preserve real-time speed without sacrificing too much detection accuracy, Liu et al. [136] proposed SSD, which is faster than YOLO [174] and has accuracy competitive with state-of-the-art region-based detectors, including Faster RCNN [175].SSD effectively combines ideas from RPN in Faster RCNN [175], YOLO [174] and multiscale CONV features [75] to achieve fast detection speed while still retaining high detection quality.Like YOLO, SSD predicts a fixed number of bounding boxes and scores for the presence of object class instances in these boxes, followed by an NMS step to produce the final detection.The CNN network in SSD is fully convolutional, whose early layers are based on a standard architecture, such as VGG [191] (truncated before any classification layers), which is referred as the base network.Then several auxiliary CONV layers, progressively decreasing in size, are added to the end of the base network.The information in the last layer with low resolution may be too coarse spatially to allow precise localization.SSD uses shallower layers with higher resolution for detecting small objects.For objects of different sizes, SSD performs detection over multiple scales by operating on multiple CONV feature maps, each of which predicts category scores and box offsets for bounding boxes of appropriate sizes.For a 300 × 300 input, SSD achieves 74.3% mAP on the VOC2007 test at 59 FPS on a Nvidia Titan X.

Fundamental SubProblems
In this section important subproblems are described, including feature representation, region proposal, context information mining, and training strategies.Each approach is reviewed with respect to its primary contribution.

DCNN based Object Representation
As one of the main components in any detector, good feature representations are of primary importance in object detection [46,65,62,249].In the past, a great deal of effort was devoted to designing local descriptors (e.g., SIFT [139] and HOG [42]) and to explore approaches (e.g., Bag of Words [194] and Fisher Vector [166]) to group and abstract the descriptors into higher level representations in order to allow the discriminative object parts to begin to emerge, however these feature representation methods required careful engineering and considerable domain expertise.
In contrast, deep learning methods (especially deep CNNs, or DCNNs), which are composed of multiple processing layers, can learn powerful feature representations with multiple levels of abstraction directly from raw images [12,116].As the learning procedure reduces the dependency of specific domain knowledge and complex procedures needed in traditional feature engineering [12,116], the burden for feature representation has been transferred to the design of better network architectures.
The leading frameworks reviewed in Section 3 (RCNN [65], Fast RCNN [64], Faster RCNN [175], YOLO [174], SSD [136]) have persistently promoted detection accuracy and speed.It is generally accepted that the CNN representation plays a crucial role and it is the CNN architecture which is the engine of a detector.As a result, most of the recent improvements in detection accuracy have been achieved via research into the development of novel networks.Therefore we begin by reviewing popular CNN architectures used in Generic Object Detection, followed by a review of the effort devoted to improving object feature representations, such as developing invariant features to accommodate geometric variations in object scale, pose, viewpoint, part deformation and performing multiscale analysis to improve object detection over a wide range of scales.

Popular CNN Architectures
CNN architectures serve as network backbones to be used in the detection frameworks described in Section 3. Representative frameworks include AlexNet [110], ZFNet [234] VGGNet [191], GoogLeNet [200], Inception series [99,201,202], ResNet [79], DenseNet [94] and SENet [91], which are summarized in Table 2, and where the network improvement in object recognition can be seen from Fig. 9.A further review of recent CNN advances can be found in [71].
Briefly, a CNN has a hierarchical structure and is composed of a number of layers such as convolution, nonlinearity, pooling etc. From finer to coarser layers, the image repeatedly undergoes filtered convolution, and with each layer the receptive field (region of support) of these filters increases.For example, the pioneering AlexNet [110] has five convolutional layers and two Fully  Connected (FC) layers, and where the first layer contains 96 filters of size 11 × 11 × 3.In general, the first CNN layer extracts low level features (e.g.edges), intermediate layers extract features of increasing complexity, such as combinations of low level features, and later convolutional layers detect objects as combinations of earlier parts [234,12,116,157].
As can be observed from Table 2, the trend in architecture evolution is that networks are getting deeper: AlexNet consisted of 8 layers, VGGNet 16 layers, and more recently ResNet and DenseNet both surpassed the 100 layer mark, and it was VGGNet [191] and GoogLeNet [200], in particular, which showed that increasing depth can improve the representational power of deep networks.Interestingly, as can be observed from Table 2, networks such as AlexNet, OverFeat, ZFNet and VGGNet have an enormous number of parameters, despite being only few layers deep, since a large fraction of the parameters come from the FC layers.Therefore, newer networks like Inception, ResNet, and DenseNet, although having a very great network depth, have far fewer parameters by avoiding the use of FC layers.
With the use of Inception modules in carefully designed topologies, the parameters of GoogLeNet is dramatically reduced.Similarly ResNet demonstrated the effectiveness of skip connections for learning extremely deep networks with hundreds of layers, winning the ILSVRC 2015 classification task.Inspired by ResNet [79], InceptionResNets [202] combine the Inception networks with shortcut connections, claiming that shortcut connections can significantly accelerate the training of Inception networks.Extending ResNets, Huang et al. [94] proposed DenseNets which are built from dense blocks, where dense blocks connect each layer to every other layer in a feed-forward fashion, leading to compelling advantages such as parameter efficiency, implicit deep supervision, and feature reuse.Recently, Hu et al. [79] proposed an architectural unit termed the Squeeze and Excitation (SE) block which can be combined with existing deep architectures to boost their performance at minimal additional computational cost, by adaptively recalibrating channelwise feature responses by explicitly modeling the interdependencies between convolutional feature channels, leading to winning the ILSVRC 2017 classification task.Research on CNN architectures remain active, and a numer of backbone networks are still emerging such as Dilated Residual Networks [230], Xception [35], DetNet [127], and Dual Path Networks (DPN) [31].
The training of a CNN requires a large labelled dataset with sufficient label and intraclass diversity.Unlike image classification, detection requires localizing (possibly many) objects from an image.It has been shown [161] that pretraining the deep model with a large scale dataset having object-level annotations (such as the ImageNet classification and localization dataset), instead of only image-level annotations, improves the detection performance.However collecting bounding box labels is expensive, especially for hundreds of thousands of categories.A common scenario is for a CNN to be pretrained on a large dataset (usually with a large number of visual categories) with image-level labels; the pretrained CNN can then be applied to a small dataset, directly, as a generic feature extractor [172,8,49,228], which can support a wider range of visual recognition tasks.For detection, the pre-trained network is typically finetuned2 on a given detection dataset [49,65,67].Several large scale image classification datasets are used for CNN pretraining; among them the ImageNet1000 dataset [44,179] with 1.2 million images of 1000 object categories, or the Places dataset [245] which is much larger than ImageNet1000 but has fewer classes, or a recent hybrid dataset [245] combining the Places and ImageNet datasets.
Pretrained CNNs without finetuning were explored for object classification and detection in [49,67,1], where it was shown that features performance is a function of the extracted layer; for example, for AlexNet pretrained on ImageNet, FC6 / FC7 / Pool5 are in descending order of detection accuracy [49,67]; finetuning a pretrained network can increase detection performance significantly [65,67], although in the case of AlexNet the finetuning performance boost was shown to be much larger for FC6 and FC7 than for Pool5, suggesting that the Pool5 features are more general.Furthermore the relationship or similarity between the source and target datasets plays a critical role, for example that ImageNet based CNN features show better performance [243] on object related image datasets.

Methods For Improving Object Representation
Deep CNN based detectors such as RCNN [65], Fast RCNN [64], Faster RCNN [175] and YOLO [174], typically use the deep CNN architectures listed in 2 as the backbone network and use features from the top layer of the CNN as object representation, however detecting objects across a large range of scales is a fundamental challenge.A classical strategy to address this issue is to run the detector over a number of scaled input images (e.g., an image pyramid) [56,65,77], which typically produces more accurate detection, however with obvious limitations of inference time and memory.In contrast, a CNN computes its feature hierarchy layer by layer, and the subsampling layers in the feature hierarchy lead to an inherent multiscale pyramid.
This inherent feature hierarchy produces feature maps of different spatial resolutions, but have inherent problems in structure [75,138,190]: the later (or higher) layers have a large receptive field and strong semantics, and are the most robust to variations such as object pose, illumination and part deformation, but the resolution is low and the geometric details are lost.On the contrary, the earlier (or lower) layers have a small receptive field and rich geometric details, but the resolution is high and is much less sensitive to semantics.Intuitively, semantic concepts of objects can emerge in different layers, depending on the size of the objects.So if a target object is small it requires fine detail information in earlier layers and may very well disappear at later layers, in principle making small object detection very challenging, for which tricks such as dilated convolutions [229] or atrous convolution [40,27] have been proposed.On the other hand if the target object is large then the semantic concept will emerge in much later layers.Clearly it is not optimal to predict objects of different scales with features from only one layer, therefore a number of methods [190,241,130,104] have been proposed to improve detection accuracy by exploiting multiple CNN layers, broadly falling into three types of multiscale object detection: 1. Detecting with combined features of multiple CNN layers [75,103,10]; 2. Detecting at multiple CNN layers; 3. Combinations of the above two methods [58,130,190,104,246,239].
(1) Detecting with combined features of multiple CNN layers seeks to combine features from multiple layers before making a prediction.Representative approaches include Hypercolumns [75], HyperNet [103], and ION [10].Such feature combining is commonly accomplished via skip connections, a classic neural network idea that skips some layers in the network and feeds the output of an earlier layer as the input to a later layer, architectures which have recently become popular for semantic segmentation [138,185,75].As shown in Fig. 10 (a), ION [10] uses skip pooling to extract RoI features from multiple layers, and then the object proposals generated by selective search and edgeboxes are classified by using the combined features.HyperNet [103], as shown in (2) Detecting at multiple CNN layers [138,185] combines coarse to fine predictions from multiple layers by averaging segmentation probabilities.SSD [136] and MSCNN [20], RBFNet [135], and DSOD [186] combine predictions from multiple feature maps to handle objects of various sizes.SSD spreads out default boxes of different scales to multiple layers within a CNN and enforces each layer to focus on predicting objects of a certain scale.Liu et al. [135] proposed RFBNet which simply replaces the later convolution layers of SSD with a Receptive Field Block (RFB) to enhance the discriminability and robustness of features.The RFB is a multibranch convolutional block, similar to the Inception block [200], but combining multiple branches with different kernels and convolution layers [27].MSCNN [20] applies deconvolution on multiple layers of a CNN to increase feature map resolution before using the layers to learn region proposals and pool features.
(3) Combination of the above two methods recognizes that, on the one hand, the utility of the hyper feature representation by simply incorporating skip features into detection like UNet [154], Hypercolumns [75], HyperNet [103] and ION [10] does not yield significant improvements due to the high dimensionality.On the other hand, it is natural to detect large objects from later layers with large receptive fields and to use earlier layers with small receptive fields to detect small objects; however, simply detecting objects from earlier layers may result in low performance because earlier layers possess less semantic information.Therefore, in order to combine the best of both worlds, some recent works propose to detect objects at multiple layers, and the feature of each detection layer is obtained by combining features from different layers.Representative methods include SharpMask [168], Deconvolutional Single Shot Detector (DSSD) [58], Feature Pyramid Network (FPN) [130], Top Down Modulation (TDM ) [190], Reverse connection with Objectness prior Network (RON) [104], ZIP [122] (shown in Fig. 12), Scale Transfer Detection Network (STDN) [246], RefineDet [239] and StairNet [217], as shown in Table 3 and contrasted in Fig. 11.
Table 3 Summarization of properties of representative methods in improving DCNN feature representations for generic object detection.See Section 4.1.2for more detail discussion.Abbreviations: Selective Search (SS), EdgeBoxes (EB), InceptionResNet (IRN).Detection results on VOC07, VOC12 and COCO were reported with mAP@IoU=0.5,and the other column results on COCO were reported with a new metric mAP@IoU=[0.5 : 0.05 : 0.95] which averages mAP over different IoU thresholds from 0.5 to 0.95 (written as [0.5:0.95]).Training data: "07"←VOC2007 trainval; "12"←VOC2012 trainval; "07+12"←union of 07 and VOC12 trainval; "07++12"←union of VOC07 trainval, VOC07 test, and VOC12 trainval; 07++12+CO←union of VOC07 trainval, VOC07 test, VOC12 trainval and COCO trainval.The COCO detection results were reported with COCO2015 Test-Dev, except for MPN [233]  As can be observed from Fig. 11 (a1) to (e1), these methods have highly similar detection architectures which incorporate a top down network with lateral connections to supplement the standard bottom-up, feedforward network.Specifically, after a bottom-up pass the final high level semantic features are transmitted back by the top-down network to combine with the bottom-up features from intermediate layers after lateral processing.The combined features are further processed, then used for detection and also transmitted down by the top-down network.As can be seen from Fig. 11 (a2) to (e2), one main difference is the design of the Re- verse Fusion Block (RFB) which handles the selection of the lower layer filters and the combination of multilayer features.The topdown and lateral features are processed with small convolutions and combined with elementwise sum or elementwise product or concatenation.FPN shows significant improvement as a generic feature extractor in several applications including object detection [130,131] and instance segmentation [80], e.g. using FPN in a basic Faster RCNN detector.These methods have to add additional layers to obtain multiscale features, introducing cost that can not be neglected.STDN [246] used DenseNet [94] to combine features of different layers and designed a scale transfer module to obtain feature maps with different resolutions.The scale transfer module module can be directly embedded into DenseNet with little additional cost.
(4) Model Geometric Transformations.DCNNs are inherently limited to model significant geometric transformations.An empirical study of the invariance and equivalence of DCNN representations to image transformations can be found in [118].Some approaches have been presented to enhance the robustness of CNN representations, aiming at learning invariant CNN representations with respect to different types of transformations such as scale [101,18], rotation [18,32,218,248], or both [100].
Modeling Object Deformations: Before deep learning, Deformable Part based Models (DPMs) [56] have been very successful for generic object detection, representing objects by component parts arranged in a deformable configuration.This DPM modeling is less sensitive to transformations in object pose, viewpoint and nonrigid deformations because the parts are positioned accordingly and their local appearances are stable, motivating researchers [41,66,147,160,214] to explicitly model object composition to improve CNN based detection.The first attempts [66,214] combined DPMs with CNNs by using deep features learned by AlexNet in DPM based detection, but without region proposals.To enable a CNN to enjoy the built-in capability of modeling the deformations of object parts, a number of approaches were proposed, including DeepIDNet [160], DCN [41] and DPFCN [147] (shown in Table 3).Although similar in spirit, deformations are computed in a different ways: DeepIDNet [161] designed a deformation constrained pooling layer to replace a regular max pooling layer to learn the shared visual patterns and their deformation properties across different object classes, Dai et al. [41] designed a deformable convolution layer and a deformable RoI pooling layer, both of which are based on the idea of augmenting the regular grid sampling locations in the feature maps with additional position offsets and learning the offsets via convolutions, leading to Deformable Convolutional Networks (DCN), and in DPFCN [147], Mordan et al. proposed deformable part based RoI pooling layer which selects discriminative parts of objects around object proposals by simultaneously optimizing latent displacements of all parts.

Context Modeling
In the physical world visual objects occur in particular environments and usually coexist with other related objects, and there is strong psychological evidence [13,9] that context plays an essential role in human object recognition.It is recognized that proper modeling of context helps object detection and recognition [203,155,27,26,47,59], especially when object appearance features are insufficient because of small object size, occlusion, or poor image quality.Many different types of context have been discussed, in particular see surveys [47,59].Context can broadly be grouped into one of three categories [13,59]: 1. Semantic context: The likelihood of an object to be found in some scenes but not in others; 2. Spatial context: Tthe likelihood of finding an object in some position and not others with respect to other objects in the scene; 3. Scale context: Objects have a limited set of sizes relative to other objects in the scene.
A great deal of work [28,47,59,143,152,171,162] preceded the prevalence of deep learning, however much of this work has not been explored in DCNN based object detectors [29,90].The current state of the art in object detection [175,136,80] detects objects without explicitly exploiting any contextual information.It is broadly agreed that DCNNs make use of contextual information implicitly [234,242] since they learn hierarchical representations with multiple levels of abstraction.Nevertheless there is still value in exploring contextual information explicitly in DCNN based detectors [90,29,236], and so the following reviews recent work in exploiting contextual cues in DCNN based object detectors, organized into categories of global and local contexts, motivated by earlier work in [240,59].Representative approaches are summarized in Table 4.
Global context [240,59] refers to image or scene level context, which can serve as cues for object detection (e.g., a bedroom will predict the presence of a bed).In DeepIDNet [160], the image classification scores were used as contextual features, and concatenated with the object detection scores to improve detection results.In ION [10], Bell et al. proposed to use spatial Recurrent Neural Networks (RNNs) to explore contextual information across the entire image.In SegDeepM [250], Zhu et al. proposed a MRF model that scores appearance as well as context for each detection, and allows each candidate box to select a segment and score the agreement between them.In [188], semantic segmentation was used as a form of contextual priming.
Local context [240,59,171] considers local surroundings in object relations, the interactions between an object and its surrounding area.In general, modeling object relations is challenging, requiring reasoning about bounding boxes of different classes, locations, scales etc.In the deep learning era, research that explicitly models object relations is quite limited, with representative ones being Spatial Memory Network (SMN) [29], Object Relation Network [90], and Structure Inference Network (SIN) [137].In SMN, spatial memory essentially assembles object instances back into a pseudo image representation that is easy to be fed into another CNN for object relations reasoning, leading to a new sequential reasoning architecture where image and memory are processed in parallel to obtain detections which further update memory.Inspired by the recent success of attention modules in natural language processing field [211], Hu et al. [90] proposed a lightweight ORN, which processes a set of objects simultaneously through interaction between their appearance feature and geometry.It does not require additional supervision and is easy to embed in existing networks.It has been shown to be effective in improving object recognition and duplicate removal steps in modern object detection pipelines, giving rise to the first fully end-to-end object detector.SIN [137] considered two kinds of context including scene contextual information and object relationships within a single image.It formulates object detection as a problem of graph structure inference, where given an image the objects are treated as nodes in a graph and relationships between objects are modeled as edges in such graph.
In MRCNN [62] (Fig. 13 (a)), in addition to the features extracted from the original object proposal at the last CONV layer of the backbone, Gidaris and Komodakis proposed to extract features from a number of different regions of an object proposal (half regions, border regions, central regions, contextual region and semantically segmented regions), in order to obtain a richer and more robust object representation.All of these features are combined simply by concatenation.
Quite a number of methods, all closely related to MRCNN, have been proposed since.The method in [233] used only four contextual regions, organized in a foveal structure, where the classifier is trained jointly end to end.Zeng et al. proposed GBDNet [235,236] (Fig. 13 (b)) to extract features from multiscale contextualized regions surrounding an object proposal to improve detection performance.Different from the naive way of learning CNN features for each region separately and then concatenating them, as in MRCNN, GBDNet can pass messages among features from different contextual regions, implemented through convolution.Noting that message passing is not always helpful but dependent on individual samples, Zeng et al. used gated functions to control message transmission, like in Long Short Term Memory (LSTM) networks [83].Concurrent with GBDNet, Li et al. [123] presented ACCNN (Fig. 13 (c)) to utilize both global and local contextual information to facilitate object detection.To capture global context, a Multiscale Local Contextualized (MLC) subnetwork was pro- posed, which recurrently generates an attention map for an input image to highlight useful global contextual locations, through multiple stacked LSTM layers.To encode local surroundings context, Li et al. [123] adopted a method similar to that in MRCNN [62].
As shown in Fig. 13 (d), CoupleNet [251] is conceptually similar to ACCNN [123], but built upon RFCN [40].In addition to the original branch in RFCN [40], which captures object information with position sensitive RoI pooling, CoupleNet [251] added one branch to encode the global context information with RoI pooling.

Detection Proposal Methods
An object can be located at any position and scale in an image.
During the heyday of handcrafted feature descriptors (e.g., SIFT [140], HOG [42] and LBP [153]), the Bag of Words (BoW) [194,37] and the DPM [55] used sliding window techniques [213,42,55,76,212].However the number of windows is large and grows with the number of pixels in an image, and the need to search at multiple scales and aspect ratios further significantly increases the search space.Therefore, it is computationally too expensive to apply more sophisticated classifiers.
Around 2011, researchers proposed to relieve the tension between computational tractability and high detection quality by using detection proposals3 [210,209].Originating in the idea of objectness proposed by [2], object proposals are a set of candidate regions in an image that are likely to contain objects.Detection proposals are usually used as a preprocessing step, in order to reduce the computational complexity by limiting the number of regions that need be evaluated by the detector.Therefore, a good detection proposal should have the following characteristics: 1. High recall, which can be achieved with only a few proposals; 2. The proposals match the objects as accurately as possible; 3. High efficiency.
A comprehensive review of object proposal algorithms is outside the scope of this paper, because object proposals have applications beyond object detection [6,72,252].We refer interested readers to the recent surveys [86,23] which provides an in-depth Fig. 13 Representative approaches that explore local surrounding contextual features: MRCNN [62], GBDNet [235,236], ACCNN [123] and CoupleNet [251], see also Table 4.
analysis of many classical object proposal algorithms and their impact on detection performance.Our interest here is to review object proposal methods that are based on DCNNs, output class agnostic proposals, and related to generic object detection.
In 2014, the integration of object proposals [210,209] and DCNN features [109] led to the milestone RCNN [65] in generic object detection.Since then, detection proposal algorithms have quickly become a standard preprocessing step, evidenced by the fact that all winning entries in the PASCAL VOC [53], ILSVRC [179] and MS COCO [129] object detection challenges since 2014 used detection proposals [65,160,64,175,236,80].
Among object proposal approaches based on traditional lowlevel cues (e.g., color, texture, edge and gradients), Selective Search [209], MCG [7] and EdgeBoxes [254] are among the more popular.As the domain rapidly progressed, traditional object proposal approaches [86] (e.g.selective search [209] and [254]), which were adopted as external modules independent of the detectors, became the bottleneck of the detection pipeline [175].An emerging class of object proposal algorithms [52,175,111,61,167,224] using DCNNs has attracted broad attention.
Recent DCNN based object proposal methods generally fall into two categories: bounding box based and object segment based, with representative methods summarized in Table 5.
Bounding Box Proposal Methods is best exemplified by the RPC method [175] of Ren et al., illustrated in Fig. 14.RPN predicts object proposals by sliding a small network over the feature map of the last shared CONV layer (as shown in Fig. 14).At each sliding window location, it predicts k proposals simultaneously by using k anchor boxes, where each anchor box 4 is centered at some location in the image, and is associated with a particular scale and aspect ratio.Ren et al. [175] proposed to integrate RPN and Fast 4 The terminology "an anchor box" or "an anchor" first appeared in [175].RCNN into a single network by sharing their convolutional layers.Such a design led to substantial speedup and the first end-to-end detection pipeline, Faster RCNN [175].RPN has been broadly selected as the proposal method by many state of the art object detectors, as can be observed from Tables 3 and 4. Instead of fixing a priori a set of anchors as MultiBox [52,199] and RPN [175], Lu et al. [141] proposed to generate anchor locations by using a recursive search strategy which can adaptively guide computational resources to focus on subregions likely to contain objects.Starting with the whole image, all regions visited during the search process serve as anchors.For any anchor region encountered during the search procedure, a scalar zoom indicator is used to decide whether to further partition the region, and a set of bounding boxes with objectness scores are computed with a deep network called Adjacency and Zoom Network (AZNet).AZNet extends RPN by adding a branch to compute the scalar zoom indicator in parallel with the existing branch.
There is further work attempting to generate object proposals by exploiting multilayer convolutional features [103,61,224,122].Generate instance segment proposals efficiently in one shot manner similar to SSD [136], in order to make use of multiscale convolutional features in a deep network; Need segmentation annotations for training.
ScaleNet [170] ResNet Concurrent with RPN [175], Ghodrati et al. [61] proposed Deep-Proposal which generates object proposals by using a cascade of multiple convolutional features, building an inverse cascade to select the most promising object locations and to refine their boxes in a coarse to fine manner.An improved variant of RPN, HyperNet [103] designs Hyper Features which aggregate multilayer convolutional features and shares them both in generating proposals and detecting objects via an end to end joint training strategy.Yang et al. proposed CRAFT [224] which also used a cascade strategy, first training an RPN network to generate object proposals and then using them to train another binary Fast RCNN network to further distinguish objects from background.Li et al. [122] proposed ZIP to improve RPN by leveraging a commonly used idea of predicting object proposals with multiple convolutional feature maps at different depths of a network to integrate both low level details and high level semantics.The backbone network used in ZIP is a "zoom out and in" network inspired by the conv and deconv structure [138].
Finally, recent work which deserves mention includes Deepbox [111], which proposed a light weight CNN to learn to rerank proposals generated by EdgeBox, and DeNet [208] which introduces a bounding box corner estimation to predict object proposals efficiently to replace RPN in a Faster RCNN style two stage detector.
Object Segment Proposal Methods [167,168] aim to generate segment proposals that are likely to correspond to objects.Segment proposals are more informative than bounding box proposals, and take a step further towards object instance segmentation [74,39,126].A pioneering work was DeepMask proposed by Pinheiro et al. [167], where segment proposals are learned directly from raw image data with a deep network.Sharing similarities with RPN, after a number of shared convolutional layers DeepMask splits the network into two branches to predict a class agnostic mask and an associated objectness score.Similar to the efficient sliding window prediction strategy in OverFeat [183], the trained DeepMask network is applied in a sliding window manner to an image (and its rescaled versions) during inference.More recently, Pinheiro et al. [168] proposed SharpMask by augmenting the DeepMask architecture with a refinement module, similar to the architectures shown in Fig. 11 (b1) and (b2), augmenting the feedforward network with a top-down refinement process.SharpMask can efficiently integrate the spatially rich information from early features with the strong semantic information encoded in later layers to generate high fidelity object masks.
Motivated by Fully Convolutional Networks (FCN) for semantic segmentation [138] and DeepMask [167], Dai et al. proposed InstanceFCN [38] for generating instance segment proposals.Similar to DeepMask, the InstanceFCN network is split into two branches, howver the two branches are fully convolutional, where one branch generates a small set of instance sensitive score maps, followed by an assembling module that outputs instances, and the other branch for predicting the objectness score.Hu et al. proposed FastMask [89] to efficiently generate instance segment proposals in a oneshot manner similar to SSD [136], in order to make use of multiscale convolutional features in a deep network.Sliding windows extracted densely from multiscale convolutional feature maps were input to a scale-tolerant attentional head module to predict segmentation masks and objectness scores.FastMask is claimed to run at 13 FPS on a 800 × 600 resolution image with a slight trade off in average recall.Qiao et al. [170] proposed ScaleNet to extend previous object proposal methods like SharpMask [168] by explicitly adding a scale prediction phase.That is, ScaleNet estimates the distribution of object scales for an input image, upon which Sharp-Mask searches the input image at the scales predicted by ScaleNet and outputs instance segment proposals.Qiao et al. [170] showed their method outperformed the previous state of the art on supermarket datasets by a large margin.

Other Special Issues
Aiming at obtaining better and more robust DCNN feature representations, data augmentation tricks are commonly used [22,64,65].It can be used at training time, at test time, or both.Augmentation refers to perturbing an image by transformations that leave the underlying category unchanged, such as cropping, flipping, rotating, scaling and translating in order to generate additional samples of the class.Data augmentation can affect the recognition performance of deep feature representations.Nevertheless, it has obvious limitations.Both training and inference computational complexity increases significantly, limiting its usage in real applications.Detecting objects under a wide range of variations, and especially, detecting very small objects stands out as one of key challenges.It has been shown [96,136] that image resolution has a considerable impact on detection accuracy.Therefore, among those data augmentation tricks, scaling (especially a higher resolution input) is mostly used, since high resolution inputs enlarge the possibility of small objects to be detected [96].Recently, Singh et al. proposed advanced and efficient data argumentation methods SNIP [192] and SNIPER [193] to illustrate the scale invariance problem, as summarized in Table 6.Motivated by the intuitive understanding that small and large objects are difficult to detect at smaller and larger scales respectively, Singh et al. presented a novel training scheme named SNIP can reduce scale variations during training but without reducing training samples.SNIPER [193] is an approach proposed for efficient multiscale training.It only processes context regions around ground truth objects at the appropriate scale instead of processing a whole image pyramid.Shrivastava et al. [189] and Lin et al. explored approaches to handle the extreme foreground-background class imbalance issue [131].Wang et al. [216] proposed to train an adversarial network to generate examples with occlusions and deformations that are difficult for the object detector to recognize.There are some works focusing on developing better methods for nonmaximum suppression [16,87,207].
5 Datasets and Performance Evaluation

Datasets
Datasets have played a key role throughout the history of object recognition research.They have been one of the most important factors for the considerable progress in the field, not only as a common ground for measuring and comparing performance of competing algorithms, but also pushing the field towards increasingly complex and challenging problems.The present access to large numbers of images on the Internet makes it possible to build comprehensive datasets of increasing numbers of images and categories in order to capture an ever greater richness and diversity of objects.The rise of large scale datasets with millions of images has paved the way for significant breakthroughs and enabled unprecedented performance in object recognition.Recognizing space limitations, we refer interested readers to several papers [53,54,129,179,107] for detailed description of related datasets.
Earlier datasets, such as Caltech101 or Caltech256, were criticized because of the lack of intraclass variations that they exhibit.As a result, SUN [221] was collected by finding images depicting various scene categories, and many of its images have scene and object annotations which can support scene recognition and object detection.Tiny Images [204] created a dataset at an unprecedented scale, giving comprehensive coverage of all object categories and scenes, however its annotations were not manually verified, containing numerous errors, so two benchmarks (CIFAR10 and CI-FAR100 [108]) with reliable labels were derived from Tiny Images.
PASCAL VOC [53,54], a multiyear effort devoted to the creation and maintenance of a series of benchmark datasets for classification and object detection, creates the precedent for standardized evaluation of recognition algorithms in the form of annual competitions.Starting from only four categories in 2005, increasing to 20 categories that are common in everyday life, as shown in Fig. 15.ImageNet [44] contains over 14 million images and over 20,000 categories, the backbone of ILSVRC [44,179] challenge, which has pushed object recognition research to new heights.
ImageNet has been criticized that the objects in the dataset tend to be large and well centered, making the dataset atypical of real world scenarios.With the goal of addressing this problem and pushing research to richer image understanding, researchers created the MS COCO database [129].Images in MS COCO are complex everyday scenes containing common objects in their natural context, closer to real life, and objects are labeled using fullysegmented instances to provide more accurate detector evaluation.The Places database [245] contains 10 million scene images, labeled with scene semantic categories, offering the opportunity for data hungry deep learning algorithms to reach human level recognition of visual patterns.More recently, Open Images [106] is a dataset of about 9 million images that have been annotated with image level labels and object bounding boxes.There are three famous challenges for generic object detection: PASCAL VOC [53,54], ILSVRC [179] and MS COCO [129].Each challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardized evaluation software; and (ii) an annual competition and corresponding workshop.Statistics for the number of images and object instances in the training, validation and testing datasets 5 for the detection challenges is given in Table 8.
For the PASCAL VOC challenge, since 2009 the data consist of the previous years' images augmented with new images, allowing the number of images to grow each year and, more importantly, meaning that test results can be compared with the previous years' images.
ILSVRC [179] scales up PASCAL VOC's goal of standardized training and evaluation of detection algorithms by more than an order of magnitude in the number of object classes and images.The ILSVRC object detection challenge has been run annually from 2013 to the present.7 for summary of these datasets.
The COCO object detection challenge is designed to push the state of the art in generic object detection forward, and has been run annually from 2015 to the present.It features two object detection tasks: using either bounding box output or object instance segmentation output.It has fewer object categories than ILSVRC (80 in COCO versus 200 in ILSVRC object detection) but more instances per category (11000 on average compared to about 2600 in ILSVRC object detection).In addition, it contains object segmentation annotations which are not currently available in ILSVRC.COCO introduced several new challenges: (1) it contains objects at a wide range of scales, including a high percentage of small objects (e.g.smaller than 1% of image area [192]).( 2) objects are less iconic and amid clutter or heavy occlusion, and (3) the evaluation metric (see Table 9) encourages more accurate object localization.
COCO has become the most widely used dataset for generic object detection, with the dataset statistics for training, validation and testing summarized in Table 8.Starting in 2017, the test set has only the Dev and Challenge splits, where the Test-Dev split is the default test data, and results in papers are generally reported on Test-Dev to allow for fair comparison.
2018 saw the introduction of the Open Images Object Detection Challenge, following in the tradition of PASCAL VOC, Ima-geNet and COCO, but at an unprecedented scale.It offers a broader range of object classes than previous challenges, and has two tasks: bounding box object detection of 500 different classes and visual relationship detection which detects pairs of objects in particular relations.

Evaluation Criteria
There are three criteria for evaluating the performance of detection algorithms: detection speed (Frames Per Second, FPS), precision, and recall.The most commonly used metric is Average Precision (AP), derived from precision and recall.AP is usually evaluated in a category specific manner, i.e., computed for each object category separately.In generic object detection, detectors are usually tested in terms of detecting a number of object categories.To compare performance over all object categories, the mean AP (mAP) averaged over all object categories is adopted as the final measure of performance 6 .More details on these metrics can be found in [53,54,179,84].
The standard outputs of a detector applied to a testing image I are the predicted detections {(b j , c j , p j )} j , indexed by j.A given detection (b, c, p) (omitting j for notational simplicity) denotes the

ACKNOWLEDGMENTS
• Li Liu is with the Information System Engineering Key Lab, College of Information System and Management, National University of Defense Technology, China.She is also a post doctor researcher at the Machine Vision Group, University of Oulu, Finland.email: li.liu@oulu.fi• Matti Pietikäinen are with Machine Vision Group, University of Oulu, Finland.email: {matti.pietikainen}@ee.oulu.fi• Wanli Ouyang and Xiaogang Wang are with the Department of Electronic Engineering, Chinese University of Hong Kong, China.email: wanli.ouyang@gmail.com;xgwang@ee.cuhk.edu.hkFig. 16 The algorithm for determining TPs and FPs by greedily matching object detection results to ground truth boxes.
predicted location (i.e., the Bounding Box, BB) b with its predicted category label c and its confidence level p.A predicted detection (b, c, p) is regarded as a True Positive (TP) if • The predicted class label c is the same as the ground truth label c g .• The overlap ratio IOU (Intersection Over Union) [53,179] between the predicted BB b and the ground truth one b g is not smaller than a predefined threshold ε.Here area(b ∩ b g ) denotes the intersection of the predicted and ground truth BBs, and area(b ∪ b g ) their union.A typical value of ε is 0.5.
Otherwise, it is considered as a False Positive (FP).The confidence level p is usually compared with some threshold β to determine whether the predicted class label c is accepted.Precision The fraction of correct detections out of the total detections returned by the detector with confidence of at least β.

Recall
The fraction of all Nc objects detected by the detector having a confidence of at least β.

AP Average Precision
Computed over the different levels of recall achieved by varying the confidence β.

VOC
AP at a single IOU and averaged over all classes.ILSVRC AP at a modified IOU and averaged over all classes.

MS COCO
•APcoco: mAP averaged over ten IOUs: {0.5 : 0.05 : 0.95}; • AP small coco : mAP for small objects of area smaller than 32 2 ; • AP medium coco : mAP for objects of area between 32 2 and 96 2 ; • AP large coco : mAP for large objects of area bigger than 96 2 ; AR Average Recall The maximum recall given a fixed number of detections per image, averaged over all categories and IOU thresholds.

AR Average Recall MS COCO
•AR max=1 coco : AR given 1 detection per image; • AR max=10 coco : AR given 10 detection per image; • AR max=100 coco : AR given 100 detection per image; • AR small coco : AR for small objects of area smaller than 32 2 ; • AR medium coco : AR for objects of area between 32 2 and 96 2 ; • AR large coco : AR for large objects of area bigger than 96 2 ; AP is computed separately for each of the object classes, based on Precision and Recall.For a given object class c and a testing image I i , let {(b ij , p ij )} M j=1 denote the detections returned by a detector, ranked by the confidence p ij in decreasing order.Let B = {b g ik } K k=1 be the ground truth boxes on image I i for the given object class c.Each detection (b ij , p ij ) is either a TP or a FP, which can be determined via the algorithm 7 in Fig. 16.Based on the TP and FP detections, the precision P (β) and recall R(β) [53] can be computed as a function of the confidence threshold β, so by varying the confidence threshold different pairs (P, R) can be obtained, in principle allowing precision to be regarded as a function of recall, i.e.P (R), from which the Average Precision (AP) [53,179] can be found.
Table 9 summarizes the main metrics used in the PASCAL, ILSVRC and MS COCO object detection challenges.

Performance
A large variety of detectors has appeared in the last several years, and the introduction of standard benchmarks such as PASCAL VOC [53,54], ImageNet [179] and COCO [129] has made it easier to compare detectors with respect to accuracy.As can be seen from our earlier discussion in Sections 3 and 4, it is difficult to objectively compare detectors in terms of accuracy, speed and memory alone, as they can differ in fundamental / contextual respects, including the following:   [64,80,176] accordingly.The backbone network, the design of detection framework and the availability of good and large scale datasets are the three most important factors in detection.
Although it may be impractical to compare every recently proposed detector, it is nevertheless highly valuable to integrate representative and publicly available detectors into a common platform and to compare them in a unified manner.There has been very limited work in this regard, except for Huang's study [96] of the trade off between accuracy and speed of three main families of detectors (Faster RCNN [175], RFCN [40] and SSD [136]) by varying the backbone network, image resolution, and the number of box proposals etc.
As can be seen from Tables 3, 4, 5, 6 and Table 10, we have summarized the best reported performance of many methods on three widely used standard benchmarks.The results of these methods were reported on the same test benchmark, despite their differing in one or more of the aspects listed above.
Figs. 1 and 17 present a very brief overview of the state of the art, summarizing the best detection results of the PASCAL VOC, ILSVRC and MSCOCO challenges.More results can be found at detection challenge websites [98,148,163].In summary, the backbone network, the detection framework design and the availability of large scale datasets are the three most important factors in detection.Furthermore ensembles of multiple models, the incorporation of context features, and data augmentation all help to achieve better accuracy.
In less than five years, since AlexNet [109] was proposed, the Top5 error on ImageNet classification [179] with 1000 classes has dropped from 16% to 2%, as shown in Fig. 9.However, the mAP of the best performing detector [164] (which is only trained to detect 80 classes) on COCO [129] has reached 73%, even at 0.5 IoU, illustrating clearly how object detection is much harder than image classification.The accuracy level achieved by the state of the art detectors is far from satisfying the requirements of general purpose practical applications, so there remains significant room for future improvement.

Conclusions
Generic object detection is an important and challenging problem in computer vision, and has received considerable attention.Thanks to remarkable development of deep learning techniques, the field of object detection has dramatically evolved.As a comprehensive survey on deep learning for generic object detection, this paper has highlighted the recent achievements, provided a structural taxonomy for methods according to their roles in detection, summarized existing popular datasets and evaluation criteria, and discussed performance for the most representative methods.
Despite the tremendous successes achieved in the past several years (e.g.detection accuracy improving significantly from 23% in ILSVRC2013 to 73% in ILSVRC2017), there remains a huge gap between the state-of-the-art and human-level performance, especially in terms of open world learning.Much work remains to be done, which we see focused on the following eight domains: (1) Open World Learning: The ultimate goal is to develop object detection systems that are capable of accurately and efficiently recognizing and localizing instances of all object categories (thousands or more object classes [43]) in all open world scenes, competing with the human visual system.Recent object detection algorithms are learned with limited datasets [53,54,129,179], recognizing and localizing the object categories included in the dataset, but blind, in principle, to other object categories outside the dataset, although ideally a powerful detection system should be able to recognize novel object categories [112,73].Current detection datasets [53,179,129] contain only dozens to hundreds of categories, which is significantly smaller than those which can be recognized by humans.To achieve this goal, new large-scale labeled datasets with significantly more categories for generic object detection will need to be developed, since the state of the art in CNNs require extensive data to train well.However collecting such massive amounts of data, particularly bounding box labels for object detection, is very expensive, especially for hundreds of thousands categories.
(2) Better and More Efficient Detection Frameworks: One of the factors for the tremendous successes in generic object detection has been the development of better detection frameworks, both region-based (RCNN [65], Fast RCNN [64], Faster RCNN [175], Mask RCNN [80]) and one-state detectors (YOLO [174], SSD [136]).Region-based detectors have the highest accuracy, but are too computationally intensive for embedded or real-time systems.One-stage detectors have the potential to be faster and simpler, but have not yet reached the accuracy of region-based detectors.One possible limitation is that the state of the art object detectors depend heavily on the underlying backbone network, which have been initially optimized for image classification, causing a learning bias due to the differences between classification and detection, such that one potential strategy is to learn object detectors from scratch, like the DSOD detector [186].
(3) Compact and Efficient Deep CNN Features: Another significant factor in the considerable progress in generic object detection has been the development of powerful deep CNNs, which have increased remarkably in depth, from several layers (e.g., AlexNet [110]) to hundreds of layers (e.g., ResNet [79], DenseNet [94]).These networks have millions to hundreds of millions of parameters, requiring massive data and power-hungry GPUs for training, again limiting their application to real-time / embedded applications.In response, there has been growing research interest in designing compact and lightweight networks [25,4,95,88,132,231], network compression and acceleration [34,97,195,121,124], and network interpretation and understanding [19,142,146].
(4) Robust Object Representations: One important factor which makes the object recognition problem so challenging is the great variability in real-world images, including viewpoint and lighting changes, object scale, object pose, object part deformations, background clutter, occlusions, changes in appearance, image blur, image resolution, noise, and camera limitations and distortions.Despite the advances in deep networks, they are still limited by a lack of robustness to these many variations [134,24], which significantly constrains the usability for real-world applications.
(5) Context Reasoning: Real-world objects typically coexist with other objects and environments.It has been recognized that contextual information (object relations, global scene statistics) helps object detection and recognition [155], especially in situations of small or occluded objects or poor image quality.There was extensive work preceding deep learning [143,152,171,47,59], however since the deep learning era there has been only very limited progress in exploiting contextual information [29,62,90].How to efficiently and effectively incorporate contextual information remains to be explored, ideally guided by how humans are quickly able to guide their attention to objects of interest in natural scenes.
(6) Object Instance Segmentation: Continuing the trend of moving towards a richer and more detailed understanding image content (e.g., from image classification to single object localization to object detection), a next challenge would be to tackle pixellevel object instance segmentation [129,80,93], as object instance segmentation can play an important role in many potential applications that require the precise boundaries of individual instances.
(7) Weakly Supervised or Unsupervised Learning: Current state of the art detectors employ fully-supervised models learned from labelled data with object bounding boxes or segmentation masks [54,129,179,129], however such fully supervised learning has serious limitations, where the assumption of bounding box annotations may become problematic, especially when the number of object categories is large.Fully supervised learning is not scalable in the absence of fully labelled training data, therefore it is valuable to study how the power of CNNs can be leveraged in weakly supervised or unsupervised detection [15,45,187].
(8) 3D Object Detection: The progress of depth cameras has enabled the acquisition of depth information in the form of RGB-D images or 3D point clouds.The depth modality can be employed to help object detection and recognition, however there is only limited work in this direction [30,165,220], but which might benefit from taking advantage of large collections of high quality CAD models [219].
The research field of generic object detection is still far from complete; given the massive algorithmic breakthroughs over the past five years, we remain optimistic of the opportunities over the next five years.
Table 10 Summarization of properties and performance of milestone detection frameworks for generic object detection.See Section 3 for detail discussion.The architectures of some methods listed in this table are illustrated in Fig. 8.The properties of the backbone DCNNs can be found in Table 2.

Fig. 1
Fig. 1 Recent evolution of object detection performance.We can observe significant performance (mean average precision) improvement since deep learning entered the scene in 2012.The performance of the best detector has been steadily increasing by a significant amount on a yearly basis.(a) Results on the PASCAL VOC datasets: Detection results of winning entries in the VOC2007-2012 competitions (using only provided training data).(b) Top object detection competition results in ILSVRC2013-2017 (using only provided training data).
• Thousands of real-world object classes structured and unstructured • Thousands of object categories in real world • Requiring localizing and recognizing objects • Large number of possible locations of objects • Large-scale image/video data

Fig. 4
Fig. 4 Summary of challenges in generic object detection.

Fig. 6
Fig. 6 Milestones in generic object detection based on the point in time of the first arXiv version.

Fig. 9
Fig. 9 Performance of winning entries in the ILSVRC competitions from 2011 to 2017 in the image classification task.

Fig. 11 Fig. 12
Fig. 11 Hourglass architectures: Conv1 to Conv5 are the main Conv blocks in backbone networks such as VGG or ResNet.Comparison of a number of Reverse Fusion Block (RFB) commonly used in recent approaches.
Input: {(b j , p j )} M j=1 : M predictions for image I for object class c, ranked by the confidence p j in decreasing order; B = {b g k } K k=1 : ground truth BBs on image I for object class c; Output: a ∈ R M : a binary vector indicating each (b j , p j ) to be a TP or FP.Initialize a = 0; for j = 1, ..., M do Set A = ∅ and t = 0; foreach unmatched object b g k in B do if IOU(b j , b g k ) ≥ ε and IOU(b j , b g k ) > t then A = {b g k }; t = IOU(b j , b g k ); end end if A ̸ = ∅ then Set a(i) = 1 since object prediction (b j , p j ) is a TP; Remove the matched GT box in A from B, B = B − A. end end positive detection, per Fig. 16.β Confidence Threshold A confidence threshold for computing P (β) and R(β).(h+10) ); w × h is the size of a GT box.MS COCO Ten IOU thresholds ε ∈ {0.5 : 0.05 : 0.95} P (β)

Fig. 17
Fig.17Evolution of object detection performance on COCO (Test-Dev results).Results are quoted from[64,80,176] accordingly.The backbone network, the design of detection framework and the availability of good and large scale datasets are the three most important factors in detection.
Fig.17Evolution of object detection performance on COCO (Test-Dev results).Results are quoted from[64,80,176] accordingly.The backbone network, the design of detection framework and the availability of good and large scale datasets are the three most important factors in detection.

Table 1
Summarization of a number of related surveys since 2000.

Table 2
DCNN architectures that are commonly used for generic object detection.Regarding the statistics for "#Paras" and "#Layers", we didn't consider the final FC prediction layer."TestError" column indicates the Top 5 classification test error on ImageNet1000.Explanations: OverFeat (accurate model), DenseNet201 (Growth Rate 32, DenseNet-BC), and ResNeXt50 (32*4d).The first DCNN; The historical turning point of feature representation from traditional to CNN; In the classification task of ILSVRC2012 competition, achieved a winning Top 5 test error rate of 15.3%, compared to 26.2% given by the second best entry.
[173]]Design dense block, which connects each layer to every other layer in a feed forward fashion; Alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.−[173]Similar to VGGNet, but with significantly less parameters due to the use of fewer filters at each layer.Proposing a novel block called Squeeze and Excitation to model feature channel relationship; Can be flexibly used in all existing CNNs to improve recognition performance at minimal additional computational cost.
which reported with COCO2015 Test-Standard.
[200]ptive Field Block, RBF); Proposed RFB to improve SSD; RBF is a multibranch convolutional block similar to the Inception block[200], but with dilated CONV layers.

Table 4
Summarization of detectors that exploit context information, similar to Table3.

Table 5
Summarization of object proposal methods using DCNN.The numbers in blue color denote the the number of object proposals.The detection results on COCO is mAP@IoU[0.5,0.95], unless stated otherwise.CVPR14 Among the first to explore DCNN for object proposals; Learns a class agnostic regressor on a small set of 800 predefined anchor boxes; Does not share features with the detection network.Introduced a classification Network (i.e. two class Fast RCNN) cascade that comes after the RPN.Not sharing features extracted for detection.

Table 6
Representative methods for training strategies and class imbalance handling.Results on COCO are reported with Test-Dev.

Table 7
Popular databases for object recognition.Some example images from MNIST, Caltech101, CIFAR10, PASCAL VOC and ImageNet are shown in Fig.15.

Table 8
Statistics of commonly used object detection datasets.Object statistics for VOC challenges list the nondifficult objects used in the evaluation (all annotated objects).For the COCO challenge, prior to 2017, the test set had four splits (Dev, Standard, Reserve, and Challenge), with each having about 20K images.Starting in 2017, test set has only the Dev and Challenge splits, with the other two splits removed.
!1 DATASETS AND PERFORMANCE EVALUATIONAlgorithm 1: The algorithm for greedily matching object detection results (for an object category) to ground truth boxes.

Table 9
Summarization of commonly used metrics for evaluating object detectors.