1 Introduction

As a longstanding, fundamental and challenging problem in computer vision, object detection (illustrated in Fig. 1) has been an active area of research for several decades (Fischler and Elschlager 1973). The goal of object detection is to determine whether there are any instances of objects from given categories (such as humans, cars, bicycles, dogs or cats) in an image and, if present, to return the spatial location and extent of each object instance (e.g., via a bounding box Everingham et al. 2010; Russakovsky et al. 2015). As the cornerstone of image understanding and computer vision, object detection forms the basis for solving complex or high level vision tasks such as segmentation, scene understanding, object tracking, image captioning, event detection, and activity recognition. Object detection supports a wide range of applications, including robot vision, consumer electronics, security, autonomous driving, human computer interaction, content based image retrieval, intelligent video surveillance, and augmented reality.

Recently, deep learning techniques (Hinton and Salakhutdinov 2006; LeCun et al. 2015) have emerged as powerful methods for learning feature representations automatically from data. In particular, these techniques have provided major improvements in object detection, as illustrated in Fig. 3.

As illustrated in Fig. 2, object detection can be grouped into one of two types (Grauman and Leibe 2011; Zhang et al. 2013): detection of specific instances versus the detection of broad categories. The first type aims to detect instances of a particular object (such as Donald Trump’s face, the Eiffel Tower, or a neighbor’s dog), essentially a matching problem. The goal of the second type is to detect (usually previously unseen) instances of some predefined object categories (for example humans, cars, bicycles, and dogs). Historically, much of the effort in the field of object detection has focused on the detection of a single category (typically faces and pedestrians) or a few specific categories. In contrast, over the past several years, the research community has started moving towards the more challenging goal of building general purpose object detection systems where the breadth of object detection ability rivals that of humans.

Fig. 1
figure 1

Most frequent keywords in ICCV and CVPR conference papers from 2016 to 2018. The size of each word is proportional to the frequency of that keyword. We can see that object detection has received significant attention in recent years

Fig. 2
figure 2

Object detection includes localizing instances of a particular object (top), as well as generalizing to detecting object categories in general (bottom). This survey focuses on recent advances for the latter problem of generic object detection

Fig. 3
figure 3

An overview of recent object detection performance: we can observe a significant improvement in performance (measured as mean average precision) since the arrival of deep learning in 2012. a Detection results of winning entries in the VOC2007-2012 competitions, and b top object detection competition results in ILSVRC2013-2017 (results in both panels use only the provided training data)

Krizhevsky et al. (2012a) proposed a Deep Convolutional Neural Network (DCNN) called AlexNet which achieved record breaking image classification accuracy in the Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al. 2015). Since that time, the research focus in most aspects of computer vision has been specifically on deep learning methods, indeed including the domain of generic object detection (Girshick et al. 2014; He et al. 2014; Girshick 2015; Sermanet et al. 2014; Ren et al. 2017). Although tremendous progress has been achieved, illustrated in Fig. 3, we are unaware of comprehensive surveys of this subject over the past 5 years. Given the exceptionally rapid rate of progress, this article attempts to track recent advances and summarize their achievements in order to gain a clearer picture of the current panorama in generic object detection.

Table 1 Summary of related object detection surveys since 2000

1.1 Comparison with Previous Reviews

Many notable object detection surveys have been published, as summarized in Table 1. These include many excellent surveys on the problem of specific object detection, such as pedestrian detection (Enzweiler and Gavrila 2009; Geronimo et al. 2010; Dollar et al. 2012), face detection (Yang et al. 2002; Zafeiriou et al. 2015), vehicle detection (Sun et al. 2006) and text detection (Ye and Doermann 2015). There are comparatively few recent surveys focusing directly on the problem of generic object detection, except for the work by Zhang et al. (2013) who conducted a survey on the topic of object class detection. However, the research reviewed in Grauman and Leibe (2011), Andreopoulos and Tsotsos (2013) and Zhang et al. (2013) is mostly pre-2012, and therefore prior to the recent striking success and dominance of deep learning and related methods.

Deep learning allows computational models to learn fantastically complex, subtle, and abstract representations, driving significant progress in a broad range of problems such as visual recognition, object detection, speech recognition, natural language processing, medical image analysis, drug discovery and genomics. Among different types of deep neural networks, DCNNs (LeCun et al. 1998, 2015; Krizhevsky et al. 2012a) have brought about breakthroughs in processing images, video, speech and audio. To be sure, there have been many published surveys on deep learning, including that of Bengio et al. (2013), LeCun et al. (2015), Litjens et al. (2017), Gu et al. (2018), and more recently in tutorials at ICCV and CVPR.

In contrast, although many deep learning based methods have been proposed for object detection, we are unaware of any comprehensive recent survey. A thorough review and summary of existing work is essential for further progress in object detection, particularly for researchers wishing to enter the field. Since our focus is on generic object detection, the extensive work on DCNNs for specific object detection, such as face detection (Li et al. 2015a; Zhang et al. 2016a; Hu et al. 2017), pedestrian detection (Zhang et al. 2016b; Hosang et al. 2015), vehicle detection (Zhou et al. 2016b) and traffic sign detection (Zhu et al. 2016b) will not be considered.

1.2 Scope

The number of papers on generic object detection based on deep learning is breathtaking. There are so many, in fact, that compiling any comprehensive review of the state of the art is beyond the scope of any reasonable length paper. As a result, it is necessary to establish selection criteria, in such a way that we have limited our focus to top journal and conference papers. Due to these limitations, we sincerely apologize to those authors whose works are not included in this paper. For surveys of work on related topics, readers are referred to the articles in Table 1. This survey focuses on major progress of the last 5 years, and we restrict our attention to still pictures, leaving the important subject of video object detection as a topic for separate consideration in the future.

The main goal of this paper is to offer a comprehensive survey of deep learning based generic object detection techniques, and to present some degree of taxonomy, a high level perspective and organization, primarily on the basis of popular datasets, evaluation metrics, context modeling, and detection proposal methods. The intention is that our categorization be helpful for readers to have an accessible understanding of similarities and differences between a wide variety of strategies. The proposed taxonomy gives researchers a framework to understand current research and to identify open challenges for future research.

The remainder of this paper is organized as follows. Related background and the progress made during the last 2 decades are summarized in Sect. 2. A brief introduction to deep learning is given in Sect. 3. Popular datasets and evaluation criteria are summarized in Sect. 4. We describe the milestone object detection frameworks in Sect. 5. From Sects. 6 to 9, fundamental sub-problems and the relevant issues involved in designing object detectors are discussed. Finally, in Sect. 10, we conclude the paper with an overall discussion of object detection, state-of-the- art performance, and future research directions.

2 Generic Object Detection

2.1 The Problem

Generic object detection, also called generic object category detection, object class detection, or object category detection (Zhang et al. 2013), is defined as follows. Given an image, determine whether or not there are instances of objects from predefined categories (usually many categories, e.g., 200 categories in the ILSVRC object detection challenge) and, if present, to return the spatial location and extent of each instance. A greater emphasis is placed on detecting a broad range of natural categories, as opposed to specific object category detection where only a narrower predefined category of interest (e.g., faces, pedestrians, or cars) may be present. Although thousands of objects occupy the visual world in which we live, currently the research community is primarily interested in the localization of highly structured objects (e.g., cars, faces, bicycles and airplanes) and articulated objects (e.g., humans, cows and horses) rather than unstructured scenes (such as sky, grass and cloud).

The spatial location and extent of an object can be defined coarsely using a bounding box (an axis-aligned rectangle tightly bounding the object) (Everingham et al. 2010; Russakovsky et al. 2015), a precise pixelwise segmentation mask (Zhang et al. 2013), or a closed boundary (Lin et al. 2014; Russell et al. 2008), as illustrated in Fig. 4. To the best of our knowledge, for the evaluation of generic object detection algorithms, it is bounding boxes which are most widely used in the current literature (Everingham et al. 2010; Russakovsky et al. 2015), and therefore this is also the approach we adopt in this survey. However, as the research community moves towards deeper scene understanding (from image level object classification to single object localization, to generic object detection, and to pixelwise object segmentation), it is anticipated that future challenges will be at the pixel level (Lin et al. 2014).

Fig. 4
figure 4

Recognition problems related to generic object detection: a image level object classification, b bounding box level generic object detection, c pixel-wise semantic segmentation, d instance level semantic segmentation

There are many problems closely related to that of generic object detectionFootnote 1. The goal of object classification or object categorization (Fig. 4a) is to assess the presence of objects from a given set of object classes in an image; i.e., assigning one or more object class labels to a given image, determining the presence without the need of location. The additional requirement to locate the instances in an image makes detection a more challenging task than classification. The object recognition problem denotes the more general problem of identifying/localizing all the objects present in an image, subsuming the problems of object detection and classification (Everingham et al. 2010; Russakovsky et al. 2015; Opelt et al. 2006; Andreopoulos and Tsotsos 2013). Generic object detection is closely related to semantic image segmentation (Fig. 4c), which aims to assign each pixel in an image to a semantic class label. Object instance segmentation (Fig. 4d) aims to distinguish different instances of the same object class, as opposed to semantic segmentation which does not.

Fig. 5
figure 5

Taxonomy of challenges in generic object detection

2.2 Main Challenges

The ideal of generic object detection is to develop a general-purpose algorithm that achieves two competing goals of high quality/accuracy and high efficiency (Fig. 5). As illustrated in Fig. 6, high quality detection must accurately localize and recognize objects in images or video frames, such that the large variety of object categories in the real world can be distinguished (i.e., high distinctiveness), and that object instances from the same category, subject to intra-class appearance variations, can be localized and recognized (i.e., high robustness). High efficiency requires that the entire detection task runs in real time with acceptable memory and storage demands.

Fig. 6
figure 6

Changes in appearance of the same class with variations in imaging conditions (ah). There is an astonishing variation in what is meant to be a single object class (i). In contrast, the four images in j appear very similar, but in fact are from four different object classes. Most images are from ImageNet (Russakovsky et al. 2015) and MS COCO (Lin et al. 2014)

Fig. 7
figure 7

Milestones of object detection and recognition, including feature representations (Csurka et al. 2004; Dalal and Triggs 2005; He et al. 2016; Krizhevsky et al. 2012a; Lazebnik et al. 2006; Lowe 1999, 2004; Perronnin et al. 2010; Simonyan and Zisserman 2015; Sivic and Zisserman 2003; Szegedy et al. 2015; Viola and Jones 2001; Wang et al. 2009), detection frameworks (Felzenszwalb et al. 2010b; Girshick et al. 2014; Sermanet et al. 2014; Uijlings et al. 2013; Viola and Jones 2001), and datasets (Everingham et al. 2010; Lin et al. 2014; Russakovsky et al. 2015). The time period up to 2012 is dominated by handcrafted features, a transition took place in 2012 with the development of DCNNs for image classification by Krizhevsky et al. (2012a), with methods after 2012 dominated by related deep networks. Most of the listed methods are highly cited and won a major ICCV or CVPR prize. See Sect. 2.3 for details

2.2.1 Accuracy Related Challenges

Challenges in detection accuracy stem from (1) the vast range of intra-class variations and (2) the huge number of object categories.

Intra-class variations can be divided into two types: intrinsic factors and imaging conditions. In terms of intrinsic factors, each object category can have many different object instances, possibly varying in one or more of color, texture, material, shape, and size, such as the “chair” category shown in Fig. 6i. Even in a more narrowly defined class, such as human or horse, object instances can appear in different poses, subject to nonrigid deformations or with the addition of clothing.

Imaging condition variations are caused by the dramatic impacts unconstrained environments can have on object appearance, such as lighting (dawn, day, dusk, indoors), physical location, weather conditions, cameras, backgrounds, illuminations, occlusion, and viewing distances. All of these conditions produce significant variations in object appearance, such as illumination, pose, scale, occlusion, clutter, shading, blur and motion, with examples illustrated in Fig. 6a–h. Further challenges may be added by digitization artifacts, noise corruption, poor resolution, and filtering distortions.

In addition to intraclass variations, the large number of object categories, on the order of \(10^4\)\(10^5\), demands great discrimination power from the detector to distinguish between subtly different interclass variations, as illustrated in Fig. 6j. In practice, current detectors focus mainly on structured object categories, such as the 20, 200 and 91 object classes in PASCAL VOC (Everingham et al. 2010), ILSVRC (Russakovsky et al. 2015) and MS COCO (Lin et al. 2014) respectively. Clearly, the number of object categories under consideration in existing benchmark datasets is much smaller than can be recognized by humans.

2.2.2 Efficiency and Scalability Related Challenges

The prevalence of social media networks and mobile/wearable devices has led to increasing demands for analyzing visual data. However, mobile/wearable devices have limited computational capabilities and storage space, making efficient object detection critical.

The efficiency challenges stem from the need to localize and recognize, computational complexity growing with the (possibly large) number of object categories, and with the (possibly very large) number of locations and scales within a single image, such as the examples in Fig. 6c, d.

A further challenge is that of scalability: A detector should be able to handle previously unseen objects, unknown situations, and high data rates. As the number of images and the number of categories continue to grow, it may become impossible to annotate them manually, forcing a reliance on weakly supervised strategies.

Fig. 8
figure 8

a Illustration of three operations that are repeatedly applied by a typical CNN: convolution with a number of linear filters; Nonlinearities (e.g. ReLU); and local pooling (e.g. max pooling). The M feature maps from a previous layer are convolved with N different filters (here shown as size \(3\times 3\times M\)), using a stride of 1. The resulting N feature maps are then passed through a nonlinear function (e.g. ReLU), and pooled (e.g. taking a maximum over \(2\times 2\) regions) to give N feature maps at a reduced resolution. b Illustration of the architecture of VGGNet (Simonyan and Zisserman 2015), a typical CNN with 11 weight layers. An image with 3 color channels is presented as the input. The network has 8 convolutional layers, 3 fully connected layers, 5 max pooling layers and a softmax classification layer. The last three fully connected layers take features from the top convolutional layer as input in vector form. The final layer is a C-way softmax function, C being the number of classes. The whole network can be learned from labeled training data by optimizing an objective function (e.g. mean squared error or cross entropy loss) via stochastic gradient descent (Color figure online)

2.3 Progress in the Past 2 Decades

Early research on object recognition was based on template matching techniques and simple part-based models (Fischler and Elschlager 1973), focusing on specific objects whose spatial layouts are roughly rigid, such as faces. Before 1990 the leading paradigm of object recognition was based on geometric representations (Mundy 2006; Ponce et al. 2007), with the focus later moving away from geometry and prior models towards the use of statistical classifiers [such as Neural Networks (Rowley et al. 1998), SVM (Osuna et al. 1997) and Adaboost (Viola and Jones 2001; Xiao et al. 2003)] based on appearance features (Murase and Nayar 1995a; Schmid and Mohr 1997). This successful family of object detectors set the stage for most subsequent research in this field.

The milestones of object detection in more recent years are presented in Fig. 7, in which two main eras (SIFT vs. DCNN) are highlighted. The appearance features moved from global representations (Murase and Nayar 1995b; Swain and Ballard 1991; Turk and Pentland 1991) to local representations that are designed to be invariant to changes in translation, scale, rotation, illumination, viewpoint and occlusion. Handcrafted local invariant features gained tremendous popularity, starting from the Scale Invariant Feature Transform (SIFT) feature (Lowe 1999), and the progress on various visual recognition tasks was based substantially on the use of local descriptors (Mikolajczyk and Schmid 2005) such as Haar-like features (Viola and Jones 2001), SIFT (Lowe 2004), Shape Contexts (Belongie et al. 2002), Histogram of Gradients (HOG) (Dalal and Triggs 2005) Local Binary Patterns (LBP) (Ojala et al. 2002), and region covariances (Tuzel et al. 2006). These local features are usually aggregated by simple concatenation or feature pooling encoders such as the Bag of Visual Words approach, introduced by Sivic and Zisserman (2003) and Csurka et al. (2004), Spatial Pyramid Matching (SPM) of BoW models (Lazebnik et al. 2006), and Fisher Vectors (Perronnin et al. 2010).

For years, the multistage hand tuned pipelines of handcrafted local descriptors and discriminative classifiers dominated a variety of domains in computer vision, including object detection, until the significant turning point in 2012 when DCNNs (Krizhevsky et al. 2012a) achieved their record-breaking results in image classification.

The use of CNNs for detection and localization (Rowley et al. 1998) can be traced back to the 1990s, with a modest number of hidden layers used for object detection (Vaillant et al. 1994; Rowley et al. 1998; Sermanet et al. 2013), successful in restricted domains such as face detection. However, more recently, deeper CNNs have led to record-breaking improvements in the detection of more general object categories, a shift which came about when the successful application of DCNNs in image classification (Krizhevsky et al. 2012a) was transferred to object detection, resulting in the milestone Region-based CNN (RCNN) detector of Girshick et al. (2014).

The successes of deep detectors rely heavily on vast training data and large networks with millions or even billions of parameters. The availability of GPUs with very high computational capability and large-scale detection datasets [such as ImageNet (Deng et al. 2009; Russakovsky et al. 2015) and MS COCO (Lin et al. 2014)] play a key role in their success. Large datasets have allowed researchers to target more realistic and complex problems from images with large intra-class variations and inter-class similarities (Lin et al. 2014; Russakovsky et al. 2015). However, accurate annotations are labor intensive to obtain, so detectors must consider methods that can relieve annotation difficulties or can learn with smaller training datasets.

The research community has started moving towards the challenging goal of building general purpose object detection systems whose ability to detect many object categories matches that of humans. This is a major challenge: according to cognitive scientists, human beings can identify around 3000 entry level categories and 30,000 visual categories overall, and the number of categories distinguishable with domain expertise may be to the order of \(10^5\) (Biederman 1987a). Despite the remarkable progress of the past years, designing an accurate, robust, efficient detection and recognition system that approaches human-level performance on \(10^4\)\(10^5\) categories is undoubtedly an unresolved problem.

3 A Brief Introduction to Deep Learning

Deep learning has revolutionized a wide range of machine learning tasks, from image classification and video processing to speech recognition and natural language understanding. Given this tremendously rapid evolution, there exist many recent survey papers on deep learning (Bengio et al. 2013; Goodfellow et al. 2016; Gu et al. 2018; LeCun et al. 2015; Litjens et al. 2017; Pouyanfar et al. 2018; Wu et al. 2019; Young et al. 2018; Zhang et al. 2018d; Zhou et al. 2018a; Zhu et al. 2017). These surveys have reviewed deep learning techniques from different perspectives (Bengio et al. 2013; Goodfellow et al. 2016; Gu et al. 2018; LeCun et al. 2015; Pouyanfar et al. 2018; Wu et al. 2019; Zhou et al. 2018a), or with applications to medical image analysis (Litjens et al. 2017), natural language processing (Young et al. 2018), speech recognition systems (Zhang et al. 2018d), and remote sensing (Zhu et al. 2017).

Convolutional Neural Networks (CNNs), the most representative models of deep learning, are able to exploit the basic properties underlying natural signals: translation invariance, local connectivity, and compositional hierarchies (LeCun et al. 2015). A typical CNN, illustrated in Fig. 8, has a hierarchical structure and is composed of a number of layers to learn representations of data with multiple levels of abstraction (LeCun et al. 2015). We begin with a convolution

$$\begin{aligned} {\varvec{{x}}}^{l-1} * {\varvec{{w}}}^{l} \end{aligned}$$

between an input feature map \({\varvec{{x}}}^{l-1}\) at a feature map from previous layer \(l-1\), convolved with a 2D convolutional kernel (or filter or weights) \({\varvec{{w}}}^{l}\). This convolution appears over a sequence of layers, subject to a nonlinear operation \(\sigma \), such that

$$\begin{aligned} {\varvec{{x}}}^l_j = \sigma \left( \sum _{i=1}^{N^{l-1}} {\varvec{{x}}}^{l-1}_i * {\varvec{{w}}}^{l}_{i, j} +b^{l}_j\right) , \end{aligned}$$

with a convolution now between the \(N^{l-1}\) input feature maps \({\varvec{{x}}}^{l-1}_i\) and the corresponding kernel \({\varvec{{w}}}^{l}_{i, j}\), plus a bias term \(b^{l}_j\). The elementwise nonlinear function \(\sigma (\cdot )\) is typically a rectified linear unit (ReLU) for each element,

$$\begin{aligned} \sigma (x) = \max \{x, 0\}. \end{aligned}$$

Finally, pooling corresponds to the downsampling/upsampling of feature maps. These three operations (convolution, nonlinearity, pooling) are illustrated in Fig. 8a; CNNs having a large number of layers, a “deep” network, are referred to as Deep CNNs (DCNNs), with a typical DCNN architecture illustrated in Fig. 8b.

Most layers of a CNN consist of a number of feature maps, within which each pixel acts like a neuron. Each neuron in a convolutional layer is connected to feature maps of the previous layer through a set of weights \({\varvec{{w}}}_{i,j}\) (essentially a set of 2D filters). As can be seen in Fig. 8b, where the early CNN layers are typically composed of convolutional and pooling layers, the later layers are normally fully connected. From earlier to later layers, the input image is repeatedly convolved, and with each layer, the receptive field or region of support increases. In general, the initial CNN layers extract low-level features (e.g., edges), with later layers extracting more general features of increasing complexity (Zeiler and Fergus 2014; Bengio et al. 2013; LeCun et al. 2015; Oquab et al. 2014).

DCNNs have a number of outstanding advantages: a hierarchical structure to learn representations of data with multiple levels of abstraction, the capacity to learn very complex functions, and learning feature representations directly and automatically from data with minimal domain knowledge. What has particularly made DCNNs successful has been the availability of large scale labeled datasets and of GPUs with very high computational capability.

Despite the great successes, known deficiencies remain. In particular, there is an extreme need for labeled training data and a requirement of expensive computing resources, and considerable skill and experience are still needed to select appropriate learning parameters and network architectures. Trained networks are poorly interpretable, there is a lack of robustness to degradations, and many DCNNs have shown serious vulnerability to attacks (Goodfellow et al. 2015), all of which currently limit the use of DCNNs in real-world applications.

Table 2 Popular databases for object recognition

4 Datasets and Performance Evaluation

4.1 Datasets

Datasets have played a key role throughout the history of object recognition research, not only as a common ground for measuring and comparing the performance of competing algorithms, but also pushing the field towards increasingly complex and challenging problems. In particular, recently, deep learning techniques have brought tremendous success to many visual recognition problems, and it is the large amounts of annotated data which play a key role in their success. Access to large numbers of images on the Internet makes it possible to build comprehensive datasets in order to capture a vast richness and diversity of objects, enabling unprecedented performance in object recognition.

Fig. 9
figure 9

Some example images with object annotations from PASCAL VOC, ILSVRC, MS COCO and Open Images. See Table 2 for a summary of these datasets

For generic object detection, there are four famous datasets: PASCAL VOC (Everingham et al. 2010, 2015), ImageNet (Deng et al. 2009), MS COCO (Lin et al. 2014) and Open Images (Kuznetsova et al. 2018). The attributes of these datasets are summarized in Table 2, and selected sample images are shown in Fig. 9. There are three steps to creating large-scale annotated datasets: determining the set of target object categories, collecting a diverse set of candidate images to represent the selected categories on the Internet, and annotating the collected images, typically by designing crowdsourcing strategies. Recognizing space limitations, we refer interested readers to the original papers (Everingham et al. 2010, 2015; Lin et al. 2014; Russakovsky et al. 2015; Kuznetsova et al. 2018) for detailed descriptions of these datasets in terms of construction and properties.

Table 3 Statistics of commonly used object detection datasets

The four datasets form the backbone of their respective detection challenges. Each challenge consists of a publicly available dataset of images together with ground truth annotation and standardized evaluation software, and an annual competition and corresponding workshop. Statistics for the number of images and object instances in the training, validation and testing datasetsFootnote 2 for the detection challenges are given in Table 3. The most frequent object classes in VOC, COCO, ILSVRC and Open Images detection datasets are visualized in Table 4.

PASCAL VOC Everingham et al. (2010, 2015) is a multi-year effort devoted to the creation and maintenance of a series of benchmark datasets for classification and object detection, creating the precedent for standardized evaluation of recognition algorithms in the form of annual competitions. Starting from only four categories in 2005, the dataset has increased to 20 categories that are common in everyday life. Since 2009, the number of images has grown every year, but with all previous images retained to allow test results to be compared from year to year. Due the availability of larger datasets like ImageNet, MS COCO and Open Images, PASCAL VOC has gradually fallen out of fashion.

ILSVRC, the ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al. 2015), is derived from ImageNet (Deng et al. 2009), scaling up PASCAL VOC’s goal of standardized training and evaluation of detection algorithms by more than an order of magnitude in the number of object classes and images. ImageNet1000, a subset of ImageNet images with 1000 different object categories and a total of 1.2 million images, has been fixed to provide a standardized benchmark for the ILSVRC image classification challenge.

MS COCO is a response to the criticism of ImageNet that objects in its dataset tend to be large and well centered, making the ImageNet dataset atypical of real-world scenarios. To push for richer image understanding, researchers created the MS COCO database (Lin et al. 2014) containing complex everyday scenes with common objects in their natural context, closer to real life, where objects are labeled using fully-segmented instances to provide more accurate detector evaluation. The COCO object detection challenge (Lin et al. 2014) features two object detection tasks: using either bounding box output or object instance segmentation output. COCO introduced three new challenges:

  1. 1.

    It contains objects at a wide range of scales, including a high percentage of small objects (Singh and Davis 2018);

  2. 2.

    Objects are less iconic and amid clutter or heavy occlusion;

  3. 3.

    The evaluation metric (see Table 5) encourages more accurate object localization.

Just like ImageNet in its time, MS COCO has become the standard for object detection today.

OICOD (the Open Image Challenge Object Detection) is derived from Open Images V4 (now V5 in 2019) (Kuznetsova et al. 2018), currently the largest publicly available object detection dataset. OICOD is different from previous large scale object detection datasets like ILSVRC and MS COCO, not merely in terms of the significantly increased number of classes, images, bounding box annotations and instance segmentation mask annotations, but also regarding the annotation process. In ILSVRC and MS COCO, instances of all classes in the dataset are exhaustively annotated, whereas for Open Images V4 a classifier was applied to each image and only those labels with sufficiently high scores were sent for human verification. Therefore in OICOD only the object instances of human-confirmed positive labels are annotated.

Table 4 Most frequent object classes for each detection challenge

4.2 Evaluation Criteria

There are three criteria for evaluating the performance of detection algorithms: detection speed in Frames Per Second (FPS), precision, and recall. The most commonly used metric is Average Precision (AP), derived from precision and recall. AP is usually evaluated in a category specific manner, i.e., computed for each object category separately. To compare performance over all object categories, the mean AP (mAP) averaged over all object categories is adopted as the final measure of performanceFootnote 3. More details on these metrics can be found in Everingham et al. (2010), Everingham et al. (2015), Russakovsky et al. (2015), Hoiem et al. (2012).

The standard outputs of a detector applied to a testing image \(\mathbf{I} \) are the predicted detections \(\{(b_j,c_j,p_j)\}_j\), indexed by object j, of Bounding Box (BB) \(b_j\), predicted category \(c_j\), and confidence \(p_j\). A predicted detection (bcp) is regarded as a True Positive (TP) if

  • The predicted category c equals the ground truth label \(c_g\).

  • The overlap ratio IOU (Intersection Over Union) (Everingham et al. 2010; Russakovsky et al. 2015)

    $$\begin{aligned} \text {IOU}(b,b^g)=\frac{{ area}\,(b\cap b^g)}{{ area}\,(b\cup b^g)}, \end{aligned}$$

    between the predicted BB b and the ground truth \(b^g\) is not smaller than a predefined threshold \(\varepsilon \), where \(\cap \) and cup denote intersection and union, respectively. A typical value of \(\varepsilon \) is 0.5.

Otherwise, it is considered as a False Positive (FP). The confidence level p is usually compared with some threshold \(\beta \) to determine whether the predicted class label c is accepted.

AP is computed separately for each of the object classes, based on Precision and Recall. For a given object class c and a testing image \(\mathbf{I} _i\), let \(\{(b_{ij},p_{ij})\}_{j=1}^M\) denote the detections returned by a detector, ranked by confidence \(p_{ij}\) in decreasing order. Each detection \((b_{ij},p_{ij})\) is either a TP or an FP, which can be determined via the algorithmFootnote 4 in Fig. 10. Based on the TP and FP detections, the precision \(P(\beta )\) and recall \(R(\beta )\) (Everingham et al. 2010) can be computed as a function of the confidence threshold \(\beta \), so by varying the confidence threshold different pairs (PR) can be obtained, in principle allowing precision to be regarded as a function of recall, i.e. P(R), from which the Average Precision (AP) (Everingham et al. 2010; Russakovsky et al. 2015) can be found.

Since the introduction of MS COCO, more attention has been placed on the accuracy of the bounding box location. Instead of using a fixed IOU threshold, MS COCO introduces a few metrics (summarized in Table 5) for characterizing the performance of an object detector. For instance, in contrast to the traditional mAP computed at a single IoU of 0.5, \(AP_{{ coco}}\) is averaged across all object categories and multiple IOU values from 0.5 to 0.95 in steps of 0.05. Because \(41\%\) of the objects in MS COCO are small and \(24\%\) are large, metrics \(AP_{{ coco}}^{{ small}}\), \(AP_{{ coco}}^{{ medium}}\) and \(AP_{{ coco}}^{{ large}}\) are also introduced. Finally, Table 5 summarizes the main metrics used in the PASCAL, ILSVRC and MS COCO object detection challenges, with metric modifications for the Open Images challenges proposed in Kuznetsova et al. (2018).

Table 5 Summary of commonly used metrics for evaluating object detectors

5 Detection Frameworks

There has been steady progress in object feature representations and classifiers for recognition, as evidenced by the dramatic change from handcrafted features (Viola and Jones 2001; Dalal and Triggs 2005; Felzenszwalb et al. 2008; Harzallah et al. 2009; Vedaldi et al. 2009) to learned DCNN features (Girshick et al. 2014; Ouyang et al. 2015; Girshick 2015; Ren et al. 2015; Dai et al. 2016c). In contrast, in terms of localization, the basic “sliding window” strategy (Dalal and Triggs 2005; Felzenszwalb et al. 2010b, 2008) remains mainstream, although with some efforts to avoid exhaustive search (Lampert et al. 2008; Uijlings et al. 2013). However, the number of windows is large and grows quadratically with the number of image pixels, and the need to search over multiple scales and aspect ratios further increases the search space. Therefore, the design of efficient and effective detection frameworks plays a key role in reducing this computational cost. Commonly adopted strategies include cascading, sharing feature computation, and reducing per-window computation.

Fig. 10
figure 10

The algorithm for determining TPs and FPs by greedily matching object detection results to ground truth boxes

This section reviews detection frameworks, listed in Fig. 11 and Table 11, the milestone approaches appearing since deep learning entered the field, organized into two main categories:

  1. (a)

    Two stage detection frameworks, which include a preprocessing step for generating object proposals;

  2. (b)

    One stage detection frameworks, or region proposal free frameworks, having a single proposed method which does not separate the process of the detection proposal.

Sections 69 will discuss fundamental sub-problems involved in detection frameworks in greater detail, including DCNN features, detection proposals, and context modeling.

Fig. 11
figure 11

Milestones in generic object detection

5.1 Region Based (Two Stage) Frameworks

In a region-based framework, category-independent region proposalsFootnote 5 are generated from an image, CNN (Krizhevsky et al. 2012a) features are extracted from these regions, and then category-specific classifiers are used to determine the category labels of the proposals. As can be observed from Fig. 11, DetectorNet (Szegedy et al. 2013), OverFeat (Sermanet et al. 2014), MultiBox (Erhan et al. 2014) and RCNN (Girshick et al. 2014) independently and almost simultaneously proposed using CNNs for generic object detection.

Fig. 12
figure 12

Illustration of the RCNN detection framework (Girshick et al. 2014, 2016)

RCNN (Girshick et al. 2014): Inspired by the breakthrough image classification results obtained by CNNs and the success of the selective search in region proposal for handcrafted features (Uijlings et al. 2013), Girshick et al. (2014, 2016) were among the first to explore CNNs for generic object detection and developed RCNN, which integrates AlexNet (Krizhevsky et al. 2012a) with a region proposal selective search (Uijlings et al. 2013). As illustrated in detail in Fig. 12, training an RCNN framework consists of multistage pipelines:

  1. 1.

    Region proposal computation Class agnostic region proposals, which are candidate regions that might contain objects, are obtained via a selective search (Uijlings et al. 2013).

  2. 2.

    CNN model finetuning Region proposals, which are cropped from the image and warped into the same size, are used as the input for fine-tuning a CNN model pre-trained using a large-scale dataset such as ImageNet. At this stage, all region proposals with \(\geqslant 0.5\) IOUFootnote 6 overlap with a ground truth box are defined as positives for that ground truth box’s class and the rest as negatives.

  3. 3.

    Class specific SVM classifiers training A set of class-specific linear SVM classifiers are trained using fixed length features extracted with CNN, replacing the softmax classifier learned by fine-tuning. For training SVM classifiers, positive examples are defined to be the ground truth boxes for each class. A region proposal with less than 0.3 IOU overlap with all ground truth instances of a class is negative for that class. Note that the positive and negative examples defined for training the SVM classifiers are different from those for fine-tuning the CNN.

  4. 4.

    Class specific bounding box regressor training Bounding box regression is learned for each object class with CNN features.

In spite of achieving high object detection quality, RCNN has notable drawbacks (Girshick 2015):

  1. 1.

    Training is a multistage pipeline, slow and hard to optimize because each individual stage must be trained separately.

  2. 2.

    For SVM classifier and bounding box regressor training, it is expensive in both disk space and time, because CNN features need to be extracted from each object proposal in each image, posing great challenges for large scale detection, particularly with very deep networks, such as VGG16 (Simonyan and Zisserman 2015).

  3. 3.

    Testing is slow, since CNN features are extracted per object proposal in each test image, without shared computation.

All of these drawbacks have motivated successive innovations, leading to a number of improved detection frameworks such as SPPNet, Fast RCNN, Faster RCNN etc., as follows.

SPPNet (He et al. 2014) During testing, CNN feature extraction is the main bottleneck of the RCNN detection pipeline, which requires the extraction of CNN features from thousands of warped region proposals per image. As a result, He et al. (2014) introduced traditional spatial pyramid pooling (SPP) (Grauman and Darrell 2005; Lazebnik et al. 2006) into CNN architectures. Since convolutional layers accept inputs of arbitrary sizes, the requirement of fixed-sized images in CNNs is due only to the Fully Connected (FC) layers, therefore He et al. added an SPP layer on top of the last convolutional (CONV) layer to obtain features of fixed length for the FC layers. With this SPPNet, RCNN obtains a significant speedup without sacrificing any detection quality, because it only needs to run the convolutional layers once on the entire test image to generate fixed-length features for region proposals of arbitrary size. While SPPNet accelerates RCNN evaluation by orders of magnitude, it does not result in a comparable speedup of the detector training. Moreover, fine-tuning in SPPNet (He et al. 2014) is unable to update the convolutional layers before the SPP layer, which limits the accuracy of very deep networks.

Fig. 13
figure 13

High level diagrams of the leading frameworks for generic object detection. The properties of these methods are summarized in Table 11

Fast RCNN (Girshick 2015) Girshick proposed Fast RCNN (Girshick 2015) that addresses some of the disadvantages of RCNN and SPPNet, while improving on their detection speed and quality. As illustrated in Fig. 13, Fast RCNN enables end-to-end detector training by developing a streamlined training process that simultaneously learns a softmax classifier and class-specific bounding box regression, rather than separately training a softmax classifier, SVMs, and Bounding Box Regressors (BBRs) as in RCNN/SPPNet. Fast RCNN employs the idea of sharing the computation of convolution across region proposals, and adds a Region of Interest (RoI) pooling layer between the last CONV layer and the first FC layer to extract a fixed-length feature for each region proposal. Essentially, RoI pooling uses warping at the feature level to approximate warping at the image level. The features after the RoI pooling layer are fed into a sequence of FC layers that finally branch into two sibling output layers: softmax probabilities for object category prediction, and class-specific bounding box regression offsets for proposal refinement. Compared to RCNN/SPPNet, Fast RCNN improves the efficiency considerably—typically 3 times faster in training and 10 times faster in testing. Thus there is higher detection quality, a single training process that updates all network layers, and no storage required for feature caching.

Faster RCNN (Ren et al. 2015, 2017) Although Fast RCNN significantly sped up the detection process, it still relies on external region proposals, whose computation is exposed as the new speed bottleneck in Fast RCNN. Recent work has shown that CNNs have a remarkable ability to localize objects in CONV layers (Zhou et al. 2015, 2016a; Cinbis et al. 2017; Oquab et al. 2015; Hariharan et al. 2016), an ability which is weakened in the FC layers. Therefore, the selective search can be replaced by a CNN in producing region proposals. The Faster RCNN framework proposed by Ren et al. (2015, 2017) offered an efficient and accurate Region Proposal Network (RPN) for generating region proposals. They utilize the same backbone network, using features from the last shared convolutional layer to accomplish the task of RPN for region proposal and Fast RCNN for region classification, as shown in Fig. 13.

RPN first initializes k reference boxes (i.e. the so called anchors) of different scales and aspect ratios at each CONV feature map location. The anchor positions are image content independent, but the feature vectors themselves, extracted from anchors, are image content dependent. Each anchor is mapped to a lower dimensional vector, which is fed into two sibling FC layers—an object category classification layer and a box regression layer. In contrast to detection in Fast RCNN, the features used for regression in RPN are of the same shape as the anchor box, thus k anchors lead to k regressors. RPN shares CONV features with Fast RCNN, thus enabling highly efficient region proposal computation. RPN is, in fact, a kind of Fully Convolutional Network (FCN) (Long et al. 2015; Shelhamer et al. 2017); Faster RCNN is thus a purely CNN based framework without using handcrafted features.

For the VGG16 model (Simonyan and Zisserman 2015), Faster RCNN can test at 5 FPS (including all stages) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 using 300 proposals per image. The initial Faster RCNN in Ren et al. (2015) contains several alternating training stages, later simplified in Ren et al. (2017).

Concurrent with the development of Faster RCNN, Lenc and Vedaldi (2015) challenged the role of region proposal generation methods such as selective search, studied the role of region proposal generation in CNN based detectors, and found that CNNs contain sufficient geometric information for accurate object detection in the CONV rather than FC layers. They showed the possibility of building integrated, simpler, and faster object detectors that rely exclusively on CNNs, removing region proposal generation methods such as selective search.

RFCN (Region based Fully Convolutional Network) While Faster RCNN is an order of magnitude faster than Fast RCNN, the fact that the region-wise sub-network still needs to be applied per RoI (several hundred RoIs per image) led Dai et al. (2016c) to propose the RFCN detector which is fully convolutional (no hidden FC layers) with almost all computations shared over the entire image. As shown in Fig. 13, RFCN differs from Faster RCNN only in the RoI sub-network. In Faster RCNN, the computation after the RoI pooling layer cannot be shared, so Dai et al. (2016c) proposed using all CONV layers to construct a shared RoI sub-network, and RoI crops are taken from the last layer of CONV features prior to prediction. However, Dai et al. (2016c) found that this naive design turns out to have considerably inferior detection accuracy, conjectured to be that deeper CONV layers are more sensitive to category semantics, and less sensitive to translation, whereas object detection needs localization representations that respect translation invariance. Based on this observation, Dai et al. (2016c) constructed a set of position-sensitive score maps by using a bank of specialized CONV layers as the FCN output, on top of which a position-sensitive RoI pooling layer is added. They showed that RFCN with ResNet101 (He et al. 2016) could achieve comparable accuracy to Faster RCNN, often at faster running times.

Mask RCNN He et al. (2017) proposed Mask RCNN to tackle pixelwise object instance segmentation by extending Faster RCNN. Mask RCNN adopts the same two stage pipeline, with an identical first stage (RPN), but in the second stage, in parallel to predicting the class and box offset, Mask RCNN adds a branch which outputs a binary mask for each RoI. The new branch is a Fully Convolutional Network (FCN) (Long et al. 2015; Shelhamer et al. 2017) on top of a CNN feature map. In order to avoid the misalignments caused by the original RoI pooling (RoIPool) layer, a RoIAlign layer was proposed to preserve the pixel level spatial correspondence. With a backbone network ResNeXt101-FPN (Xie et al. 2017; Lin et al. 2017a), Mask RCNN achieved top results for the COCO object instance segmentation and bounding box object detection. It is simple to train, generalizes well, and adds only a small overhead to Faster RCNN, running at 5 FPS (He et al. 2017).

Chained Cascade Network and Cascade RCNN The essence of cascade (Felzenszwalb et al. 2010a; Bourdev and Brandt 2005; Li and Zhang 2004) is to learn more discriminative classifiers by using multistage classifiers, such that early stages discard a large number of easy negative samples so that later stages can focus on handling more difficult examples. Two-stage object detection can be considered as a cascade, the first detector removing large amounts of background, and the second stage classifying the remaining regions. Recently, end-to-end learning of more than two cascaded classifiers and DCNNs for generic object detection were proposed in the Chained Cascade Network (Ouyang et al. 2017a), extended in Cascade RCNN (Cai and Vasconcelos 2018), and more recently applied for simultaneous object detection and instance segmentation (Chen et al. 2019a), winning the COCO 2018 Detection Challenge.

Light Head RCNN In order to further increase the detection speed of RFCN (Dai et al. 2016c), Li et al. (2018c) proposed Light Head RCNN, making the head of the detection network as light as possible to reduce the RoI computation. In particular, Li et al. (2018c) applied a convolution to produce thin feature maps with small channel numbers (e.g., 490 channels for COCO) and a cheap RCNN sub-network, leading to an excellent trade-off of speed and accuracy.

5.2 Unified (One Stage) Frameworks

The region-based pipeline strategies of Sect. 5.1 have dominated since RCNN (Girshick et al. 2014), such that the leading results on popular benchmark datasets are all based on Faster RCNN (Ren et al. 2015). Nevertheless, region-based approaches are computationally expensive for current mobile/wearable devices, which have limited storage and computational capability, therefore instead of trying to optimize the individual components of a complex region-based pipeline, researchers have begun to develop unified detection strategies.

Unified pipelines refer to architectures that directly predict class probabilities and bounding box offsets from full images with a single feed-forward CNN in a monolithic setting that does not involve region proposal generation or post classification / feature resampling, encapsulating all computation in a single network. Since the whole pipeline is a single network, it can be optimized end-to-end directly on detection performance.

DetectorNet (Szegedy et al. 2013) were among the first to explore CNNs for object detection. DetectorNet formulated object detection a regression problem to object bounding box masks. They use AlexNet (Krizhevsky et al. 2012a) and replace the final softmax classifier layer with a regression layer. Given an image window, they use one network to predict foreground pixels over a coarse grid, as well as four additional networks to predict the object’s top, bottom, left and right halves. A grouping process then converts the predicted masks into detected bounding boxes. The network needs to be trained per object type and mask type, and does not scale to multiple classes. DetectorNet must take many crops of the image, and run multiple networks for each part on every crop, thus making it slow.

Fig. 14
figure 14

Illustration of the OverFeat (Sermanet et al. 2014) detection framework

OverFeat, proposed by Sermanet et al. (2014) and illustrated in Fig. 14, can be considered as one of the first single-stage object detectors based on fully convolutional deep networks. It is one of the most influential object detection frameworks, winning the ILSVRC2013 localization and detection competition. OverFeat performs object detection via a single forward pass through the fully convolutional layers in the network (i.e. the “Feature Extractor”, shown in Fig. 14a). The key steps of object detection at test time can be summarized as follows:

  1. 1.

    Generate object candidates by performing object classification via a sliding window fashion on multiscale images OverFeat uses a CNN like AlexNet (Krizhevsky et al. 2012a), which would require input images ofa fixed size due to its fully connected layers, in order to make the sliding window approach computationally efficient, OverFeat casts the network (as shown in Fig. 14a) into a fully convolutional network, taking inputs of any size, by viewing fully connected layers as convolutions with kernels of size \(1\times 1\). OverFeat leverages multiscale features to improve the overall performance by passing up to six enlarged scales of the original image through the network (as shown in Fig. 14b), resulting in a significantly increased number of evaluated context views. For each of the multiscale inputs, the classifier outputs a grid of predictions (class and confidence).

  2. 2.

    Increase the number of predictions by offset max pooling In order to increase resolution, OverFeat applies offset max pooling after the last CONV layer, i.e. performing a subsampling operation at every offset, yielding many more views for voting, increasing robustness while remaining efficient.

  3. 3.

    Bounding box regression Once an object is identified, a single bounding box regressor is applied. The classifier and the regressor share the same feature extraction (CONV) layers, only the FC layers need to be recomputed after computing the classification network.

  4. 4.

    Combine predictions OverFeat uses a greedy merge strategy to combine the individual bounding box predictions across all locations and scales.

OverFeat has a significant speed advantage, but is less accurate than RCNN (Girshick et al. 2014), because it was difficult to train fully convolutional networks at the time. The speed advantage derives from sharing the computation of convolution between overlapping windows in the fully convolutional network. OverFeat is similar to later frameworks such as YOLO (Redmon et al. 2016) and SSD (Liu et al. 2016), except that the classifier and the regressors in OverFeat are trained sequentially.

YOLO Redmon et al. (2016) proposed YOLO (You Only Look Once), a unified detector casting object detection as a regression problem from image pixels to spatially separated bounding boxes and associated class probabilities, illustrated in Fig. 13. Since the region proposal generation stage is completely dropped, YOLO directly predicts detections using a small set of candidate regionsFootnote 7. Unlike region based approaches (e.g. Faster RCNN) that predict detections based on features from a local region, YOLO uses features from an entire image globally. In particular, YOLO divides an image into an \(S\times S\) grid, each predicting C class probabilities, B bounding box locations, and confidence scores. By throwing out the region proposal generation step entirely, YOLO is fast by design, running in real time at 45 FPS and Fast YOLO (Redmon et al. 2016) at 155 FPS. Since YOLO sees the entire image when making predictions, it implicitly encodes contextual information about object classes, and is less likely to predict false positives in the background. YOLO makes more localization errors than Fast RCNN, resulting from the coarse division of bounding box location, scale and aspect ratio. As discussed in Redmon et al. (2016), YOLO may fail to localize some objects, especially small ones, possibly because of the coarse grid division, and because each grid cell can only contain one object. It is unclear to what extent YOLO can translate to good performance on datasets with many objects per image, such as MS COCO.

YOLOv2  and  YOLO9000 Redmon and Farhadi (2017) proposed YOLOv2, an improved version of YOLO, in which the custom GoogLeNet (Szegedy et al. 2015) network is replaced with the simpler DarkNet19, plus batch normalization (He et al. 2015), removing the fully connected layers, and using good anchor boxesFootnote 8 learned via kmeans and multiscale training. YOLOv2 achieved state-of-the-art on standard detection tasks. Redmon and Farhadi (2017) also introduced YOLO9000, which can detect over 9000 object categories in real time by proposing a joint optimization method to train simultaneously on an ImageNet classification dataset and a COCO detection dataset with WordTree to combine data from multiple sources. Such joint training allows YOLO9000 to perform weakly supervised detection, i.e. detecting object classes that do not have bounding box annotations.

SSD In order to preserve real-time speed without sacrificing too much detection accuracy, Liu et al. (2016) proposed SSD (Single Shot Detector), faster than YOLO (Redmon et al. 2016) and with an accuracy competitive with region-based detectors such as Faster RCNN (Ren et al. 2015). SSD effectively combines ideas from RPN in Faster RCNN (Ren et al. 2015), YOLO (Redmon et al. 2016) and multiscale CONV features (Hariharan et al. 2016) to achieve fast detection speed, while still retaining high detection quality. Like YOLO, SSD predicts a fixed number of bounding boxes and scores, followed by an NMS step to produce the final detection. The CNN network in SSD is fully convolutional, whose early layers are based on a standard architecture, such as VGG (Simonyan and Zisserman 2015), followed by several auxiliary CONV layers, progressively decreasing in size. The information in the last layer may be too coarse spatially to allow precise localization, so SSD performs detection over multiple scales by operating on multiple CONV feature maps, each of which predicts category scores and box offsets for bounding boxes of appropriate sizes. For a \(300\times 300\) input, SSD achieves \(74.3\%\) mAP on the VOC2007 test at 59 FPS versus Faster RCNN 7 FPS / mAP \(73.2\%\) or YOLO 45 FPS / mAP \(63.4\%\).

CornerNet Recently, Law and Deng (2018) questioned the dominant role that anchor boxes have come to play in SoA object detection frameworks (Girshick 2015; He et al. 2017; Redmon et al. 2016; Liu et al. 2016). Law and Deng (2018) argue that the use of anchor boxes, especially in one stage detectors (Fu et al. 2017; Lin et al. 2017b; Liu et al. 2016; Redmon et al. 2016), has drawbacks (Law and Deng 2018; Lin et al. 2017b) such as causing a huge imbalance between positive and negative examples, slowing down training and introducing extra hyperparameters. Borrowing ideas from the work on Associative Embedding in multiperson pose estimation (Newell et al. 2017), Law and Deng (2018) proposed CornerNet by formulating bounding box object detection as detecting paired top-left and bottom-right keypointsFootnote 9. In CornerNet, the backbone network consists of two stacked Hourglass networks (Newell et al. 2016), with a simple corner pooling approach to better localize corners. CornerNet achieved a \(42.1\%\) AP on MS COCO, outperforming all previous one stage detectors; however, the average inference time is about 4FPS on a Titan X GPU, significantly slower than SSD (Liu et al. 2016) and YOLO (Redmon et al. 2016). CornerNet generates incorrect bounding boxes because it is challenging to decide which pairs of keypoints should be grouped into the same objects. To further improve on CornerNet, Duan et al. (2019) proposed CenterNet to detect each object as a triplet of keypoints, by introducing one extra keypoint at the centre of a proposal, raising the MS COCO AP to \(47.0\%\), but with an inference speed slower than CornerNet.

6 Object Representation

As one of the main components in any detector, good feature representations are of primary importance in object detection (Dickinson et al. 2009; Girshick et al. 2014; Gidaris and Komodakis 2015; Zhu et al. 2016a). In the past, a great deal of effort was devoted to designing local descriptors [e.g., SIFT (Lowe 1999) and HOG (Dalal and Triggs 2005)] and to explore approaches [e.g., Bag of Words (Sivic and Zisserman 2003) and Fisher Vector (Perronnin et al. 2010)] to group and abstract descriptors into higher level representations in order to allow the discriminative parts to emerge; however, these feature representation methods required careful engineering and considerable domain expertise.

In contrast, deep learning methods (especially deep CNNs) can learn powerful feature representations with multiple levels of abstraction directly from raw images (Bengio et al. 2013; LeCun et al. 2015). As the learning procedure reduces the dependency of specific domain knowledge and complex procedures needed in traditional feature engineering (Bengio et al. 2013; LeCun et al. 2015), the burden for feature representation has been transferred to the design of better network architectures and training procedures.

The leading frameworks reviewed in Sect. 5 [RCNN (Girshick et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015), YOLO (Redmon et al. 2016), SSD (Liu et al. 2016)] have persistently promoted detection accuracy and speed, in which it is generally accepted that the CNN architecture (Sect. 6.1 and Fig. 15) plays a crucial role. As a result, most of the recent improvements in detection accuracy have been via research into the development of novel networks. Therefore we begin by reviewing popular CNN architectures used in Generic Object Detection, followed by a review of the effort devoted to improving object feature representations, such as developing invariant features to accommodate geometric variations in object scale, pose, viewpoint, part deformation and performing multiscale analysis to improve object detection over a wide range of scales.

Fig. 15
figure 15

Performance of winning entries in the ILSVRC competitions from 2011 to 2017 in the image classification task

Table 6 DCNN architectures that were commonly used for generic object detection

6.1 Popular CNN Architectures

CNN architectures (Sect. 3) serve as network backbones used in the detection frameworks of Sect. 5. Representative frameworks include AlexNet (Krizhevsky et al. 2012b), ZFNet (Zeiler and Fergus 2014) VGGNet (Simonyan and Zisserman 2015), GoogLeNet (Szegedy et al. 2015), Inception series (Ioffe and Szegedy 2015; Szegedy et al. 2016, 2017), ResNet (He et al. 2016), DenseNet (Huang et al. 2017a) and SENet (Hu et al. 2018b), summarized in Table 6, and where the improvement over time is seen in Fig. 15. A further review of recent CNN advances can be found in Gu et al. (2018).

The trend in architecture evolution is for greater depth: AlexNet has 8 layers, VGGNet 16 layers, more recently ResNet and DenseNet both surpassed the 100 layer mark, and it was VGGNet (Simonyan and Zisserman 2015) and GoogLeNet (Szegedy et al. 2015) which showed that increasing depth can improve the representational power. As can be observed from Table 6, networks such as AlexNet, OverFeat, ZFNet and VGGNet have an enormous number of parameters, despite being only a few layers deep, since a large fraction of the parameters come from the FC layers. Newer networks like Inception, ResNet, and DenseNet, although having a great depth, actually have far fewer parameters by avoiding the use of FC layers.

With the use of Inception modules (Szegedy et al. 2015) in carefully designed topologies, the number of parameters of GoogLeNet is dramatically reduced, compared to AlexNet, ZFNet or VGGNet. Similarly, ResNet demonstrated the effectiveness of skip connections for learning extremely deep networks with hundreds of layers, winning the ILSVRC 2015 classification task. Inspired by ResNet (He et al. 2016), InceptionResNets (Szegedy et al. 2017) combined the Inception networks with shortcut connections, on the basis that shortcut connections can significantly accelerate network training. Extending ResNets, Huang et al. (2017a) proposed DenseNets, which are built from dense blocksconnecting each layer to every other layer in a feedforward fashion, leading to compelling advantages such as parameter efficiency, implicit deep supervisionFootnote 10, and feature reuse. Recently, He et al. (2016) proposed Squeeze and Excitation (SE) blocks, which can be combined with existing deep architectures to boost their performance at minimal additional computational cost, adaptively recalibrating channel-wise feature responses by explicitly modeling the interdependencies between convolutional feature channels, and which led to winning the ILSVRC 2017 classification task. Research on CNN architectures remains active, with emerging networks such as Hourglass (Law and Deng 2018), Dilated Residual Networks (Yu et al. 2017), Xception (Chollet 2017), DetNet (Li et al. 2018b), Dual Path Networks (DPN) (Chen et al. 2017b), FishNet (Sun et al. 2018), and GLoRe (Chen et al. 2019b).

The training of a CNN requires a large-scale labeled dataset with intraclass diversity. Unlike image classification, detection requires localizing (possibly many) objects from an image. It has been shown (Ouyang et al. 2017b) that pretraining a deep model with a large scale dataset having object level annotations (such as ImageNet), instead of only the image level annotations, improves the detection performance. However, collecting bounding box labels is expensive, especially for hundreds of thousands of categories. A common scenario is for a CNN to be pretrained on a large dataset (usually with a large number of visual categories) with image-level labels; the pretrained CNN can then be applied to a small dataset, directly, as a generic feature extractor (Razavian et al. 2014; Azizpour et al. 2016; Donahue et al. 2014; Yosinski et al. 2014), which can support a wider range of visual recognition tasks. For detection, the pre-trained network is typically fine-tunedFootnote 11 on a given detection dataset (Donahue et al. 2014; Girshick et al. 2014, 2016). Several large scale image classification datasets are used for CNN pre-training, among them ImageNet1000 (Deng et al. 2009; Russakovsky et al. 2015) with 1.2 million images of 1000 object categories, Places (Zhou et al. 2017a), which is much larger than ImageNet1000 but with fewer classes, a recent Places-Imagenet hybrid (Zhou et al. 2017a), or JFT300M (Hinton et al. 2015; Sun et al. 2017).

Pretrained CNNs without fine-tuning were explored for object classification and detection in Donahue et al. (2014), Girshick et al. (2016), Agrawal et al. (2014), where it was shown that detection accuracies are different for features extracted from different layers; for example, for AlexNet pre-trained on ImageNet, FC6 / FC7 / Pool5 are in descending order of detection accuracy (Donahue et al. 2014; Girshick et al. 2016). Fine-tuning a pre-trained network can increase detection performance significantly (Girshick et al. 2014, 2016), although in the case of AlexNet, the fine-tuning performance boost was shown to be much larger for FC6 / FC7 than for Pool5, suggesting that Pool5 features are more general. Furthermore, the relationship between the source and target datasets plays a critical role, for example that ImageNet based CNN features show better performance for object detection than for human action (Zhou et al. 2015; Azizpour et al. 2016).

6.2 Methods For Improving Object Representation

Deep CNN based detectors such as RCNN (Girshick et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015) and YOLO (Redmon et al. 2016), typically use the deep CNN architectures listed in Table 6 as the backbone network and use features from the top layer of the CNN as object representations; however, detecting objects across a large range of scales is a fundamental challenge. A classical strategy to address this issue is to run the detector over a number of scaled input images (e.g., an image pyramid) (Felzenszwalb et al. 2010b; Girshick et al. 2014; He et al. 2014), which typically produces more accurate detection, with, however, obvious limitations of inference time and memory.

6.2.1 Handling of Object Scale Variations

Since a CNN computes its feature hierarchy layer by layer, the sub-sampling layers in the feature hierarchy already lead to an inherent multiscale pyramid, producing feature maps at different spatial resolutions, but subject to challenges (Hariharan et al. 2016; Long et al. 2015; Shrivastava et al. 2017). In particular, the higher layers have a large receptive field and strong semantics, and are the most robust to variations such as object pose, illumination and part deformation, but the resolution is low and the geometric details are lost. In contrast, lower layers have a small receptive field and rich geometric details, but the resolution is high and much less sensitive to semantics. Intuitively, semantic concepts of objects can emerge in different layers, depending on the size of the objects. So if a target object is small it requires fine detail information in earlier layers and may very well disappear at later layers, in principle making small object detection very challenging, for which tricks such as dilated or “atrous” convolution (Yu and Koltun 2015; Dai et al. 2016c; Chen et al. 2018b) have been proposed, increasing feature resolution, but increasing computational complexity. On the other hand, if the target object is large, then the semantic concept will emerge in much later layers. A number of methods (Shrivastava et al. 2017; Zhang et al. 2018e; Lin et al. 2017a; Kong et al. 2017) have been proposed to improve detection accuracy by exploiting multiple CNN layers, broadly falling into three types of multiscale object detection:

  1. 1.

    Detecting with combined features of multiple layers;

  2. 2.

    Detecting at multiple layers;

  3. 3.

    Combinations of the above two methods.

Fig. 16
figure 16

Comparison of HyperNet and ION. LRN is local response normalization, which performs a kind of “lateral inhibition” by normalizing over local input regions (Jia et al. 2014)

(1) Detecting with combined features of multiple CNN layers Many approaches, including Hypercolumns (Hariharan et al. 2016), HyperNet (Kong et al. 2016), and ION (Bell et al. 2016), combine features from multiple layers before making a prediction. Such feature combination is commonly accomplished via concatenation, a classic neural network idea that concatenates features from different layers, architectures which have recently become popular for semantic segmentation (Long et al. 2015; Shelhamer et al. 2017; Hariharan et al. 2016). As shown in Fig. 16a, ION (Bell et al. 2016) uses RoI pooling to extract RoI features from multiple layers, and then the object proposals generated by selective search and edgeboxes are classified by using the concatenated features. HyperNet (Kong et al. 2016), shown in Fig. 16b, follows a similar idea, and integrates deep, intermediate and shallow features to generate object proposals and to predict objects via an end to end joint training strategy. The combined feature is more descriptive, and is more beneficial for localization and classification, but at increased computational complexity.

Table 7 Summary of properties of representative methods in improving DCNN feature representations for generic object detection
Fig. 17
figure 17

Hourglass architectures: Conv1 to Conv5 are the main Conv blocks in backbone networks such as VGG or ResNet. The figure compares a number of feature fusion blocks (FFB) commonly used in recent approaches: FPN (Lin et al. 2017a), TDM (Shrivastava et al. 2017), DSSD (Fu et al. 2017), RON (Kong et al. 2017), RefineDet (Zhang et al. 2018a), ZIP (Li et al. 2018a), PANet (Liu et al. 2018c), FPR (Kong et al. 2018), DetNet (Li et al. 2018b) and M2Det (Zhao et al. 2019). FFM feature fusion module, TUM thinned U-shaped module

(2) Detecting at multiple CNN layers A number of recent approaches improve detection by predicting objects of different resolutions at different layers and then combining these predictions: SSD (Liu et al. 2016) and MSCNN (Cai et al. 2016), RBFNet (Liu et al. 2018b), and DSOD (Shen et al. 2017). SSD (Liu et al. 2016) spreads out default boxes of different scales to multiple layers within a CNN, and forces each layer to focus on predicting objects of a certain scale. RFBNet (Liu et al. 2018b) replaces the later convolution layers of SSD with a Receptive Field Block (RFB) to enhance the discriminability and robustness of features. The RFB is a multibranch convolutional block, similar to the Inception block (Szegedy et al. 2015), but combining multiple branches with different kernels and convolution layers (Chen et al. 2018b). MSCNN (Cai et al. 2016) applies deconvolution on multiple layers of a CNN to increase feature map resolution before using the layers to learn region proposals and pool features. Similar to RFBNet (Liu et al. 2018b), TridentNet (Li et al. 2019b) constructs a parallel multibranch architecture where each branch shares the same transformation parameters but with different receptive fields; dilated convolution with different dilation rates are used to adapt the receptive fields for objects of different scales.

(3) Combinations of the above two methods Features from different layers are complementary to each other and can improve detection accuracy, as shown by Hypercolumns (Hariharan et al. 2016), HyperNet (Kong et al. 2016) and ION (Bell et al. 2016). On the other hand, however, it is natural to detect objects of different scales using features of approximately the same size, which can be achieved by detecting large objects from downscaled feature maps while detecting small objects from upscaled feature maps. Therefore, in order to combine the best of both worlds, some recent works propose to detect objects at multiple layers, and the resulting features obtained by combining features from different layers. This approach has been found to be effective for segmentation (Long et al. 2015; Shelhamer et al. 2017) and human pose estimation (Newell et al. 2016), has been widely exploited by both one-stage and two-stage detectors to alleviate problems of scale variation across object instances. Representative methods include SharpMask (Pinheiro et al. 2016), Deconvolutional Single Shot Detector (DSSD) (Fu et al. 2017), Feature Pyramid Network (FPN) (Lin et al. 2017a), Top Down Modulation (TDM)(Shrivastava et al. 2017), Reverse connection with Objectness prior Network (RON) (Kong et al. 2017), ZIP (Li et al. 2018a), Scale Transfer Detection Network (STDN) (Zhou et al. 2018b), RefineDet (Zhang et al. 2018a), StairNet (Woo et al. 2018), Path Aggregation Network (PANet) (Liu et al. 2018c), Feature Pyramid Reconfiguration (FPR) (Kong et al. 2018), DetNet (Li et al. 2018b), Scale Aware Network (SAN) (Kim et al. 2018), Multiscale Location aware Kernel Representation (MLKP) (Wang et al. 2018) and M2Det (Zhao et al. 2019), as shown in Table 7 and contrasted in Fig. 17.

Early works like FPN (Lin et al. 2017a), DSSD (Fu et al. 2017), TDM (Shrivastava et al. 2017), ZIP (Li et al. 2018a), RON (Kong et al. 2017) and RefineDet (Zhang et al. 2018a) construct the feature pyramid according to the inherent multiscale, pyramidal architecture of the backbone, and achieved encouraging results. As can be observed from Fig. 17a1–f1, these methods have very similar detection architectures which incorporate a top-down network with lateral connections to supplement the standard bottom-up, feed-forward network. Specifically, after a bottom-up pass the final high level semantic features are transmitted back by the top-down network to combine with the bottom-up features from intermediate layers after lateral processing, and the combined features are then used for detection. As can be seen from Fig. 17a2–e2, the main differences lie in the design of the simple Feature Fusion Block (FFB), which handles the selection of features from different layers and the combination of multilayer features.

FPN (Lin et al. 2017a) shows significant improvement as a generic feature extractor in several applications including object detection (Lin et al. 2017a, b) and instance segmentation (He et al. 2017). Using FPN in a basic Faster RCNN system achieved state-of-the-art results on the COCO detection dataset. STDN (Zhou et al. 2018b) used DenseNet (Huang et al. 2017a) to combine features of different layers and designed a scale transfer module to obtain feature maps with different resolutions. The scale transfer module can be directly embedded into DenseNet with little additional cost.

More recent work, such as PANet (Liu et al. 2018c), FPR (Kong et al. 2018), DetNet (Li et al. 2018b), and M2Det (Zhao et al. 2019), as shown in Fig. 17g–j, propose to further improve on the pyramid architectures like FPN in different ways. Based on FPN, Liu et al. designed PANet (Liu et al. 2018c) (Fig. 17g1) by adding another bottom-up path with clean lateral connections from low to top levels, in order to shorten the information path and to enhance the feature pyramid. Then, an adaptive feature pooling was proposed to aggregate features from all feature levels for each proposal. In addition, in the proposal sub-network, a complementary branch capturing different views for each proposal is created to further improve mask prediction. These additional steps bring only slightly extra computational overhead, but are effective and allowed PANet to reach 1st place in the COCO 2017 Challenge Instance Segmentation task and 2nd place in the Object Detection task. Kong et al. proposed FPR (Kong et al. 2018) by explicitly reformulating the feature pyramid construction process [e.g. FPN (Lin et al. 2017a)] as feature reconfiguration functions in a highly nonlinear but efficient way. As shown in Fig. 17h1, instead of using a top-down path to propagate strong semantic features from the topmost layer down as in FPN, FPR first extracts features from multiple layers in the backbone network by adaptive concatenation, and then designs a more complex FFB module (Fig. 17h2) to spread strong semantics to all scales. Li et al. (2018b) proposed DetNet (Fig. 17i1) by introducing dilated convolutions to the later layers of the backbone network in order to maintain high spatial resolution in deeper layers. Zhao et al. (2019) proposed a MultiLevel Feature Pyramid Network (MLFPN) to build more effective feature pyramids for detecting objects of different scales. As can be seen from Fig. 17j1, features from two different layers of the backbone are first fused as the base feature, after which a top-down path with lateral connections from the base feature is created to build the feature pyramid. As shown in Fig. 17j2, j5, the FFB module is much more complex than those like FPN, in that FFB involves a Thinned U-shaped Module (TUM) to generate a second pyramid structure, after which the feature maps with equivalent sizes from multiple TUMs are combined for object detection. The authors proposed M2Det by integrating MLFPN into SSD, and achieved better detection performance than other one-stage detectors.

6.3 Handling of Other Intraclass Variations

Powerful object representations should combine distinctiveness and robustness. A large amount of recent work has been devoted to handling changes in object scale, as reviewed in Sect. 6.2.1. As discussed in Sect. 2.2 and summarized in Fig. 5, object detection still requires robustness to real-world variations other than just scale, which we group into three categories:

  • Geometric transformations,

  • Occlusions, and

  • Image degradations.

To handle these intra-class variations, the most straightforward approach is to augment the training datasets with a sufficient amount of variations; for example, robustness to rotation could be achieved by adding rotated objects at many orientations to the training data. Robustness can frequently be learned this way, but usually at the cost of expensive training and complex model parameters. Therefore, researchers have proposed alternative solutions to these problems.

Handling of geometric transformations DCNNs are inherently limited by the lack of ability to be spatially invariant to geometric transformations of the input data (Lenc and Vedaldi 2018; Liu et al. 2017; Chellappa 2016). The introduction of local max pooling layers has allowed DCNNs to enjoy some translation invariance, however the intermediate feature maps are not actually invariant to large geometric transformations of the input data (Lenc and Vedaldi 2018). Therefore, many approaches have been presented to enhance robustness, aiming at learning invariant CNN representations with respect to different types of transformations such as scale (Kim et al. 2014; Bruna and Mallat 2013), rotation (Bruna and Mallat 2013; Cheng et al. 2016; Worrall et al. 2017; Zhou et al. 2017b), or both (Jaderberg et al. 2015). One representative work is Spatial Transformer Network (STN) (Jaderberg et al. 2015), which introduces a new learnable module to handle scaling, cropping, rotations, as well as nonrigid deformations via a global parametric transformation. STN has now been used in rotated text detection (Jaderberg et al. 2015), rotated face detection and generic object detection (Wang et al. 2017).

Although rotation invariance may be attractive in certain applications, such as scene text detection (He et al. 2018; Ma et al. 2018), face detection (Shi et al. 2018), and aerial imagery (Ding et al. 2018; Xia et al. 2018), there is limited generic object detection work focusing on rotation invariance because popular benchmark detection datasets (e.g. PASCAL VOC, ImageNet, COCO) do not actually present rotated images.

Before deep learning, Deformable Part based Models (DPMs) (Felzenszwalb et al. 2010b) were successful for generic object detection, representing objects by component parts arranged in a deformable configuration. Although DPMs have been significantly outperformed by more recent object detectors, their spirit still deeply influences many recent detectors. DPM modeling is less sensitive to transformations in object pose, viewpoint and nonrigid deformations, motivating researchers (Dai et al. 2017; Girshick et al. 2015; Mordan et al. 2018; Ouyang et al. 2015; Wan et al. 2015) to explicitly model object composition to improve CNN based detection. The first attempts (Girshick et al. 2015; Wan et al. 2015) combined DPMs with CNNs by using deep features learned by AlexNet in DPM based detection, but without region proposals. To enable a CNN to benefit from the built-in capability of modeling the deformations of object parts, a number of approaches were proposed, including DeepIDNet (Ouyang et al. 2015), DCN (Dai et al. 2017) and DPFCN (Mordan et al. 2018) (shown in Table 7). Although similar in spirit, deformations are computed in different ways: DeepIDNet (Ouyang et al. 2017b) designed a deformation constrained pooling layer to replace regular max pooling, to learn the shared visual patterns and their deformation properties across different object classes; DCN (Dai et al. 2017) designed a deformable convolution layer and a deformable RoI pooling layer, both of which are based on the idea of augmenting regular grid sampling locations in feature maps; and DPFCN (Mordan et al. 2018) proposed a deformable part-based RoI pooling layer which selects discriminative parts of objects around object proposals by simultaneously optimizing latent displacements of all parts.

Handling of occlusions In real-world images, occlusions are common, resulting in information loss from object instances. A deformable parts idea can be useful for occlusion handling, so deformable RoI Pooling (Dai et al. 2017; Mordan et al. 2018; Ouyang and Wang 2013) and deformable convolution (Dai et al. 2017) have been proposed to alleviate occlusion by giving more flexibility to the typically fixed geometric structures. Wang et al. (2017) propose to learn an adversarial network that generates examples with occlusions and deformations, and context may be helpful in dealing with occlusions (Zhang et al. 2018b). Despite these efforts, the occlusion problem is far from being solved; applying GANs to this problem may be a promising research direction.

Handling of image degradations Image noise is a common problem in many real-world applications. It is frequently caused by insufficient lighting, low quality cameras, image compression, or the intentional low-cost sensors on edge devices and wearable devices. While low image quality may be expected to degrade the performance of visual recognition, most current methods are evaluated in a degradation free and clean environment, evidenced by the fact that PASCAL VOC, ImageNet, MS COCO and Open Images all focus on relatively high quality images. To the best of our knowledge, there is so far very limited work to address this problem.

7 Context Modeling

Table 8 Summary of detectors that exploit context information, with labelling details as in Table 7

In the physical world, visual objects occur in particular environments and usually coexist with other related objects. There is strong psychological evidence (Biederman 1972; Bar 2004) that context plays an essential role in human object recognition, and it is recognized that a proper modeling of context helps object detection and recognition (Torralba 2003; Oliva and Torralba 2007; Chen et al. 2018b, 2015a; Divvala et al. 2009; Galleguillos and Belongie 2010), especially when object appearance features are insufficient because of small object size, object occlusion, or poor image quality. Many different types of context have been discussed (Divvala et al. 2009; Galleguillos and Belongie 2010), and can broadly be grouped into one of three categories:

  1. 1.

    Semantic context: The likelihood of an object to be found in some scenes, but not in others;

  2. 2.

    Spatial context: The likelihood of finding an object in some position and not others with respect to other objects in the scene;

  3. 3.

    Scale context: Objects have a limited set of sizes relative to other objects in the scene.

A great deal of work (Chen et al. 2015b; Divvala et al. 2009; Galleguillos and Belongie 2010; Malisiewicz and Efros 2009; Murphy et al. 2003; Rabinovich et al. 2007; Parikh et al. 2012) preceded the prevalence of deep learning, and much of this work has yet to be explored in DCNN-based object detectors (Chen and Gupta 2017; Hu et al. 2018a).

The current state of the art in object detection (Ren et al. 2015; Liu et al. 2016; He et al. 2017) detects objects without explicitly exploiting any contextual information. It is broadly agreed that DCNNs make use of contextual information implicitly (Zeiler and Fergus 2014; Zheng et al. 2015) since they learn hierarchical representations with multiple levels of abstraction. Nevertheless, there is value in exploring contextual information explicitly in DCNN based detectors (Hu et al. 2018a; Chen and Gupta 2017; Zeng et al. 2017), so the following reviews recent work in exploiting contextual cues in DCNN- based object detectors, organized into categories of global and local contexts, motivated by earlier work in Zhang et al. (2013), Galleguillos and Belongie (2010). Representative approaches are summarized in Table 8.

7.1 Global Context

Global context (Zhang et al. 2013; Galleguillos and Belongie 2010) refers to image or scene level contexts, which can serve as cues for object detection (e.g., a bedroom will predict the presence of a bed). In DeepIDNet (Ouyang et al. 2015), the image classification scores were used as contextual features, and concatenated with the object detection scores to improve detection results. In ION (Bell et al. 2016), Bell et al. proposed to use spatial Recurrent Neural Networks (RNNs) to explore contextual information across the entire image. In SegDeepM (Zhu et al. 2015), Zhu et al. proposed a Markov random field model that scores appearance as well as context for each detection, and allows each candidate box to select a segment out of a large pool of object segmentation proposals and score the agreement between them. In Shrivastava and Gupta (2016), semantic segmentation was used as a form of contextual priming.

Fig. 18
figure 18

Representative approaches that explore local surrounding contextual features: MRCNN (Gidaris and Komodakis 2015), GBDNet (Zeng et al. 2016, 2017), ACCNN (Li et al. 2017b) and CoupleNet (Zhu et al. 2017a); also see Table 8

7.2 Local Context

Local context (Zhang et al. 2013; Galleguillos and Belongie 2010; Rabinovich et al. 2007) considers the relationship among locally nearby objects, as well as the interactions between an object and its surrounding area. In general, modeling object relations is challenging, requiring reasoning about bounding boxes of different classes, locations, scales etc. Deep learning research that explicitly models object relations is quite limited, with representative ones being Spatial Memory Network (SMN) (Chen and Gupta 2017), Object Relation Network (Hu et al. 2018a), and Structure Inference Network (SIN) (Liu et al. 2018d). In SMN, spatial memory essentially assembles object instances back into a pseudo image representation that is easy to be fed into another CNN for object relations reasoning, leading to a new sequential reasoning architecture where image and memory are processed in parallel to obtain detections which further update memory. Inspired by the recent success of attention modules in natural language processing (Vaswani et al. 2017), ORN processes a set of objects simultaneously through the interaction between their appearance feature and geometry. It does not require additional supervision, and it is easy to embed into existing networks, effective in improving object recognition and duplicate removal steps in modern object detection pipelines, giving rise to the first fully end-to-end object detector. SIN (Liu et al. 2018d) considered two kinds of context: scene contextual information and object relationships within a single image. It formulates object detection as a problem of graph inference, where the objects are treated as nodes in a graph and relationships between objects are modeled as edges.

A wider range of methods has approached the context challenge with a simpler idea: enlarging the detection window size to extract some form of local context. Representative approaches include MRCNN (Gidaris and Komodakis 2015), Gated BiDirectional CNN (GBDNet) Zeng et al. (2016), Zeng et al. (2017), Attention to Context CNN (ACCNN) (Li et al. 2017b), CoupleNet (Zhu et al. 2017a), and Sermanet et al. (2013). In MRCNN (Gidaris and Komodakis 2015) (Fig. 18a), in addition to the features extracted from the original object proposal at the last CONV layer of the backbone, Gidaris and Komodakis proposed to extract features from a number of different regions of an object proposal (half regions, border regions, central regions, contextual region and semantically segmented regions), in order to obtain a richer and more robust object representation. All of these features are combined by concatenation.

Quite a number of methods, all closely related to MRCNN, have been proposed since then. The method in Zagoruyko et al. (2016) used only four contextual regions, organized in a foveal structure, where the classifiers along multiple paths are trained jointly end-to-end. Zeng et al. (2016), Zeng et al. (2017) proposed GBDNet (Fig. 18b) to extract features from multiscale contextualized regions surrounding an object proposal to improve detection performance. In contrast to the somewhat naive approach of learning CNN features for each region separately and then concatenating them, GBDNet passes messages among features from different contextual regions. Noting that message passing is not always helpful, but dependent on individual samples, Zeng et al. (2016) used gated functions to control message transmission. Li et al. (2017b) presented ACCNN (Fig. 18c) to utilize both global and local contextual information: the global context was captured using a Multiscale Local Contextualized (MLC) subnetwork, which recurrently generates an attention map for an input image to highlight promising contextual locations; local context adopted a method similar to that of MRCNN (Gidaris and Komodakis 2015). As shown in Fig. 18d, CoupleNet (Zhu et al. 2017a) is conceptually similar to ACCNN (Li et al. 2017b), but built upon RFCN (Dai et al. 2016c), which captures object information with position sensitive RoI pooling, CoupleNet added a branch to encode the global context with RoI pooling.

8 Detection Proposal Methods

An object can be located at any position and scale in an image. During the heyday of handcrafted feature descriptors [SIFT (Lowe 2004), HOG (Dalal and Triggs 2005) and LBP (Ojala et al. 2002)], the most successful methods for object detection [e.g. DPM (Felzenszwalb et al. 2008)] used sliding window techniques (Viola and Jones 2001; Dalal and Triggs 2005; Felzenszwalb et al. 2008; Harzallah et al. 2009; Vedaldi et al. 2009). However, the number of windows is huge, growing with the number of pixels in an image, and the need to search at multiple scales and aspect ratios further increases the search spaceFootnote 12. Therefore, it is computationally too expensive to apply sophisticated classifiers.

Around 2011, researchers proposed to relieve the tension between computational tractability and high detection quality by using detection proposalsFootnote 13 (Van de Sande et al. 2011; Uijlings et al. 2013). Originating in the idea of objectness proposed by Alexe et al. (2010), object proposals are a set of candidate regions in an image that are likely to contain objects, and if high object recall can be achieved with a modest number of object proposals (like one hundred), significant speed-ups over the sliding window approach can be gained, allowing the use of more sophisticated classifiers. Detection proposals are usually used as a pre-processing step, limiting the number of regions that need to be evaluated by the detector, and should have the following characteristics:

  1. 1.

    High recall, which can be achieved with only a few proposals;

  2. 2.

    Accurate localization, such that the proposals match the object bounding boxes as accurately as possible; and

  3. 3.

    Low computational cost.

The success of object detection based on detection proposals (Van de Sande et al. 2011; Uijlings et al. 2013) has attracted broad interest (Carreira and Sminchisescu 2012; Arbeláez et al. 2014; Alexe et al. 2012; Cheng et al. 2014; Zitnick and Dollár 2014; Endres and Hoiem 2010; Krähenbühl and Koltun 2014; Manen et al. 2013). A comprehensive review of object proposal algorithms is beyond the scope of this paper, because object proposals have applications beyond object detection (Arbeláez et al. 2012; Guillaumin et al. 2014; Zhu et al. 2017b). We refer interested readers to the recent surveys (Hosang et al. 2016; Chavali et al. 2016) which provide in-depth analysis of many classical object proposal algorithms and their impact on detection performance. Our interest here is to review object proposal methods that are based on DCNNs, output class agnostic proposals, and are related to generic object detection.

In 2014, the integration of object proposals (Van de Sande et al. 2011; Uijlings et al. 2013) and DCNN features (Krizhevsky et al. 2012a) led to the milestone RCNN (Girshick et al. 2014) in generic object detection. Since then, detection proposal has quickly become a standard preprocessing step, based on the fact that all winning entries in the PASCAL VOC (Everingham et al. 2010), ILSVRC (Russakovsky et al. 2015) and MS COCO (Lin et al. 2014) object detection challenges since 2014 used detection proposals (Girshick et al. 2014; Ouyang et al. 2015; Girshick 2015; Ren et al. 2015; Zeng et al. 2017; He et al. 2017).

Among object proposal approaches based on traditional low-level cues (e.g., color, texture, edge and gradients), Selective Search (Uijlings et al. 2013), MCG (Arbeláez et al. 2014) and EdgeBoxes (Zitnick and Dollár 2014) are among the more popular. As the domain rapidly progressed, traditional object proposal approaches (Uijlings et al. 2013; Hosang et al. 2016; Zitnick and Dollár 2014), which were adopted as external modules independent of the detectors, became the speed bottleneck of the detection pipeline (Ren et al. 2015). An emerging class of object proposal algorithms (Erhan et al. 2014; Ren et al. 2015; Kuo et al. 2015; Ghodrati et al. 2015; Pinheiro et al. 2015; Yang et al. 2016a) using DCNNs has attracted broad attention.

Table 9 Summary of object proposal methods using DCNN. Bold values indicates the number of object proposals

Recent DCNN based object proposal methods generally fall into two categories: bounding box based and object segment based, with representative methods summarized in Table 9.

Bounding Box Proposal Methods are best exemplified by the RPC method of Ren et al. (2015), illustrated in Fig. 19. RPN predicts object proposals by sliding a small network over the feature map of the last shared CONV layer. At each sliding window location, k proposals are predicted by using k anchor boxes, where each anchor boxFootnote 14 is centered at some location in the image, and is associated with a particular scale and aspect ratio. Ren et al. (2015) proposed integrating RPN and Fast RCNN into a single network by sharing their convolutional layers, leading to Faster RCNN, the first end-to-end detection pipeline. RPN has been broadly selected as the proposal method by many state-of-the-art object detectors, as can be observed from Tables 7 and 8.

Instead of fixing a priori a set of anchors as MultiBox (Erhan et al. 2014; Szegedy et al. 2014) and RPN (Ren et al. 2015), Lu et al. (2016) proposed generating anchor locations by using a recursive search strategy which can adaptively guide computational resources to focus on sub-regions likely to contain objects. Starting with the whole image, all regions visited during the search process serve as anchors. For any anchor region encountered during the search procedure, a scalar zoom indicator is used to decide whether to further partition the region, and a set of bounding boxes with objectness scores are computed by an Adjacency and Zoom Network (AZNet), which extends RPN by adding a branch to compute the scalar zoom indicator in parallel with the existing branch.

Further work attempts to generate object proposals by exploiting multilayer convolutional features. Concurrent with RPN (Ren et al. 2015), Ghodrati et al. (2015) proposed DeepProposal, which generates object proposals by using a cascade of multiple convolutional features, building an inverse cascade to select the most promising object locations and to refine their boxes in a coarse-to-fine manner. An improved variant of RPN, HyperNet (Kong et al. 2016) designs Hyper Features which aggregate multilayer convolutional features and shares them both in generating proposals and detecting objects via an end-to-end joint training strategy. Yang et al. (2016a) proposed CRAFT which also used a cascade strategy, first training an RPN network to generate object proposals and then using them to train another binary Fast RCNN network to further distinguish objects from background. Li et al. (2018a) proposed ZIP to improve RPN by predicting object proposals with multiple convolutional feature maps at different network depths to integrate both low level details and high level semantics. The backbone used in ZIP is a “zoom out and in” network inspired by the conv and deconv structure (Long et al. 2015).

Finally, recent work which deserves mention includes Deepbox (Kuo et al. 2015), which proposed a lightweight CNN to learn to rerank proposals generated by EdgeBox, and DeNet (TychsenSmith and Petersson 2017) which introduces bounding box corner estimation to predict object proposals efficiently to replace RPN in a Faster RCNN style detector.

Object Segment Proposal MethodsPinheiro et al. (2015), Pinheiro et al. (2016) aim to generate segment proposals that are likely to correspond to objects. Segment proposals are more informative than bounding box proposals, and take a step further towards object instance segmentation (Hariharan et al. 2014; Dai et al. 2016b; Li et al. 2017e). In addition, using instance segmentation supervision can improve the performance of bounding box object detection. The pioneering work of DeepMask, proposed by Pinheiro et al. (2015), segments proposals learnt directly from raw image data with a deep network. Similarly to RPN, after a number of shared convolutional layers DeepMask splits the network into two branches in order to predict a class agnostic mask and an associated objectness score. Also similar to the efficient sliding window strategy in OverFeat (Sermanet et al. 2014), the trained DeepMask network is applied in a sliding window manner to an image (and its rescaled versions) during inference. More recently, Pinheiro et al. (2016) proposed SharpMask by augmenting the DeepMask architecture with a refinement module, similar to the architectures shown in Fig. 17 (b1) and (b2), augmenting the feed-forward network with a top-down refinement process. SharpMask can efficiently integrate spatially rich information from early features with strong semantic information encoded in later layers to generate high fidelity object masks.

Motivated by Fully Convolutional Networks (FCN) for semantic segmentation (Long et al. 2015) and DeepMask (Pinheiro et al. 2015; Dai et al. 2016a) proposed InstanceFCN to generate instance segment proposals. Similar to DeepMask, the InstanceFCN network is split into two fully convolutional branches, one to generate instance sensitive score maps, the other to predict the objectness score. Hu et al. (2017) proposed FastMask to efficiently generate instance segment proposals in a one-shot manner, similar to SSD (Liu et al. 2016), in order to make use of multiscale convolutional features. Sliding windows extracted densely from multiscale convolutional feature maps were input to a scale-tolerant attentional head module in order to predict segmentation masks and objectness scores. FastMask is claimed to run at 13 FPS on \(800\times 600\) images.

Fig. 19
figure 19

Illustration of the region proposal network (RPN) introduced in Ren et al. (2015)

9 Other Issues

Data Augmentation Performing data augmentation for learning DCNNs (Chatfield et al. 2014; Girshick 2015; Girshick et al. 2014) is generally recognized to be important for visual recognition. Trivial data augmentation refers to perturbing an image by transformations that leave the underlying category unchanged, such as cropping, flipping, rotating, scaling, translating, color perturbations, and adding noise. By artificially enlarging the number of samples, data augmentation helps in reducing overfitting and improving generalization. It can be used at training time, at test time, or both. Nevertheless, it has the obvious limitation that the time required for training increases significantly. Data augmentation may synthesize completely new training images (Peng et al. 2015; Wang et al. 2017), however it is hard to guarantee that the synthetic images generalize well to real ones. Some researchers (Dwibedi et al. 2017; Gupta et al. 2016) proposed augmenting datasets by pasting real segmented objects into natural images; indeed, Dvornik et al. (2018) showed that appropriately modeling the visual context surrounding objects is crucial to place them in the right environment, and proposed a context model to automatically find appropriate locations on images to place new objects for data augmentation.

Novel Training Strategies Detecting objects under a wide range of scale variations, especially the detection of very small objects, stands out as a key challenge. It has been shown (Huang et al. 2017b; Liu et al. 2016) that image resolution has a considerable impact on detection accuracy, therefore scaling is particularly commonly used in data augmentation, since higher resolutions increase the possibility of detecting small objects (Huang et al. 2017b). Recently, Singh et al. proposed advanced and efficient data argumentation methods SNIP (Singh and Davis 2018) and SNIPER (Singh et al. 2018b) to 1 illustrate the scale invariance problem, as summarized in Table 10. Motivated by the intuitive understanding that small and large objects are difficult to detect at smaller and larger scales, respectively, SNIP introduces a novel training scheme that can reduce scale variations during training, but without reducing training samples; SNIPER allows for efficient multiscale training, only processing context regions around ground truth objects at the appropriate scale, instead of processing a whole image pyramid. Peng et al. (2018) studied a key factor in training, the minibatch size, and proposed MegDet, a Large MiniBatch Object Detector, to enable the training with a much larger minibatch size than before (from 16 to 256). To avoid the failure of convergence and significantly speed up the training process, Peng et al. (2018) proposed a learning rate policy and Cross GPU Batch Normalization, and effectively utilized 128 GPUs, allowing MegDet to finish COCO training in 4 hours on 128 GPUs, and winning the COCO 2017 Detection Challenge.

Table 10 Representative methods for training strategies and class imbalance handling

Reducing Localization Error In object detection, the Intersection Over UnionFootnote 15 (IOU) between a detected bounding box and its ground truth box is the most popular evaluation metric, and an IOU threshold (e.g. typical value of 0.5) is required to define positives and negatives. From Fig. 13, in most state of the art detectors (Girshick 2015; Liu et al. 2016; He et al. 2017; Ren et al. 2015; Redmon et al. 2016) object detection is formulated as a multitask learning problem, i.e., jointly optimizing a softmax classifier which assigns object proposals with class labels and bounding box regressors, localizing objects by maximizing IOU or other metrics between detection results and ground truth. Bounding boxes are only a crude approximation for articulated objects, consequently background pixels are almost invariably included in a bounding box, which affects the accuracy of classification and localization. The study in Hoiem et al. (2012) shows that object localization error is one of the most influential forms of error, in addition to confusion between similar objects. Localization error could stem from insufficient overlap (smaller than the required IOU threshold, such as the green box in Fig. 20) or duplicate detections (i.e., multiple overlapping detections for an object instance). Usually, some post-processing step like NonMaximum Suppression (NMS) (Bodla et al. 2017; Hosang et al. 2017) is used for eliminating duplicate detections. However, due to misalignments the bounding box with better localization could be suppressed during NMS, leading to poorer localization quality (such as the purple box shown in Fig. 20). Therefore, there are quite a few methods aiming at improving detection performance by reducing localization error.

MRCNN (Gidaris and Komodakis 2015) introduces iterative bounding box regression, where an RCNN is applied several times. CRAFT (Yang et al. 2016a) and AttractioNet (Gidaris and Komodakis 2016) use a multi-stage detection sub-network to generate accurate proposals, to forward to Fast RCNN. Cai and Vasconcelos (2018) proposed Cascade RCNN, a multistage extension of RCNN, in which a sequence of detectors is trained sequentially with increasing IOU thresholds, based on the observation that the output of a detector trained with a certain IOU is a good distribution to train the detector of the next higher IOU threshold, in order to be sequentially more selective against close false positives. This approach can be built with any RCNN-based detector, and is demonstrated to achieve consistent gains (about 2 to 4 points) independent of the baseline detector strength, at a marginal increase in computation. There is also recent work (Jiang et al. 2018; Rezatofighi et al. 2019; Huang et al. 2019) formulating IOU directly as the optimization objective, and in proposing improved NMS results (Bodla et al. 2017; He et al. 2019; Hosang et al. 2017; TychsenSmith and Petersson 2018), such as Soft NMS (Bodla et al. 2017) and learning NMS (Hosang et al. 2017).

Fig. 20
figure 20

Localization error could stem from insufficient overlap or duplicate detections. Localization error is a frequent cause of false positives (Color figure online)

Class Imbalance Handling Unlike image classification, object detection has another unique problem: the serious imbalance between the number of labeled object instances and the number of background examples (image regions not belonging to any object class of interest). Most background examples are easy negatives, however this imbalance can make the training very inefficient, and the large number of easy negatives tends to overwhelm the training. In the past, this issue has typically been addressed via techniques such as bootstrapping (Sung and Poggio 1994). More recently, this problem has also seen some attention (Li et al. 2019a; Lin et al. 2017b; Shrivastava et al. 2016). Because the region proposal stage rapidly filters out most background regions and proposes a small number of object candidates, this class imbalance issue is mitigated to some extent in two-stage detectors (Girshick et al. 2014; Girshick 2015; Ren et al. 2015; He et al. 2017), although example mining approaches, such as Online Hard Example Mining (OHEM) (Shrivastava et al. 2016), may be used to maintain a reasonable balance between foreground and background. In the case of one-stage object detectors (Redmon et al. 2016; Liu et al. 2016), this imbalance is extremely serious (e.g. 100,000 background examples to every object). Lin et al. (2017b) proposed Focal Loss to address this by rectifying the Cross Entropy loss, such that it down-weights the loss assigned to correctly classified examples. Li et al. (2019a) studied this issue from the perspective of gradient norm distribution, and proposed a Gradient Harmonizing Mechanism (GHM) to handle it.

10 Discussion and Conclusion

Generic object detection is an important and challenging problem in computer vision and has received considerable attention. Thanks to remarkable developments in deep learning techniques, the field of object detection has dramatically evolved. As a comprehensive survey on deep learning for generic object detection, this paper has highlighted the recent achievements, provided a structural taxonomy for methods according to their roles in detection, summarized existing popular datasets and evaluation criteria, and discussed performance for the most representative methods. We conclude this review with a discussion of the state of the art in Sect. 10.1, an overall discussion of key issues in Sect. 10.2, and finally suggested future research directions in Sect. 10.3.

10.1 State of the Art Performance

A large variety of detectors has appeared in the last few years, and the introduction of standard benchmarks, such as PASCAL VOC (Everingham et al. 2010, 2015), ImageNet (Russakovsky et al. 2015) and COCO (Lin et al. 2014), has made it easier to compare detectors. As can be seen from our earlier discussion in Sects. 59, it may be misleading to compare detectors in terms of their originally reported performance (e.g. accuracy, speed), as they can differ in fundamental / contextual respects, including the following choices:

  • Meta detection frameworks, such as RCNN (Girshick et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015), RFCN (Dai et al. 2016c), Mask RCNN (He et al. 2017), YOLO (Redmon et al. 2016) and SSD (Liu et al. 2016);

  • Backbone networks such as VGG (Simonyan and Zisserman 2015), Inception (Szegedy et al. 2015; Ioffe and Szegedy 2015; Szegedy et al. 2016), ResNet (He et al. 2016), ResNeXt (Xie et al. 2017), and Xception (Chollet 2017) etc. listed in Table 6;

  • Innovations such as multilayer feature combination (Lin et al. 2017a; Shrivastava et al. 2017; Fu et al. 2017), deformable convolutional networks (Dai et al. 2017), deformable RoI pooling (Ouyang et al. 2015; Dai et al. 2017), heavier heads (Ren et al. 2016; Peng et al. 2018), and lighter heads (Li et al. 2018c);

  • Pretraining with datasets such as ImageNet (Russakovsky et al. 2015), COCO (Lin et al. 2014), Places (Zhou et al. 2017a), JFT (Hinton et al. 2015) and Open Images (Krasin et al. 2017);

  • Different detection proposal methods and different numbers of object proposals;

  • Train/test data augmentation, novel multiscale training strategies (Singh and Davis 2018; Singh et al. 2018b) etc, and model ensembling.

Although it may be impractical to compare every recently proposed detector, it is nevertheless valuable to integrate representative and publicly available detectors into a common platform and to compare them in a unified manner. There has been very limited work in this regard, except for Huang’s study (Huang et al. 2017b) of the three main families of detectors [Faster RCNN (Ren et al. 2015), RFCN (Dai et al. 2016c) and SSD (Liu et al. 2016)] by varying the backbone network, image resolution, and the number of box proposals.

As can be seen from Tables 7891011, we have summarized the best reported performance of many methods on three widely used standard benchmarks. The results of these methods were reported on the same test benchmark, despite their differing in one or more of the aspects listed above.

Figures 3 and 21 present a very brief overview of the state of the art, summarizing the best detection results of the PASCAL VOC, ILSVRC and MSCOCO challenges; more results can be found at detection challenge websites (ILSVRC 2018; MS COCO 2018; PASCAL VOC 2018). The competition winner of the open image challenge object detection task achieved \(61.71\%\) mAP in the public leader board and \(58.66\%\) mAP on the private leader board, obtained by combining the detection results of several two-stage detectors including Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015), FPN (Lin et al. 2017a), Deformable RCNN (Dai et al. 2017), and Cascade RCNN (Cai and Vasconcelos 2018). In summary, the backbone network, the detection framework, and the availability of large scale datasets are the three most important factors in detection accuracy. Ensembles of multiple models, the incorporation of context features, and data augmentation all help to achieve better accuracy.

In less than 5 years, since AlexNet (Krizhevsky et al. 2012a) was proposed, the Top5 error on ImageNet classification (Russakovsky et al. 2015) with 1000 classes has dropped from 16% to 2%, as shown in Fig. 15. However, the mAP of the best performing detector (Peng et al. 2018) on COCO (Lin et al. 2014), trained to detect only 80 classes, is only at \(73\%\), even at 0.5 IoU, illustrating how object detection is much harder than image classification. The accuracy and robustness achieved by the state-of-the-art detectors far from satisfies the requirements of real world applications, so there remains significant room for future improvement.

Table 11 Summary of properties and performance of milestone detection frameworks for generic object detection

10.2 Summary and Discussion

With hundreds of references and many dozens of methods discussed throughout this paper, we would now like to focus on the key factors which have emerged in generic object detection based on deep learning.

  1. (1)

    Detection frameworks: two stage versus one stage

In Sect. 5 we identified two major categories of detection frameworks: region based (two stage) and unified (one stage):

  • When large computational cost is allowed, two-stage detectors generally produce higher detection accuracies than one-stage, evidenced by the fact that most winning approaches used in famous detection challenges like are predominantly based on two-stage frameworks, because their structure is more flexible and better suited for region based classification. The most widely used frameworks are Faster RCNN (Ren et al. 2015), RFCN (Dai et al. 2016c) and Mask RCNN (He et al. 2017).

  • It has been shown in Huang et al. (2017b) that the detection accuracy of one-stage SSD (Liu et al. 2016) is less sensitive to the quality of the backbone network than representative two-stage frameworks.

  • One-stage detectors like YOLO (Redmon et al. 2016) and SSD (Liu et al. 2016) are generally faster than two-stage ones, because of avoiding preprocessing algorithms, using lightweight backbone networks, performing prediction with fewer candidate regions, and making the classification subnetwork fully convolutional. However, two-stage detectors can run in real time with the introduction of similar techniques. In any event, whether one stage or two, the most time consuming step is the feature extractor (backbone network) (Law and Deng 2018; Ren et al. 2015).

  • It has been shown (Huang et al. 2017b; Redmon et al. 2016; Liu et al. 2016) that one-stage frameworks like YOLO and SSD typically have much poorer performance when detecting small objects than two-stage architectures like Faster RCNN and RFCN, but are competitive in detecting large objects.

There have been many attempts to build better (faster, more accurate, or more robust) detectors by attacking each stage of the detection framework. No matter whether one, two or multiple stages, the design of the detection framework has converged towards a number of crucial design choices:

  • A fully convolutional pipeline

  • Exploring complementary information from other correlated tasks, e.g., Mask RCNN (He et al. 2017)

  • Sliding windows (Ren et al. 2015)

  • Fusing information from different layers of the backbone.

The evidence from recent success of cascade for object detection (Cai and Vasconcelos 2018; Cheng et al. 2018a, b) and instance segmentation on COCO (Chen et al. 2019a) and other challenges has shown that multistage object detection could be a future framework for a speed-accuracy trade-off. A teaser investigation is being done in the 2019 WIDER Challenge (Loy et al. 2019).

  1. (2)

    Backbone networks

As discussed in Sect. 6.1, backbone networks are one of the main driving forces behind the rapid improvement of detection performance, because of the key role played by discriminative object feature representation. Generally, deeper backbones such as ResNet (He et al. 2016), ResNeXt (Xie et al. 2017), InceptionResNet (Szegedy et al. 2017) perform better; however, they are computationally more expensive and require much more data and massive computing for training. Some backbones (Howard et al. 2017; Iandola et al. 2016; Zhang et al. 2018c) were proposed for focusing on speed instead, such as MobileNet (Howard et al. 2017) which has been shown to achieve VGGNet16 accuracy on ImageNet with only \(\frac{1}{30}\) the computational cost and model size. Backbone training from scratch may become possible as more training data and better training strategies are available (Wu and He 2018; Luo et al. 2018, 2019).

  1. (3)

    Improving the robustness of object representation

The variation of real world images is a key challenge in object recognition. The variations include lighting, pose, deformations, background clutter, occlusions, blur, resolution, noise, and camera distortions.

  1. (3.1)

    Object scale and small object size

Fig. 21
figure 21

Evolution of object detection performance on COCO (Test-Dev results). Results are quoted from (Girshick 2015; He et al. 2017; Ren et al. 2017). The backbone network, the design of detection framework and the availability of good and large scale datasets are the three most important factors in detection accuracy

Large variations of object scale, particularly those of small objects, pose a great challenge. Here a summary and discussion on the main strategies identified in Sect. 6.2:

  • Using image pyramids: They are simple and effective, helping to enlarge small objects and to shrink large ones. They are computationally expensive, but are nevertheless commonly used during inference for better accuracy.

  • Using features from convolutional layers of different resolutions: In early work like SSD (Liu et al. 2016), predictions are performed independently, and no information from other layers is combined or merged. Now it is quite standard to combine features from different layers, e.g. in FPN (Lin et al. 2017a).

  • Using dilated convolutions (Li et al. 2018b, 2019b): A simple and effective method to incorporate broader context and maintain high resolution feature maps.

  • Using anchor boxes of different scales and aspect ratios: Drawbacks of having many parameters, and scales and aspect ratios of anchor boxes are usually heuristically determined.

  • Up-scaling: Particularly for the detection of small objects, high-resolution networks (Sun et al. 2019a, b) can be developed. It remains unclear whether super-resolution techniques improve detection accuracy or not.

Despite recent advances, the detection accuracy for small objects is still much lower than that of larger ones. Therefore, the detection of small objects remains one of the key challenges in object detection. Perhaps localization requirements need to be generalized as a function of scale, since certain applications, e.g. autonomous driving, only require the identification of the existence of small objects within a larger region, and exact localization is not necessary.

  1. (3.2)

    Deformation, occlusion, and other factors

As discussed in Sect. 2.2, there are approaches to handling geometric transformation, occlusions, and deformation mainly based on two paradigms. The first is a spatial transformer network, which uses regression to obtain a deformation field and then warp features according to the deformation field (Dai et al. 2017). The second is based on a deformable part-based model (Felzenszwalb et al. 2010b), which finds the maximum response to a part filter with spatial constraints taken into consideration (Ouyang et al. 2015; Girshick et al. 2015; Wan et al. 2015).

Rotation invariance may be attractive in certain applications, but there are limited generic object detection work focusing on rotation invariance, because popular benchmark detection datasets (PASCAL VOC, ImageNet, COCO) do not have large variations in rotation. Occlusion handling is intensively studied in face detection and pedestrian detection, but very little work has been devoted to occlusion handling for generic object detection. In general, despite recent advances, deep networks are still limited by the lack of robustness to a number of variations, which significantly constrains their real-world applications.

  1. (4)

    Context reasoning

As introduced in Sect. 7, objects in the wild typically coexist with other objects and environments. It has been recognized that contextual information (object relations, global scene statistics) helps object detection and recognition (Oliva and Torralba 2007), especially for small objects, occluded objects, and with poor image quality. There was extensive work preceding deep learning (Malisiewicz and Efros 2009; Murphy et al. 2003; Rabinovich et al. 2007; Divvala et al. 2009; Galleguillos and Belongie 2010), and also quite a few works in the era of deep learning (Gidaris and Komodakis 2015; Zeng et al. 2016, 2017; Chen and Gupta 2017; Hu et al. 2018a). How to efficiently and effectively incorporate contextual information remains to be explored, possibly guided by how human vision uses context, based on scene graphs (Li et al. 2017d), or via the full segmentation of objects and scenes using panoptic segmentation (Kirillov et al. 2018).

  1. (5)

    Detection proposals

Detection proposals significantly reduce search spaces. As recommended in Hosang et al. (2016), future detection proposals will surely have to improve in repeatability, recall, localization accuracy, and speed. Since the success of RPN (Ren et al. 2015), which integrated proposal generation and detection into a common framework, CNN based detection proposal generation methods have dominated region proposal. It is recommended that new detection proposals should be assessed for object detection, instead of evaluating detection proposals alone.

  1. (6)

    Other factors

As discussed in Sect. 9, there are many other factors affecting object detection quality: data augmentation, novel training strategies, combinations of backbone models, multiple detection frameworks, incorporating information from other related tasks, methods for reducing localization error, handling the huge imbalance between positive and negative samples, mining of hard negative samples, and improving loss functions.

10.3 Research Directions

Despite the recent tremendous progress in the field of object detection, the technology remains significantly more primitive than human vision and cannot yet satisfactorily address real-world challenges like those of Sect. 2.2. We see a number of long-standing challenges:

  • Working in an open world: being robust to any number of environmental changes, being able to evolve or adapt.

  • Object detection under constrained conditions: learning from weakly labeled data or few bounding box annotations, wearable devices, unseen object categories etc.

  • Object detection in other modalities: video, RGBD images, 3D point clouds, lidar, remotely sensed imagery etc.

Based on these challenges, we see the following directions of future research:

(1) Open World Learning The ultimate goal is to develop object detection capable of accurately and efficiently recognizing and localizing instances in thousands or more object categories in open-world scenes, at a level competitive with the human visual system. Object detection algorithms are unable, in general, to recognize object categories outside of their training dataset, although ideally there should be the ability to recognize novel object categories (Lake et al. 2015; Hariharan and Girshick 2017). Current detection datasets (Everingham et al. 2010; Russakovsky et al. 2015; Lin et al. 2014) contain only a few dozen to hundreds of categories, significantly fewer than those which can be recognized by humans. New larger-scale datasets (Hoffman et al. 2014; Singh et al. 2018a; Redmon and Farhadi 2017) with significantly more categories will need to be developed.

(2) Better and More Efficient Detection Frameworks One of the reasons for the success in generic object detection has been the development of superior detection frameworks, both region-based [RCNN (Girshick et al. 2014), Fast RCNN (Girshick 2015), Faster RCNN (Ren et al. 2015), Mask RCNN (He et al. 2017)] and one-stage detectors [YOLO (Redmon et al. 2016), SSD (Liu et al. 2016)]. Region-based detectors have higher accuracy, one-stage detectors are generally faster and simpler. Object detectors depend heavily on the underlying backbone networks, which have been optimized for image classification, possibly causing a learning bias; learning object detectors from scratch could be helpful for new detection frameworks.

(3) Compact and Efficient CNN Features CNNs have increased remarkably in depth, from several layers [AlexNet (Krizhevsky et al. 2012b)] to hundreds of layers [ResNet (He et al. 2016), DenseNet (Huang et al. 2017a)]. These networks have millions to hundreds of millions of parameters, requiring massive data and GPUs for training. In order reduce or remove network redundancy, there has been growing research interest in designing compact and lightweight networks (Chen et al. 2017a; Alvarez and Salzmann 2016; Huang et al. 2018; Howard et al. 2017; Lin et al. 2017c; Yu et al. 2018) and network acceleration (Cheng et al. 2018c; Hubara et al. 2016; Han et al. 2016; Li et al. 2017a, c; Wei et al. 2018).

(4) Automatic Neural Architecture Search Deep learning bypasses manual feature engineering which requires human experts with strong domain knowledge, however DCNNs require similarly significant expertise. It is natural to consider automated design of detection backbone architectures, such as the recent Automated Machine Learning (AutoML) (Quanming et al. 2018), which has been applied to image classification and object detection (Cai et al. 2018; Chen et al. 2019c; Ghiasi et al. 2019; Liu et al. 2018a; Zoph and Le 2016; Zoph et al. 2018).

(5) Object Instance Segmentation For a richer and more detailed understanding of image content, there is a need to tackle pixel-level object instance segmentation (Lin et al. 2014; He et al. 2017; Hu et al. 2018c), which can play an important role in potential applications that require the precise boundaries of individual objects.

(6) Weakly Supervised Detection Current state-of-the-art detectors employ fully supervised models learned from labeled data with object bounding boxes or segmentation masks (Everingham et al. 2015; Lin et al. 2014; Russakovsky et al. 2015; Lin et al. 2014). However, fully supervised learning has serious limitations, particularly where the collection of bounding box annotations is labor intensive and where the number of images is large. Fully supervised learning is not scalable in the absence of fully labeled training data, so it is essential to understand how the power of CNNs can be leveraged where only weakly / partially annotated data are provided (Bilen and Vedaldi 2016; Diba et al. 2017; Shi et al. 2017).

(7) Few / Zero Shot Object Detection The success of deep detectors relies heavily on gargantuan amounts of annotated training data. When the labeled data are scarce, the performance of deep detectors frequently deteriorates and fails to generalize well. In contrast, humans (even children) can learn a visual concept quickly from very few given examples and can often generalize well (Biederman 1987b; Lake et al. 2015; FeiFei et al. 2006). Therefore, the ability to learn from only few examples, few shot detection, is very appealing (Chen et al. 2018a; Dong et al. 2018; Finn et al. 2017; Kang et al. 2018; Lake et al. 2015; Ren et al. 2018; Schwartz et al. 2019). Even more constrained, zero shot object detection localizes and recognizes object classes that have never been seenFootnote 16 before (Bansal et al. 2018; Demirel et al. 2018; Rahman et al. 2018b, a), essential for life-long learning machines that need to intelligently and incrementally discover new object categories.

(8) Object Detection in Other Modalities Most detectors are based on still 2D images; object detection in other modalities can be highly relevant in domains such as autonomous vehicles, unmanned aerial vehicles, and robotics. These modalities raise new challenges in effectively using depth (Chen et al. 2015c; Pepik et al. 2015; Xiang et al. 2014; Wu et al. 2015), video (Feichtenhofer et al. 2017; Kang et al. 2016), and point clouds (Qi et al. 2017, 2018).

(9) Universal Object Detection: Recently, there has been increasing effort in learning universal representations, those which are effective in multiple image domains, such as natural images, videos, aerial images, and medical CT images (Rebuffi et al. 2017, 2018). Most such research focuses on image classification, rarely targeting object detection (Wang et al. 2019), and developed detectors are usually domain specific. Object detection independent of image domain and cross-domain object detection represent important future directions.

The research field of generic object detection is still far from complete. However given the breakthroughs over the past 5 years we are optimistic of future developments and opportunities.