BING: Binarized Normed Gradients for Objectness Estimation at 300fps

Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 × 8 and use the norm of the gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD, BITWISE SHIFT, etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1, 000 proposals. Increasing the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.

S e e h t t p://o r c a .cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s.Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Introduction
As suggested in the pioneering research [3,4], objectness is usually represented as a value which reflects how likely an image window covers an object of any category.
Especially for object detection, proposal based detectors have dominated recent state-of-theart performance.Compared with sliding windows, objectness measures can significantly improve: i) computational efficiency by reducing the search space, and ii) system accuracy by allowing the use of complex subsequent processing during testing.
However, designing a good generic objectness measure method is difficult, and should: • achieve high object detection rate (DR), as any undetected objects at this stage cannot be recovered later; • gain high proposal localization accuracy which is measured by the average best overlap (ABO) for each object in each class and the mean average best overlap (MABO) across all classes; • obtain high computational efficiency so that the method can be easily incorporated in various applications, especially for realtime and large-scale applications; • produce a small number of proposals for reducing computational time of subsequent precessing; • have good generalization ability to unseen object categories, so that the proposals can be reused by various of vision tasks without category biases.To the best of our knowledge, no prior method can satisfy all these ambitious goals simultaneously.
Research from cognitive psychology [74,79] and neurobiology [25,48] suggests that humans have a strong ability to perceive objects before identifying them.
Based on the human reaction time that is observed and the biological signal transmission time that is estimated, human attention theories hypothesize that the human visual system processes only parts of an image in detail, while leaving others nearly unprocessed.This further suggests that before identifying objects, there are simple mechanisms in the human visual system to select possible object locations.
In this paper, we propose a surprisingly simple and powerful feature "BING" to help the search for objects using objectness scores.Our work is motivated by the fact that objects are stand-alone things with well-defined closed boundaries and centers [4,31,40] although the visibility of these boundaries depends on the characteristics of the background of occluding foreground objects.We observe that generic objects with well-defined closed boundaries share surprisingly strong correlation in terms of the norm of their gradient (see Fig. 1 and Sec.3), after resizing of their corresponding image windows to a small fixed size (e.g. 8 × 8).Therefore, in order to efficiently quantify the objectness of an image window, we resize it to 8 × 8 and use the norm of the gradients as a simple 64D feature for learning a generic objectness measure in a cascaded svm framework.We further show how the binarized version of the NG feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation of image windows, which requires only a few atomic CPU operations (i.e.add, bitwise shift, etc.).The BING feature's simplicity, while using advanced speed up techniques to make the computational time tractable, contrasts with recent state of the art techniques [4,26,75] which seek increasingly sophisticated features to obtain greater discrimination.
The original conference version of BING [19] has received much attention.
Its efficiency and high detection rates makes BING a good choice in a large number of successful applications that requires category independent object proposals [53,62,64,78,[80][81][82]. Recently, deep neural network based object proposal generation methods have become very popular due to their high recall and computational efficiency, e.g.RPN [70], YOLO900 [68] and SSD [58].However, these methods generalize poorly to unseen categories, and rely on training with many ground-truth annotations for the target classes.For instance, the detected ...  object proposals of RPN are highly related to the training data: when training it on the PASCAL VOC dataset [27], the trained model will aim to only detect the 20-classes objects in PASCAL VOC and perform poorly on another dataset like MS COCO (see Sec. 5.4).Its poor generalization ability has restricted its usage, so RPN is usually only used in object detection.By contrast, BING is built based on low-level cues about enclosed boundaries and thus can produce category independent object proposals, which has demonstrated applications in multi-label image classification [78], semantic segmentation [64], video classification [81], co-salient object detection [82], deep multi instance learning [80], and video summarisation [53].However, several researchers [41,65,86,90] have noted that BING's proposal localization is weak.
This manuscript further improves the proposal localization of the conference version [19] by applying multi-thresholding straddling expansion (MTSE) [15] as a postprocessing step.
The standard MTSE would introduce a significant computational bottleneck because of its image segmentation step.Therefore we propose a novel image segmentation method, which generates accurate segments much more efficiently.Our approach starts with a GPU version of the SLIC method [2,69], to quickly obtain initial seed regions (superpixels) by performing oversegmentation.A region merging process is then performed based on the average pixel distance.We replace [30] in MTSE with this novel grouping method [16], and dub the new proposal system BING-E.
We have extensively evaluated our objectness methods on the PASCAL VOC2007 [27] and Microsoft COCO [56] datasets.The experimental results show that our method efficiently (300fps for BING and 200fps for BING-E) generates a small set of datadriven, category-independent and high quality object windows.BING is able to achieve 96.2% detection rate (DR) with 1,000 windows and intersection-overunion (IoU) threshold 0.5.At the increased IoU threshold of 0.7, BING-E can obtain 81.4% DR and 78.6% MABO.Feeding the proposals to the fast R-CNN [32] framework for an object detection task, BING-E achieves 67.4% mean average precision (mAP).Following [4,26,75], we also verify the generalization ability of our method.When training our objectness measure on the VOC2007 training set and testing on the challenging COCO validation set, our method still achieves competitive performance.Compared to most popular alternatives [4, 26, 44, 49, 50, 61, 65-67, 75, 85, 90], our method achieves competitive performance using a smaller set of proposals, while being 100-1,000 times faster than them.Thus, our proposed method achieves significantly high efficiency while obtaining state-of-the-art generic object proposals.These performances fulfill the previously stated requirements for a good objectness detector.Our source code will be published with the paper.

Related Works
Being able to perceive objects before identifying them is closely related to bottom up visual attention (saliency).According to how saliency is defined, we broadly classify the related research into three categories: fixation prediction, salient object detection, and objectness proposal generation.
Inspired by neurobiology research about early primate visual system, Itti et al. [45] proposed one of the first computational models for saliency detection, which estimates center-surrounded differences across multiscale image features.Ma and Zhang [60] proposed a fuzzy growing model to analyze local contrast based saliency.Harel et al. [36] proposed normalizing centersurrounded feature maps for highlighting conspicuous parts.
Although fixation point prediction models have achieved remarkable development, the prediction results tend to highlight edges and corners rather than the entire objects.Thus, these models are not suitable for generating generic object proposals.
Salient object detection models try to detect the most attention-grabbing object in a scene, and then segment the whole extent of that object [6,7,55].Liu et al. [57] combined local, regional, and global saliency measurements in a CRF framework.Achanta et al. [1] localized salient regions using a frequencytuned approach.Cheng et al. [18] proposed a salient object detection and segmentation method based on region contrast analysis and iterative graph based segmentation.More recent research also tried to produce high quality saliency maps in a filtering based framework [63].Such salient object segmentation for simple images achieved great success in image scene analysis [20,54,87], content aware image editing [83,89], and it can be used as a cheap tool to process a large number of Internet images or build robust applications [12,13,21,37,42,43] by automatically selecting good results [17,18].However, these approaches are less likely to work for complicated images where many objects are presented and they are rarely dominant (e.g.PASCAL VOC images).
Objectness proposal generation methods avoid making decisions early on, by proposing a small number (e.g.1,000) of category-independent proposals, that are expected to cover all objects in an image [4,26,75].Producing rough segmentations [10,26] as object proposals has been shown to be an effective way of reducing search spaces for category-specific classifiers, whilst allowing the usage of strong classifiers to improve accuracy.However, such methods [10,26] are very computationally expensive.Alexe et al. [4] proposed a cue integration approach to get better prediction performance more efficiently.Broadly speaking, there are two main categories of object proposal generation methods: region based methods and edge based methods.
Region based object proposal generation methods mainly look for sets of regions produced by image segmentation and use the bounding boxes of these sets of regions to generate object proposals.Since image segmentation aims to cluster pixels into regions that are expected to represent objects or object-parts, merging together some regions is likely to find complete objects.A large literature has focused on this aspect.Uijlings et al. [75] proposed a selective search approach, which combined the strength of both an exhaustive search and segmentation, to achieve higher prediction performance.Pont-Tuset et al. [65] proposed a multiscale segmenter to generate segmentation hierarchies, and then explored the combinatorial space of these hierarchical regions to produce high-quality object proposals.Some other well-known algorithms [26,50,61,66,67] fall into this category as well.
Edge based object proposal generation approaches use edges to explore where in an image the complete objects occur.As pointed out in [4], complete objects usually have well-defined closed boundaries in space.Some methods have achieved high performance using this intuitive cue.Zitnick et al. [90] proposed a simple box objectness score that measured the number of contours wholly enclosed by a bounding box.They generated object bounding box proposals directly from edges in an efficient way.Lu et al. [59] proposed a closed contour measure that is defined using closed path integral.Zhang et al. [85] proposed a cascaded ranking SVM approach with an oriented gradient feature for efficient proposal generation.

BING for Objectness Measure
Inspired by the ability of the human visual system which efficiently perceives objects before identifying them [25,48,74,79], we introduce a simple 64D norm of the gradients (NG) feature (Sec.3.1), as well as its binary approximation, i.e. the binarized normed gradients (BING) feature (Sec.3.3), for efficiently capturing the objectness of an image window.
To find generic objects within an image, we scan over a predefined set of quantized window sizes (scales and aspect ratios1 ).Each window is scored with a linear model w ∈ R 64 (Sec.3.2), where s l , g l , l, i and (x, y) are filter score, NG feature, location, size and position of a window respectively.Using non-maximal suppression (NMS), we select a small set of proposals from each size i.Zhao et al. [86] show that this choice of window sizes along with the NMS is close to optimal.Some sizes (e.g. 10 × 500) are less likely than others (e.g. 100 × 100) to contain an object instance.Thus we define the objectness score (i.e. the calibrated filter score) as where v i , t i ∈ R are learnt coefficient and bias terms for each quantised size i (Sec.3.2).Note that calibration using Eq. ( 3), although very fast, is only required when re-ranking the small set of final proposals.

Normed gradients (NG) and objectness
Objects are stand-alone things with well-defined closed boundaries and centers [4,31,40] although the visibility of these boundaries depends on the characteristics of the background of occluding foreground objects.
When resizing windows corresponding to real world objects to a small fixed size (e.g. 8 × 8, chosen for computational reasons that will be explained in Sec.3.3), the norm (i.e.magnitude) of the corresponding image gradients becomes a good discriminative feature, because of the limited variation that closed boundaries could present in such an abstracted view.As demonstrated in Fig. 1, although the cruise ship and the person have huge differences in terms of color, shape, texture, illumination etc., they do share clear similarity in normed gradient space.To utilize this observation for efficiently predicting the existence of object instances, we firstly resize the input image to different quantized sizes and calculate the normed gradients of each resized image.The values in an 8 × 8 region of these resized normed gradients maps are defined as a 64D normed gradients (NG)2 feature of its corresponding window.
Our NG feature, as a dense and compact objectness feature for an image window, has several advantages.Firstly, no matter how an object changes its position, scale and aspect ratio, its corresponding NG feature will remain roughly unchanged because the region for computing the feature is normalized.In other words, NG features are insensitive to change of translation, scale and aspect ratio, which will be very useful for detecting objects of arbitrary categories.And these insensitive properties are what a good objectness proposal generation method should have.Secondly, the dense compact representation of the NG feature makes it very efficient to be calculated and verified, thus having great potential to be involved in realtime applications.
The cost of introducing such advantages to the NG feature is the loss of discriminative ability.However, this is not a problem as BING can be used as a pre-filter, and the resulting false-positives will be processed and eliminated by subsequent category specific detectors.In Sec. 5, we show that our method results in a small set of high quality proposals that cover 96.2% of the true object windows in the challenging VOC2007 dataset.

Learning objectness measurement with NG
To learn an objectness measure of image windows, we follow the two stage cascaded svm approach [85].
Stage I. We learn a single model w for Eq. ( 1) using a linear svm [28].NG features of the ground truth object windows and random sampled background windows are used as positive and negative training samples respectively.
Stage II.To learn v i and t i in Eq. ( 3) using a linear svm [28], we evaluate Eq.
(1) at size i for training images and use the selected (NMS) proposals as training samples, their filter scores as 1D features, and check their labeling using training image annotations (see Sec. 5 for evaluation criteria).
Discussion.As illustrated in Fig. 1d, the learned linear model w (see Sec. 5 for experimental settings), looks similar to the multi-size center-surrounded patterns [45] hypothesized as biologically plausible architecture of primates [34,48,79].
The large weights along the borders of w favor a boundary that separates an object (center) from its background (surround).Compared to manually designed center surround patterns [45], our learned w captures a more sophisticated natural prior.For example, lower object regions are more often occluded than upper parts.This is represented by w placing less confidence in the lower regions.

Binarized normed gradients (BING)
To make use of recent advantages in binary model approximation [35,88], we describe an accelerated version of the NG feature, namely binarized normed gradients (BING), to speed up the feature extraction and testing process.Our learned linear model w ∈ R 64 can be approximated with a set of basis vectors w ≈ Nw j=1 β j a j using Alg. 1, where N w denotes the number Algorithm 1 Binary approximate model w [35].
Input: w, Nw Output: Initialize residual: ε = w for j = 1 to Nw do a j =sign(ε) (update residual) end for of basis vectors, a j ∈ {−1, 1} 64 denotes a basis vector, and β j ∈ R denotes the corresponding coefficient.By further representing each a j using a binary vector and its complement: a j = a + j − a + j , where a + j ∈ {0, 1} 64 , a binarized feature b could be tested using fast bitwise and and bit count operations (see [35]), The key challenge is how to binarize and calculate our NG features efficiently.We approximate the normed gradient values (each saved as a byte value) using the top N g binary bits of the byte values.Thus, a 64D NG feature g l can be approximated by N g binarized normed gradients (BING) features as Notice that these BING features have different weights according to their corresponding bit position in the byte values.
Naively getting an 8 × 8 BING feature requires a loop computing access to 64 positions.By exploring two special characteristics of an 8 × 8 BING feature, we develop a fast BING feature calculation algorithm (Alg.2), which enables using atomic updates (bitwise shift and bitwise or) to avoid computing the loop.First, a BING feature b x,y and its last row r x,y are saved in a single int64 and a byte variable, respectively.Second, adjacent BING features and their rows have a simple cumulative relation.As shown in Fig. 2 and Alg. 2, the operator bitwise shift shifts r x−1,y by one bit, automatically through the bit which does not belong to r x,y , and makes room to insert the new bit  Illustration of variables: a BING feature b x,y , its last row r x,y and last element b x,y .Notice that the subscripts i, x, y, l, k, introduced in Eq. ( 2) and Eq. ( 5), are locations of the whole vector rather than index of vector element.We can use a single atomic variable (int64 and byte) to represent a BING feature and its last row, enabling efficient feature computation (Alg.2).b x,y using the bitwise or operator.Similarly bitwise shift shifts b x,y−1 by 8 bits automatically through the bits which do not belong to b x,y , and makes room to insert r x,y .
Our efficient BING feature calculation shares the cumulative nature with the integral image representation [76].Instead of calculating a single scalar value over an arbitrary rectangle range [76], our method uses a few atomic operations (e.g.add, bitwise, etc.) to calculate a set of binary patterns over an 8 × 8 fixed range.
The filter score Eq.
(1) of an image window corresponding to BING features b k,l can be efficiently tested using: where ) can be tested using fast bitwise and popcnt sse operators.
Implementation details.We use the 1-D kernel [−1, 0, 1] to find image gradients g x and g y in the horizontal and vertical directions, while calculating normed gradients using min(|g x | + |g y |, 255) and saving them in byte values.By default, we calculate gradients in RGB color space.

Enhancing BING with Region Cues
BING is not only very efficient, but also can achieve high object detection rate.However, when considering ABO or MABO, its performance is disappointing.When further applying BING to some object detection frameworks which use object proposals as input, like fast-RCNN, the detection rate is also bad.This situation suggests BING does not obtain good proposal localization quality.
Two reasons may cause this phenomenon.On the one hand, given an object, BING tries to capture its closed boundaries by resizing it to a small fixed size and setting larger weights at the most probable positions, but the problem is that the shapes of objects are varied, which means that the closed boundaries of objects will be mapped to different positions in the fixed size windows.So the learned model of NG features cannot adequately represent this variability across objects.On the other hand, BING is designed to only test a limited set of quantized window sizes.However, the sizes of objects are variable.Thus, to some extent, bounding boxes generated by BING are unable to tightly cover all objects.
In order to improve the unsatisfactory localization quality caused by above reasons, we consider multithresholding straddling expansion (MTSE) [15], which is an effective method for refining object proposals using segments.Given an image and corresponding initial bounding boxes, MTSE first aligns boxes with potential object boundaries preserved by superpixels, and then multi-thresholding expansion is performed with respect to superpixels straddling for each box.By this means, each bounding box covers tightly a set of internal superpixels, and thus the localization quality of proposals is significantly improved.However, MTSE algorithm is too slow and the bottleneck is segmentation [30].Considering this situation, we use a new fast image segmentation method [16] to replace the segmentation method in MTSE.
Recently, SLIC [2] has become a popular superpixel generation method because of its efficiency, and the GPU version of SLIC (i.e.gSLICr) [69] can achieve a fast speed of 250fps.SLIC aims to generate small superpixels and is not good at producing large image segments.
In the MTSE algorithm, large image segments are needed to ensure accuracy, so it is not straightforward to apply SLIC within MTSE.However, the high efficiency of SLIC makes it a good start for developing new segmentation methods.We first use gSLICr to segment an image into many small superpixels.Then, we view each superpixel as a node whose color is denoted by the average color value of all the pixels in this superpixel, and the distance between two adjacent nodes is computed using the Euclidean distance of color values.Finally, we feed these nodes into the graph-based segmentation method to produce the final image segmentation [16].
We employ the full MTSE pipeline which is modified to use our new segmentation algorithm, and manage to reduce the computation time from 0.15 second down to 0.0014 second per image.Incorporating this improved version of MTSE as a post processing enhancement step of BING, we obtain a new proposal system, and call it BING-E.

Evaluation
We extensively evaluate our method on the challenging PASCAL VOC2007 [27] and Microsoft COCO [56] [26], Objectness [4], GOP [49], LPO [50], Rahtu [66], RandomPrim [61], Rantalankila [67], and SelectiveSearch [75]7 using publicly available code.All the parameters of these method are set using default values, except for [49], in which we employ (180,9) as highlighted on the author's homepage.To make the comparison fair, all the methods except the deep learning based RPN [70] are tested on the same device with an Intel i7-6700k CPU and a NVIDIA GeForce GTX 970 GPU, and data parallelization is enabled.For RPN, we utilize an NVIDIA GeForce GTX TITAN X GPU for computation.Since objectness is often used as a preprocessing step to reduce the number of windows subsequent processing needs to consider, too many proposals are contrary to this principle.Therefore, we only use the top 1000 proposals for comparison.In order to evaluate the generalization ability of each method, we test them on the COCO validation dataset using the same parameters as on VOC2007 without retraining.Since there are at least 60 categories in COCO different to those in VOC2007, using COCO to test the generalization ability of the proposal methods is a good choice.

Experimental Setup
Discussion of BING.As shown in Tab. 1, with the binary approximation to the learned linear filter (Sec.3.3) and BING features, computing the response score for each image window only needs a fixed small number of atomic operations.It is easy to see that the number of positions at each quantized scale and aspect ratio is equivalent to O(N ), where N is the number of pixels in the image.Thus, computing response scores at all scales and aspect ratios also has computational complexity O(N ).Furthermore, extracting the BING feature and computing the response score at each potential position (i.e. an image window) can be calculated with information given by its 2 neighboring positions (i.e.left and above).This means that the space complexity is also O(N ).
For training, we flip the images and the corresponding annotations.
The positive samples are boxes that have IoU overlap with a ground truth box of at least 0.5, while the max IoU overlap with ground truth for the negative sampling boxes is less than 0.5.In addition, some window sizes whose aspect ratios are too large are ignored because the number of training samples in VOC2007 for each of them is too small (less than 50).Our training on 2501 images (VOC2007) takes only 20 seconds (excluding xml loading time).We further illustrate in Tab. 2 how different approximation levels influence the result quality.According to this comparison, in all further experiments we use N w = 2, N g = 4.
Implementation details of BING-E.In the implementation of BING-E, we find that removing some small BING windows, with W o < 30 or H o <  30, hardly degrades the proposal quality of BING-E while reducing the runtime spent on BING process by half.When using gSLICr [69] to segment images into superpixels, we set the expected size of superpixels to 4 × 4. In the graph-based segmentation system [16,30], we use the scale parameter k = 120, and the minimum count of superpixels in each produced segment is set to 6.We utilize the default multi-thresholds of MTSE, i.e. {0.1, 0.2, 0.3, 0.4, 0.5}.After refinement, non-maximal suppression (NMS) is performed to obtain the final boxes, where the IoU threshold of NMS is set to 0.8.All the following experiments use these settings.

PASCAL VOC2007
As demonstrated by [4,75], a small set of coarse locations with high detection recall (DR) is sufficient for effective object detection, and it allows expensive features and complementary cues to be involved in subsequent detection to achieve better quality and higher efficiency than traditional methods.Thus, we first compare our method with some competitors using detection recall metrics.Fig. 3 (a) show detection recall when varying the IoU overlap threshold using  very high performance when the IoU threshold is less than 0.7, but then drops rapidly.Note that RPN is the only deep learning based method amongst these competitors.BING's performance is not competitive when the IoU threshold increases, but BING-E is close to the best performance.It should be emphasized that both BING and BING-E are more than two orders of magnitude (i.e.100+) faster than most popular alternatives [26,65,75,90] (see details in Tab. 3).The performance of BING and CSVM [85] almost coincide in all three subfigures, but BING is 100 times faster than CSVM.The significant improvement from BING to BING-E illustrates that BING is a strong basis that can be extended and improved in various ways.Since BING is able to run at about 300 fps, its variants can still be very fast.For example, BING-E can generate competitive candidates at over 200 fps, which is far beyond the performance of most other detection algorithms.Fig. 3 (b)-(d) show detection recall and MABO versus the number of proposals (#WIN) respectively.When the IoU threshold is 0.5, both BING and BING-E perform very well.Especially when the number of candidates is sufficient, BING outperform all other methods.In the subfigure (e), the recall curve of BING drops a lot, and the same behavior appears in the MABO evaluation.This may be because the proposal localization quality of BING is poor.However, note the performance of BING-E is consistently close to the best performance, that BING's localization problem has been overcome.
We show numeric comparison of recall vs. #WIN in Tab. 3. BING-E always performs better than most of the competitors.Both the speeds of BING and BING-E are obviously faster than all of the other methods.
Although EdgeBoxes, MCG and SelectiveSearch perform very well, they are too slow for many applications.
By contrast, BING-E is more attractive.It is also interesting to find that the detection recall of BING-E increases by 46.1% over BING using 1000 proposals with IoU threshold 0.7, which suggests that the accuracy of BING has lots of room for improvement after applying some postprocessing steps.
Tab. 4 shows the ABO & MABO comparison of these competitors.MCG always outperforms others with a big gap, and BING-E is with all the methods except MCG.Since proposal generation is usually a preprocessing step in vision tasks, we feed candidate boxes produced by objectness methods into the fast R-CNN [32] object detection framework to test the effectiveness of proposals in practical applications.The CNN model of fast R-CNN is retrained using boxes from the respective methods.Tab. 5 shows the evaluation results.In terms of mAP (mean average precision), the overall detection across all the methods are quite close to each other.RPN performs slightly better, and our BING-E method is very close to the best performance.Although MCG almost dominates the recall, ABO and MABO metrics, it does not achieve the best performance on object detection, and is worse than BING-E.Synthesizing the effects of various factors, BING-E achieves a significantly high speed while obtaining state-of-the-art generic object proposals.Finally, we illustrate sample results with varied complexity for  VOC2007 test images using our improved BING-E method in Fig. 5 to better demonstrate our high quality proposals.

Discussion on PASCAL VOC2007
order to perform further analysis, we divide the ground truths into different sets according to their window sizes, and test some of the most competitive methods on these sets.Tab.6 shows the results.When the ground truth area is small, BING-E performs much worse than others.As the ground truth area increases, the gap between BING-E and other state-ofthe-art methods is gradually narrowing, and BING-E outperforms all of them on the metric of recall when the area is larger than 2 12 .Fig. 4 shows some failure examples of BING-E.Note that almost all the false detected objects are small.These small objects may have blurry boundaries that make them be hard to distinguish from the background.
Note that MCG achieves much better performance on small objects, and it may be the main cause of the drop in detection rate when applying MCG into the fast R-CNN framework.The fast R-CNN uses the VGG16 [72] model, in which the convolutional layers are pooled several times.The size of a feature map will be just 1/2 4 size of the original object when it arrives at the last convolutional layer of VGG16, and the feature map will be too coarse to classify such small instances.So using MCG proposals to retrain the CNN model may confuse the network because of the detected small object proposals.Thus, MCG does not achieve the best performance in the object detection task although it  outperforms others on recall and MABO metrics.

Microsoft COCO
In order to test the generalization ability of these proposal methods, we extensively evaluate them on the COCO validation set using the same parameters as on the VOC2007 dataset without retraining.Since the dataset is too large, we only compare against some efficient methods.
Fig. 6 (a) show object detection recall versus IoU overlap threshold using different numbers of proposals.MCG always dominates the performance, but its low speed makes it impossible for many vision applications.
performs well when the IoU threshold is  small, and LPO performs well for large IoU thresholds.
The performance of BING-E is slightly worse than state-of-the-art performance.Both BING, Rahtu and Objectness struggle on the COCO dataset, suggesting that these methods may be not robust in complex scenes.Note that RPN performs very poorly on COCO, which means it is highly dependent on the training data.As addressed in [11], a good object proposal algorithm should be category independent.
Although RPN achieves good results on VOC2007, it is not consistent with the goal of designing a category independent object proposal method.

Conclusion and Future Work
We present a surprisingly simple, fast, and high quality objectness measure by using 8 × 8 binarized normed gradients (BING) features, with which computing the objectness of each image window at any scale and aspect ratio only needs a few atomic (i.e.add, bitwise, etc.) operations.To improve the localization quality of BING, we further propose BING-E which incorporates an efficient image segmentation strategy.Evaluation results using the most widely used benchmarks (VOC2007 and COCO) and evaluation metrics show that BING-E can generate state-of-theart generic object proposals with a significantly high speed.The evaluations also demonstrate that BING is a good basis for object proposal generation.
Limitations.BING and BING-E predict a small set of object bounding boxes.Thus, they share similar limitations as all other bounding box based objectness measure methods [4,85] and classic sliding window based object detection methods [23,29].For some object categories (e.g. a snake, wires, etc.), a bounding box might not localize the object instances as accurately as a segmentation region [10,26,67].
Future work.The high quality and efficiency of our method make it suitable for many realtime vision applications and large scale image collections (e.g.ImageNet [24]).In particular, the operations and memory efficiency make our BING method suitable to run on low power devices [35,88].Our speed-up strategy by reducing the number of tested windows is complementary to other speed-up techniques which try to reduce the subsequent processing time required for each location.The efficiency of our method solves the computation bottleneck of proposal based vision tasks such as object detection methods [32,39], enabling potential realtime quality object detection.
We have demonstrated how to generate a small set (e.g.1,000) of proposals to cover nearly all potential object regions, using very simple BING features and a postprocessing step.It would be interesting to introduce other additional cues to further reduce the number of proposals while maintaining a high detection rate [51,84], and explore more applications [14,53,64,78,[80][81][82] using BING and BING-E.To encourage future works, we will continuously make the updated source code available at http://mmcheng.net/bing.
Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t his ve r sio n.Fo r t h e d efi nitiv e ve r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e.You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wi s h t o cit e t hi s p a p er. Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s.

Fig. 1
Fig. 1 Although object (red) and non-object (green) windows present huge variation in image space (a), at proper scales and aspect ratios which correspond to a small fixed size (b), their corresponding normed gradients, i.e. a NG feature (c), share strong correlation.We learn a single 64D linear model (d) for selecting object proposals based on their NG features.

Fig. 2
Fig.2Illustration of variables: a BING feature b x,y , its last row r x,y and last element b x,y .Notice that the subscripts i, x, y, l, k, introduced in Eq. (2) and Eq.(5), are locations of the whole vector rather than index of vector element.We can use a single atomic variable (int64 and byte) to represent a BING feature and its last row, enabling efficient feature computation (Alg.2).

Fig. 3
Fig. 3 Testing results on PASCAL VOC2007 test set: (a) object detection recall versus IoU overlap threshold; (b, c) recall versus the number of candidates at IoU threshold 0.5 and 0.7 respectively; (d) MABO versus the number of candidates using at most 1000 proposals.

Fig. 4
Fig. 4 Some failure examples of BING-E.Failure means that the overlap between the best detected box (green) and ground truth (red) is less than 0.5.All images are from the VOC2007 test set.

Fig. 5
Fig. 5 Illustration of true positive object proposals for VOC2007 test images using our method (BING-E).

Fig. 6
Fig. 6 Testing results on COCO validation dataset: (a) object detection recall versus IoU overlap threshold; (b, c) recall versus the number of candidates at IoU threshold 0.5 and 0.7 respectively; (d) MABO versus the number of candidates using at most 1000 proposals.

Fig. 6 (
b)-(d) show the recall/MABO when varying the number of proposals.The key observation is also that RPN suffers a big drop in performance over VOC2007.Its recall at IoU 0.5 and MABO are even worse than BING.In addition, our proposed BING and BING-E are very robust when transferring to different object classes.Tab.
Ming-Ming Cheng received his PhD degree from Tsinghua University in 2012.Then he did 2 years research fellow, with Prof. Philip Torr in Oxford.He is now an associate professor at Nankai University, leading the Media Computing Lab.His research interests includes computer graphics, computer vision, and image processing.Yun Liu is a Ph.D. Candidate with College of Computer Science and Control Engineering, Nankai University, under the supervision of Prof. Ming-Ming Cheng.His major research interest is computer vision and machine learning.Wen-Yan Lin received his PhD degree from the National University of Singapore in 2012, supervised by Prof. Loong-Fah Cheong and Dr. Dong Guo.He subsequently worked for the Institute of Infocomm Research Singapore and Prof. Philip Torr.He is currently a post-doc at the Advanced Digital Sciences Center Singapore.
7shows a statistical comparison.Although BING and BING-E do not achieve the best performance, they obtain very high computational efficiency with a moderate drop in accuracy.The significant improvement from BING to BING-E suggests that BING would be a good basis combining with other more accurate bounding box refinement methods if the increased computational load