Branch&Rank for Efficient Object Detection
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s11263-013-0670-8
- Cite this article as:
- Lehmann, A.D., Gehler, P.V. & Gool, L.V. Int J Comput Vis (2014) 106: 252. doi:10.1007/s11263-013-0670-8
Abstract
Ranking hypothesis sets is a powerful concept for efficient object detection. In this work, we propose a branch&rank scheme that detects objects with often less than 100 ranking operations. This efficiency enables the use of strong and also costly classifiers like non-linear SVMs with RBF-\(\chi ^2\) kernels. We thereby relieve an inherent limitation of branch&bound methods as bounds are often not tight enough to be effective in practice. Our approach features three key components: a ranking function that operates on sets of hypotheses and a grouping of these into different tasks. Detection efficiency results from adaptively sub-dividing the object search space into decreasingly smaller sets. This is inherited from branch&bound, while the ranking function supersedes a tight bound which is often unavailable (except for rather limited function classes). The grouping makes the system effective: it separates image classification from object recognition, yet combines them in a single formulation, phrased as a structured SVM problem. A novel aspect of branch&rank is that a better ranking function is expected to decrease the number of classifier calls during detection. We use the VOC’07 dataset to demonstrate the algorithmic properties of branch&rank.
Keywords
Branch&rank Object detection Non-linear kernel classifier Sub-linear detection1 Introduction
Object class detection in images is challenging because of two problems. First, object appearances exhibit large variations due to intra-class variability, illumination changes, etc. Second, objects may appear anywhere in an image with unknown scale, and need to be localised. Hence, detectors have to simultaneously cope with appearance variations and a large search space of possible object positions.
Reducing the classifier cost is a common strategy to deal with the limited computational resources. Cascade classifiers (Viola and Jones 2004; Vedaldi et al. 2009; Felzenszwalb et al. 2010) are prominent examples: they reject many hypotheses with a simple criterion and thereby avoid many computations. However, they do not per-se reduce the number of calls and the total runtime still scales linearly in the number of detection sites. Similarly, one can utilise faster classifiers (Gall and Lempitsky 2009; Zhang et al. 2011a), optimise the implementation (Wei and Tao 2010), or leverage the massive parallelism of GPU architectures (Prisacariu and Reid 2009; Wojek et al. 2008). All these approaches process the search space exhaustively and are thus not scalable. In other words, reducing the classifier cost makes a detector faster but not efficient.
Reducing the number of classifier calls is the alternative method to make detection scalable. Branch&bound (Breuel 2002; Keysers et al. 2007; Lampert et al. 2009; Lehmann et al. 2011b) is a promising approach that falls in this category: it partitions the search space adaptively and thereby avoids exhaustive search. Branch&bound operates on sets of hypotheses and uses a bounding function to process the most-promising sets in a best-first order. This allows for sub-linear runtime given a tight bound on the classifier function with exhaustive search as worst-case complexity. Unfortunately, bounds are generally not tight enough and the method’s efficiency is still insufficient to deploy powerful (and thus often costly) classifiers (c.f. Vedaldi et al. 2009); such classifiers seem however essential to deal with the appearance variations. This indicates that the concept of bounds is a limiting factor and we identify the notion of sets as the key to efficiency. This paper sets out to rectify the problem of bounds: branch&rank improves efficiency and therefore enables the use of costly classifiers i.e., rich appearance models.
Branch&rank (Lehmann et al. 2011a) follows the ideas of branch&bound but overcomes its limitations. Specifically, we abandon the notion of bounds and thereby allow for arbitrary classifiers. We adopt the best-first search and “branch” but do not “bound”. Instead, we explicitly integrate the idea of scoring sets into the training problem. Intuitively speaking we aim to “learn the bound”. More precisely, we learn a ranking function that prioritises hypothesis sets that do contain an object over those that do not. This branch&rank scheme is more efficient and detects objects with often less then \(100\) ranking operations. This enables the use of expensive classifiers. Although we could apply the ranking paradigm to arbitrary functions, we deliberately choose to work with non-linear SVMs and RBF-\(\chi ^2\) kernels. These classifiers have been shown to perform well (Gehler and Nowozin 2009; Lazebnik et al. 2006; Vedaldi et al. 2009), but are generally perceived as being too slow to be directly applicable. Yet, we show that using only evaluations of these non-linear SVMs is feasible. We train them in a multi-task setup which accounts for the size of hypothesis sets; we thereby separate image classification from object recognition, yet combine them in a joint objective.
In summary, the benefits of ranking are the following: (1) The ranking condition combines model estimation and acceleration of the test-time problem in a joint objective: improving the ranking function (classifiers) leads to better and more efficient object detection. (2) The ranking condition is flexible; it allows for arbitrary (ranking-)classifiers since no bound has to be derived. (3) Branch&rank is efficient and enables the use of strong and costly classifiers (like non-linear SVMs) without the need for cascade-like approximations.
In the sequel, Sect. 2 presents related work and Sect. 3 covers the overal branch&rank algorithm. The reason for multiple tasks as well as a SVM-based ranking function are presented in Sect. 4. Various compact set description are detailed in Sect. 5, followed by experiments in Sect. 6.
2 Related Work
Branch&Bound. Branch&bound for bounding box detection (Lampert et al. 2009) pioneered the field of efficient object detection; the number of classifier calls of such methods scales sub-linearly with the search space size. Its efficiency depends on the availability of a tight bounding function, that, unfortunately, is only available for very small function classes. The obvious bounds for non-linear SVMs are simply not sufficiently tight to be of practical value. This severely limits the use of this technique for the task of object detection, that requires strong classification functions. Approximate bounds which are tighter are shown to accelerate convergence (Lehmann et al. 2011b). But again, this was only shown for simple function classes. Moreover, Lehmann et al. (2011b) give up on global optimality guarantees, a much appreciated property of branch&bound (Lampert et al. 2009). We will elaborate on optimality and discuss the property of our method in Sect. 3.5.
Coarse-to-Fine. Coarse-to-fine detectors (Gangaputra and Geman 2006; Pedersoli et al. 2010) also operate on sets. However, they start with a uniform partitioning at the coarsest level that still scales linearly (similar to a sliding-window approach); only the finer levels partition subsets in an adaptive way. Gangaputra and Geman (2006) use a cost-to-power criterion to learn how to partition a given set. They examine potential sets in a breadth-first order and prune non-promising sets. Pedersoli et al. (2010) refine a set uniformly and propagate only the locally best-scoring hypothesis to the next finer level. This local non-maximum suppression focuses on the most promising hypothesis (within a neighbourhood) and prunes all others. In contrast, we start with the entire search space, partition a set and explore the globally best option: We prioritise promising sets and we never prune. In fact, the concept of pruning is closely related to the idea of cascade detectors.
Cascades. Cascades have been used with much success to reduce the computation cost of classifiers (Viola and Jones 2004; Vedaldi et al. 2009; Felzenszwalb et al. 2010). They use simple criteria to reject many hypotheses and thereby reduce the number of strong classifier evaluations. However, they still process every bounding box exhaustively. In other words cascades are fast but not efficient. Although the cascading in (Felzenszwalb et al. 2010) does not examine every possible part configuration, the root part (which is reported as detection) is evaluated exhaustively as in a sliding-window approach.
Adaptive Cascades. Furthermore, cascading and adaptive sub-division are two orthogonal techniques (Lampert 2010; Weiss et al. 2010). Structured ensemble cascades (Weiss et al. 2010) leverage the coarse-to-fine approach and only subdivide hypothesis sets that have not yet been filtered. Again, they prune whereas we prioritise sets. Moreover, they work on pose estimation and focus on localising part configurations; their images are “largely focused on a single actor” and thus avoid the object localisation that we tackle in this work. Lampert (2010) is more closely related to our approach as it does best-first search and it presents a cascade of bounds. The bounding still limits the possible functions to be used and only the final cascade stage is non-linear; our approach uses non-linear SVMs throughout the entire search.
Candidate Proposals. Other approaches reduce the number of classifier calls by proposing possible bounding boxes that can subsequently be verified using an object detector of choice. For example, Chum and Zisserman (2007), Razavi et al. (2011) generate class-specific proposals based on discriminative visual words while Alexe et al. (2010) propose class-independent object positions using various low-level saliency measures. In object segmentation similar techniques are applied with much success (Carreira and Sminchisescu 2010). Such two-step proposal-verification schemes are effective but lack a joint objective function. Thus they are difficult to optimise and the influence of each single part on the entire system is not trivial to measure. Saliency-based attention methods (Itti et al. 1998; Alexe et al. 2010) actually reason in a bottom-up fashion driven by low-level cues. In contrast, our ranking algorithm guides its attention based on high-level hypotheses.
Context. Context is a valuable source of information that many object detectors try to exploit; so does branch&rank. A crucial difference w.r.t. other approaches is that they focus on detection accuracy, while we address efficiency. The work by Blaschko and Lampert (2009) is closest to ours as they combine global and local context in an SVM setup. Their aim is however to improve the detection confidence and it is unclear if this positively affects the convergence of their branch&bound approach. Moreover, they do not consider the continuum between global and local context as we do. Object priming (Torralba 2003) is another prominent example that leverages global context: they capture the “gist” of a scene (Oliva and Torralba 2001). Work along this line (Torralba 2003; Murphy et al. 2003; Bileschi and Wolf 2005; Torralba et al. 2010; Harzallah et al. 2009) has shown that such holistic image features boost detection accuracy, but computational aspects were not considered or improved. Finally, Wolf and Bileschi (2006) argue that context is particularly helpful to detect “difficult” objects (e.g., small, low-resolution instances) but not so much in the general case. They measured only accuracy though and branch&rank suggests that context helps to improve detection efficiency.
3 Branch and Rank
3.1 Overview
Branch&rank (Lehmann et al. 2011a) aims to focus on sets that contain objects, rather than spending computation on sets that do not. The detector iteratively splits such sets to eventually identify a single bounding box. The challenge is to accurately decide which sets contain an object—and should thus be split—without examining every set member individually. To this end, we classify the sub-image that covers the union of every bounding-box of a set. This is a challenging task and we thus want to avoid making hard decisions that would reject hypotheses prematurely. Therefore, we keep a partitioning of the search space (represented by sets) and refine (i.e., split/branch) sets that most likely contain an object. This search strategy traverses the hypotheses set in a order that visits highest scoring elements first. In combination with applying a detection threshold we can thus avoid having to compute scores for all possible bounding boxes in a principled way.
The efficiency of branch&rank results from splitting sets that probably contain objects first, which naturally leads to a ranking problem. This section details the theoretical background as well as the overall algorithm, while the next section gives a concrete implementation.
3.2 Ranking Condition
Notation
\(\lambda \) | Parametrisation of a single bounding box |
\(\Lambda \) | Parametrisation of a set of bounding boxes |
\(\mathbb {L}\) | The set of all sets of bounding boxes |
\(\mathbb {L^{+}}\subset \mathbb {L}\) | The set of all sets containing at least one object |
\(\Lambda _j^+\in \mathbb {L^{+}}\) | A set containing at least one annotated object |
\(Y = \left\{ \lambda ^+_i \right\} _1^m\) | The training data: \(m\) ground truth annotations \(\lambda ^+_i\) |
\(B(\lambda )\), \(B(\Lambda )\) | (Four corner) bounding box for parametrisation \(\lambda \) and sets \(\Lambda \) |
\( {\mathbf {\phi }}(\Lambda )\) | Appearance descriptor for hypothesis set \(\Lambda \) |
\(\Delta : \Lambda \times \Lambda \mapsto \mathcal {R}\) | (Set-valued) loss function for pair of examples |
\(f: \Lambda \mapsto \mathcal {R}\) | Ranking function: provide a priority for hypothesis set \(\Lambda \) |
\(f^{LA}, f^{GT}\) | Loss-augmented and ground truth score function |
\(q: \Lambda \mapsto \left\{ 1,\ldots ,T \right\} \) | Task mapping to distinguish \(T\) different tasks. |
\(w_t\), \(b_t\) | SVM weight vector and bias term for task \(t\) (with \(t=q(\Lambda )\)) |
3.3 Best-First Search
3.4 Non-maximum Suppression
An image may contain multiple object instances and we would like to detect them all. Moreover we should avoid re-detecting a given object twice as this counts as false positive. Re-detections are usually eliminated with a non-maximum suppression post-processing, but a best-first search algorithm cannot wait for the post-processing: It would re-detect the same instance over and over again and loose its sub-linear runtime. We elude this problem with a simple suppression scheme, while more elaborate approaches exist (Blaschko 2011; Desai et al. 2009).
Our best-first algorithm aims to detect local optima and we assume that the first detection would survive a non-maximum post-processing. Hence, we can directly suppress or penalize hypotheses that would get suppressed by this detection. However, to maintain the efficiency of the algorithm, we have to adapt the non-maximum suppression to sets of hypotheses.
3.5 Connection to Branch&Bound
A connection to branch&bound (Lehmann et al. 2009; Lampert and Blaschko 2009) is evident from the name; this subsection comments on the relationship in more detail. First of all, note that the only algorithmic difference is to replace the ranking function while everything else remains unchanged. Branch&bound prioritises sets by an upper bound \(\hat{g}(\Lambda )\ge \max _{\lambda \in \Lambda }g(\lambda )\) to a traditional bounding-box score function \(g(\lambda )\). This leads to different properties and guarantees that we want to elaborate.
Assumptions and Requirements. Branch&bound requires that there exists a tight upper-bound for a given score function. In contrast branch&rank assumes that it is possible to efficiently decide whether a set (in our case represented by a sub-image) contains an object or not. In fact, image classification addresses exactly this problem and it does so with much success (Everingham et al. 2007). This suggests that the assumption is valid in practice. Of course, it also suggests that the detection strategy is challenged when objects are presented out-of-context (e.g., a car in a kitchen). However, it seems that this assumption is rather benign when compared to branch&bound’s requirement of a tight bound (Vedaldi et al. 2009).
On Convergence. We discussed that branch&rank has logarithmic detection time in case of a perfect ranking function. This is in contrast to branch&bound where the efficiency depends on the quality of the bound, which is unrelated to the generalisation problem. In other words, given a perfect bounding-box classifier \(g\) the runtime of branch&bound with bound \(\hat{g}\) can still be linear time (e.g., \(\hat{g}(\Lambda )=\infty \) except for \(\hat{g}(\lambda )=g(\lambda )\)). Of course, we cannot expect perfect generalisation in practice and every incorrect ranking increases the number of iterations till detection. But we conclude that the ranking condition couples accuracy and efficiency. This suggests that the better a ranking function, the better and more efficient the detector.
On Optimality. We follow Bottou and Bousquet (2008) and distinguish the following types of errors. The approximation error which is due to the function class we search a classifier in and the test-time error which is incurred by only approximately solving the test-time search problem. We neglect the third possible cause of an error, which is the optimization error. This type of error results from solving learning problems approximately. For branch&bound as in (Lampert et al. 2009) the test-time error is zero, but the approximation error is high due to the limited function classes for which tight bounds are known. Our approach incurs a test-time error because we can not guarantee finding the best scoring bounding box, but the approximation error decreases over branch&bound because we can use richer function classes, e.g., non-linear SVMs. The recent research on object detection (e.g. Vedaldi et al. 2009) indicates that the decrease of the approximation error outweighs the test-time error. However, the optimization problem that we formulate next aims to actively reduce the test-time error. A lower objective yields a better ranking that should increase accuracy and speed up convergence, both at the same time.
4 Multi-task SVM Ranking
This section presents a concrete ranking function along with its training procedure. We aim for using non-linear SVMs as they are constantly found to cope well with appearance variations of image scenes and object classes (Vedaldi et al. 2009; Gehler and Nowozin 2009). However, the branch&rank paradigm is general and can also be implemented with other classifiers (e.g., random forests, boosting, etc).
Section 4.1 introduces the ranking function and motivates the idea of grouping hypothesis sets into multiple tasks. Subsequently, Sect. 4.2 revisits a structured SVM formulation dedicated to learning rankings. We then present a transformation of the problem in Sect. 4.3 that decomposes the training. The resulting optimisation problem is stated in Sect. 4.4. Finally, Sect. 4.5 elucidates locally linear SVMs (Zhang et al. 2011b) to speed-up the evaluation.
4.1 Multi-task Ranking Function
Bag-of-Words Appearance. Our ranking function builds on the commonly used bag-of-words approach as it copes well with image classification and object detection (Lazebnik et al. 2006; Vedaldi et al. 2009; Lampert et al. 2009; Lehmann et al. 2011b). This aggregates (local) features and represents them by a histogram. Specifically, the appearance descriptor \( {\mathbf {\phi }}(\Lambda )\) uses all features that fall within the bounding box \(B(\Lambda ):=\bigcup _{\lambda \in \Lambda } B(\lambda )\) where the union extends the notion of a bounding box to sets of hypotheses (c.f. Fig. 3). For large sets, this actually includes all image features as usual in image classification.
Appearance is Not Enough. Figure 4 illustrates a degenerate case which suggest that the appearance within the bounding-box union is not sufficient to properly rank a set. In fact, using only \(\phi (\Lambda )\) may lead to ambiguities that jeopardize the ranking. In short, two sets with different labels can yield the same bounding box union; thus the same appearance descriptor. We address this problem with a multi-task framework that connects image classification with object detection.
At one end, the initial set represents the entire image and all possible sub-windows. Scoring this set is the task of image classification. The other extreme is a hypothesis set with only one instance, corresponding to scoring a single bounding box. This is an object recognition problem. Both of course are related, but note the difference in the tasks: the first set should have a high score if it contains an object, the latter if it is centered on the object. This suggests that these tasks are better solved separately but combined in a joint objective. For example the first task could benefit from different image features such as the gist of a scene (Oliva and Torralba 2001), while the latter could make use of object specific features (Oliva and Torralba 2001; Dalal and Triggs 2005; Lowe 2004) or spatial configurations of object parts (Felzenszwalb et al. 2008). In our experiments we did not take advantage of size dependent image representations but focus more on the algorithmic properties of the systems. Moreover, grouping related examples into tasks reduces the intra-task variability, which simplifies the learning problem.
Multi-task Mapping. We capture the notion of tasks by a mapping \(q(\Lambda )\mapsto \left\{ 1,2,\ldots , T \right\} \) which assigns a task ID to any hypothesis set. This mapping builds on properties other than a set’s appearance. In other words, the set provides valuable domain knowledge. For example, we can use the number of bounding boxes contained in a set (i.e., its size \(|\Lambda |\)) to differentiate between the classification and recognition tasks. More specifically, our task mapping discretizes \(\log (|\Lambda |)\) uniformly into \(T\) different tasks; the log accounts for the set size’s exponential decay (due to the splitting scheme). This spans the continuum between image classification and the final object recognition problem. While both extremes are often dealt with separately (Griffin et al. 2007; Everingham et al. 2007), we combine and complement them with intermediate tasks. Unless stated otherwise, we will use \(T=6\) tasks. Let us stress that this is one particular choice, while the concept of tasks is more general: we could define other mappings that group examples by scale and aspect-ratio (Park et al. 2010; Zhang et al. 2011b), or also by class labels (Yeh et al. 2009).
4.2 Structured SVM Ranking
Several ranking problems have been studied in the machine learning literature (Tsochantaridis et al. 2005; Chapelle and Keerthi 2009; Burges et al. 2005) and we adopt structured SVMs using margin-rescaling (Tsochantaridis et al. 2005).
4.3 Problem Decomposition
The multi-task ranking function from the previous subsection allows for a decomposition of the SVM optimisation problem. This reduces the complexity since each of the resulting problems involves only a subset of the data. This Subsection details the two necessary steps to accomplish this decomposition.
Doing so reveals the final optimisation problems that we state in the following section. Let us point out that we did not alter the constraint set that enforces the ranking condition (1).
4.4 Decomposed SVMs and Training
Although this objective function based on the aforementioned decompositions does allow for separate training, it does not mean that the functions \(\langle w_t,\cdot \rangle \) are independent. The problem Eq. (9) is indeed equivalent to Eq. (5) when using \(f\) as in Eq. (4). This also results in the scores being calibrated, the same regularization and loss rescaled margin is being used for all of them. In some sense this is only possible because ranking is enforced between two sets where one does contain at least one correct hypothesis. This allows for the constraint decoupling described above. There is no distinguishing between two different sets that do not contain correct hypotheses nor two sets that both do. It would be conceivable to rank smaller sets higher than larger sets which would break this decomposition. This problem can be readily kernelized and we choose to work with the RBF-\(\chi ^2\)-kernel\(\,\langle x,z\rangle _k = k(x,z) = \exp \left( -\gamma \sum _l \frac{(x_l-z_l)^2}{x_l+z_l}\right) \) as an example. As bandwidth \(\gamma \), we use the inverse of the kernel matrix’s median.
Constraint Generation. We solve Eq. (9) using SVM\(^{struct}\) (Tsochantaridis et al. 2005) and delayed constraint generation since the constraint set is huge; it consists of all sets of bounding boxes. We initially generate the positive constraints by running Algorithm 1 using the ground truth ranking\(f^{GT}(\Lambda ):= {\mathrm {1I}} _{\Lambda \cap Y\ne \emptyset }\); let us emphasise that branch&rank uses exactly the same annotation as any other detection approach. Thereafter, we alternate between optimising Eq. (9) with the reduced constraint set, and gathering new examples that violate the constraints. We identify new constraints by running a detector that uses the current estimate of the loss-augmented score\(f^{LA}(\Lambda ):=f(\Lambda )+\Delta (\Lambda )\), and subsequently add them to the constraint set. More precisely, in our implementation we perform \(10\) rounds in each of which we generate new constraints from \(300\) randomly chosen training images.
Hard Negative Mining. The newly gathered examples are in fact those that are easily confused with positive ones. They are often called hard negatives and delayed constraint generation thus provides a formal justification to the commonly used “hard negative mining” (Dalal and Triggs 2005). This connection was first demonstrated by Blaschko and Lampert (2008) and their extension (Blaschko et al. 2010) addressed the problem of pair-wise constraints in Eq. (5). Our decomposition in Sect. 4.3 breaks the pair-wise coupling and the resulting optimisation problem Eq. (9) makes the connection to the binary SVM setup more explicit. But recall, Eqs. (7–9) are equivalent to the well establish ranking formulation Eq. (5).
4.5 Linearization: Anchor Plane SVMs
We further experiment with locally linear SVMs (Ladický and Torr 2011; Zhang et al. 2011a) to speed-up the evaluation of various detector configurations. The evaluation time of non-linear SVMs scales in the number of support-vectors, which tends to grow linearly with the training data size. This is a downside especially since more training data often leads to better classifiers. Consequently, non-linear SVMs become slower as they become better. A possibility to overcome the computational bottleneck is to work with locally linear SVMs (Ladický and Torr 2011; Zhang et al. 2011a). The rationale is that linear SVM are often too simplistic while non-linear SVMs are too expensive to evaluate; locally linear SVMs aim for combining the advantages of both by assuming that the decision boundary is locally linear. In essence, a local-linear SVM is a linear SVM after mapping each feature \(x\) to \(\psi (x) = x\otimes \gamma (x)\) with a local coordinate coding function \(\gamma : x\mapsto \mathcal {R}^D\); the parameter \(D\) increases the capacity of the classifier by expanding the original descriptor \(x\in \mathcal {R}^{N}\) to \(N\cdot D\) dimensions. This actually induces a kernel \(k(x,z)=(x^Tz)\cdot (\gamma (x)^T\gamma (z))\) that compares only examples that have similar mappings.
5 Hypothesis Set Representation
This section details the representation of hypothesis sets and we describe how to implement the operations required by the detection algorithm. The overall procedure to represent hypothesis sets is as follows. First, we parametrize bounding boxes using a four dimensional vector. In Sects. 5.1 and 5.2 we work with a reference point plus the box’s scale&aspect-ratio, while in Sect. 5.3 the width&height is used instead. Secondly, we obtain sets of bounding boxes by representing each dimension with an interval, rather than a single number. This provides a compact description of sets as illustrated in Fig. 7. The remainder of this section details the representations and set splitting, as well as how to compute the set size and union bounding box.
5.1 Position, Scale, and Aspect-Ratio
Splitting. We implement the splitting scheme as follows. The largest of all four intervals (defining a set) is split into two equals halves. For this we normalize the size of the spatial (xy) intervals using the largest scale \(\overline{s}\). This scale-adaptation avoids localising objects with unnecessary high precision: \(xy\)-intervals may already be small w.r.t. the largest bounding box of a set; therefore we prefer splitting along \(ls,lr\) over spatial splits.
5.2 Including Set Shrinkage
Next we describe a parameterisation that is largely equal to the one described above, but includes a shrinkage step. The previous parameterisation allowed bounding boxes that can be partially outside the image. Although these bounding boxes are generally low-scoring (due to fewer supporting features), one may want to eliminate them explicitly.
Representation & Initialisation as in Sect. 5.1.
Bounding Box & Set Size. This shrinkage procedure is applied after every set splitting and ensures bounding boxes (partially) outside the image are excluded from the search. Consequently, the union of bounding boxes as computed previously can be clamped to the image domain. The size of a (shrinked) set is computed as before in Sect. 5.1.
Splitting. Equations (12, 13) show that the spatial intervals depend on the size of the bounding box: the larger the bounding box the smaller the interval. As the splitting procedure relies on the interval size relative to the largest bounding box we also account for the shrinkage when choosing the split dimension. For example, the the x-interval size becomes \((\min (\overline{x},W-w_{\max }/2)-\max (\underline{x},w_{\max }/2))/w_{\max }\) with \(w_{\max }=\exp (\overline{ls}+\overline{lr})\).
5.3 Position, Width, and Height
6 Experiments
This section evaluates the performance of branch&rank. All details of the evaluation, the pre-processing steps like image feature extraction are described in Sect. 6.1. Then, individual components are analysed in the following sections. We analyse in isolation the classifier (Sect. 6.2), the parametrisation (Sect. 6.3), and the task quantisation (Sect. 6.4). Finally the overall performance (Sect. 6.5) and the efficiency (Sect. 6.6) of the resulting branch&rank detector are evaluated.
6.1 Testbed and Features
Testbed. We use the VOC’07 dataset (Everingham et al. 2007) as a testbed for the experiments, it consists of about \(10\) k images with \(20\) classes and comes with three pre-defined splits “train”, “val”, and “test”. We compare different configurations of the branch&rank detector on a subset of \(10\) classes, while the final (test-set) evaluation and comparison to published results is done for all classes. The evaluation scheme is as follows: the parameter \(C\) is estimated by training (validating) models on the train (val) data splits. As images are selected randomly during hard negative mining, we always average over three different runs; the variation among runs is plotted with error bars. Using the best \(C\) value, we eventually retrain on the entire ’trainval’ split and evaluate on the test data. For training, we ignore images that contain truncated or difficult examples. Moreover, we use only \(1\),\(000\) negative images during validation. Throughout, we measure detection quality using average precision of VOC’10, unless stated otherwise.
RGB-SIFT-Pyramid-Features. We extract features on a dense grid at multiple scales. We use the code of van de Sande et al. (2010) and select rgb-SIFT descriptors. This typically results in about \(15\)k features that we quantise using a vocabulary of \(100\) visual words using k-means clustering. Subsequently, we apply a spatial pyramid histogram scheme with \(1\times 1\), \(2\times 2\), and \(4\times 4\) bins that yields a \(2100\)D sub-image descriptor \( {\mathbf {\phi }}(\Lambda )\) for a set \(\Lambda \). In fact, we further apply a normalisation where we compare two versions. One version is to normalise the vector by its \(l_2\)-norm. The second to apply term frequency reweighing as common in retrieval (Robertson and Walker 1994). Specifically, we rescale every feature \(x\) by \(x/(x+a)\) where we found \(a=7\) to work well. This relates to binarisation/max-pooling (Boureau et al. 2010) as each feature saturates at \(1\), but it does so in a smooth fashion. Results obtained by this normalisation are denoted by a TF suffix.
6.2 Comparison of Kernels
The first experiment validates the quality of different kernel functions. In particular, we elucidate the quality of the anchor plane SVMs compared to common non-linear SVMs. Furthermore, we investigate the difference between the RBF-\(\chi ^2\) and a standard Gaussian kernel. The latter uses the Euclidean distance to compare feature vectors.In this section and also Sects 6.3 and 6.4 we use the “train” split for training and “val” for performance evaluation of the model.
While the performance decreases a bit, the pay-off from locally-linear SVMs become apparent when looking at the computation and storage requirements. We report numbers for \(car\) as an exemplary example. The entire evaluation (including training, hard-negative mining, and testing on the validation set) takes roughly \(25\) min for an anchor plane (\(D=20\)) SVM, compared to \(250\) min for the RBF-\(\chi ^2\) classifier. This is one order of magnitude less runtime. In terms of storage, the models are \(10\)MB and \(66\)MB large, respectively. Hence, measuring performance not only in terms of “average precision” makes anchor plane SVMs competitive.
In conclusion, the RBF-\(\chi ^2\) kernel yields best accuracy (24.6 %), while anchor plane SVMs have strong computational advantages with only minor decrease in accuracy (21.4 %). In the sequel, we therefore compare various detector configurations (i.e., different search space partitioning schemes, increasingly many tasks) using the much faster anchor plane SVMs. However, the final evaluation and comparison to published results is done using the more accurate RBF-\(\chi ^2\) kernel.
6.3 Search Space Partitioning Schemes
6.4 Multi-task Improvements
6.5 Performance Evaluation
Finally, we evaluate the detector on the VOC’2007 testset and use the ’trainval’ split for training. The detector distinguishes six tasks and uses the width&height parametrisation with quadruple splits. Figure 14 reports the results of the anchor plane SVM with \(20\) planes and the non-linear RBF-\(\chi ^2\)-SVM. The RBF-\(\chi ^2\) SVM clearly outperforms anchor plane SVMs in terms of average precision; recall that anchor plane SVMs are an order of magnitude faster.
Average precision (AP) on the VOC 2007 testset; detectors are trained with the ’trainval’ and evaluated on the ’test’ datasplit
Aerop | Bicyc | Bird | Boat | Bottle | Bus | Car | Cat | Chair | Cow | Dtable | Dog | Horse | Mbike | Person | Plant | Sheep | Sofa | Train | Tv | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Anchor | 14.3 | 21.9 | 1.7 | 3.4 | 1.3 | 29.0 | 28.5 | 9.9 | 1.2 | 7.1 | 3.0 | 7.3 | 31.9 | 28.4 | 5.7 | 0.5 | 7.9 | 13.3 | 25.6 | 17.0 | |
AP10 | Plane | ||||||||||||||||||||
RBF-\(\chi ^2\) | 17.5 | 24.8 | 3.7 | 6.6 | 2.1 | 29.9 | 31.9 | 14.3 | 1.8 | 9.5 | 5.4 | 12.6 | 40.1 | 32.2 | 7.3 | 2.1 | 11.4 | 19.4 | 30.4 | 24.0 | |
RBF-\(\chi ^2\) | 20.6 | 26.9 | 9.5 | 11.4 | 8.1 | 32.0 | 32.5 | 17.3 | 6.8 | 12.9 | 10.5 | 15.9 | 41.5 | 33.8 | 13.4 | 8.4 | 14.5 | 23.2 | 32.1 | 26.5 | |
AP07 | v7 | 18.0 | 41.1 | 9.2 | 9.8 | 24.9 | 34.9 | 39.6 | 11.0 | 15.5 | 16.5 | 11.0 | 6.2 | 30.1 | 33.7 | 26.7 | 14.0 | 14.1 | 15.6 | 20.6 | 33.6 |
dt | 26.2 | 40.9 | 9.8 | 9.4 | 21.4 | 39.3 | 43.2 | 24.0 | 12.8 | 14.0 | 9.8 | 16.2 | 33.5 | 37.5 | 22.1 | 12.0 | 17.5 | 14.7 | 33.4 | 28.9 |
6.6 Efficiency: often less than 100 classifier calls
7 Conclusion
In short, branch&rank (Lehmann et al. 2011a) generalises the idea of branch&bound (Lampert et al. 2009; Lehmann et al. 2011b): ranking improves efficiency and thereby enables the use of arbitrary classifiers, including non-linear SVMs with RBF-\(\chi ^2\) kernels. This is a crucial advance in efficient object detection since strong classifiers are beneficial to properly model the object intra-class variations. Let us recapitulate, the efficiency of our method results from leveraging the branching step of branch&bound, but superseding the bounding by a ranking step. This relieves the former limitations (availability of a tight bounding function) and allows for arbitrary ranking functions. The system is trained in a structured SVM setting while a multi-task formulation has proven effective: it properly handles image classification, object recognition, and in-between task arising throughout the search procedure. The experiments show that branch&rank localises objects using often less than 100 classifier calls. This efficiency enables costly and thus strong classifiers.
A novel aspect of branch&rank is that the notion of sets is already integrated into the training. The ranking function can therefore leverage information of a set which is not available when looking at a single bounding box. This allows to overcome a systematic bias (towards larger sets) of bounding functions (c.f. Lehmann 2011) and thus improves efficiency. As a result, we made detection by non-linear SVMs feasible, without the need for approximations.
Let us emphasise that efficiency is orthogonal to reducing the cost of a classifier. Using faster classifiers will eventually reduce the overall runtime as we showed with anchor plane SVMs. While those performed a bit worse, they run an order of magnitude faster. This enables to combine multiple complementary features, which are the source of most empirical progress in image classification and detection (Gehler and Nowozin 2009; Vedaldi et al. 2009). This is subject to future work while we found a single feature sufficient to demonstrate the algorithmic properties of branch&rank.
Multi-task aspects play a vital role in branch&rank: hypothesis sets throughout the search process correspond to image classification, object categorisation, and in-between tasks. We captured this phenomenon by a task mapping that groups related sets; each task is scored with a dedicated function that still targets a global ranking. This concept is versatile and our grouping (based on the set size) only scratched the surface of what is possible. For example, (Park et al. 2010; Zhang et al. 2011b) group by scale (and aspect-ratio) to better cope with small, low-resolution objects. Our task mapping describes such grouping in a formal, yet simple and general manner.
In the future, we plan to better leverage the power of this task mapping. The flexibility of branch&rank in fact allows to use different appearance descriptors for different tasks, and to sample features on demand. The next step is thus to take advantage of findings from the image classification and object categorisation community. We envision to improve the detection results by tailoring the ranking function for each task separately. Furthermore, it would also be interesting to automatically learn a task mapping from training data.
Another upcoming research challenge lies in developing a better understanding of hypothesis sets and how to partition them. Our current bi/quad-section scheme is simple yet effective, but it is not directly applicable to e.g., multi-class scenarios. However, we anticipate that the proposed multi-task approach extends to multi-class branching e.g., (Yeh et al. 2009). Designing an appropriate splitting scheme that interleaves spatial and class branching is a promising endeavour: extending branch&rank will provide a principled and efficient true multi-class detector.
To avoid a hard decision, one may subtract a penalty from the score and re-schedule the element in the priority queue
More generally, this transformation is possible whenever the loss is separable in its two arguments, i.e., \(\Delta (\Lambda ^+,\Lambda )=u(\Lambda ^+)-v(\Lambda )\) for some functions \(u\) and \(v\).
Our visual-words based image descriptors seems inadequate for scales \(<\)50 pixels; unfortunately, this yields an a priori loss of recall. This is a problem of the feature representation, not of the detector.
We cannot directly compare to Lampert (2010) as they evaluate using recall-overlap rather than average precision
Due to measuring the performance at a discrete set of recall values, an AP of \(1/11\approx 9\,\%\) is obtained if the best-scoring detection is correct even if it is the only one.
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.