Training Effective Node Classifiers for Cascade Classification

Shen, Chunhua; Wang, Peng; Paisitkriangkrai, Sakrapee; van den Hengel, Anton

doi:10.1007/s11263-013-0608-1

Training Effective Node Classifiers for Cascade Classification

Published: 24 January 2013

Volume 103, pages 326–347, (2013)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Chunhua Shen¹,
Peng Wang¹,
Sakrapee Paisitkriangkrai¹ &
…
Anton van den Hengel¹

1172 Accesses
26 Citations
Explore all metrics

Abstract

Cascade classifiers are widely used in real-time object detection. Different from conventional classifiers that are designed for a low overall classification error rate, a classifier in each node of the cascade is required to achieve an extremely high detection rate and moderate false positive rate. Although there are a few reported methods addressing this requirement in the context of object detection, there is no principled feature selection method that explicitly takes into account this asymmetric node learning objective. We provide such an algorithm here. We show that a special case of the biased minimax probability machine has the same formulation as the linear asymmetric classifier (LAC) of Wu et al. (linear asymmetric classifier for cascade detectors, 2005). We then design a new boosting algorithm that directly optimizes the cost function of LAC. The resulting totally-corrective boosting algorithm is implemented by the column generation technique in convex optimization. Experimental results on object detection verify the effectiveness of the proposed boosting algorithm as a node classifier in cascade object detection, and show performance better than that of the current state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Object Detection Using Cascades of Binary and One-Class Classifiers

Article 11 January 2017

Multiple Classifier Boosting and Tree-Structured Classifiers

Multi-dimensional Bayesian network classifiers: A survey

Article 11 July 2020

Notes

In our object detection experiment, we found that this assumption can always be satisfied.
Since the multi-exit cascade makes use of all previous weak classifiers in earlier nodes, it would meet the Gaussianity requirement better than the conventional cascade classifier.
To train a complete $22$-node cascade and choose the best $ \theta $ on cross-validation data may give better detection rates.
Our implementation is in C++ and only the weak classifier training part is parallelized using OpenMP.
Covariance features capture the relationship between different image statistics and have been shown to perform well in our previous experiments. However, other discriminative features can also be used here instead, e.g., Haar-like features, local binary pattern (LBP) (Mu et al. 2008) and self-similarity of low-level features (CSS) (Walk et al. 2010).
Here $({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{S}$ denotes the family of distributions in $ ( {\varvec{\mu }}, {\varvec{\varSigma }}) $ that are also symmetric about the mean $ {\varvec{\mu }}.$ $({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{SU}$ denotes the family of distributions in $ ( {\varvec{\mu }}, {\varvec{\varSigma }}) $ that are additionally symmetric and linear unimodal about $ {\varvec{\mu }}.$

References

Agarwal, S., Awan, A., & Roth, D. (2004). Learning to detect objects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1475–1490.
Article Google Scholar
Aldavert, D., Ramisa, A., Toledo, R., & Lopez de Mantaras, R. (2010). Fast and robust object segmentation with the integral linear classifier. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.
Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167–175.
Article MathSciNet MATH Google Scholar
Bi, J., Periaswamy, S., Okada, K., Kubota, T., Fung, G., & Salganicoff, M., et al. (2006). Computer aided detection via asymmetric cascade of sparse hyperplane classifiers. In Proceedings of ACM International Conference Discovery & Data Mining (pp. 837–844). Philadelphia, PA: ACM Press.
Bourdev, L., & Brandt, J. (2005). Robust object detection via soft cascade. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 236–243). San Diego, CA: IEEE.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
MATH Google Scholar
Brubaker, S. C., Wu, J., Sun, J., Mullin, M. D., & Rehg, J. M. (2008). On the design of cascades of boosted ensembles for face detection. International Journal of Computer Vision, 77(1–3), 65–86.
Article Google Scholar
Collins, M., Globerson, A., Koo, T., Carreras, X., & Bartlett, P. L. (2008). Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9, 1775–1822.
MathSciNet MATH Google Scholar
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 886–893). San Diego, CA: IEEE.
Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1–3), 225–254.
Article MATH Google Scholar
Dollár, P. (2012). Piotr’s image and video Matlab toolbox. Retrieved December 14, 2012, from http://vision.ucsd.edu/pdollar/toolbox/doc/.
Dollár, P., Babenko, B., Belongie, S., Perona, P., & Tu, Z. (2008). Multiple component learning for object detection. In Proceedings of European Conference on Computer Vision (pp. 211–224). Marseille, France: ECCV.
Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.
Google Scholar
Dundar, M., & Bi, J. (2007). Joint optimization of cascaded classifiers for computer aided detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Minneapolis, MN: IEEE.
Enzweiler, M., Eigenstetter, A., Schiele, B., & Gavrila, D. M. (2010). Multi-cue pedestrian classification with partial occlusion handling. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.
Ess, A., Leibe, B., & Van Gool, L. (2007). Depth and appearance for mobile scene analysis. In Proceedings of International Conference on Computer Vision. Rio de Janeiro: ICCV.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In Proceedings of International Conference on Computer Vision. Rio de Janeiro: ICCV.
Huang, K., Yang, H., King, I., Lyu, M., & Chan, L. (2004). The minimum error minimax probability machine. Journal of Machine Learning Research, 5, 1253–1286.
MathSciNet MATH Google Scholar
Lanckriet, G. R. G., El Ghaoui, L., Bhattacharyya, C., & Jordan, M. I. (2002). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555–582.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE.
Lefakis, L., & Fleuret, F. (2010). Joint cascade optimization using a product of boosted classifiers. In Advances in Neural Information Processing Systems. Vancouver: NIPS.
Li, S. Z., & Zhang, Z. (2004). FloatBoost learning and statistical face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1112–1123.
Article Google Scholar
Lin, Z., Hua, G., & Davis, L. S. (2009). Multiple instance feature for robust part-based object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 405–412). IEEE: Miami, FL.
Liu, C., & Shum, H.-Y. (2003) Kullback-Leibler boosting. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 587–594). Madison, Wisconsin: IEEE.
Maji, S., Berg, A. C., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.
Masnadi-Shirazi, H., & Vasconcelos, N. (2007). Asymmetric boosting. In Proceedings of International Conference on Machine Learning (pp. 609–619). IMLS: Corvalis, Oregon.
Masnadi-Shirazi, H., & Vasconcelos, N. (2011). Cost-sensitive boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 294–309.
Article Google Scholar
MOSEK. (2010). The MOSEK optimization toolbox for matlab manual, version 6.0, revision 93. Retrieved May 25, 2010, from http://www.mosek.com/.
Mu, Y., Yan, S., Liu, Y., Huang, T., & Zhou, B. (2008). Discriminative local binary patterns for human detection in personal album. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.
Munder, S., & Gavrila, D. M. (2006). An experimental study on pedestrian classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1863–1868.
Article Google Scholar
Paisitkriangkrai, S., Shen, C., & Zhang, J. (2008). Fast pedestrian detection using a cascade of boosted covariance features. IEEE Transactions on Circuits and Systems for Video Technology, 18(8), 1140–1151.
Google Scholar
Paisitkriangkrai, S., Shen, C., & Zhang, J. (2009). Efficiently training a better visual detector with sparse Eigenvectors. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Miami, Florida: IEEE.
Pham, M.-T., & Cham, T.-J. (2007a). Fast training and selection of Haar features using statistics in boosting-based face detection. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.
Pham, M.-T., & Cham, T.-J. (2007b). Online learning asymmetric boosted classifiers for object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Minneapolis, MN: IEEE.
Pham, M.-T., Hoang, V.-D. D., & Cham, T.-J. (2008). Detection with multi-exit asymmetric boosting. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska: IEEE.
Rätsch, G., Mika, S., Schölkopf, B., & Müller, K.-R. (2002). Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1184–1199.
Article Google Scholar
Saberian, M., & Vasconcelos, N. (2012). Learning optimal embedded cascades. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Colorado: TPAMI.
Saberian, M. J., & Vasconcelos, N. (2010). Boosting classifer cascades. In Advances in Neural Information Processing Systems. Vancouver: NIPS.
Shen, C., & Li, H. (2010a). Boosting through optimization of margin distributions. IEEE Transanctions on Neural Networks, 21(4), 659–666.
Article Google Scholar
Shen, C., & Li, H. (2010b). On the dual formulation of boosting algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2216–2231. IEEE computer Society Digital Library. http://dx.doi.org/10.1109/TPAMI.2010.47.
Google Scholar
Shen, C., Paisitkriangkrai, S., & Zhang, J. (2008). Face detection from few training examples. In Proceedings of the International Conference on Image Processing (pp. 2764–2767). IEEE: San Diego, California.
Shen, C., Wang, P., & Li, H. (2010). LACBoost and FisherBoost: Optimally building cascade classifiers. In Proceedings of European Conference on Computer Vision, LNCS 6312. (Vol. 2, pp. 608–621). Crete Island, Greece: Springer.
Shen, C., Paisitkriangkrai, S., Zhang, J. (2011). Efficiently learning a detection cascade with sparse Eigenvectors. IEEE Transactions on Image Processing, 20(1):22–35. http://dx.doi.org/10.1109/TIP.2010.2055880.
Google Scholar
Sochman, J., & Matas, J. (2005). Waldboost learning for time constrained sequential detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego: IEEE.
Torralba, A., Murphy, K. P., & Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 854–869.
Article Google Scholar
Tu, H.-H., & Lin, H.-T. (2010). One-sided support vector regression for multiclass cost-sensitive classification. In Proceedings of the International Conference on Machine Learning. Haifa, Israel: ICML.
Tuzel, O., Porikli, F., & Meer, P. (2008). Pedestrian detection via classification on Riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1713–1727.
Article Google Scholar
Viola, P., & Jones, M. (2002). Fast and robust classification using asymmetric AdaBoost and a detector cascade. In Advances in Neural Information Processing Systems (pp. 1311–1318). Cambridge: MIT Press.
Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.
Article Google Scholar
Viola, P., Platt, J. C., & Zhang, C. (2005). Multiple instance boosting for object detection. In Advances in Neural Information Processing Systems (pp. 1417–1424). Vancouver, Canada: NIPS.
Walk, S., Majer, N., Schindler, K., & Schiele, B. (2010). New features and insights for pedestrian detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.
Wang, P., Shen, C., Barnes, N., & Zheng, H. (2012). Fast and robust object detection using asymmetric totally-corrective boosting. IEEE Transactions on Neural Networks and Learning Systems, 23(1), 33–46.
Article Google Scholar
Wang, W., Zhang, J., & Shen, C. (2010). Improved human detection and classification in thermal images. In Proceedings of the International Conference on Image Processing. Hong Kong: ICIP.
Wang, X., Han, T. X., & Yan, S. (2007). An HOG-LBP human detector with partial occlusion handling. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.
Wojek, C., Walk, S., & Schiele, B. (2009). Multi-cue onboard pedestrian detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Miami: IEEE.
Wu, B., & Nevatia, R. (2008). Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.
Wu, J., & Rehg, J. M. (2011). CENTRIST: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1489–1501.
Article Google Scholar
Wu, J., Rehg, J. M., & Mullin, M. D. (2003). Learning a rare event detection cascade by direct feature selection. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), In Proceedings of Advances in Neural Information Processing Systems. Sierra Nevada: NIPS.
Wu, J., Mullin, M. D., & Rehg, J. M. (2005). Linear asymmetric classifier for cascade detectors. In Proceedings of the International Conference on Machine Learning (pp. 988–995). Bonn, Germany: IMLS.
Wu, J., Brubaker, S. C., Mullin, M. D., & Rehg, J. M. (2008). Fast asymmetric learning for cascade face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 369–382.
Article Google Scholar
Xiao, R., Zhu, L., & Zhang, H.-J. (2003). Boosting chain learning for object detection. In Proceedings of International Conference on Computer Vision (pp. 709–715). Nice, France: ICCV.
Xiao, R., Zhu, H., Sun, H., & Tang, X. (2007). Dynamic cascades for face detection. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.
Yu, Y.-L., Li, Y., Schuurmans, D., & Szepesvári, C. (2009). A general projection property for distribution families. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Proceeding of Advances in Neural Information Processing Systems (pp. 2232–2240). Vancouver, Canada: NIPS.
Zheng, Y., Shen, C., Hartley, R., & Huang, X. (2010). Pyramid center-symmetric local binary, trinary patterns for effective pedestrian detection. In Proceedings of Asian Conference on Computer Vision. Queenstown, New Zealand: ACCV.
Zhu, Q., Avidan, S., Yeh, M.-C., & Cheng, K.-T. (2006). Fast human detection using a cascade of histograms of oriented gradients. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 1491–1498). New York: IEEE.

Download references

Acknowledgments

This work was in part supported by Australian Research Council Future Fellowship FT120100969

Author information

Authors and Affiliations

Australian Centre for Visual Technologies, School of Computer Science, The University of Adelaide, North Terrace, Adelaide, SA, 5005, Australia
Chunhua Shen, Peng Wang, Sakrapee Paisitkriangkrai & Anton van den Hengel

Authors

Chunhua Shen
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sakrapee Paisitkriangkrai
View author publications
You can also search for this author in PubMed Google Scholar
Anton van den Hengel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunhua Shen.

Appendices

Appendix A: Proof of Theorem 1

Before we present our results, we introduce an important proposition from (Yu et al. 2009). Note that we have used different notation.

Proposition 2

For a few different distribution families, the worst-case constraint

$$\begin{aligned} \left[ \inf _\mathbf{x \sim ( {\varvec{\mu }}, {\varvec{\varSigma }}) } \Pr \left\{ \mathbf{w }^{\!\top }\mathbf x \le b \right\} \right] \ge \gamma , \end{aligned}$$

(26)

can be written as:

1.
if $ \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }}) $, i.e., $ \mathbf x $ follows an arbitrary distribution with mean $ {\varvec{\mu }}$ and covariance $ {\varvec{\varSigma }}$, then
$$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}+ \sqrt{ \tfrac{ \gamma }{ 1 - \gamma } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w }; \end{aligned}$$
(27)
2.
if $ \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{S},$ ^{Footnote 6}then we have
$$\begin{aligned} {\left\{ \begin{array}{ll} b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }}\!+\! \sqrt{ \tfrac{ 1 }{ 2 (1 - \gamma ) } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w },&\text{ if}\quad \gamma \in (0.5,1);\nonumber \\ b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }},&\text{ if} \quad \gamma \in (0,0.5]; \end{array}\right.}\nonumber \\ \end{aligned}$$
(28)
3.
if $ \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{SU} $, then
$$\begin{aligned} {\left\{ \begin{array}{ll} b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }}\!+\! \frac{2}{3} \sqrt{ \tfrac{ 1 }{ 2 (1 - \gamma ) } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w },&\text{ if} \quad \gamma \in (0.5,1);\nonumber \\ b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }},&\text{ if} \quad \gamma \in (0,0.5]; \end{array}\right.}\nonumber \\ \end{aligned}$$
(29)
4.
if $ \mathbf x $ follows a Gaussian distribution with mean $ {\varvec{\mu }}$ and covariance $ {\varvec{\varSigma }}$, i.e., $ \mathbf x \sim \mathcal{G}( {\varvec{\mu }}, {\varvec{\varSigma }}) $, then
$$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}+ \Phi ^{-1} ( \gamma ) \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w }, \end{aligned}$$
(30)
where $ \Phi (\cdot )$ is the cumulative distribution function (c.d.f.) of the standard normal distribution $ \mathcal{G} (0,1)$, and $ \Phi ^{-1} (\cdot )$ is the inverse function of $ \Phi (\cdot ).$ Two useful observations about $ \Phi ^{-1} (\cdot )$ are: $ \Phi ^{-1} ( 0.5) = 0 $; and $ \Phi ^{-1} ( \cdot ) $ is a monotonically increasing function in its domain.

We omit the proof of Proposition 2 here and refer the reader to Yu et al. (2009) for details. Next we begin to prove Theorem 1:

Proof

The second constraint of (2) is simply

$$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}_2. \end{aligned}$$

(31)

The first constraint of (2) can be handled by writing $ \mathbf w ^{\!\top }\mathbf x _1 \ge b $ as $ - \mathbf w ^{\!\top }\mathbf x _1 \le - b $ and applying the results in Proposition 2. It can be written as

$$\begin{aligned} - b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 \ge \varphi ( \gamma ) \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w }, \end{aligned}$$

(32)

with (6).

Let us assume that $ \Sigma _1 $ is strictly positive definite (if it is only positive semidefinite, we can always add a small regularization to its diagonal components). From (32) we have

$$\begin{aligned} \varphi ( \gamma ) \le \frac{ -b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 }{ \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w }}. \end{aligned}$$

(33)

So the optimization problem becomes

$$\begin{aligned} \max _\mathbf{w , b, \gamma } \, \gamma , \quad \text{ s.t.} \quad (31)\quad \text{ and}\quad (33). \end{aligned}$$

(34)

The maximum value of $\gamma $ (which we label $ \gamma ^\star $) is achieved when (33) is strictly an equality. To illustrate this point, let us assume that the maximum is achieved when

$$\begin{aligned} \varphi ( \gamma ^\star ) < \frac{ -b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 }{ \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w }}. \end{aligned}$$

Then a new solution can be obtained by increasing $ \gamma ^\star $ with a positive value such that (33) becomes an equality. Notice that the constraint (31) will not be affected, and the new solution will be better than the previous one. Hence, at the optimum, (5) must be fulfilled.

Because $ \varphi (\gamma ) $ is monotonically increasing for all the four cases in its domain $ (0,1) $ (see Fig. 9), maximizing $ \gamma $ is equivalent to maximizing $\varphi (\gamma ) $ and this results in

$$\begin{aligned} \max _\mathbf{w , b } \, \frac{ -b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 }{ \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w } }, \quad \text{ s.t.} \quad b \ge \mathbf w ^{\!\top }{\varvec{\mu }}_2. \end{aligned}$$

(35)

As in Lanckriet et al. (2002) and Huang et al. (2004), we also have a scale ambiguity: if $ ( \mathbf w ^\star , b^\star )$ is a solution, $ ( t \mathbf w ^\star , t b^\star ) $ with $ t > 0 $ is also a solution.

An important observation is that the problem (35) must attain the optimum at (4). Otherwise if $ b > \mathbf w ^{\!\top }{\varvec{\mu }}_2$, the optimal value of (35) must be smaller. So we can rewrite (35) as an unconstrained problem (3).

We have thus shown that, if $ \mathbf x _1 $ is distributed according to a symmetric, symmetric unimodal, or Gaussian distribution, the resulting optimization problem is identical. This is not surprising considering the latter two cases are merely special cases of the symmetric distribution family.

At optimality, the inequality (33) becomes an equality, and hence $ \gamma ^\star $ can be obtained as in (5). For ease of exposition, let us denote the fours cases in the right side of (6) as $ \varphi _\mathrm{gnrl} ( \cdot )$, $ \varphi _\mathrm{S} ( \cdot ) $, $ \varphi _\mathrm{SU} ( \cdot ) $, and $ \varphi _\mathcal{G} ( \cdot ) .$ For $ \gamma \in [0.5, 1) $, as shown in Fig. 9, we have $ \varphi _\mathrm{gnrl} ( \gamma ) > \varphi _\mathrm{S} ( \gamma ) > \varphi _\mathrm{SU} ( \gamma ) > \varphi _\mathcal{G} ( \gamma ).$ Therefore, when solving (5) for $ \gamma ^\star $, we have $ \gamma ^\star _\mathrm{gnrl} < \gamma ^\star _\mathrm{S} < \gamma ^\star _\mathrm{SU} < \gamma ^\star _\mathcal{G}.$ That is to say, one can get better accuracy when additional information about the data distribution is available, although the actual optimization problem to be solved is identical.

Appendix B: Proof of Theorem 2

Let us assume that in the current solution we have selected $ n $ weak classifiers and their corresponding linear weights are $ \mathbf w = [ w_1, \ldots , w_n ] .$ If we add a weak classifier $ h^{\prime }(\cdot ) $ that is not in the current subset, the corresponding $ w $ is zero, then we can conclude that the current weak classifiers and $\mathbf w $ are the optimal solution already. In this case, the best weak classifier that is found by solving the subproblem (20) does not contribute to solving the master problem.

Let us consider the case that the optimality condition is violated. We need to show that we are able to find such a weak learner $ h^{\prime }(\cdot ) $, which is not in the set of current selected weak classifiers, that its corresponding coefficient $ w > 0 $ holds. Again assume $ h^{\prime }(\cdot ) $ is the most violated weak learner found by solving (20) and the convergence condition is not satisfied. In other words, we have

$$\begin{aligned} \sum \limits _{i=1}^m u_i y_i h^{\prime }( \mathbf x _i ) \ge r. \end{aligned}$$

(36)

Now, after this weak learner is added into the master problem, the corresponding primal solution $ w $ must be non-zero (positive because we have the nonnegative-ness constraint on $ \mathbf w $).

If this is not the case, then the corresponding $ w = 0 .$ This is not possible because of the following reason. From the Lagrangian (17), at optimality we have $ \partial L / \partial \mathbf w = {\varvec{0}} $, which leads to

$$\begin{aligned} r - \sum _{i=1}^m u_i y_i h^{\prime }( \mathbf x _i ) = q > 0. \end{aligned}$$

(37)

Clearly (36) and (37) contradict.

Thus, after the weak classifier $ h^{\prime }(\cdot ) $ is added to the primal problem, its corresponding $ w $ must have a positive solution. This is to say, one more free variable is added into the problem and re-solving the primal problem (16) must reduce the objective value. Therefore a strict decrease in the objective is obtained. In other words, Algorithm 1 must make progress at each iteration. Furthermore, the primal optimization problem is convex, there are no local optimal points. The column generation procedure is guaranteed to converge to the global optimum up to some prescribed accuracy.

Appendix C: Exponentiated Gradient Descent

Exponentiated gradient descent (EG) is a very useful tool for solving large-scale convex minimization problems over the unit simplex. Let us first define the unit simplex $ \varDelta _n = \left\{ \mathbf{w } \in \mathbb{R }^n : \mathbf{1 } ^ {\!\top }\mathbf w = 1, \mathbf w \succcurlyeq \mathbf{0 } \right\} .$ EG efficiently solves the convex optimization problem

$$\begin{aligned} \min _\mathbf w \quad f(\mathbf w ), \quad \mathrm{s.t.} \quad \mathbf w \in \varDelta _n, \end{aligned}$$

(38)

under the assumption that the objective function $ f(\cdot ) $ is a convex Lipschitz continuous function with Lipschitz constant $ L_f $ w.r.t. a fixed given norm $ ||\cdot ||.$ The mathematical definition of $ L_f $ is that $ | f(\mathbf w ) -f (\mathbf z ) | \le L_f ||\mathbf x - \mathbf z ||$ holds for any $ \mathbf x , \mathbf z $ in the domain of $ f(\cdot ).$ The EG algorithm is very simple:

1.
Initialize with $\mathbf w ^0 \in \text{ the} \text{ interior} \text{ of} \varDelta _n$;
2.
Generate the sequence $ \left\{ \mathbf{w }^k \right\} $, $ k=1,2,\ldots $ with:
$$\begin{aligned} \mathbf w ^k_j = \frac{ \mathbf w ^{k-1}_j \exp [ - \tau _k {f_{j}^{\prime }} ( \mathbf w ^{k-1} ) ] }{ \sum _{j=1}^n \mathbf w ^{k-1}_j \exp [ - \tau _k {f_{j}^{\prime }} ( \mathbf w ^{k-1} ) ] }. \end{aligned}$$
(39)
Here $ \tau _k $ is the step-size. $ f^{\prime }( \mathbf w ) = [ f_{1}^{\prime }(\mathbf w ), \dots , f_{n}^{\prime }(\mathbf w ) ] ^{\!\top }$ is the gradient of $ f(\cdot ) $;
3.
Stop if some stopping criteria are met.

The learning step-size can be determined by

$$\begin{aligned} \tau _k = \frac{ \sqrt{ 2\log n } }{ L_f } \frac{1}{ \sqrt{ k } }, \end{aligned}$$

following Beck and Teboulle (2003). In Collins et al. (2008), the authors have used a simpler strategy to set the learning rate.

In EG there is an important parameter $ L_f $, which is used to determine the step-size. $ L_f $ can be determined by the $\ell _\infty $-norm of $ | f^{\prime } (\mathbf w ) | .$ In our case $ f^{\prime } (\mathbf w ) $ is a linear function, which is trivial to compute. The convergence of EG is guaranteed; see Beck and Teboulle (2003) for details.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, C., Wang, P., Paisitkriangkrai, S. et al. Training Effective Node Classifiers for Cascade Classification. Int J Comput Vis 103, 326–347 (2013). https://doi.org/10.1007/s11263-013-0608-1

Download citation

Received: 25 May 2011
Accepted: 01 January 2013
Published: 24 January 2013
Issue Date: July 2013
DOI: https://doi.org/10.1007/s11263-013-0608-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training Effective Node Classifiers for Cascade Classification

Abstract

Access this article

Similar content being viewed by others

Visual Object Detection Using Cascades of Binary and One-Class Classifiers

Multiple Classifier Boosting and Tree-Structured Classifiers

Multi-dimensional Bayesian network classifiers: A survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Theorem 1

Proposition 2

Proof

Appendix B: Proof of Theorem 2

Appendix C: Exponentiated Gradient Descent

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Training Effective Node Classifiers for Cascade Classification

Abstract

Access this article

Similar content being viewed by others

Visual Object Detection Using Cascades of Binary and One-Class Classifiers

Multiple Classifier Boosting and Tree-Structured Classifiers

Multi-dimensional Bayesian network classifiers: A survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Theorem 1

Proposition 2

Proof

Appendix B: Proof of Theorem 2

Appendix C: Exponentiated Gradient Descent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation