Skip to main content

Advertisement

Log in

Training Effective Node Classifiers for Cascade Classification

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Cascade classifiers are widely used in real-time object detection. Different from conventional classifiers that are designed for a low overall classification error rate, a classifier in each node of the cascade is required to achieve an extremely high detection rate and moderate false positive rate. Although there are a few reported methods addressing this requirement in the context of object detection, there is no principled feature selection method that explicitly takes into account this asymmetric node learning objective. We provide such an algorithm here. We show that a special case of the biased minimax probability machine has the same formulation as the linear asymmetric classifier (LAC) of Wu et al. (linear asymmetric classifier for cascade detectors, 2005). We then design a new boosting algorithm that directly optimizes the cost function of LAC. The resulting totally-corrective boosting algorithm is implemented by the column generation technique in convex optimization. Experimental results on object detection verify the effectiveness of the proposed boosting algorithm as a node classifier in cascade object detection, and show performance better than that of the current state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. In our object detection experiment, we found that this assumption can always be satisfied.

  2. Since the multi-exit cascade makes use of all previous weak classifiers in earlier nodes, it would meet the Gaussianity requirement better than the conventional cascade classifier.

  3. To train a complete \(22\)-node cascade and choose the best \( \theta \) on cross-validation data may give better detection rates.

  4. Our implementation is in C++ and only the weak classifier training part is parallelized using OpenMP.

  5. Covariance features capture the relationship between different image statistics and have been shown to perform well in our previous experiments. However, other discriminative features can also be used here instead, e.g., Haar-like features, local binary pattern (LBP) (Mu et al. 2008) and self-similarity of low-level features (CSS) (Walk et al. 2010).

  6. Here \(({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{S}\) denotes the family of distributions in \( ( {\varvec{\mu }}, {\varvec{\varSigma }}) \) that are also symmetric about the mean \( {\varvec{\mu }}.\) \(({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{SU}\) denotes the family of distributions in \( ( {\varvec{\mu }}, {\varvec{\varSigma }}) \) that are additionally symmetric and linear unimodal about \( {\varvec{\mu }}.\)

References

  • Agarwal, S., Awan, A., & Roth, D. (2004). Learning to detect objects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1475–1490.

    Article  Google Scholar 

  • Aldavert, D., Ramisa, A., Toledo, R., & Lopez de Mantaras, R. (2010). Fast and robust object segmentation with the integral linear classifier. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.

  • Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167–175.

    Article  MathSciNet  MATH  Google Scholar 

  • Bi, J., Periaswamy, S., Okada, K., Kubota, T., Fung, G., & Salganicoff, M., et al. (2006). Computer aided detection via asymmetric cascade of sparse hyperplane classifiers. In Proceedings of ACM International Conference Discovery & Data Mining (pp. 837–844). Philadelphia, PA: ACM Press.

  • Bourdev, L., & Brandt, J. (2005). Robust object detection via soft cascade. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 236–243). San Diego, CA: IEEE.

  • Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Brubaker, S. C., Wu, J., Sun, J., Mullin, M. D., & Rehg, J. M. (2008). On the design of cascades of boosted ensembles for face detection. International Journal of Computer Vision, 77(1–3), 65–86.

    Article  Google Scholar 

  • Collins, M., Globerson, A., Koo, T., Carreras, X., & Bartlett, P. L. (2008). Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9, 1775–1822.

    MathSciNet  MATH  Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 886–893). San Diego, CA: IEEE.

  • Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1–3), 225–254.

    Article  MATH  Google Scholar 

  • Dollár, P. (2012). Piotr’s image and video Matlab toolbox. Retrieved December 14, 2012, from http://vision.ucsd.edu/pdollar/toolbox/doc/.

  • Dollár, P., Babenko, B., Belongie, S., Perona, P., & Tu, Z. (2008). Multiple component learning for object detection. In Proceedings of European Conference on Computer Vision (pp. 211–224). Marseille, France: ECCV.

  • Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.

    Google Scholar 

  • Dundar, M., & Bi, J. (2007). Joint optimization of cascaded classifiers for computer aided detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Minneapolis, MN: IEEE.

  • Enzweiler, M., Eigenstetter, A., Schiele, B., & Gavrila, D. M. (2010). Multi-cue pedestrian classification with partial occlusion handling. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.

  • Ess, A., Leibe, B., & Van Gool, L. (2007). Depth and appearance for mobile scene analysis. In Proceedings of International Conference on Computer Vision. Rio de Janeiro: ICCV.

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In Proceedings of International Conference on Computer Vision. Rio de Janeiro: ICCV.

  • Huang, K., Yang, H., King, I., Lyu, M., & Chan, L. (2004). The minimum error minimax probability machine. Journal of Machine Learning Research, 5, 1253–1286.

    MathSciNet  MATH  Google Scholar 

  • Lanckriet, G. R. G., El Ghaoui, L., Bhattacharyya, C., & Jordan, M. I. (2002). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555–582.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE.

  • Lefakis, L., & Fleuret, F. (2010). Joint cascade optimization using a product of boosted classifiers. In Advances in Neural Information Processing Systems. Vancouver: NIPS.

  • Li, S. Z., & Zhang, Z. (2004). FloatBoost learning and statistical face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1112–1123.

    Article  Google Scholar 

  • Lin, Z., Hua, G., & Davis, L. S. (2009). Multiple instance feature for robust part-based object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 405–412). IEEE: Miami, FL.

  • Liu, C., & Shum, H.-Y. (2003) Kullback-Leibler boosting. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 587–594). Madison, Wisconsin: IEEE.

  • Maji, S., Berg, A. C., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.

  • Masnadi-Shirazi, H., & Vasconcelos, N. (2007). Asymmetric boosting. In Proceedings of International Conference on Machine Learning (pp. 609–619). IMLS: Corvalis, Oregon.

  • Masnadi-Shirazi, H., & Vasconcelos, N. (2011). Cost-sensitive boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 294–309.

    Article  Google Scholar 

  • MOSEK. (2010). The MOSEK optimization toolbox for matlab manual, version 6.0, revision 93. Retrieved May 25, 2010, from http://www.mosek.com/.

  • Mu, Y., Yan, S., Liu, Y., Huang, T., & Zhou, B. (2008). Discriminative local binary patterns for human detection in personal album. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.

  • Munder, S., & Gavrila, D. M. (2006). An experimental study on pedestrian classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1863–1868.

    Article  Google Scholar 

  • Paisitkriangkrai, S., Shen, C., & Zhang, J. (2008). Fast pedestrian detection using a cascade of boosted covariance features. IEEE Transactions on Circuits and Systems for Video Technology, 18(8), 1140–1151.

    Google Scholar 

  • Paisitkriangkrai, S., Shen, C., & Zhang, J. (2009). Efficiently training a better visual detector with sparse Eigenvectors. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Miami, Florida: IEEE.

  • Pham, M.-T., & Cham, T.-J. (2007a). Fast training and selection of Haar features using statistics in boosting-based face detection. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.

  • Pham, M.-T., & Cham, T.-J. (2007b). Online learning asymmetric boosted classifiers for object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Minneapolis, MN: IEEE.

  • Pham, M.-T., Hoang, V.-D. D., & Cham, T.-J. (2008). Detection with multi-exit asymmetric boosting. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska: IEEE.

  • Rätsch, G., Mika, S., Schölkopf, B., & Müller, K.-R. (2002). Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1184–1199.

    Article  Google Scholar 

  • Saberian, M., & Vasconcelos, N. (2012). Learning optimal embedded cascades. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Colorado: TPAMI.

  • Saberian, M. J., & Vasconcelos, N. (2010). Boosting classifer cascades. In Advances in Neural Information Processing Systems. Vancouver: NIPS.

  • Shen, C., & Li, H. (2010a). Boosting through optimization of margin distributions. IEEE Transanctions on Neural Networks, 21(4), 659–666.

    Article  Google Scholar 

  • Shen, C., & Li, H. (2010b). On the dual formulation of boosting algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2216–2231. IEEE computer Society Digital Library. http://dx.doi.org/10.1109/TPAMI.2010.47.

    Google Scholar 

  • Shen, C., Paisitkriangkrai, S., & Zhang, J. (2008). Face detection from few training examples. In Proceedings of the International Conference on Image Processing (pp. 2764–2767). IEEE: San Diego, California.

  • Shen, C., Wang, P., & Li, H. (2010). LACBoost and FisherBoost: Optimally building cascade classifiers. In Proceedings of European Conference on Computer Vision, LNCS 6312. (Vol. 2, pp. 608–621). Crete Island, Greece: Springer.

  • Shen, C., Paisitkriangkrai, S., Zhang, J. (2011). Efficiently learning a detection cascade with sparse Eigenvectors. IEEE Transactions on Image Processing, 20(1):22–35. http://dx.doi.org/10.1109/TIP.2010.2055880.

    Google Scholar 

  • Sochman, J., & Matas, J. (2005). Waldboost learning for time constrained sequential detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego: IEEE.

  • Torralba, A., Murphy, K. P., & Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 854–869.

    Article  Google Scholar 

  • Tu, H.-H., & Lin, H.-T. (2010). One-sided support vector regression for multiclass cost-sensitive classification. In Proceedings of the International Conference on Machine Learning. Haifa, Israel: ICML.

  • Tuzel, O., Porikli, F., & Meer, P. (2008). Pedestrian detection via classification on Riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1713–1727.

    Article  Google Scholar 

  • Viola, P., & Jones, M. (2002). Fast and robust classification using asymmetric AdaBoost and a detector cascade. In Advances in Neural Information Processing Systems (pp. 1311–1318). Cambridge: MIT Press.

  • Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.

    Article  Google Scholar 

  • Viola, P., Platt, J. C., & Zhang, C. (2005). Multiple instance boosting for object detection. In Advances in Neural Information Processing Systems (pp. 1417–1424). Vancouver, Canada: NIPS.

  • Walk, S., Majer, N., Schindler, K., & Schiele, B. (2010). New features and insights for pedestrian detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.

  • Wang, P., Shen, C., Barnes, N., & Zheng, H. (2012). Fast and robust object detection using asymmetric totally-corrective boosting. IEEE Transactions on Neural Networks and Learning Systems, 23(1), 33–46.

    Article  Google Scholar 

  • Wang, W., Zhang, J., & Shen, C. (2010). Improved human detection and classification in thermal images. In Proceedings of the International Conference on Image Processing. Hong Kong: ICIP.

  • Wang, X., Han, T. X., & Yan, S. (2007). An HOG-LBP human detector with partial occlusion handling. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.

  • Wojek, C., Walk, S., & Schiele, B. (2009). Multi-cue onboard pedestrian detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Miami: IEEE.

  • Wu, B., & Nevatia, R. (2008). Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.

  • Wu, J., & Rehg, J. M. (2011). CENTRIST: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1489–1501.

    Article  Google Scholar 

  • Wu, J., Rehg, J. M., & Mullin, M. D. (2003). Learning a rare event detection cascade by direct feature selection. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), In Proceedings of Advances in Neural Information Processing Systems. Sierra Nevada: NIPS.

  • Wu, J., Mullin, M. D., & Rehg, J. M. (2005). Linear asymmetric classifier for cascade detectors. In Proceedings of the International Conference on Machine Learning (pp. 988–995). Bonn, Germany: IMLS.

  • Wu, J., Brubaker, S. C., Mullin, M. D., & Rehg, J. M. (2008). Fast asymmetric learning for cascade face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 369–382.

    Article  Google Scholar 

  • Xiao, R., Zhu, L., & Zhang, H.-J. (2003). Boosting chain learning for object detection. In Proceedings of International Conference on Computer Vision (pp. 709–715). Nice, France: ICCV.

  • Xiao, R., Zhu, H., Sun, H., & Tang, X. (2007). Dynamic cascades for face detection. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.

  • Yu, Y.-L., Li, Y., Schuurmans, D., & Szepesvári, C. (2009). A general projection property for distribution families. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Proceeding of Advances in Neural Information Processing Systems (pp. 2232–2240). Vancouver, Canada: NIPS.

  • Zheng, Y., Shen, C., Hartley, R., & Huang, X. (2010). Pyramid center-symmetric local binary, trinary patterns for effective pedestrian detection. In Proceedings of Asian Conference on Computer Vision. Queenstown, New Zealand: ACCV.

  • Zhu, Q., Avidan, S., Yeh, M.-C., & Cheng, K.-T. (2006). Fast human detection using a cascade of histograms of oriented gradients. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 1491–1498). New York: IEEE.

Download references

Acknowledgments

This work was in part supported by Australian Research Council Future Fellowship FT120100969

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunhua Shen.

Appendices

Appendix A: Proof of Theorem 1

Before we present our results, we introduce an important proposition from (Yu et al. 2009). Note that we have used different notation.

Proposition 2

For a few different distribution families, the worst-case constraint

$$\begin{aligned} \left[ \inf _\mathbf{x \sim ( {\varvec{\mu }}, {\varvec{\varSigma }}) } \Pr \left\{ \mathbf{w }^{\!\top }\mathbf x \le b \right\} \right] \ge \gamma , \end{aligned}$$
(26)

can be written as:

  1. 1.

    if \( \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }}) \), i.e., \( \mathbf x \) follows an arbitrary distribution with mean \( {\varvec{\mu }}\) and covariance \( {\varvec{\varSigma }}\), then

    $$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}+ \sqrt{ \tfrac{ \gamma }{ 1 - \gamma } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w }; \end{aligned}$$
    (27)
  2. 2.

    if \( \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{S},\) Footnote 6then we have

    $$\begin{aligned} {\left\{ \begin{array}{ll} b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }}\!+\! \sqrt{ \tfrac{ 1 }{ 2 (1 - \gamma ) } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w },&\text{ if}\quad \gamma \in (0.5,1);\nonumber \\ b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }},&\text{ if} \quad \gamma \in (0,0.5]; \end{array}\right.}\nonumber \\ \end{aligned}$$
    (28)
  3. 3.

    if \( \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{SU} \), then

    $$\begin{aligned} {\left\{ \begin{array}{ll} b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }}\!+\! \frac{2}{3} \sqrt{ \tfrac{ 1 }{ 2 (1 - \gamma ) } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w },&\text{ if} \quad \gamma \in (0.5,1);\nonumber \\ b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }},&\text{ if} \quad \gamma \in (0,0.5]; \end{array}\right.}\nonumber \\ \end{aligned}$$
    (29)
  4. 4.

    if \( \mathbf x \) follows a Gaussian distribution with mean \( {\varvec{\mu }}\) and covariance \( {\varvec{\varSigma }}\), i.e., \( \mathbf x \sim \mathcal{G}( {\varvec{\mu }}, {\varvec{\varSigma }}) \), then

    $$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}+ \Phi ^{-1} ( \gamma ) \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w }, \end{aligned}$$
    (30)

    where \( \Phi (\cdot )\) is the cumulative distribution function (c.d.f.) of the standard normal distribution \( \mathcal{G} (0,1)\), and \( \Phi ^{-1} (\cdot )\) is the inverse function of \( \Phi (\cdot ).\) Two useful observations about \( \Phi ^{-1} (\cdot )\) are: \( \Phi ^{-1} ( 0.5) = 0 \); and \( \Phi ^{-1} ( \cdot ) \) is a monotonically increasing function in its domain.

We omit the proof of Proposition 2 here and refer the reader to Yu et al. (2009) for details. Next we begin to prove Theorem 1:

Proof

The second constraint of (2) is simply

$$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}_2. \end{aligned}$$
(31)

The first constraint of (2) can be handled by writing \( \mathbf w ^{\!\top }\mathbf x _1 \ge b \) as \( - \mathbf w ^{\!\top }\mathbf x _1 \le - b \) and applying the results in Proposition 2. It can be written as

$$\begin{aligned} - b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 \ge \varphi ( \gamma ) \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w }, \end{aligned}$$
(32)

with (6).

Let us assume that \( \Sigma _1 \) is strictly positive definite (if it is only positive semidefinite, we can always add a small regularization to its diagonal components). From (32) we have

$$\begin{aligned} \varphi ( \gamma ) \le \frac{ -b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 }{ \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w }}. \end{aligned}$$
(33)

So the optimization problem becomes

$$\begin{aligned} \max _\mathbf{w , b, \gamma } \, \gamma , \quad \text{ s.t.} \quad (31)\quad \text{ and}\quad (33). \end{aligned}$$
(34)

The maximum value of \(\gamma \) (which we label \( \gamma ^\star \)) is achieved when (33) is strictly an equality. To illustrate this point, let us assume that the maximum is achieved when

$$\begin{aligned} \varphi ( \gamma ^\star ) < \frac{ -b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 }{ \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w }}. \end{aligned}$$

Then a new solution can be obtained by increasing \( \gamma ^\star \) with a positive value such that (33) becomes an equality. Notice that the constraint (31) will not be affected, and the new solution will be better than the previous one. Hence, at the optimum, (5) must be fulfilled.

Because \( \varphi (\gamma ) \) is monotonically increasing for all the four cases in its domain \( (0,1) \) (see Fig. 9), maximizing \( \gamma \) is equivalent to maximizing \(\varphi (\gamma ) \) and this results in

$$\begin{aligned} \max _\mathbf{w , b } \, \frac{ -b + \mathbf w ^{\!\top }{\varvec{\mu }}_1 }{ \sqrt{ \mathbf w ^{\!\top }\varvec{{{\varvec{\varSigma }}_1}} \mathbf w } }, \quad \text{ s.t.} \quad b \ge \mathbf w ^{\!\top }{\varvec{\mu }}_2. \end{aligned}$$
(35)

As in Lanckriet et al. (2002) and Huang et al. (2004), we also have a scale ambiguity: if \( ( \mathbf w ^\star , b^\star )\) is a solution, \( ( t \mathbf w ^\star , t b^\star ) \) with \( t > 0 \) is also a solution.

An important observation is that the problem (35) must attain the optimum at (4). Otherwise if \( b > \mathbf w ^{\!\top }{\varvec{\mu }}_2\), the optimal value of (35) must be smaller. So we can rewrite (35) as an unconstrained problem (3).

We have thus shown that, if \( \mathbf x _1 \) is distributed according to a symmetric, symmetric unimodal, or Gaussian distribution, the resulting optimization problem is identical. This is not surprising considering the latter two cases are merely special cases of the symmetric distribution family.

At optimality, the inequality (33) becomes an equality, and hence \( \gamma ^\star \) can be obtained as in (5). For ease of exposition, let us denote the fours cases in the right side of (6) as \( \varphi _\mathrm{gnrl} ( \cdot )\), \( \varphi _\mathrm{S} ( \cdot ) \), \( \varphi _\mathrm{SU} ( \cdot ) \), and \( \varphi _\mathcal{G} ( \cdot ) .\) For \( \gamma \in [0.5, 1) \), as shown in Fig. 9, we have \( \varphi _\mathrm{gnrl} ( \gamma ) > \varphi _\mathrm{S} ( \gamma ) > \varphi _\mathrm{SU} ( \gamma ) > \varphi _\mathcal{G} ( \gamma ).\) Therefore, when solving (5) for \( \gamma ^\star \), we have \( \gamma ^\star _\mathrm{gnrl} < \gamma ^\star _\mathrm{S} < \gamma ^\star _\mathrm{SU} < \gamma ^\star _\mathcal{G}.\) That is to say, one can get better accuracy when additional information about the data distribution is available, although the actual optimization problem to be solved is identical.

Fig. 9
figure 9

The function \( \varphi (\cdot ) \) in (6). The four curves correspond to the four cases. They are all monotonically increasing in \((0,1)\)

Appendix B: Proof of Theorem 2

Let us assume that in the current solution we have selected \( n \) weak classifiers and their corresponding linear weights are \( \mathbf w = [ w_1, \ldots , w_n ] .\) If we add a weak classifier \( h^{\prime }(\cdot ) \) that is not in the current subset, the corresponding \( w \) is zero, then we can conclude that the current weak classifiers and \(\mathbf w \) are the optimal solution already. In this case, the best weak classifier that is found by solving the subproblem (20) does not contribute to solving the master problem.

Let us consider the case that the optimality condition is violated. We need to show that we are able to find such a weak learner \( h^{\prime }(\cdot ) \), which is not in the set of current selected weak classifiers, that its corresponding coefficient \( w > 0 \) holds. Again assume \( h^{\prime }(\cdot ) \) is the most violated weak learner found by solving (20) and the convergence condition is not satisfied. In other words, we have

$$\begin{aligned} \sum \limits _{i=1}^m u_i y_i h^{\prime }( \mathbf x _i ) \ge r. \end{aligned}$$
(36)

Now, after this weak learner is added into the master problem, the corresponding primal solution \( w \) must be non-zero (positive because we have the nonnegative-ness constraint on \( \mathbf w \)).

If this is not the case, then the corresponding \( w = 0 .\) This is not possible because of the following reason. From the Lagrangian (17), at optimality we have \( \partial L / \partial \mathbf w = {\varvec{0}} \), which leads to

$$\begin{aligned} r - \sum _{i=1}^m u_i y_i h^{\prime }( \mathbf x _i ) = q > 0. \end{aligned}$$
(37)

Clearly (36) and (37) contradict.

Thus, after the weak classifier \( h^{\prime }(\cdot ) \) is added to the primal problem, its corresponding \( w \) must have a positive solution. This is to say, one more free variable is added into the problem and re-solving the primal problem (16) must reduce the objective value. Therefore a strict decrease in the objective is obtained. In other words, Algorithm 1 must make progress at each iteration. Furthermore, the primal optimization problem is convex, there are no local optimal points. The column generation procedure is guaranteed to converge to the global optimum up to some prescribed accuracy.

Appendix C: Exponentiated Gradient Descent

Exponentiated gradient descent (EG) is a very useful tool for solving large-scale convex minimization problems over the unit simplex. Let us first define the unit simplex \( \varDelta _n = \left\{ \mathbf{w } \in \mathbb{R }^n : \mathbf{1 } ^ {\!\top }\mathbf w = 1, \mathbf w \succcurlyeq \mathbf{0 } \right\} .\) EG efficiently solves the convex optimization problem

$$\begin{aligned} \min _\mathbf w \quad f(\mathbf w ), \quad \mathrm{s.t.} \quad \mathbf w \in \varDelta _n, \end{aligned}$$
(38)

under the assumption that the objective function \( f(\cdot ) \) is a convex Lipschitz continuous function with Lipschitz constant \( L_f \) w.r.t. a fixed given norm \( ||\cdot ||.\) The mathematical definition of \( L_f \) is that \( | f(\mathbf w ) -f (\mathbf z ) | \le L_f ||\mathbf x - \mathbf z ||\) holds for any \( \mathbf x , \mathbf z \) in the domain of \( f(\cdot ).\) The EG algorithm is very simple:

  1. 1.

    Initialize with \(\mathbf w ^0 \in \text{ the} \text{ interior} \text{ of} \varDelta _n\);

  2. 2.

    Generate the sequence \( \left\{ \mathbf{w }^k \right\} \), \( k=1,2,\ldots \) with:

    $$\begin{aligned} \mathbf w ^k_j = \frac{ \mathbf w ^{k-1}_j \exp [ - \tau _k {f_{j}^{\prime }} ( \mathbf w ^{k-1} ) ] }{ \sum _{j=1}^n \mathbf w ^{k-1}_j \exp [ - \tau _k {f_{j}^{\prime }} ( \mathbf w ^{k-1} ) ] }. \end{aligned}$$
    (39)

    Here \( \tau _k \) is the step-size. \( f^{\prime }( \mathbf w ) = [ f_{1}^{\prime }(\mathbf w ), \dots , f_{n}^{\prime }(\mathbf w ) ] ^{\!\top }\) is the gradient of \( f(\cdot ) \);

  3. 3.

    Stop if some stopping criteria are met.

The learning step-size can be determined by

$$\begin{aligned} \tau _k = \frac{ \sqrt{ 2\log n } }{ L_f } \frac{1}{ \sqrt{ k } }, \end{aligned}$$

following Beck and Teboulle (2003). In Collins et al. (2008), the authors have used a simpler strategy to set the learning rate.

In EG there is an important parameter \( L_f \), which is used to determine the step-size. \( L_f \) can be determined by the \(\ell _\infty \)-norm of \( | f^{\prime } (\mathbf w ) | .\) In our case \( f^{\prime } (\mathbf w ) \) is a linear function, which is trivial to compute. The convergence of EG is guaranteed; see Beck and Teboulle (2003) for details.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, C., Wang, P., Paisitkriangkrai, S. et al. Training Effective Node Classifiers for Cascade Classification. Int J Comput Vis 103, 326–347 (2013). https://doi.org/10.1007/s11263-013-0608-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0608-1

Keywords

Navigation