Abstract
Cascade classifiers are widely used in real-time object detection. Different from conventional classifiers that are designed for a low overall classification error rate, a classifier in each node of the cascade is required to achieve an extremely high detection rate and moderate false positive rate. Although there are a few reported methods addressing this requirement in the context of object detection, there is no principled feature selection method that explicitly takes into account this asymmetric node learning objective. We provide such an algorithm here. We show that a special case of the biased minimax probability machine has the same formulation as the linear asymmetric classifier (LAC) of Wu et al. (linear asymmetric classifier for cascade detectors, 2005). We then design a new boosting algorithm that directly optimizes the cost function of LAC. The resulting totally-corrective boosting algorithm is implemented by the column generation technique in convex optimization. Experimental results on object detection verify the effectiveness of the proposed boosting algorithm as a node classifier in cascade object detection, and show performance better than that of the current state-of-the-art.
Similar content being viewed by others
Notes
In our object detection experiment, we found that this assumption can always be satisfied.
Since the multi-exit cascade makes use of all previous weak classifiers in earlier nodes, it would meet the Gaussianity requirement better than the conventional cascade classifier.
To train a complete \(22\)-node cascade and choose the best \( \theta \) on cross-validation data may give better detection rates.
Our implementation is in C++ and only the weak classifier training part is parallelized using OpenMP.
Covariance features capture the relationship between different image statistics and have been shown to perform well in our previous experiments. However, other discriminative features can also be used here instead, e.g., Haar-like features, local binary pattern (LBP) (Mu et al. 2008) and self-similarity of low-level features (CSS) (Walk et al. 2010).
Here \(({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{S}\) denotes the family of distributions in \( ( {\varvec{\mu }}, {\varvec{\varSigma }}) \) that are also symmetric about the mean \( {\varvec{\mu }}.\) \(({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{SU}\) denotes the family of distributions in \( ( {\varvec{\mu }}, {\varvec{\varSigma }}) \) that are additionally symmetric and linear unimodal about \( {\varvec{\mu }}.\)
References
Agarwal, S., Awan, A., & Roth, D. (2004). Learning to detect objects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1475–1490.
Aldavert, D., Ramisa, A., Toledo, R., & Lopez de Mantaras, R. (2010). Fast and robust object segmentation with the integral linear classifier. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.
Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167–175.
Bi, J., Periaswamy, S., Okada, K., Kubota, T., Fung, G., & Salganicoff, M., et al. (2006). Computer aided detection via asymmetric cascade of sparse hyperplane classifiers. In Proceedings of ACM International Conference Discovery & Data Mining (pp. 837–844). Philadelphia, PA: ACM Press.
Bourdev, L., & Brandt, J. (2005). Robust object detection via soft cascade. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 236–243). San Diego, CA: IEEE.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Brubaker, S. C., Wu, J., Sun, J., Mullin, M. D., & Rehg, J. M. (2008). On the design of cascades of boosted ensembles for face detection. International Journal of Computer Vision, 77(1–3), 65–86.
Collins, M., Globerson, A., Koo, T., Carreras, X., & Bartlett, P. L. (2008). Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9, 1775–1822.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 886–893). San Diego, CA: IEEE.
Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1–3), 225–254.
Dollár, P. (2012). Piotr’s image and video Matlab toolbox. Retrieved December 14, 2012, from http://vision.ucsd.edu/pdollar/toolbox/doc/.
Dollár, P., Babenko, B., Belongie, S., Perona, P., & Tu, Z. (2008). Multiple component learning for object detection. In Proceedings of European Conference on Computer Vision (pp. 211–224). Marseille, France: ECCV.
Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.
Dundar, M., & Bi, J. (2007). Joint optimization of cascaded classifiers for computer aided detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Minneapolis, MN: IEEE.
Enzweiler, M., Eigenstetter, A., Schiele, B., & Gavrila, D. M. (2010). Multi-cue pedestrian classification with partial occlusion handling. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.
Ess, A., Leibe, B., & Van Gool, L. (2007). Depth and appearance for mobile scene analysis. In Proceedings of International Conference on Computer Vision. Rio de Janeiro: ICCV.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In Proceedings of International Conference on Computer Vision. Rio de Janeiro: ICCV.
Huang, K., Yang, H., King, I., Lyu, M., & Chan, L. (2004). The minimum error minimax probability machine. Journal of Machine Learning Research, 5, 1253–1286.
Lanckriet, G. R. G., El Ghaoui, L., Bhattacharyya, C., & Jordan, M. I. (2002). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555–582.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE.
Lefakis, L., & Fleuret, F. (2010). Joint cascade optimization using a product of boosted classifiers. In Advances in Neural Information Processing Systems. Vancouver: NIPS.
Li, S. Z., & Zhang, Z. (2004). FloatBoost learning and statistical face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1112–1123.
Lin, Z., Hua, G., & Davis, L. S. (2009). Multiple instance feature for robust part-based object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 405–412). IEEE: Miami, FL.
Liu, C., & Shum, H.-Y. (2003) Kullback-Leibler boosting. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 587–594). Madison, Wisconsin: IEEE.
Maji, S., Berg, A. C., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.
Masnadi-Shirazi, H., & Vasconcelos, N. (2007). Asymmetric boosting. In Proceedings of International Conference on Machine Learning (pp. 609–619). IMLS: Corvalis, Oregon.
Masnadi-Shirazi, H., & Vasconcelos, N. (2011). Cost-sensitive boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 294–309.
MOSEK. (2010). The MOSEK optimization toolbox for matlab manual, version 6.0, revision 93. Retrieved May 25, 2010, from http://www.mosek.com/.
Mu, Y., Yan, S., Liu, Y., Huang, T., & Zhou, B. (2008). Discriminative local binary patterns for human detection in personal album. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.
Munder, S., & Gavrila, D. M. (2006). An experimental study on pedestrian classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1863–1868.
Paisitkriangkrai, S., Shen, C., & Zhang, J. (2008). Fast pedestrian detection using a cascade of boosted covariance features. IEEE Transactions on Circuits and Systems for Video Technology, 18(8), 1140–1151.
Paisitkriangkrai, S., Shen, C., & Zhang, J. (2009). Efficiently training a better visual detector with sparse Eigenvectors. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Miami, Florida: IEEE.
Pham, M.-T., & Cham, T.-J. (2007a). Fast training and selection of Haar features using statistics in boosting-based face detection. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.
Pham, M.-T., & Cham, T.-J. (2007b). Online learning asymmetric boosted classifiers for object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Minneapolis, MN: IEEE.
Pham, M.-T., Hoang, V.-D. D., & Cham, T.-J. (2008). Detection with multi-exit asymmetric boosting. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska: IEEE.
Rätsch, G., Mika, S., Schölkopf, B., & Müller, K.-R. (2002). Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1184–1199.
Saberian, M., & Vasconcelos, N. (2012). Learning optimal embedded cascades. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Colorado: TPAMI.
Saberian, M. J., & Vasconcelos, N. (2010). Boosting classifer cascades. In Advances in Neural Information Processing Systems. Vancouver: NIPS.
Shen, C., & Li, H. (2010a). Boosting through optimization of margin distributions. IEEE Transanctions on Neural Networks, 21(4), 659–666.
Shen, C., & Li, H. (2010b). On the dual formulation of boosting algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2216–2231. IEEE computer Society Digital Library. http://dx.doi.org/10.1109/TPAMI.2010.47.
Shen, C., Paisitkriangkrai, S., & Zhang, J. (2008). Face detection from few training examples. In Proceedings of the International Conference on Image Processing (pp. 2764–2767). IEEE: San Diego, California.
Shen, C., Wang, P., & Li, H. (2010). LACBoost and FisherBoost: Optimally building cascade classifiers. In Proceedings of European Conference on Computer Vision, LNCS 6312. (Vol. 2, pp. 608–621). Crete Island, Greece: Springer.
Shen, C., Paisitkriangkrai, S., Zhang, J. (2011). Efficiently learning a detection cascade with sparse Eigenvectors. IEEE Transactions on Image Processing, 20(1):22–35. http://dx.doi.org/10.1109/TIP.2010.2055880.
Sochman, J., & Matas, J. (2005). Waldboost learning for time constrained sequential detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego: IEEE.
Torralba, A., Murphy, K. P., & Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 854–869.
Tu, H.-H., & Lin, H.-T. (2010). One-sided support vector regression for multiclass cost-sensitive classification. In Proceedings of the International Conference on Machine Learning. Haifa, Israel: ICML.
Tuzel, O., Porikli, F., & Meer, P. (2008). Pedestrian detection via classification on Riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1713–1727.
Viola, P., & Jones, M. (2002). Fast and robust classification using asymmetric AdaBoost and a detector cascade. In Advances in Neural Information Processing Systems (pp. 1311–1318). Cambridge: MIT Press.
Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.
Viola, P., Platt, J. C., & Zhang, C. (2005). Multiple instance boosting for object detection. In Advances in Neural Information Processing Systems (pp. 1417–1424). Vancouver, Canada: NIPS.
Walk, S., Majer, N., Schindler, K., & Schiele, B. (2010). New features and insights for pedestrian detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, US: IEEE.
Wang, P., Shen, C., Barnes, N., & Zheng, H. (2012). Fast and robust object detection using asymmetric totally-corrective boosting. IEEE Transactions on Neural Networks and Learning Systems, 23(1), 33–46.
Wang, W., Zhang, J., & Shen, C. (2010). Improved human detection and classification in thermal images. In Proceedings of the International Conference on Image Processing. Hong Kong: ICIP.
Wang, X., Han, T. X., & Yan, S. (2007). An HOG-LBP human detector with partial occlusion handling. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.
Wojek, C., Walk, S., & Schiele, B. (2009). Multi-cue onboard pedestrian detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Miami: IEEE.
Wu, B., & Nevatia, R. (2008). Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE.
Wu, J., & Rehg, J. M. (2011). CENTRIST: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1489–1501.
Wu, J., Rehg, J. M., & Mullin, M. D. (2003). Learning a rare event detection cascade by direct feature selection. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), In Proceedings of Advances in Neural Information Processing Systems. Sierra Nevada: NIPS.
Wu, J., Mullin, M. D., & Rehg, J. M. (2005). Linear asymmetric classifier for cascade detectors. In Proceedings of the International Conference on Machine Learning (pp. 988–995). Bonn, Germany: IMLS.
Wu, J., Brubaker, S. C., Mullin, M. D., & Rehg, J. M. (2008). Fast asymmetric learning for cascade face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 369–382.
Xiao, R., Zhu, L., & Zhang, H.-J. (2003). Boosting chain learning for object detection. In Proceedings of International Conference on Computer Vision (pp. 709–715). Nice, France: ICCV.
Xiao, R., Zhu, H., Sun, H., & Tang, X. (2007). Dynamic cascades for face detection. In Proceedings of International Conference on Computer Vision. Rio de Janeiro, Brazil: ICCV.
Yu, Y.-L., Li, Y., Schuurmans, D., & Szepesvári, C. (2009). A general projection property for distribution families. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Proceeding of Advances in Neural Information Processing Systems (pp. 2232–2240). Vancouver, Canada: NIPS.
Zheng, Y., Shen, C., Hartley, R., & Huang, X. (2010). Pyramid center-symmetric local binary, trinary patterns for effective pedestrian detection. In Proceedings of Asian Conference on Computer Vision. Queenstown, New Zealand: ACCV.
Zhu, Q., Avidan, S., Yeh, M.-C., & Cheng, K.-T. (2006). Fast human detection using a cascade of histograms of oriented gradients. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 1491–1498). New York: IEEE.
Acknowledgments
This work was in part supported by Australian Research Council Future Fellowship FT120100969
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of Theorem 1
Before we present our results, we introduce an important proposition from (Yu et al. 2009). Note that we have used different notation.
Proposition 2
For a few different distribution families, the worst-case constraint
can be written as:
-
1.
if \( \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }}) \), i.e., \( \mathbf x \) follows an arbitrary distribution with mean \( {\varvec{\mu }}\) and covariance \( {\varvec{\varSigma }}\), then
$$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}+ \sqrt{ \tfrac{ \gamma }{ 1 - \gamma } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w }; \end{aligned}$$(27) -
2.
if \( \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{S},\) Footnote 6then we have
$$\begin{aligned} {\left\{ \begin{array}{ll} b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }}\!+\! \sqrt{ \tfrac{ 1 }{ 2 (1 - \gamma ) } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w },&\text{ if}\quad \gamma \in (0.5,1);\nonumber \\ b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }},&\text{ if} \quad \gamma \in (0,0.5]; \end{array}\right.}\nonumber \\ \end{aligned}$$(28) -
3.
if \( \mathbf x \sim ({\varvec{\mu }}, {\varvec{\varSigma }})_\mathrm{SU} \), then
$$\begin{aligned} {\left\{ \begin{array}{ll} b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }}\!+\! \frac{2}{3} \sqrt{ \tfrac{ 1 }{ 2 (1 - \gamma ) } } \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w },&\text{ if} \quad \gamma \in (0.5,1);\nonumber \\ b \!\ge \mathbf w ^{\!\top }{\varvec{\mu }},&\text{ if} \quad \gamma \in (0,0.5]; \end{array}\right.}\nonumber \\ \end{aligned}$$(29) -
4.
if \( \mathbf x \) follows a Gaussian distribution with mean \( {\varvec{\mu }}\) and covariance \( {\varvec{\varSigma }}\), i.e., \( \mathbf x \sim \mathcal{G}( {\varvec{\mu }}, {\varvec{\varSigma }}) \), then
$$\begin{aligned} b \ge \mathbf w ^{\!\top }{\varvec{\mu }}+ \Phi ^{-1} ( \gamma ) \cdot \sqrt{ \mathbf w ^{\!\top }{\varvec{\varSigma }}\mathbf w }, \end{aligned}$$(30)where \( \Phi (\cdot )\) is the cumulative distribution function (c.d.f.) of the standard normal distribution \( \mathcal{G} (0,1)\), and \( \Phi ^{-1} (\cdot )\) is the inverse function of \( \Phi (\cdot ).\) Two useful observations about \( \Phi ^{-1} (\cdot )\) are: \( \Phi ^{-1} ( 0.5) = 0 \); and \( \Phi ^{-1} ( \cdot ) \) is a monotonically increasing function in its domain.
We omit the proof of Proposition 2 here and refer the reader to Yu et al. (2009) for details. Next we begin to prove Theorem 1:
Proof
The second constraint of (2) is simply
The first constraint of (2) can be handled by writing \( \mathbf w ^{\!\top }\mathbf x _1 \ge b \) as \( - \mathbf w ^{\!\top }\mathbf x _1 \le - b \) and applying the results in Proposition 2. It can be written as
with (6).
Let us assume that \( \Sigma _1 \) is strictly positive definite (if it is only positive semidefinite, we can always add a small regularization to its diagonal components). From (32) we have
So the optimization problem becomes
The maximum value of \(\gamma \) (which we label \( \gamma ^\star \)) is achieved when (33) is strictly an equality. To illustrate this point, let us assume that the maximum is achieved when
Then a new solution can be obtained by increasing \( \gamma ^\star \) with a positive value such that (33) becomes an equality. Notice that the constraint (31) will not be affected, and the new solution will be better than the previous one. Hence, at the optimum, (5) must be fulfilled.
Because \( \varphi (\gamma ) \) is monotonically increasing for all the four cases in its domain \( (0,1) \) (see Fig. 9), maximizing \( \gamma \) is equivalent to maximizing \(\varphi (\gamma ) \) and this results in
As in Lanckriet et al. (2002) and Huang et al. (2004), we also have a scale ambiguity: if \( ( \mathbf w ^\star , b^\star )\) is a solution, \( ( t \mathbf w ^\star , t b^\star ) \) with \( t > 0 \) is also a solution.
An important observation is that the problem (35) must attain the optimum at (4). Otherwise if \( b > \mathbf w ^{\!\top }{\varvec{\mu }}_2\), the optimal value of (35) must be smaller. So we can rewrite (35) as an unconstrained problem (3).
We have thus shown that, if \( \mathbf x _1 \) is distributed according to a symmetric, symmetric unimodal, or Gaussian distribution, the resulting optimization problem is identical. This is not surprising considering the latter two cases are merely special cases of the symmetric distribution family.
At optimality, the inequality (33) becomes an equality, and hence \( \gamma ^\star \) can be obtained as in (5). For ease of exposition, let us denote the fours cases in the right side of (6) as \( \varphi _\mathrm{gnrl} ( \cdot )\), \( \varphi _\mathrm{S} ( \cdot ) \), \( \varphi _\mathrm{SU} ( \cdot ) \), and \( \varphi _\mathcal{G} ( \cdot ) .\) For \( \gamma \in [0.5, 1) \), as shown in Fig. 9, we have \( \varphi _\mathrm{gnrl} ( \gamma ) > \varphi _\mathrm{S} ( \gamma ) > \varphi _\mathrm{SU} ( \gamma ) > \varphi _\mathcal{G} ( \gamma ).\) Therefore, when solving (5) for \( \gamma ^\star \), we have \( \gamma ^\star _\mathrm{gnrl} < \gamma ^\star _\mathrm{S} < \gamma ^\star _\mathrm{SU} < \gamma ^\star _\mathcal{G}.\) That is to say, one can get better accuracy when additional information about the data distribution is available, although the actual optimization problem to be solved is identical.
Appendix B: Proof of Theorem 2
Let us assume that in the current solution we have selected \( n \) weak classifiers and their corresponding linear weights are \( \mathbf w = [ w_1, \ldots , w_n ] .\) If we add a weak classifier \( h^{\prime }(\cdot ) \) that is not in the current subset, the corresponding \( w \) is zero, then we can conclude that the current weak classifiers and \(\mathbf w \) are the optimal solution already. In this case, the best weak classifier that is found by solving the subproblem (20) does not contribute to solving the master problem.
Let us consider the case that the optimality condition is violated. We need to show that we are able to find such a weak learner \( h^{\prime }(\cdot ) \), which is not in the set of current selected weak classifiers, that its corresponding coefficient \( w > 0 \) holds. Again assume \( h^{\prime }(\cdot ) \) is the most violated weak learner found by solving (20) and the convergence condition is not satisfied. In other words, we have
Now, after this weak learner is added into the master problem, the corresponding primal solution \( w \) must be non-zero (positive because we have the nonnegative-ness constraint on \( \mathbf w \)).
If this is not the case, then the corresponding \( w = 0 .\) This is not possible because of the following reason. From the Lagrangian (17), at optimality we have \( \partial L / \partial \mathbf w = {\varvec{0}} \), which leads to
Clearly (36) and (37) contradict.
Thus, after the weak classifier \( h^{\prime }(\cdot ) \) is added to the primal problem, its corresponding \( w \) must have a positive solution. This is to say, one more free variable is added into the problem and re-solving the primal problem (16) must reduce the objective value. Therefore a strict decrease in the objective is obtained. In other words, Algorithm 1 must make progress at each iteration. Furthermore, the primal optimization problem is convex, there are no local optimal points. The column generation procedure is guaranteed to converge to the global optimum up to some prescribed accuracy.
Appendix C: Exponentiated Gradient Descent
Exponentiated gradient descent (EG) is a very useful tool for solving large-scale convex minimization problems over the unit simplex. Let us first define the unit simplex \( \varDelta _n = \left\{ \mathbf{w } \in \mathbb{R }^n : \mathbf{1 } ^ {\!\top }\mathbf w = 1, \mathbf w \succcurlyeq \mathbf{0 } \right\} .\) EG efficiently solves the convex optimization problem
under the assumption that the objective function \( f(\cdot ) \) is a convex Lipschitz continuous function with Lipschitz constant \( L_f \) w.r.t. a fixed given norm \( ||\cdot ||.\) The mathematical definition of \( L_f \) is that \( | f(\mathbf w ) -f (\mathbf z ) | \le L_f ||\mathbf x - \mathbf z ||\) holds for any \( \mathbf x , \mathbf z \) in the domain of \( f(\cdot ).\) The EG algorithm is very simple:
-
1.
Initialize with \(\mathbf w ^0 \in \text{ the} \text{ interior} \text{ of} \varDelta _n\);
-
2.
Generate the sequence \( \left\{ \mathbf{w }^k \right\} \), \( k=1,2,\ldots \) with:
$$\begin{aligned} \mathbf w ^k_j = \frac{ \mathbf w ^{k-1}_j \exp [ - \tau _k {f_{j}^{\prime }} ( \mathbf w ^{k-1} ) ] }{ \sum _{j=1}^n \mathbf w ^{k-1}_j \exp [ - \tau _k {f_{j}^{\prime }} ( \mathbf w ^{k-1} ) ] }. \end{aligned}$$(39)Here \( \tau _k \) is the step-size. \( f^{\prime }( \mathbf w ) = [ f_{1}^{\prime }(\mathbf w ), \dots , f_{n}^{\prime }(\mathbf w ) ] ^{\!\top }\) is the gradient of \( f(\cdot ) \);
-
3.
Stop if some stopping criteria are met.
The learning step-size can be determined by
following Beck and Teboulle (2003). In Collins et al. (2008), the authors have used a simpler strategy to set the learning rate.
In EG there is an important parameter \( L_f \), which is used to determine the step-size. \( L_f \) can be determined by the \(\ell _\infty \)-norm of \( | f^{\prime } (\mathbf w ) | .\) In our case \( f^{\prime } (\mathbf w ) \) is a linear function, which is trivial to compute. The convergence of EG is guaranteed; see Beck and Teboulle (2003) for details.
Rights and permissions
About this article
Cite this article
Shen, C., Wang, P., Paisitkriangkrai, S. et al. Training Effective Node Classifiers for Cascade Classification. Int J Comput Vis 103, 326–347 (2013). https://doi.org/10.1007/s11263-013-0608-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-013-0608-1