The Consistency of Greedy Algorithms for Classification

  • Shie Mannor
  • Ron Meir
  • Tong Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2375)


We consider a class of algorithms for classification, which are based on sequential greedy minimization of a convex upper bound on the 0 — 1 loss function. A large class of recently popular algorithms falls within the scope of this approach, including many variants of Boosting algorithms. The basic question addressed in this paper relates to the statistical consistency of such approaches. We provide precise conditions which guarantee that sequential greedy procedures are consistent, and establish rates of convergence under the assumption that the Bayes decision boundary belongs to a certain class of smooth functions. The results are established using a form of regularization which constrains the search space at each iteration of the algorithm. In addition to providing general consistency results, we provide rates of convergence for smooth decision boundaries. A particularly interesting conclusion of our work is that Logistic function based Boosting provides faster rates of convergence than Boosting based on the exponential function used in AdaBoost.


Loss Function Greedy Algorithm Borel Measurable Function Yorktown Height Computational Learn Theory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    R. A. Adams. Sobolev Spaces. Academic Press, New York, 1975.zbMATHGoogle Scholar
  2. 2.
    M. Anthony and P. L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999.Google Scholar
  3. 3.
    P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001.Google Scholar
  4. 4.
    L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–824, 1998.zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Y. Freund and R. E. Schapire. A decision theoretic generalization of on-line learning and application to boosting. Comput. Syst. Sci., 55(1):119–139, 1997.zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 38(2):337–374, 2000.CrossRefMathSciNetGoogle Scholar
  7. 7.
    W. Jiang. Does boosting overfit: Views from an exact solution. Technical Report 00-03, Department of Statistics, Northwestern University, 2000.Google Scholar
  8. 8.
    W. Jiang. Process consistency for adaboost. Technical Report 00-05, Department of Statistics, Northwestern University, 2000.Google Scholar
  9. 9.
    V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002.Google Scholar
  10. 10.
    G. Lugosi and N. Vayatis. On the bayes-risk consistency of bosting methods. Technical report, Pompeu Fabra University, 2001.Google Scholar
  11. 11.
    S. Mannor and R. Meir. Geometric bounds for generlization in boosting. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 461–472, 2001.Google Scholar
  12. 12.
    S. Mannor and R. Meir. On the existence of weak learners and applications to boosting. Machine Learning, 2002. To appear.Google Scholar
  13. 13.
    L. Mason, P. Bartlett, J. Baxter, and M. Frean. Functional gradient techniques for combining hypotheses. In B. Schölkopf A. Smola, P. Bartlett and D. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 2000.Google Scholar
  14. 14.
    R. Meir and V. Maiorov. On the optimality of neural network approximation using incremental algorithms. IEEE Trans. Neural Networks, 11(2):323–337, 2000.CrossRefGoogle Scholar
  15. 15.
    D. Pollard. Convergence of Empirical Processes. Springer Verlag, New York, 1984.Google Scholar
  16. 16.
    R. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.Google Scholar
  17. 17.
    A. W. van der Vaart and J. A. Wellner. Weak Convergence and EmpiricalProcesses. Springer Verlag, New York, 1996.Google Scholar
  18. 18.
    Y. Yang. Minimax nonparametric classification-patri: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999.zbMATHCrossRefGoogle Scholar
  19. 19.
    T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Technical Report RC22155, IBM T. J. Watson Research Center, Yorktown Heights, 2001.Google Scholar
  20. 20.
    T. Zhang. Sequential greedy approximation for certain convex optimization problems. Technical Report RC22309, IBM T. J. Watson Research Center, Yorktown Heights, 2002.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Shie Mannor
    • 1
  • Ron Meir
    • 1
  • Tong Zhang
    • 2
  1. 1.Department of Electrical EngineeringTechnionHaifaIsrael
  2. 2.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations