Axiomatic Characterization of AdaBoost and the Multiplicative Weight Update Procedure

  • Ibrahim AlabdulmohsinEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11051)


AdaBoost was introduced for binary classification tasks by Freund and Schapire in 1995. Ever since its publication, numerous results have been produced, which revealed surprising links between AdaBoost and related fields, such as information geometry, game theory, and convex optimization. This remarkably comprehensive set of connections suggests that adaBoost is a unique approach that may, in fact, arise out of axiomatic principles. In this paper, we prove that this is indeed the case. We show that three natural axioms on adaptive re-weighting and combining algorithms, also called arcing, suffice to construct adaBoost and, more generally, the multiplicative weight update procedure as the unique family of algorithms that meet those axioms. Informally speaking, our three axioms only require that the arcing algorithm satisfies some elementary notions of additivity, objectivity, and utility. We prove that any method that satisfies these axioms must be minimizing the composition of an exponential loss with an additive function, and that the weights must be updated according to the multiplicative weight update procedure. This conclusion holds in the general setting of learning, which encompasses regression, classification, ranking, and clustering.


Ensemble methods Boosting AdaBoost Axioms 


  1. 1.
    Ackerman, M., Ben-David, S.: Measures of clustering quality: a working set of axioms for clustering. In: NIPS, pp. 121–128 (2009)Google Scholar
  2. 2.
    Aczél, J., Forte, B., Ng, C.T.: Why the shannon and hartley entropies are natural. Adv. Appl. Probab. 6(01), 131–146 (1974)MathSciNetzbMATHGoogle Scholar
  3. 3.
    Bell, D.A., Wang, H.: A formalism for relevance and its application in feature subset selection. Mach. Learn. 41(2), 175–195 (2000)zbMATHGoogle Scholar
  4. 4.
    Bousquet, O., Elisseeff, A.: Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Breiman, L.: Prediction games and arcing algorithms. Neural Comput. 11(7), 1493–1517 (1999)Google Scholar
  6. 6.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)zbMATHGoogle Scholar
  7. 7.
    Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)zbMATHGoogle Scholar
  8. 8.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  9. 9.
    Cox, R.T.: Probability, frequency and reasonable expectation. Am. J. Phys. 14(1), 1–13 (1946)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Csiszar, I.: Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems. Ann. Statist. 19, 2032–2066 (1991)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Csiszár, I.: Axiomatic characterizations of information measures. Entropy 10(3), 261–273 (2008)zbMATHGoogle Scholar
  12. 12.
    Džeroski, S., Ženko, B.: Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54(3), 255–273 (2004)zbMATHGoogle Scholar
  13. 13.
    Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. JMLR 4, 933–969 (2003)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995). Scholar
  15. 15.
    Friedman, J., Hastie, T., Tibshirani, R., et al.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28(2), 337–407 (2000)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Jardine, N., Sibson, R.: The construction of hierarchic and non-hierarchic classifications. Comput. J. 11(2), 177–184 (1968)zbMATHGoogle Scholar
  18. 18.
    Jaynes, E.T.: Probability theory: The Logic of science. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
  19. 19.
    Ji, C., Ma, S.: Combined weak classifiers. NIPS 9, 494–500 (1997)Google Scholar
  20. 20.
    Khanchel, R., Limam, M.: Empirical comparison of arcing algorithms (2005)Google Scholar
  21. 21.
    Kleinberg, J.: An impossibility theorem for clustering. In: NIPS, vol. 15, pp. 463–470 (2002)Google Scholar
  22. 22.
    Lee, P.: On the axioms of information theory. Ann. Math. Stat. 35(1), 415–418 (1964)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Mason, L., Baxter, J., Bartlett, P.L., Frean, M.R.: Boosting algorithms as gradient descent. In: NIPS, pp. 512–518 (1999)Google Scholar
  24. 24.
    Österreicher, F.: Csiszár’s f-divergences-basic properties. Technical report (2002)Google Scholar
  25. 25.
    Pennock, D.M., Horvitz, E.: Analysis of the axiomatic foundations of collaborative filtering. Ann Arbor 1001, 48109–2110 (1999)Google Scholar
  26. 26.
    Pennock, D.M., Maynard-Reid II, P., Giles, C.L., Horvitz, E.: A normative examination of ensemble learning algorithms. In: ICML, pp. 735–742 (2000)Google Scholar
  27. 27.
    Prasad, A., Pareek, H.H., Ravikumar, P.: Distributional rank aggregation, and anaxiomatic analysis. In: ICML, pp. 2104–2112 (2015)Google Scholar
  28. 28.
    Sansone, G.: Orthogonal Functions. Dover Publications, New York (1991)Google Scholar
  29. 29.
    Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. MIT Press, Cambridge (2012)zbMATHGoogle Scholar
  30. 30.
    Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37(3), 297–336 (1999)zbMATHGoogle Scholar
  31. 31.
    Servedio, R.A.: Smooth boosting and learning with malicious noise. J. Mach. Learn. Res. (JMLR) 4, 633–648 (2003)MathSciNetzbMATHGoogle Scholar
  32. 32.
    Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K.: Learnability, stability and uniform convergence. J. Mach. Learn. Res. (JMLR) 11, 2635–2670 (2010)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948)MathSciNetzbMATHGoogle Scholar
  34. 34.
    Shore, J., Johnson, R.: Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory 26(1), 26–37 (1980)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Skilling, J.: The axioms of maximum entropy. In: Erickson, G.J., Smith, C.R. (eds.) Maximum-Entropy and Bayesian Methods in Science and Engineering, pp. 173–187. Springer, Dordrecht (1988). Scholar
  36. 36.
    Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)Google Scholar
  37. 37.
    Wang, Z., et al.: Multi-class hingeboost. Methods Inf. Med. 51(2), 162–167 (2012)Google Scholar
  38. 38.
    Zhu, J., Zou, H., Rosset, S., Hastie, T.: Multi-class adaboost. Stat. Interface 2(3), 349–360 (2009)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Zou, H., Zhu, J., Hastie, T.: New multicategory boosting algorithms based on multicategory fisher-consistent losses. Ann. Appl. Stat. 2(4), 1290 (2008)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Business Intelligence Division, Saudi AramcoDhahranSaudi Arabia

Personalised recommendations