An Introduction to Boosting and Leveraging

  • Ron Meir
  • Gunnar Rätsch
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2600)

Abstract

We provide an introduction to theoretical and practical aspects of Boosting and Ensemble learning, providing a useful reference for researchers in the field of Boosting as well as for those seeking to enter this fascinating area of research. We begin with a short background concerning the necessary learning theoretical foundations of weak learners and their linear combinations. We then point out the useful connection between Boosting and the Theory of Optimization, which facilitates the understanding of Boosting and later on enables us to move on to new Boosting algorithms, applicable to a broad spectrum of problems. In order to increase the relevance of the paper to practitioners, we have added remarks, pseudo code, “tricks of the trade”, and algorithmic considerations where appropriate. Finally, we illustrate the usefulness of Boosting algorithms by giving an overview of some existing applications. The main ideas are illustrated on the problem of binary classification, although several extensions are discussed.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    S. Abney, R. E. Schapire, and Y. Singer. Boosting applied to tagging and pp attachment. In Proc. of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.Google Scholar
  2. 2.
    H. Akaike. A new look at the statistical model identification. IEEE Trans. Automat. Control, 19(6):716–723, 1974.MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2000.CrossRefMathSciNetGoogle Scholar
  4. 4.
    M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.Google Scholar
  5. 5.
    A. Antos, B. Kégl, T. Linder, and G. Lugosi. Data-dependent margin-based generalization bounds for classification. JMLR, 3:73–98, 2002.CrossRefGoogle Scholar
  6. 6.
    J. A. Aslam. Improving algorithms for boosting. In Proc. COLT, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  7. 7.
    F. Audrino and P. Bühlmann. Volatility estimation with functional gradient descent for very high-dimensional financial time series. Journal of Computational Finance., 2002. To appear. See http://www.stat.ethz.ch/~buhlmann/bibliog.html.
  8. 8.
    J. P. Barnes. Capacity control in boosting using a p-convex hull. Master’s thesis, Australian National University, 1999. supervised by R. C. Williamson.Google Scholar
  9. 9.
    P. Bartlett, P. Boucheron, and G. Lugosi. Model selction and error estimation. Machine Learning, 48:85–2002, 2002.MATHCrossRefGoogle Scholar
  10. 10.
    P. L. Bartlett, O. Bousquet, and S. Mendelson. Localized rademacher averages. In Procedings COLT’02, volume 2375 of LNAI, pages 44–58, Sydney, 2002. Springer.Google Scholar
  11. 11.
    P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002. to appear 10/02.Google Scholar
  12. 12.
    E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithm: Bagging, boosting and variants. Machine Learning, 36:105–142, 1999.CrossRefGoogle Scholar
  13. 13.
    H. H. Bauschke and J. M. Borwein. Legendre functions and the method of random Bregman projections. Journal of Convex Analysis, 4:27–67, 1997.MATHMathSciNetGoogle Scholar
  14. 14.
    S. Ben-David, P. Long, and Y. Mansour. Agnostic boosting. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 507–516, 2001.Google Scholar
  15. 15.
    K. P. Bennett and O. L. Mangasarian. Multicategory separation via linear programming. Optimization Methods and Software, 3:27–39, 1993.CrossRefGoogle Scholar
  16. 16.
    K. P. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in ensemble methods. In Proc. ICML, 2002.Google Scholar
  17. 17.
    K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992.CrossRefGoogle Scholar
  18. 18.
    A. Bertoni, P. Campadelli, and M. Parodi. A boosting algorithm for regression. In W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud, editors, Proceedings ICANN’97, Int. Conf. on Artificial Neural Networks, volume V of LNCS, pages 343–348, Berlin, 1997. Springer.Google Scholar
  19. 19.
    D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.MATHGoogle Scholar
  20. 20.
    C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.Google Scholar
  21. 21.
    A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam’s razor. Information Processing Letters, 24:377–380, 1987.MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM orkshop on Computational Learning Theory, pages 144–152, 1992.Google Scholar
  23. 23.
    P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In Proc. 15th International Conf. on Machine Learning, pages 82–90. Morgan Kaufmann, San Francisco, CA, 1998.Google Scholar
  24. 24.
    L. M. Bregman. The relaxation method for finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Math. and Math. Physics, 7:200–127, 1967.CrossRefGoogle Scholar
  25. 25.
    L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996.Google Scholar
  26. 26.
    L. Breiman. Bias, variance, and arcing classifiers. Technical Report 460, Statistics Department, University of California, July 1997.Google Scholar
  27. 27.
    L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493–1518, 1999. Also Technical Report 504, Statistics Department, University of California Berkeley.CrossRefGoogle Scholar
  28. 28.
    L. Breiman. Some infinity theory for predictor ensembles. Technical Report 577, Berkeley, August 2000.Google Scholar
  29. 29.
    L. Breiman, J. Friedman, J. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984.Google Scholar
  30. 30.
    N. Bshouty and D. Gavinsky. On boosting with polynomially bounded distributions. JMLR, pages 107–111, 2002. Accepted.Google Scholar
  31. 31.
    P. Buhlmann and B. Yu. Boosting with the l2 loss: Regression and classification. J. Amer. Statist. Assoc., 2002. revised, also Technical Report 605, Stat Dept, UC Berkeley August, 2001.Google Scholar
  32. 32.
    C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 395–401. MIT Press, 2001.Google Scholar
  33. 33.
    J. Carmichael. Non-intrusive appliance load monitoring system. Epri journal, Electric Power Research Institute, 1990.Google Scholar
  34. 34.
    Y. Censor and S. A. Zenios. Parallel Optimization: Theory, Algorithms and Application. Numerical Mathematics and Scientific Computation. Oxford University Press, 1997.Google Scholar
  35. 35.
    N. Cesa-Bianchi, A. Krogh, and M. Warmuth. Bounds on approximate steepest descent for likelihood maximization in exponential families. IEEE Transaction on Information Theory, 40(4):1215–1220, July 1994.MATHCrossRefMathSciNetGoogle Scholar
  36. 36.
    O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131–159, 2002.MATHCrossRefGoogle Scholar
  37. 37.
    S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Technical Report 479, Department of Statistics, Stanford University, 1995.Google Scholar
  38. 38.
    W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998.Google Scholar
  39. 39.
    M. Collins, R. E. Schapire, and Y. Singer. Logistic Regression, AdaBoost and Bregman distances. Machine Learning, 48(1–3):253–285, 2002. Special Issue on New Methods for Model Selection and Model Combination.MATHCrossRefGoogle Scholar
  40. 40.
    R. Cominetti and J.-P. Dussault. A stable exponential penalty algorithm with superlinear convergence. J.O.T.A., 83(2), Nov 1994.Google Scholar
  41. 41.
    C. Cortes and V. N. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.MATHGoogle Scholar
  42. 42.
    T. M. Cover and P. E. Hart. Nearest neighbor pattern classifications. IEEE transaction on information theory, 13(1):21–27, 1967.MATHCrossRefGoogle Scholar
  43. 43.
    D. D. Cox and F. O’sullivan. Asymptotic analysis of penalized likelihood and related estimates. The Annals of Statistics, 18(4):1676–1695, 1990.MATHMathSciNetCrossRefGoogle Scholar
  44. 44.
    K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. In N. Cesa-Bianchi and S. Goldberg, editors, Proc. Colt, pages 35–46, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  45. 45.
    N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.Google Scholar
  46. 46.
    S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393, April 1997.CrossRefGoogle Scholar
  47. 47.
    S. Della Pietra, V. Della Pietra, and J. Lafferty. Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, Carnegie Mellon University, 2001.Google Scholar
  48. 48.
    A. Demiriz, K. P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column generation. Journal of Machine Learning Research, 46:225–254, 2002.MATHCrossRefGoogle Scholar
  49. 49.
    M. Dettling and P. Bühlmann. How to use boosting for tumor classification with gene expression data. Preprint. See http://www.stat.ethz.ch/~dettling/boosting, 2002.
  50. 50.
    L. Devroye, L. Györ., and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of Mathematics. Springer, New York, 1996.Google Scholar
  51. 51.
    T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, 1999.CrossRefGoogle Scholar
  52. 52.
    T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Aritifical Intelligence Research, 2:263–286, 1995.MATHGoogle Scholar
  53. 53.
    C. Domingo and O. Watanabe. A modification of AdaBoost. In Proc. COLT, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  54. 54.
    H. Drucker, C. Cortes, L. D. Jackel, Y. LeCun, and V. Vapnik. Boosting and other ensemble methods. Neural Computation, 6, 1994.Google Scholar
  55. 55.
    H. Drucker, R. E. Schapire, and P. Y. Simard. Boosting performance in neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7:705–719, 1993.CrossRefGoogle Scholar
  56. 56.
    N. Duffy and D. P. Helmbold. A geometric approach to leveraging weak learners. In P. Fischer and H. U. Simon, editors, Computational Learning Theory: 4th European Conference (EuroCOLT’ 99), pages 18–33, March 1999. Long version to appear in TCS.Google Scholar
  57. 57.
    N. Duffy and D. P. Helmbold. Boosting methods for regression. Technical report, Department of Computer Science, University of Santa Cruz, 2000.Google Scholar
  58. 58.
    N. Duffy and D. P. Helmbold. Leveraging for regression. In Proc. COLT, pages 208–219, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  59. 59.
    N. Duffy and D. P. Helmbold. Potential boosters? In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems, volume 12, pages 258–264. MIT Press, 2000.Google Scholar
  60. 60.
    G. Escudero, L. Màrquez, and G. Rigau. Boosting applied to word sense disambiguation. In LNAI 1810: Proceedings of the 12th European Conference on Machine Learning, ECML, pages 129–141, Barcelona, Spain, 2000.Google Scholar
  61. 61.
    W. Feller. An Introduction to Probability Theory and its Applications. Wiley, Chichester, third edition, 1968.Google Scholar
  62. 62.
    D. H. Fisher, Jr., editor. Improving regressors using boosting techniques, 1997.Google Scholar
  63. 63.
    M. Frean and T. Downs. A simple cost function for boosting. Technical report, Dep. of Computer Science and Electrical Engineering, University of Queensland, 1998.Google Scholar
  64. 64.
    Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256–285, September 1995.MATHCrossRefMathSciNetGoogle Scholar
  65. 65.
    Y. Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293–318, 2001.MATHCrossRefGoogle Scholar
  66. 66.
    Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. In Proc. ICML, 1998.Google Scholar
  67. 67.
    Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT: European Conference on Computational Learning Theory. LNCS, 1994.Google Scholar
  68. 68.
    Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan Kaufmann, 1996.Google Scholar
  69. 69.
    Y. Freund and R. E. Schapire. Game theory, on-line prediction and boosting. In Proc. COLT, pages 325–332, New York, NY, 1996. ACM Press.Google Scholar
  70. 70.
    Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.MATHCrossRefMathSciNetGoogle Scholar
  71. 71.
    Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999.MATHCrossRefMathSciNetGoogle Scholar
  72. 72.
    Y. Freund and R. E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771–780, September 1999. Appeared in Japanese, translation by Naoki Abe.Google Scholar
  73. 73.
    J. Friedman. Stochastic gradient boosting. Technical report, Stanford University, March 1999.Google Scholar
  74. 74.
    J. Friedman, T. Hastie, and R. J. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 2:337–374, 2000. with discussion pp.375-407, also Technical Report at Department of Statistics, Sequoia Hall, Stanford University.CrossRefMathSciNetGoogle Scholar
  75. 75.
    J. H. Friedman. On bias, variance, 0/1-loss, and the corse of dimensionality. In Data Mining and Knowledge Discovery, volume I, pages 55–77. Kluwer Academic Publishers, 1997.CrossRefGoogle Scholar
  76. 76.
    J. H. Friedman. Greedy function approximation. Technical report, Department of Statistics, Stanford University, February 1999.Google Scholar
  77. 77.
    K. R. Frisch. The logarithmic potential method of convex programming. Memorandum, University Institute of Economics, Oslo, May 13 1955.Google Scholar
  78. 78.
    T. Graepel, R. Herbrich, B. Schölkopf, A. J. Smola, P. L. Bartlett, K.-R. Müller, K. Obermayer, and R. C. Williamson. Classification on proximity data with LPmachines. In D. Willshaw and A. Murray, editors, Proceedings of ICANN’99, volume 1, pages 304–309. IEE Press, 1999.Google Scholar
  79. 79.
    Y. Grandvalet. Bagging can stabilize without reducing variance. In ICANN’01, Lecture Notes in Computer Science. Springer, 2001.Google Scholar
  80. 80.
    Y. Grandvalet, F. D’alché-Buc, and C. Ambroise. Boosting mixture models for semi-supervised tasks. In Proc. ICANN, Vienna, Austria, 2001.Google Scholar
  81. 81.
    A. J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artifical Intelligence, 1998.Google Scholar
  82. 82.
    V. Guruswami and A. Sahai. Multiclass learning, boosing, and error-correcting codes. In Proc. of the twelfth annual conference on Computational learning theory, pages 145–155, New York, USA, 1999. ACM Press.Google Scholar
  83. 83.
    W. Hart. Non-intrusive appliance load monitoring. Proceedings of the IEEE, 80(12), 1992.Google Scholar
  84. 84.
    M. Haruno, S. Shirai, and Y. Ooyama. Using decision trees to construct a practical parser. Machine Learning, 34:131–149, 1999.MATHCrossRefGoogle Scholar
  85. 85.
    T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: data mining, inference and prediction. Springer series in statistics. Springer, New York, N.Y., 2001.MATHGoogle Scholar
  86. 86.
    T. J. Hastie and R. J. Tibshirani. Generalized Additive Models, volume 43 of Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1990.Google Scholar
  87. 87.
    D. Haussler. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications. Information and Computation, 100:78–150, 1992.MATHCrossRefMathSciNetGoogle Scholar
  88. 88.
    S. S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice-Hall, second edition, 1998.Google Scholar
  89. 89.
    D. P. Helmbold, K. Kivinen, and M. K. Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10(6):1291–1304, 1999.CrossRefGoogle Scholar
  90. 90.
    R. Herbrich. Learning Linear Classifiers: Theory and Algorithms, volume 7 of Adaptive Computation and Machine Learning. MIT Press, 2002.Google Scholar
  91. 91.
    R. Herbrich, T. Graepel, and J. Shawe-Taylor. Sparsity vs. large margins for linear classifiers. In Proc. COLT, pages 304–308, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  92. 92.
    R. Herbrich and R. Williamson. Algorithmic luckiness. JMLR, 3:175–212, 2002.CrossRefMathSciNetGoogle Scholar
  93. 93.
    R. Hettich and K. O. Kortanek. Semi-infinite programming: Theory, methods and applications. SIAM Review, 3:380–429, September 1993.Google Scholar
  94. 94.
    F. J. Huang, Z.-H. Zhou, H.-J. Zhang, and T. Chen. Pose invariant face recognition. In Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, pages 245–250, Grenoble, France, 2000.Google Scholar
  95. 95.
    R. D. Iyer, D. D. Lewis, R. E. Schapire, Y. Singer, and A. Singhal. Boosting for document routing. In A. Agah, J. Callan, and E. Rundensteiner, editors, Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management, pages 70–77, McLean, US, 2000. ACM Press, New York, US.Google Scholar
  96. 96.
    W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, volume 1, pages 361–380, Berkeley, 1960. University of California Press.Google Scholar
  97. 97.
    W. Jiang. Some theoretical aspects of boosting in the presence of noisy data. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001.Google Scholar
  98. 98.
    D. S. Johnson and F. P. Preparata. The densest hemisphere problem. Theoretical Computer Science, 6:93–107, 1978.MATHCrossRefMathSciNetGoogle Scholar
  99. 99.
    M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2):181–214, 1994.CrossRefGoogle Scholar
  100. 100.
    M. Kearns and Y. Mansour. On the boosting ability og top-down decision tree learning algorithms. In Proc. 28th ACM Symposium on the Theory of Computing,, pages 459–468. ACM Press, 1996.Google Scholar
  101. 101.
    M. Kearns and L. Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM, 41(1):67–95, January 1994.MATHCrossRefMathSciNetGoogle Scholar
  102. 102.
    M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.Google Scholar
  103. 103.
    G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82–95, 1971.MATHCrossRefMathSciNetGoogle Scholar
  104. 104.
    J. Kivinen and M. Warmuth. Boosting as entropy projection. In Proc. 12th Annu. Conference on Comput. Learning Theory, pages 134–144. ACM Press, New York, NY, 1999.Google Scholar
  105. 105.
    J. Kivinen, M. Warmuth, and P. Auer. The perceptron algorithm vs. winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. Special issue of Artificial Intelligence, 97(1–2):325–343, 1997.MATHMathSciNetGoogle Scholar
  106. 106.
    J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1–64, 1997.MATHCrossRefMathSciNetGoogle Scholar
  107. 107.
    K. C. Kiwiel. Relaxation methods for strictly convex regularizations of piecewise linear programs. Applied Mathematics and Optimization, 38:239–259, 1998.MATHCrossRefMathSciNetGoogle Scholar
  108. 108.
    V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002.Google Scholar
  109. 109.
    A. Krieger, A. Wyner, and C. Long. Boosting noisy data. In Proceedings, 18th ICML. Morgan Kaufmann, 2001.Google Scholar
  110. 110.
    J. Lafferty. Additive models, boosting, and inference for generalized divergences. In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 125–133, New York, NY, 1999. ACM Press.Google Scholar
  111. 111.
    G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In Advances in Neural information processings systems, volume 14, 2002. to appear. Longer version also NeuroCOLT Technical Report NC-TR-2001-098.Google Scholar
  112. 112.
    Y. A. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Müller, E. Säckinger, P. Y. Simard, and V. N. Vapnik. Comparison of learning algorithms for handwritten digit recognition. In F. Fogelman-Soulié and P. Gallinari, editors, Proceedings ICANN’95-International Conference on Artificial Neural Networks, volume II, pages 53–60, Nanterre, France, 1995. EC2.Google Scholar
  113. 113.
    M. Leshno, V. Lin, A. Pinkus, and S. Schocken. Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate any Function. Neural Networks, 6:861–867, 1993.CrossRefGoogle Scholar
  114. 114.
    N. Littlestone, P. M. Long, and M. K. Warmuth. On-line learning of linear functions. Journal of Computational Complexity, 5:1–23, 1995. Earlier version is Technical Report CRL-91-29 at UC Santa Cruz.MATHCrossRefMathSciNetGoogle Scholar
  115. 115.
    D. G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley Publishing Co., Reading, second edition, May 1984. Reprinted with corrections in May, 1989.Google Scholar
  116. 116.
    Gábor Lugosi and Nicolas Vayatis. A consistent strategy for boosting algorithms. In Proceedings of the Annual Conference on Computational Learning Theory, volume 2375 of LNAI, pages 303–318, Sydney, February 2002. Springer.Google Scholar
  117. 117.
    Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992.MATHCrossRefMathSciNetGoogle Scholar
  118. 118.
    S. Mallat and Z. Zhang. Matching Pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, December 1993.MATHCrossRefGoogle Scholar
  119. 119.
    O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research, 13:444–452, 1965.MATHMathSciNetCrossRefGoogle Scholar
  120. 120.
    O. L. Mangasarian. Arbitrary-norm separating plane. Operation Research Letters, 24(1):15–23, 1999.MATHCrossRefMathSciNetGoogle Scholar
  121. 121.
    S. Mannor and R. Meir. Geometric bounds for generlization in boosting. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 461–472, 2001.Google Scholar
  122. 122.
    S. Mannor and R. Meir. On the existence of weak learners and applications to boosting. Machine Learning, 48(1–3):219–251, 2002.Google Scholar
  123. 123.
    S. Mannor, R. Meir, and T. Zhang. The consistency of greedy algorithms for classification. In Procedings COLT’02, volume 2375 of LNAI, pages 319–333, Sydney, 2002. Springer.Google Scholar
  124. 124.
    L. Mason. Margins and Combined Classifiers. PhD thesis, Australian National University, September 1999.Google Scholar
  125. 125.
    L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Technical report, Department of Systems Engineering, Australian National University, 1998.Google Scholar
  126. 126.
    L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and C. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, Cambridge, MA, 1999.Google Scholar
  127. 127.
    L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221–247. MIT Press, Cambridge, MA, 2000.Google Scholar
  128. 128.
    J. Matoušek. Geometric Discrepancy: An Illustrated Guide. Springer Verlag, 1999.Google Scholar
  129. 129.
    R. Meir, R. El-Yaniv, and Shai Ben-David. Localized boosting. In Proc. COLT, pages 190–199, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  130. 130.
    R. Meir and T. Zhang. Data-dependent bounds for bayesian mixture models. unpublished manuscript, 2002.Google Scholar
  131. 131.
    J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415–446, 1909.CrossRefGoogle Scholar
  132. 132.
    S. Merler, C. Furlanello, B. Larcher, and A. Sboner. Tuning cost-sensitive boosting and its application to melanoma diagnosis. In J. Kittler and F. Roli, editors, Proceedings of the 2nd Internationa Workshop on Multiple Classifier Systems MCS2001, volume 2096 of LNCS, pages 32–42. Springer, 2001.Google Scholar
  133. 133.
    J. Moody. The effective number of parameters: An analysis of generalization and regularization in non-linear learning systems. In S. J. Hanson J. Moody and R. P. Lippman, editors, Advances in Neural information processings systems, volume 4, pages 847–854, San Mateo, CA, 1992. Morgan Kaufman.Google Scholar
  134. 134.
    K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201, 2001.CrossRefGoogle Scholar
  135. 135.
    N. Murata, S. Amari, and S. Yoshizawa. Network information criterion-determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5:865–872, 1994.CrossRefGoogle Scholar
  136. 136.
    S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New York, NY, 1996.Google Scholar
  137. 137.
    Richard Nock and Patrice Lefaucheur. A robust boosting algorithm. In Proc. 13th European Conference on Machine Learning, volume LNAI 2430, Helsinki, 2002. Springer Verlag.Google Scholar
  138. 138.
    T. Onoda, G. Rätsch, and K.-R. Müller. An asymptotic analysis of AdaBoost in the binary classification case. In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proc. of the Int. Conf. on Artificial Neural Networks (ICANN’98), pages 195–200, March 1998.Google Scholar
  139. 139.
    T. Onoda, G. Rätsch, and K.-R. Müller. A non-intrusive monitoring system for household electric appliances with inverters. In H. Bothe and R. Rojas, editors, Proc. of NC’2000, Berlin, 2000. ICSC Academic Press Canada/Switzerland.Google Scholar
  140. 140.
    J. O’sullivan, J. Langford, R. Caruana, and A. Blum. Featureboost: A metalearning algorithm that improves model robustness. In Proceedings, 17th ICML. Morgan Kaufmann, 2000.Google Scholar
  141. 141.
    N. Oza and S. Russell. Experimental comparisons of online and batch versions of bagging and boosting. In Proc. KDD-01, 2001.Google Scholar
  142. 142.
    R. El-Yaniv P. Derbeko and R. Meir. Variance optimized bagging. In Proc. 13th European Conference on Machine Learning, 2002.Google Scholar
  143. 143.
    T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978–982, 1990.CrossRefMathSciNetGoogle Scholar
  144. 144.
    J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992.Google Scholar
  145. 145.
    J. R. Quinlan. Boosting first-order learning. Lecture Notes in Computer Science, 1160:143, 1996.Google Scholar
  146. 146.
    G. Rätsch. Ensemble learning methods for classification. Master’s thesis, Dep. of Computer Science, University of Potsdam, April 1998. In German.Google Scholar
  147. 147.
    G. Rätsch. Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam, Computer Science Dept., August-Bebel-Str. 89, 14482 Potsdam, Germany, October 2001.Google Scholar
  148. 148.
    G. Rätsch. Robustes boosting durch konvexe optimierung. In D. Wagner et al., editor, Ausgezeichnete Informatikdissertationen 2001, volume D-2 of GI-Edition-Lecture Notes in Informatics (LNI), pages 125–136. Bonner Köllen, 2002.Google Scholar
  149. 149.
    G. Rätsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning, 48(1–3):193–221, 2002. Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085.Google Scholar
  150. 150.
    G. Rätsch, S. Mika, B. Schölkopf, and K.-R. Müller. Constructing boosting algorithms from SVMs: an application to one-class classification. IEEE PAMI, 24(9), September 2002. In press. Earlier version is GMD TechReport No. 119, 2000.Google Scholar
  151. 151.
    G. Rätsch, S. Mika, and M. K. Warmuth. On the convergence of leveraging. NeuroCOLT2 Technical Report 98, Royal Holloway College, London, August 2001. A short version appeared in NIPS 14, MIT Press, 2002.Google Scholar
  152. 152.
    G. Rätsch, S. Mika, and M. K. Warmuth. On the convergence of leveraging. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural information processings systems, volume 14, 2002. In press. Longer version also NeuroCOLT Technical Report NC-TR-2001-098.Google Scholar
  153. 153.
    G. Rätsch, T. Onoda, and K.-R. Müller. Soft margins for AdaBoost. Machine Learning, 42(3):287–320, March 2001. also NeuroCOLT Technical Report NCTR-1998-021.MATHCrossRefGoogle Scholar
  154. 154.
    G. Rätsch, B. Schölkopf, A. J. Smola, S. Mika, T. Onoda, and K.-R. Müller. Robust ensemble learning. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 207–219. MIT Press, Cambridge, MA, 2000.Google Scholar
  155. 155.
    G. Rätsch, A. J. Smola, and S. Mika. Adapting codes and embeddings for polychotomies. In NIPS, volume 15. MIT Press, 2003. accepted.Google Scholar
  156. 156.
    G. Rätsch, M. Warmuth, S. Mika, T. Onoda, S. Lemm, and K.-R. Müller. Barrier boosting. In Proc. COLT, pages 170–179, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  157. 157.
    G. Rätsch and M. K. Warmuth. Maximizing the margin with boosting. In Proc. COLT, volume 2375 of LNAI, pages 319–333, Sydney, 2002. Springer.Google Scholar
  158. 158.
    G. Ridgeway, D. Madigan, and T. Richardson. Boosting methodology for regression problems. In D. Heckerman and J. Whittaker, editors, Proceedings of Artificial Intelligence and Statistics’ 99, pages 152–161, 1999.Google Scholar
  159. 159.
    J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.MATHCrossRefGoogle Scholar
  160. 160.
    C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994.MATHGoogle Scholar
  161. 161.
    M. Rochery, R. Schapire, M. Rahim, N. Gupta, G. Riccardi, S. Bangalore, H. Alshawi, and S. Douglas. Combining prior knowledge and boosting for call classification in spoken language dialogue. In International Conference on Accoustics, Speech and Signal Processing, 2002.Google Scholar
  162. 162.
    R. T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathemathics. Princeton University Press, New Jersey, 1970.MATHGoogle Scholar
  163. 163.
    R. E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.Google Scholar
  164. 164.
    R. E. Schapire. Using output codes to boost multiclass learning problems. In Machine Learning: Proceedings of the 14th International Conference, pages 313–321, 1997.Google Scholar
  165. 165.
    R. E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.Google Scholar
  166. 166.
    R. E. Schapire. The boosting approach to machine learning: An overview. In Workshop on Nonlinear Estimation and Classification. MSRI, 2002.Google Scholar
  167. 167.
    R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, October 1998.MATHCrossRefMathSciNetGoogle Scholar
  168. 168.
    R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999. also Proceedings of the 14th Workshop on Computational Learning Theory 1998, pages 80-91.MATHCrossRefGoogle Scholar
  169. 169.
    R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000.MATHCrossRefGoogle Scholar
  170. 170.
    R. E. Schapire, Y. Singer, and A. Singhal. Boosting and rocchio applied to text filtering. In Proc. 21st Annual International Conference on Research and Development in Information Retrieval, 1998.Google Scholar
  171. 171.
    R. E. Schapire, P. Stone, D. McAllester, M. L. Littman, and J. A. Csirik. Modeling auction price uncertainty using boosting-based conditional density estimations noise. In Proceedings of the Proceedings of the Nineteenth International Conference on Machine Learning, 2002.Google Scholar
  172. 172.
    B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In D. P. Helmbold and R. C. Williamson, editors, COLT/EuroCOLT, volume 2111 of LNAI, pages 416–426. Springer, 2001.Google Scholar
  173. 173.
    B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. TR 87, Microsoft Research, Redmond, WA, 1999.Google Scholar
  174. 174.
    B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1207–1245, 2000. also NeuroCOLT Technical Report NC-TR-1998-031.CrossRefGoogle Scholar
  175. 175.
    B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.Google Scholar
  176. 176.
    H. Schwenk and Y. Bengio. Boosting neural networks. Neural Computation, 12(8):1869–1887, 2000.CrossRefGoogle Scholar
  177. 177.
    R. A. Servedio. PAC analogoues of perceptron and winnow via boosting the margin. In Proc. COLT, pages 148–157, San Francisco, 2000. Morgan Kaufmann.Google Scholar
  178. 178.
    R. A. Servedio. Smooth boosting and learning with malicious noise. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 473–489, 2001.Google Scholar
  179. 179.
    J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory, 44(5):1926–1940, September 1998.MATHCrossRefMathSciNetGoogle Scholar
  180. 180.
    J. Shawe-Taylor and N. Cristianini. Further results on the margin distribution. In Proceedings of the twelfth Conference on Computational Learning Theory, pages 278–285, 1999.Google Scholar
  181. 181.
    J. Shawe-Taylor and N. Cristianini. On the genralization of soft margin algorithms. Technical Report NC-TR-2000-082, NeuroCOLT2, June 2001.Google Scholar
  182. 182.
    J. Shawe-Taylor and G. Karakoulas. Towards a strategy for boosting regressors. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 247–258, Cambridge, MA, 2000. MIT Press.Google Scholar
  183. 183.
    Y. Singer. Leveraged vector machines. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems, volume 12, pages 610–616. MIT Press, 2000.Google Scholar
  184. 184.
    D. Tax and R. Duin. Data domain description by support vectors. In M. Verleysen, editor, Proc. ESANN, pages 251–256, Brussels, 1999. D. Facto Press.Google Scholar
  185. 185.
    F. Thollard, M. Sebban, and P. Ezequel. Boosting density function estimators. In Proc. 13th European Conference on Machine Learning, volume LNAI 2430, pages 431–443, Helsinki, 2002. Springer Verlag.Google Scholar
  186. 186.
    A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. W. H. Winston, Washington, D.C., 1977.MATHGoogle Scholar
  187. 187.
    K. Tsuda, M. Sugiyama, and K.-R. Müller. Subspace information criterion for non-quadratic regularizers-model selection for sparse regressors. IEEE Transactions on Neural Networks, 13(1):70–80, 2002.CrossRefGoogle Scholar
  188. 188.
    L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984.MATHCrossRefGoogle Scholar
  189. 189.
    A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Verlag, New York, 1996.MATHGoogle Scholar
  190. 190.
    V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995.MATHGoogle Scholar
  191. 191.
    V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.MATHGoogle Scholar
  192. 192.
    V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16(2):264–280, 1971.CrossRefMathSciNetMATHGoogle Scholar
  193. 193.
    J. von Neumann. Zur Theorie der Gesellschaftsspiele. Math. Ann., 100:295–320, 1928.CrossRefMathSciNetMATHGoogle Scholar
  194. 194.
    M. A. Walker, O. Rambow, and M. Rogati. Spot: A trainable sentence planner. In Proc. 2nd Annual Meeting of the North American Chapter of the Assiciation for Computational Linguistics, 2001.Google Scholar
  195. 195.
    R. Zemel and T. Pitassi. A gradient-based boosting algorithm for regression problems. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 696–702. MIT Press, 2001.Google Scholar
  196. 196.
    T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Technical Report RC22155, IBM Research, Yorktown Heights, NY, 2001.Google Scholar
  197. 197.
    T. Zhang. A general greedy approximation algorithm with applications. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002.Google Scholar
  198. 198.
    T. Zhang. On the dual formulation of regularized linear systems with convex risks. Machine Learning, 46:91–129, 2002.MATHCrossRefGoogle Scholar
  199. 199.
    T. Zhang. Sequential greedy approximation for certain convex optimization problems. Technical report, IBM T.J. Watson Research Center, 2002.Google Scholar
  200. 200.
    Z.-H. Zhou, Y. Jiang, Y.-B. Yang, and S.-F. Chen. Lung cancer cell identification based on artificial neural network ensembles. Artificial Intelligence in Medicine, 24(1):25–36, 2002.MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Ron Meir
    • 1
  • Gunnar Rätsch
    • 2
  1. 1.Department of Electrical EngineeringTechnionHaifaIsrael
  2. 2.Research School of Information Sciences & Engineering The Australian National UniversityCanberraACT 0200Australia

Personalised recommendations