Machine Learning

, Volume 42, Issue 3, pp 287–320 | Cite as

Soft Margins for AdaBoost

  • G. Rätsch
  • T. Onoda
  • K.-R. Müller

Abstract

Recently ensemble methods like ADABOOST have been applied successfully in many problems, while seemingly defying the problems of overfitting.

ADABOOST rarely overfits in the low noise regime, however, we show that it clearly does so for higher noise levels. Central to the understanding of this fact is the margin distribution. ADABOOST can be viewed as a constraint gradient descent in an error function with respect to the margin. We find that ADABOOST asymptotically achieves a hard margin distribution, i.e. the algorithm concentrates its resources on a few hard-to-learn patterns that are interestingly very similar to Support Vectors. A hard margin is clearly a sub-optimal strategy in the noisy case, and regularization, in our case a “mistrust” in the data, must be introduced in the algorithm to alleviate the distortions that single difficult patterns (e.g. outliers) can cause to the margin distribution. We propose several regularization methods and generalizations of the original ADABOOST algorithm to achieve a soft margin. In particular we suggest (1) regularized ADABOOSTREG where the gradient decent is done directly with respect to the soft margin and (2) regularized linear and quadratic programming (LP/QP-) ADABOOST, where the soft margin is attained by introducing slack variables.

Extensive simulations demonstrate that the proposed regularized ADABOOST-type algorithms are useful and yield competitive results for noisy data.

ADABOOST arcing large margin soft margin classification support vectors 

References

  1. Bennett, K. (1998). Combining support vector and mathematical programming methods for induction. In B. Schölkopf, C. Burges,& A. Smola (Eds.), Advances in kernel methodsSV learning. Cambridge, MA: MIT Press.Google Scholar
  2. Bennett, K.& Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34.Google Scholar
  3. Bertoni, A., Campadelli, P.,& Parodi, M. (1997).Aboosting algorithm for regression. In W. Gerstner, A. Germond, M. Hasler,& J.-D. Nicoud (Eds.), LNCS, Vol. V: Proceedings ICANN'97: Int. Conf. on Artificial Neural Networks (pp. 343–348). Berlin: Springer.Google Scholar
  4. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford: Clarendon Press.Google Scholar
  5. Boser, B., Guyon, I.,& Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings COLT'92: Conference on Computational Learning Theory (pp. 144–152). New York, NY: ACM Press.Google Scholar
  6. Breiman, L. (1996). Bagging predictors. Mechine Learning, 26(2), 123–140.Google Scholar
  7. Breiman, L. (1997a). Arcing the edge. Technical Report 486, Statistics Department, University of California.Google Scholar
  8. Breiman, L. (1997b). Prediction games and arcing algorithms. Technical Report 504, Statistics Department, University of California.Google Scholar
  9. Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26(3), 801–849.Google Scholar
  10. Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Statistics Department, University of California.Google Scholar
  11. Cortes, C.& Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.Google Scholar
  12. Frean, M.& Downs, T. (1998). A simple cost function for boosting. Technical Report, Department of Computer Science and Electrical Engineering, University of Queensland.Google Scholar
  13. Freund, Y.& Schapire, R. (1994). A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings EuroCOLT'94: European Conference on Computational Learning Theory. LNCS.Google Scholar
  14. Freund, Y.& Schapire, R. (1996). Game theory, on-line prediction and boosting. In Proceedings COLT'86: Conf. on Comput. Learning Theory (pp. 325–332). New York, NY: ACM Press.Google Scholar
  15. Friedman, J. (1999). Greedy function approximation. Technical Report, Department of Statistics, Stanford University.Google Scholar
  16. Friedman, J., Hastie, T.,& Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Technical Report, Department of Statistics, Sequoia Hall, Stanford University.Google Scholar
  17. Frieß, T.& Harrison, R. (1998). Perceptrons in kernel feature space. Research Report RR-720, Department of Automatic Control and Systems Engineering, University of Sheffield, Sheffield, UK.Google Scholar
  18. Grove, A.& Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artifical Intelligence.Google Scholar
  19. Kirkpatrick, S. (1984). Optimization by simulated annealing: Quantitative studies. J. Statistical Physics, 34, 975–986.Google Scholar
  20. LeCun, Y., Jackel, L., Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Müller, U., Säckinger, E., Simard, P.,& Vapnik, V. (1995). Learning algorithms for classification: A comparism on handwritten digit recognition. Neural Networks, 261–276.Google Scholar
  21. Mangasarian, O. (1965). Linear and nonlinear separation of patterns by linear programming. Operations Research, 13, 444–452.Google Scholar
  22. Mason, L., Bartlett, P. L.,& Baxter, J. (2000a). Improved generalization through explicit optimization of margins. Machine Learning 38(3), 243–255.Google Scholar
  23. Mason, L., Baxter, J., Bartlett, P. L.,& Frean, M. (2000b). Functional gradient techniques for combining hypotheses. In A. J. Smola, P. Bartlett, B. Schölkopf,& C. Schuurmans (Eds.), Advances in Large Margin Classifiers. Cambridge, MA: MIT Press.Google Scholar
  24. Moody, J.& Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2), 281–294.Google Scholar
  25. Müller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J.,& Vapnik, V. (1998). Using support vector machines for time series prediction. In B. Schölkopf, C. Burges,& A. Smola (Eds.), Advances in Kernel MethodsSupport Vector Learning. Cambridge, MA: MIT Press.Google Scholar
  26. Onoda, T., Rätsch, G.,& Müller, K.-R. (1998). An asymptotic analysis of ADABOOST in the binary classification case. In L. Niklasson, M. Bodén,& T. Ziemke (Eds.), Proceedings ICANN'98: Int. Conf. on Artificial Neural Networks (pp. 195–200).Google Scholar
  27. Onoda, T., Rätsch, G.,& Müller, K.-R. (2000). An asymptotical analysis and improvement of ADABOOST in the binary classification case. Journal of Japanese Society for AI, 15(2), 287–296 (in Japanese).Google Scholar
  28. Press, W., Flannery, B., Teukolsky, S.,& Vetterling, W. (1992). Numerical Recipes in C (2nd ed.). Cambridge: Cambridge University Press.Google Scholar
  29. Quinlan, J. (1992). C4.5: Programs for Machine Learning. Los Altos, CA: Morgan Kaufmann.Google Scholar
  30. Quinlan, J. (1996). Boosting first-order learning. In S. Arikawa& A. Sharma (Eds.), LNAI, Vol. 1160: Proceedings of the 7th International Workshop on Algorithmic Learning Theory (pp. 143–155). Berlin: Springer.Google Scholar
  31. Rätsch, G. (1998). Ensemble learning methods for classification. Master's Thesis, Department of Computer Science, University of Potsdam, Germany (in German).Google Scholar
  32. Rätsch, G., Onoda, T.,& Müller, K.-R. (1998). Soft margins for ADABOOST. Technical Report NC-TR-1998-021, Department of Computer Science, Royal Holloway, University of London, Egham, UK.Google Scholar
  33. Rätsch, G., Onoda, T.,& Müller, K.-R. (1999). Regularizing ADABOOST. In M. Kearns, S. Solla,& D. Cohn (Eds.), Advances in Neural Information Processing Systems 11 (pp. 564–570). Cambridge, MA: MIT Press.Google Scholar
  34. Rätsch, G., Schölkopf, B., Smola, A., Mika, S., Onoda, T.,& Müller, K.-R. (2000). Robust ensemble learning. In A. Smola, P. Bartlett, B. Schölkopf,& D. Schuurmans (Eds.), Advances in Large Margin Classifiers (pp. 207–219). Cambridge, MA: MIT Press.Google Scholar
  35. R00E4;tsch, G., Warmuth, M., Mika, S., Onoda, T., Lemm, S.,& Müller, K.-R. (2000). Barrier boosting. In Proceedings COLT'00: Conference on Computational Learning Theory (pp. 170–179). Los Altos, CA: Morgan Kaufmann.Google Scholar
  36. Rokui, J.& Shimodaira, H. (1998). Improving the generalization performance of the minimum classification error learning and its application to neural networks. In Proc. of the Int. Conf. on Neural Information Processing (ICONIP) (pp. 63–66). Japan, Kitakyushu.Google Scholar
  37. Schapire, R. (1999). Theoretical views of boosting. In Proceedings EuroCOLT'99: European Conference on Computational Learning Theory.Google Scholar
  38. Schapire, R., Freund, Y., Bartlett, P.,& Lee, W. (1997). Boosting the margin:Anewexplanation for the effectiveness of voting methods. In Proceedings ICML'97: International Conference on Machine Learning (pp. 322–330). Los Altos, CA: Morgan Kaufmann.Google Scholar
  39. Schapire, R.& Singer, Y. (1998). Improved boosting algorithms using confidence-rated predictions. In Proceedings COLT'98: Conference on Computational Learning Theory (pp. 80–91).Google Scholar
  40. Schölkopf, B. (1997). Support Vector Learning. R. Oldenbourg Verlag, Berlin.Google Scholar
  41. Schölkopf, B., Smola, A.,& Williamson, R. (2000). New support vector algorithms. Neural Computation. also NeuroCOLT TR-31-89, 12:1083–1121.Google Scholar
  42. Schwenk, H.& Bengio, Y. (1997). AdaBoosting neural networks. In W. Gerstner, A. Germond, M. Hasler,& J.-D. Nicoud (Eds.), Proceedings ICANN'97: Int. Conf. on Artificial Neural Networks, Vol. 1327 of LNCS (pp. 967–972). Berlin: Springer.Google Scholar
  43. Smola, A. J. (1998). Learning with kernels. Ph.D. Thesis, Technische Universität Berlin.Google Scholar
  44. Smola, A., Schölkopf, B.,& Müller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637–649.Google Scholar
  45. Tikhonov, A.& Arsenin, V. (1977). Solutions of Ill-Posed Problems. Washington, D.C.: W.H. Winston.Google Scholar
  46. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Berlin: Springer.Google Scholar
  47. Weston, J. (1999). LOO-support vector machines. In Proceedings of IJCNN'99.Google Scholar
  48. Weston, J., Gammerman, A., Stitson, M. O., Vapnik, V., Vovk, V.,& Watkins, C. (1997). Density estimation using SV machines. Technical Report CSD-TR-97-23, Royal Holloway, University of London, Egham, UK.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • G. Rätsch
    • 1
  • T. Onoda
    • 2
  • K.-R. Müller
    • 3
    • 4
  1. 1.GMD FIRSTBerlinGermany
  2. 2.CRIEPI, Komae-shiIwado Kita, TokyoJapan
  3. 3.GMD FIRSTBerlinGermany
  4. 4.University of PotsdamPotsdamGermany

Personalised recommendations