Machine Learning

, Volume 36, Issue 1–2, pp 105–139

An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

  • Eric Bauer
  • Ron Kohavi
Article

Abstract

Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms, which use perturbation, reweighting, and combination techniques, affect classification error. We provide a bias and variance decomposition of the error to show how different methods and variants influence these two terms. This allowed us to determine that Bagging reduced variance of unstable methods, while boosting methods (AdaBoost and Arc-x4) reduced both the bias and variance of unstable methods but increased the variance for Naive-Bayes, which was very stable. We observed that Arc-x4 behaves differently than AdaBoost if reweighting is used instead of resampling, indicating a fundamental difference. Voting variants, some of which are introduced in this paper, include: pruning versus no pruning, use of probabilistic estimates, weight perturbations (Wagging), and backfitting of data. We found that Bagging improves when probabilistic estimates in conjunction with no-pruning are used, as well as when the data was backfit. We measure tree sizes and show an interesting positive correlation between the increase in the average tree size in AdaBoost trials and its success in reducing the error. We compare the mean-squared error of voting methods to non-voting methods and show that the voting methods lead to large and significant reductions in the mean-squared errors. Practical problems that arise in implementing boosting algorithms are explored, including numerical instabilities and underflows. We use scatterplots that graphically show how AdaBoost reweights instances, emphasizing not only “hard” areas but also outliers and noise.

classification boosting Bagging decision trees Naive-Bayes mean-squared error 

References

  1. Ali, K.M. (1996). Learning probabilistic relational concept descriptions. Ph.D. thesis, University of California, Irvine. http://www.ics.uci.edu/~ali.Google Scholar
  2. Becker, B., Kohavi, R., & Sommerfield, D. (1997). Visualizing the simple bayesian classifier. KDD Workshop on Issues in the Integration of Data Mining and Data Visualization.Google Scholar
  3. Bernardo, J.M., & Smith, A.F. (1993). Bayesian theory. John Wiley & Sons.Google Scholar
  4. Breiman, L. (1994). Heuristics of instability in model selection (Technical Report). Berkeley: Statistics Department, University of California.Google Scholar
  5. Breiman, L. (1996a). Arcing classifiers (Technical Report). Berkeley: Statistics Department, University of California. http://www.stat.Berkeley.EDU/users/breiman/.Google Scholar
  6. Breiman, L. (1996b). Bagging predictors. Machine Learning, 24, 123–140.Google Scholar
  7. Breiman, L. (1997). Arcing the edge (Technical Report 486). Berkeley: Statistics Department, University of California. http://www.stat.Berkeley.EDU/users/breiman/.Google Scholar
  8. Buntine, W. (1992a). Learning classification trees. Statistics and Computing, 2(2), 63–73.Google Scholar
  9. Buntine, W. (1992b). A theory of learning classification rules. Ph.D. thesis, University of Technology, Sydney, School of Computing Science.Google Scholar
  10. Blake, C. Keogh, E., & Merz, C.J. (1998). UCI repository of machine learning databases. http://www.ics. uci.edu/~mlearn/MLRepository.html.Google Scholar
  11. Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning. In L.C. Aiello (Ed.), Proceedings of the Ninth European Conference on Artificial Intelligence (pp. 147–149).Google Scholar
  12. Chan, P., Stolfo, S., & Wolpert, D. (1996). Integrating multiple learned models for improving and scaling machine learning algorithms. AAAI Workshop.Google Scholar
  13. Craven, M.W., & Shavlik, J.W. (1993). Learning symbolic rules using artificial neural networks. Proceedings of the Tenth International Conference on Machine Learning (pp. 73–80). Morgan Kaufmann.Google Scholar
  14. Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7).Google Scholar
  15. Dietterich, T.G., & Bakiri, G. (1991). Error-correcting output codes: A general method for improving multiclass inductive learning programs. Proceedings of the Ninth National Conference on Artificial Intelligence (AAAI-91) (pp. 572–577).Google Scholar
  16. Domingos, P. (1997). Why does bagging work? A Bayesian account and its implications. In D. Heckerman, H. Mannila, D. Pregibon, & R. Uthurusamy (Eds.), Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (pp. 155–158). AAAI Press.Google Scholar
  17. Domingos, P., & Pazzani, M. (1997). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. Machine Learning, 29(2/3), 103–130.Google Scholar
  18. Drucker, H., & Cortes, C. (1996). Boosting decision trees. Advances in neural information processing systems 8' (pp. 479–485).Google Scholar
  19. Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. Wiley.Google Scholar
  20. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Chapman & Hall.Google Scholar
  21. Elkan, C. (1997). Boosting and naive bayesian learning (Technical Report). San Diego: Department of Computer Science and Engineering, University of California.Google Scholar
  22. Fayyad, U.M., & Irani, K.B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence (pp. 1022–1027). Morgan Kaufmann Publishers.Google Scholar
  23. Freund, Y. (1990). Boosting a weak learning algorithm by majority. Proceedings of the Third Annual Workshop on Computational Learning Theory (pp. 202–216).Google Scholar
  24. Freund, Y. (1996). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285.Google Scholar
  25. Freund, Y., & Schapire, R.E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. Proceedings of the Second European Conference on Computational Learning Theory (pp. 23–37). Springer-Verlag, To appear in Journal of Computer and System Sciences.Google Scholar
  26. Freund, Y., & Schapire, R.E. (1996). Experiments with a new boosting algorithm. In L. Saitta (Ed.), Machine Learning: Proceedings of the Thirteenth National Conference (pp. 148–156). Morgan Kaufmann.Google Scholar
  27. Friedman, J.H. (1997). On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77. ftp://playfair.stanford.edu/pub/friedman/curse.ps.Z.Google Scholar
  28. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–48.Google Scholar
  29. Good, I.J. (1965). The estimation of probabilities: An essay on modern bayesian methods. M.I.T. Press.Google Scholar
  30. Holte, R.C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–90.Google Scholar
  31. Iba, W., & Langley, P. (1992). Induction of one-level decision trees. Proceedings of the Ninth International Conference on Machine Learning (pp. 233–240). Morgan Kaufmann Publishers.Google Scholar
  32. Kohavi, R. (1995a). A study of cross-validation and bootstrap for accuracy estimation and model selection. In C.S. Mellish (Ed.), Proceedings of the 14th International Joint Conference on Artificial Intelligence (pp. 1137–1143). Morgan Kaufmann. http://robotics.stanford.edu/~ronnyk.Google Scholar
  33. Kohavi, R. (1995b). Wrappers for performance enhancement and oblivious decision graphs. Ph.D. thesis, Stanford University, Computer Science department. STAN-CS-TR–95–1560. http://robotics.Stanford.EDU/~ronnyk/teza.ps.Z.Google Scholar
  34. Kohavi, R., Becker, B., & Sommerfield, D. (1997). Improving simple bayes. The Nineth European Conference on Machine Learning, Poster Papers' (pp. 78–87). Available at http://robotics.stanford.edu/users/ronnyk.Google Scholar
  35. Kohavi, R., & Kunz, C. (1997). Option decision trees with majority votes. In D. Fisher (Ed.), Machine Learning: Proceedings of theFourteenth International Conference (pp. 161–169). Morgan Kaufmann Publishers.Available at http://robotics.stanford.edu/users/ronnyk.Google Scholar
  36. Kohavi, R., & Sahami, M. (1996). Error-based and entropy-based discretization of continuous features. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (pp. 114–119).Google Scholar
  37. Kohavi, R., & Sommerfield, D. (1995). Feature subset selection using the wrapper model: Overfitting and dynamic search space topology. The First International Conference on Knowledge Discovery and Data Mining (pp. 192–197).Google Scholar
  38. Kohavi, R., Sommerfield, D., & Dougherty, J. (1997). Data mining using \({\mathcal{M}}{\mathcal{L}}{\mathcal{C}}\)++: A machine learning library in C++. International Journal on Artificial Intelligence Tools 6(4), 537–566. http://www.sgi.com/Technology/mlc.Google Scholar
  39. Kohavi, R., & Wolpert, D.H. (1996). Bias plus variance decomposition for zero-one loss functions. In L. Saitta (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference (pp. 275–283). Morgan Kaufmann. Available at http://robotics.stanford.edu/users/ronnyk.Google Scholar
  40. Kong, E.B., & Dietterich, T.G. (1995). Error-correcting output coding corrects bias and variance. In A. Prieditis & S. Russell (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (pp. 313–321). Morgan Kaufmann.Google Scholar
  41. Kwok, S.W., & Carter, C. (1990). Multiple decision trees. In R.D. Schachter, T.S. Levitt, L.N. Kanal, & J.F. Lemmer (Eds.), Uncertainty in Artificial Intelligence (pp. 327–335). Elsevier Science Publishers.Google Scholar
  42. Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 223–228). AAAI Press and MIT Press.Google Scholar
  43. Langley, P., & Sage, S. (1997). Scaling to domains withmany irrelevant features. In R. Greiner (Ed.), Computational learning theory and natural learning systems (Vol. 4). MIT Press.Google Scholar
  44. Oates, T., & Jensen, D. (1997). The effects of training set size on decision tree complexity. In D. Fisher (Ed.), Machine Learning: Proceedings of the Fourteenth International Conference (pp. 254–262). Morgan Kaufmann.Google Scholar
  45. Oliver, J., & Hand, D. (1995). On pruning and averaging decision trees. In A. Prieditis & S. Russell (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (pp. 430–437). Morgan Kaufmann.Google Scholar
  46. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994). Reducing misclassification costs. Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann.Google Scholar
  47. Quinlan, J.R. (1993). C4.5: programs for machine learning. San Mateo, California: Morgan Kaufmann.Google Scholar
  48. Quinlan, J.R. (1994). Comparing connectionist and symbolic learning methods. In S.J. Hanson, G.A. Drastal, & R.L. Rivest (Eds.), Computational learning theory and natural learning systems (Vol. I: Constraints and prospects, chap. 15, pp. 445–456). MIT Press.Google Scholar
  49. Quinlan, J.R. (1996). Bagging, boosting, and c4.5. Proceedings of the Thirteenth National Conference on Artificial Intelligence (pp. 725–730). AAAI Press and the MIT Press.Google Scholar
  50. Ridgeway, G., Madigan, D., & Richardson, T. (1998). Interpretable boosted naive bayes classification. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining.Google Scholar
  51. Schaffer, C. (1994). A conservation law for generalization performance. Machine Learning: Proceedings of the Eleventh International Conference (pp. 259–265). Morgan Kaufmann.Google Scholar
  52. Schapire, R.E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.Google Scholar
  53. Schapire, R.E., Freund, Y., Bartlett, P., & Lee,W.S. (1997). Boosting the margin: A new explanation for the effectiveness of voting methods. In D. Fisher (Ed.), Machine Learning: Proceedings of the Fourteenth International Conference (pp. 322–330). Morgan Kaufmann.Google Scholar
  54. Wolpert, D.H. (1992). Stacked generalization. Neural Networks, 5, 241–259.Google Scholar
  55. Wolpert, D.H. (1994). The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In D.H. Wolpert (Ed.), The mathematics of generalization. Addison Wesley.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Eric Bauer
    • 1
  • Ron Kohavi
    • 2
  1. 1.Computer Science DepartmentStanford UniversityStanford
  2. 2.Blue Martini SoftwareSan Matis

Personalised recommendations