A Comparison of Ensemble Creation Techniques

  • Robert E. Banfield
  • Lawrence O. Hall
  • Kevin W. Bowyer
  • Divya Bhadoria
  • W. Philip Kegelmeyer
  • Steven Eschrich
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3077)

Abstract

We experimentally evaluated bagging and six other randomization-based ensemble tree methods. Bagging uses randomization to create multiple training sets. Other approaches, such as Randomized C4.5, apply randomization in selecting a test at a given node of a tree. Then there are approaches, such as random forests and random subspaces, that apply randomization in the selection of attributes to be used in building the tree. On the other hand boosting incrementally builds classifiers by focusing on examples misclassified by existing classifiers. Experiments were performed on 34 publicly available data sets. While each of the other six approaches has some strengths, we find that none of them is consistently more accurate than standard bagging when tested for statistical significance.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)MATHMathSciNetGoogle Scholar
  2. 2.
    Eibl, G., Pfeiffer, K.P.: How to make AdaBoost.M1 work for weak base classifiers by changing only one line of the code. In: Proceedings of the Thirteenth European Conference on Machine Learning, pp. 72–83 (2002)Google Scholar
  3. 3.
    Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. In: Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp. 158–169 (2000)Google Scholar
  4. 4.
    Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990)Google Scholar
  5. 5.
    Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)MATHCrossRefGoogle Scholar
  6. 6.
    Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning 40(2), 139–157 (2000)CrossRefGoogle Scholar
  7. 7.
    Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)CrossRefGoogle Scholar
  8. 8.
    Hulten, G., Domingos, P.: Learning from infinite data in finite time. In: Advances in Neural Information Processing Systems, vol. 14, pp. 673–680. MIT Press, Cambridge (2002)Google Scholar
  9. 9.
    Bowyer, K.W., Chawla Jr., N.V., Moore, T.E., Hall, L.O., Kegelmeyer, W.P.: A parallel decision tree builder for mining very large visualization datasets. In: IEEE Systems, Man, and Cybernetics Conference, pp. 1888–1893 (2000)Google Scholar
  10. 10.
    Chawla, N.V., Moore, T.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P., Springer, C.: Distributed learning with bagging-like performance. Pattern Recognition Letters 24, 455–471 (2003)CrossRefGoogle Scholar
  11. 11.
    Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases. Univ. of CA., Dept. of CIS, Irvine, CA, http://www.ics.uci.edu/~mlearn/MLRepository.html
  12. 12.
    Banfield, R.: The OpenDT project. Technical report, University of South Florida (2003), http://www.csee.usf.edu/~rbanfiel/OpenDT.html
  13. 13.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1992)Google Scholar
  14. 14.
    Hall, L.O., Bowyer, K.W., Banfield, R.E., Bhadoria, D., Kegelmeyer, W.P., Eschrich, S.: Comparing pure parallel ensemble creation techniques against bagging. In: The Third IEEE International Conference on Data Mining, pp. 533–536 (2003)Google Scholar
  15. 15.
    Brazdil, P., Gama, J.: The statlog project-evaluation/characterization of classification algorithms. Technical report, The STATLOG Project-Evaluation/Characterization of Classification Algorithms (1998), http://www.ncc.up.pt/liacc/ML/statlog/
  16. 16.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, P.J.: Classification and Regression Trees. Wadsworth International Group, Belmont (1984)MATHGoogle Scholar
  17. 17.
    Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (1999)Google Scholar
  18. 18.
    de Borda, J.C.: Memoire sur les elections au scrutin. Historie de lÁcademie Royale des Sciences, Paris (1781)Google Scholar
  19. 19.
    Ho, T.K.: A data complexity analysis of comparative advantages of decision forest constructors. Pattern Analysis and Applications 5, 102–112 (2002)MATHCrossRefGoogle Scholar
  20. 20.
    Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: A new ensemble diversity measure applied to thinning ensembles. In: Multiple Classifier Systems Conference, June 2003, pp. 306–316 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Robert E. Banfield
    • 1
  • Lawrence O. Hall
    • 1
  • Kevin W. Bowyer
    • 2
  • Divya Bhadoria
    • 1
  • W. Philip Kegelmeyer
    • 3
  • Steven Eschrich
    • 1
  1. 1.Department of Computer Science & EngineeringUniversity of South Florida TampaFlorida
  2. 2.Computer Science & EngineeringNotre DameUSA
  3. 3.Biosystems Research DepartmentSandia National LabsLivermoreUSA

Personalised recommendations