A Comparison of Ensemble Creation Techniques
We experimentally evaluated bagging and six other randomization-based ensemble tree methods. Bagging uses randomization to create multiple training sets. Other approaches, such as Randomized C4.5, apply randomization in selecting a test at a given node of a tree. Then there are approaches, such as random forests and random subspaces, that apply randomization in the selection of attributes to be used in building the tree. On the other hand boosting incrementally builds classifiers by focusing on examples misclassified by existing classifiers. Experiments were performed on 34 publicly available data sets. While each of the other six approaches has some strengths, we find that none of them is consistently more accurate than standard bagging when tested for statistical significance.
Unable to display preview. Download preview PDF.
- 2.Eibl, G., Pfeiffer, K.P.: How to make AdaBoost.M1 work for weak base classifiers by changing only one line of the code. In: Proceedings of the Thirteenth European Conference on Machine Learning, pp. 72–83 (2002)Google Scholar
- 3.Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. In: Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp. 158–169 (2000)Google Scholar
- 4.Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990)Google Scholar
- 8.Hulten, G., Domingos, P.: Learning from infinite data in finite time. In: Advances in Neural Information Processing Systems, vol. 14, pp. 673–680. MIT Press, Cambridge (2002)Google Scholar
- 9.Bowyer, K.W., Chawla Jr., N.V., Moore, T.E., Hall, L.O., Kegelmeyer, W.P.: A parallel decision tree builder for mining very large visualization datasets. In: IEEE Systems, Man, and Cybernetics Conference, pp. 1888–1893 (2000)Google Scholar
- 11.Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases. Univ. of CA., Dept. of CIS, Irvine, CA, http://www.ics.uci.edu/~mlearn/MLRepository.html
- 12.Banfield, R.: The OpenDT project. Technical report, University of South Florida (2003), http://www.csee.usf.edu/~rbanfiel/OpenDT.html
- 13.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1992)Google Scholar
- 14.Hall, L.O., Bowyer, K.W., Banfield, R.E., Bhadoria, D., Kegelmeyer, W.P., Eschrich, S.: Comparing pure parallel ensemble creation techniques against bagging. In: The Third IEEE International Conference on Data Mining, pp. 533–536 (2003)Google Scholar
- 15.Brazdil, P., Gama, J.: The statlog project-evaluation/characterization of classification algorithms. Technical report, The STATLOG Project-Evaluation/Characterization of Classification Algorithms (1998), http://www.ncc.up.pt/liacc/ML/statlog/
- 17.Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (1999)Google Scholar
- 18.de Borda, J.C.: Memoire sur les elections au scrutin. Historie de lÁcademie Royale des Sciences, Paris (1781)Google Scholar
- 20.Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: A new ensemble diversity measure applied to thinning ensembles. In: Multiple Classifier Systems Conference, June 2003, pp. 306–316 (2003)Google Scholar