Machine Learning

, Volume 63, Issue 1, pp 3–42 | Cite as

Extremely randomized trees

Article

Abstract

This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.

Keywords

Supervised learning Decision and regression trees Ensemble methods Cut-point randomization Bias/variance tradeoff Kernel-based models 

References

  1. Ali, K., & Pazzani, M. (1996). Error reduction through learning multiple descriptions. Machine Learning, 24:3, 173--206.Google Scholar
  2. Ali, K. (1995). On the link between error correlation and error reduction in decision tree ensembles. Technical report, Department of Information and Computer Science, University of California, Irvine.Google Scholar
  3. Bauer, E., & Kohavi., R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36, 105--139.Google Scholar
  4. Blake, C., & Merz, C.(1998). UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html.
  5. Breiman, L., Friedman, J., Olsen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth International.Google Scholar
  6. Breiman, L. (1996a). Arcing classifiers. Technical report, University of California, Department of Statistics.Google Scholar
  7. Breiman, L. (1996b). Bagging predictors. Machine Learning, 24:2, 123--140.Google Scholar
  8. Breiman, L. (2000a). Randomizing outputs to increase prediction accuracy. Machine Learning, 40:3, 229--242.Google Scholar
  9. Breiman, L. (2000b). Some infinity theory for predictor ensembles. Technical Report 579, University of California, Department of Statistics.Google Scholar
  10. Breiman, L. (2001). Random forests. Machine Learning, 45, 5--32.Google Scholar
  11. Buntine, W., & Niblett, T. (1992), A further comparison of splitting rules for decision-tree induction. Machine Learning, 8, 75--85.Google Scholar
  12. Buntine, W., & Weigend, A. (1991). Bayesian back-propagation. Complex Systems, 5, 603--643.Google Scholar
  13. Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63--73.Google Scholar
  14. Cutler, A.,& Guohua, Z. (2001), PERT — Perfect random tree ensembles. Computing Science and Statistics 33.Google Scholar
  15. Dietterich, T., & Kong, E. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.Google Scholar
  16. Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40:2, 139--157.Google Scholar
  17. Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503--556.Google Scholar
  18. Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the 2nd European Conference on Computational Learning Theory, 23--27.Google Scholar
  19. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19:1, 1--141.Google Scholar
  20. Friedman, J. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1, 55--77.Google Scholar
  21. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemna. Neural Computation, 4, 1--58.Google Scholar
  22. Geurts, P., Blanco Cuesta A., & Wehenkel, L. (2005a). Segment and combine approach for biological sequence classification. In: Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 194–201.Google Scholar
  23. Geurts, P., Fillet,M., de Seny, D., Meuwis, M. -A., Merville, M. -P., & Wehenkel, L. (2005b). Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics, 21:14, 3138– 3145.Google Scholar
  24. Geurts, P., & L. Wehenkel. (2000). Investigation and reduction of discretization variance in decision tree induction. In: Proceedings of the 11th European Conference on Machine Learning, 162--170.Google Scholar
  25. Geurts, P., & Wehenkel, L. (2005). Segment and combine approach for non-parametric time-series classification. In: Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. pp. 478--485.Google Scholar
  26. Geurts, P. (2002). Contributions to decision tree induction: bias/variance tradeo. and time series classification. Ph.D. thesis, University of Liège.Google Scholar
  27. Geurts, P. (2003). Extremely randomized trees. Technical report, University of Liège - Department of Electrical Engineering and Computer Science.Google Scholar
  28. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer.Google Scholar
  29. Herbrich, R., Graepel, T., & Campbell, C. (2001). Bayes point machines. Journal of Machine Learning Research, 1, 241--279.Google Scholar
  30. Ho, T. (1998). The Random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:8, 832--844.Google Scholar
  31. James, G. (2003). Variance and bias for generalized loss functions. Machine Learning, 51, 115--135.Google Scholar
  32. Kamath, C., Cantu-Paz, E., & Littau, D. (2002). Approximate splitting for ensembles of trees using histograms. In: Proceedings of the 2nd SIAM International Conference on Data mining.Google Scholar
  33. Kleinberg, E. (1990). Stochastic discrimination. Annals of Mathematics and Artificial Intelligence 1, 207--239.Google Scholar
  34. Lin, Y., & Jeon, Y. (2002). Random forests and adaptive nearest neighbors. Technical Report 1055, University of Wisconsin, Department of Statistics.Google Scholar
  35. Marée, R., Geurts, P., Piater, J., & Wehenkel, L. (2004). A generic approach for image classsification based on decision tree ensembles and local sub-windows. In: Proceedings of the 6th Asian Conference on Computer Vision, 2, 860–865.Google Scholar
  36. Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3, 319--342.Google Scholar
  37. Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52:3, 239--281.Google Scholar
  38. Quinlan, J. (1986). C4.5: Programs for machine learning. Morgan Kaufmann (San Mateo).Google Scholar
  39. Torgo, L. (1999). Inductive learning of tree-based regression models. Ph.D. thesis, University of Porto.Google Scholar
  40. Valentini, G., & Dietterich, T. (2004). Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research, 5, 725--775.Google Scholar
  41. Webb, G., & Zheng, Z. (2004). Multi-strategy ensemble learning: reducing error by combining ensemble learning techniques. IEEE Transactions on Knowledge and Data Engineering, 16:8, 980--991.Google Scholar
  42. Webb, G. (2000). Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40:2, 159--196.Google Scholar
  43. Wehenkel, L., & Pavella, M. (1991). Decision trees and transient stability of electric power systems. Automatica, 27:1, 115--134.Google Scholar
  44. Wehenkel, L. (1996). On uncertainty measures used for decision tree induction. In: Proceedings of Information Processing and Management of Uncertainty in Knowledge Based Systems, 413--418.Google Scholar
  45. Wehenkel, L. (1997). Discretization of continuous attributes for supervised learning: variance evaluation and variance reduction. In: Proceedings of the International Fuzzy Systems Association World Congress, 381--388.Google Scholar
  46. Wehenkel, L. (1998). Automatic Learning Techniques in Power Systems. Boston: Kluwer Academic.Google Scholar
  47. Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241--259.Google Scholar
  48. Zhao, G. (2000). A new perspective on classification. Ph.D. thesis, Utah State University, Department of Mathematics and Statistics.Google Scholar
  49. Zheng, Z., & Webb, G. (1998). Stochastic attribute selection committees. In: Proceedings of the 11h Australian Joint Conference on Artificial Intelligence, 321--332.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of Electrical Engineering and Computer ScienceUniversity of LiègeLiègeBelgium

Personalised recommendations