Advertisement

Statistics and Computing

, Volume 20, Issue 4, pp 393–407 | Cite as

Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

  • Marco Sandri
  • Paola ZuccolottoEmail author
Article

Abstract

Variable selection is one of the main problems faced by data mining and machine learning techniques. These techniques are often, more or less explicitly, based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. Firstly, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Secondly, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data.

Keywords

Impurity measures Ensemble learning Variable importance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bell, D., Wang, H.: A formalism for relevance and its application in feature subset selection. Mach. Learn. 4(2), 175–195 (2000) CrossRefGoogle Scholar
  2. Berk, R.A.: An introduction to ensemble methods for data analysis. Sociol. Methods Res. 34(3), 263–295 (2006) CrossRefMathSciNetGoogle Scholar
  3. Breiman, L.: The heuristic of instability in model selection. Ann. Stat. 24, 2350–2383 (1996) zbMATHCrossRefMathSciNetGoogle Scholar
  4. Breiman, L.: Random Forests. Mach. Learn. 45, 5–32 (2001a) zbMATHCrossRefGoogle Scholar
  5. Breiman, L.: Statistical modeling: the two cultures. Stat. Sci. 16, 199–231 (2001b) zbMATHCrossRefMathSciNetGoogle Scholar
  6. Breiman, L.: Manual on setting up, using, and understanding Random Forests v3.1. Technical report (2002). http://oz.berkeley.edu/users/breiman
  7. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, London (1984) zbMATHGoogle Scholar
  8. Breiman, L., Cutler, A., Liaw, A., Wiener, M.: Breiman and Cutler’s Random Forests for classification and regression. R package version 4.5-18 (2006). http://cran.r-project.org/doc/packages/randomForest.pdf
  9. Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Stat. 30(4), 927–961 (2002) zbMATHCrossRefGoogle Scholar
  10. Dobra, A., Gehrke, J.: Bias correction in classification tree construction. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Seventeenth International Conference on Machine Learning, Williams College, Williamstown, MA, USA, pp. 90–97 (2001) Google Scholar
  11. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001) zbMATHCrossRefGoogle Scholar
  12. Friedman, J.H.: Tutorial: getting started with MART in R. Technical report, Standford University (2002). http://www-stat.stanford.edu/~jhf/r-mart/tutorial/tutorial.pdf
  13. Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006) CrossRefMathSciNetGoogle Scholar
  14. Kim, H., Loh, W.: Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 96, 589–604 (2001) CrossRefMathSciNetGoogle Scholar
  15. Kononenko, I.: On biases in estimating multi-valued attributes. In: Mellish, C. (ed.) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada, pp. 1034–1040 (1995) Google Scholar
  16. Liaw, A., Wiener, M.: Classification and regression by Random Forest. R News 2(3), 18–22 (2002) Google Scholar
  17. Loh, W.-Y., Shih, Y.-S.: Split selection methods for classification trees. Stat. Sinica 7, 815–840 (1997) zbMATHMathSciNetGoogle Scholar
  18. Murthy, K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2(4), 1384–5810 (2004) Google Scholar
  19. Nierenberg, D.W., Stukel, T.A., Baron, J.A., Dain, B.J., Greenberg, E.R.: Determinants of plasma levels of beta-carotene and retinol. Am. J. Epidemiol. 130, 511–521 (1989) Google Scholar
  20. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988) Google Scholar
  21. R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. (2008)
  22. Ridgeway, G.: Generalized boosted models: a guide to the gbm package. http://i-pensieri.com/gregr/papers/gbm-vignette.pdf (2007)
  23. Sandri, M., Zuccolotto, P.: A bias correction algorithm for the Gini variable importance measure in classification trees. J. Comput. Graph. Stat. 17(3), 1–18 (2008) MathSciNetGoogle Scholar
  24. Schonlau, M.: Boosted regression (boosting): a tutorial and a stata plugin. Stata J. 5(3), 330–354 (2005) Google Scholar
  25. Shih, Y.-S.: Families of splitting criteria for classification trees. Stat. Comput. 9, 309–315 (1999) CrossRefGoogle Scholar
  26. Strobl, C.: Statistical sources of variable selection bias in classification trees based on the Gini index. Technical report, SFB 386. http://epub.ub.uni-muenchen.de/archive/00001789/01/paper_420.pdf (2005)
  27. Strobl, C., Boulesteix, A.-L., Augustin, T.: Unbiased split selection for classification trees based on the Gini index. Comput. Stat. Data Anal. (2007a). doi: 10.1016/j.csda.2006.12.030 MathSciNetGoogle Scholar
  28. Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinf. 8, 25 (2007b). doi: 10.1186/1471-2105-8-25 CrossRefGoogle Scholar
  29. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinf. 9, 307 (2008). doi: 10.1186/1471-2105-9-307 CrossRefGoogle Scholar
  30. van der Laan, M.J.: Statistical inference for variable importance. Int. J. Biostat. 2(1), 1–30 (2005) Google Scholar
  31. White, A.P., Liu, W.Z.: Bias in information-based measures in decision tree induction. Mach. Learn. 15, 321–329 (1994) zbMATHGoogle Scholar
  32. Wu, Y., Boos, D.D., Stefanski, L.A.: Controlling variable selection by the addition of pseudovariables. J. Am. Stat. Assoc. 102(477), 235–243 (2007) zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Quantitative MethodsUniversity of BresciaBresciaItaly

Personalised recommendations