Skip to main content

Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

Abstract

Variable selection is one of the main problems faced by data mining and machine learning techniques. These techniques are often, more or less explicitly, based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. Firstly, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Secondly, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data.

This is a preview of subscription content, access via your institution.

References

  • Bell, D., Wang, H.: A formalism for relevance and its application in feature subset selection. Mach. Learn. 4(2), 175–195 (2000)

    Article  Google Scholar 

  • Berk, R.A.: An introduction to ensemble methods for data analysis. Sociol. Methods Res. 34(3), 263–295 (2006)

    Article  MathSciNet  Google Scholar 

  • Breiman, L.: The heuristic of instability in model selection. Ann. Stat. 24, 2350–2383 (1996)

    MATH  Article  MathSciNet  Google Scholar 

  • Breiman, L.: Random Forests. Mach. Learn. 45, 5–32 (2001a)

    MATH  Article  Google Scholar 

  • Breiman, L.: Statistical modeling: the two cultures. Stat. Sci. 16, 199–231 (2001b)

    MATH  Article  MathSciNet  Google Scholar 

  • Breiman, L.: Manual on setting up, using, and understanding Random Forests v3.1. Technical report (2002). http://oz.berkeley.edu/users/breiman

  • Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, London (1984)

    MATH  Google Scholar 

  • Breiman, L., Cutler, A., Liaw, A., Wiener, M.: Breiman and Cutler’s Random Forests for classification and regression. R package version 4.5-18 (2006). http://cran.r-project.org/doc/packages/randomForest.pdf

  • Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Stat. 30(4), 927–961 (2002)

    MATH  Article  Google Scholar 

  • Dobra, A., Gehrke, J.: Bias correction in classification tree construction. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Seventeenth International Conference on Machine Learning, Williams College, Williamstown, MA, USA, pp. 90–97 (2001)

  • Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)

    MATH  Article  Google Scholar 

  • Friedman, J.H.: Tutorial: getting started with MART in R. Technical report, Standford University (2002). http://www-stat.stanford.edu/~jhf/r-mart/tutorial/tutorial.pdf

  • Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)

    Article  MathSciNet  Google Scholar 

  • Kim, H., Loh, W.: Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 96, 589–604 (2001)

    Article  MathSciNet  Google Scholar 

  • Kononenko, I.: On biases in estimating multi-valued attributes. In: Mellish, C. (ed.) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada, pp. 1034–1040 (1995)

  • Liaw, A., Wiener, M.: Classification and regression by Random Forest. R News 2(3), 18–22 (2002)

    Google Scholar 

  • Loh, W.-Y., Shih, Y.-S.: Split selection methods for classification trees. Stat. Sinica 7, 815–840 (1997)

    MATH  MathSciNet  Google Scholar 

  • Murthy, K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2(4), 1384–5810 (2004)

    Google Scholar 

  • Nierenberg, D.W., Stukel, T.A., Baron, J.A., Dain, B.J., Greenberg, E.R.: Determinants of plasma levels of beta-carotene and retinol. Am. J. Epidemiol. 130, 511–521 (1989)

    Google Scholar 

  • Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)

    Google Scholar 

  • R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. (2008)

  • Ridgeway, G.: Generalized boosted models: a guide to the gbm package. http://i-pensieri.com/gregr/papers/gbm-vignette.pdf (2007)

  • Sandri, M., Zuccolotto, P.: A bias correction algorithm for the Gini variable importance measure in classification trees. J. Comput. Graph. Stat. 17(3), 1–18 (2008)

    MathSciNet  Google Scholar 

  • Schonlau, M.: Boosted regression (boosting): a tutorial and a stata plugin. Stata J. 5(3), 330–354 (2005)

    Google Scholar 

  • Shih, Y.-S.: Families of splitting criteria for classification trees. Stat. Comput. 9, 309–315 (1999)

    Article  Google Scholar 

  • Strobl, C.: Statistical sources of variable selection bias in classification trees based on the Gini index. Technical report, SFB 386. http://epub.ub.uni-muenchen.de/archive/00001789/01/paper_420.pdf (2005)

  • Strobl, C., Boulesteix, A.-L., Augustin, T.: Unbiased split selection for classification trees based on the Gini index. Comput. Stat. Data Anal. (2007a). doi:10.1016/j.csda.2006.12.030

    MathSciNet  Google Scholar 

  • Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinf. 8, 25 (2007b). doi:10.1186/1471-2105-8-25

    Article  Google Scholar 

  • Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinf. 9, 307 (2008). doi:10.1186/1471-2105-9-307

    Article  Google Scholar 

  • van der Laan, M.J.: Statistical inference for variable importance. Int. J. Biostat. 2(1), 1–30 (2005)

    Google Scholar 

  • White, A.P., Liu, W.Z.: Bias in information-based measures in decision tree induction. Mach. Learn. 15, 321–329 (1994)

    MATH  Google Scholar 

  • Wu, Y., Boos, D.D., Stefanski, L.A.: Controlling variable selection by the addition of pseudovariables. J. Am. Stat. Assoc. 102(477), 235–243 (2007)

    MATH  Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paola Zuccolotto.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Sandri, M., Zuccolotto, P. Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms. Stat Comput 20, 393–407 (2010). https://doi.org/10.1007/s11222-009-9132-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-009-9132-0

Keywords

  • Impurity measures
  • Ensemble learning
  • Variable importance