Skip to main content

Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles

  • Conference paper
  • First Online:
Machine Learning and Knowledge Extraction (CD-MAKE 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13480))

Abstract

We attempt to give a unifying view of the various recent attempts to (i) improve the interpretability of tree-based models and (ii) debias the default variable-importance measure in random forests, Gini importance. In particular, we demonstrate a common thread among the out-of-bag based bias correction methods and their connection to local explanation for trees. In addition, we point out a bias caused by the inclusion of inbag data in the newly developed SHAP values and suggest a remedy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Appendix A.1 contains expanded definitions and more thorough notation.

  2. 2.

    A python library is available: https://github.com/slundberg/shap.

  3. 3.

    In all random forest simulations, we choose \(mtry=2, ntrees=100\) and exclude rows with missing Age.

  4. 4.

    For easier notation we have (i) left the multiplier 2 and (ii) omitted an index for the class membership.

References

  1. Adler, A.I., Painsky, A.: Feature importance in gradient boosting trees with cross-validation feature selection. Entropy 24(5), 687 (2022)

    Article  MathSciNet  Google Scholar 

  2. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  3. Díaz-Uriarte, R., De Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7(1), 3 (2006)

    Article  Google Scholar 

  4. Grömping, U.: Variable importance assessment in regression: linear regression versus random forest. Am. Stat. 63(4), 308–319 (2009)

    Article  MathSciNet  Google Scholar 

  5. Grömping, U.: Variable importance in regression models. Wiley Interdiscip. Rev. Comput. Stat. 7(2), 137–152 (2015)

    Article  MathSciNet  Google Scholar 

  6. Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)

    Article  MathSciNet  Google Scholar 

  7. Kim, H., Loh, W.Y.: Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 96(454), 589–604 (2001)

    Article  MathSciNet  Google Scholar 

  8. Li, X., Wang, Y., Basu, S., Kumbier, K., Yu, B.: A debiased mdi feature importance measure for random forests. In: Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8049–8059 (2019)

    Google Scholar 

  9. Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18–22 (2002). https://CRAN.R-project.org/doc/Rnews/

  10. Loecher, M.: Unbiased variable importance for random forests. Commun. Stat. Theory Methods 51, 1–13 (2020)

    MathSciNet  MATH  Google Scholar 

  11. Loh, W.Y., Shih, Y.S.: Split selection methods for classification trees. Stat. Sin. 7, 815–840 (1997)

    MathSciNet  MATH  Google Scholar 

  12. Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 56–67 (2020)

    Article  Google Scholar 

  13. Menze, B.H., Kelm, B.M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F.A.: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1), 213 (2009)

    Article  Google Scholar 

  14. Nembrini, S., König, I.R., Wright, M.N.: The revival of the Gini importance? Bioinformatics 34(21), 3711–3718 (2018)

    Article  Google Scholar 

  15. Olson, R.S., Cava, W.L., Mustahsan, Z., Varik, A., Moore, J.H.: Data-driven advice for applying machine learning to bioinformatics problems. In: Pacific Symposium on Biocomputing 2018: Proceedings of the Pacific Symposium, pp. 192–203. World Scientific (2018)

    Google Scholar 

  16. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  17. Saabas, A.: Interpreting random forests (2019). http://blog.datadive.net/interpreting-random-forests/

  18. Saabas, A.: Treeinterpreter library (2019). https://github.com/andosa/treeinterpreter

  19. Sandri, M., Zuccolotto, P.: A bias correction algorithm for the Gini variable importance measure in classification trees. J. Comput. Graph. Stat. 17(3), 611–628 (2008)

    Article  MathSciNet  Google Scholar 

  20. Shih, Y.S.: A note on split selection bias in classification trees. Comput. Stat. Data Anal. 45(3), 457–466 (2004)

    Article  MathSciNet  Google Scholar 

  21. Shih, Y.S., Tsai, H.W.: Variable selection bias in regression trees with constant fits. Comput. Stat. Data Anal. 45(3), 595–607 (2004)

    Article  MathSciNet  Google Scholar 

  22. Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8, 1–21 (2007). https://doi.org/10.1186/1471-2105-8-25

    Article  Google Scholar 

  23. Strobl, C., Boulesteix, A.L., Augustin, T.: Unbiased split selection for classification trees based on the Gini index. Comput. Stat. Data Anal. 52(1), 483–501 (2007)

    Article  MathSciNet  Google Scholar 

  24. Sun, Q.: tree. interpreter: Random Forest Prediction Decomposition and Feature Importance Measure (2020). https://CRAN.R-project.org/package=tree.interpreter. R package version 0.1.1

  25. Wright, M.N., Ziegler, A.: ranger: a fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77(1), 1–17 (2017). https://doi.org/10.18637/jss.v077.i01

  26. Zhou, Z., Hooker, G.: Unbiased measurement of feature importance in tree-based methods. ACM Trans. Knowl. Discov. Data (TKDD) 15(2), 1–21 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Markus Loecher .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Background and Notations

Definitions needed to understand Eq. (4). (The following paragraph closely follows the definitions in [8].)

Random Forests (RFs) are an ensemble of classification and regression trees, where each tree T defines a mapping from the feature space to the response. Trees are constructed independently of one another on a bootstrapped or subsampled data set \(\mathcal {D}^{(T)}\) of the original data \(\mathcal {D}\). Any node t in a tree T represents a subset (usually a hyper-rectangle) \(R_{t}\) of the feature space. A split of the node t is a pair (kz) which divides the hyper-rectangle \(R_{t}\) into two hyper-rectangles \(R_{t} \cap \mathbbm {1}\left( X_{k} \le z\right) \) and \(R_{t} \cap \mathbbm {1}\left( X_{k}>z\right) \) corresponding to the left child t left and right child t right of node t, respectively. For a node t in a tree \(T, N_{n}(t)=\left| \left\{ i \in \mathcal {D}^{(T)}: \mathbf {x}_{i} \in R_{t}\right\} \right| \) denotes the number of samples falling into \(R_{t}\) and

$$ \mu _{n}(t):=\frac{1}{N_{n}(t)} \sum _{i: \mathbf {x}_{i} \in R_{t}} y_{i} $$

We define the set of inner nodes of a tree T as I(T).

1.2 A.2 Debiasing MDI via OOB Samples

In this section we give a short version of the proof that \(PG_{oob}^{(1,2)}\) is equivalent to the MDI-oob measure defined in [8]. For clarity we assume binary classification; Appendix A.2 contains an expanded version of the proof including the multi-class case. As elegantly demonstrated by [8], the MDI of feature k in a tree T can be written as

$$\begin{aligned} MDI = \frac{1}{\left| \mathcal {D}^{(T)}\right| } \sum _{i \in \mathcal {D}^{(T)}}{f_{T, k}(x_i) \cdot y_i} \end{aligned}$$
(6)

where \(\mathcal {D}^{(T)}\) is the bootstrapped or subsampled data set of the original data \(\mathcal {D}\). Since \(\sum _{i \in \mathcal {D}^{(T)}}{f_{T, k}(x_i) } = 0\), we can view MDI essentially as the sample covariance between \(f_{T, k}(x_i)\) and \(y_i\) on the bootstrapped dataset \(\mathcal {D}^{(T)}\).

MDI-oob is based on the usual variance reduction per node as shown in Eq. (34) (proof of Proposition (1)), but with a “variance” defined as the mean squared deviations of \(y_{oob}\) from the inbag mean \(\mu _{in}\):

$$ \varDelta _{I}(t) = \frac{1}{N_n(t)} \cdot \sum _{i \in D(T)}{(y_{i,oob} - \mu _{n,in})^2} 1 (x_i \in R_t) - \ldots $$

We can, of course, rewrite the variance as

$$\begin{aligned} \frac{1}{N_n(t)} \cdot \sum _{i \in D(T)}{(y_{i,oob} - \mu _{n,in})^2}&= \frac{1}{N_n(t)} \cdot \sum _{i \in D(T)}{(y_{i,oob} - \mu _{n,oob})^2} + (\mu _{n,in} - \mu _{n,oob})^2 \end{aligned}$$
(7)
$$\begin{aligned}&= p_{oob} \cdot (1- p_{oob} ) + (p_{in} - p_{oob})^2 \end{aligned}$$
(8)

where the last equality is for Bernoulli \(y_i\), in which case the means \(\mu _{in/oob}\) become proportions \(p_{in/oob}\) and the first sum is equal to the binomial variance \(p_{oob} \cdot (1- p_{oob})\). The final expression is effectively equal to \(PG_{oob}^{(1,2)}\).

Lastly, we now show that \(PG_{oob}^{(0.5,1)}\) is equivalent to the unbiased split-improvement measure defined in [26]. For the binary classificaton case, we can rewrite \(PG_{oob}^{(0.5,1)}\) as follows:

$$\begin{aligned} PG_{oob}^{(0.5,1)}&= \frac{1}{2} \cdot \sum _{d=1}^D{ \hat{p}_{d,oob} \cdot \left( 1- \hat{p}_{d,oob} \right) + \hat{p}_{d,in} \cdot \left( 1- \hat{p}_{d,in} \right) + (\hat{p}_{d,oob} - \hat{p}_{d,in})^2} \end{aligned}$$
(9)
$$\begin{aligned}&= \hat{p}_{oob} \cdot \left( 1- \hat{p}_{oob} \right) + \hat{p}_{in} \cdot \left( 1- \hat{p}_{in} \right) + (\hat{p}_{oob} - \hat{p}_{in})^2 \end{aligned}$$
(10)
$$\begin{aligned}&= \hat{p}_{oob} - \hat{p}_{oob}^2 + \hat{p}_{in} - \hat{p}_{in}^2 + \hat{p}_{oob}^2 - 2 \hat{p}_{oob} \cdot \hat{p}_{in} + \hat{p}_{in}^2 \end{aligned}$$
(11)
$$\begin{aligned}&= \hat{p}_{oob} + \hat{p}_{in} - 2 \hat{p}_{oob} \cdot \hat{p}_{in} \end{aligned}$$
(12)

1.3 A.3 Variance Reduction View

Here, we provide a full version of the proof sketched in Sect. A.2 which leans heavily on the proof of Proposition (1) in [8].

We consider the usual variance reduction per node but with a “variance” defined as the mean squared deviations of \(y_{oob}\) from the inbag mean \(\mu _{in}\):

(13)
(14)
figure c

where the last equality is for Bernoulli \(y_i\), in which case the means \(\mu _{in/oob}\) become proportions \(p_{in/oob}\) and we replace the squared deviations with the binomial variance \(p_{oob} \cdot (1- p_{oob} )\). The final expression is then

$$\begin{aligned} \begin{aligned} \varDelta _{\mathcal {I}}(t)&=p_{oob}(t) \cdot \left( 1- p_{oob}(t) \right) + \left[ p_{oob}(t) - p_{in}(t)\right] ^2\\&- \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} \left( p_{oob}(t^{\text {left}}) \cdot \left( 1- p_{oob}(t^{\text {left}}) \right) + \left[ p_{oob}(t^{\text {left}}) - p_{in}(t^{\text {left}})\right] ^2 \right) \\&- \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} \left( p_{oob}(t^{\text {right}}) \cdot \left( 1- p_{oob}(t^{\text {right}}) \right) + \left[ p_{oob}(t^{\text {right}}) - p_{in}(t^{\text {right}})\right] ^2 \right) \end{aligned} \end{aligned}$$
(15)

which, of course, is exactly the impurity reduction due to \(PG_{oob}^{(1,2)}\):

$$\begin{aligned} \varDelta _{\mathcal {I}}(t) =PG_{oob}^{(1,2)}(t) - \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} PG_{oob}^{(1,2)}(t^{\text {left}}) - \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} PG_{oob}^{(1,2)}(t^{\text {right}}) \end{aligned}$$
(16)

Another, somewhat surprising view of MDI is given by Eqs. (6) and (4), which for binary classification reads as:

$$\begin{aligned} \begin{aligned} MDI&= \frac{1}{\left| \mathcal {D}^{(T)}\right| } \sum _{t \in I(T): v(t)=k} \sum _{i \in \mathcal {D}^{(T)}} \left[ \mu _{n}\left( t^{left}\right) \mathbbm {1}\left( x_i \in R_{t^{left}}\right) \right. \\&\quad \left. +\,\mu _{n}\left( t^{right}\right) \mathbbm {1}\left( x_i \in R_{t^{right}}\right) -\mu _{n}(t) \mathbbm {1}\left( x_ \in R_{t}\right) \right] \cdot y_i \\&= \frac{1}{\left| \mathcal {D}^{(T)}\right| } \sum _{t \in I(T): v(t)=k}{ - p_{in}(t)^2 + \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} p_{in}(t^{\text {left}})^2 + \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} p_{in}(t^{\text {right}})^2 } \end{aligned} \end{aligned}$$
(17)

and for the oob version:

$$\begin{aligned} MDI_{oob} = - p_{in}(t) \cdot p_{oob}(t) + \frac{N_{n}(t^{\text {left}})}{N_{n}(t)} p_{in}(t^{\text {left}}) \cdot p_{oob}(t^{\text {left}}) + \frac{N_{n}(t^{\text {right}})}{N_{n}(t)} p_{in}(t^{\text {right}}) \cdot p_{oob}(t^{\text {right}}) \end{aligned}$$
(18)

1.4 A.4 \(\mathbf {E(\varDelta \widehat{PG}_{oob}^{(0)}) = 0}\)

The decrease in impurity (\(\varDelta G\)) for a parent node m is the weighted difference between the Gini importanceFootnote 4 \(G(m) = \hat{p}_m (1- \hat{p}_m )\) and those of its left and right children:

$$ \varDelta G(m) = G(m) - \left[ N_{m_l} G(m_l) - N_{m_r} G(m_r) \right] / N_m $$

We assume that the node m splits on an uninformative variable \(X_j\), i.e. \(X_j\) and Y are independent.

We will use the short notation \(\sigma ^2_{m, .} \equiv p_{m,.} (1-p_{m,.})\) for . either equal to oob or in and rely on the following facts and notation:

  1. 1.

    \(E[\hat{p}_{m, oob}] = p_{m,oob}\) is the “population” proportion of the class label in the OOB test data (of node m).

  2. 2.

    \(E[\hat{p}_{m, in}] = p_{m,in}\) is the “population” proportion of the class label in the inbag test data (of node m).

  3. 3.

    \(E[\hat{p}_{m, oob}] = E[\hat{p}_{m_l, oob}] = E[\hat{p}_{m_r, oob}] =p_{m,oob}\)

  4. 4.

    \(E[\hat{p}_{m, oob}^2] = var(\hat{p}_{m, oob}) + E[\hat{p}_{m, oob}]^2 = \sigma ^2_{m, oob}/N_m + p_{m,oob}^2\)

    \(\Rightarrow E[G_{oob}(m)] = E[\hat{p}_{m, oob}] - E[\hat{p}_{m, oob}^2] = \sigma ^2_{m, oob} \cdot \left( 1- \frac{1}{N_m}\right) \)

    \(\Rightarrow E[\widehat{G}_{oob}(m)] = \sigma ^2_{m, oob}\)

  5. 5.

    \(E[\hat{p}_{m, oob} \cdot \hat{p}_{m, in}] = E[\hat{p}_{m, oob}] \cdot E[\hat{p}_{m, in}] = p_{m,oob} \cdot p_{m,in}\)

Equalities 3 and 5 hold because of the independence of the inbag and out-of-bag data as well as the independence of \(X_j\) and Y.

We now show that \(\mathbf {E(\varDelta PG_{oob}^{(0)}) \ne 0}\) We use the shorter notation \(G_{oob} = PG_{oob}^{(0)}\):

$$\begin{aligned} E[\varDelta G_{oob}(m)]&= E[G_{oob}(m)] - \frac{N_{m_l}}{N_{m}} E[G_{oob}(m_l)] - \frac{N_{m_r}}{N_{m}} E[G_{oob}(m_r)] \\&= \sigma ^2_{m,oob} \cdot \left[ 1- \frac{1}{N_m} - \frac{N_{m_l}}{N_{m}} \left( 1- \frac{1}{N_{m_l}}\right) - \frac{N_{m_r}}{N_{m}} \left( 1- \frac{1}{N_{m_r}}\right) \right] \\&= \sigma ^2_{m,oob} \cdot \left[ 1- \frac{1}{N_m} - \frac{N_{m_l} + N_{m_r}}{N_{m}} + \frac{2}{N_m} \right] = \frac{\sigma ^2_{m,oob}}{N_m} \end{aligned}$$

We see that there is a bias if we used only OOB data, which becomes more pronounced for nodes with smaller sample sizes. This is relevant because visualizations of random forests show that the splitting on uninformative variables happens most frequently for “deeper” nodes.

The above bias is due to the well known bias in variance estimation, which can be eliminated with the bias correction, as outlined in the main text. We now show that the bias for this modified Gini impurity is zero for OOB data. As before, \(\widehat{G}_{oob} = \widehat{PG}_{oob}^{(0)}\):

$$\begin{aligned} E[\varDelta \widehat{PG}_{oob}(m)]&= E[\widehat{G}_{oob}(m)] - \frac{N_{m_l}}{N_{m}} E[\widehat{G}_{oob}(m_l)] - \frac{N_{m_r}}{N_{m}} E[\widehat{G}_{oob}(m_r)] \\&= \sigma ^2_{m,oob} \cdot \left[ 1 - \frac{N_{m_l} + N_{m_r}}{N_{m}} \right] = 0 \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Loecher, M. (2022). Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2022. Lecture Notes in Computer Science, vol 13480. Springer, Cham. https://doi.org/10.1007/978-3-031-14463-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14463-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14462-2

  • Online ISBN: 978-3-031-14463-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics