Correlation and variable importance in random forests

Gregorutti, Baptiste; Michel, Bertrand; Saint-Pierre, Philippe

doi:10.1007/s11222-016-9646-1

Correlation and variable importance in random forests

Published: 23 March 2016

Volume 27, pages 659–678, (2017)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Baptiste Gregorutti^1,2,
Bertrand Michel² &
Philippe Saint-Pierre²

12k Accesses
489 Citations
6 Altmetric
Explore all metrics

Abstract

This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable importance-weighted random forests

Article 06 November 2017

A computationally fast variable importance test for random forests for high-dimensional data

Article 29 November 2016

All Relevant Feature Selection Methods and Applications

References

Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99, 6562–6566 (2002)
Article MATH Google Scholar
Archer, K.J., Kimes, R.V.: Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 52, 2249–2260 (2008)
Article MathSciNet MATH Google Scholar
Auret, L., Aldrich, C.: Empirical comparison of tree ensemble variable importance measures. Chemometr. Intell. Lab. Syst. 105, 157–170 (2011)
Article Google Scholar
Bi, J., Bennett, K.P., Embrechts, M., Brenemanand, C.M., Song, M.: Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res. 3, 1229–1243 (2003)
MATH Google Scholar
Biau, G., Devroye, L., Lugosi, G.: Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 9, 2015–2033 (2008)
MathSciNet MATH Google Scholar
Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271 (1997)
Article MathSciNet MATH Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth Advanced Books and Software, Pacific Grove (1984)
MATH Google Scholar
Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.-H.: Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Inference 143, 1835–1858 (2013)
Article MathSciNet MATH Google Scholar
Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 3 (2006)
Article Google Scholar
Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recogn. Lett. 31, 2225–2236 (2010)
Article Google Scholar
Grömping, U.: Variable importance assessment in regression: linear regression versus random forest. Am. Stat. 63, 308–319 (2009)
Article MathSciNet Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Article MATH Google Scholar
Hapfelmeier, A., Ulm, K.: A new variable selection approach using random forests. Comput. Stat. Data Anal. 60, 50–69 (2013)
Article MathSciNet Google Scholar
Haury, A.-C., Gestraud, P., Vert, J.-P.: The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6, 1–12 (2011)
Article Google Scholar
Ishwaran, H.: Variable importance in binary regression trees and forests. Electron. J. Stat. 1, 519–537 (2007)
Article MathSciNet MATH Google Scholar
Jiang, H., Deng, Y., Chen, H.-S., Tao, L., Sha, Q., Chen, J., Tsai, C.-J., Zhang, S.: Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinform. 5, 81 (2004)
Article Google Scholar
Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12, 95–116 (2007)
Article Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Article MATH Google Scholar
Křížek, P., Kittler, J., Hlaváč, V.: Improving stability of feature selection methods. Comput. Anal. Images Patterns 4673, 929–936 (2007)
Google Scholar
Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A.: A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. on Comput. Biol. Bioinform. 9, 1106–1119 (2012)
Article Google Scholar
Louw, N., Steel, S.J.: Variable selection in kernel fisher discriminant analysis by means of recursive feature elimination. Comput. Stat. Data Anal. 51, 2043–2055 (2006)
Article MathSciNet MATH Google Scholar
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based discriminant analysis. J. Multivar. Anal. 102, 1374–1387 (2011)
Article MathSciNet MATH Google Scholar
Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. Ser. B 72, 417–473 (2010)
Article MathSciNet Google Scholar
Neville, P.G.: Controversy of variable importance in random forests. J. Unified Stat. Tech. 1, 15–20 (2013)
Google Scholar
Nicodemus, K.K.: Letter to the editor: on the stability and ranking of predictors from random forest variable importance measures. Brief. Bioinform. 12, 369–373 (2011)
Article Google Scholar
Nicodemus, K.K., Malley, J.D.: Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics 25, 1884–1890 (2009)
Article Google Scholar
Nicodemus, K.K., Malley, J.D., Strobl, C., Ziegler, A.: The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinform. 11, 110 (2010)
Article Google Scholar
Rakotomamonjy, A.: Variable selection using svm based criteria. J. Mach. Learn. Res. 3, 1357–1370 (2003)
MathSciNet MATH Google Scholar
Rao, C.R.: Linear Statistical Inference and Its Applications. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Wiley, Hoboken (1973)
Reunanen, J.: Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res. 3, 1371–1382 (2003)
MATH Google Scholar
Scornet, E., Biau, G., Vert, J.-P.: Consistency of random forests. arXiv:1405.2881, (2014)
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9, 307 (2008)
Svetnik, V., Liaw, A., Tong, C., Wang, T.: Application of breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules. In Proceedings of the 5th International Workshop on Multiple Classifier Systems, vol. 3077, pp. 334–343 (2004)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Toloşi, L., Lengauer, T.: Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27, 1986–1994 (2011)
Article Google Scholar
van der Laan, M.J.: Statistical inference for variable importance. Int. J. Biostat. 2, 1–33 (2006)
MathSciNet Google Scholar
Zhu, R., Zeng, D., Kosorok, M.R.: Reinforcement learning trees. Technical report, University of North Carolina (2012)

Download references

Acknowledgments

The authors would like to thank Gérard Biau for helpful discussions and Cathy Maugis for pointing us the Landsat Satellite dataset. The authors also thank the two anonymous referees for their many helpful comments and valuable suggestions.

Author information

Authors and Affiliations

Safety Line, 15 Rue Jean-Baptiste Berlier, 75013, Paris, France
Baptiste Gregorutti
Laboratoire de Statistique Théorique et Appliquée, Université Pierre et Marie Curie, 4 Place Jussieu, 75252, Paris Cedex 05, France
Baptiste Gregorutti, Bertrand Michel & Philippe Saint-Pierre

Authors

Baptiste Gregorutti
View author publications
You can also search for this author in PubMed Google Scholar
Bertrand Michel
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Saint-Pierre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baptiste Gregorutti.

Appendix: Proofs

1.1 Proof of Proposition 1

The random variable $X_j^{\prime }$ and the vector ${\mathbf {X}}_{(j)}$ are defined as in Section 2:

$$\begin{aligned} I(X_j)&{=} \mathbb {E}\left[ (Y {-} f({\mathbf {X}}) {+} f({\mathbf {X}}) {-} f({\mathbf {X}}_{(j)}))^2\right] {-} \mathbb {E}\left[ (Y{-}f({\mathbf {X}}))^2\right] \\&= \mathbb {E}[(f({\mathbf {X}}) - f({\mathbf {X}}_{(j)}))^2] + 2 \mathbb {E}\left[ \varepsilon (f({\mathbf {X}}) {-} f({\mathbf {X}}_{(j)})) \right] \\&= \mathbb {E}[(f({\mathbf {X}}) - f({\mathbf {X}}_{(j)}))^2], \end{aligned}$$

since $ \mathbb {E}[ \varepsilon f({\mathbf {X}}) ] = \mathbb {E}[ f({\mathbf {X}}) \mathbb {E}[ \varepsilon | {\mathbf {X}} ] ] = 0 $ and $ \mathbb {E}[ \varepsilon f({\mathbf {X}}_{(j}) ] = \mathbb {E}( \varepsilon ) \mathbb {E}[ f({\mathbf {X}}_{(j}) ] = 0 $. Since the model is additive, we have:

$$\begin{aligned} I(X_j)&= \mathbb {E}[(f_j(X_j) - f_j(X_j^{\prime }))^2]\\&= 2\mathbb {V}[f_j(X_j)], \end{aligned}$$

as $X_j$ and $X_j^{\prime }$ are independent and identically distributed. For the second statement of the proposition, using the fact that $f_j(X_j)$ is centered we have:

$$\begin{aligned} \mathbb {C}[Y, f_j(X_j)]&= \mathbb {E}\left[ f_j(X_j) \mathbb {E}[Y| {\mathbf {X}}] \right] \\&= \mathbb {E}[f_j(X_j) \sum _{k=1}^p f_k(X_k)]\\&= \mathbb {V}[f_j(X_j)] + \sum _{k\ne j} \mathbb {E}\left[ f_j(X_j) f_k(X_k) \right] \\&= \frac{I (X_j)}{2} + \sum _{k\ne j} \mathbb {C}\left[ f_j(X_j), f_k(X_k) \right] . \end{aligned}$$

1.2 Proof of Proposition 2

This proposition is an application of Proposition 1 for a particular distribution. We only show that $ \alpha = C^{-1} \varvec{\tau }$ in that case.

Since $({\mathbf {X}}, Y)$ is a normal multivariate vector, the conditional distribution of Y over ${\mathbf {X}}$ is also normal and the conditional mean $f({\mathbf {x}}) = \mathbb {E}[Y|{\mathbf {X}}={\mathbf {x}}]$ is a linear function: $f({\mathbf {x}}) = \sum _{j=1}^p \alpha _j x_j$ (see for instance Rao 1973, p. 522). Then, for any $j \in \{1, \dots , p \}$,

$$\begin{aligned} \tau _j&= \mathbb {E}[X_jY] \\&= \mathbb {E}[\; X_j \mathbb {E}[Y | {\mathbf {X}}] \;] \\&= \alpha _1 \mathbb {E}[X_1X_j] + \cdots + \alpha _j \mathbb {E}[X_j^2] + \cdots + \alpha _p \mathbb {E}[X_p X_j] \\&= \alpha _1 c_{1j} + \cdots + \alpha _j c_{jj}+ \cdots + \alpha _p c_{pj}. \end{aligned}$$

The vector $\alpha $ is thus solution of the equation ${\varvec{\tau }} = C \alpha $ and the expected result is proven since the covariance matrix C is invertible.

1.3 Proof of Proposition 3

The correlation matrix C is assumed to have the form $C = (1-c) I_p + c \mathbbm {1}\mathbbm {1}^t$. We show that the invert of C can be decomposed in the same way. Let $M = a I_p + b \mathbbm {1}\mathbbm {1}^t$ where a and b are real numbers to be chosen later. Then

$$\begin{aligned} C M= & {} \big ( (1-c)I_p + c \mathbbm {1}\mathbbm {1}^t \big ) \big ( a I_p + b \mathbbm {1}\mathbbm {1}^t \big ) \\= & {} a (1-c) I_p + b (1-c) \mathbbm {1}\mathbbm {1}^t + a c \mathbbm {1}\mathbbm {1}^t + b c \mathbbm {1}\mathbbm {1}^t \mathbbm {1}\mathbbm {1}^t \\= & {} a (1-c) I_p + (b (1-c) + ac + pbc )\mathbbm {1}\mathbbm {1}^t, \end{aligned}$$

since $\mathbbm {1}^t \mathbbm {1}= p$. Thus, $C M = I_d$ if and only if

$$\begin{aligned} \left\{ \begin{array}{l} a (1-c) = 1 \\ b (1-c) + ac + pbc = 0, \end{array} \right. \end{aligned}$$

which is equivalent to

$$\begin{aligned} \left\{ \begin{array}{l} a = \dfrac{1}{(1-c)} \\ b = \dfrac{- c}{(1-c)(1-c+pc)}. \end{array} \right. \end{aligned}$$

Consequently, $M^{-1}_{jk} = C^{-1}_{jk} = b$ if $j \ne k$ and $M^{-1}_{jk} = C^{-1}_{jj} = a+b$. Finally we find that for any $j \in \{1\dots p\}$:

$$\begin{aligned}{}[C^{-1} {\varvec{\tau }}]_j= & {} \tau _0 (a+b) + \tau _0 b (p-1) \\= & {} \tau _0 ( a + pb ) \\= & {} \tau _0 \bigg ( \dfrac{1}{(1-c)} - \dfrac{pc}{(1-c)(1-c+pc)} \bigg ) \\= & {} \dfrac{\tau _0}{1 - c + pc}. \end{aligned}$$

The second point derives from Proposition 2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gregorutti, B., Michel, B. & Saint-Pierre, P. Correlation and variable importance in random forests. Stat Comput 27, 659–678 (2017). https://doi.org/10.1007/s11222-016-9646-1

Download citation

Received: 11 March 2014
Accepted: 06 March 2016
Published: 23 March 2016
Issue Date: May 2017
DOI: https://doi.org/10.1007/s11222-016-9646-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Correlation and variable importance in random forests

Abstract

Access this article

Similar content being viewed by others

Variable importance-weighted random forests

A computationally fast variable importance test for random forests for high-dimensional data

All Relevant Feature Selection Methods and Applications

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs

1.1 Proof of Proposition 1

1.2 Proof of Proposition 2

1.3 Proof of Proposition 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Correlation and variable importance in random forests

Abstract

Access this article

Similar content being viewed by others

Variable importance-weighted random forests

A computationally fast variable importance test for random forests for high-dimensional data

All Relevant Feature Selection Methods and Applications

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs

Appendix: Proofs

1.1 Proof of Proposition 1

1.2 Proof of Proposition 2

1.3 Proof of Proposition 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation