Models under which random forests perform badly; consequences for applications

Ferreira, José A.

doi:10.1007/s00180-021-01182-4

Models under which random forests perform badly; consequences for applications

Original paper
Published: 24 January 2022

Volume 37, pages 1839–1854, (2022)
Cite this article

Computational Statistics Aims and scope Submit manuscript

José A. Ferreira¹

925 Accesses
1 Citation
Explore all metrics

Abstract

We give examples of data-generating models under which Breiman’s random forest may be extremely slow to converge to the optimal predictor or even fail to be consistent. The evidence provided for these properties is based on mostly intuitive arguments, similar to those used earlier with simpler examples, and on numerical experiments. Although one can always choose models under which random forests perform very badly, we show that simple methods based on statistics of ‘variable use’ and ‘variable importance’ can often be used to construct a much better predictor based on a ‘many-armed’ random forest obtained by forcing initial splits on variables which the default version of the algorithm tends to ignore.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A random forest guided tour

Article 19 April 2016

Gérard Biau & Erwan Scornet

Comments on: A random forest guided tour

Article 19 April 2016

Giles Hooker & Lucas Mentch

Comments on: A random forest guided tour

Article 19 April 2016

Sylvain Arlot & Robin Genuer

Notes

Duroux and Scornet (2018) show that the number of terminal nodes and the size of the subsamples can have a substantial effect on the finite-sample performance of random forests, and that if the size of the subsamples is about \(1-e^{-1}\) of that of the full sample then random forests with trees based on subsamples perform very similarly to Breiman’s random forests with trees based on bootstrap samples. Wager (2014) presents results of wider scope concerning other variants of random forests under conditions which to us seem more restrictive or more difficult to verify.
This example must be well known, but we do not recall a textbook where we may have seen it before.
As far as we know, the implementations of random forests currently available provide no ‘ready-made’ information on variable usage, but some partial information can sometimes be extracted from them, as illustrated in Sect. 3. We emphasize that measures of variable importance quantify the improvement in accuracy that results from using the various predictor variables, but they provide no information about how often a variable is used in relation to its importance.
We do not want to suggest that other algorithms perform better on such data; the more classic Nadaraya–Watson or nearest-neighbour algorithms, for example, will generally not perform better unless \(d_1\), \(d_2\) and \(d_3\) are ‘small’. A simple R script for simulating data from the last model and comparing the performance of random forests on it with that of the optimal predictor, as well as scripts that reproduce the results of Sect. 3, may be obtained from the author.
We use Breiman’s measure of variable importance as implemented in the randomForest package with scale=TRUE; variable importance is then not really an estimate of the percent worsening of the mean square error that results from a random permutation of the data on a variable but a scaled version of it. As is well known, variable importance must be regarded as a relative measure which quantifies how much more important each variable is relative to the others.
This is just one of the possible definitions of the importance of a variable; for a recent overview of other definitions and methods of estimating variable importance see Loh and Zhou (2021).
This is not the same as the method used in Breiman’s random forest, which for economy computes such estimates per tree and then averages them, but in our experience the two methods generally provide a similar ranking of importance.

References

Biau G, Scornet E (2016) A random forest guided tour. TEST 25:197–227
Article MathSciNet Google Scholar
Biau G, Devroye LP, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033
MathSciNet MATH Google Scholar
Breiman L (2001) Mach Learn 45:5–32
Burrill CW (1972) Measure, integration, and probability. McGraw-Hill, New York
MATH Google Scholar
Devroye LP, Wagner TJ (1980) Distribution-free consistency results in nonparametric discrimination and regression function estimation. Ann Stat 8(2):231–239
Article MathSciNet Google Scholar
Devroye LP, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Book Google Scholar
Duroux R, Scornet E (2018) Impact of subsampling and tree depth on random forests. ESAIM: PS 22:96–128
Article MathSciNet Google Scholar
Ferreira JA (2015) Some models and methods for the analysis of observational data. Stat Surv 9:106–208
Article MathSciNet Google Scholar
Györfi L, Kholer M, Kryzak A, Walk H (2002) A distribution-free theory of nonparametric regression. Springer, Berlin
Book Google Scholar
Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80(13–14):1056–1064
Article MathSciNet Google Scholar
Kim H, Loh WY (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96(454):589–604
Article MathSciNet Google Scholar
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Google Scholar
Loh WY, Zhou P (2021) Variable importance scores. J Data Sci (to appear). arXiv.org: https://arxiv.org/abs/2102.07765v1
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141–2
Article Google Scholar
Scornet E, Biau G, Vert JP (2015) Consistency of random forests. Ann Stat 43:1716–41
Article MathSciNet Google Scholar
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5(4):595–620
Article MathSciNet Google Scholar
Therneau T, Atkinson B (2019) rpart: recursive partitioning and regression trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart
Wager S (2014) Asymptotic theory for random forests. arXiv.org paper 1405.0352 (version of 2016)
Watson GS (1964) Smooth regression analysis. Sankhyā Ser A 26(4):359–372
MathSciNet MATH Google Scholar
Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770–1784
Article MathSciNet Google Scholar

Download references

Acknowledgements

The author is indebted to several reviewers and editors for comments that helped improving the paper.

Author information

Authors and Affiliations

Department of Statistics, Informatics and Modelling, National Institute for Public Health and the Environment (RIVM), Antonie van Leeuwenhoeklaan 9, 3721 MA, Bilthoven, The Netherlands
José A. Ferreira

Authors

José A. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José A. Ferreira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferreira, J.A. Models under which random forests perform badly; consequences for applications. Comput Stat 37, 1839–1854 (2022). https://doi.org/10.1007/s00180-021-01182-4

Download citation

Received: 25 May 2021
Accepted: 29 November 2021
Published: 24 January 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s00180-021-01182-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Models under which random forests perform badly; consequences for applications

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Comments on: A random forest guided tour

Comments on: A random forest guided tour

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Models under which random forests perform badly; consequences for applications

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Comments on: A random forest guided tour

Comments on: A random forest guided tour

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation