Abstract
We give examples of data-generating models under which Breiman’s random forest may be extremely slow to converge to the optimal predictor or even fail to be consistent. The evidence provided for these properties is based on mostly intuitive arguments, similar to those used earlier with simpler examples, and on numerical experiments. Although one can always choose models under which random forests perform very badly, we show that simple methods based on statistics of ‘variable use’ and ‘variable importance’ can often be used to construct a much better predictor based on a ‘many-armed’ random forest obtained by forcing initial splits on variables which the default version of the algorithm tends to ignore.
Similar content being viewed by others
Notes
Duroux and Scornet (2018) show that the number of terminal nodes and the size of the subsamples can have a substantial effect on the finite-sample performance of random forests, and that if the size of the subsamples is about \(1-e^{-1}\) of that of the full sample then random forests with trees based on subsamples perform very similarly to Breiman’s random forests with trees based on bootstrap samples. Wager (2014) presents results of wider scope concerning other variants of random forests under conditions which to us seem more restrictive or more difficult to verify.
This example must be well known, but we do not recall a textbook where we may have seen it before.
As far as we know, the implementations of random forests currently available provide no ‘ready-made’ information on variable usage, but some partial information can sometimes be extracted from them, as illustrated in Sect. 3. We emphasize that measures of variable importance quantify the improvement in accuracy that results from using the various predictor variables, but they provide no information about how often a variable is used in relation to its importance.
We do not want to suggest that other algorithms perform better on such data; the more classic Nadaraya–Watson or nearest-neighbour algorithms, for example, will generally not perform better unless \(d_1\), \(d_2\) and \(d_3\) are ‘small’. A simple R script for simulating data from the last model and comparing the performance of random forests on it with that of the optimal predictor, as well as scripts that reproduce the results of Sect. 3, may be obtained from the author.
We use Breiman’s measure of variable importance as implemented in the randomForest package with scale=TRUE; variable importance is then not really an estimate of the percent worsening of the mean square error that results from a random permutation of the data on a variable but a scaled version of it. As is well known, variable importance must be regarded as a relative measure which quantifies how much more important each variable is relative to the others.
This is just one of the possible definitions of the importance of a variable; for a recent overview of other definitions and methods of estimating variable importance see Loh and Zhou (2021).
This is not the same as the method used in Breiman’s random forest, which for economy computes such estimates per tree and then averages them, but in our experience the two methods generally provide a similar ranking of importance.
References
Biau G, Scornet E (2016) A random forest guided tour. TEST 25:197–227
Biau G, Devroye LP, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033
Breiman L (2001) Mach Learn 45:5–32
Burrill CW (1972) Measure, integration, and probability. McGraw-Hill, New York
Devroye LP, Wagner TJ (1980) Distribution-free consistency results in nonparametric discrimination and regression function estimation. Ann Stat 8(2):231–239
Devroye LP, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Duroux R, Scornet E (2018) Impact of subsampling and tree depth on random forests. ESAIM: PS 22:96–128
Ferreira JA (2015) Some models and methods for the analysis of observational data. Stat Surv 9:106–208
Györfi L, Kholer M, Kryzak A, Walk H (2002) A distribution-free theory of nonparametric regression. Springer, Berlin
Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80(13–14):1056–1064
Kim H, Loh WY (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96(454):589–604
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Loh WY, Zhou P (2021) Variable importance scores. J Data Sci (to appear). arXiv.org: https://arxiv.org/abs/2102.07765v1
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141–2
Scornet E, Biau G, Vert JP (2015) Consistency of random forests. Ann Stat 43:1716–41
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5(4):595–620
Therneau T, Atkinson B (2019) rpart: recursive partitioning and regression trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart
Wager S (2014) Asymptotic theory for random forests. arXiv.org paper 1405.0352 (version of 2016)
Watson GS (1964) Smooth regression analysis. Sankhyā Ser A 26(4):359–372
Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770–1784
Acknowledgements
The author is indebted to several reviewers and editors for comments that helped improving the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ferreira, J.A. Models under which random forests perform badly; consequences for applications. Comput Stat 37, 1839–1854 (2022). https://doi.org/10.1007/s00180-021-01182-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01182-4