Skip to main content
Log in

Models under which random forests perform badly; consequences for applications

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

We give examples of data-generating models under which Breiman’s random forest may be extremely slow to converge to the optimal predictor or even fail to be consistent. The evidence provided for these properties is based on mostly intuitive arguments, similar to those used earlier with simpler examples, and on numerical experiments. Although one can always choose models under which random forests perform very badly, we show that simple methods based on statistics of ‘variable use’ and ‘variable importance’ can often be used to construct a much better predictor based on a ‘many-armed’ random forest obtained by forcing initial splits on variables which the default version of the algorithm tends to ignore.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Duroux and Scornet (2018) show that the number of terminal nodes and the size of the subsamples can have a substantial effect on the finite-sample performance of random forests, and that if the size of the subsamples is about \(1-e^{-1}\) of that of the full sample then random forests with trees based on subsamples perform very similarly to Breiman’s random forests with trees based on bootstrap samples. Wager (2014) presents results of wider scope concerning other variants of random forests under conditions which to us seem more restrictive or more difficult to verify.

  2. This example must be well known, but we do not recall a textbook where we may have seen it before.

  3. As far as we know, the implementations of random forests currently available provide no ‘ready-made’ information on variable usage, but some partial information can sometimes be extracted from them, as illustrated in Sect. 3. We emphasize that measures of variable importance quantify the improvement in accuracy that results from using the various predictor variables, but they provide no information about how often a variable is used in relation to its importance.

  4. We do not want to suggest that other algorithms perform better on such data; the more classic Nadaraya–Watson or nearest-neighbour algorithms, for example, will generally not perform better unless \(d_1\), \(d_2\) and \(d_3\) are ‘small’. A simple R script for simulating data from the last model and comparing the performance of random forests on it with that of the optimal predictor, as well as scripts that reproduce the results of Sect. 3, may be obtained from the author.

  5. We use Breiman’s measure of variable importance as implemented in the randomForest package with scale=TRUE; variable importance is then not really an estimate of the percent worsening of the mean square error that results from a random permutation of the data on a variable but a scaled version of it. As is well known, variable importance must be regarded as a relative measure which quantifies how much more important each variable is relative to the others.

  6. This is just one of the possible definitions of the importance of a variable; for a recent overview of other definitions and methods of estimating variable importance see Loh and Zhou (2021).

  7. This is not the same as the method used in Breiman’s random forest, which for economy computes such estimates per tree and then averages them, but in our experience the two methods generally provide a similar ranking of importance.

References

  • Biau G, Scornet E (2016) A random forest guided tour. TEST 25:197–227

    Article  MathSciNet  Google Scholar 

  • Biau G, Devroye LP, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033

    MathSciNet  MATH  Google Scholar 

  • Breiman L (2001) Mach Learn 45:5–32

  • Burrill CW (1972) Measure, integration, and probability. McGraw-Hill, New York

    MATH  Google Scholar 

  • Devroye LP, Wagner TJ (1980) Distribution-free consistency results in nonparametric discrimination and regression function estimation. Ann Stat 8(2):231–239

    Article  MathSciNet  Google Scholar 

  • Devroye LP, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York

    Book  Google Scholar 

  • Duroux R, Scornet E (2018) Impact of subsampling and tree depth on random forests. ESAIM: PS 22:96–128

    Article  MathSciNet  Google Scholar 

  • Ferreira JA (2015) Some models and methods for the analysis of observational data. Stat Surv 9:106–208

    Article  MathSciNet  Google Scholar 

  • Györfi L, Kholer M, Kryzak A, Walk H (2002) A distribution-free theory of nonparametric regression. Springer, Berlin

    Book  Google Scholar 

  • Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80(13–14):1056–1064

    Article  MathSciNet  Google Scholar 

  • Kim H, Loh WY (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96(454):589–604

    Article  MathSciNet  Google Scholar 

  • Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22

    Google Scholar 

  • Loh WY, Zhou P (2021) Variable importance scores. J Data Sci (to appear). arXiv.org: https://arxiv.org/abs/2102.07765v1

  • Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141–2

    Article  Google Scholar 

  • Scornet E, Biau G, Vert JP (2015) Consistency of random forests. Ann Stat 43:1716–41

    Article  MathSciNet  Google Scholar 

  • Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5(4):595–620

    Article  MathSciNet  Google Scholar 

  • Therneau T, Atkinson B (2019) rpart: recursive partitioning and regression trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart

  • Wager S (2014) Asymptotic theory for random forests. arXiv.org paper 1405.0352 (version of 2016)

  • Watson GS (1964) Smooth regression analysis. Sankhyā Ser A 26(4):359–372

    MathSciNet  MATH  Google Scholar 

  • Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770–1784

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The author is indebted to several reviewers and editors for comments that helped improving the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José A. Ferreira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ferreira, J.A. Models under which random forests perform badly; consequences for applications. Comput Stat 37, 1839–1854 (2022). https://doi.org/10.1007/s00180-021-01182-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-021-01182-4

Keywords

Navigation