Abstract
The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Amaratunga D, Cabrera J, Lee Y-S (2008) Enriched random forests. Bioinformatics 24:2010–2014
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545–1588
Archer KJ, Kimes RV (2008) Empirical characterization of random forest variable importance measures. Comput Stat Data Anal 52:2249–2260
Arlot S, Genuer R (2014) Analysis of purely random forests bias. arXiv:1407.3939
Auret L, Aldrich C (2011) Empirical comparison of tree ensemble variable importance measures. Chemom Intell Lab Syst 105:157–170
Bai Z-H, Devroye L, Hwang H-K, Tsai T-H (2005) Maxima in hypercubes. Random Struct Algorithms 27:290–309
Banerjee M, McKeague IW (2007) Confidence sets for split points in decision trees. Ann Stat 35:543–574
Barndorff-Nielsen O, Sobel M (1966) On the distribution of the number of admissible points in a vector random sample. Theory Probab Appl 11:249–269
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2:1–127
Bernard S, Heutte L, Adam S (2008) Forest-RK: a new random forest induction method. In: Huang D-S, Wunsch DC II, Levine DS, Jo K-H (eds) Advanced intelligent computing theories and applications. With aspects of artificial intelligence. Springer, Berlin, pp 430–437
Bernard S, Adam S, Heutte L (2012) Dynamic random forests. Pattern Recognit Lett 33:1580–1586
Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13:1063–1095
Biau G, Devroye L (2010) On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J Multivar Anal 101:2499–2518
Biau G, Devroye L (2013) Cellular tree classifiers. Electron J Stat 7:1875–1912
Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033
Biau G, Cérou F, Guyader A (2010) On the rate of convergence of the bagged nearest neighbor estimate. J Mach Learn Res 11:687–712
Boulesteix A-L, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Mining Knowl Discov 2:493–507
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2000a) Some infinity theory for predictor ensembles. Technical Report 577, University of California, Berkeley
Breiman L (2000b) Randomizing outputs to increase prediction accuracy. Mach Learn 40:229–242
Breiman L (2001) Random forests. Mach Learn 45:5–32
Breiman L (2003a) Setting up, using, and understanding random forests V3.1. https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
Breiman L (2003b) Setting up, using, and understanding random forests V4.0. https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
Breiman L (2004) Consistency for a simple model of random forests. Technical Report 670, University of California, Berkeley
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall/CRC, Boca Raton
Bühlmann P, Yu B (2002) Analyzing bagging. Ann Stat 30:927–961
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. Technical Report 666, University of California, Berkeley
Clémençon S, Depecker M, Vayatis N (2013) Ranking forests. J Mach Learn Res 14:39–73
Clémençon S, Vayatis N (2009) Tree-based ranking methods. IEEE Trans Inform Theory 55:4316–4336
Criminisi A, Shotton J, Konukoglu E (2011) Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found Trends Comput Graph Vis 7:81–227
Crookston NL, Finley AO (2008) yaImpute: an R package for \(k\)NN imputation. J Stat Softw 23:1–16
Cutler A, Zhao G (2001) PERT—perfect random tree ensembles. Comput Sci Stat 33:490–497
Cutler DR, Edwards TC Jr, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88:2783–2792
Davies A, Ghahramani Z (2014) The Random Forest Kernel and creating other kernels for big data from random partitions. arXiv:1402.4293
Deng H, Runger G (2012) Feature selection via regularized trees. In: The 2012 international joint conference on neural networks, pp 1–8
Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46:3483–3489
Denil M, Matheson D, de Freitas N (2013) Consistency of online random forests. In: International conference on machine learning (ICML)
Denil M, Matheson D, de Freitas N (2014) Narrowing the gap: random forests in theory and in practice. In: International conference on machine learning (ICML)
Désir C, Bernard S, Petitjean C, Heutte L (2013) One class random forests. Pattern Recognit 46:3490–3506
Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7:1–13
Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 1–15
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26
Efron B (1982) The jackknife, the bootstrap and other resampling plans, vol 38. CBMS-NSF Regional Conference Series in Applied Mathematics, Philadelphia
Fink D, Hochachka WM, Zuckerberg B, Winkler DW, Shaby B, Munson MA, Hooker G, Riedewald M, Sheldon D, Kelling S (2010) Spatiotemporal exploratory models for broad-scale survey data. Ecol Appl 20:2131–2147
Friedman J, Hastie T, Tibshirani R (2009) The elements of statistical learning, 2nd edn. Springer, New York
Genuer R (2012) Variance reduction in purely random forests. J Nonparametr Stat 24:543–562
Genuer R, Poggi J-M, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31:2225–2236
Geremia E, Menze BH, Ayache N (2013) Spatially adaptive random forests. In: IEEE international symposium on biomedical imaging: from nano to macro, pp 1332–1335
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42
Gregorutti B, Michel B, Saint Pierre P (2016) Correlation and variable importance in random forests. Stat Comput. doi:10.1007/s11222-016-9646-1
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Györfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution-free theory of nonparametric regression. Springer, New York
Ho T (1998) The random subspace method for constructing decision forests. Pattern Anal Mach Intell 20:832–844
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15:651–674
Howard J, Bowles M (2012) The two most important algorithms in predictive modeling today. In: Strata Conference: Santa Clara. http://strataconf.com/strata2012/public/schedule/detail/22658
Ishioka T (2013) Imputation of missing values for unsupervised data using the proximity in random forests. In: eLmL 2013, The fifth international conference on mobile, hybrid, and on-line learning, pp 30–36. International Academy, Research, and Industry Association
Ishwaran H (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537
Ishwaran H (2013) The effect of splitting on random forests. Mach Learn 99:75–118
Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80:1056–1064
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2:841–860
Ishwaran H, Kogalur UB, Chen X, Minn AJ (2011) Random survival forests for high-dimensional data. Stat Anal Data Mining ASA Data Sci J 4:115–132
Jeffrey D, Sanja G (2008) Simplified data processing on large clusters. Commun ACM 51:107–113
Joly A, Geurts P, Wehenkel L (2014) Random forests with random projections of the output space for high dimensional multi-label classification. In: Calders T, Esposito F, Hüllermeier E, Meo R (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 607–622
Kim H, Loh W-Y (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96:589–604
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J Royal Stat Soc Ser B (Stat Methodol) 76:795–816
Konukoglu E, Ganz M (2014) Approximate false positive rate control in selection frequency for random forest. arXiv:1410.2838
Kruppa J, Schwarz A, Arminger G, Ziegler A (2013) Consumer credit risk: individual probability estimates using machine learning. Expert Syst Appl 40:5125–5131
Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A (2014a) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biometr J 56:534–563
Kruppa J, Liu Y, Diener H-C, Holste T, Weimar C, König IR, Ziegler A (2014b) Probability estimation with machine learning methods for dichotomous and multicategory outcome: applications. Biometr J 56:564–583
Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York
Kyrillidis A, Zouzias A (2014) Non-uniform feature sampling for decision tree ensembles. In: IEEE international conference on acoustics, speech and signal processing, pp 4548–4552
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, pp 3140–3148
Latinne P, Debeir O, Decaestecker C (2001) Limiting the number of trees in random forests. In: Kittler J, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 178–187
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18–22
Lin Y, Jeon Y (2006) Random forests and adaptive nearest neighbors. J Am Stat Assoc 101:578–590
Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, pp 431–439
Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform Med 51:74–81
Meinshausen N (2006) Quantile regression forests. J Mach Learn Res 7:983–999
Meinshausen N (2009) Forest Garrote. Electron J Stat 3:1288–1304
Mentch L, Hooker G (2014) A novel test for additivity in supervised ensemble learners. arXiv:1406.1845
Mentch L, Hooker G (2015) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res (in press)
Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA (2011) On oblique random forests. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 453–469
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141–142
Nicodemus KK, Malley JD (2009) Predictor correlation impacts machine learning algorithms: Implications for genomic studies. Bioinformatics 25:1884–1890
Politis DN, Romano JP, Wolf M (1999) Subsampling. Springer, New York
Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199
Qian SS, King RS, Richardson CJ (2003) Two statistical methods for the detection of environmental thresholds. Ecol Model 166:87–97
Rieger A, Hothorn T, Strobl C (2010) Random forests with missing values in the covariates. Technical Report 79, University of Munich, Munich
Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: IEEE 12th international conference on computer vision workshops, pp 1393–1400
Schwarz DF, König IR, Ziegler A (2010) On safari to random jungle: a fast implementation of random forests for high-dimensional data. Bioinformatics 26:1752–1758
Scornet E (2015a) On the asymptotics of random forests. J Multivar Anal 146:72–83
Scornet E (2015b) Random forests and kernel methods. IEEE Trans Inform Theory 62:1485–1500
Scornet E, Biau G, Vert J-P (2015) Consistency of random forests. Ann Stat 43:1716–1741
Segal MR (1988) Regression trees for censored data. Biometrics 44:35–47
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: IEEE conference on computer vision and pattern recognition, pp 1297–1304
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5:595–645
Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann Stat 8:1348–1360
Stone CJ (1982) Optimal global rates of convergence for nonparametric regression. Ann Stat 10:1040–1053
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inform Comput Sci 43:1947–1958
Toloşi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27:1986–1994
Truong AKY (2009) Fast growing and interpretable oblique trees via logistic regression models. PhD thesis, University of Oxford, Oxford
Varian H (2014) Big data: new tricks for econometrics. J Econ Perspect 28:3–28
Wager S (2014) Asymptotic theory for random forests. arXiv:1405.0352
Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 15:1625–1651
Watson GS (1964) Smooth regression analysis. Sankhy\(\bar{a}\) Ser A 26:359–372
Welbl J (2014) Casting random forests as artificial neural networks and profiting from it. In: Jiang X, Hornegger J, Koch R (eds) Pattern recognition. Springer, Berlin, pp 765–771
Winham SJ, Freimuth RR, Biernacka JM (2013) A weighted random forests approach to improve predictive performance. Stat Anal Data Mining ASA Data Sci J 6:496–505
Yan D, Chen A, Jordan MI (2013) Cluster forests. Comput Stat Data Anal 66:178–192
Yang F, Wang J, Fan G (2010) Kernel induced random survival forests. arXiv:1008.3952
Yi Z, Soatto S, Dewan M, Zhan Y (2012) Information forests. In: 2012 information theory and applications workshop, pp 143–146
Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770–1784
Ziegler A, König IR (2014) Mining data with random forests: current options for real-world applications. Wiley Interdiscip Rev Data Mining Knowl Discov 4:55–63
Acknowledgments
We thank the Editors and three anonymous referees for valuable comments and insightful suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
This invited paper is discussed in comments available at: doi:10.1007/s11749-016-0482-6; doi:10.1007/s11749-016-0483-5; doi:10.1007/s11749-016-0484-4; doi:10.1007/s11749-016-0485-3; doi:10.1007/s11749-016-0487-1.
Rights and permissions
About this article
Cite this article
Biau, G., Scornet, E. A random forest guided tour. TEST 25, 197–227 (2016). https://doi.org/10.1007/s11749-016-0481-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-016-0481-7