Abstract
In this paper, we propose a new random forest method based on completely randomized splitting rules with an acceptance–rejection criterion for quality control. We show how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees. Twenty datasets were analyzed to compare prediction performance and a simulated dataset was used to assess variable selection bias. In terms of prediction accuracy for classification problems, the proposed AR algorithm performed the best, with ER being the second best. For regression problems, RF and SSS performed the best, followed by AR, and then ER at the last. However, each algorithm was most accurate for at least one study. We investigate scenarios where the AR algorithm can yield better predictive performance. In terms of variable importance, both RF and SSS demonstrated selection bias in favor of variables with many possible splits, while both ER and AR largely removed this bias.
Similar content being viewed by others
References
Allwein E, Schapire R, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545–1588
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2001) Random forests. Mach Learn 45:5–32
Breiman L (2004) Consistency for a simply model of random forests. Technical report, University of California at Berkeley
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, pp 161–168
Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th international conference on machine learning, pp 96–103
Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical methods for data analysis. Wadsworth, Belmont
Cutler D, Edwards T Jr, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random forest for classification in ecology. Ecology 88:2783–2792
Davis R, Anderson Z (1989) Exponential survival trees. Stat Med 8:947–962
Derrig R, Francis L (2008) Distinguishing the forest from the trees: a comparison of tree-based data mining methods. Variance 2:184–208
Dietterich T, Bakiri G (1995) Solving multiclass learning problems via error–correcting output codes. J Artif Intell Res 2:263–286
Fan J, Su X, Levine R, Nunn M, LeBlanc M (2006) Trees for correlated survival data by goodness of split, with applications to tooth prognosis. J Am Stat Assoc 101:959–967
Friedman J (2001) Greedy function approximation: the gradient boosting machine. Ann Stat 29:1189–1232
Genuer R, Poggi JM, Tuleau C (2008) Random forests: some methodological insights. arXiv
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42
Gordon L, Olshen R (1985) Tree-structured survival analysis. Cancer Treat Rep 69:1065–1069
Hajjem A, Bellavance F, Larocque D (2014) Mixed effects random forest for clustered data. J Stat Comput Simul 84:1313–1328
Hanley J, McNeil B (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Ho T (1995) Random decision forest. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1, pp 278–282
Ho T (1998) The random subspace method of constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844
Hosmer D, Lemeshow S (1989) Applied logistic regression. Wiley, New York
Hothorn T, Leisch F, Zeileis A, Hornik K (2005) The design and analysis of benchmark experiments. J Comput Graph Stat 14:675–699
Ishwaran H (2015) The effect of splitting on random forests. Mach Learn 99:75–118
Ishwaran H, Kogalur UB (2016) Random forests for survival, regression, and classification (RF-SRC). R package version 2.2.0
Ishwaran H, Kogalur U, Gorodeski E, Minn A, Lauer M (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105:205–217
König I, Malley J, Weimar C, Diener HC, Ziegler A (2007) Practical experiences on the necessity of external validation. Stat Med 26:5499–5511
Leisch F, Dimitriadou E (2010) mlbench: Machine Learning Benchmark Problems. R package version 2.1-1
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Malley J, Kruppa J, Dasgupta A, Malley K, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform in Med 51:74–81
Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed May 2018
Segal M, Xiao Y (2011) Multivariate random forests. WIREs Data Min Knowl Discov 1:80–87
Sela R, Simonoff J (2012) RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86:169–207
Shah A, Bartlett J, Carpenter J, Nicholas O, Hemingway H (2014) Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am J Epidemiol 179:764–774
Strobl C, Boulesteix A, Zeileis A, Augustin T (2007a) Unbiased split selection for classification trees based on the gini index. Comput Stat Data Anal 52:483–501
Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007b) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform 8:25–46
Su X, Kang J, Liu L, Yang Q, Fan J, Levine R (2016) Smooth sigmoid surrogate (SSS): An alternative to greedy search in recursive partitioning. Comput Stat Data Anal Under Rev
Su X, Pena A, Liu L, Levine R (2018) Random forests of interaction trees for estimating individualized treatment effects in randomized trials. Stat Med 37:2547–2560
Torgo L (1999) Inductive learning of tree-based regression models. Ph.D. thesis, University of Porto
Yoo W, Ference B, Cote M, Schwartz A (2012) A comparison of logistic re gression, logic regression, classification tree, and random forests to identify effective gene–gene and gene–environment interactions. Int J Appl Sci Technol 2:268
Acknowledgements
This research was supported in part by NSF Grant 163310.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix A: datasets
Appendix A: datasets
The twenty datasets analyzed to assess prediction accuracy are summarized in Table 8. The Ailerons and Elevators data were taken from Torgo (1999), the Birthwt data were taken from Hosmer and Lemeshow (1989), the Airquality data were taken from Chambers et al. (1983), and the remaining datasets were taken from the UCI Machine Learning Repository (Newman et al. 1998). Most of the datasets are also available in the mlbench R package (Leisch and Dimitriadou 2010). The Breast Cancer dataset had 16 missing values, the Imports85 dataset had 12 missing values, and the Airquality dataset had 42 missing values. Missing values were handled with list-wise deletion for this paper.
Among the ten binary problems, three are synthetic ones and seven are real ones. The Twonorm, Ringnorm, and Ringnorm datasets simulated two binary responses from continuous inputs each having the same effect on the response. Four of the real datasets (Breast Cancer, Ionosphere, German Credit, and Birthwt) included nominal or ordinal categorical variables. The binary response datasets ranged from 189 to 4601 observations and from 6 to 60 input variables. For the ten regression problems, half are synthetic with only continuous inputs. The Friedman datasets have nonlinear relationships with interacting variables and the Ailerons and Elevators datasets were simulated datasets related on F16 aircraft actions. Four of the real datasets (Housing, Servo, Abalone, and Imports85) included nominal or ordinal categorical variables. The regression datasets ranged from 111 to 16,599 observations and from 4 to 40 input variables.
Rights and permissions
About this article
Cite this article
Calhoun, P., Hallett, M.J., Su, X. et al. Random forest with acceptance–rejection trees. Comput Stat 35, 983–999 (2020). https://doi.org/10.1007/s00180-019-00929-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-019-00929-4