Random forest with acceptance–rejection trees

Calhoun, Peter; Hallett, Melodie J.; Su, Xiaogang; Cafri, Guy; Levine, Richard A.; Fan, Juanjuan

doi:10.1007/s00180-019-00929-4

Random forest with acceptance–rejection trees

Original Paper
Published: 29 October 2019

Volume 35, pages 983–999, (2020)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Peter Calhoun¹,
Melodie J. Hallett²,
Xiaogang Su³,
Guy Cafri⁴,
Richard A. Levine^5,6 &
…
Juanjuan Fan⁶

748 Accesses
9 Citations
Explore all metrics

Abstract

In this paper, we propose a new random forest method based on completely randomized splitting rules with an acceptance–rejection criterion for quality control. We show how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees. Twenty datasets were analyzed to compare prediction performance and a simulated dataset was used to assess variable selection bias. In terms of prediction accuracy for classification problems, the proposed AR algorithm performed the best, with ER being the second best. For regression problems, RF and SSS performed the best, followed by AR, and then ER at the last. However, each algorithm was most accurate for at least one study. We investigate scenarios where the AR algorithm can yield better predictive performance. In terms of variable importance, both RF and SSS demonstrated selection bias in favor of variables with many possible splits, while both ER and AR largely removed this bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable importance-weighted random forests

Article 06 November 2017

Double random forest

Article 02 July 2020

Generalising Random Forest Parameter Optimisation to Include Stability and Cost

References

Allwein E, Schapire R, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
MathSciNet MATH Google Scholar
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545–1588
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
MATH Google Scholar
Breiman L (2004) Consistency for a simply model of random forests. Technical report, University of California at Berkeley
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, pp 161–168
Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th international conference on machine learning, pp 96–103
Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical methods for data analysis. Wadsworth, Belmont
MATH Google Scholar
Cutler D, Edwards T Jr, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random forest for classification in ecology. Ecology 88:2783–2792
Google Scholar
Davis R, Anderson Z (1989) Exponential survival trees. Stat Med 8:947–962
Google Scholar
Derrig R, Francis L (2008) Distinguishing the forest from the trees: a comparison of tree-based data mining methods. Variance 2:184–208
Google Scholar
Dietterich T, Bakiri G (1995) Solving multiclass learning problems via error–correcting output codes. J Artif Intell Res 2:263–286
MATH Google Scholar
Fan J, Su X, Levine R, Nunn M, LeBlanc M (2006) Trees for correlated survival data by goodness of split, with applications to tooth prognosis. J Am Stat Assoc 101:959–967
MathSciNet MATH Google Scholar
Friedman J (2001) Greedy function approximation: the gradient boosting machine. Ann Stat 29:1189–1232
MathSciNet MATH Google Scholar
Genuer R, Poggi JM, Tuleau C (2008) Random forests: some methodological insights. arXiv
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42
MATH Google Scholar
Gordon L, Olshen R (1985) Tree-structured survival analysis. Cancer Treat Rep 69:1065–1069
Google Scholar
Hajjem A, Bellavance F, Larocque D (2014) Mixed effects random forest for clustered data. J Stat Comput Simul 84:1313–1328
MathSciNet MATH Google Scholar
Hanley J, McNeil B (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Google Scholar
Ho T (1995) Random decision forest. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1, pp 278–282
Ho T (1998) The random subspace method of constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844
Google Scholar
Hosmer D, Lemeshow S (1989) Applied logistic regression. Wiley, New York
MATH Google Scholar
Hothorn T, Leisch F, Zeileis A, Hornik K (2005) The design and analysis of benchmark experiments. J Comput Graph Stat 14:675–699
MathSciNet Google Scholar
Ishwaran H (2015) The effect of splitting on random forests. Mach Learn 99:75–118
MathSciNet MATH Google Scholar
Ishwaran H, Kogalur UB (2016) Random forests for survival, regression, and classification (RF-SRC). R package version 2.2.0
Ishwaran H, Kogalur U, Gorodeski E, Minn A, Lauer M (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105:205–217
MathSciNet MATH Google Scholar
König I, Malley J, Weimar C, Diener HC, Ziegler A (2007) Practical experiences on the necessity of external validation. Stat Med 26:5499–5511
MathSciNet Google Scholar
Leisch F, Dimitriadou E (2010) mlbench: Machine Learning Benchmark Problems. R package version 2.1-1
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Google Scholar
Malley J, Kruppa J, Dasgupta A, Malley K, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform in Med 51:74–81
Google Scholar
Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed May 2018
Segal M, Xiao Y (2011) Multivariate random forests. WIREs Data Min Knowl Discov 1:80–87
Google Scholar
Sela R, Simonoff J (2012) RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86:169–207
MathSciNet MATH Google Scholar
Shah A, Bartlett J, Carpenter J, Nicholas O, Hemingway H (2014) Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am J Epidemiol 179:764–774
Google Scholar
Strobl C, Boulesteix A, Zeileis A, Augustin T (2007a) Unbiased split selection for classification trees based on the gini index. Comput Stat Data Anal 52:483–501
MathSciNet MATH Google Scholar
Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007b) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform 8:25–46
Google Scholar
Su X, Kang J, Liu L, Yang Q, Fan J, Levine R (2016) Smooth sigmoid surrogate (SSS): An alternative to greedy search in recursive partitioning. Comput Stat Data Anal Under Rev
Su X, Pena A, Liu L, Levine R (2018) Random forests of interaction trees for estimating individualized treatment effects in randomized trials. Stat Med 37:2547–2560
MathSciNet Google Scholar
Torgo L (1999) Inductive learning of tree-based regression models. Ph.D. thesis, University of Porto
Yoo W, Ference B, Cote M, Schwartz A (2012) A comparison of logistic re gression, logic regression, classification tree, and random forests to identify effective gene–gene and gene–environment interactions. Int J Appl Sci Technol 2:268
Google Scholar

Download references

Acknowledgements

This research was supported in part by NSF Grant 163310.

Author information

Authors and Affiliations

Jaeb Center for Health Research, Tampa, FL, USA
Peter Calhoun
Department of Sociology, San Diego State University, San Diego, CA, USA
Melodie J. Hallett
Department of Mathematical Sciences, University of Texas, El Paso, TX, USA
Xiaogang Su
Johnson & Johnson Medical Devices, San Diego, CA, USA
Guy Cafri
Analytic Studies and Institutional Research, San Diego State University, San Diego, CA, USA
Richard A. Levine
Department of Mathematics and Statistics, San Diego State University, San Diego, CA, USA
Richard A. Levine & Juanjuan Fan

Authors

Peter Calhoun
View author publications
You can also search for this author in PubMed Google Scholar
Melodie J. Hallett
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Su
View author publications
You can also search for this author in PubMed Google Scholar
Guy Cafri
View author publications
You can also search for this author in PubMed Google Scholar
Richard A. Levine
View author publications
You can also search for this author in PubMed Google Scholar
Juanjuan Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juanjuan Fan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 123 KB)

Appendix A: datasets

Table 8 Dataset Summaries

Full size table

The twenty datasets analyzed to assess prediction accuracy are summarized in Table 8. The Ailerons and Elevators data were taken from Torgo (1999), the Birthwt data were taken from Hosmer and Lemeshow (1989), the Airquality data were taken from Chambers et al. (1983), and the remaining datasets were taken from the UCI Machine Learning Repository (Newman et al. 1998). Most of the datasets are also available in the mlbench R package (Leisch and Dimitriadou 2010). The Breast Cancer dataset had 16 missing values, the Imports85 dataset had 12 missing values, and the Airquality dataset had 42 missing values. Missing values were handled with list-wise deletion for this paper.

Among the ten binary problems, three are synthetic ones and seven are real ones. The Twonorm, Ringnorm, and Ringnorm datasets simulated two binary responses from continuous inputs each having the same effect on the response. Four of the real datasets (Breast Cancer, Ionosphere, German Credit, and Birthwt) included nominal or ordinal categorical variables. The binary response datasets ranged from 189 to 4601 observations and from 6 to 60 input variables. For the ten regression problems, half are synthetic with only continuous inputs. The Friedman datasets have nonlinear relationships with interacting variables and the Ailerons and Elevators datasets were simulated datasets related on F16 aircraft actions. Four of the real datasets (Housing, Servo, Abalone, and Imports85) included nominal or ordinal categorical variables. The regression datasets ranged from 111 to 16,599 observations and from 4 to 40 input variables.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Calhoun, P., Hallett, M.J., Su, X. et al. Random forest with acceptance–rejection trees. Comput Stat 35, 983–999 (2020). https://doi.org/10.1007/s00180-019-00929-4

Download citation

Received: 08 July 2018
Accepted: 09 October 2019
Published: 29 October 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00180-019-00929-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random forest with acceptance–rejection trees

Abstract

Access this article

Similar content being viewed by others

Variable importance-weighted random forests

Double random forest

Generalising Random Forest Parameter Optimisation to Include Stability and Cost

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 123 KB)

Appendix A: datasets

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Random forest with acceptance–rejection trees

Abstract

Access this article

Similar content being viewed by others

Variable importance-weighted random forests

Double random forest

Generalising Random Forest Parameter Optimisation to Include Stability and Cost

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 123 KB)

Appendix A: datasets

Appendix A: datasets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation