Abstract
This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context of ensembles of trees, we show that this correction can be approximated at only double the cost of training the original ensemble. Our method is shown to improve test set accuracy over random forests by up to 70% on example problems from the UCI repository.
This is a preview of subscription content, log in to check access.
References
Biau, G., Devroye, L., Lugosi, G.: Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 9, 20152033 (2008)
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Brooks, T.F., Pope, D.S., Marcolini, M.A.: Airfoil SelfNoise and Prediction, vol. 1218. National Aeronautics and Space Administration, Office of Management, Scientific and Technical Information Division (1989)
Cortez, P., Morais, A.: A data mining approach to predict forest fires using meteorological data. In: Neves, J., Santos, M.F., Machado, J. (eds.) New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007  Portuguese Conference on Artificial Intelligence, pp. 512–523. APPIA, Guimaraes (2007)
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)
Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26 (1979)
Efron, B.: Estimation and accuracy after model selection. J. Am. Stat. Assoc. 109(507), 991–1007 (2014)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, New York (1993)
Eubank, R.L.: Nonparametric Regression and Spline Smoothing. CRC Press, New York (1990)
FanaeeT, H., Gama, J.: Event labeling combining ensemble detectors and background knowledge. Prog. Artif. Intell. (2013). doi:10.1007/s1374801300403
Freedman, D.A., et al.: Bootstrapping regression models. Ann. Stat. 9(6), 1218–1228 (1981)
Gerritsma, J., Onnink, R., Versluis, A.: Geometry, Resistance and Stability of the Delft Systematic Yacht Hull Series. Delft University of Technology, Amsterdam (1981)
Hall, P.: The Bootstrap and Edgeworth Expansion. Springer, Berlin (1992a)
Hall, P.: On bootstrap confidence intervals in nonparametric regression. Ann. Stat. 20, 695–711 (1992b)
Hall, P., Horowitz, J.: A simple bootstrap method for constructing nonparametric confidence bands for functions. Ann. Stat. 41(4), 1892–1921 (2013)
Härdle, W., Bowman, A.W.: Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands. J. Am. Stat. Assoc. 83(401), 102–110 (1988)
Harrison, D., Rubinfeld, D.L.: Hedonic prices and the demand for clean air. J. Environ. Econ. Manag. 5, 81–102 (1978)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002). http://CRAN.Rproject.org/doc/Rnews/
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A., Moroz, I.M., et al.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng. OnLine 6(1), 23 (2007)
Mentch L, Hooker G (2016a) Formal hypothesis tests for additive structure in random forests. J. Comput. Gr. Stat. (In Press)
Mentch, L., Hooker, G.: Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17(26), 1–41 (2016b)
Quinlan, J.R.: Combining instancebased and modelbased learning. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 236–243 (1993)
Redmond, M., Baveja, A.: A datadriven software tool for enabling cooperative information sharing among police departments. Eur. J. Oper. Res. 141(3), 660–678 (2002)
Scornet, E.: On the asymptotics of random forests (2014). arXiv:1409.2090
Scornet, E., Biau, G., Vert, J.P.: Consistency of random forests. Ann. Stat. 43(4), 1716–1741 (2015)
Sexton, J., Laake, P.: Standard errors for bagged and random forest estimators. Comput. Stat. Data Anal. 53(3), 801–811 (2009)
Thompson, J.J., Blair, M.R., Chen, L., Henrey, A.J.: Video game telemetry as a critical tool in the study of complex skill learning. PLoS ONE 8(9), e75129 (2013)
Tüfekci, P.: Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 60, 126–140 (2014)
Wager, S.: Asymptotic theory for random forests (2014). arXiv:1405.0352
Wager, S., Hastie, T., Efron, B.: Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15(1), 1625–1651 (2014)
Yeh, I.C.: Modeling of strength of highperformance concrete using artificial neural networks. Cem. Concr. Res. 28(12), 1797–1808 (1998)
Acknowledgements
Supported by NSF grants DMS 1053252 and DEB 1353039.
Author information
Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: Proof of Theorem 1
Proof
We begin by writing the prediction at x from an individual tree as
where \(\Omega _b\) is the realization of a random variable that describes both the selection of bootstrap or subsamples used in learning the tree \(T_b\) as well as any additional random variables involved in the learning process (e.g., the selection of candidate split variables in RF). Here \(L(x,X_i,\Omega _b)\) is the indicator that x and \(X_i\) are in the same leaf of a tree learned with randomization parameters \(\Omega _b\) and \(N(x,\Omega _b)\) is the number of observations in the same leaf as x. We will also write
as the average weight on \(Y_i\) across all resamples so that
Note that
We can similarly write a residualbootstrap tree as
with the corresponding quantities
where we also have
Using these quantities we can write \(\hat{F}^c_{BB_o}(x)\) as
Hence, letting \(\hbox {Var}_{\Omega }(W_i(x,\Omega ))\) indicate variance with respect to only the randomization parameters \(\Omega \), writing \(Y_i = F(X_i) + \epsilon _i\) and observing that \(0 \le W_i(x,\Omega ) \le 1\), \(0 \le V_{ij}(x,\Omega ) \le 1\):
for
Here we use the fact that for \(\epsilon _1,\ldots ,\epsilon _n{\,\sim \,}N(0,1)\), \(E \left( \max _i \epsilon _i^2\right) \le 1 + 4 \log (n)\) (Boucheron et al. 2013).
Appendix 2: Details of case study datasets
After processing each dataset as described below, we employed 10fold crossvalidation to obtain crossvalidated squared error for both \(\hat{F}_B\) and \(\hat{F}^c_{BB_o}\), removing the final data entries to create equalsized folds. To maintain comparability, the same folds were used for both estimates. We set \(B = 1000\) and \(B_o = 2000\), but these results were insensitive to setting \(B_o = 1000\) or \(B_o = 5000\).
Below we detail each dataset and the processing steps taken for it; unless processing is noted, data were taken as is from the UCI repository Lichman (2013).
 Airfoil:

42% improvement over RF. Task is to predict sound pressure in decibels of airfoils at various wind tunnel speeds and angles of attack Brooks et al. (1989). 1503 observations, 5 features.
 Autompg:

6% improvement over RF. Task is to predict citycycle fuel consumption in miles per gallon from physical car and engine characteristics Quinlan (1993). Rows missing horsepower were removed resulting in 392 examples with 8 features, 3 of which are discrete.
 BikeSharinghour:

34% improvement over RF. Prediction of number of rental bikes used each hour over in a bikesharing system FanaeeT and Gama (2013). Date and Season (columns 2 and 3) were removed from features as duplicating information, leaving 13 covariates related to time, weather, and number of users. 17389 examples; prediction task was for log counts.
 Communities:

−1% improvement over RF. Prediction of percapita rate of violent crime in U.S. cities Redmond and Baveja (2002). 1993 examples, 96 features. 30 (out of original 125) feature removed due to highmissingness including state, county and data associated with police statistics. One row (Natchezcity) deleted due to missing values. Crossvalidation was done using independently generated folds.
 CCPP:

8% improvement over RF. Prediction of net hourly output from Combined Cycle Power Plants Tüfekci (2014). 4 features and 9568 examples.
 Concrete:

3% improvement over RF. Prediction of concrete compressive strength from constituent components Yeh (1998). 9 features, 1030 examples.
 Forestfires:

−8% improvement over RF. Prediction of log(area+1) burned by forrest fires from location, date, and weather attributes Cortez and Morais (2007). 517 examples, 13 features. Not reported in main paper because Random Forests predictions had 15% higher squared error than a constant prediction function.
 Housing:

9% improvement over RF. Predict median housing prices from demographic and geographic features for suburbs of Boston Harrison and Rubinfeld (1978). Response was taken to be the log on median house prices. 506 examples, 14 attributes.
 Parkinsons:

3% improvement over RF. Prediction of Motor UPDRS from voice monitoring data in earlystate Parkinsons patients Little et al. (2007). Removed features for age, sex, test time, and Total UPDRS, resulting in 15 features and 5875 examples.
 SkillCraft:

−1% improvement over RF. Predict league index of gamers playing SkillCraft based on playing statistics Thompson et al. (2013). Entries with NA’s removed; results in 3338 examples and 18 features.
 Winequalitywhite:

5% improvement over RF. Predict expert quality score on white wines based on 11 measures of wine composition Cortez et al. (2009). 4898 examples.
 Winequalityred:

3% improvement over RF. As in winequalitywhite for red wines Cortez et al. (2009). 1599 examples.
 Yachthydrodynamics:

70% improvement over RF. Predict residuary resistance per unit weight of displacement of sailing yachts from hull geometry Gerritsma et al. (1981). 308 examples, 7 features.
Rights and permissions
About this article
Cite this article
Hooker, G., Mentch, L. Bootstrap bias corrections for ensemble methods. Stat Comput 28, 77–86 (2018). https://doi.org/10.1007/s1122201697173
Received:
Accepted:
Published:
Issue Date:
Keywords
 Bagging
 Ensemble methods
 Bias correction
 Bootstrap