Bootstrap bias corrections for ensemble methods


This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context of ensembles of trees, we show that this correction can be approximated at only double the cost of training the original ensemble. Our method is shown to improve test set accuracy over random forests by up to 70% on example problems from the UCI repository.

This is a preview of subscription content, log in to check access.

Fig. 1


  1. Biau, G., Devroye, L., Lugosi, G.: Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 9, 20152033 (2008)

    MathSciNet  MATH  Google Scholar 

  2. Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013)

    Google Scholar 

  3. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  5. Brooks, T.F., Pope, D.S., Marcolini, M.A.: Airfoil Self-Noise and Prediction, vol. 1218. National Aeronautics and Space Administration, Office of Management, Scientific and Technical Information Division (1989)

  6. Cortez, P., Morais, A.: A data mining approach to predict forest fires using meteorological data. In: Neves, J., Santos, M.F., Machado, J. (eds.) New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, pp. 512–523. APPIA, Guimaraes (2007)

  7. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009)

    Article  Google Scholar 

  8. Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26 (1979)

    MathSciNet  Article  MATH  Google Scholar 

  9. Efron, B.: Estimation and accuracy after model selection. J. Am. Stat. Assoc. 109(507), 991–1007 (2014)

    MathSciNet  Article  MATH  Google Scholar 

  10. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, New York (1993)

    Google Scholar 

  11. Eubank, R.L.: Nonparametric Regression and Spline Smoothing. CRC Press, New York (1990)

    Google Scholar 

  12. Fanaee-T, H., Gama, J.: Event labeling combining ensemble detectors and background knowledge. Prog. Artif. Intell. (2013). doi:10.1007/s13748-013-0040-3

  13. Freedman, D.A., et al.: Bootstrapping regression models. Ann. Stat. 9(6), 1218–1228 (1981)

    MathSciNet  Article  MATH  Google Scholar 

  14. Gerritsma, J., Onnink, R., Versluis, A.: Geometry, Resistance and Stability of the Delft Systematic Yacht Hull Series. Delft University of Technology, Amsterdam (1981)

    Google Scholar 

  15. Hall, P.: The Bootstrap and Edgeworth Expansion. Springer, Berlin (1992a)

    Google Scholar 

  16. Hall, P.: On bootstrap confidence intervals in nonparametric regression. Ann. Stat. 20, 695–711 (1992b)

    MathSciNet  Article  MATH  Google Scholar 

  17. Hall, P., Horowitz, J.: A simple bootstrap method for constructing nonparametric confidence bands for functions. Ann. Stat. 41(4), 1892–1921 (2013)

    MathSciNet  Article  MATH  Google Scholar 

  18. Härdle, W., Bowman, A.W.: Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands. J. Am. Stat. Assoc. 83(401), 102–110 (1988)

    MathSciNet  MATH  Google Scholar 

  19. Harrison, D., Rubinfeld, D.L.: Hedonic prices and the demand for clean air. J. Environ. Econ. Manag. 5, 81–102 (1978)

    Article  MATH  Google Scholar 

  20. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002).

  21. Lichman, M.: UCI machine learning repository (2013).

  22. Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A., Moroz, I.M., et al.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng. OnLine 6(1), 23 (2007)

    Article  Google Scholar 

  23. Mentch L, Hooker G (2016a) Formal hypothesis tests for additive structure in random forests. J. Comput. Gr. Stat. (In Press)

  24. Mentch, L., Hooker, G.: Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17(26), 1–41 (2016b)

    MathSciNet  MATH  Google Scholar 

  25. Quinlan, J.R.: Combining instance-based and model-based learning. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 236–243 (1993)

  26. Redmond, M., Baveja, A.: A data-driven software tool for enabling cooperative information sharing among police departments. Eur. J. Oper. Res. 141(3), 660–678 (2002)

    Article  MATH  Google Scholar 

  27. Scornet, E.: On the asymptotics of random forests (2014). arXiv:1409.2090

  28. Scornet, E., Biau, G., Vert, J.P.: Consistency of random forests. Ann. Stat. 43(4), 1716–1741 (2015)

    MathSciNet  Article  MATH  Google Scholar 

  29. Sexton, J., Laake, P.: Standard errors for bagged and random forest estimators. Comput. Stat. Data Anal. 53(3), 801–811 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  30. Thompson, J.J., Blair, M.R., Chen, L., Henrey, A.J.: Video game telemetry as a critical tool in the study of complex skill learning. PLoS ONE 8(9), e75129 (2013)

    Article  Google Scholar 

  31. Tüfekci, P.: Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 60, 126–140 (2014)

  32. Wager, S.: Asymptotic theory for random forests (2014). arXiv:1405.0352

  33. Wager, S., Hastie, T., Efron, B.: Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15(1), 1625–1651 (2014)

    MathSciNet  MATH  Google Scholar 

  34. Yeh, I.C.: Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 28(12), 1797–1808 (1998)

    Article  Google Scholar 

Download references


Supported by NSF grants DMS 1053252 and DEB 1353039.

Author information



Corresponding author

Correspondence to Giles Hooker.

Electronic supplementary material


Appendix 1: Proof of Theorem 1


We begin by writing the prediction at x from an individual tree as

$$\begin{aligned} T_b(X,\Omega )&= \sum _{i=1}^n \frac{ L(x,X_i,\Omega _b) }{N(x,\Omega _b)} Y_i \\&= \sum _{i=1}^n W_i(x,\Omega _b) Y_i, \end{aligned}$$

where \(\Omega _b\) is the realization of a random variable that describes both the selection of bootstrap or subsamples used in learning the tree \(T_b\) as well as any additional random variables involved in the learning process (e.g., the selection of candidate split variables in RF). Here \(L(x,X_i,\Omega _b)\) is the indicator that x and \(X_i\) are in the same leaf of a tree learned with randomization parameters \(\Omega _b\) and \(N(x,\Omega _b)\) is the number of observations in the same leaf as x. We will also write

$$\begin{aligned} \bar{W}_i^{B}(x) = \frac{1}{B} \sum _{b=1}^B W_i (x,\Omega _b) \end{aligned}$$

as the average weight on \(Y_i\) across all resamples so that

$$\begin{aligned} \hat{F}_B(x) = \sum _{i=1}^n \bar{W}_i^{B}(x)Y_i. \end{aligned}$$

Note that

$$\begin{aligned} \sum _{i=1}^n W_i(x,\Omega _b) = \sum _{i=1}^n \bar{W}_i^{B}(x) = 1. \end{aligned}$$

We can similarly write a residual-bootstrap tree as

$$\begin{aligned} T_{b^o}^o&= \sum _{i=1}^n \sum _{j=1}^n V_{ij}(x,\Omega _{b^o})Y_i^o \\&= \sum _{i=1}^n \sum _{j=1}^n V_{ij}(x,\Omega _{b^o})[ \hat{F}(X_i) + (Y_j - \hat{F}(X_j)) ] \end{aligned}$$

with the corresponding quantities

$$\begin{aligned} \bar{V}_{ij}^{B_o}(x) = \frac{1}{B_o} \sum _{b^o=1}^{B_o} V_{ij}(x,\Omega _{b^o}) \end{aligned}$$

where we also have

$$\begin{aligned} \sum _{i=1}^n \sum _{j=1}^n V_{ij}(x,\Omega _{b^o}) = \sum _{i=1}^n \sum _{j=1}^n \bar{V}_{ij}^{B_o}(x) = 1. \end{aligned}$$

Using these quantities we can write \(\hat{F}^c_{BB_o}(x)\) as

$$\begin{aligned}&2 \hat{F}_b(x) - \hat{F}^o_{B_o}(x) \\&\quad = \sum _{i=1}^n 2 \bar{W}_i^{B}(x) Y_i \\&\quad \quad - \sum _{i=1}^n \sum _{j=1}^n \bar{V}_{ij}^{B_o}(x) [ \hat{F}_B(X_i) + (Y_j - \hat{F}_B(X_j)) ] \\&\quad = \sum _{i=1}^n 2\bar{W}_i^{B}(x) Y_i - \sum _{i=1}^n \sum _{j=1}^n \bar{V}^{B_o}_{ij}(x)Y_i \\&\quad \quad + \sum _{i=1}^n \sum _{j=1}^n \sum _{k=1}^n \bar{V}_{kj}^{B_o}(x) \left( \bar{W}_i^{B}(X_k) - \bar{W}_i^{B}(X_j) \right) Y_i. \end{aligned}$$

Hence, letting \(\hbox {Var}_{\Omega }(W_i(x,\Omega ))\) indicate variance with respect to only the randomization parameters \(\Omega \), writing \(Y_i = F(X_i) + \epsilon _i\) and observing that \(0 \le W_i(x,\Omega ) \le 1\), \(0 \le V_{ij}(x,\Omega ) \le 1\):

$$\begin{aligned}&E \left( \hat{F}^c - \hat{F}^c_{\infty } \right) ^2 \\&\quad \le \frac{8}{B} E_Y \hbox {Var}_{\Omega }\left( \sum _{i=1}^n W_i(x,\Omega ) Y_i\right) \\&\quad \quad + \frac{2}{B} E_y \hbox {Var}_{\Omega }\left( \sum _{i=1}^n \sum _{j=1}^n V_{ij}(x,\Omega )Y_i \right) \\&\quad \quad + \frac{2}{B_o} E_Y \hbox {Var}_{\Omega _{b^o},\Omega _b} \left( \sum _{i=1}^n \sum _{j=1}^n \sum _{k=1}^n H_{ijk} \right) \\&\quad \le \frac{8}{B} \left[ 2 \max _{ij} (F(X_i)-F(X_j))^2 + 2 \max _{ij} (\epsilon _i - \epsilon _j)^2 \right] \\&\quad \quad + \frac{2}{B_0} \left[ 16 \max _{ij} (F(X_i)-F(X_j))^2 + 10 \max _{ij} (\epsilon _i - \epsilon _j)^2 \right] \\&\quad \le \frac{64}{B} \left[ ||F||_{\infty }^2 + \sigma ^2(1+4 \log (n)) \right] \\&\quad \quad + \frac{80}{B_o} \left[ ||F||_{\infty }^2 + \sigma ^2(1+ 4 \log (n)) \right] \end{aligned}$$


$$\begin{aligned} H_{ijk} = V_{kj}^{B_o}(x,\Omega _{b^o}) \left( W_i^{B}(X_k,\Omega _b) - W_i^{B}(X_j,\Omega _b) \right) Y_i. \end{aligned}$$

Here we use the fact that for \(\epsilon _1,\ldots ,\epsilon _n{\,\sim \,}N(0,1)\), \(E \left( \max _i \epsilon _i^2\right) \le 1 + 4 \log (n)\) (Boucheron et al. 2013).

Appendix 2: Details of case study datasets

After processing each dataset as described below, we employed 10-fold cross-validation to obtain cross-validated squared error for both \(\hat{F}_B\) and \(\hat{F}^c_{BB_o}\), removing the final data entries to create equal-sized folds. To maintain comparability, the same folds were used for both estimates. We set \(B = 1000\) and \(B_o = 2000\), but these results were insensitive to setting \(B_o = 1000\) or \(B_o = 5000\).

Below we detail each dataset and the processing steps taken for it; unless processing is noted, data were taken as is from the UCI repository Lichman (2013).


42% improvement over RF. Task is to predict sound pressure in decibels of airfoils at various wind tunnel speeds and angles of attack Brooks et al. (1989). 1503 observations, 5 features.


6% improvement over RF. Task is to predict city-cycle fuel consumption in miles per gallon from physical car and engine characteristics Quinlan (1993). Rows missing horsepower were removed resulting in 392 examples with 8 features, 3 of which are discrete.


34% improvement over RF. Prediction of number of rental bikes used each hour over in a bike-sharing system Fanaee-T and Gama (2013). Date and Season (columns 2 and 3) were removed from features as duplicating information, leaving 13 covariates related to time, weather, and number of users. 17389 examples; prediction task was for log counts.


−1% improvement over RF. Prediction of per-capita rate of violent crime in U.S. cities Redmond and Baveja (2002). 1993 examples, 96 features. 30 (out of original 125) feature removed due to high-missingness including state, county and data associated with police statistics. One row (Natchezcity) deleted due to missing values. Cross-validation was done using independently generated folds.


8% improvement over RF. Prediction of net hourly output from Combined Cycle Power Plants Tüfekci (2014). 4 features and 9568 examples.


3% improvement over RF. Prediction of concrete compressive strength from constituent components Yeh (1998). 9 features, 1030 examples.


−8% improvement over RF. Prediction of log(area+1) burned by forrest fires from location, date, and weather attributes Cortez and Morais (2007). 517 examples, 13 features. Not reported in main paper because Random Forests predictions had 15% higher squared error than a constant prediction function.


9% improvement over RF. Predict median housing prices from demographic and geographic features for suburbs of Boston Harrison and Rubinfeld (1978). Response was taken to be the log on median house prices. 506 examples, 14 attributes.


3% improvement over RF. Prediction of Motor UPDRS from voice monitoring data in early-state Parkinsons patients Little et al. (2007). Removed features for age, sex, test time, and Total UPDRS, resulting in 15 features and 5875 examples.


−1% improvement over RF. Predict league index of gamers playing SkillCraft based on playing statistics Thompson et al. (2013). Entries with NA’s removed; results in 3338 examples and 18 features.


5% improvement over RF. Predict expert quality score on white wines based on 11 measures of wine composition Cortez et al. (2009). 4898 examples.


3% improvement over RF. As in winequality-white for red wines Cortez et al. (2009). 1599 examples.


70% improvement over RF. Predict residuary resistance per unit weight of displacement of sailing yachts from hull geometry Gerritsma et al. (1981). 308 examples, 7 features.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hooker, G., Mentch, L. Bootstrap bias corrections for ensemble methods. Stat Comput 28, 77–86 (2018).

Download citation


  • Bagging
  • Ensemble methods
  • Bias correction
  • Bootstrap