Bayesian Additive Regression Trees using Bayesian model averaging

Hernández, Belinda; Raftery, Adrian E.; Pennington, Stephen R; Parnell, Andrew C.

doi:10.1007/s11222-017-9767-1

Bayesian Additive Regression Trees using Bayesian model averaging

Published: 27 July 2017

Volume 28, pages 869–890, (2018)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Belinda Hernández¹,
Adrian E. Raftery³,
Stephen R Pennington² &
…
Andrew C. Parnell^1,4

2658 Accesses
38 Citations
11 Altmetric
Explore all metrics

Abstract

Bayesian Additive Regression Trees (BART) is a statistical sum of trees model. It can be considered a Bayesian version of machine learning tree ensemble methods where the individual trees are the base learners. However, for datasets where the number of variables p is large the algorithm can become inefficient and computationally expensive. Another method which is popular for high-dimensional data is random forests, a machine learning algorithm which grows trees using a greedy search for the best split points. However, its default implementation does not produce probabilistic estimates or predictions. We propose an alternative fitting algorithm for BART called BART-BMA, which uses Bayesian model averaging and a greedy search algorithm to obtain a posterior distribution more efficiently than BART for datasets with large p. BART-BMA incorporates elements of both BART and random forests to offer a model-based algorithm which can deal with high-dimensional data. We have found that BART-BMA can be run in a reasonable time on a standard laptop for the “small n large p” scenario which is common in many areas of bioinformatics. We showcase this method using simulated data and data from two real proteomic experiments, one to distinguish between patients with cardiovascular disease and controls and another to classify aggressive from non-aggressive prostate cancer. We compare our results to their main competitors. Open source code written in R and Rcpp to run BART-BMA can be found at: https://github.com/BelindaHernandez/BART-BMA.git.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic incorporation of prior knowledge from multiple domains in biomarker discovery

Article Open access 11 March 2020

eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

Article Open access 16 April 2019

Bayesian network classifiers using ensembles and smoothing

Article 30 March 2020

References

Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 88(422), 669–679 (1993)
Article MathSciNet MATH Google Scholar
Archer, K., Kimes, R.: Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 52(4), 2249–2260 (2008). doi:10.1016/j.csda.2007.08.015
Article MathSciNet MATH Google Scholar
Beaumont, M.A., Rannala, B.: The Bayesian revolution in genetics. Nat. Rev. Genet. 5(4), 251–261 (2004)
Article Google Scholar
Bleich, J., Kapelner, A., George, E.I., Jensen, S.T.: Variable selection for BART: an application to gene regulation. Ann. Appl. Stat. 8(3), 1750–1781 (2014)
Article MathSciNet MATH Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 26, 123–140 (1996a)
MATH Google Scholar
Breiman, L.: Stacked regressions. Mach. Learn. 24, 41–64 (1996b)
MathSciNet MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). doi:10.1186/1478-7954-9-29
Article MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)
MATH Google Scholar
Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)
Book MATH Google Scholar
Chipman, H., George, E.I., McCulloch, R.E.M.: Bayesian CART model search. J. Am. Stat. Assoc. 93(443), 935–948 (1998)
Article Google Scholar
Chipman, H., George, E.I., Mcculloch, R.E.M.: BART: Bayesian additive regression trees. Ann. Appl. Stat. 4(1), 266–298 (2010)
Article MathSciNet MATH Google Scholar
Chipman, H., McCulloch, R., Dorie, V.: Package dbarts (2014). https://cran.r-project.org/web/packages/dbarts/dbarts.pdf
Cortes, I.: Package conformal (2014). https://cran.r-project.org/web/packages/conformal/conformal.pdf
Daz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 3 (2006). doi:10.1186/1471-2105-7-3
Article Google Scholar
Friedman, J.H.: Multivariate adaptive regression splines (with discussion and a rejoinder by the author). Ann. Stat. 19, 1–67 (1991)
Article Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001). doi:10.1214/aos/1013203451
Fujikoshi, Y., Ulyanov, V.V., Shimizu, R.: Multivariate Statistics: High-Dimensional and Large-Sample Approximations, vol. 760. Wiley, Hoboken (2011)
MATH Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article MATH Google Scholar
Ham, J., Chen, Y., Crawford, M.M., Ghosh, J.: Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43(3), 492–501 (2005). doi:10.1109/TGRS.2004.842481
Article Google Scholar
Harris, K., Girolami, M., Mischak, H.: Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science, chap. Definition of Valid Proteomic Biomarkers: A Bayesian Solution, pp. 137–149. Springer, Berlin (2009)
Hawkins, D.M.: Fitting multiple change-point models to data. Comput. Stat. Data Anal. 37(3), 323–341 (2001)
Article MathSciNet MATH Google Scholar
Hernández, B., Parnell, A.C., Pennington, S.R.: Why have so few proteomic biomarkers “survived” validation? (sample size and independent validation considerations). Proteomics 14(13–14), 1587–1592 (2014)
Article Google Scholar
Hernández, B., Pennington, S.R., Parnell, A.C.: Bayesian methods for proteomic biomarker development. EuPA Open Proteomics 9, 54–64 (2015)
Article Google Scholar
Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K.: Algorithm runtime prediction: methods & evaluation. Artif. Intell. 206, 79–111 (2014)
Article MathSciNet MATH Google Scholar
Johansson, U., Boström, H., Löfström, T., Linusson, H.: Regression conformal prediction with random forests. Mach. Learn. 97(1–2), 155–176 (2014)
Article MathSciNet MATH Google Scholar
Kapelner, A., Bleich, J.: bartmachine: machine learning with Bayesian additive regression trees. ArXiv e-prints (2014a)
Kapelner, A., Bleich, J.: Package bartMachine (2014b). http://cran.r-project.org/web/packages/bartMachine/bartMachine.pdf
Killick, R., Eckley, I., Haynes, K., Fearnhead, P.: Package changepoint (2014). http://cran.r-project.org/web/packages/changepoint/changepoint.pdf
Killick, R., Fearnhead, P., Eckley, I.: Optimal detection of changepoints with a linear computational cost. J. Am. Stat. Assoc. 107(500), 1590–1598 (2012)
Article MathSciNet MATH Google Scholar
Lakshminarayanan, B., Roy, D.M., Teh, Y.W.: Particle Gibbs for Bayesian additive regression trees. arXiv preprint arXiv:1502.04622 (2015)
Lakshminarayanan, B., Roy, D.M., Teh, Y.W.: Mondrian forests for large-scale regression when uncertainty matters. In: Artificial Intelligence and Statistics, pp. 1478–1487. (arXiv:1506.03805, 2015) (2016)
Liaw, A., Matthew, W.: Package randomForest (2015). http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Logothetis, C.J., Gallick, G.E., Maity, S.N., Kim, J., Aparicio, A., Efstathiou, E., Lin, S.H.: Molecular classification of prostate cancer progression: foundation for marker-driven treatment of prostate cancer. Cancer Discov. 3(8), 849–861 (2013)
Article Google Scholar
Lynch, C.: Big data: how do your data grow? Nature 455(7209), 28–29 (2008)
Article Google Scholar
Madigan, D., Raftery, A.E.: Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Am. Stat. Assoc. 89(428), 1535–1546 (1994)
Article MATH Google Scholar
Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7, 983–999 (2006)
MathSciNet MATH Google Scholar
Morgan, J.N.: History and potential of binary segmentation for exploratory data analysis. J. Data Sci. 3, 123–136 (2005)
Google Scholar
Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data and a proposal. J. Am. Stat. Assoc. 58(302), 415–434 (1963)
Article MATH Google Scholar
Nicodemus, K.K., Malley, J.D., Strobl, C., Ziegler, A.: The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinform. 11, 110 (2010). doi:10.1186/1471-2105-11-110
Article Google Scholar
Norinder, U., Carlsson, L., Boyer, S., Eklund, M.: Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination. J. Chem. Inf. Model. 54(6), 1596–1603 (2014)
Article Google Scholar
Pratola, M.: Efficient Metropolis–Hastings proposal mechanisms for Bayesian regression tree models. Bayesian Anal. 11(3), 885–911 (2016)
Article MathSciNet MATH Google Scholar
Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). doi:10.1023/A:1022643204877
Google Scholar
Quinlan, J.R.: Discovering rules by induction from large collections of examples. In: Michie, D. (ed.) Expert Systems in the Micro Electronic Age. Edinburgh University Press, Edinburgh (1979)
Google Scholar
Raghavan, V., Bollmann, P., Jung, G.S.: A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst. (TOIS) 7(3), 205–229 (1989)
Article Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MathSciNet MATH Google Scholar
Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43(6), 1947–1958 (2003). doi:10.1021/ci034160g
Article Google Scholar
Wager, S., Hastie, T., Efron, B.: Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15(1), 1625–1651 (2014)
MathSciNet MATH Google Scholar
Wilkinson, D.J.: Bayesian methods in bioinformatics and computational systems biology. Brief. Bioinform. 8(2), 109–16 (2007). doi:10.1093/bib/bbm007
Article Google Scholar
Wu, Y., Tjelmeland, H., West, M.: Bayesian CART: prior specification and posterior simulation. J. Comput. Graph. Stat. 16(1), 44–66 (2007)
Article MathSciNet Google Scholar
Yao, Y.: Estimation of a noisy discrete-time step function: Bayes and empirical Bayes approaches. Ann. Stat. 4(12), 1434–1447 (1984)
Article MathSciNet MATH Google Scholar
Zhao, T., Liu, H., Roeder, K., Lafferty, J., Wasserman, L.: The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13(1), 1059–1062 (2012)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to thank Drs Chris Watson, John Baugh, Mark Ledwidge and Professor Kenneth McDonald for kindly allowing us to use the cardiovascular dataset described. Hernández’s research was supported by the Irish Research Council. Raftery’s research was supported by NIH Grants Nos. R01-HD054511, R01-HD070936, and U54-HL127624, and by a Science Foundation Ireland E.T.S. Walton visitor award, Grant Reference 11/W.1/I2079. Protein biomarker discovery work in the Pennington Biomedical Proteomics Group is supported by grants from Science Foundation Ireland (for mass spectrometry instrumentation), the Irish Cancer Society (PCI11WAT), St Lukes Institute for Cancer Research, the Health Research Board (HRA_POR / 2011 / 125), Movember GAP1 and the EU FP7 (MIAMI). The UCD Conway Institute is supported by the Program for Research in Third Level Institutions as administered by the Higher Education Authority of Ireland.

Author information

Authors and Affiliations

School of Mathematics and Statistics, University College Dublin, Dublin, Ireland
Belinda Hernández & Andrew C. Parnell
School of Medicine and Medical Science, University College Dublin, Dublin, Ireland
Stephen R Pennington
Department of Statistics, University of Washington, Seattle, WA, USA
Adrian E. Raftery
Insight: The National Centre for Data Analytics, University College Dublin, Dublin, Ireland
Andrew C. Parnell

Authors

Belinda Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Adrian E. Raftery
View author publications
You can also search for this author in PubMed Google Scholar
Stephen R Pennington
View author publications
You can also search for this author in PubMed Google Scholar
Andrew C. Parnell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Belinda Hernández.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 5 KB)

Appendices

Appendix A: BART full conditional distribution $p(R_j|X,T_j,\sigma ^{-2})$

Using the forms of $p(\mu _{ij})$ and $p(\sigma ^{-2})$ described in Sect. 2.2.2 gives rise to the following full conditional distribution of the partial residuals for the BART model:

$$\begin{aligned} p(R_j|X,T_j,\sigma ^{-2})\propto & {} \prod _{i=1}^b\left( n_i\sigma ^{-2} +\left( \frac{0.5}{e\sqrt{m}}\right) ^{-2}\right) ^{-\frac{1}{2}}{\sigma ^{-2}}^{\frac{n+\nu }{2}-1}\nonumber \\&\times \exp \left( -\frac{\sigma ^{-2}}{2}\left( \sum _{\iota =1}^{n_i} R_{\iota ij}^2 +\nu \lambda \right) \right) \nonumber \\&\exp \left( \frac{n_i^2\bar{R}_{i j}^2\sigma ^{-4}}{2\left( n_i\sigma ^{-2}+\left( \frac{0.5}{e\sqrt{m}}\right) ^{-2}\right) } \right) , \end{aligned}$$

(11)

where $n_i$ is the number of observations in terminal node i of tree j and $\bar{R}_{ij}$ is the mean of the partial residuals $R_j$ for terminal node i in tree j.

Appendix B: BART for classification

For binary classification, Chipman et al. (2010) follows the latent variable probit approach of Albert and Chib (1993). Latent variables $Z_k$ are introduced so that

$$\begin{aligned} Y_k={\left\{ \begin{array}{ll} 1 &{}\text {if }Z_k>0 \\ 0 &{}\text {otherwise} . \end{array}\right. } \end{aligned}$$

The sum of trees prior is then placed on the $Z_k$ so that $Z_k \sim N\left( \sum _{j=1}^m g(x_k;T_j,M_j),1 \right) $. It follows that $Y_k$ is Bernoulli with

$$\begin{aligned} P(Y_k=1|x_k) = \Phi \left( \sum _{j=1}^m g(x_k;T_j,M_j)\right) , \end{aligned}$$

(12)

where $\Phi $ is the standard normal cumulative distribution function (CDF), used here as the link function. Note that there is no residual variance parameter $\sigma ^2$ in the classification version of the model.

Using the same prior distribution structure as in Sect. 2.2.2, the full posterior distribution of this version of the model is:

$$\begin{aligned} \begin{aligned}&p(T,M,Z|X,Y) \propto p(Y|Z) p(Z|X,T,M)\\&\quad \times \left[ \prod _{j}\prod _{i}p(\mu _{ij}|T_j)p(T_j)\right] , \end{aligned} \end{aligned}$$

(13)

where the top level of the likelihood (i.e., the first term on the right-hand side) is a deterministic function of the latent variables. The conditional prior distributions of the terminal node parameters $\mu _{ij}|T_j$ are set exactly as described in Sect. 2.2.2 except that $\sigma _0=\frac{3}{e\sqrt{m}}$ instead of $\sigma _0=\frac{0.5}{e\sqrt{m}}$. This is in order to assign high prior probability to the interval $(\Phi [-3],\Phi [3])$ which corresponds to the 0.1 and $99.9\%$ quantiles of the normal CDF.

The fitting algorithm proposed by Chipman et al. (2010) for the classification model is nearly identical to that of their standard algorithm. The only difference is that the latent variables $Z_k$ introduce an extra step in the Gibbs algorithm. The full conditional distributions of $Z_k|\ldots $ are:

$$\begin{aligned} Z_k|\ldots \sim {\left\{ \begin{array}{ll} \min {\left[ N \left( \sum _j g(x_k;T_j,M_j),1 \right) ,0\right] } &{}\text {if }Y_k=1, \\ \max {\left[ N \left( \sum _j g(x_k;T_j,M_j),1 \right) , 0\right] } &{}\text {if }Y_k=0. \end{array}\right. } \end{aligned}$$

(14)

The partial residuals in the tree updates are of course now based on the latent variables $Z_k$ for updating of individual trees.

Appendix C: BART-BMA algorithm overview

Appendix D: BART-BMA post hoc Gibbs sampler

In order to provide credible intervals for the point predictions, $\hat{Y}$, provided by BART-BMA, we run a post hoc Gibbs sampler. For each sum of trees model $\mathcal {T}_\ell $ in Occam’s window, a separate chain in the MCMC algorithm is run. For each model $\mathcal {T}_\ell $, each terminal node parameter $\mu _{ij}$ in each tree $T_j$ is then updated followed by an update of $\sigma ^{2}$. The details of the updates for the full conditional of $p(\mu _{ij}|T_j,R_j,\sigma ^2)$ and of $p(\sigma ^2)$ are explained in further detail in the following sections. The Gibbs sampler yields credible and prediction intervals for each set of sum of trees models accepted by BART-BMA along with the updates for $\sigma ^{-2}$ for each set of trees accepted in the final BART-BMA model. The final simulated sample from the overall posterior distribution is obtained by selecting a number of iterations from the Gibbs sampler for each sum of trees model proportional to its posterior probability, and combining them. The post hoc Gibbs sampler used by BART-BMA is far less computationally expensive than that of BART as it requires only an update for $\mu _{ij}$ and $\sigma $ from the full conditional of each sum of trees model, which is merely a draw from a normal distribution and an inverse-Gamma distribution, respectively (see Sects. D.1, D.2, respectively).

1.1 Appendix D.1: Update of $p(M_j|T_j,R_{\iota ji},\sigma ^2)$

Let $M_j = (\mu _{1j} \ldots \mu _{ij})$ index the $b_j$ terminal node parameters of tree $T_j$, and $R_{kij}$ be the partial residuals for observations k belonging to terminal node i used as the response variable to grow tree $T_j$. BART-BMA assumes that the prior on terminal node parameters is $\mu _{ij}|T_j,\sigma \mathtt {\sim } N(0,\frac{\sigma ^2}{a})$, as in Chipman et al. (1998). The prior distribution of the partial residual is $R_j|\ldots \mathtt {\sim }N(\mu _{ij},\sigma ^2)$.

The full conditional distribution of $M_j$ is then

$$\begin{aligned} p(M_j| T_j,R_{kij},\sigma )&\propto p(R_{kij}|T_j,M_j,\sigma )p(M_j| T_j) \nonumber \\&\propto \prod _{k=1}^{n_i} p(R_{kij}|T_j,M_j,\sigma ) p(M_j |T_j) , \end{aligned}$$

(15)

where k indexes the observations within terminal node i of tree $T_j$ and $n_i$ refers to the number of observations which fall in terminal node i.

The draw from the full conditional of $p(M_j|\ldots )$ is then a draw from the normal distribution

$$\begin{aligned} M_j|T_j,R_{kij},\sigma \mathtt {\sim } N\left( \frac{\sum _{k=1}^{n_i}{R_{\iota ij}}}{n_i+a},\frac{\sigma ^2}{n_i+a}\right) . \end{aligned}$$

(16)

The full conditional of $M_j|\ldots $ depends only on $\sigma $ in the variance parameter, making it slightly more efficient than the update of $M_j$ using the BART prior which depends on $\sigma $ in both the mean and variance parameter.

1.2 Appendix D.2: Update of $p(\sigma ^2)$

BART-BMA performs the update for $p(\sigma )$ in the same way as (Chipman et al. 2010). The full conditional distribution of $\sigma ^2$ is:

$$\begin{aligned} p(\sigma ^2|R_j,T_j,M_j) \propto \prod _{k=1}^n p\left( R_j|T_j,M_j,\sigma ^2 \right) p\left( \sigma ^2\right) ,\nonumber \\ \end{aligned}$$

(17)

where $R_j \mathtt {\sim } N\left( \sum _{j=1}^m g(x_k,T_j,M_j ),\sigma ^2\right) $ and $\frac{1}{\sigma ^2} \mathtt {\sim } \text{ Gamma }(\zeta ,\eta )$, where $\zeta $ and $\eta $ are equal to $\frac{\nu }{2}$ and $\frac{\nu \lambda }{2}$, respectively.

BART-BMA makes the draw for $\sigma ^2$ in terms of the precision $\sigma ^{-2}=\frac{1}{\sigma ^2}$ where $p(\sigma ^{-2}|R_j,T_j,M_j )$ is calculated as:

$$\begin{aligned} \sigma ^{-2}|R_j,T_j,M_j&\mathtt {\sim } \text{ Gamma } \left( \zeta +\frac{1}{2} , \frac{P}{2} + \frac{1}{\eta }\right) , \end{aligned}$$

(18)

where $P=\sum _{k}\left[ Y_k-\sum _{j}g(x_k,T_j,M_j)\right] ^2$ . The next value of $\sigma ^{-2}$ is then drawn from (18), and the value of $\sigma $ is calculated by getting the reciprocal square root of that value.

Appendix E: Greedy tree growing extra details

1.1 Appendix E.1: The PELT algorithm

Univariate changepoint detection algorithms in general search for distributional changes in an ordered series of data. For example, if normality is assumed then such an algorithm may look for changes in the mean or variance of the data. Searching for predictive split points for a single variable in a tree has an equivalent goal, i.e., it is desirable to find split points which maximise the separation of the response variable between the left- and right-hand daughter nodes. For this reason, we use a changepoint detection algorithm called PELT (Pruned Exact Linear Time) in BART-BMA to find predictive split points and greedily grow trees.

PELT was originally proposed to detect changepoints in an ordered series of data $y_{1:n}=(y_1, \ldots , y_n)$ by minimising the function

$$\begin{aligned} \min _\delta \left[ \sum _{\theta =1}^{\Theta +1} \left[ C(y_{(\delta _{\theta -1}+1):\delta _{\theta }})+ D \right] \right] . \end{aligned}$$

(19)

Here, there are $\Theta $ changepoints in the series at positions $\delta _{1:\Theta }=(\delta _1, \ldots , \delta _\Theta )$ which results in $\Theta +1$ segments. Each changepoint position $\delta _{\theta }$ can take the value $1 \ldots n-1$. For example, if a changepoint occurs at position $\delta _1=5$ and another occurs at position $\delta _2=12$, the second segment where $\theta =2$ will contain the values for $y_{(6:12)}$. The function $C(\cdot )$ is a cost function of each segment $\theta $ containing observations $y_{(\delta _{\theta -1}+1):\delta _{\theta }}$. In the results which follow, the cost function used is twice the negative log likelihood assuming that y has a univariate normal distribution. Finally, D is a penalty for adding additional changepoints, default values for which are discussed below.

Table 8 Coverage for out-of-sample 50% prediction intervals and average interval width for BART-BMA, RF using conformal intervals bartMachine and dbarts for the Friedman example

Full size table

Table 9 Coverage for out-of-sample 75% prediction intervals and average interval width for BART-BMA, RF using conformal prediction bartMachine and dbarts for the Friedman example

Full size table

PELT extends the optimal partitioning method of Yao (1984) by eliminating any changepoints which cannot be optimal. This is achieved by observing that if there exists a candidate changepoint s where $\delta<s<S$ which reduces the overall cost of the sequence, then the changepoint at $\delta $ can never be optimal and so is removed from consideration (Killick et al. 2012). An algorithm describing how we use PELT to greedily grow trees is described in Algorithm 3:

One disadvantage of using PELT for large datasets is that the number of changepoint detected by the PELT algorithm is linearly related to the number of observations, which can reduce the speed of the BART-BMA algorithm for large n. Our experience is that $D=10\log (n)$ performs well as a general default for the PELT penalty when $n<200$. For larger values of n, we recommend using a higher value for D or the grid search option instead (see Sect. 3.4.2) in order to limit the number of split points detected per variable. We implement a version of PELT which is equivalent to the PELT.meanvar.norm function from the changepoint package in R (Killick et al. 2014). This function searches for changes in the mean and variance of variables which are assumed to be normally distributed. Additional changepoint are accepted if there is support for their inclusion according to the log likelihood ratio statistic.

1.2 Appendix E.2: Updating splitting rules

By default, we choose the best numcp% of the total splitting rules before the tree is grown and only trees using the most predictive splitting rules are considered for inclusion. However, the best splitting rules can also be updated for each internal node i in each tree $T_j$, similarly to how RF creates trees. We have found that updating splitting rules at each internal node generally results in fewer trees $T_j$ being included in each sum of trees model; however, each tree $T_j$ within the sum of trees models averaged over in the final model tends to be deeper and to choose splits that are similar to the primary splits of trees in the RF. We have found that updating the splitting rules at each internal node can in some cases increase the predictive accuracy, but generally at the expense of computational speed.

Appendix F: Out of sample prediction intervals

This appendix shows the results for the calibration of the Friedman example using 50 and 75% prediction intervals (Tables 8, 9).

Appendix G: Choice of default values for BART-BMA

This section will show some of the preliminary investigations which guided the choice of default settings for the BART-BMA algorithm such as the choice of the size of Occam’s window, the penalty on the PELT parameter and the size of sum of tree models to be averaged over. In the results that follow, the following datasets are shown: Ozone, Compactiv, Ankara and Baseball. These were also used as the benchmark datasets for the bartMachine package Kapelner and Bleich (2014b).

For each of the four datasets shown, varying amounts of random noise variables were appended to test the sensitivity of the parameters to the dimensionality of the dataset. In all, 17 different values for the number of random noise variables appended were tested ranging from 100 to 15, 000 so each parameter value of interest was run/tested a total of 68 times.

For the value of Occam’s window, 20 values were evaluated ranging from 100 to 100, 000. A contour plot showing the relative RMSE for the Baseball, Ankara, Compactiv and Ozone datasets can be seen in Fig. 4. Here, the RMSE value for each dataset has been divided by its minimum value which allows for fair comparison across datasets.

In general, we recommend a default value of $OW=1000$ as it seems to work well on the majority of datasets tested as can be seen here (and in other datasets not shown). It was decided that the additional computational complexity involved in setting $OW=10{,}000$ was not worth the marginal gain in accuracy for the datasets tested.

Figure 5 shows the same experiments conducted by ranging the multiple pen used in the PELT penalty $D=pen\log (n)$ from 1 to 20. Here, we can see that a value of $D=10\log (n)$ works well in the majority of the datasets shown.

Figure 6 shows the same datasets where Occam’s window is fixed at its default value of OW $=$ 1000 and the PELT penalty parameter is fixed at $D=10 \hbox {Log}(n)$. Here, 7 increments for numcp were chosen ranging from 5 to 100%.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hernández, B., Raftery, A.E., Pennington, S.R. et al. Bayesian Additive Regression Trees using Bayesian model averaging. Stat Comput 28, 869–890 (2018). https://doi.org/10.1007/s11222-017-9767-1

Download citation

Received: 01 July 2015
Accepted: 19 July 2017
Published: 27 July 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11222-017-9767-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Additive Regression Trees using Bayesian model averaging

Abstract

Access this article

Similar content being viewed by others

Dynamic incorporation of prior knowledge from multiple domains in biomarker discovery

eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

Bayesian network classifiers using ensembles and smoothing

References

Acknowledgements