Advertisement

Statistics and Computing

, Volume 28, Issue 4, pp 869–890 | Cite as

Bayesian Additive Regression Trees using Bayesian model averaging

  • Belinda Hernández
  • Adrian E. Raftery
  • Stephen R Pennington
  • Andrew C. Parnell
Article

Abstract

Bayesian Additive Regression Trees (BART) is a statistical sum of trees model. It can be considered a Bayesian version of machine learning tree ensemble methods where the individual trees are the base learners. However, for datasets where the number of variables p is large the algorithm can become inefficient and computationally expensive. Another method which is popular for high-dimensional data is random forests, a machine learning algorithm which grows trees using a greedy search for the best split points. However, its default implementation does not produce probabilistic estimates or predictions. We propose an alternative fitting algorithm for BART called BART-BMA, which uses Bayesian model averaging and a greedy search algorithm to obtain a posterior distribution more efficiently than BART for datasets with large p. BART-BMA incorporates elements of both BART and random forests to offer a model-based algorithm which can deal with high-dimensional data. We have found that BART-BMA can be run in a reasonable time on a standard laptop for the “small n large p” scenario which is common in many areas of bioinformatics. We showcase this method using simulated data and data from two real proteomic experiments, one to distinguish between patients with cardiovascular disease and controls and another to classify aggressive from non-aggressive prostate cancer. We compare our results to their main competitors. Open source code written in R and Rcpp to run BART-BMA can be found at: https://github.com/BelindaHernandez/BART-BMA.git.

Keywords

Bayesian Additive Regression Trees Bayesian model averaging Random forest Biomarker selection Small n large p 

Notes

Acknowledgements

We would like to thank Drs Chris Watson, John Baugh, Mark Ledwidge and Professor Kenneth McDonald for kindly allowing us to use the cardiovascular dataset described. Hernández’s research was supported by the Irish Research Council. Raftery’s research was supported by NIH Grants Nos. R01-HD054511, R01-HD070936, and U54-HL127624, and by a Science Foundation Ireland E.T.S. Walton visitor award, Grant Reference 11/W.1/I2079. Protein biomarker discovery work in the Pennington Biomedical Proteomics Group is supported by grants from Science Foundation Ireland (for mass spectrometry instrumentation), the Irish Cancer Society (PCI11WAT), St Lukes Institute for Cancer Research, the Health Research Board (HRA_POR / 2011 / 125), Movember GAP1 and the EU FP7 (MIAMI). The UCD Conway Institute is supported by the Program for Research in Third Level Institutions as administered by the Higher Education Authority of Ireland.

Supplementary material

11222_2017_9767_MOESM1_ESM.zip (5 kb)
Supplementary material 1 (zip 5 KB)

References

  1. Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 88(422), 669–679 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  2. Archer, K., Kimes, R.: Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 52(4), 2249–2260 (2008). doi: 10.1016/j.csda.2007.08.015 MathSciNetCrossRefzbMATHGoogle Scholar
  3. Beaumont, M.A., Rannala, B.: The Bayesian revolution in genetics. Nat. Rev. Genet. 5(4), 251–261 (2004)CrossRefGoogle Scholar
  4. Bleich, J., Kapelner, A., George, E.I., Jensen, S.T.: Variable selection for BART: an application to gene regulation. Ann. Appl. Stat. 8(3), 1750–1781 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  5. Breiman, L.: Bagging predictors. Mach. Learn. 26, 123–140 (1996a)zbMATHGoogle Scholar
  6. Breiman, L.: Stacked regressions. Mach. Learn. 24, 41–64 (1996b)MathSciNetzbMATHGoogle Scholar
  7. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). doi: 10.1186/1478-7954-9-29 CrossRefzbMATHGoogle Scholar
  8. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)zbMATHGoogle Scholar
  9. Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)CrossRefzbMATHGoogle Scholar
  10. Chipman, H., George, E.I., McCulloch, R.E.M.: Bayesian CART model search. J. Am. Stat. Assoc. 93(443), 935–948 (1998)CrossRefGoogle Scholar
  11. Chipman, H., George, E.I., Mcculloch, R.E.M.: BART: Bayesian additive regression trees. Ann. Appl. Stat. 4(1), 266–298 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  12. Chipman, H., McCulloch, R., Dorie, V.: Package dbarts (2014). https://cran.r-project.org/web/packages/dbarts/dbarts.pdf
  13. Daz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 3 (2006). doi: 10.1186/1471-2105-7-3 CrossRefGoogle Scholar
  14. Friedman, J.H.: Multivariate adaptive regression splines (with discussion and a rejoinder by the author). Ann. Stat. 19, 1–67 (1991)CrossRefGoogle Scholar
  15. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001). doi: 10.1214/aos/1013203451
  16. Fujikoshi, Y., Ulyanov, V.V., Shimizu, R.: Multivariate Statistics: High-Dimensional and Large-Sample Approximations, vol. 760. Wiley, Hoboken (2011)zbMATHGoogle Scholar
  17. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)CrossRefzbMATHGoogle Scholar
  18. Ham, J., Chen, Y., Crawford, M.M., Ghosh, J.: Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43(3), 492–501 (2005). doi: 10.1109/TGRS.2004.842481 CrossRefGoogle Scholar
  19. Harris, K., Girolami, M., Mischak, H.: Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science, chap. Definition of Valid Proteomic Biomarkers: A Bayesian Solution, pp. 137–149. Springer, Berlin (2009)Google Scholar
  20. Hawkins, D.M.: Fitting multiple change-point models to data. Comput. Stat. Data Anal. 37(3), 323–341 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  21. Hernández, B., Parnell, A.C., Pennington, S.R.: Why have so few proteomic biomarkers “survived” validation? (sample size and independent validation considerations). Proteomics 14(13–14), 1587–1592 (2014)CrossRefGoogle Scholar
  22. Hernández, B., Pennington, S.R., Parnell, A.C.: Bayesian methods for proteomic biomarker development. EuPA Open Proteomics 9, 54–64 (2015)CrossRefGoogle Scholar
  23. Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K.: Algorithm runtime prediction: methods & evaluation. Artif. Intell. 206, 79–111 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  24. Johansson, U., Boström, H., Löfström, T., Linusson, H.: Regression conformal prediction with random forests. Mach. Learn. 97(1–2), 155–176 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  25. Kapelner, A., Bleich, J.: bartmachine: machine learning with Bayesian additive regression trees. ArXiv e-prints (2014a)Google Scholar
  26. Kapelner, A., Bleich, J.: Package bartMachine (2014b). http://cran.r-project.org/web/packages/bartMachine/bartMachine.pdf
  27. Killick, R., Eckley, I., Haynes, K., Fearnhead, P.: Package changepoint (2014). http://cran.r-project.org/web/packages/changepoint/changepoint.pdf
  28. Killick, R., Fearnhead, P., Eckley, I.: Optimal detection of changepoints with a linear computational cost. J. Am. Stat. Assoc. 107(500), 1590–1598 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  29. Lakshminarayanan, B., Roy, D.M., Teh, Y.W.: Particle Gibbs for Bayesian additive regression trees. arXiv preprint arXiv:1502.04622 (2015)
  30. Lakshminarayanan, B., Roy, D.M., Teh, Y.W.: Mondrian forests for large-scale regression when uncertainty matters. In: Artificial Intelligence and Statistics, pp. 1478–1487. (arXiv:1506.03805, 2015) (2016)
  31. Liaw, A., Matthew, W.: Package randomForest (2015). http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
  32. Logothetis, C.J., Gallick, G.E., Maity, S.N., Kim, J., Aparicio, A., Efstathiou, E., Lin, S.H.: Molecular classification of prostate cancer progression: foundation for marker-driven treatment of prostate cancer. Cancer Discov. 3(8), 849–861 (2013)CrossRefGoogle Scholar
  33. Lynch, C.: Big data: how do your data grow? Nature 455(7209), 28–29 (2008)CrossRefGoogle Scholar
  34. Madigan, D., Raftery, A.E.: Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Am. Stat. Assoc. 89(428), 1535–1546 (1994)CrossRefzbMATHGoogle Scholar
  35. Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7, 983–999 (2006)MathSciNetzbMATHGoogle Scholar
  36. Morgan, J.N.: History and potential of binary segmentation for exploratory data analysis. J. Data Sci. 3, 123–136 (2005)Google Scholar
  37. Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data and a proposal. J. Am. Stat. Assoc. 58(302), 415–434 (1963)CrossRefzbMATHGoogle Scholar
  38. Nicodemus, K.K., Malley, J.D., Strobl, C., Ziegler, A.: The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinform. 11, 110 (2010). doi: 10.1186/1471-2105-11-110 CrossRefGoogle Scholar
  39. Norinder, U., Carlsson, L., Boyer, S., Eklund, M.: Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination. J. Chem. Inf. Model. 54(6), 1596–1603 (2014)CrossRefGoogle Scholar
  40. Pratola, M.: Efficient Metropolis–Hastings proposal mechanisms for Bayesian regression tree models. Bayesian Anal. 11(3), 885–911 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  41. Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). doi: 10.1023/A:1022643204877 Google Scholar
  42. Quinlan, J.R.: Discovering rules by induction from large collections of examples. In: Michie, D. (ed.) Expert Systems in the Micro Electronic Age. Edinburgh University Press, Edinburgh (1979)Google Scholar
  43. Raghavan, V., Bollmann, P., Jung, G.S.: A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst. (TOIS) 7(3), 205–229 (1989)CrossRefGoogle Scholar
  44. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
  45. Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43(6), 1947–1958 (2003). doi: 10.1021/ci034160g CrossRefGoogle Scholar
  46. Wager, S., Hastie, T., Efron, B.: Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15(1), 1625–1651 (2014)MathSciNetzbMATHGoogle Scholar
  47. Wilkinson, D.J.: Bayesian methods in bioinformatics and computational systems biology. Brief. Bioinform. 8(2), 109–16 (2007). doi: 10.1093/bib/bbm007 CrossRefGoogle Scholar
  48. Wu, Y., Tjelmeland, H., West, M.: Bayesian CART: prior specification and posterior simulation. J. Comput. Graph. Stat. 16(1), 44–66 (2007)MathSciNetCrossRefGoogle Scholar
  49. Yao, Y.: Estimation of a noisy discrete-time step function: Bayes and empirical Bayes approaches. Ann. Stat. 4(12), 1434–1447 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  50. Zhao, T., Liu, H., Roeder, K., Lafferty, J., Wasserman, L.: The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13(1), 1059–1062 (2012)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Belinda Hernández
    • 1
  • Adrian E. Raftery
    • 3
  • Stephen R Pennington
    • 2
  • Andrew C. Parnell
    • 1
    • 4
  1. 1.School of Mathematics and StatisticsUniversity College DublinDublinIreland
  2. 2.School of Medicine and Medical ScienceUniversity College DublinDublinIreland
  3. 3.Department of StatisticsUniversity of WashingtonSeattleUSA
  4. 4.Insight: The National Centre for Data AnalyticsUniversity College DublinDublinIreland

Personalised recommendations