Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates

  • Janek Thomas
  • Andreas Mayr
  • Bernd Bischl
  • Matthias Schmid
  • Adam Smith
  • Benjamin Hofner
Article

Abstract

We present a new algorithm for boosting generalized additive models for location, scale and shape (GAMLSS) that allows to incorporate stability selection, an increasingly popular way to obtain stable sets of covariates while controlling the per-family error rate. The model is fitted repeatedly to subsampled data, and variables with high selection frequencies are extracted. To apply stability selection to boosted GAMLSS, we develop a new “noncyclical” fitting algorithm that incorporates an additional selection step of the best-fitting distribution parameter in each iteration. This new algorithm has the additional advantage that optimizing the tuning parameters of boosting is reduced from a multi-dimensional to a one-dimensional problem with vastly decreased complexity. The performance of the novel algorithm is evaluated in an extensive simulation study. We apply this new algorithm to a study to estimate abundance of common eider in Massachusetts, USA, featuring excess zeros, overdispersion, nonlinearity and spatiotemporal structures. Eider abundance is estimated via boosted GAMLSS, allowing both mean and overdispersion to be regressed on covariates. Stability selection is used to obtain a sparse set of stable predictors.

Keywords

Boosting Additive models GAMLSS GamboostLSS Stability selection 

Notes

Acknowledgements

We thank Mass Audubon for the use of common eider abundance data.

Supplementary material

11222_2017_9754_MOESM1_ESM.zip (267.2 mb)
Supplementary material 1 (zip 273662 KB)

References

  1. Aho, K., Derryberry, D.W., Peterson, T.: Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95, 631–636 (2014)CrossRefGoogle Scholar
  2. Anderson, D.R., Burnham, K.P.: Avoiding pitfalls when using information-theoretic methods. J. Wildl. Manag. 912–918 (2002)Google Scholar
  3. Bühlmann, P., Hothorn, T.: Boosting algorithms: regularization, prediction and model fitting. Stat. Sci. 22, 477–505 (2007)MathSciNetCrossRefMATHGoogle Scholar
  4. Bühlmann, P., Hothorn, T.: Twin boosting: improved feature selection and prediction. Stat. Comput. 20, 119–138 (2010)MathSciNetCrossRefGoogle Scholar
  5. Bühlmann, P., Yu, B.: Boosting with the L\(_2\) loss: regression and classification. J. Am. Stat. Assoc. 98, 324–339 (2003)CrossRefMATHGoogle Scholar
  6. Bühlmann, P., Yu, B.: Sparse boosting. J. Mach. Learn. Res. 7, 1001–1024 (2006)MathSciNetMATHGoogle Scholar
  7. Bühlmann, P., Gertheiss, J., Hieke, S., Kneib, T., Ma, S., Schumacher, M., Tutz, G., Wang, C., Wang, Z., Ziegler, A., et al.: Discussion of “the evolution of boosting algorithms” and “extending statistical boosting”. Methods Inf. Med. 53(6), 436–445 (2014)CrossRefGoogle Scholar
  8. Dormann, C.F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carre, G., Marquez, J.R.G., Gruber, B., Lafourcade, B., Leitao, P.J., Münkemüller, T., McClean, C., Osborne, P.E., Reineking, B., Schröder, B., Skidmore, A.K., Zurell, D., Lautenbach, S.: Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 27–46 (2013)CrossRefGoogle Scholar
  9. Flack, V.F., Chang, P.C.: Frequency of selecting noise variables in subset regression analysis: a simulation study. Am. Stat. 41(1), 84–86 (1987)Google Scholar
  10. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337–407 (2000)CrossRefMATHGoogle Scholar
  11. Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models, vol. 43. CRC Press, Boca Raton (1990)MATHGoogle Scholar
  12. Hofner, B., Boccuto, L., Göker, M.: Controlling false discoveries in high-dimensional situations: boosting with stability selection. BMC Bioinf. 16(1), 144 (2015)CrossRefGoogle Scholar
  13. Hofner, B., Hothorn, T.: stabs: stability selection with error control (2017). http://CRAN.R-project.org/package=stabs. R package version 0.6-2
  14. Hofner, B., Hothorn, T., Kneib, T., Schmid, M.: A framework for unbiased model selection based on boosting. J. Comput. Gr. Stat. 20, 956–971 (2011)MathSciNetCrossRefGoogle Scholar
  15. Hofner, B., Mayr, A., Fenske, N., Thomas, J., Schmid, M.: gamboostLSS: boosting methods for GAMLSS models (2017). http://CRAN.R-project.org/package=gamboostLSS. R package version 2.0-0
  16. Hofner, B., Mayr, A., Robinzonov, N., Schmid, M.: Model-based boosting in R—A hands-on tutorial using the R package mboost. Comput. Stat. 29, 3–35 (2014)MathSciNetCrossRefMATHGoogle Scholar
  17. Hofner, B., Mayr, A., Schmid, M.: gamboostLSS: an R package for model building and variable selection in the GAMLSS framework. J. Stat. Softw. 74(1), 1–31 (2016)CrossRefGoogle Scholar
  18. Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., Hofner, B.: Model-based boosting 2.0. J. Mach. Learn. Res. 11, 2109–2113 (2010)MathSciNetMATHGoogle Scholar
  19. Hothorn, T., Buehlmann, P., Kneib, T., Schmid, T., Hofner, B.: mboost: model-based boosting (2017). http://CRAN.R-project.org/package=mboost. R package version 2.8-0
  20. Hothorn, T., Müller, J., Schröder, B., Kneib, T., Brandl, R.: Decomposing environmental, spatial, and spatiotemporal components of species distributions. Ecol. Monogr. 81, 329–347 (2011)CrossRefGoogle Scholar
  21. Huang, S.M.Y., Huang, J., Fang, K.: Gene network-based cancer prognosis analysis with sparse boosting. Genet. Res. 94, 205–221 (2012)CrossRefGoogle Scholar
  22. Li, P.: Robust logitboost and adaptive base class (abc) logitboost (2012). arXiv preprint arXiv:1203.3491
  23. Mayr, A., Binder, H., Gefeller, O., Schmid, M., et al.: The evolution of boosting algorithms. Methods Inf. Med. 53(6), 419–427 (2014)CrossRefGoogle Scholar
  24. Mayr, A., Binder, H., Gefeller, O., Schmid, M., et al.: Extending statistical boosting. Methods Inf. Med. 53(6), 428–435 (2014)Google Scholar
  25. Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M.: Generalized additive models for location, scale and shape for high-dimensional data—a flexible approach based on boosting. J. R. Stat. Soc. Ser. C Appl. Stat. 61(3), 403–427 (2012)MathSciNetCrossRefGoogle Scholar
  26. Mayr, A., Hofner, B., Schmid, M.: Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC Bioinf. 17(1), 288 (2016)Google Scholar
  27. Mayr, A., Hofner, B., Schmid, M., et al.: The importance of knowing when to stop. Methods Inf. Med. 51(2), 178–186 (2012)CrossRefGoogle Scholar
  28. Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72(4), 417–473 (2010)MathSciNetCrossRefGoogle Scholar
  29. Messner, J.W., Mayr, G.J., Zeileis, A.: Nonhomogeneous boosting for predictor selection in ensemble postprocessing. Mon. Weather Rev. 145(1), 137–147 (2017). doi:10.1175/MWR-D-16-0088.1 CrossRefGoogle Scholar
  30. Mullahy, J.: Specification and testing of some modified count data models. J. Econom. 33(3), 341–365 (1986)MathSciNetCrossRefGoogle Scholar
  31. Murtaugh, P.A.: Performance of several variable-selection methods applied to real ecological data. Ecol. Lett. 12, 1061–1068 (2009)CrossRefGoogle Scholar
  32. Opelt, A., Fussenegger, M., Pinz, A., Auer, P.: Weak hypotheses and boosting for generic object detection and recognition. In: European Conference on Computer Vision, pp. 71–84. Springer (2004)Google Scholar
  33. Osorio, J.D.G., Galiano, S.G.G.: Non-stationary analysis of dry spells in monsoon season of Senegal River Basin using data from regional climate models (RCMs). J. Hydrol. 450–451, 82–92 (2012)CrossRefGoogle Scholar
  34. Rigby, R.A., Stasinopoulos, D.M.: Generalized additive models for location, scale and shape. J. R. Stat. Soc. Ser. C (Appl. Stat.) 54(3), 507–554 (2005)MathSciNetCrossRefMATHGoogle Scholar
  35. Rigby, R.A., Stasinopoulos, D.M., Akantziliotou, C.: Instructions on how to use the gamlss package in R (2008). http://www.gamlss.org/wp-content/uploads/2013/01/gamlss-manual.pdf
  36. Schmid, M., Hothorn, T.: Boosting additive models using component-wise P-splines. Comput. Stat. Data Anal. 53(2), 298–311 (2008)MathSciNetCrossRefMATHGoogle Scholar
  37. Schmid, M., Potapov, S., Pfahlberg, A., Hothorn, T.: Estimation and regularization techniques for regression models with multidimensional prediction functions. Stat. Comput. 20(2), 139–150 (2010)MathSciNetCrossRefGoogle Scholar
  38. Shah, R.D., Samworth, R.J.: Variable selection with error control: Another look at stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 75(1), 55–80 (2013)MathSciNetCrossRefGoogle Scholar
  39. Smith, A.D., Hofner, B., Osenkowski, J.E., Allison, T., Sadoti, G., McWilliams, S.R., Paton, P.W.C.: Spatiotemporal modelling of sea duck abundance: implications for marine spatial planning (2017). arXiv preprint arXiv:1705.00644

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Department of StatisticsLudwig-Maximilians-Universität MünchenMunichGermany
  2. 2.Department of Medical Informatics, Biometry and EpidemiologyFAU Erlangen-NürnbergErlangenGermany
  3. 3.Department of Medical Biometry, Informatics and EpidemiologyUniversity Hospital BonnBonnGermany
  4. 4.U.S. Fish & Wildlife Service, National Wildlife Refuge SystemSoutheast Inventory & Monitoring BranchLewistownUSA
  5. 5.Section BiostatisticsPaul-Ehrlich-InstituteLangenGermany

Personalised recommendations