Accelerated gradient boosting

  • G. BiauEmail author
  • B. Cadre
  • L. Rouvière


Gradient tree boosting is a prediction algorithm that sequentially produces a model in the form of linear combinations of decision trees, by solving an infinite-dimensional optimization problem. We combine gradient boosting and Nesterov’s accelerated descent to design a new algorithm, which we call AGB (for Accelerated Gradient Boosting). Substantial numerical evidence is provided on both synthetic and real-life data sets to assess the excellent performance of the method in a large variety of prediction problems. It is empirically shown that AGB is less sensitive to the shrinkage parameter and outputs predictors that are considerably more sparse in the number of trees, while retaining the exceptional performance of gradient boosting.


Gradient boosting Nesterov’s acceleration Trees 



We greatly thank two referees for valuable comments and insightful suggestions, which led to a substantial improvement of the paper.


  1. Bartlett, P. L., & Traskin, M. (2007). AdaBoost is consistent. Journal of Machine Learning Research, 8, 2347–2368.MathSciNetzbMATHGoogle Scholar
  2. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202.MathSciNetzbMATHGoogle Scholar
  3. Becker, S., Bobin, J., & Candès, E. J. (2011). NESTA: A fast and accurate first-order method for sparse recovery. SIAM Journal on Imaging Sciences, 4, 1–39.MathSciNetzbMATHGoogle Scholar
  4. Biau, G., & Cadre, B. (2017). Optimization by gradient boosting. arXiv:1707.05023.
  5. Biau, G., Fischer, A., Guedj, B., & Malley, J. D. (2016). COBRA: A combined regression strategy. Journal of Multivariate Analysis, 146, 18–28.MathSciNetzbMATHGoogle Scholar
  6. Bickel, P. J., Ritov, Y., & Zakai, A. (2006). Some theory for generalized boosting algorithms. Journal of Machine Learning Research, 7, 705–732.MathSciNetzbMATHGoogle Scholar
  7. Blanchard, G., Lugosi, G., & Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. Journal of Machine Learning Research, 4, 861–894.MathSciNetzbMATHGoogle Scholar
  8. Breiman, L. (1997). Arcing the edge. Technical Report 486, Statistics Department, University of California, Berkeley.Google Scholar
  9. Breiman, L. (1998). Arcing classifiers (with discussion). The Annals of Statistics, 26, 801–824.MathSciNetzbMATHGoogle Scholar
  10. Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation, 11, 1493–1517.Google Scholar
  11. Breiman, L. (2000). Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, University of California, Berkeley.Google Scholar
  12. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.zbMATHGoogle Scholar
  13. Breiman, L. (2004). Population theory for boosting ensembles. The Annals of Statistics, 32, 1–11.MathSciNetzbMATHGoogle Scholar
  14. Bubeck, S. (2013). ORF523: Nesterov’s accelerated gradient descent.
  15. Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8, 231–357.zbMATHGoogle Scholar
  16. Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting (with discussion). Statistical Science, 22, 477–505.MathSciNetzbMATHGoogle Scholar
  17. Bühlmann, P., & Yu, B. (2003). Boosting with the \(L_2\) loss: Regression and classification. Journal of the American Statistical Association, 98, 324–339.MathSciNetzbMATHGoogle Scholar
  18. Chen, T. & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 785–794). New York: ACM.Google Scholar
  19. Devolder, O., Glineur, F., & Nesterov, Y. (2014). First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146, 37–75.MathSciNetzbMATHGoogle Scholar
  20. Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer.zbMATHGoogle Scholar
  21. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121, 256–285.MathSciNetzbMATHGoogle Scholar
  22. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In S. Lorenza (Ed.), Machine learning: Proceedings of the thirteenth international conference on machine learning (pp. 148–156). San Francisco: Morgan Kaufmann Publishers.Google Scholar
  23. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139.MathSciNetzbMATHGoogle Scholar
  24. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). The Annals of Statistics, 28, 337–374.MathSciNetzbMATHGoogle Scholar
  25. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.MathSciNetzbMATHGoogle Scholar
  26. Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38, 367–378.MathSciNetzbMATHGoogle Scholar
  27. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer.zbMATHGoogle Scholar
  28. Jain, P., Netrapalli, P., Kakade, S. M., Kidambi, R., & Sidford, A. (2018). Accelerating stochastic gradient descent for least squares regression. In S. Bubeck, V. Perchet, & P. Rigollet (Ed.), Proceedings of the 31st conference on learning theory (Vol. 75, pp. 545–604). PMLR.Google Scholar
  29. Lugosi, G., & Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 30–55.MathSciNetzbMATHGoogle Scholar
  30. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradient descent. In S. A. Solla, T. K. Leen, & K. Müller (Eds.), Proceedings of the 12th international conference on neural information processing systems (pp. 512–518). Cambridge, MA: The MIT Press.Google Scholar
  31. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (2000). Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 221–246). Cambridge, MA: The MIT Press.Google Scholar
  32. Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate \({\rm O}(1/k^2)\). Soviet Mathematics Doklady, 27, 372–376.zbMATHGoogle Scholar
  33. Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. New York: Springer.zbMATHGoogle Scholar
  34. Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming, 103, 127–152.MathSciNetzbMATHGoogle Scholar
  35. Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140, 125–161.MathSciNetzbMATHGoogle Scholar
  36. Qu, G., & Li, N. (2016). Accelerated distributed Nesterov gradient descent. In 54th Annual Allerton conference on communication, control, and computing (pp. 209–216). Red Hook: Curran Associates, Inc.Google Scholar
  37. Ridgeway, G. (2007). Generalized boosted models: A guide to the gbm package.
  38. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.Google Scholar
  39. Su, W., Boyd, S., & Candès, E. J. (2016). A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17, 1–43.MathSciNetzbMATHGoogle Scholar
  40. Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In S. Dasgupta & D. McAllester (Eds.), Proceedings of the 30th international conference on machine learning, proceedings of machine learning research (pp. 1139–1147).Google Scholar
  41. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58, 267–288.MathSciNetzbMATHGoogle Scholar
  42. Tseng, P. (2008). On accelerated proximal gradient methods for convex-concave optimization.
  43. Zhang, T., & Yu, B. (2005). Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33, 1538–1579.MathSciNetzbMATHGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.CNRS, LPSMSorbonne UniversitéParisFrance
  2. 2.CNRS, IRMARUniv RennesRennesFrance

Personalised recommendations