Skip to main content

Minimizing finite sums with the stochastic average gradient

An Erratum to this article was published on 14 July 2016


We analyze the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from \(O(1/\sqrt{k})\) to O(1 / k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1 / k) to a linear convergence rate of the form \(O(\rho ^k)\) for \(\rho < 1\). Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. This extends our earlier work Le Roux et al. (Adv Neural Inf Process Syst, 2012), which only lead to a faster rate for well-conditioned strongly-convex problems. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3







  6. Note that we also compared to a variety of other SG methods including the popular Pegasos SG method of Shalev-Shwartz et al. [55], SG with momentum, SG with gradient averaging, the regularized dual averaging method of Xiao [63] (a stochastic variant of the primal-dual subgradient method of Nesterov [44] for regularized objectives), the accelerated SG method of Delyon and Juditsky [14], SG methods that only average the later iterations as in the ‘optimal’ SG method for non-smooth optimization of Rakhlin et al. [48], the epoch SG method of Hazan and Kale [21], the ‘nearly-optimal’ SG method of Ghadimi and Lan [18], a diagonally-scaled SG method using the inverse of the coordinate-wise Lipshitz constants as the diagonal terms, and the adaptive diagonally-scaled AdaGrad method of Duchi et al. [15]. However, we omit results obtained using these algorithms since they never performed substantially better than the minimum between the SG and ASG methods when their step-size was chosen in hindsight.


  1. Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)

  2. Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Adv. Neural Inf. Process. Syst. 773–781 (2013)

  3. Bach, F., Moulines, E.: Non-strongly-convex smooth stochastic approximation with convergence rate \(o(1/n)\). arXiv preprint (2013)

  4. Bertsekas, D.P.: A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7(4), 913–926 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  5. Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10, 1737–1754 (2009)

    MathSciNet  MATH  Google Scholar 

  7. Bottou, L., LeCun, Y.: Large scale online learning. Adv. Neural Inf. Process. Syst. 217–224 (2003)

  8. Carbonetto, P.: New probabilistic inference algorithms that harness the strengths of variational and Monte Carlo methods. Ph.D. thesis, University of British Columbia (2009)

  9. Caruana, R., Joachims, T., Backstrom, L.: KDD-cup 2004: results and analysis. ACM SIGKDD Newsl. 6(2), 95–108 (2004)

    Article  Google Scholar 

  10. Cauchy, A.: Méthode générale pour la résolution des systèmes d’équations simultanées. Comptes rendus des séances de l’Académie des sciences de Paris 25, 536–538 (1847)

    Google Scholar 

  11. Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.: Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)

    MathSciNet  MATH  Google Scholar 

  12. Cormack, G.V., Lynam, T.R.: Spam corpus creation for TREC. In: Proceedings of 2nd Conference on Email and Anti-Spam (2005).

  13. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 1646–1654 (2014)

  14. Delyon, B., Juditsky, A.: Accelerated stochastic approximation. SIAM J. Optim. 3(4), 868–881 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  15. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  16. Frank, A., Asuncion, A.: UCI machine learning repository (2010).

  17. Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1351–A1379 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  18. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. Optim. Online (2010)

  19. Gong, P., Ye, J.: Linear convergence of variance-reduced projected stochastic gradient without strong convexity. arXiv preprint (2014)

  20. Guyon, I.: Sido: A phamacology dataset (2008).

  21. Hazan, E., Kale, S.: Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. J. Mach. Learn. Res. Workshop Conf. Proc. 15, 2489–2512 (2011)

  22. Hu, C., Kwok, J., Pan, W.: Accelerated gradient methods for stochastic optimization and online learning. Adv. Neural Inf. Process. Syst. 781–789 (2009)

  23. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 315–323 (2013)

  24. Keerthi, S., DeCoste, D.: A modified finite Newton method for fast solution of large scale linear SVMs. J. Mach. Learn. Res. 6, 341–361 (2005)

    MathSciNet  MATH  Google Scholar 

  25. Kesten, H.: Accelerated stochastic approximation. Ann. Math. Stat. 29(1), 41–59 (1958)

    Article  MathSciNet  MATH  Google Scholar 

  26. Konečnỳ, J., Richtárik, P.: Semi-stochastic gradient descent methods. arXiv preprint (2013)

  27. Konečnỳ, J., Qu, Z., Richtárik, P.: Semi-stochastic coordinate descent. arXiv preprint (2014)

  28. Kushner, H.J., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, 2nd edn. Springer, New york (2003)

    MATH  Google Scholar 

  29. Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate frank-wolfe optimization for structural SVMs. Int. Conf. Mach. Learn. J. Mach. Learn. Res. Workshop Conf. Proc. 28, 53–61 (2013)

  30. Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Adv. Neural Inf. Process. Syst. 2663–2671 (2012)

  31. Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

    Google Scholar 

  32. Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Adv. Neural Inf. Process. Syst. 3384–3392 (2015)

  33. Liu, J., Chen, J., Ye, J.: Large-scale sparse logistic regression. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2009)

  34. Mahdavi, M., Zhang, L., Jin, R.: Mixed Optimization of Smooth Functions. Adv. Neural Inf. Process. Syst. 674–682 (2013)

  35. Mairal, J.: Optimization with first-order surrogate functions. J. Mach. Learn. Res. Workshop Conf. Proc. 28, 783–791 (2013)

  36. Martens, J.: Deep learning via Hessian-free optimization. Int. Conf. Mach. Learn. 735–742 (2010)

  37. Nedic, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P.M. (eds.) Stochastic Optimization: Algorithms and Applications, pp. 263–304. Kluwer Academic Publishers, Dordrecht (2000)

  38. Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Adv. Neural Inf. Process. Syst. 1017–1025 (2014)

  39. Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)

    Google Scholar 

  40. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  41. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \({O}(1/k^2)\). Doklady AN SSSR 269(3), 543–547 (1983)

    Google Scholar 

  42. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Dordrecht (2004)

  43. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  44. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  45. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion Paper (2010)

  46. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006)

    MATH  Google Scholar 

  47. Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  48. Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. Int. Conf. Mach. Learn. 449–456 (2012)

  49. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  50. Robert, C.P., Casella, G.: Monte Carlo Statistical Methods, 2nd edn. Springer, New York (2004)

    Book  MATH  Google Scholar 

  51. Schmidt, M.: minfunc: unconstrained differentiable multivariate optimization in matlab. (2005)

  52. Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint (2013)

  53. Schmidt, M., Babanezhad, R., Ahemd, M., Clifton, A., Sarkar, A.: Non-uniform stochastic average gradient method for training conditional random fields. Int. Conf. Artif. Intell. Stat. J. Mach. Learn. Res. Workshop Conf. Proc. 38, 819–828 (2015)

  54. Shalev-Schwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  55. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1), 3–30 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  56. Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. J. Mach. Learn. Res. Workshop Conf. Proc. 32 (2014)

  57. Solodov, M.: Incremental gradient algorithms with stepsizes bounded away from zero. Comput. Optim. Appl. 11(1), 23–35 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  58. Srebro, N., Sridharan, K.: Theoretical basis for “more data less work”? NIPS Workshop on Computataional Trade-offs in Statistical Learning (2011)

  59. Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  60. Sunehag, P., Trumpf, J., Vishwanathan, S., Schraudolph, N.: Variable metric stochastic approximation theory. J. Mach. Learn. Res. Workshop Conf. Proc. 5, 560–566 (2009)

  61. Teo, C.H., Le, Q., Smola, A.J., Vishwanathan, S.V.N.: A scalable modular convex solver for regularized risk minimization. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2007)

  62. Tseng, P.: An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  63. Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)

    MathSciNet  MATH  Google Scholar 

  64. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(2), 2057–2075 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  65. Zhong, L., Kwok, J.: Fast stochastic alternating direction method of multipliers. J. Mach. Learn. Res. Workshop Conf. Proc. 65 (2014)

Download references


We would like to thank the anonymous reviewers for their many useful comments. This work was partially supported by the European Research Council (SIERRA-ERC-239993) and a Google Research Award. Mark Schmidt is also supported by a postdoctoral fellowship from the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Mark Schmidt.

Additional information

An erratum to this article is available at

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 384 KB)

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schmidt, M., Le Roux, N. & Bach, F. Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


Mathematics Subject Classification