Abstract
We analyze the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from \(O(1/\sqrt{k})\) to O(1 / k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1 / k) to a linear convergence rate of the form \(O(\rho ^k)\) for \(\rho < 1\). Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. This extends our earlier work Le Roux et al. (Adv Neural Inf Process Syst, 2012), which only lead to a faster rate for well-conditioned strongly-convex problems. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.
This is a preview of subscription content, access via your institution.



Notes
Note that we also compared to a variety of other SG methods including the popular Pegasos SG method of Shalev-Shwartz et al. [55], SG with momentum, SG with gradient averaging, the regularized dual averaging method of Xiao [63] (a stochastic variant of the primal-dual subgradient method of Nesterov [44] for regularized objectives), the accelerated SG method of Delyon and Juditsky [14], SG methods that only average the later iterations as in the ‘optimal’ SG method for non-smooth optimization of Rakhlin et al. [48], the epoch SG method of Hazan and Kale [21], the ‘nearly-optimal’ SG method of Ghadimi and Lan [18], a diagonally-scaled SG method using the inverse of the coordinate-wise Lipshitz constants as the diagonal terms, and the adaptive diagonally-scaled AdaGrad method of Duchi et al. [15]. However, we omit results obtained using these algorithms since they never performed substantially better than the minimum between the SG and ASG methods when their step-size was chosen in hindsight.
References
Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)
Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Adv. Neural Inf. Process. Syst. 773–781 (2013)
Bach, F., Moulines, E.: Non-strongly-convex smooth stochastic approximation with convergence rate \(o(1/n)\). arXiv preprint (2013)
Bertsekas, D.P.: A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7(4), 913–926 (1997)
Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)
Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10, 1737–1754 (2009)
Bottou, L., LeCun, Y.: Large scale online learning. Adv. Neural Inf. Process. Syst. 217–224 (2003)
Carbonetto, P.: New probabilistic inference algorithms that harness the strengths of variational and Monte Carlo methods. Ph.D. thesis, University of British Columbia (2009)
Caruana, R., Joachims, T., Backstrom, L.: KDD-cup 2004: results and analysis. ACM SIGKDD Newsl. 6(2), 95–108 (2004)
Cauchy, A.: Méthode générale pour la résolution des systèmes d’équations simultanées. Comptes rendus des séances de l’Académie des sciences de Paris 25, 536–538 (1847)
Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.: Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)
Cormack, G.V., Lynam, T.R.: Spam corpus creation for TREC. In: Proceedings of 2nd Conference on Email and Anti-Spam (2005). http://plg.uwaterloo.ca/~gvcormac/treccorpus/
Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 1646–1654 (2014)
Delyon, B., Juditsky, A.: Accelerated stochastic approximation. SIAM J. Optim. 3(4), 868–881 (1993)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Frank, A., Asuncion, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml
Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1351–A1379 (2012)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. Optim. Online (2010)
Gong, P., Ye, J.: Linear convergence of variance-reduced projected stochastic gradient without strong convexity. arXiv preprint (2014)
Guyon, I.: Sido: A phamacology dataset (2008). http://www.causality.inf.ethz.ch/data/SIDO.html
Hazan, E., Kale, S.: Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. J. Mach. Learn. Res. Workshop Conf. Proc. 15, 2489–2512 (2011)
Hu, C., Kwok, J., Pan, W.: Accelerated gradient methods for stochastic optimization and online learning. Adv. Neural Inf. Process. Syst. 781–789 (2009)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 315–323 (2013)
Keerthi, S., DeCoste, D.: A modified finite Newton method for fast solution of large scale linear SVMs. J. Mach. Learn. Res. 6, 341–361 (2005)
Kesten, H.: Accelerated stochastic approximation. Ann. Math. Stat. 29(1), 41–59 (1958)
Konečnỳ, J., Richtárik, P.: Semi-stochastic gradient descent methods. arXiv preprint (2013)
Konečnỳ, J., Qu, Z., Richtárik, P.: Semi-stochastic coordinate descent. arXiv preprint (2014)
Kushner, H.J., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, 2nd edn. Springer, New york (2003)
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate frank-wolfe optimization for structural SVMs. Int. Conf. Mach. Learn. J. Mach. Learn. Res. Workshop Conf. Proc. 28, 53–61 (2013)
Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Adv. Neural Inf. Process. Syst. 2663–2671 (2012)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Adv. Neural Inf. Process. Syst. 3384–3392 (2015)
Liu, J., Chen, J., Ye, J.: Large-scale sparse logistic regression. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2009)
Mahdavi, M., Zhang, L., Jin, R.: Mixed Optimization of Smooth Functions. Adv. Neural Inf. Process. Syst. 674–682 (2013)
Mairal, J.: Optimization with first-order surrogate functions. J. Mach. Learn. Res. Workshop Conf. Proc. 28, 783–791 (2013)
Martens, J.: Deep learning via Hessian-free optimization. Int. Conf. Mach. Learn. 735–742 (2010)
Nedic, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P.M. (eds.) Stochastic Optimization: Algorithms and Applications, pp. 263–304. Kluwer Academic Publishers, Dordrecht (2000)
Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Adv. Neural Inf. Process. Syst. 1017–1025 (2014)
Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \({O}(1/k^2)\). Doklady AN SSSR 269(3), 543–547 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Dordrecht (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion Paper (2010)
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006)
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. Int. Conf. Mach. Learn. 449–456 (2012)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods, 2nd edn. Springer, New York (2004)
Schmidt, M.: minfunc: unconstrained differentiable multivariate optimization in matlab. https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html (2005)
Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint (2013)
Schmidt, M., Babanezhad, R., Ahemd, M., Clifton, A., Sarkar, A.: Non-uniform stochastic average gradient method for training conditional random fields. Int. Conf. Artif. Intell. Stat. J. Mach. Learn. Res. Workshop Conf. Proc. 38, 819–828 (2015)
Shalev-Schwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1), 3–30 (2011)
Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. J. Mach. Learn. Res. Workshop Conf. Proc. 32 (2014)
Solodov, M.: Incremental gradient algorithms with stepsizes bounded away from zero. Comput. Optim. Appl. 11(1), 23–35 (1998)
Srebro, N., Sridharan, K.: Theoretical basis for “more data less work”? NIPS Workshop on Computataional Trade-offs in Statistical Learning (2011)
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
Sunehag, P., Trumpf, J., Vishwanathan, S., Schraudolph, N.: Variable metric stochastic approximation theory. J. Mach. Learn. Res. Workshop Conf. Proc. 5, 560–566 (2009)
Teo, C.H., Le, Q., Smola, A.J., Vishwanathan, S.V.N.: A scalable modular convex solver for regularized risk minimization. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2007)
Tseng, P.: An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)
Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(2), 2057–2075 (2014)
Zhong, L., Kwok, J.: Fast stochastic alternating direction method of multipliers. J. Mach. Learn. Res. Workshop Conf. Proc. 65 (2014)
Acknowledgments
We would like to thank the anonymous reviewers for their many useful comments. This work was partially supported by the European Research Council (SIERRA-ERC-239993) and a Google Research Award. Mark Schmidt is also supported by a postdoctoral fellowship from the Natural Sciences and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
An erratum to this article is available at http://dx.doi.org/10.1007/s10107-016-1051-1.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Schmidt, M., Le Roux, N. & Bach, F. Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017). https://doi.org/10.1007/s10107-016-1030-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-016-1030-6