Skip to main content

Stochastic Gradient Descent with Polyak’s Learning Rate

Abstract

Stochastic gradient descent (SGD) for strongly convex functions converges at the rate \(\mathcal {O}(1/k)\). However, achieving good results in practice requires tuning the parameters (for example the learning rate) of the algorithm. In this paper we propose a generalization of the Polyak step size, used for subgradient methods, to stochastic gradient descent. We prove a non-asymptotic convergence at the rate \(\mathcal {O}(1/k)\) with a rate constant which can be better than the corresponding rate constant for optimally scheduled SGD. We demonstrate that the method is effective in practice, and on convex optimization problems and on training deep neural networks, and compare to the theoretical rate.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Availability of Data and Materials

The datasets analysed during the current study are available in https://www.cs.toronto.edu/~kriz/cifar.html.

Code Availability

Available at https://github.com/marianapraz/polyakSGD.

Notes

  1. https://github.com/marianapraz/polyakSGD.

References

  1. Agarwal, A., Wainwright, M.J., Bartlett, P.L., Ravikumar, P.K.: Information–theoretic lower bounds on the oracle complexity of convex optimization. In: Advances in Neural Information Processing Systems, pp. 1–9 (2009)

  2. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016)

  3. Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, Philadelphia (2017)

    Book  Google Scholar 

  4. Boyd, S., Mutapcic, A.: Subgradient methods. Lecture notes of EE364b, Stanford University, Winter Quarter, 2013 (2013)

  5. Bottou, L.: Stochastic gradient learning in neural networks. Proc. Neuro-Nımes 91(8), 12 (1991)

    Google Scholar 

  6. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  7. Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)

  8. Hinton, G., Srivastava, N., Swersky, K.: RmsProp: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning, Coursera lecture 6e (2012)

  9. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

  10. Kim, S., Ahn, H., Cho, S.-C.: Variable target value subgradient method. Math. Program. 49(1–3), 359–369 (1990)

    MathSciNet  Article  Google Scholar 

  11. Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82(1), 35–45 (1960)

    MathSciNet  Article  Google Scholar 

  12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002 (2012)

  14. Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Volume 89 of Proceedings of Machine Learning Research, pp. 983–992. PMLR, 16–18 April 2019

  15. Laborde, M., Oberman, A.: A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In: International Conference on Artificial Intelligence and Statistics, pp. 602–612 (2020)

  16. Laborde, M., Oberman, A.M.: Nesterov’s method with decreasing learning rate leads to accelerated stochastic gradient descent. arXiv preprint arXiv:1908.07861 (2019)

  17. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)

    MATH  Google Scholar 

  18. Osher, S., Wang, B., Yin, P., Luo, X., Barekat, F., Pham, M., Lin, A.: Laplacian smoothing gradient descent. arXiv preprint arXiv:1806.06317 (2018)

  19. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  20. Polyak, R.A.: Introduction to Continuous Optimization, Springer Optimization and Its Applications Series, vol. 172 (2021)

  21. Qian, X., Richtarik, P., Gower, R., Sailanbayev, A., Loizou, N., Shulgin, E.: SGD with arbitrary sampling: general analysis and improved rates. In: International Conference on Machine Learning, pp. 5200–5209 (2019)

  22. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    MathSciNet  Article  Google Scholar 

  23. Rolinek, M., Martius, G.: L4: practical loss-based stepsize adaptation for deep learning. In: Advances in Neural Information Processing Systems, pp. 6434–6444 (2018)

  24. Raginsky, M., Rakhlin, A.: Information-based complexity, feedback and dynamics in convex programming. IEEE Trans. Inf. Theory 57(10), 7036–7056 (2011)

    MathSciNet  Article  Google Scholar 

  25. Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647 (2011)

  26. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)

  27. Shor, N.Z.: Minimization Methods for Non-differentiable Functions, vol. 3. Springer, Berlin (2012)

    Google Scholar 

  28. Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)

  29. Wu, X., Ward, R., Bottou, L.: Wngrad: learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865 (2018)

  30. Wang, B., Zou, D., Gu, Q., Osher, S.J.: Laplacian smoothing stochastic gradient Markov chain Monte Carlo. SIAM J. Sci. Comput. 43(1), A26–A53 (2021)

    MathSciNet  Article  Google Scholar 

  31. Yan, Y., Yang, Ti., Li, Z., Lin, Q., Yi, Y.: A unified analysis of stochastic momentum methods for deep learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI18, pp. 2955–2961. AAAI Press (2018)

  32. Zhou, D., Chen, J., Cao, Y., Tang, Y., Yang, Z., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization (2020)

  33. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)

  34. Zhang, L., Mahdavi, M., Jin, R.: Linear convergence with condition number independent access of full gradients. In: Advances in Neural Information Processing Systems, pp. 980–988 (2013)

Download references

Funding

Adam Oberman was supported by the Air Force Office of Scientific Research under award number FA9550-18-1-0167 and by IVADO. Mariana Prazeres was supported by the FCT scholarship SFRH/BD/145075/2019 (M.P.).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mariana Prazeres.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prazeres, M., Oberman, A.M. Stochastic Gradient Descent with Polyak’s Learning Rate. J Sci Comput 89, 25 (2021). https://doi.org/10.1007/s10915-021-01628-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-021-01628-3

Keywords

  • Stochastic gradient descent
  • Learning rate
  • Polyak’s learning rate
  • Optimization
  • Strong convexity