Skip to main content
Log in

Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Stopping criteria for Stochastic Gradient Descent (SGD) methods play important roles from enabling adaptive step size schemes to providing rigor for downstream analyses such as asymptotic inference. Unfortunately, current stopping criteria for SGD methods are often heuristics that rely on asymptotic normality results or convergence to stationary distributions, which may fail to exist for nonconvex functions and, thereby, limit the applicability of such stopping criteria. To address this issue, in this work, we rigorously develop two stopping criteria for SGD that can be applied to a broad class of nonconvex functions, which we term Bottou-Curtis-Nocedal functions. Moreover, as a prerequisite for developing these stopping criteria, we prove that the gradient function evaluated at SGD’s iterates converges strongly to zero for Bottou-Curtis-Nocedal functions, which addresses an open question in the SGD literature. As a result of our work, our rigorously developed stopping criteria can be used to develop new adaptive step size schemes or bolster other downstream analyses for nonconvex functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. For a review of recent approaches to adaptive step size procedures, see [8].

  2. There has been some recent work that disagrees with whether early termination leads to better generalization. See [1] and related work. However, even in this case, one needs to know that a minimizer is achieved.

  3. We can include those problems that are nonconvex, yet are locally strongly convex around minimizers.

  4. Including a number of reports that came out after a preprint of this work was made public. See [18, 23, 31].

  5. The notions of “sufficiently small” or “too large” are dependent on the application, just as they are in deterministic optimization.

  6. This bound is required to apply Lemma 1 of [27]. See the second display equations of Page 6 in [27].

  7. This type of bound is established in (15) and the subsequent display equation in [26]. The argument then essentially reestablishes Lemma 1 of [27].

  8. We could drop the last term in the optimization problem as it is a constant.

  9. Since \(P_k\) is random, its Schur decomposition is random.

References

  1. Bassily, R., Belkin, M., Ma, S.: On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)

  2. Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning 2010(1–38), 3 (2011)

    Google Scholar 

  3. Bi, J., Gunn, S.R.: A stochastic gradient method with biased estimation for faster nonconvex optimization. In: Pacific Rim International Conference on Artificial Intelligence, pp. 337–349. Springer (2019)

  4. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  5. Chee, J., Toulis, P.: Convergence diagnostics for stochastic gradient descent with constant learning rate. In: International Conference on Artificial Intelligence and Statistics, pp. 1476–1485 (2018)

  6. Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941 (2018)

  7. Chung, K.L., et al.: On a stochastic approximation method. Ann. Math. Stat. 25(3), 463–483 (1954)

    Article  MathSciNet  Google Scholar 

  8. Curtis, F.E., Scheinberg, K.: Adaptive stochastic optimization. arXiv preprint arXiv:2001.06699 (2020)

  9. Devroye, L., Györfi, L., Lugosi, G.: A probabilistic theory of pattern recognition, vol. 31. Springer Science & Business Media, Berlin (2013)

    MATH  Google Scholar 

  10. Durrett, R.: Probability: theory and examples, 4th edn. Cambridge University Press, Cambridge (2010)

    Book  Google Scholar 

  11. Ermoliev, Y.: Stochastic quasigradient methods and their application to system optimization. Stoch.: Int. J. Probab. Stoch. Process. 9(1–2), 1–36 (1983)

    MathSciNet  MATH  Google Scholar 

  12. Fabian, V.: Stochastic approximation of minima with improved asymptotic speed. The Annals of Mathematical Statistics pp. 191–200 (1967)

  13. Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019)

  14. Farrell, R.: Bounded length confidence intervals for the zero of a regression function. The Annals of Mathematical Statistics pp. 237–247 (1962)

  15. Fehrman, B., Gess, B., Jentzen, A.: Convergence rates for the stochastic gradient descent method for non-convex objective functions. arXiv preprint arXiv:1904.01517 (2019)

  16. Foster, D.J., Sekhari, A., Sridharan, K.: Uniform convergence of gradients for non-convex learning and optimization. In: Advances in Neural Information Processing Systems, pp. 8745–8756 (2018)

  17. Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)

    Article  MathSciNet  Google Scholar 

  18. Gower, R.M., Sebbouh, O., Loizou, N.: Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. arXiv preprint arXiv:2006.10311 (2020)

  19. Hu, W., Li, C.J., Li, L., Liu, J.G.: On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562 (2017)

  20. Huang, F., Chen, S.: Linear convergence of accelerated stochastic gradient descent for nonconvex nonsmooth optimization. arXiv preprint arXiv:1704.07953 (2017)

  21. Jin, C., Netrapalli, P., Ge, R., Kakade, S.M., Jordan, M.I.: On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. arXiv preprint arXiv:1902.04811 (2019)

  22. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)

  23. Khaled, A., Richtárik, P.: Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329 (2020)

  24. Kiefer, J., Wolfowitz, J., et al.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)

    Article  MathSciNet  Google Scholar 

  25. Lei, J., Shanbhag, U.V.: A randomized block proximal variable sample-size stochastic gradient method for composite nonconvex stochastic optimization. arXiv preprint arXiv:1808.02543 (2018)

  26. Lei, Y., Hu, T., Li, G., Tang, K.: Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE Transactions on Neural Networks and Learning Systems (2019)

  27. Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv preprint arXiv:1805.08114 (2018)

  28. Li, Z., Li, J.: A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 5564–5574 (2018)

  29. Ma, Y., Klabjan, D.: Convergence analysis of batch normalization for deep neural nets. CoRR, arXiv:1705.080112 (2017)

  30. McDiarmid, C.: Concentration. In: Probabilistic methods for algorithmic discrete mathematics, pp. 195–248. Springer (1998)

  31. Mertikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. arXiv preprint arXiv:2006.11144 (2020)

  32. Mirozahmedov, F., Uryasev, S.: Adaptive stepsize regulation for stochastic optimization algorithm. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 23(6), 1314–1325 (1983)

  33. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  Google Scholar 

  34. Park, S., Jung, S.H., Pardalos, P.M.: Combining stochastic adaptive cubic regularization with negative curvature for nonconvex optimization. arXiv preprint arXiv:1906.11417 (2019)

  35. Patel, V.: Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning. SIAM J. Optim. 26(4), 2620–2648 (2016)

    Article  MathSciNet  Google Scholar 

  36. Patel, V.: The impact of local geometry and batch size on the convergence and divergence of stochastic gradient descent. arXiv preprint arXiv:1709.047189 (2017)

  37. Pflug, G.C.: Stepsize rules, stopping times and their implementation in stochastic quasi-gradient algorithms. numerical techniques for stochastic optimization pp. 353–372 (1988)

  38. Prechelt, L.: Early stopping-but when? In: Neural Networks: Tricks of the trade, pp. 55–69. Springer (1998)

  39. Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp. 314–323 (2016)

  40. Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)

  41. Robbins, H., Monro, S.: A stochastic approximation method. The annals of mathematical statistics pp. 400–407 (1951)

  42. Roy, V.: Convergence diagnostics for markov chain monte carlo. Annual Rev. Stat. Its Appl. 7, 387–412 (2020)

    Article  MathSciNet  Google Scholar 

  43. Sielken, R.L.: Stopping times for stochastic approximation procedures. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 26(1), 67–75 (1973)

    Article  MathSciNet  Google Scholar 

  44. Stroup, D.F., Braun, H.I.: On a new stopping rule for stochastic approximation. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 60(4), 535–554 (1982)

    Article  MathSciNet  Google Scholar 

  45. Van der Vaart, A.W.: Asymptotic statistics, vol. 3. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  46. Wada, T., Itani, T., Fujisaki, Y.: A stopping rule for linear stochastic approximation. In: 49th IEEE Conference on Decision and Control (CDC), pp. 4171–4176. IEEE (2010)

  47. Wang, X., Wang, X., Yuan, Y.X.: Stochastic proximal quasi-newton methods for non-convex composite optimization. Optim. Methods Softw. 34(5), 922–948 (2019)

    Article  MathSciNet  Google Scholar 

  48. Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811 (2018)

  49. Wu, L.: Mixed effects models for complex data. Chapman and Hall/CRC, Florida (2009)

    Book  Google Scholar 

  50. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017)

  51. Yin, G.: A stopping rule for the robbins-monro method. J. Optim. Theory Appl. 67(1), 151–173 (1990)

    Article  MathSciNet  Google Scholar 

  52. Yu, H., Jin, R.: On the computation and communication complexity of parallel sgd with dynamic batch sizes for stochastic non-convex optimization. arXiv preprint arXiv:1905.04346 (2019)

  53. Zhang, P., Lang, H., Liu, Q., Xiao, L.: Statistical adaptive stochastic gradient methods. arXiv preprint arXiv:2002.10597 (2020)

  54. Zhou, Y.: Nonconvex optimization in machine learning: Convergence, landscape, and generalization. Ph.D. thesis, The Ohio State University (2018)

  55. Zou, F., Shen, L., Jie, Z., Zhang, W., Liu, W.: A sufficient condition for convergences of adam and rmsprop. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11127–11135 (2019)

  56. Zoutendijk, G.: Nonlinear programming, computational methods. Integer and nonlinear programming pp. 37–86 (1970)

Download references

Acknowledgements

We thank the reviewers for their detailed feedback, which has greatly improved the quality of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vivak Patel.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the Wisconsin Alumni Research Foundation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patel, V. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Math. Program. 195, 693–734 (2022). https://doi.org/10.1007/s10107-021-01710-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-021-01710-6

Keywords

Mathematics Subject Classification

Navigation