Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions

Patel, Vivak

doi:10.1007/s10107-021-01710-6

Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions

Full Length Paper
Series A
Published: 29 October 2021

Volume 195, pages 693–734, (2022)
Cite this article

Mathematical Programming Submit manuscript

Vivak Patel ORCID: orcid.org/0000-0003-4130-0897¹

798 Accesses
4 Citations
Explore all metrics

Abstract

Stopping criteria for Stochastic Gradient Descent (SGD) methods play important roles from enabling adaptive step size schemes to providing rigor for downstream analyses such as asymptotic inference. Unfortunately, current stopping criteria for SGD methods are often heuristics that rely on asymptotic normality results or convergence to stationary distributions, which may fail to exist for nonconvex functions and, thereby, limit the applicability of such stopping criteria. To address this issue, in this work, we rigorously develop two stopping criteria for SGD that can be applied to a broad class of nonconvex functions, which we term Bottou-Curtis-Nocedal functions. Moreover, as a prerequisite for developing these stopping criteria, we prove that the gradient function evaluated at SGD’s iterates converges strongly to zero for Bottou-Curtis-Nocedal functions, which addresses an open question in the SGD literature. As a result of our work, our rigorously developed stopping criteria can be used to develop new adaptive step size schemes or bolster other downstream analyses for nonconvex functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stopping Rules for Gradient Methods for Non-convex Problems with Additive Noise in Gradient

Article 01 June 2023

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

The continuous stochastic gradient method: part I–convergence theory

Article Open access 23 November 2023

Notes

For a review of recent approaches to adaptive step size procedures, see [8].
There has been some recent work that disagrees with whether early termination leads to better generalization. See [1] and related work. However, even in this case, one needs to know that a minimizer is achieved.
We can include those problems that are nonconvex, yet are locally strongly convex around minimizers.
Including a number of reports that came out after a preprint of this work was made public. See [18, 23, 31].
The notions of “sufficiently small” or “too large” are dependent on the application, just as they are in deterministic optimization.
This bound is required to apply Lemma 1 of [27]. See the second display equations of Page 6 in [27].
This type of bound is established in (15) and the subsequent display equation in [26]. The argument then essentially reestablishes Lemma 1 of [27].
We could drop the last term in the optimization problem as it is a constant.
Since \(P_k\) is random, its Schur decomposition is random.

References

Bassily, R., Belkin, M., Ma, S.: On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)
Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning 2010(1–38), 3 (2011)
Google Scholar
Bi, J., Gunn, S.R.: A stochastic gradient method with biased estimation for faster nonconvex optimization. In: Pacific Rim International Conference on Artificial Intelligence, pp. 337–349. Springer (2019)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Chee, J., Toulis, P.: Convergence diagnostics for stochastic gradient descent with constant learning rate. In: International Conference on Artificial Intelligence and Statistics, pp. 1476–1485 (2018)
Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941 (2018)
Chung, K.L., et al.: On a stochastic approximation method. Ann. Math. Stat. 25(3), 463–483 (1954)
Article MathSciNet Google Scholar
Curtis, F.E., Scheinberg, K.: Adaptive stochastic optimization. arXiv preprint arXiv:2001.06699 (2020)
Devroye, L., Györfi, L., Lugosi, G.: A probabilistic theory of pattern recognition, vol. 31. Springer Science & Business Media, Berlin (2013)
MATH Google Scholar
Durrett, R.: Probability: theory and examples, 4th edn. Cambridge University Press, Cambridge (2010)
Book Google Scholar
Ermoliev, Y.: Stochastic quasigradient methods and their application to system optimization. Stoch.: Int. J. Probab. Stoch. Process. 9(1–2), 1–36 (1983)
MathSciNet MATH Google Scholar
Fabian, V.: Stochastic approximation of minima with improved asymptotic speed. The Annals of Mathematical Statistics pp. 191–200 (1967)
Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019)
Farrell, R.: Bounded length confidence intervals for the zero of a regression function. The Annals of Mathematical Statistics pp. 237–247 (1962)
Fehrman, B., Gess, B., Jentzen, A.: Convergence rates for the stochastic gradient descent method for non-convex objective functions. arXiv preprint arXiv:1904.01517 (2019)
Foster, D.J., Sekhari, A., Sridharan, K.: Uniform convergence of gradients for non-convex learning and optimization. In: Advances in Neural Information Processing Systems, pp. 8745–8756 (2018)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Article MathSciNet Google Scholar
Gower, R.M., Sebbouh, O., Loizou, N.: Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. arXiv preprint arXiv:2006.10311 (2020)
Hu, W., Li, C.J., Li, L., Liu, J.G.: On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562 (2017)
Huang, F., Chen, S.: Linear convergence of accelerated stochastic gradient descent for nonconvex nonsmooth optimization. arXiv preprint arXiv:1704.07953 (2017)
Jin, C., Netrapalli, P., Ge, R., Kakade, S.M., Jordan, M.I.: On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. arXiv preprint arXiv:1902.04811 (2019)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)
Khaled, A., Richtárik, P.: Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329 (2020)
Kiefer, J., Wolfowitz, J., et al.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)
Article MathSciNet Google Scholar
Lei, J., Shanbhag, U.V.: A randomized block proximal variable sample-size stochastic gradient method for composite nonconvex stochastic optimization. arXiv preprint arXiv:1808.02543 (2018)
Lei, Y., Hu, T., Li, G., Tang, K.: Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE Transactions on Neural Networks and Learning Systems (2019)
Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv preprint arXiv:1805.08114 (2018)
Li, Z., Li, J.: A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 5564–5574 (2018)
Ma, Y., Klabjan, D.: Convergence analysis of batch normalization for deep neural nets. CoRR, arXiv:1705.080112 (2017)
McDiarmid, C.: Concentration. In: Probabilistic methods for algorithmic discrete mathematics, pp. 195–248. Springer (1998)
Mertikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. arXiv preprint arXiv:2006.11144 (2020)
Mirozahmedov, F., Uryasev, S.: Adaptive stepsize regulation for stochastic optimization algorithm. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 23(6), 1314–1325 (1983)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet Google Scholar
Park, S., Jung, S.H., Pardalos, P.M.: Combining stochastic adaptive cubic regularization with negative curvature for nonconvex optimization. arXiv preprint arXiv:1906.11417 (2019)
Patel, V.: Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning. SIAM J. Optim. 26(4), 2620–2648 (2016)
Article MathSciNet Google Scholar
Patel, V.: The impact of local geometry and batch size on the convergence and divergence of stochastic gradient descent. arXiv preprint arXiv:1709.047189 (2017)
Pflug, G.C.: Stepsize rules, stopping times and their implementation in stochastic quasi-gradient algorithms. numerical techniques for stochastic optimization pp. 353–372 (1988)
Prechelt, L.: Early stopping-but when? In: Neural Networks: Tricks of the trade, pp. 55–69. Springer (1998)
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp. 314–323 (2016)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. The annals of mathematical statistics pp. 400–407 (1951)
Roy, V.: Convergence diagnostics for markov chain monte carlo. Annual Rev. Stat. Its Appl. 7, 387–412 (2020)
Article MathSciNet Google Scholar
Sielken, R.L.: Stopping times for stochastic approximation procedures. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 26(1), 67–75 (1973)
Article MathSciNet Google Scholar
Stroup, D.F., Braun, H.I.: On a new stopping rule for stochastic approximation. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 60(4), 535–554 (1982)
Article MathSciNet Google Scholar
Van der Vaart, A.W.: Asymptotic statistics, vol. 3. Cambridge University Press, Cambridge (2000)
Google Scholar
Wada, T., Itani, T., Fujisaki, Y.: A stopping rule for linear stochastic approximation. In: 49th IEEE Conference on Decision and Control (CDC), pp. 4171–4176. IEEE (2010)
Wang, X., Wang, X., Yuan, Y.X.: Stochastic proximal quasi-newton methods for non-convex composite optimization. Optim. Methods Softw. 34(5), 922–948 (2019)
Article MathSciNet Google Scholar
Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811 (2018)
Wu, L.: Mixed effects models for complex data. Chapman and Hall/CRC, Florida (2009)
Book Google Scholar
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017)
Yin, G.: A stopping rule for the robbins-monro method. J. Optim. Theory Appl. 67(1), 151–173 (1990)
Article MathSciNet Google Scholar
Yu, H., Jin, R.: On the computation and communication complexity of parallel sgd with dynamic batch sizes for stochastic non-convex optimization. arXiv preprint arXiv:1905.04346 (2019)
Zhang, P., Lang, H., Liu, Q., Xiao, L.: Statistical adaptive stochastic gradient methods. arXiv preprint arXiv:2002.10597 (2020)
Zhou, Y.: Nonconvex optimization in machine learning: Convergence, landscape, and generalization. Ph.D. thesis, The Ohio State University (2018)
Zou, F., Shen, L., Jie, Z., Zhang, W., Liu, W.: A sufficient condition for convergences of adam and rmsprop. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11127–11135 (2019)
Zoutendijk, G.: Nonlinear programming, computational methods. Integer and nonlinear programming pp. 37–86 (1970)

Download references

Acknowledgements

We thank the reviewers for their detailed feedback, which has greatly improved the quality of this work.

Author information

Authors and Affiliations

Department of Statistics, University of Wisconsin, Madison, USA
Vivak Patel

Authors

Vivak Patel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vivak Patel.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the Wisconsin Alumni Research Foundation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Patel, V. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Math. Program. 195, 693–734 (2022). https://doi.org/10.1007/s10107-021-01710-6

Download citation

Received: 01 April 2020
Accepted: 19 September 2021
Published: 29 October 2021
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10107-021-01710-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions

Abstract

Access this article

Similar content being viewed by others

Stopping Rules for Gradient Methods for Non-convex Problems with Additive Noise in Gradient

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

The continuous stochastic gradient method: part I–convergence theory

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions

Abstract

Access this article

Similar content being viewed by others

Stopping Rules for Gradient Methods for Non-convex Problems with Additive Noise in Gradient

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

The continuous stochastic gradient method: part I–convergence theory

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation