Abstract
Stopping criteria for Stochastic Gradient Descent (SGD) methods play important roles from enabling adaptive step size schemes to providing rigor for downstream analyses such as asymptotic inference. Unfortunately, current stopping criteria for SGD methods are often heuristics that rely on asymptotic normality results or convergence to stationary distributions, which may fail to exist for nonconvex functions and, thereby, limit the applicability of such stopping criteria. To address this issue, in this work, we rigorously develop two stopping criteria for SGD that can be applied to a broad class of nonconvex functions, which we term Bottou-Curtis-Nocedal functions. Moreover, as a prerequisite for developing these stopping criteria, we prove that the gradient function evaluated at SGD’s iterates converges strongly to zero for Bottou-Curtis-Nocedal functions, which addresses an open question in the SGD literature. As a result of our work, our rigorously developed stopping criteria can be used to develop new adaptive step size schemes or bolster other downstream analyses for nonconvex functions.
Similar content being viewed by others
Notes
For a review of recent approaches to adaptive step size procedures, see [8].
There has been some recent work that disagrees with whether early termination leads to better generalization. See [1] and related work. However, even in this case, one needs to know that a minimizer is achieved.
We can include those problems that are nonconvex, yet are locally strongly convex around minimizers.
The notions of “sufficiently small” or “too large” are dependent on the application, just as they are in deterministic optimization.
We could drop the last term in the optimization problem as it is a constant.
Since \(P_k\) is random, its Schur decomposition is random.
References
Bassily, R., Belkin, M., Ma, S.: On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)
Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning 2010(1–38), 3 (2011)
Bi, J., Gunn, S.R.: A stochastic gradient method with biased estimation for faster nonconvex optimization. In: Pacific Rim International Conference on Artificial Intelligence, pp. 337–349. Springer (2019)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Chee, J., Toulis, P.: Convergence diagnostics for stochastic gradient descent with constant learning rate. In: International Conference on Artificial Intelligence and Statistics, pp. 1476–1485 (2018)
Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941 (2018)
Chung, K.L., et al.: On a stochastic approximation method. Ann. Math. Stat. 25(3), 463–483 (1954)
Curtis, F.E., Scheinberg, K.: Adaptive stochastic optimization. arXiv preprint arXiv:2001.06699 (2020)
Devroye, L., Györfi, L., Lugosi, G.: A probabilistic theory of pattern recognition, vol. 31. Springer Science & Business Media, Berlin (2013)
Durrett, R.: Probability: theory and examples, 4th edn. Cambridge University Press, Cambridge (2010)
Ermoliev, Y.: Stochastic quasigradient methods and their application to system optimization. Stoch.: Int. J. Probab. Stoch. Process. 9(1–2), 1–36 (1983)
Fabian, V.: Stochastic approximation of minima with improved asymptotic speed. The Annals of Mathematical Statistics pp. 191–200 (1967)
Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019)
Farrell, R.: Bounded length confidence intervals for the zero of a regression function. The Annals of Mathematical Statistics pp. 237–247 (1962)
Fehrman, B., Gess, B., Jentzen, A.: Convergence rates for the stochastic gradient descent method for non-convex objective functions. arXiv preprint arXiv:1904.01517 (2019)
Foster, D.J., Sekhari, A., Sridharan, K.: Uniform convergence of gradients for non-convex learning and optimization. In: Advances in Neural Information Processing Systems, pp. 8745–8756 (2018)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Gower, R.M., Sebbouh, O., Loizou, N.: Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. arXiv preprint arXiv:2006.10311 (2020)
Hu, W., Li, C.J., Li, L., Liu, J.G.: On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562 (2017)
Huang, F., Chen, S.: Linear convergence of accelerated stochastic gradient descent for nonconvex nonsmooth optimization. arXiv preprint arXiv:1704.07953 (2017)
Jin, C., Netrapalli, P., Ge, R., Kakade, S.M., Jordan, M.I.: On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. arXiv preprint arXiv:1902.04811 (2019)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)
Khaled, A., Richtárik, P.: Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329 (2020)
Kiefer, J., Wolfowitz, J., et al.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)
Lei, J., Shanbhag, U.V.: A randomized block proximal variable sample-size stochastic gradient method for composite nonconvex stochastic optimization. arXiv preprint arXiv:1808.02543 (2018)
Lei, Y., Hu, T., Li, G., Tang, K.: Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE Transactions on Neural Networks and Learning Systems (2019)
Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv preprint arXiv:1805.08114 (2018)
Li, Z., Li, J.: A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 5564–5574 (2018)
Ma, Y., Klabjan, D.: Convergence analysis of batch normalization for deep neural nets. CoRR, arXiv:1705.080112 (2017)
McDiarmid, C.: Concentration. In: Probabilistic methods for algorithmic discrete mathematics, pp. 195–248. Springer (1998)
Mertikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. arXiv preprint arXiv:2006.11144 (2020)
Mirozahmedov, F., Uryasev, S.: Adaptive stepsize regulation for stochastic optimization algorithm. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 23(6), 1314–1325 (1983)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Park, S., Jung, S.H., Pardalos, P.M.: Combining stochastic adaptive cubic regularization with negative curvature for nonconvex optimization. arXiv preprint arXiv:1906.11417 (2019)
Patel, V.: Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning. SIAM J. Optim. 26(4), 2620–2648 (2016)
Patel, V.: The impact of local geometry and batch size on the convergence and divergence of stochastic gradient descent. arXiv preprint arXiv:1709.047189 (2017)
Pflug, G.C.: Stepsize rules, stopping times and their implementation in stochastic quasi-gradient algorithms. numerical techniques for stochastic optimization pp. 353–372 (1988)
Prechelt, L.: Early stopping-but when? In: Neural Networks: Tricks of the trade, pp. 55–69. Springer (1998)
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp. 314–323 (2016)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. The annals of mathematical statistics pp. 400–407 (1951)
Roy, V.: Convergence diagnostics for markov chain monte carlo. Annual Rev. Stat. Its Appl. 7, 387–412 (2020)
Sielken, R.L.: Stopping times for stochastic approximation procedures. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 26(1), 67–75 (1973)
Stroup, D.F., Braun, H.I.: On a new stopping rule for stochastic approximation. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 60(4), 535–554 (1982)
Van der Vaart, A.W.: Asymptotic statistics, vol. 3. Cambridge University Press, Cambridge (2000)
Wada, T., Itani, T., Fujisaki, Y.: A stopping rule for linear stochastic approximation. In: 49th IEEE Conference on Decision and Control (CDC), pp. 4171–4176. IEEE (2010)
Wang, X., Wang, X., Yuan, Y.X.: Stochastic proximal quasi-newton methods for non-convex composite optimization. Optim. Methods Softw. 34(5), 922–948 (2019)
Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811 (2018)
Wu, L.: Mixed effects models for complex data. Chapman and Hall/CRC, Florida (2009)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017)
Yin, G.: A stopping rule for the robbins-monro method. J. Optim. Theory Appl. 67(1), 151–173 (1990)
Yu, H., Jin, R.: On the computation and communication complexity of parallel sgd with dynamic batch sizes for stochastic non-convex optimization. arXiv preprint arXiv:1905.04346 (2019)
Zhang, P., Lang, H., Liu, Q., Xiao, L.: Statistical adaptive stochastic gradient methods. arXiv preprint arXiv:2002.10597 (2020)
Zhou, Y.: Nonconvex optimization in machine learning: Convergence, landscape, and generalization. Ph.D. thesis, The Ohio State University (2018)
Zou, F., Shen, L., Jie, Z., Zhang, W., Liu, W.: A sufficient condition for convergences of adam and rmsprop. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11127–11135 (2019)
Zoutendijk, G.: Nonlinear programming, computational methods. Integer and nonlinear programming pp. 37–86 (1970)
Acknowledgements
We thank the reviewers for their detailed feedback, which has greatly improved the quality of this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is supported by the Wisconsin Alumni Research Foundation.
Rights and permissions
About this article
Cite this article
Patel, V. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Math. Program. 195, 693–734 (2022). https://doi.org/10.1007/s10107-021-01710-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01710-6