Recent Theoretical Advances in Non-Convex Optimization

Danilova, Marina; Dvurechensky, Pavel; Gasnikov, Alexander; Gorbunov, Eduard; Guminov, Sergey; Kamzolov, Dmitry; Shibaev, Innokentiy

doi:10.1007/978-3-031-00832-0_3

Marina Danilova^22,23,
Pavel Dvurechensky^24,25,
Alexander Gasnikov^26,27,28,
Eduard Gorbunov^29,30,
Sergey Guminov³¹,
Dmitry Kamzolov^32,33 &
…
Innokentiy Shibaev^29,30

Part of the book series: Springer Optimization and Its Applications ((SOIA,volume 191))

2330 Accesses
10 Citations

Abstract

Motivated by recent increased interest in optimization algorithms for non-convex optimization in application to training deep neural networks and other optimization problems in data analysis, we give an overview of recent theoretical results on global performance guarantees of optimization algorithms for non-convex optimization. We start with classical arguments showing that general non-convex problems could not be solved efficiently in a reasonable time. Then we give a list of problems that can be solved efficiently to find the global minimizer by exploiting the structure of the problem as much as it is possible. Another way to deal with non-convexity is to relax the goal from finding the global minimum to finding a stationary point or a local minimum. For this setting, we first present known results for the convergence rates of deterministic first-order methods, which are then followed by a general theoretical analysis of optimal stochastic and randomized gradient schemes, and an overview of the stochastic first-order methods. After that, we discuss quite general classes of non-convex problems, such as minimization of α-weakly quasi-convex functions and functions that satisfy Polyak–Łojasiewicz condition, which still allow obtaining theoretical convergence guarantees of first-order methods. Then we consider higher-order and zeroth-order/derivative-free methods and their convergence rates for non-convex optimization problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See also this webpage with the list of references being updated https://sunju.org/research/nonconvex/.
2.
By ϕ(a), where \(a = (a_1,\ldots ,a_n)^\top \in {\mathbb R}^n\) is multidimensional vector, we mean vector (ϕ(a₁), …, ϕ(a_n))^⊤.
3.
Here \(\mathbb {E}_{\xi _k}[\cdot ]\) is a mathematical expectation conditioned on everything despite ξ_k, i.e., expectation is taken w.r.t. the randomness coming only from ξ_k.
4.
In the original paper [160], the authors considered more general situation when stochastic realizations f(x, ξ) have Hölder-continuous gradients.
5.
This technique is applied in distributed optimization to reduce the overall communication cost (e.g., see [4, 27, 113]). However, methods for distributed optimization are out of scope of our survey.
6.
For simplicity, we neglect all parameters except m and ε, see the details in Table 2.
7.
To distinguish exponents from superindexes, we use braces (⋅) for exponents.
8.
In fact, most of the results from [118] do not rely on the finite-sum structure of f.

References

N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199 (2017)
Google Scholar
K. Ahn, C. Yun, S. Sra, Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Adv. Neural Inf. Process. Syst. 33 (2020)
Google Scholar
A. Ajalloeian, S.U. Stich, Analysis of sgd with biased gradient estimators. Preprint (2020). arXiv:2008.00051
Google Scholar
D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, Qsgd: Communication-efficient sgd via gradient quantization and encoding. Adv. Neural Inf. Process. Syst. 1709–1720 (2017)
Google Scholar
Z. Allen-Zhu, Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter, in International Conference on Machine Learning, pp. 89–97 (2017)
Google Scholar
Z. Allen-Zhu, How to make the gradients small stochastically: Even faster convex and nonconvex sgd, in Advances in Neural Information Processing Systems, pp. 1157–1167 (2018)
Google Scholar
Z. Allen-Zhu, Katyusha x: Simple momentum method for stochastic sum-of-nonconvex optimization, in International Conference on Machine Learning, pp. 179–185 (2018)
Google Scholar
Z. Allen-Zhu, Natasha 2: Faster non-convex optimization than sgd, in Advances in Neural Information Processing Systems, pp. 2675–2686 (2018)
Google Scholar
Z. Allen-Zhu, Y. Li, Neon2: Finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, pp. 3716–3726 (2018)
Google Scholar
Z. Allen-Zhu, Y. Li, Can sgd learn recurrent neural networks with provable generalization? in Advances in Neural Information Processing Systems, pp. 10331–10341 (2019)
Google Scholar
Z. Allen-Zhu, Y. Li, Y. Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, in Advances in Neural Information Processing Systems, pp. 6158–6169 (2019)
Google Scholar
Z. Allen-Zhu, Y. Li, Z. Song, A convergence theory for deep learning via over-parameterization, in International Conference on Machine Learning, pp. 242–252 (PMLR, 2019)
Google Scholar
Z. Allen-Zhu, Y. Li, Z. Song, On the convergence rate of training recurrent neural networks, in Advances in Neural Information Processing Systems, pp. 6676–6688 (2019)
Google Scholar
A. Anandkumar, R. Ge, Efficient approaches for escaping higher order saddle points in non-convex optimization, in Conference on Learning Theory, pp. 81–102 (PMLR, 2016)
Google Scholar
Y. Arjevani, Y. Carmon, J.C. Duchi, D.J. Foster, N. Srebro, B. Woodworth, Lower bounds for non-convex stochastic optimization. Preprint (2019). arXiv:1912.02365
Google Scholar
Y. Arjevani, Y. Carmon, J.C. Duchi, D.J. Foster, A. Sekhari, K. Sridharan, Second-order information in non-convex stochastic optimization: Power and limitations, in Conference on Learning Theory, pp. 242–299 (2020)
Google Scholar
S. Arora, N. Cohen, N. Golowich, W. Hu, A convergence analysis of gradient descent for deep linear neural networks. Preprint (2018). arXiv:1810.02281
Google Scholar
F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al., Optimization with sparsity-inducing penalties. Found. Trends® Mach. Learn. 4(1), 1–106 (2012)
Google Scholar
R. Baraniuk, M. Davenport, R. DeVore, M. Wakin, A simple proof of the restricted isometry property for random matrices. Constructive Approximation 28(3), 253–263 (2008)
Article MathSciNet MATH Google Scholar
A. Bazarova, A. Beznosikov, A. Gasnikov, Linearly convergent gradient-free methods for minimization of symmetric parabolic approximation. Preprint (2020). arXiv:2009.04906
Google Scholar
M. Belkin, Fit without fear: Remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica 30, 203–248 (2021)
Article MathSciNet Google Scholar
A. Ben-Tal, A. Nemirovski, Lectures on Modern Convex Optimization (Society for Industrial and Applied Mathematics, 2001)
Google Scholar
A.S. Berahas, L. Cao, K. Choromanski, K. Scheinberg, Linear interpolation gives better gradients than gaussian smoothing in derivative-free optimization (2019)
Google Scholar
A.S. Berahas, L. Cao, K. Scheinberg, Global convergence rate analysis of a generic line search algorithm with noise (2019)
Google Scholar
A.S. Berahas, L. Cao, K. Choromanski, K. Scheinberg, A theoretical and empirical comparison of gradient approximations in derivative-free optimization (2020)
Google Scholar
E.H. Bergou, E. Gorbunov, P. Richtárik, Stochastic three points method for unconstrained smooth minimization (2019)
Google Scholar
A. Beznosikov, S. Horváth, P. Richtárik, M. Safaryan, On biased compression for distributed learning. Preprint (2020). arXiv:2002.12410
Google Scholar
S. Bhojanapalli, A. Kyrillidis, S. Sanghavi, Dropping convexity for faster semi-definite optimization, in Conference on Learning Theory, pp. 530–582 (2016)
Google Scholar
E.G. Birgin, J. Gardenghi, J.M. Martínez, S.A. Santos, P.L. Toint, Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming 163(1–2), 359–368 (2017)
Article MathSciNet MATH Google Scholar
A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science (Cambridge University Press, 2016)
Google Scholar
A. Blum, R.L. Rivest, Training a 3-node neural network is np-complete, in Advances in Neural Information Processing Systems, pp. 494–501 (1989)
Google Scholar
T. Blumensath, M.E. Davies, Iterative hard thresholding for compressed sensing. Appl. Comput. Harmonic Anal. 27(3), 265–274 (2009)
Article MathSciNet MATH Google Scholar
L. Bogolubsky, P. Dvurechensky, A. Gasnikov, G. Gusev, Y. Nesterov, A.M. Raigorodskii, A. Tikhonov, M. Zhukovskii, Learning supervised pagerank with gradient-based and gradient-free optimization methods, in Advances in Neural Information Processing Systems 29, ed. by D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, R. Garnett (Curran Associates, Inc., 2016), pp. 4914–4922. arXiv:1603.00717
Google Scholar
L. Bottou, Curiously fast convergence of some stochastic gradient descent algorithms, in Proceedings of the Symposium on Learning and Data Science, Paris (2009)
Google Scholar
L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010 (Springer, 2010), pp. 177–186
Google Scholar
L. Bottou, Stochastic gradient descent tricks, in Neural Networks: Tricks of the Trade (Springer, 2012), pp. 421–436
Google Scholar
L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning. SIAM Review 60(2), 223–311 (2018)
Article MathSciNet MATH Google Scholar
S. Boyd, L. Vandenberghe, Convex Optimization (NY Cambridge University Press, 2004)
Google Scholar
J. Bu, M. Mesbahi, A note on Nesterov’s accelerated method in nonconvex optimization: a weak estimate sequence approach. Preprint (2020). arXiv:2006.08548
Google Scholar
S. Bubeck, Introduction to online optimization (2011)
Google Scholar
S. Bubeck, Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (Nov 2015)
Article MATH Google Scholar
E.J. Candès, B. Recht, Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009)
Google Scholar
E.J. Candès, T. Tao, Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)
Article MathSciNet MATH Google Scholar
E.J. Candès, T. Tao, The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2010)
Article MathSciNet MATH Google Scholar
E.J. Candès, M.B. Wakin, S.P. Boyd, Enhancing sparsity by reweighted ℓ₁ minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008)
Article MathSciNet MATH Google Scholar
E.J. Candès, X. Li, M. Soltanolkotabi, Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
Article MathSciNet MATH Google Scholar
Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex newton step. Preprint (2016). arXiv:1612.00547
Google Scholar
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, “Convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions, in Proceedings of Machine Learning Research, vol. 70 (International Convention Centre, Sydney, Australia, 06–11 Aug 2017), pp. 654–663. PMLR
Google Scholar
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)
Article MathSciNet MATH Google Scholar
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Lower bounds for finding stationary points II: first-order methods. Mathematical Programming (Sep 2019)
Google Scholar
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Lower bounds for finding stationary points i. Mathematical Programming 184(1), 71–120 (Nov 2020)
Article MathSciNet MATH Google Scholar
C. Cartis, N.I. Gould, P.L. Toint, Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming 127(2), 245–295 (2011)
Google Scholar
C. Cartis, N.I.M. Gould, P.L. Toint, Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function- and derivative-evaluation complexity. Mathematical Programming 130(2), 295–319 (2011)
Google Scholar
C. Cartis, N.I.M. Gould, P.L. Toint, Improved second-order evaluation complexity for unconstrained nonlinear optimization using high-order regularized models (2018). arXiv:1708.04044
Google Scholar
C. Cartis, N.I. Gould, P.L. Toint, Universal regularization methods: Varying the power, the smoothness and the accuracy. SIAM J. Optim. 29(1), 595–615 (2019)
Article MathSciNet MATH Google Scholar
V. Charisopoulos, A.R. Benson, A. Damle, Entrywise convergence of iterative methods for eigenproblems. Preprint (2020). arXiv:2002.08491
Google Scholar
Y. Chen, Y. Chi, Harnessing structures in big data via guaranteed low-rank matrix estimation. Preprint (2018). arXiv:1802.08397
Google Scholar
Z. Chen, T. Yang, A variance reduction method for non-convex optimization with improved convergence under large condition number. Preprint (2018). arXiv:1809.06754
Google Scholar
Z. Chen, Y. Zhou, Momentum with variance reduction for nonconvex composition optimization. Preprint (2020). arXiv:2005.07755
Google Scholar
X. Chen, S. Liu, R. Sun, M. Hong, On the convergence of a class of adam-type algorithms for non-convex optimization. Preprint (2018). arXiv:1808.02941
Google Scholar
Y. Chen, Y. Chi, J. Fan, C. Ma, Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming 176(1–2), 5–37 (2019)
Article MathSciNet MATH Google Scholar
Y. Chi, Y.M. Lu, Y. Chen, Nonconvex optimization meets low-rank matrix factorization: An overview. Preprint (2018). arXiv:1809.09573
Google Scholar
Collection of optimizers for pytorch, https://github.com/jettify/pytorch-optimizer
P.L. Combettes, J.-C. Pesquet, Proximal splitting methods in signal processing, in Fixed-Point Algorithms for Inverse Problems in Science and Engineering (Springer, 2011), pp. 185–212
Google Scholar
A. Conn, N. Gould, P. Toint, Trust Region Methods (Society for Industrial and Applied Mathematics, 2000)
Google Scholar
A. Conn, K. Scheinberg, L. Vicente, Introduction to Derivative-Free Optimization (Society for Industrial and Applied Mathematics, 2009)
Google Scholar
F.E. Curtis, K. Scheinberg, Optimization methods for supervised machine learning: From linear models to deep learning. Preprint (2017). arXiv:1706.10207
Google Scholar
A. Cutkosky, F. Orabona, Momentum-based variance reduction in non-convex sgd, in Advances in Neural Information Processing Systems, pp. 15236–15245 (2019)
Google Scholar
C.D. Dang, G. Lan, Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Optim. 25(2), 856–881 (2015)
Article MathSciNet MATH Google Scholar
D. Davis, D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)
Article MathSciNet MATH Google Scholar
A. Defazio, Understanding the role of momentum in non-convex optimization: Practical insights from a lyapunov analysis. Preprint (2020). arXiv:2010.00406
Google Scholar
A. Defazio, L. Bottou, On the ineffectiveness of variance reduced optimization for deep learning, in Advances in Neural Information Processing Systems, pp. 1753–1763 (2019)
Google Scholar
A. Defazio, F. Bach, S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, in Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14 (MIT Press, Cambridge, MA, USA, 2014), pp. 1646–1654
Google Scholar
A. Defazio, J. Domke, et al., Finito: A faster, permutable incremental gradient method for big data problems, in International Conference on Machine Learning, pp. 1125–1133 (2014)
Google Scholar
A. Défossez, L. Bottou, F. Bach, N. Usunier, On the convergence of adam and adagrad. Preprint (2020). arXiv:2003.02395
Google Scholar
V. Demin, D. Nekhaev, I. Surazhevsky, K. Nikiruy, A. Emelyanov, S. Nikolaev, V. Rylkov, M. Kovalchuk, Necessary conditions for stdp-based pattern recognition learning in a memristive spiking neural network. Neural Networks 134, 64–75 (2021)
Article Google Scholar
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint (2018). arXiv:1810.04805
Google Scholar
J. Diakonikolas, M.I. Jordan, Generalized momentum-based methods: A Hamiltonian perspective. Preprint (2019). arXiv:1906.00436
Google Scholar
T. Ding, D. Li, R. Sun, Spurious local minima exist for almost all over-parameterized neural networks (2019)
Google Scholar
J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul.), 2121–2159 (2011)
MathSciNet MATH Google Scholar
J. Duchi, M.I. Jordan, B. McMahan, Estimation, optimization, and parallelism when data is sparse, in Advances in Neural Information Processing Systems, pp. 2832–2840 (2013)
Google Scholar
D. Dvinskikh, A. Ogaltsov, A. Gasnikov, P. Dvurechensky, V. Spokoiny, On the line-search gradient methods for stochastic optimization. IFAC-PapersOnLine 53(2), 1715–1720 (2020). 21th IFAC World Congress. arXiv:1911.08380
Google Scholar
P. Dvurechensky, Gradient method with inexact oracle for composite non-convex optimization (2017). arXiv:1703.09180
Google Scholar
P. Dvurechensky, M. Staudigl, Hessian barrier algorithms for non-convex conic optimization (2021). arXiv:2111.00100
Google Scholar
P. Dvurechensky, M. Staudigl, C.A. Uribe, Generalized self-concordant Hessian-barrier algorithms (2019). arXiv:1911.01522. WIAS Preprint No. 2693
Google Scholar
P.E. Dvurechensky, A.V. Gasnikov, E.A. Nurminski, F.S. Stonyakin, Advances in Low-Memory Subgradient Optimization (Springer International Publishing, Cham, 2020), pp. 19–59. arXiv:1902.01572
Google Scholar
P. Dvurechensky, E. Gorbunov, A. Gasnikov, An accelerated directional derivative method for smooth stochastic convex optimization. Eur. J. Oper. Res. 290(2), 601–621 (2021)
Article MathSciNet MATH Google Scholar
P. Dvurechensky, S. Shtern, M. Staudigl, First-order methods for convex optimization. EURO J. Comput. Optim. 9, 100015 (2021). arXiv:2101.00935
Google Scholar
N. Emmenegger, R. Kyng, A.N. Zehmakan, On the oracle complexity of higher-order smooth non-convex finite-sum optimization. Preprint (2021). arXiv:2103.05138
Google Scholar
Y.G. Evtushenko, Numerical methods for finding global extrema (case of a non-uniform mesh). USSR Comput. Math. Math. Phys. 11(6), 38–54 (1971)
Article Google Scholar
C. Fang, C.J. Li, Z. Lin, T. Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, pp. 689–699 (2018)
Google Scholar
C. Fang, Z. Lin, T. Zhang, Sharp analysis for nonconvex sgd escaping from saddle points, in Conference on Learning Theory, pp. 1192–1234 (2019)
Google Scholar
I. Fatkhullin, B. Polyak, Optimizing static linear feedback: Gradient method. Preprint (2020). arXiv:2004.09875
Google Scholar
M. Fazel, R. Ge, S.M. Kakade, M. Mesbahi, Global convergence of policy gradient methods for the linear quadratic regulator (2019)
Google Scholar
S. Feizi, H. Javadi, J. Zhang, D. Tse, Porcupine neural networks:(almost) all local optima are global. Preprint (2017). arXiv:1710.02196
Google Scholar
A.D. Flaxman, A.T. Kalai, H.B. McMahan, Online convex optimization in the bandit setting: Gradient descent without a gradient, in Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05 (Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2005), pp. 385–394
Google Scholar
C.A. Floudas, P.M. Pardalos, Encyclopedia of Optimization (Springer Science & Business Media, 2008)
Google Scholar
A. Gasnikov, Universal Gradient Descent (MCCME, Moscow, 2021)
Google Scholar
A. Gasnikov, P. Dvurechensky, M. Zhukovskii, S. Kim, S. Plaunov, D. Smirnov, F. Noskov, About the power law of the pagerank vector component distribution. part 2. The buckley–osthus model, verification of the power law for this model, and setup of real search engines. Numer. Anal. Appl. 11(1), 16–32 (2018)
Google Scholar
R. Ge, J. Zou, Intersecting faces: Non-negative matrix factorization with new guarantees, in International Conference on Machine Learning, pp. 2295–2303 (PMLR, 2015)
Google Scholar
R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points–online stochastic gradient for tensor decomposition, in Conference on Learning Theory, pp. 797–842 (2015)
Google Scholar
S. Ghadimi, G. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013). arXiv:1309.5549
Google Scholar
S. Ghadimi, G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming 156(1), 59–99 (2016)
Article MathSciNet MATH Google Scholar
S. Ghadimi, G. Lan, H. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming 155(1), 267–305 (2016). arXiv:1308.6594
Google Scholar
S. Ghadimi, G. Lan, H. Zhang, Generalized uniformly optimal methods for nonlinear programming. J. Scientif. Comput. 79(3), 1854–1881 (2019)
Article MathSciNet MATH Google Scholar
M.X. Goemans, D.P. Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM (JACM) 42(6), 1115–1145 (1995)
Google Scholar
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 2016). http://www.deeplearningbook.org
I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, vol. 1 (MIT Press Cambridge, 2016)
Google Scholar
E. Gorbunov, P. Dvurechensky, A. Gasnikov, An accelerated method for derivative-free smooth stochastic convex optimization. Preprint (2018). arXiv:1802.09022 (accepted to SIOPT)
Google Scholar
E. Gorbunov, M. Danilova, A. Gasnikov, Stochastic optimization with heavy-tailed noise via accelerated gradient clipping, in Advances in Neural Information Processing Systems, vol. 33, ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (Curran Associates, Inc., 2020), pp. 15042–15053
Google Scholar
E. Gorbunov, F. Hanzely, P. Richtárik, A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent, in International Conference on Artificial Intelligence and Statistics, pp. 680–690 (2020)
Google Scholar
E.A. Gorbunov, A. Bibi, O. Sener, E.H. Bergou, P. Richtárik, A stochastic derivative free optimization method with momentum, in ICLR (2020)
Google Scholar
E. Gorbunov, K.P. Burlachenko, Z. Li, P. Richtarik, Marina: Faster non-convex distributed learning with compression, in Proceedings of the 38th International Conference on Machine Learning, vol. 139 of Proceedings of Machine Learning Research, ed. by M. Meila, T. Zhang (PMLR, 18–24 Jul 2021), pp. 3788–3798
Google Scholar
E. Gorbunov, M. Danilova, I. Shibaev, P. Dvurechensky, A. Gasnikov, Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise (2021). arXiv:2106.05958
Google Scholar
A. Gorodetskiy, A. Shlychkova, A.I. Panov, Delta schema network in model-based reinforcement learning, in Artificial General Intelligence, ed. by B. Goertzel, A. I. Panov, A. Potapov, R. Yampolskiy (Springer International Publishing, Cham, 2020), pp. 172–182
Chapter Google Scholar
A. Gotmare, N. S. Keskar, C. Xiong, R. Socher, A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. Preprint (2018). arXiv:1810.13243
Google Scholar
R.M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, P. Richtárik, Sgd: General analysis and improved rates, in International Conference on Machine Learning, pp. 5200–5209 (2019)
Google Scholar
R. Gower, O. Sebbouh, N. Loizou, Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation, in International Conference on Artificial Intelligence and Statistics (PMLR, 2021), pp. 1315–1323
Google Scholar
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He, Accurate, large minibatch sgd: Training imagenet in 1 hour. Preprint (2017). arXiv:1706.02677
Google Scholar
A.O. Griewank, Generalized descent for global optimization. J. Optim. Theory Appl. 34(1), 11–39 (1981)
Article MathSciNet MATH Google Scholar
S. Guminov, A. Gasnikov, Accelerated methods for alpha-weakly-quasi-convex problems. Preprint (2017). arXiv:1710.00797
Google Scholar
S.V. Guminov, Y.E. Nesterov, P.E. Dvurechensky, A.V. Gasnikov, Accelerated primal-dual gradient descent with linesearch for convex, nonconvex, and nonsmooth optimization problems. Doklady Mathematics 99(2), 125–128 (2019)
Article MathSciNet MATH Google Scholar
S. Guminov, P. Dvurechensky, N. Tupitsa, A. Gasnikov, On a combination of alternating minimization and Nesterov’s momentum, in Proceedings of the 38th International Conference on Machine Learning, vol. 145 of Proceedings of Machine Learning Research, Virtual (PMLR, 18–24 Jul 2021). arXiv:1906.03622. WIAS Preprint No. 2695
Google Scholar
B.D. Haeffele, R. Vidal, Global optimality in neural network training, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7331–7339 (2017)
Google Scholar
G. Haeser, H. Liu, Y. Ye, Optimality condition and complexity analysis for linearly-constrained optimization without differentiability on the boundary. Mathematical Programming 178(1), 263–299 (2019)
Article MathSciNet MATH Google Scholar
J.Z. HaoChen, S. Sra, Random shuffling beats sgd after finite epochs. Preprint (2018). arXiv:1806.10077
Google Scholar
E. Hazan, K. Levy, S. Shalev-Shwartz, Beyond convexity: Stochastic quasi-convex optimization, in Advances in Neural Information Processing Systems, pp. 1594–1602 (2015)
Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
O. Hinder, A. Sidford, N. Sohoni, Near-optimal methods for minimizing star-convex functions and beyond, in Conference on Learning Theory (PMLR, 2020), pp. 1894–1938
Google Scholar
T. Hofmann, A. Lucchi, S. Lacoste-Julien, B. McWilliams, Variance reduced stochastic gradient descent with neighbors, in Advances in Neural Information Processing Systems, pp. 2305–2313 (2015)
Google Scholar
S. Horváth, D. Kovalev, K. Mishchenko, S. Stich, P. Richtárik, Stochastic distributed learning with gradient quantization and variance reduction. Preprint (2019). arXiv:1904.05115
Google Scholar
S.A. Ilyuhin, A.V. Sheshkus, V.L. Arlazarov, Recognition of images of Korean characters using embedded networks, in Twelfth International Conference on Machine Vision (ICMV 2019), vol. 11433, ed. by W. Osten, D.P. Nikolaev (International Society for Optics and Photonics, SPIE, 2020), pp. 273–279
Google Scholar
P. Jain, P. Kar, Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3–4), 142–336 (2017)
Article MATH Google Scholar
P. Jain, P. Netrapalli, S. Sanghavi, Low-rank matrix completion using alternating minimization, in Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, pp. 665–674 (2013)
Google Scholar
Z. Ji, M.J. Telgarsky, Gradient descent aligns the layers of deep linear networks, in 7th International Conference on Learning Representations, ICLR 2019 (2019)
Google Scholar
K. Ji, Z. Wang, Y. Zhou, Y. Liang, Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization (2019)
Google Scholar
C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently. Proceedings of Machine Learning Research, vol. 70 (International Convention Centre, Sydney, Australia, 06–11 Aug 2017), pp. 1724–1732. PMLR
Google Scholar
C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, in Conference On Learning Theory (PMLR, 2018), pp. 1042–1085
Google Scholar
C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, M. I. Jordan, On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. J. ACM (JACM) 68(2), 1–29 (2021)
Google Scholar
R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Google Scholar
H. Karimi, J. Nutini, M. Schmidt, Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases (Springer, 2016), pp. 795–811
Google Scholar
A. Khaled, P. Richtárik, Better theory for sgd in the nonconvex world. Preprint (2020). arXiv:2002.03329
Google Scholar
S. Khot, G. Kindler, E. Mossel, R. O’Donnell, Optimal inapproximability results for max-cut and other 2-variable csps? SIAM J. Comput. 37(1), 319–357 (2007)
Article MathSciNet MATH Google Scholar
A. Khritankov, Hidden feedback loops in machine learning systems: A simulation model and preliminary results, in Software Quality: Future Perspectives on Software Engineering Quality, ed. by D. Winkler, S. Biffl, D. Mendez, M. Wimmer, J. Bergsmann (Springer International Publishing, Cham, 2021), pp. 54–65
Chapter Google Scholar
R. Kidambi, P. Netrapalli, P. Jain, S. Kakade, On the insufficiency of existing momentum schemes for stochastic optimization, in 2018 Information Theory and Applications Workshop (ITA) (IEEE, 2018), pp. 1–9
Google Scholar
L. Kiefer, M. Storath, A. Weinmann, Iterative potts minimization for the recovery of signals with discontinuities from indirect measurements: The multivariate case. Found. Comput. Math. 1–46 (2020)
Google Scholar
D. P. Kingma, J. Ba, Adam: A method for stochastic optimization. Preprint (2014). arXiv:1412.6980
Google Scholar
V. V. Kniaz, S. Y. Zheltov, F. Remondino, V. A. Knyaz, A. Gruen, Wire structure image-based 3d reconstruction aided by deep learning. volume XLIII-B2-2020, pp. 435–441, Göttingen, 2020. Copernicus. XXIV ISPRS Congress 2020 (virtual); Conference Location: Online; Conference Date: August 31–September 2, 2020; Due to the Corona virus (COVID-19) the conference was conducted virtually
Google Scholar
V. V. Kniaz, V. A. Knyaz, V. Mizginov, A. Papazyan, N. Fomin, L. Grodzitsky, Adversarial dataset augmentation using reinforcement learning and 3d modeling, in Advances in Neural Computation, Machine Learning, and Cognitive Research IV, ed. by B. Kryzhanovsky, W. Dunin-Barkowski, V. Redko, Y. Tiumentsev (Springer International Publishing, Cham, 2021), pp. 316–329
Chapter Google Scholar
J. M. Kohler, A. Lucchi, Sub-sampled cubic regularization for non-convex optimization, in International Conference on Machine Learning, pp. 1895–1904 (2017)
Google Scholar
G. Kornowski, O. Shamir, Oracle complexity in nonsmooth nonconvex optimization. Preprint (2021). arXiv:2104.06763
Google Scholar
D. Kovalev, S. Horváth, P. Richtárik, Don’t jump through hoops and remove those loops: Svrg and katyusha are better without the outer loop, in Algorithmic Learning Theory, pp. 451–467 (2020)
Google Scholar
A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009)
Google Scholar
P. Kuderov, A. Panov, Planning with hierarchical temporal memory for deterministic markov decision problem, in Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART (INSTICC, SciTePress, 2021), pp. 1073–1081
Google Scholar
T. Lacroix, N. Usunier, G. Obozinski, Canonical tensor decomposition for knowledge base completion, in International Conference on Machine Learning, pp. 2863–2872 (2018)
Google Scholar
G. Lan, First-Order and Stochastic Optimization Methods for Machine Learning (Springer, 2020)
Google Scholar
G. Lan, Y. Yang, Accelerated stochastic algorithms for nonconvex finite-sum and multiblock optimization. SIAM J. Optim. 29(4), 2753–2784 (2019)
Article MathSciNet MATH Google Scholar
J. Larson, M. Menickelly, S. M. Wild, Derivative-free optimization methods. Acta Numerica 28, 287–404 (2019)
Article MathSciNet MATH Google Scholar
J. C. H. Lee, P. Valiant, Optimizing star-convex functions, in 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 603–614 (2016)
Google Scholar
Y. Lei, T. Hu, G. Li, K. Tang, Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE Trans. Neural Networks Learn. Syst. (2019)
Google Scholar
K. Y. Levy, The power of normalization: Faster evasion of saddle points. Preprint (2016). arXiv:1611.04831
Google Scholar
Y. Li, K. Lee, Y. Bresler, Identifiability in blind deconvolution with subspace or sparsity constraints. IEEE Trans. Inf. Theory 62(7), 4266–4275 (2016)
Article MathSciNet MATH Google Scholar
D. Li, T. Ding, R. Sun, Over-parameterized deep neural networks have no strict local minima for any continuous activations. Preprint (2018). arXiv:1812.11039
Google Scholar
Z. Li, H. Bao, X. Zhang, P. Richtárik, Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. Preprint (2020). arXiv:2008.10898
Google Scholar
Z. Li, P. Richtárik, A unified analysis of stochastic gradient methods for nonconvex federated optimization. Preprint (2020). arXiv:2006.07013
Google Scholar
S. Liang, R. Sun, Y. Li, R. Srikant, Understanding the loss surface of neural networks for binary classification, in International Conference on Machine Learning, pp. 2835–2843 (2018)
Google Scholar
S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, L. Amini, Zeroth-order stochastic variance reduction for nonconvex optimization (2018)
Google Scholar
R. Livni, S. Shalev-Shwartz, O. Shamir, On the computational efficiency of training neural networks, in Advances in Neural Information Processing Systems, pp. 855–863 (2014)
Google Scholar
N. Loizou, S. Vaswani, I. H. Laradji, S. Lacoste-Julien, Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence, in International Conference on Artificial Intelligence and Statistics (PMLR, 2021), pp. 1306–1314
Google Scholar
S. Lojasiewicz, A topological property of real analytic subsets. Coll. du CNRS, Les équations aux dérivées partielles 117, 87–89 (1963)
Google Scholar
I. Loshchilov, F. Hutter, Sgdr: Stochastic gradient descent with warm restarts. Preprint (2016). arXiv:1608.03983
Google Scholar
A. Lucchi, J. Kohler, A stochastic tensor method for non-convex optimization. Preprint (2019). arXiv:1911.10367
Google Scholar
C. Ma, K. Wang, Y. Chi, Y. Chen, Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion, in International Conference on Machine Learning (PMLR, 2018), pp. 3345–3354
Google Scholar
S. Ma, R. Bassily, M. Belkin, The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning, in International Conference on Machine Learning (PMLR, 2018), pp. 3325–3334
Google Scholar
J. Mairal, Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Article MathSciNet MATH Google Scholar
D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, M. J. Wainwright, Derivative-free methods for policy optimization: Guarantees for linear quadratic systems (2020)
Google Scholar
J. Martens, Deep learning via hessian-free optimization, in International Conference on Machine Learning, vo 27, pp. 735–742 (2010)
Google Scholar
T. Mikolov, Statistical language models based on neural networks, Presentation at Google, Mountain View, 2nd April, 80 (2012)
Google Scholar
K. Mishchenko, E. Gorbunov, M. Takáč, P. Richtárik, Distributed learning with compressed gradient differences. Preprint (2019). arXiv:1901.09269
Google Scholar
K. Mishchenko, A. Khaled, P. Richtárik, Random reshuffling: Simple analysis with vast improvements. Preprint (2020). arXiv:2006.05988
Google Scholar
K. G. Murty, S. N. Kabadi, Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming 39(2), 117–129 (1987)
Article MathSciNet MATH Google Scholar
J. A. Nelder, R. Mead, A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)
Article MathSciNet MATH Google Scholar
A. Nemirovski, Orth-method for smooth convex optimization. Izvestia AN SSSR Transl. Eng. Cybern. Soviet J. Comput. Syst. Sci. 2, 937–947 (1982)
Google Scholar
Y. Nesterov, A method of solving a convex programming problem with convergence rate o(1∕k²). Soviet Math. Doklady 27(2), 372–376 (1983)
MATH Google Scholar
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Kluwer Academic Publishers, Massachusetts, 2004)
Book MATH Google Scholar
Y. Nesterov, How to make the gradients small. Optima 88, 10–11 (2012)
Google Scholar
Y. Nesterov, Lectures on Convex Optimization, vol. 137 (Springer, 2018)
Google Scholar
Y. Nesterov, B. Polyak, Cubic regularization of Newton method and its global performance. Mathematical Programming 108(1), 177–205 (2006)
Article MathSciNet MATH Google Scholar
Y. Nesterov, V. Spokoiny, Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2017). First appeared in 2011 as CORE discussion paper 2011/16
Google Scholar
Y. Nesterov, A. Gasnikov, S. Guminov, P. Dvurechensky, Primal-dual accelerated gradient methods with small-dimensional relaxation oracle. Optim. Methods Softw., 1–28 (2020). arXiv:1809.05895
Google Scholar
B. Neyshabur, S. Bhojanapalli, D. McAllester, N. Srebro, Exploring generalization in deep learning, in Advances in Neural Information Processing Systems, pp. 5947–5956 (2017)
Google Scholar
L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, Sarah: A novel method for machine learning problems using stochastic recursive gradient, in International Conference on Machine Learning, pp. 2613–2621 (2017)
Google Scholar
L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, Stochastic recursive gradient algorithm for nonconvex optimization. Preprint (2017). arXiv:1705.07261
Google Scholar
Q. Nguyen, M. C. Mukkamala, M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys. Preprint (2018). arXiv:1809.10749
Google Scholar
L. M. Nguyen, Q. Tran-Dinh, D. T. Phan, P. H. Nguyen, M. van Dijk, A unified convergence analysis for shuffling-type gradient methods. Preprint (2020). arXiv:2002.08246
Google Scholar
J. Nocedal, S. Wright, Numerical Optimization (Springer Science & Business Media, 2006)
Google Scholar
K. Osawa, Y. Tsuji, Y. Ueno, A. Naruse, R. Yokota, S. Matsuoka, Second-order optimization method for large mini-batch: Training resnet-50 on imagenet in 35 epochs. Preprint (2018). arXiv:1811.12019, 1:2
Google Scholar
N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, A. Swami, Practical black-box attacks against machine learning (2017)
Google Scholar
V. Papyan, Y. Romano, J. Sulam, M. Elad, Convolutional dictionary learning via local processing, in Proceedings of the IEEE International Conference on Computer Vision, pp. 5296–5304 (2017)
Google Scholar
S. Park, S. H. Jung, P. M. Pardalos, Combining stochastic adaptive cubic regularization with negative curvature for nonconvex optimization. J. Optim. Theory Appl. 184(3), 953–971 (2020)
Article MathSciNet MATH Google Scholar
R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in International Conference on Machine Learning, pp. 1310–1318 (2013)
Google Scholar
B. Polyak, Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 3(4), 864–878 (1963)
Article MATH Google Scholar
B. T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
B. Polyak, Introduction to Optimization (Optimization Software, New York, 1987)
MATH Google Scholar
Q. Qu, X. Li, Z. Zhu, A nonconvex approach for exact and efficient multichannel sparse blind deconvolution, in Advances in Neural Information Processing Systems, pp. 4015–4026 (2019)
Google Scholar
S. Rajput, A. Gupta, D. Papailiopoulos, Closing the convergence gap of sgd without replacement. Preprint (2020). arXiv:2002.10400
Google Scholar
S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in International conference on machine learning, pp. 314–323 (2016)
Google Scholar
S. J. Reddi, S. Sra, B. Poczos, A. J. Smola, Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization, in Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Google Scholar
S. J. Reddi, S. Kale, S. Kumar, On the convergence of adam and beyond. Preprint (2019). arXiv:1904.09237
Google Scholar
A. Rezanov, D. Yudin, Deep neural networks for ortophoto-based vehicle localization, in Advances in Neural Computation, Machine Learning, and Cognitive Research IV, ed. by B. Kryzhanovsky, W. Dunin-Barkowski, V. Redko, Y. Tiumentsev (Springer International Publishing, Cham, 2021), pp. 167–174
Chapter Google Scholar
A. Risteski, Y. Li, Algorithms and matching lower bounds for approximately-convex optimization, in NIPS (2016)
Google Scholar
A. Roy, K. Balasubramanian, S. Ghadimi, P. Mohapatra, Escaping saddle-point faster under interpolation-like conditions, in Advances in Neural Information Processing Systems, p. 33 (2020)
Google Scholar
C. W. Royer, S. J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J. Optim. 28(2), 1448–1477 (2018)
Article MathSciNet MATH Google Scholar
I. Safran, O. Shamir, Spurious local minima are common in two-layer relu neural networks, in International Conference on Machine Learning (PMLR, 2018), pp. 4433–4441
Google Scholar
K. A. Sankararaman, S. De, Z. Xu, W. R. Huang, T. Goldstein, The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. Preprint (2019). arXiv:1904.06963
Google Scholar
M. Schmidt, N. L. Roux, Fast convergence of stochastic gradient descent under a strong growth condition. Preprint (2013). arXiv:1308.6370
Google Scholar
M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Mathematical Programming 162(1–2), 83–112 (2017)
Article MathSciNet MATH Google Scholar
M. Schumer, K. Steiglitz, Adaptive step size random search. IEEE Trans. Automatic Control 13(3), 270–276 (1968)
Article Google Scholar
O. Sebbouh, R. M. Gower, A. Defazio, On the convergence of the stochastic heavy ball method. Preprint (2020). arXiv:2006.07867
Google Scholar
O. Sener, V. Koltun, Learning to guide random search, in International Conference on Learning Representations (2020)
Google Scholar
S. Shalev-Shwartz, Sdca without duality, regularization, and individual convexity, in International Conference on Machine Learning, pp. 747–754 (2016)
Google Scholar
Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, M. Segev, Phase retrieval with application to optical imaging: a contemporary overview. IEEE Signal Process. Mag. 32(3), 87–109 (2015)
Article Google Scholar
Z. Shen, P. Zhou, C. Fang, A. Ribeiro, A stochastic trust region method for non-convex minimization. Preprint (2019). arXiv:1903.01540
Google Scholar
L. Shi, Y. Chi, Manifold gradient descent solves multi-channel sparse blind deconvolution provably and efficiently, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 5730–5734
Google Scholar
B. Shi, W. J. Su, M. I. Jordan, On learning rates and schrödinger operators. Preprint (2020). arXiv:2004.06977
Google Scholar
N. Shi, D. Li, M. Hong, R. Sun, Rmsprop converges with proper hyper-parameter, in International Conference on Learning Representations (2021)
Google Scholar
I. Shibaev, P. Dvurechensky, A. Gasnikov, Zeroth-order methods for noisy Hölder-gradient functions. Optimization Letters (2021). (accepted), arXiv:2006.11857. https://doi.org/10.1007/s11590-021-01742-z
Y. Shin, Effects of depth, width, and initialization: A convergence analysis of layer-wise training for deep linear neural networks. Preprint (2019). arXiv:1910.05874
Google Scholar
N. Z. Shor, Generalized gradient descent with application to block programming. Kibernetika 3(3), 53–55 (1967)
Google Scholar
A. Skrynnik, A. Staroverov, E. Aitygulov, K. Aksenov, V. Davydov, A. I. Panov, Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations. Knowl. Based Syst. 218, 106844 (2021)
Article Google Scholar
L. N. Smith, Cyclical learning rates for training neural networks, in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2017), pp. 464–472
Google Scholar
M. V. Solodov, Incremental gradient algorithms with stepsizes bounded away from zero. Comput. Optim. Appl. 11(1), 23–35 (1998)
Article MathSciNet MATH Google Scholar
M. Soltanolkotabi, A. Javanmard, J. D. Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inf. Theory 65(2), 742–769 (2018)
Article MathSciNet MATH Google Scholar
V. Spokoiny et al., Parametric estimation. finite sample theory. Ann. Stat. 40(6), 2877–2909 (2012)
Google Scholar
F. S. Stonyakin, D. Dvinskikh, P. Dvurechensky, A. Kroshnin, O. Kuznetsova, A. Agafonov, A. Gasnikov, A. Tyurin, C. A. Uribe, D. Pasechnyuk, S. Artamonov, Gradient methods for problems with inexact model of the objective, in Mathematical Optimization Theory and Operations Research, ed. by M. Khachay, Y. Kochetov, P. Pardalos, (Springer International Publishing, Cham, 2019), pp. 97–114 arXiv:1902.09001
Google Scholar
F. Stonyakin, A. Tyurin, A. Gasnikov, P. Dvurechensky, A. Agafonov, D. Dvinskikh, M. Alkousa, D. Pasechnyuk, S. Artamonov, V. Piskunova, Inexact model: A framework for optimization and variational inequalities. Optim. Methods Softw. (2021). (accepted), WIAS Preprint No. 2709, arXiv:2001.09013, arXiv:1902.00990. https://doi.org/10.1080/10556788.2021.1924714
R. Sun, Optimization for deep learning: theory and algorithms. Preprint (2019). arXiv:1912.08957
Google Scholar
I. Surazhevsky, V. Demin, A. Ilyasov, A. Emelyanov, K. Nikiruy, V. Rylkov, S. Shchanikov, I. Bordanov, S. Gerasimova, D. Guseinov, N. Malekhonova, D. Pavlov, A. Belov, A. Mikhaylov, V. Kazantsev, D. Valenti, B. Spagnolo, M. Kovalchuk, Noise-assisted persistence and recovery of memory state in a memristive spiking neuromorphic network. Chaos Solitons Fractals 146, 110890 (2021)
Article MATH Google Scholar
I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning, pp. 1139–1147 (2013)
Google Scholar
G. Swirszcz, W. M. Czarnecki, R. Pascanu, Local minima in training of deep networks (2016)
Google Scholar
Y. S. Tan, R. Vershynin, Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval. Preprint (2019). arXiv:1910.12837
Google Scholar
W. Tao, Z. Pan, G. Wu, Q. Tao, Primal averaging: A new gradient evaluation step to attain the optimal individual convergence. IEEE Trans. Cybern. 50(2), 835–845 (2018)
Article Google Scholar
A. Taylor, F. Bach, Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions, in Conference on Learning Theory, pp. 2934–2992 (2019)
Google Scholar
T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Networks Mach. Learn. 4(2), 26–31 (2012)
Google Scholar
N. Tripuraneni, M. Stern, C. Jin, J. Regier, M. I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)
Google Scholar
P. Tseng, An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)
Article MathSciNet MATH Google Scholar
I. Usmanova, Robust solutions to stochastic optimization problems. Master Thesis (MSIAM); Institut Polytechnique de Grenoble ENSIMAG, Laboratoire Jean Kuntzmann (2017)
Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
S. Vaswani, F. Bach, M. Schmidt, Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron, in The 22nd International Conference on Artificial Intelligence and Statistics (PMLR, 2019), pp. 1195–1204
Google Scholar
S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, S. Lacoste-Julien, Painless stochastic gradient: Interpolation, line-search, and convergence rates, in Advances in Neural Information Processing Systems, pp. 3732–3745 (2019)
Google Scholar
S. A. Vavasis, Black-box complexity of local minimization. SIAM J. Optim. 3(1), 60–80 (1993)
Article MathSciNet MATH Google Scholar
R. Vidal, J. Bruna, R. Giryes, S. Soatto, Mathematics of deep learning. Preprint (2017). arXiv:1712.04741
Google Scholar
Z. Wang, K. Ji, Y. Zhou, Y. Liang, V. Tarokh, Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. Preprint (2018). arXiv:1810.10690
Google Scholar
Z. Wang, K. Ji, Y. Zhou, Y. Liang, V. Tarokh, Spiderboost and momentum: Faster variance reduction algorithms, in Advances in Neural Information Processing Systems, pp. 2403–2413 (2019)
Google Scholar
Z. Wang, Y. Zhou, Y. Liang, G. Lan, Stochastic variance-reduced cubic regularization for nonconvex optimization, in The 22nd International Conference on Artificial Intelligence and Statistics (PMLR, 2019), pp. 2731–2740
Google Scholar
Z. Wang, Y. Zhou, Y. Liang, G. Lan, Cubic regularization with momentum for nonconvex optimization, in Uncertainty in Artificial Intelligence (PMLR, 2020), pp. 313–322
Google Scholar
R. Ward, X. Wu, L. Bottou, Adagrad stepsizes: Sharp convergence over nonconvex landscapes, in International Conference on Machine Learning (PMLR, 2019), pp. 6677–6686
Google Scholar
A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, B. Recht, The marginal value of adaptive gradient methods in machine learning, in Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)
Google Scholar
S. J. Wright, Optimization algorithms for data analysis. Math. Data 25, 49 (2018)
Article MathSciNet MATH Google Scholar
F. Wu, P. Rebeschini, Hadamard wirtinger flow for sparse phase retrieval. Preprint (2020). arXiv:2006.01065
Google Scholar
G. Xie, L. Luo, Z. Zhang, A general analysis framework of lower complexity bounds for finite-sum optimization. Preprint (2019). arXiv:1908.08394
Google Scholar
Y. Xu, Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization. Preprint (2020). arXiv:2006.00425
Google Scholar
P. Xu, J. Chen, D. Zou, Q. Gu, Global convergence of langevin dynamics based algorithms for nonconvex optimization, in Advances in Neural Information Processing Systems, pp. 3122–3133 (2018)
Google Scholar
Y. Xu, R. Jin, T. Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, in Advances in Neural Information Processing Systems, pp. 5530–5540 (2018)
Google Scholar
P. Xu, F. Roosta, M. W. Mahoney, Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184(1), 35–70 (2020)
Article MathSciNet MATH Google Scholar
P. Xu, F. Roosta, M. W. Mahoney, Second-order optimization for non-convex machine learning: An empirical study, in Proceedings of the 2020 SIAM International Conference on Data Mining (SIAM, 2020), pp. 199–207
Google Scholar
Y. Yan, T. Yang, Z. Li, Q. Lin, Y. Yang, A unified analysis of stochastic momentum methods for deep learning, in Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2955–2961 (2018)
Google Scholar
Z. Yang, L. F. Yang, E. X. Fang, T. Zhao, Z. Wang, M. Neykov, Misspecified nonconvex statistical optimization for sparse phase retrieval. Mathematical Programming 176(1–2), 545–571 (2019)
Article MathSciNet MATH Google Scholar
C. Yun, S. Sra, A. Jadbabaie, Small nonlinearities in activation functions create bad local minima in neural networks. Preprint (2018). arXiv:1802.03487
Google Scholar
J. Yun, A. C. Lozano, E. Yang, A general family of stochastic proximal gradient methods for deep learning. Preprint (2020). arXiv:2007.07484
Google Scholar
M. Zaheer, S. Reddi, D. Sachan, S. Kale, S. Kumar, Adaptive methods for nonconvex optimization, in Advances in Neural Information Processing Systems, pp. 9793–9803 (2018)
Google Scholar
R. Y. Zhang, Sharp global guarantees for nonconvex low-rank matrix recovery in the overparameterized regime. Preprint (2021). arXiv:2104.10790
Google Scholar
J. Zhang, L. Xiao, Stochastic variance-reduced prox-linear algorithms for nonconvex composite optimization. Preprint (2020). arXiv:2004.04357
Google Scholar
J. Zhang, L. Xiao, S. Zhang, Adaptive stochastic variance reduction for subsampled newton method with cubic regularization. Preprint (2018). arXiv:1811.11637
Google Scholar
B. Zhang, J. Jin, C. Fang, L. Wang, Improved analysis of clipping algorithms for non-convex optimization, in Advances in Neural Information Processing Systems, p. 33 (2020)
Google Scholar
J. Zhang, T. He, S. Sra, A. Jadbabaie, Why gradient clipping accelerates training: A theoretical justification for adaptivity, in International Conference on Learning Representations (2020)
Google Scholar
J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, S. Sra, Why are adaptive methods good for attention models? in Advances in Neural Information Processing Systems, p. 33 (2020)
Google Scholar
Y. Zhang, Q. Qu, J. Wright, From symmetry to geometry: Tractable nonconvex problems. Preprint (2020). arXiv:2007.06753
Google Scholar
Y. Zhang, Q. Qu, J. Wright, From symmetry to geometry: Tractable nonconvex problems. Preprint (2020). arXiv:2007.06753
Google Scholar
Y. Zhang, Y. Zhou, K. Ji, M. M. Zavlanos, Boosting one-point derivative-free online optimization via residual feedback (2020)
Google Scholar
C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Article Google Scholar
H. Zhang, Y. Bi, J. Lavaei, General low-rank matrix optimization: Geometric analysis and sharper bounds. Preprint (2021). arXiv:2104.10356
Google Scholar
A. Zhigljavsky, A. Zilinskas, Stochastic Global Optimization, vol. 9 (Springer Science & Business Media, 2007)
Google Scholar
D. Zhou, Q. Gu, Lower bounds for smooth nonconvex finite-sum optimization, in International Conference on Machine Learning, pp. 7574–7583 (2019)
Google Scholar
D. Zhou, Y. Tang, Z. Yang, Y. Cao, Q. Gu, On the convergence of adaptive gradient methods for nonconvex optimization. Preprint (2018). arXiv:1808.05671
Google Scholar
D. Zhou, P. Xu, Q. Gu, Stochastic nested variance reduced gradient descent for nonconvex optimization, in Advances in Neural Information Processing Systems (2018)
Google Scholar
D. Zhou, P. Xu, Q. Gu, Stochastic variance-reduced cubic regularization methods. J. Mach. Learn. Res. 20(134), 1–47 (2019)
MathSciNet MATH Google Scholar
D. Zhou, P. Xu, Q. Gu, Stochastic variance-reduced cubic regularization methods. J. Mach. Learn. Res. 20(134), 1–47 (2019)
MathSciNet MATH Google Scholar
D. Zhou, Q. Gu, Stochastic recursive variance-reduced cubic regularization methods, in International Conference on Artificial Intelligence and Statistics (PMLR, 2020), pp. 3980–3990
Google Scholar
X. Zhu, J. Han, B. Jiang, An adaptive high order method for finding third-order critical points of nonconvex optimization. Preprint (2020). arXiv:2008.04191
Google Scholar

Download references

Acknowledgements

The authors are grateful to A. Gornov, A. Nazin, Yu. Nesterov, B. Polyak, and K. Scheinberg for fruitful discussions and their suggestions that helped to improve the quality of the text.

The research was partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) No.075-00337-20-03, project No. 0714-2020-0005. The work of I. Shibaev was supported by the program “Leading Scientific Schools” (grant no. NSh-775.2022.1.1).

Author information

Authors and Affiliations

Institute of Control Sciences RAS, Moscow, Russia
Marina Danilova
Moscow Institute of Physics and Technology, Moscow, Russia
Marina Danilova
Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Germany
Pavel Dvurechensky
HSE University, Moscow, Russia
Pavel Dvurechensky
Moscow Institute of Physics and Technology, Moscow, Russia
Alexander Gasnikov
HSE University, Moscow, Russia
Alexander Gasnikov
Institute for Information Transmission Problems RAS, Moscow, Russia
Alexander Gasnikov
Moscow Institute of Physics and Technology, Moscow, Russia
Eduard Gorbunov & Innokentiy Shibaev
HSE University, Moscow, Russia
Eduard Gorbunov & Innokentiy Shibaev
HSE University, Moscow, Russia
Sergey Guminov
Moscow Institute of Physics and Technology, Moscow, Russia
Dmitry Kamzolov
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Dmitry Kamzolov

Authors

Marina Danilova
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Dvurechensky
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gasnikov
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Gorbunov
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Guminov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Kamzolov
View author publications
You can also search for this author in PubMed Google Scholar
Innokentiy Shibaev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Gasnikov .

Editor information

Editors and Affiliations

Institut für Mathematik, Universität Zürich, Zürich, Switzerland
Ashkan Nikeghbali
Dept. of Industrial & Systems Eng., University of Florida, Gainesville, FL, USA
Panos M. Pardalos
Moscow Institute of Physics & Technology, Dolgoprudny, Russia
Andrei M. Raigorodskii
Department of Mathematics and Engineering Sciences, Hellenic Military Academy, Athens, Greece
Michael Th. Rassias

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Danilova, M. et al. (2022). Recent Theoretical Advances in Non-Convex Optimization. In: Nikeghbali, A., Pardalos, P.M., Raigorodskii, A.M., Rassias, M.T. (eds) High-Dimensional Optimization and Probability. Springer Optimization and Its Applications, vol 191. Springer, Cham. https://doi.org/10.1007/978-3-031-00832-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-00832-0_3
Published: 05 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-00831-3
Online ISBN: 978-3-031-00832-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics