Skip to main content

Behavior of accelerated gradient methods near critical points of nonconvex functions

Abstract

We examine the behavior of accelerated gradient methods in smooth nonconvex unconstrained optimization, focusing in particular on their behavior near strict saddle points. Accelerated methods are iterative methods that typically step along a direction that is a linear combination of the previous step and the gradient of the function evaluated at a point at or near the current iterate. (The previous step encodes gradient information from earlier stages in the iterative process). We show by means of the stable manifold theorem that the heavy-ball method is unlikely to converge to strict saddle points, which are points at which the gradient of the objective is zero but the Hessian has at least one negative eigenvalue. We then examine the behavior of the heavy-ball method and other accelerated gradient methods in the vicinity of a strict saddle point of a nonconvex quadratic function, showing that both methods can diverge from this point more rapidly than the steepest-descent method.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

References

  1. Attouch, H., Cabot, A.: Convergence rates of inertial forward–backward algorithms. SIAM J. Optim. 28(1), 849–874 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  2. Attouch, H., Goudou, X., Redont, P.: The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Commun. Contemp. Math. 2(01), 1–34 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  3. Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than \(1/k^2\). SIAM J. Optim. 26(3), 1824–1834 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)

    Article  MATH  Google Scholar 

  5. Chambolle, A., Dossal, Ch.: On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 166(3), 968–982 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  6. Du, S.S., Jin, C., Lee, J.D., Jordan, M.I., Singh, A., Poczos, B.: Gradient descent can take exponential time to escape saddle points. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 1067–1077. Curran Associates Inc, Red Hook (2017)

    Google Scholar 

  7. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  8. Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, (2017)

  9. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. JMLR Workshop Conf. Proc. 49(1), 1–12 (2016)

    Google Scholar 

  10. Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 379–387. Curran Associates Inc, Red Hook (2015)

    Google Scholar 

  11. Nesterov, Y.: A method for unconstrained convex problem with the rate of convergence \(O(1/k^2)\). Dokl AN SSSR 269, 543–547 (1983)

    Google Scholar 

  12. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York (2004)

    Book  MATH  Google Scholar 

  13. Polyak, B.T.: Introduction to Optimization. Optimization Software (1987)

  14. Recht, B., Wright, S.J.: Nonlinear Optimization for Machine Learning (2017). (Manuscript in preparation)

  15. Shub, M.: Global Stability of Dynamical Systems. Springer, Berlin (1987)

    Book  MATH  Google Scholar 

  16. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Technical report, Department of Mathematics, University of Washington, (2008)

  17. Zavriev, S.K., Kostyuk, F.V.: Heavy-ball method in nonconvex optimization problems. Comput. Math. Model. 4(4), 336–341 (1993)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

We are grateful to Bin Hu for his advice and suggestions on the manuscript. We are also grateful to the referees and editor for helpful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephen J. Wright.

Additional information

This work was supported by NSF Awards IIS-1447449, 1628384, 1634597, and 1740707; AFOSR Award FA9550-13-1-0138; and Subcontract 3F-30222 from Argonne National Laboratory. Part of this work was done while the second author was visiting the Simons Institute for the Theory of Computing, and partially supported by the DIMACS/Simons Collaboration on Bridging Continuous and Discrete Optimization through NSF Award CCF-1740425.

A properties of the sequence \(\{t_k\}\) defined by (51)

A properties of the sequence \(\{t_k\}\) defined by (51)

In this “Appendix” we show that the following two properties hold for the sequence defined by (51):

$$\begin{aligned} \frac{t_{k-1}-1}{t_k} \;\; \text{ is } \text{ an } \text{ increasing } \text{ nonnegative } \text{ sequence } \end{aligned}$$
(52)

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{t_{k-1}-1}{t_k} = 1. \end{aligned}$$
(53)

We begin by noting two well known properties of the sequence \(t_k\) (see for example [4, Section 3.7.2]):

$$\begin{aligned} t_k^2 - t_k = t_{k-1}^2 \end{aligned}$$
(54)

and

$$\begin{aligned} t_k \ge \frac{k+1}{2}. \end{aligned}$$
(55)

To prove that \(\frac{t_{k-1}-1}{t_k}\) is monotonically increasing, we need

$$\begin{aligned} \frac{t_{k-1}-1}{t_k} = \frac{t_{k-1}}{t_k} - \frac{1}{t_k} \le \frac{t_k}{t_{k+1}} - \frac{1}{t_{k+1}} = \frac{t_k - 1}{t_{k+1}}, \quad k=1,2,\ldots . \end{aligned}$$

Since \(t_{k+1} \ge t_k\) (which follows immediately from (51)), it is sufficient to prove that

$$\begin{aligned} \frac{t_{k-1}}{t_k} \le \frac{t_k}{t_{k+1}}. \end{aligned}$$

By manipulating this expression and using (54), we obtain the equivalent expression

$$\begin{aligned} t_{k-1} \le \frac{t_k^2}{t_{k+1}} = \frac{t_{k+1}^2 - t_{k+1}}{t_{k+1}} = t_{k+1} - 1. \end{aligned}$$
(56)

By definition of \(t_{k+1}\), we have

$$\begin{aligned} t_{k+1} = \frac{\sqrt{4t_k^2 + 1} + 1}{2} \ge t_k + \frac{1}{2} = \frac{\sqrt{4t_{k-1}^2 + 1} + 1}{2} + \frac{1}{2} \ge t_{k-1} + 1. \end{aligned}$$

Thus (56) holds, so the claim (52) is proved. The sequence \(\{ (t_{k-1}-1)/t_k \}\) is nonnegative, since \((t_0-1)/t_1 = 0\).

Now we prove (53). We can lower-bound \((t_{k-1}-1)/t_k\) as follows:

$$\begin{aligned} \frac{t_{k-1}-1}{t_k}&= \frac{2(t_{k-1} -1)}{\sqrt{4 t_{k-1}^2 +1} + 1}\ge \frac{2(t_{k-1} -1)}{\sqrt{4 t_{k-1}^2} + 2} \nonumber \\&= \frac{2(t_{k-1} -1)}{2 (t_{k-1} + 1)} = 1 - \frac{2}{t_{k-1}+1}. \end{aligned}$$
(57)

For an upper bound, we have from \(t_k \ge t_{k-1}\) that

$$\begin{aligned} \frac{t_{k-1} -1}{t_k} \le \frac{t_{k-1}}{t_k} \le 1. \end{aligned}$$
(58)

Since \(t_{k-1} \rightarrow \infty \) (because of (55)), it follows from (57) and (58) that (53) holds.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

O’Neill, M., Wright, S.J. Behavior of accelerated gradient methods near critical points of nonconvex functions. Math. Program. 176, 403–427 (2019). https://doi.org/10.1007/s10107-018-1340-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-018-1340-y

Keywords

  • Accelerated gradient methods
  • Nonconvex optimization

Mathematics Subject Classification

  • 90C26
  • 49M30