Inertial Newton Algorithms Avoiding Strict Saddle Points

Castera, Camille

doi:10.1007/s10957-023-02330-0

Inertial Newton Algorithms Avoiding Strict Saddle Points

Published: 08 November 2023

Volume 199, pages 881–903, (2023)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Camille Castera ORCID: orcid.org/0000-0002-7384-6387¹

252 Accesses
1 Altmetric
Explore all metrics

Abstract

We study the asymptotic behavior of second-order algorithms mixing Newton’s method and inertial gradient descent in non-convex landscapes. We show that, despite the Newtonian behavior of these methods, they almost always escape strict saddle points. We also evidence the role played by the hyper-parameters of these methods in their qualitative behavior near critical points. The theoretical results are supported by numerical illustrations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An away-step Frank–Wolfe algorithm for constrained multiobjective optimization

Article 07 May 2024

Multi-step inertial algorithms for equilibrium, fixed point, general systems of variational inequalities and split feasibility problems

Article 27 January 2024

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Data Availability Statement

No additional data are needed to reproduce the experiments, and the code is publicly available (see above).

Notes

The limit of sub-sequences of iterates of INNA yield critical points of $\mathcal {J}$, both for vanishing step-sizes [19], and fixed ones if $\mathcal {J}$ has Lipschitz continuous gradient (see Theorem 4.3).
This requires more assumptions on $\mathcal {J}$, e.g., the Kurdyka-Łojasiewicz property, see [38].
The main algorithmic result of this paper, Theorem 4.1 holds beyond Morse functions.
See, e.g., [3, 19] for precise definitions. These notions are not crucial in what follows.
The distribution of $(\theta _0,\psi _0)$ must be absolutely continuous w.r.t. the Lebesgue measure, that is: for any set $\textsf{I}\subset \mathbb {R}^P\times \mathbb {R}^P$ with zero Lebesgue measure, $\mathbb {P}((\theta _0,\psi _0)\in \textsf{I})=0$.
Additionally H preserves the parametrization by time (i.e., the orientation of oriented curves is preserved see [36, Chapter 2.8, Definition 1]).
We could consider non-autonomous ODEs [33], but we do not for the sake of simplicity.
Since $\nabla \mathcal {J}$ is not globally Lipschitz, a local Lipschitz constant ${\hat{L}}=50$ was used to locally satisfy the assumptions of Theorems 4.1 and 4.3. Convergence was also empirically checked.

References

Alecsa, Cristian Daniel, László, Szilárd Csaba., Viorel, Adrian: A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem. Numer. Algor. 84(2), 485–512 (2020)
Article MathSciNet MATH Google Scholar
Alecsa, Cristian Daniel, László, Szilárd Csaba., Pinţa, Titus: An extension of the second order dynamical system that models Nesterov’s convex gradient method. Appl. Math. Optim. 84(2), 1687–1716 (2021)
Article MathSciNet MATH Google Scholar
Alvarez, Felipe, Attouch, Hedy, Bolte, Jérôme., Redont, Patrick: A second-order gradient-like dissipative dynamical system with Hessian-driven damping: application to optimization and mechanics. Journal de Mathématiques Pures et Appliquées 81(8), 747–779 (2002)
Article MathSciNet MATH Google Scholar
Ašić, M.D., Adamović, D.D.: Limit points of sequences in metric spaces. Am. Math. Monthly 77(6), 613–616 (1970). https://www.tandfonline.com/doi/abs/10.1080/00029890.1970.11992549
Attouch, H., László, S.C.: Newton-like inertial dynamics and proximal algorithms governed by maximally monotone operators. SIAM J. Optim. 30(4), 3252–3283 (2020)
Article MathSciNet MATH Google Scholar
Attouch, H., László, S.C.: Continuous Newton-like inertial dynamics for monotone inclusions. Set-Valued Variat. Anal 29(3), 555–581 (2021)
Article MathSciNet MATH Google Scholar
Attouch, Hedy, Redont, Patrick: The second-order in time continuous Newton method. In: Lassonde, M. (ed.) Approximation, Optimization and Mathematical Economics, pp. 25–36. Springer, NewYork (2001)
Chapter Google Scholar
Attouch, H.: Bolte, Jérôme, Svaiter, Benar Fux: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1), 91–129 (2013)
Article MathSciNet Google Scholar
Attouch, Hedy, Peypouquet, Juan, Redont, Patrick: A dynamical approach to an inertial forward-backward algorithm for convex minimization. SIAM J. Optim. 24(1), 232–256 (2014)
Article MathSciNet MATH Google Scholar
Attouch, Hedy, Peypouquet, Juan, Redont, Patrick: Fast convex optimization via inertial dynamics with Hessian driven damping. J. Differ. Eq. 261(10), 5734–5783 (2016)
Article MathSciNet MATH Google Scholar
Attouch, Hedy, Chbani, Zaki, Peypouquet, Juan, Redont, Patrick: Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. 168(1), 123–175 (2018)
Article MathSciNet MATH Google Scholar
Attouch, Hedy: Chbani, Zaki, Riahi, Hassan: rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha \le 3$. ESAIM Control Optim. Calc. Var. 25(2), 1–34 (2019)
Google Scholar
Attouch, Hedy, Chbani, Zaki, Fadili, Jalal, Riahi, Hassan: First-order optimization algorithms via inertial systems with Hessian driven damping. Math. Program. 194(4), 1–43 (2020)
MATH Google Scholar
Attouch, H., Boţ, R.I., Csetnek, E.R.: Fast optimization via inertial dynamics with closed-loop damping. J. Eur. Math. Soc. 25(5), 1985–2056 (2022)
Article MathSciNet MATH Google Scholar
Aujol, Jean-Francois., Dossal, Charles, Rondepierre, Aude: Optimal convergence rates for Nesterov acceleration. SIAM J. Optim. 29(4), 3131–3153 (2019)
Article MathSciNet MATH Google Scholar
Bertsekas, Dimitri P.: Nonlinear Programming. Athena Scientific, (1998)
Boţ, R.I., Csetnek, E.R., László, S.C.: An inertial forward-backward algorithm for the minimization of the sum of two nonconvex functions. EURO J. Comput. Optim. 4(1), 3–25 (2016)
Article MathSciNet MATH Google Scholar
Boţ, R.I., Csetnek, E.R., László, S.C.: Tikhonov regularization of a second order dynamical system with Hessian driven damping. Math. Program. 189(1), 151–186 (2021)
Article MathSciNet MATH Google Scholar
Castera, Camille, Pauwels, Edouard: An inertial Newton algorithm for deep learning. J. Mach. Learn. Res. 22(134), 1–31 (2021)
MathSciNet MATH Google Scholar
Chen, Long, Luo, Hao: First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow. arXiv:1912.09276, (2019)
Dauphin, Y.N., Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, Bengio, Yoshua: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (ed) Advances in Neural Information Processing Systems (NeurIPS), vol. 27, pp. 2933-2941. (2014)
Goudou, Xavier, Munier, Julien: The gradient and heavy ball with friction dynamical systems: the quasiconvex case. Math. Program. 116(1), 173–191 (2009)
Article MathSciNet MATH Google Scholar
Grobman, David M.: Homeomorphism of systems of differential equations. Doklady Akademii Nauk SSSR 128(5), 880–881 (1959)
MathSciNet MATH Google Scholar
Hartman, Philip: A lemma in the theory of structural stability of differential equations. Proc. Am. Math. Soc. 11(4), 610–620 (1960)
Article MathSciNet MATH Google Scholar
Hunter, John D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)
Article Google Scholar
Kelley, Al.: The stable, center-stable, center, center-unstable, unstable manifolds. J. Differ. Eq. (1966). https://doi.org/10.1016/0022-0396(67)90016-2
Article MATH Google Scholar
Kelley, John L.: General Topology. Springer, NewYork (1975)
MATH Google Scholar
Lee, Jason D., Simchowitz, Max, Jordan, Michael I., Recht, Benjamin: Gradient descent only converges to minimizers. In V. Feldman, A. Rakhlin, and O. Shamir, editors, Conference on Learning Theory (COLT), volume 49, pages 1246–1257, (2016)
Mertikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, (ed), Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1117–1128, (2020)
Milnor, John: Morse Theory. Princeton University Press, New Jersey (2016)
Google Scholar
Nesterov, Yurii: A method for unconstrained convex minimization problem with the rate of convergence ${O}(1/k^2)$. In Doklady USSR 269, 543–547 (1983)
Google Scholar
Nocedal, Jorge: Wright, Stephen: Numerical Optimization. Springer, Newyork (2006)
Google Scholar
Ochs, Peter: Local convergence of the heavy-ball method and ipiano for non-convex optimization. J. Optim. Theory Appl. 177(1), 153–180 (2018)
Article MathSciNet MATH Google Scholar
O’Neill, Michael, Wright, Stephen J.: Behavior of accelerated gradient methods near critical points of nonconvex functions. Math. Program. 176(1), 403–427 (2019)
Article MathSciNet MATH Google Scholar
Palmer, Kenneth J.: A generalization of Hartman’s linearization theorem. J. Math. Anal. Appl. 41(3), 753–758 (1973)
Article MathSciNet MATH Google Scholar
Panageas, Ioannis, Piliouras, Georgios: Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. In: C.H. Papadimitriou, (ed), Theoretical Computer Science Conference (ITCS), vol. 67, pp. 1–12, (2017)
Panageas, Ioannis, Piliouras, Georgios, Wang, Xiao: First-order methods almost always avoid saddle points: The case of vanishing step-sizes. In: H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, (ed), Advances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 1–12, (2019)
Perko, Lawrence: Differential Equations and Dynamical Systems. Springer, NewYork (2013)
MATH Google Scholar
Polyak, Boris T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Rossum, Guido: Python reference manual. CWI (Centre for Mathematics and Computer Science), (1995)
Shi, B., Du, S.S., Jordan, M.I., Su, W.J.: Understanding the acceleration phenomenon via high-resolution differential equations. Math. Program. 195(1), 79–148 (2022)
Article MathSciNet MATH Google Scholar
Shub, Michael: Global Stability of Dynamical Systems. Springer, NewYork (2013)
Google Scholar
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, (ed), Advances in Neural Information Processing Systems (NeurIPS), vol. 27, pp. 2510–2518, (2014)
Szilárd Csaba László: Convergence rates for an inertial algorithm of gradient type associated to a smooth non-convex minimization. Math. Program. 190(1), 285–329 (2021)
Truong, Tuyen Trung: Convergence to minima for the continuous version of backtracking gradient descent. arXiv preprint arXiv:1911.04221, (2019)
Truong, Tuyen Trung, Nguyen, Tuan Hang: Backtracking gradient descent method for general ${C}^{1}$ functions, with applications to deep learning. arXiv preprint arXiv:1808.05160, 2018
Vassilis, Apidopoulos, Jean-François, Aujol, Charles, Dossal: The differential inclusion modeling FISTA algorithm and optimality of convergence rate in the case $b \le 3$. SIAM J. Optim. 28(1), 551–574 (2018)
Article MathSciNet MATH Google Scholar
van der Walt, Stéfan., Colbert, Chris, Varoquaux, Gael: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng 13(2), 22–30 (2011)
Article Google Scholar
Viktor Aleksandrovich Pliss: A reduction principle in the theory of stability of motion. Izvestiya Akademii Nauk SSSR. Seriya Matematicheskaya 28(6), 1297–1324 (1964)
MathSciNet Google Scholar

Download references

Acknowledgements

The author acknowledges the support of the European Research Council (ERC FACTORY-CoG-6681839) and the Air Force Office of Scientific Research (FA9550-18-1-0226). The author deeply thanks Jérôme Bolte, Cédric Févotte, and Edouard Pauwels for their valuable comments and the anonymous reviewers for their suggestions which led to significant improvements, such as tackling non-isolated critical points. The numerical experiments were made with the following libraries: [25, 47, 48].

Author information

Authors and Affiliations

CNRS - IRIT, Université de Toulouse, Toulouse, France
Camille Castera

Authors

Camille Castera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Camille Castera.

Additional information

Communicated by Olivier Fercoq.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Permutation Matrices

We specify the permutations matrices used to obtain the block diagonalization in (7). Denote by $\textrm{mod}$ the modulo operator. We can choose the permutation matrix $U\in \mathbb {R}^{2P\times 2P}$ as the matrix whose coefficients are all zero except the following, for all $p\in \{1,\ldots ,P\}$,

$$\begin{aligned} \text {{ P} odd:}\ {\left\{ \begin{array}{ll} U_{P-p+1,p} &{}=1-\textrm{mod}(p,2)\\ U_{P+p,2P-p+1}&{}=\textrm{mod}(p,2)\\ U_{p,2P-p}&{} = \textrm{mod}(p,2) \end{array}\right. }, \text { { P} even:}\ {\left\{ \begin{array}{ll} U_{p,p} &{}=\textrm{mod}(p,2)\\ U_{P+p,P+p}&{}=1-\textrm{mod}(p,2)\\ U_{P+p,p} &{}= \textrm{mod}(p,2)\\ U_{p,P+p} &{}= \textrm{mod}(p,2) \end{array}\right. }.\nonumber \\ \end{aligned}$$

(16)

B Proof of Theorem 4.1

We consider functions with possibly uncountably many critical points, this yields additional difficulties, which we overcome using the following result as done in [34].

Lemma B.1

(Lindelõf [27]) For every open cover there is a countable sub-cover.

The proof of Theorem 4.1 follows similar steps as that of Theorem 3.2, so we omit some details and use the notations of Sect. 3.1. First, for any $(\theta ,\psi )\in \mathbb {R}^P\times \mathbb {R}^P$, we redefine G:

$$\begin{aligned} G\begin{pmatrix} \theta \\ \psi \end{pmatrix} = \begin{pmatrix} \theta + \gamma \left[ -(\alpha -\frac{1}{\beta })\theta -\frac{1}{\beta }\psi -\beta \nabla \mathcal {J}(\theta ) \right] \\ \psi + \gamma \left[ -(\alpha -\frac{1}{\beta })\theta -\frac{1}{\beta }\psi \right] \end{pmatrix}, \end{aligned}$$

(17)

so that iterations $k\in \mathbb {N}$ of INNA read $(\theta _{k+1},\psi _{k+1}) = G(\theta _{k},\psi _{k})$. Remark that the set of fixed points of G is $\textsf{S}$, the stationary points of (3), indeed, for any $(\theta ,\psi )\in \mathbb {R}^P\times \mathbb {R}^P$,

$$\begin{aligned} G(\theta ,\psi ) = (\theta ,\psi ) \iff {\left\{ \begin{array}{ll} -(\alpha -\frac{1}{\beta })\theta -\frac{1}{\beta }\psi -\beta \nabla \mathcal {J}(\theta ) = 0\\ -(\alpha -\frac{1}{\beta })\theta -\frac{1}{\beta }\psi = 0 \end{array}\right. } \iff {\left\{ \begin{array}{ll} \nabla \mathcal {J}(\theta ) = 0\\ \psi = (1-\alpha \beta )\theta \end{array}\right. }. \end{aligned}$$

Since G is $C^1$ on $\mathbb {R}^{P}\times \mathbb {R}^P$, the Jacobian matrix of G (displayed by block) reads,

$$\begin{aligned} DG(\theta ,\psi ) = \begin{pmatrix} (1 - \gamma (\alpha -\frac{1}{\beta })) I_P -\gamma \beta \nabla ^2\mathcal {J}(\theta ) &{}&{} -\frac{\gamma }{\beta } I_P\\ -\gamma (\alpha -\frac{1}{\beta }) I_P &{}&{} (1-\frac{\gamma }{\beta })I_P \end{pmatrix}. \end{aligned}$$

We can again block-diagonalize $DG(\theta ,\psi )$ (see (7)), in blocks of the form (up to symmetric permutations): $M_p = \begin{pmatrix} 1 - \gamma (\alpha -\frac{1}{\beta }) -\gamma \beta \lambda _p &{} &{} -\frac{\gamma }{\beta } \\ -\gamma (\alpha -\frac{1}{\beta }) &{} &{} 1-\frac{\gamma }{\beta } \end{pmatrix}$, where $\lambda _p$ is an eigenvalue of $\nabla ^2\mathcal {J}(\theta )$. To use the Theorem 4.2, we need G to be a local diffeomorphism.

Theorem B.1

Under the same assumptions as that of Theorem 4.1, the mapping G defined in (17) is a local diffeomorphism from $\mathbb {R}^P\times \mathbb {R}^P$ to $\mathbb {R}^P\times \mathbb {R}^P$.

This result is proved later in Section B.1. We can now prove Theorem 4.1.

Proof of Theorem 4.1

Let $\alpha $, $\beta $ and $\gamma $ such that the assumptions of the theorem hold and let G defined in (17) with these parameters. By Theorem B.1, G is a local diffeomorphism. Let $(\theta ^\star ,\psi ^\star )\in \textsf{S}_{<0}$, to use Theorem 4.2 we study the magnitude of the eigenvalues of $DG(\theta ^\star ,\psi ^\star )$. Let $\lambda _p<0$ be a negative eigenvalue of $\nabla ^2\mathcal {J}(\theta ^\star )$, using the notations and elements stated in the beginning of this section, the eigenvalues of $M_p$ are the roots of

$$\begin{aligned} \chi _{M_p}(X)= & {} X^2 - \textrm{trace}(M_p)X + \det (M_p) \\= & {} X^2 - (2 - \gamma (\alpha +\beta \lambda _p) )X + 1 - \gamma (\alpha +\beta \lambda _p) + \gamma ^2\lambda _p. \end{aligned}$$

The discriminant of $\chi _{M_p}$ is:

$$\begin{aligned} \Delta _{M_p} = (2 - \gamma (\alpha +\beta \lambda _p) )^2 - 4 (1 - \gamma (\alpha +\beta \lambda _p) + \gamma ^2\lambda _p) = \gamma ^2\left( (\alpha +\beta \lambda _p)^2 - 4\lambda _p\right) . \end{aligned}$$

Remark that Lemma 3.1 gives again the sign of $\Delta _{M_p}$. Thus since $\lambda _p<0$, we necessarily have $\Delta _{M_p}\ge 0$, and can ignore the case $\Delta _{M_p}<0$. So $M_p$ has two real eigenvalues,

$$\begin{aligned} \sigma _{p,+}&= 1 - \frac{1}{2} \gamma (\alpha +\beta \lambda _p) + \frac{1}{2} \gamma \sqrt{(\alpha +\beta \lambda _p)^2 - 4\lambda _p}\\ \sigma _{p,-}&= 1 - \frac{1}{2} \gamma (\alpha +\beta \lambda _p) - \frac{1}{2} \gamma \sqrt{(\alpha +\beta \lambda _p)^2 - 4\lambda _p}. \end{aligned}$$

Since $\lambda _p<0$, then $\vert \alpha +\beta \lambda _p\vert < \sqrt{(\alpha +\beta \lambda _p)^2 - 4\lambda _p}$, so observe that $\sigma _{p,+}>1$ and $\sigma _{p,-}<1$, so $DG(\theta ^\star ,\psi ^\star )$ has at least one eigenvalue with magnitude larger than one.

We can now use the stable manifold theorem and we omit some details since the arguments are the same as for the proof of Theorem 3.1. By Theorem 4.2, around each $(\theta ^\star ,\psi ^\star )\in \textsf{S}_{<0}$, there exists a neighborhood $\mathsf {\Omega }_{(\theta ^\star ,\psi ^\star )}$ on which the stable manifold theorem holds. Denote by $\textsf{A}$ the possibly uncountable union of all these neighborhoods: $\textsf{A} = \bigcup _{(\theta ^\star ,\psi ^\star )\in \textsf{S}_{<0}} \mathsf {\Omega }_{(\theta ^\star ,\psi ^\star )}$. By Lemma B.1, there exists a countable sub-cover of this set, i.e., there exists a sequence $(\theta _i^\star ,\psi _i^\star )_{i\in \mathbb {N}}$ in $\textsf{S}_{<0}$ such that

$$\begin{aligned} \textsf{A} = \bigcup _{i\in \mathbb {N}} \mathsf {\Omega }_{(\theta _i^\star ,\psi _i^\star )}. \end{aligned}$$

(18)

Let $(\theta ^\star ,\psi ^\star )\in \textsf{S}_{<0}$, it need not be an element of $(\theta _i^\star ,\psi _i^\star )_{i\in \mathbb {N}}$, but according to (18), there exists $i\in \mathbb {N}$ such that $(\theta ^\star ,\psi ^\star )\in \mathsf {\Omega }_{(\theta _i^\star ,\psi _i^\star )}$. Let an initialization $(\theta _0,\psi _0)$ such that the associated realization $(\theta _k,\psi _k)_{k\in \mathbb {N}}$ of INNA converges to $(\theta ^\star ,\psi ^\star )$. This means that there exists $k_0\in \mathbb {N}$ such that $\forall k\ge k_0$, $G^k(\theta _0,\psi _0)\in \mathsf {\Omega }_{(\theta _i^\star ,\psi _i^\star )}$ and thus $G^k(\theta _0,\psi _0)\in \textsf{W}^{sc}_{(\theta _i^\star ,\psi _i^\star )}$, where $\textsf{W}^{sc}_{(\theta _i^\star ,\psi _i^\star )}$ is the stable manifold around $(\theta _i^\star ,\psi _i^\star )$ as defined in Theorem 4.2. By Theorem B.1, G is a local diffeomorphism, so we can reverse the iterations and obtain, $ (\theta _0,\psi _0) \in \bigcup _{j\in \mathbb {N}}G^{-j}\left( \mathsf {\Omega }_{(\theta _i^\star ,\psi _i^\star )}\cap \textsf{W}^{sc}_{(\theta _i^\star ,\psi _i^\star )}\right) $. Since $(\theta _i^\star ,\psi _i^\star )\in \textsf{S}_{<0}$, we showed that $DG(\theta _i^\star ,\psi _i^\star )$ has at least one eigenvalue with magnitude strictly larger than 1, so by Theorem 4.2, $W^{sc}_{(\theta _i^\star ,\psi _i^\star )}$ has zero measure. Then, by Theorem B.1 for all $j\in \mathbb {N}$, $G^{-j}$ is a local diffeomorphism, so the union above has zero measure. Using (18), the rest of the proof is then similar to the end of that of Theorem 3.1 since $ \bigcup _{i\in \mathbb {N}}\left[ \bigcup _{j\in \mathbb {N}}G^{-j}\left( \mathsf {\Omega }_{(\theta _i^\star ,\psi _i^\star )}\cap \textsf{W}^{sc}_{(\theta _i^\star ,\psi _i^\star )}\right) \right] $ is a countable union of zero-measure sets, so it has again measure zero. $\square $

1.1 B.1 Missing Proofs

We begin by proving the lemmas stated in Sect. 3.2.

Proof of Lemma 3.1

Let $\alpha \ge 0$, and $\beta >0$, the function $h(\lambda )=(\alpha +\beta \lambda )^2-4\lambda = \beta ^2\lambda ^2 + 2(\alpha \beta -2)\lambda +\alpha ^2$ is a second-order polynomial in $\lambda $ whose discriminant is $16(1-\alpha \beta )$. If $\alpha \beta >1$ this discriminant is negative so h is always positive. If $\alpha \beta \le 1$, then h has two real roots: $\frac{(2-\alpha \beta )}{\beta ^2} \pm \frac{2\sqrt{1-\alpha \beta }}{\beta ^2}$, which are equal to $l_{\textrm{min}}$ and $l_{\textrm{max}}$ since $X^2\pm 2X+1 = (X\pm 1)^2$. $\square $

Proof of Lemma 3.2

Assume that $\lambda _p>0$, if $\Delta _{M_p}<0$, then $ 2\Re (\sigma _{p,-}) = 2\Re (\sigma _{p,+}) = -(\alpha +\beta \lambda _p)<0$. If $\Delta _{M_p}\ge 0$ then $\sigma _{p,-}$ and $\sigma _{p,+}$ are real. Remark that $\sigma _{p,-}\sigma _{p,+}=\lambda _p$ so the eigenvalues have the same sign, and $\sigma _{p,-}+\sigma _{p,+}= -(\alpha +\beta \lambda _p)<0,$ so they are negative. $\square $

Proof of Theorem B.1

Let $(\theta ,\psi )\in \mathbb {R}^P\times \mathbb {R}^P$, to prove that G is a local diffeomorphism we prove that $DG(\theta ,\psi )$ is invertible and then use the local inversion theorem. Using again the block transformation of $DG(\theta ,\psi )$, $\det (DG(\theta ,\psi )) = \prod _{p=1}^P \det (M_p)$, where

$$\begin{aligned} \det (M_p)= & {} (1 - \gamma (\alpha -\frac{1}{\beta }) -\gamma \beta \lambda _p)(1-\frac{\gamma }{\beta }) - \frac{\gamma }{\beta }\gamma (\alpha -\frac{1}{\beta })\nonumber \\= & {} 1 - \gamma (\alpha +\beta \lambda _p) + \gamma ^2\lambda _p. \end{aligned}$$

(19)

We want $\gamma $ such that $\det (M_p)\ne 0$ for any $(\theta ,\psi )\in \mathbb {R}^P\times \mathbb {R}^P$, hence for any $\lambda _p\in [-L,L]$ (using Assumption 1). First, if $\lambda _p=0$, from (19), we must take $\gamma \ne 1/\alpha $. Now let $\lambda _p\ne 0$, then (19) is a second-order polynomial in $\gamma $ with discriminant $ (\alpha +\beta \lambda _p)^2 - 4\lambda _p= \Delta _{M_p}$ already studied Sect. 3.1 and Lemma 3.1. If $\Delta _{M_p}<0$, then $\det (M_p)$ has no real roots and the choice of $\gamma $ is free. Assume now that $\Delta _{M_p}\ge 0$, there exists two real roots to (19):

$$\begin{aligned} \gamma ^{+}= & {} \frac{(\alpha +\beta \lambda _p)}{2\lambda _p} + \frac{\sqrt{(\alpha + \beta \lambda _p)^2 - 4\lambda _p}}{2 \lambda _p} \ \text {and}\ \gamma ^{-}\nonumber \\= & {} \frac{(\alpha +\beta \lambda _p)}{2\lambda _p} - \frac{\sqrt{(\alpha + \beta \lambda _p)^2 - 4\lambda _p}}{2 \lambda _p}. \end{aligned}$$

(20)

Remark that when $\lambda _p<0$, $\gamma ^+<0$ and when $\lambda _p>0$, $0<\gamma ^-<\gamma ^+$, so in every case we only need to ensure $0<\gamma <\gamma ^-$, for every $\lambda _p\in [-L,L]$. So, for every $\lambda \in \mathbb {R}$ for which it is well defined, consider the function $\gamma ^{-}(\lambda )=\frac{(\alpha +\beta \lambda )}{2\lambda } - \frac{\sqrt{(\alpha + \beta \lambda )^2 - 4\lambda }}{2 \lambda }$. When defined, its derivative is $ -\frac{\alpha \sqrt{\left( \alpha +\beta \lambda \right) ^2-4\lambda }+\left( 2-\alpha \beta \right) \lambda -\alpha ^2}{2\lambda ^2\sqrt{\left( \alpha +\beta \lambda \right) ^2-4\lambda }}$. The denominator is always positive so we study the numerator: $h(\lambda )=-\alpha \sqrt{(\alpha +\beta \lambda )^2-4\lambda }-(2-\alpha \beta )\lambda +\alpha ^2 $, and we differentiate it:

$$\begin{aligned} h'(\lambda ) = -\frac{\alpha (2\beta (\alpha +\beta \lambda )-4)}{2\sqrt{(\alpha +\beta \lambda )^2-4\lambda }}+\alpha \beta -2, \quad \text {and}\quad h''(\lambda ) = -\frac{4\alpha (\alpha \beta -1)}{((\alpha +\beta \lambda )^2-4\lambda )^{\frac{3}{2}}}. \end{aligned}$$

This allows deducing the minimal value of $\gamma ^-(\lambda )$ in each setting by constructing the tables of variations displayed in Fig. 4. There, it follows from standard computations that $h'(0)=h(0)=0$, $h(l_{\textrm{max}})\le 0$ and $\lim _{\lambda \rightarrow +\infty } h'(\lambda )=-2$ (when $\alpha \beta \le 1$), and via L’Hôpital’s rule we obtained $\lim _{\lambda \rightarrow 0}\gamma ^-(\lambda )=1/\alpha $. We deduce from the tables that G is a local diffeomorphism if $\gamma <\gamma ^-(L)$ when $\alpha \beta >1$ and if $\gamma <\min (\gamma ^-(L),\gamma ^-(-L))$ when $\alpha \beta \le 1$ and $L\notin [l_{\textrm{min}},l_{\textrm{max}}]$. Remark that the condition $\gamma \ne \frac{1}{\alpha }$ is implied in both cases. This proves the theorem. $\square $

C Proof of Convergence of INNA

To prove Theorem 4.3, we will use the following lemma.

Lemma C.1

( [4]) If a bounded sequence $(u_k)_{k\in \mathbb {N}}$ in $\mathbb {R}^P$ satisfies, $ \lim \limits _{k\rightarrow +\infty } \Vert u_{k+1} - u_k\Vert =0, $ then the set of accumulation points of $(u_k)_{k\in \mathbb {N}}$ is connected. If this set is finite, then it reduces to a singleton and $(u_k)_{k\in \mathbb {N}}$ converges.

Proof of Theorem 4.1

Assume that Assumption 1 holds and $\alpha >0$. Let $(\theta _0,\psi _0)\in \mathbb {R}^P\times \mathbb {R}^P$, and let $\gamma >0$ such that (15) holds. Let $(\theta _k,\psi _k)_{k\in \mathbb {N}}$ be the sequence generated by INNA initialized at $(\theta _0,\psi _0)$. We first show that the sequence $(\mathcal {E}_k)_{k\in \mathbb {N}}$ defined $\forall k\in \mathbb {N}$ by

$$\begin{aligned} \mathcal {E}_k = (1+\alpha \beta -\gamma \alpha )\mathcal {J}(\theta _k) + \frac{1}{2}\Vert \left( \alpha -\frac{1}{\beta }\right) \theta _k +\frac{1}{\beta }\psi _k\Vert ^2 \end{aligned}$$

(21)

converges. The sequence $(\mathcal {E}_k)_{k\in \mathbb {N}}$ represents an “energy” that decreases along the iterations, where the first and second terms in (21) represent “potential” and “kinetic” energies, respectively. This sequence resembles the Lyapunov function of DIN [3, 19] but is more involved to derive, as often for algorithms compared to ODEs. We use the notations $a=\alpha -1/\beta $, $b=1/\beta $, $\Delta \theta _k =\theta _{k+1}-\theta _k$ and $\Delta \psi _k =\psi _{k+1}-\psi _k$, for $k\in \mathbb {N}$, so that INNA is rewritten as:

$$\begin{aligned} \Delta \psi _k= & {} -\gamma a\theta _k -\gamma b\psi _k \nonumber \\ \Delta \theta _k= & {} \Delta \psi _k -\gamma \beta \nabla \mathcal {J}(\theta _k) \end{aligned}$$

(22)

Also denote $\mu = 1+\alpha \beta -\gamma \alpha $, where $\mu >0$ since $\gamma < 1/\alpha + \beta $. Let $k\in \mathbb {N}$, we will prove $\mathcal {E}_{k+1}-\mathcal {E}_{k}\le 0$. From Assumption 1 follows a descent lemma (see [16, Proposition A.24]):

$$\begin{aligned} \mu \mathcal {J}(\theta _{k+1}) - \mu \mathcal {J}(\theta _k) \le \mu \langle \nabla \mathcal {J}(\theta _k), \Delta \theta _k \rangle + \frac{\mu L}{2} \Vert \Delta \theta _k\Vert ^2, \end{aligned}$$

which according to (22), can equivalently be rewritten as,

$$\begin{aligned} \mu \mathcal {J}(\theta _{k+1}) - \mu \mathcal {J}(\theta _k) \le -\mu \langle \frac{\Delta \theta _k-\Delta \psi _k}{\gamma \beta }, \Delta \theta _k \rangle + \frac{\mu L}{2} \Vert \Delta \theta _k\Vert ^2. \end{aligned}$$

(23)

We save this for later and now turn our attention to the other term in $\mathcal {E}_{k+1}-\mathcal {E}_k$,

$$\begin{aligned}{} & {} \frac{1}{2}\Vert a\theta _{k+1} + b\psi _{k+1}\Vert ^2 - \frac{1}{2}\Vert a\theta _{k} + b\psi _{k}\Vert ^2\\{} & {} \quad = \frac{1}{2}\Vert a\theta _{k} + a\Delta \theta _k + b\psi _{k}+b\Delta \psi _k\Vert ^2 - \frac{1}{2}\Vert a\theta _{k} + b\psi _{k}\Vert ^2. \end{aligned}$$

Expanding this and using the fact that $ a\theta _{k} + b\psi _{k} = -\Delta \psi _k/\gamma $, we can show that,

$$\begin{aligned} \begin{aligned}&\frac{1}{2}\Vert a\theta _{k} + a\Delta \theta _k + b\psi _{k}+b\Delta \psi _k\Vert ^2 - \frac{1}{2}\Vert a\theta _{k} + b\psi _{k}\Vert ^2 \\=&\frac{a^2}{2}\Vert \Delta \theta _k\Vert ^2 + \frac{b^2}{2} \Vert \Delta \psi _k\Vert ^2 + ab\langle \Delta \theta _k,\Delta \psi _k\rangle -\frac{a}{\gamma }\langle \Delta \theta _k,\Delta \psi _k\rangle -\frac{b}{\gamma }\Vert \Delta \psi _k\Vert ^2. \end{aligned} \end{aligned}$$

(24)

We then use $\Vert \Delta \psi _k\Vert ^2 = \Vert \Delta \theta _k - \Delta \psi _k\Vert ^2 + \Vert \Delta \theta _k\Vert ^2 -2\langle \Delta \theta _k,\Delta \theta _k-\Delta \psi _k\rangle $ and $\langle \Delta \theta _k,\Delta \psi _k\rangle = \Vert \Delta \theta _k\Vert ^2 -\langle \Delta \theta _k,\Delta \theta _k-\Delta \psi _k\rangle $ in (24) to obtain:

$$\begin{aligned}{} & {} \frac{1}{2}\Vert a\theta _{k+1} + b\psi _{k+1}\Vert ^2 - \frac{1}{2}\Vert a\theta _{k} + b\psi _{k}\Vert ^2 = \left( \frac{a^2}{2}+\frac{b^2}{2}+ ab - \frac{a}{\gamma } - \frac{b}{\gamma }\right) \Vert \Delta \theta _k\Vert ^2 \nonumber \\{} & {} \quad +\left( \frac{b^2}{2}-\frac{b}{\gamma }\right) \Vert \Delta \theta _k-\Delta \psi _k\Vert ^2 + \left( -b^2-ab+\frac{a}{\gamma }+\frac{2b}{\gamma }\right) \langle \Delta \theta _k,\Delta \theta _k-\Delta \psi _k\rangle .\nonumber \\ \end{aligned}$$

(25)

We then simplify the factors using the identity $a+b=\alpha $, as well as $\frac{a^2}{2}+\frac{b^2}{2}+ ab = \frac{1}{2}(a+b)^2 = \frac{\alpha ^2}{2}$, and $-b^2-ab = -\alpha /\beta $ to deduce that (25) is equal to

$$\begin{aligned} \left( \frac{\alpha ^2}{2} - \frac{\alpha }{\gamma }\right) \Vert \Delta \theta _k\Vert ^2 +\frac{\gamma -2\beta }{2\gamma \beta ^2}\Vert \Delta \theta _k-\Delta \psi _k\Vert ^2 +\frac{-\gamma \alpha +\alpha \beta +1}{\gamma \beta }\langle \Delta \theta _k,\Delta \theta _k-\Delta \psi _k\rangle .\nonumber \\ \end{aligned}$$

(26)

We can finally combine (23) and (26),

$$\begin{aligned} \begin{aligned} \mathcal {E}_{k+1}-\mathcal {E}_k \le&\left( \frac{\mu L}{2}+\frac{\alpha ^2}{2} - \frac{\alpha }{\gamma }\right) \Vert \Delta \theta _k\Vert ^2 +\frac{\gamma -2\beta }{2\gamma \beta ^2}\Vert \Delta \theta _k-\Delta \psi _k\Vert ^2 \\ {}&+ \left( -\frac{\mu }{\gamma \beta }+\frac{1+\alpha \beta -\gamma \alpha }{\gamma \beta }\right) \langle \Delta \theta _k,\Delta \theta _k-\Delta \psi _k\rangle . \end{aligned} \end{aligned}$$

(27)

Notice that $\mu = 1+\alpha \beta -\gamma \alpha $ is specifically chosen so that the last term in (27) vanishes, so,

$$\begin{aligned} \mathcal {E}_{k+1}-\mathcal {E}_k \le \frac{\gamma \mu L+\gamma \alpha ^2 -2\alpha }{2\gamma }\Vert \Delta \theta _k\Vert ^2 +\frac{\gamma -2\beta }{2\gamma \beta ^2}\Vert \Delta \theta _k-\Delta \psi _k\Vert ^2. \end{aligned}$$

(28)

To prove the decrease of $(\mathcal {E}_k)_{k\in \mathbb {N}}$, it remains to justify that both terms in (28) are negative. First, the condition $\gamma <2\beta $ in (15) makes the second term negative. Then,

$$\begin{aligned} \gamma \mu L+\gamma \alpha ^2 -2\alpha<0 \iff -\alpha L\gamma ^2 + \left( \alpha ^2+ (1+\alpha \beta )L\right) \gamma - 2\alpha <0. \end{aligned}$$

A simpler sufficient condition for this to hold is $\left( \alpha ^2+ (1+\alpha \beta )L\right) \gamma - 2\alpha <0$ or equivalently $\gamma < 2\alpha /\left( \alpha ^2+ (1+\alpha \beta )L\right) $, which holds from (15). So the sequence $(\mathcal {E}_k)_{k\in \mathbb {N}}$ is a decreasing. It is also lower-bounded since $\mathcal {J}$ is lower-bounded, so it converges.

The rest of the proof then relies on exploiting (28). Let $K\in \mathbb {N}$, we sum (28):

$$\begin{aligned} \sum _{k=0}^{K} \mathcal {E}_{k+1} - \mathcal {E}_{k} \le \frac{\gamma \mu L+\gamma \alpha ^2 -2\alpha }{2\gamma } \sum _{k=0}^{K} \Vert \Delta \theta _k\Vert ^2 +\frac{\gamma -2\beta }{2\gamma \beta ^2}\sum _{k=0}^{K} \Vert \Delta \theta _k-\Delta \psi _k\Vert ^2. \end{aligned}$$

The left-hand side is a telescopic series, and it follows from (22) that $\forall k\in \mathbb {N}$, $\Delta \theta _k-\Delta \psi _k = -\gamma \beta \nabla \mathcal {J}(\theta _k)$, so denoting $C_1 = -(\gamma \mu L+\gamma \alpha ^2 -2\alpha )/2\gamma >0$ and $C_2 = -(\gamma ^2-2\gamma \beta )/2>0$,

$$\begin{aligned} \mathcal {E}_{0}-\mathcal {E}_{K+1} \ge C_1 \sum _{k=0}^{K} \Vert \Delta \theta _k\Vert ^2 C_2\sum _{k=0}^{K} \Vert \nabla \mathcal {J}(\theta _k)\Vert ^2. \end{aligned}$$

Then, $\mathcal {E}_{0} - \mathcal {E}_{K+1}$ is upper bounded since $(\mathcal {E}_k)_{k\in \mathbb {N}}$ converges, so $\sum _{k=0}^{K} \Vert \nabla \mathcal {J}(\theta _k)\Vert ^2<+\infty $. This implies that $\lim _{k\rightarrow +\infty }\Vert \nabla \mathcal {J}(\theta _k)\Vert ^2 =0$ and we deduce similarly that $\lim _{k\rightarrow +\infty }\Vert \theta _{k+1}-\theta _{k}\Vert ^2 =0$. Using (14), we also have,

$$\begin{aligned}{} & {} \Vert (\alpha -\frac{1}{\beta })\theta _k +\frac{1}{\beta }\psi _k\Vert ^2\nonumber \\{} & {} \quad = \frac{1}{\gamma ^2}\Vert \psi _{k+1}-\psi _{k}\Vert ^2\le \frac{2}{\gamma ^2}\Vert \theta _{k+1}-\theta _{k}\Vert ^2 +2\beta ^2 \Vert \nabla \mathcal {J}(\theta _k)\Vert ^2 \xrightarrow [k\rightarrow \infty ]{}0. \end{aligned}$$

(29)

The convergence of $(\mathcal {E}_k)_{k\in \mathbb {N}}$ and (29) imply that $(\mathcal {J}(\theta _k))_{k\in \mathbb {N}}$ converges, which proves the first part of the theorem. Assume that the critical points are isolated and that the sequence $(\theta _k)_{k\in \mathbb {N}}$ is uniformly bounded on $\mathbb {R}^P$. According to Lemma C.1, since $(\theta _k)_{k\in \mathbb {N}}$ is bounded and $\lim _{k\rightarrow +\infty }\Vert \theta _{k+1}-\theta _{k}\Vert =0$, the set of accumulation points of $(\theta _k)_{k\in \mathbb {N}}$ is connected. By continuity of $\nabla \mathcal {J}$, accumulation points of $(\theta _k)_{k\in \mathbb {N}}$ are critical points of $\mathcal {J}$, which are assumed to be isolated. So the set of accumulation points is a singleton and $(\theta _k)_{k\in \mathbb {N}}$ converges to it. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Castera, C. Inertial Newton Algorithms Avoiding Strict Saddle Points. J Optim Theory Appl 199, 881–903 (2023). https://doi.org/10.1007/s10957-023-02330-0

Download citation

Received: 18 November 2021
Accepted: 18 October 2023
Published: 08 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10957-023-02330-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inertial Newton Algorithms Avoiding Strict Saddle Points

Abstract

Access this article

Similar content being viewed by others

An away-step Frank–Wolfe algorithm for constrained multiobjective optimization

Multi-step inertial algorithms for equilibrium, fixed point, general systems of variational inequalities and split feasibility problems

Random Gradient-Free Minimization of Convex Functions

Data Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Permutation Matrices

B Proof of Theorem 4.1

Lemma B.1

Theorem B.1

Proof of Theorem 4.1

1.1 B.1 Missing Proofs

Proof of Lemma 3.1

Proof of Lemma 3.2

Proof of Theorem B.1

C Proof of Convergence of INNA

Lemma C.1

Proof of Theorem 4.1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Inertial Newton Algorithms Avoiding Strict Saddle Points

Abstract

Access this article

Similar content being viewed by others

An away-step Frank–Wolfe algorithm for constrained multiobjective optimization

Multi-step inertial algorithms for equilibrium, fixed point, general systems of variational inequalities and split feasibility problems

Random Gradient-Free Minimization of Convex Functions

Data Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Permutation Matrices

B Proof of Theorem 4.1

Lemma B.1

Theorem B.1

Proof of Theorem 4.1

1.1 B.1 Missing Proofs

Proof of Lemma 3.1

Proof of Lemma 3.2

Proof of Theorem B.1

C Proof of Convergence of INNA

Lemma C.1

Proof of Theorem 4.1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation