A Fast and Simple Modification of Newton’s Method Avoiding Saddle Points

We propose in this paper New Q-Newton’s method. The update rule is conceptually very simple, using the projections to the vector subspaces generated by eigenvectors of positive (correspondingly negative) eigenvalues of the Hessian. The main result of this paper roughly says that if a sequence {xn}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{x_n\}$$\end{document} constructed by the method from a random initial point x0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_0$$\end{document}converges, then the limit point is a critical point and not a saddle point, and the convergence rate is the same as that of Newton’s method. A subsequent work has recently been successful incorporating Backtracking line search to New Q-Newton’s method, thus resolving the global convergence issue observed for some (non-smooth) functions. An application to quickly find zeros of a univariate meromorphic function is discussed, accompanied with an illustration on basins of attraction.


Introduction
In this paper, we consider the unconstrained optimization problem of finding min x∈R m f (x) for a function f : R m → R.This includes, as a special case, the question of finding solutions to systems of equations.Smale's famous list of problems [23] mentions efficient algorithms to solve systems of polynomial equations as important for twenty-first century mathematics.Throughout the paper, we assume only that f is a C 2 (for to use the algorithm) or C 3 function (to prove some theoretical results), and do not assume restrictive conditions such as the gradient or the Hessian of the function f must be Lipschitz continuous.
In general, finding global minima of a function is NP-hard.Hence, finding a (good) local minimum is the most one can aim for.Moreover, in reality one cannot hope to find closed-form solutions.Therefore, a common strategy is to design an iterative method, which from an initial point x 0 constructs a sequence {x n }.To be useful, such a method should possess the following important requirements (for many interesting and useful cost functions): Requirement 1: Any cluster point of {x n } is a critical point of f .Requirement 2: If {x n } has a bounded subsequence, then {x n } itself converges.Requirement 3: If x 0 is randomly chosen and {x n } converges, then the limit point is not a saddle point.Requirement 4: The method is easy to implement and works well and stably with respect to its parameters.
For first-order methods, there are several algorithms satisfying all of these 4 requirements and work well even in large scale problems like in deep neural networks.A common theme of these methods is that they incorporate line search, in particular Armijo's [4].We refer the readers to [27][28][29] and references therein for recent progress and a historical overview.
This paper concerns variants of Newton's method, which currently has no version satisfying all Requirements 1-4, even though it has a fast rate of convergence (when it does converge) and is easy to implement.

Main contributions of this paper
We define a new variant of Newton's method, named New Q-Newton's method, which for a cost function on R m depends on m + 1 random parameters fixed from the beginning.The method is both conceptually simple and easy to implement.It modifies the Hessian of the cost function by a term depending on the size of the gradient of the function and the above mentioned m +1 parameters.It then uses the absolute values of the new matrix's eigenvalues, and not the eigenvalues of the new matrix like in Newton's method.Roughly speaking, this helps to stay away from negative eigenvalues of the Hessian and hence is good for the purpose of avoiding saddle points.At the same time, it behaves like Newton's method near critical points where the Hessian matrix is positive definite.Additionally, it satisfies Requirements 3 and 4 (see Theorem 3.1).The new method's advantages are illustrated in Theorem 3.3-the best result so far for finding roots of meromorphic functions in 1 complex variable.
Related works There are many variants of Newton's method in the literature, which makes it impossible to review all of them.Because of limited space, we are only able to pinpoint a few relevant representative classes.
There are algorithms which are computationally not too expensive, like BFGS and many quasi-Newton's methods, but for which no good theoretical guarantees for general non-convex cost functions are known.
There are algorithms which add a correction into the matrix used in the standard Newton's method, so that the final matrix is positive definite.For a direct application of Newton's method to a system of equations F = 0, one well-known method is that of the Levenberg-Marquardt algorithm [3,9,17,18,33], which adds a positive multiple of the identity matrix to the main matrix.Another method, that of Regularized Newton's method [20,30,31], adds a positive multiple of the identity into the Hessian of the associated cost function f = 1  2 ||F|| 2 , where the multiple constant is bigger than the absolute value of the smallest negative eigenvalue of the Hessian matrix.These methods have a good rate of convergence near the solutions of the system of equation.There are also Backtracking versions of them [2], which-under some restrictions (e.g.requiring that the function has compact sublevels and its gradient is Lipschitz continuous)-have good global convergence (satisfying Requirement 1, Requirement 4, as well as Requirement 2 under some restrictions).On the other hand, there is no theoretical proof yet concerning whether these methods (and their Backtracking versions) can avoid saddle points.Experiments in [25] seem to show that, in general, these methods do not have global convergence guarantee or cannot avoid saddle points.These two methods are relevant to New Q-Newton's method and hence will be discussed more in Sect. 2.
There are algorithms which mix first-and second-order methods.One method, which is direction of negative curvature [10,11], adds into the gradient a negative multiple of an eigenvector corresponding to a negative eigenvalue of the Hessian matrix.It behaves like gradient descent near non-degenerate local minima.One more recent method is that of inertia Newton's method [6], which is a discretization of a modification of Newton's flow.While it is originally of order 2, it is reduced to a system of order 1.These methods can avoid saddle points, but global convergence guarantees are only proven under several restrictive assumptions such as the gradient being Lipschitz continuous (see, for example, Sections 3 in [10,11]).Moreover, the rate of convergence of these methods is slower than that of Newton's method, and more like first-order methods.For comparison, the Backtracking version of gradient descent is a variant of gradient descent which works very well and stable under wider general settings (including deep neural networks), both theoretically and practically.
There are methods which replace calculating the Hessian matrix with a trust region procedure at each step.One well-known representative is (adaptive) cubic regularization [7,19].A line search is integrated into adaptive cubic regularization [5], which under some assumptions-may be difficult to check beforehand, like requiring that the sequence ∇ 2 f (x n ) is uniformly bounded and { f (x n )} and {∇ f (x n )} are uniformly continuous-proving only that {∇ f (x n )} converges and not the convergence of {x n } itself.Moreover, these methods require solving optimization subproblems in each iterative step, which makes satisfying Requirement 4 extremely difficult (see a discussion about this in [13], and experiments in this paper and in [25]).
In a subsequent work [25], Armijo's Backtracking line search [4] is incorporated into New Q-Newton's method.This is based on the observation that the final matrix used in New Q-Newton's method is positive definite.The resulting algorithm is named Backtracking New Q-Newton's method, which satisfies-besides Requirements 3 and 4-also Requirements 1 and 2. There, it is found also a more comprehensive analysis on variants of Newton's method.
Organization of the paper For the readers' convenience, in Sect. 2 we provide a concise review of variants of Newton's method related to New Q-Newton's method.The definition of New Q-Newton's method and its main theoretical properties are presented in Sect.3, where some pictures on basins of attraction are given.After that, some conclusions are presented.The appendices present details of some technical proofs, implementation details, experimental settings, as well as some further experimental results.(Many more can be found in the arXiv version of this paper and in the paper [25].)

A Brief Review on Relevant Variants of Newton's Method
Let f : R m → R be a C 2 function.We recall some common notations: ∇ f is the gradient of f , and has at least one negative eigenvalue.(Note that this definition is more general than the usual one, in that it includes also local maxima.)A function f is Morse if all of its critical points are non-degenerate.
In Newton's method, from x 0 ∈ R m one defines subsequently: Regularized Newton's method adds c 1 max{0, −λ min (∇ , where J F is the Jacobian matrix of F and (.) T is the transpose of a matrix.Levenberg-Marquardt algorithm adds λ n I d into the matrix J F(x n ) T J F(x n ), where usually The final matrices used in both methods are positive semi-definite.New Q-Newton's method also adds a term λ n I d to the Hessian ∇ 2 f (x n ), but does not require that the final matrix ∇ 2 f (x n )+λ n I d is positive definite.Instead, we change the sign of negative eigenvalues of the matrix ∇ 2 f (x n ) + λ n I d.We also simplify the choice of λ n by letting λ n = δ j ||∇ f (x n )|| γ for δ j in a fixed set of m + 1 real numbers.This turns out to provide the algorithm with very strong theoretical guarantees-in particular with its Backtracking version in [25]-while also making it straightforward to implement the algorithm and its variants.A detailed description of this method is in the next section.

The Algorithm
Here, we introduce New Q-Newton's method.An invertible square matrix A of dimension m with real coefficients is diagonalizable.That is, we can find an orthonormal basis e 1 , . . ., e m of R m and m nonzero real numbers λ 1 , . . ., λ m such that A.e j = λ j e j for all j = 1, . . ., m.For a vector v ∈ R m , we define: pr A,+ (v) = i: λ i >0 < v, e i > e i and pr A,− (v) = i: λ i <0 < v, e i > e i .If V + is the vector space generated by {e i } λ i >0 and V − is the vector space generated by {e i } λ i <0 , then V + and V − are uniquely defined and are independent of the choice of the vectors e 1 , ..., e m .Then, pr A,+ is simply the orthogonal projection to V + , and pr A,− is simply the orthogonal projection to V − .

Algorithm 1: New Q-Newton's method
Result: Find a critical point of f : R m → R Given: = {δ 0 , δ 1 , . . ., δ m } (chosen randomly) and α > 0; Initialization: to change the sign of the negative eigenvalues of A n .As the proof of the main results and the experiments show, in Algorithm 1 one does not need to have exact values of the Hessian, its eigenvalues, and eigenvectors for the algorithm to perform well.The randomness of the parameters δ 0 , δ 1 , . . ., δ m is only needed in the proof (see Theorem 3.1) that the algorithm can avoid saddle points.(For the existence of local Stable-Centre manifolds near saddle points, this randomness is not needed.)See "Appendix" for implementation details.See [25] for variants.
A disadvantage of (Backtracking) New Q-Newton's method is that in higher dimensions it is very costly to compute the Hessian matrix and its eigenvalues and eigenvectors.To resolve this issue is beyond the scope of the current paper.We note, however, that a simpler version, which does not use all negative eigenvalues but only the smallest negative eigenvalues, has been tested in [25] to perform similarly to Algorithm 1.This suggests that one may reduce the computational cost by using only a few eigenvalues of the Hessian matrix (which can be efficiently computed, e.g. by using Lanczos algorithm).Another efficient modification is to use two-way Backtracking line search, see [28,29].

Rate of Convergence and Avoidance of Saddle Points
The main result we obtain is the following.Theorem 3.1 Let f : R m → R be C 3 .Let {x n } be a sequence constructed by New Q-Newton's method.Assume that {x n } converges to x ∞ .Then, (2) If δ 0 , . . ., δ m are chosen randomly, then there is a set A ⊂ R m of Lebesgue measure 0, so that if x 0 / ∈ A, then x ∞ cannot be a saddle point of f .
then the rate of convergence is at least linear.
Next, we state an interesting immediate consequence of the theorem.

Corollary 3.1
Let f be a C 3 function and Morse.Let x 0 be a random initial point, and let {x n } be a sequence constructed by New Q-Newton's method, where the hyperparameters δ 0 , . . ., δ m are randomly chosen.If x n converges to x ∞ , then x ∞ is a local minimum and the rate of convergence is quadratic.
For simplicity, we can assume that x ∞ = 0. We assume that x ∞ is a saddle point and will arrive at a contradiction.By (1) we have ∇ f (0) = 0, and by the assumption we have that . We look at the following (may not be continuous) relevant dynamical system on R m : Then for an initial point x 0 , the sequence constructed by New Q-Newton's method is exactly the orbit of x 0 under the dynamical system x → F(x).Hence, A(x) is C 1 near x ∞ , say in an open neighbourhood U of x ∞ , and at every point x ∈ U , the matrix A(x) must be one of the m + 1 maps , and therefore F(x) must be one of the corresponding m + 1 maps F j (x).Since f is assumed to be C 3 , it follows that all of the corresponding m + 1 maps F j are locally Lipschitz continuous.Now, we analyse the map F(x) near the point [This assertion is probably well known to experts, in particular in the field of perturbations of linear operators.Here, for completion we present a proof, following [15], by using an integral formula for projections on eigenspaces via the theory of resolvents.Let λ 1 , . . ., λ s be distinct solutions of the characteristic polynomial of A. By assumption, all λ j are nonzero.Let γ j ⊂ C be a small circle with positive orientation enclosing λ j and not other λ r 's.Moreover, we can assume that γ j does not contain 0 on it or inside it, for all j = 1, . . ., s.Since A(x) converges to A(0), we can assume that for all x close to 0, all roots of the characteristic polynomial of A(x) are contained well inside the union s j=1 γ j .Then by the formula (5.22) on page 39, see also Problem 5.9, chapter 1 in [15], we have that Then, by the choice of the circle γ j , we have that pr A(x),+ = j: ; here, we use the small-o notation, and hence Therefore, we obtain the existence of local stable-central manifolds for the associated dynamical systems near saddle points of f (see Theorems III.6 and III.7 in [21]).We can then using the fact that under the assumptions the hyperparameters δ 0 , . . ., δ m are randomly chosen, to obtain: The relation between Claim and avoidance of saddle points is as follows.Let A loc be the union of local stable-central manifolds around all saddle points.Then by the above arguments, we know that A loc has Lebesgue measure 0. If x 0 is an initial point such that the constructed sequence x n = F •n (x 0 ) converges to a saddle point, then there is some n 0 such that F n 0 (x 0 ) ∈ A loc .Choose 0 ≤ m ≤ n 0 be the smallest number such that F •m (x 0 ) ∈ A loc ∪ E. Then by Claim, x 0 is in the inverse image of a set of Lebesgue measure zero by a locally invertible map and hence belongs to a set of Lebesgue measure zero.Taking the union on all n 0 and m, we find the set A which has Lebesgue measure zero which contains x 0 .
A similar claim has been established for another dynamical system in [27]-for a version of Backtracking gradient descent.The idea in [27] is to show that the associated dynamical system (depending on ∇ f ), which is locally Lipschitz continuous, has locally bounded torsion.The case at hand, where the dynamical system depends on the Hessian and also orthogonal projections to the eigenspaces of the Hessian, is more complicated to deal with.
We note that the fact that δ 0 , . . ., δ m should be random to achieve the truth of Claim has been overlooked in the arXiv version of this paper and has now been corrected in a new work by the first author [26], under more general settings.Here is a sketch of how to prove Claim, see [26] for details.
Putting, as above, be the set of critical points of f .Since det(A(x, δ)) is a polynomial and is nonzero for x / ∈ C, there is a set ⊂ R of Lebesgue measure 0 so that for a given δ / ∈ , the set x / ∈ C for which A(x, δ) is not invertible has Lebesgue measure 0. One then shows, using that w(x, δ) (that is, the w(x) as above, but now we add the parameter δ in to make clear the dependence on δ), is a rational function in δ, and is nonzero (by investigating what happens when δ → ∞).This allows one to show that there is a set ⊂ R\ of Lebesgue measure 0 so that for all δ / ∈ ( ∪ ) the matrix A(x, δ) is invertible and the set where the gradient of the dynamical system F(x) = x − w(x, δ) is, locally outside C, invertible.This proves Claim.
That δ 0 , . . ., δ m are random means that they should avoid the set ∪ .
(3) We can assume that x ∞ = 0 and define A = ∇ 2 f (0).Part 1) and the assumption that ∇ 2 f (0) is invertible imply that we can assume, without loss of generality, that . By Taylor's expansion, we find that ), using the above approximation for A −1 n , we find that Since we assume that x 0 / ∈ A, it follows that A is positive definite.Hence, we can assume, without loss of generality, that A n is positive definite for all n.Then from the construction, we have that w n = v n for all n.Hence, we obtain (4) The proof of part 3 shows that in general we still have (5) This assertion follows immediately from the proof of part 3.

Finding Roots of Meromorphic Functions in 1 Complex Variable
Here, we give an application of the new algorithm to quickly finding roots of meromorphic functions in 1 complex variable.As far as we know, the result in Theorem 3.3 is new and strongest among all existing iterative algorithms in contemporary literature.
To illustrate the advantage of the new algorithm, we present some pictures for basins of attraction taken from [25].Because of the space limit, many lengthy proofs will be put in "Appendix A".Concerning the direct use of Newton's method for solving systems of equations, even for polynomials p(z) of 1 complex variable z of small degrees (e.g. 4), there is the well-known phenomenon of attracting cycles of at least 2 points.(Hence, as a consequence, Newton's method does not converge to a root of p(z).)Among all previous variants of Newton's methods, we are aware of only one method which has theoretical guarantee for convergence to roots [24].This is the Random damping Newton's method, which has the update rule z n+1 = z n −γ n p (z n )/ p(z n ), where γ n is randomly chosen.However, this method has some disadvantages.First, the theoretical proof is very complicated.Second, it is not guaranteed when applied to a meromorphic function like Theorem 3.3, or to higher dimensions.Indeed, our extensive experiments seem to confirm that this method does not perform well in the more general setting, and in many instances it behaves like the original Newton's method.
Let g be a meromorphic function in 1 complex variable z ∈ C.Then, outside a discrete set (poles of g), g is a usual holomorphic function.To avoid the trivial case, we can assume that g is non-constant.We write z = x + iy, where x, y ∈ R. We define u(x, y) = the real part of g, and v(x, y) = the imaginary part of g.Then, we consider a function f (x, y) = u(x, y) 2 + v(x, y) 2 .A zero z = x + iy of g is a global minimum of f , at which the function value is 0. Therefore, optimization algorithm can be used to find roots of g, by applying to the function f (x, y), provided the algorithm assures convergence to critical points and avoidance of saddle points, and provided that critical points of f which are not zeros of g must be saddle points of f .Theorem 3.2 Let f (x, y) be the function constructed from a non-constant meromorphic function g(z) as before.Assume that the constant α > 0 in the definition of New Q-Newton's method does not belong to the set {(n − 3)/(n − 1) : n = 2, 3, 4, . ..} (e.g.we can choose α = 1).Let (x n , y n ) be a sequence constructed by Backtracking New Q-Newton's method from an arbitrary initial point which is not a pole of f .Then, either lim n→∞ (x 2 n + y 2 n ) = ∞, or the sequence {(x n , y n )} converges to a point (x * , y * ) which is a critical point of f .Theorem 3.2 provides the needed convergence to critical points.Its proof will be given at the end of this subsection, after some preparations.
To prove avoidance of saddle points, we need to first classify critical points of the function f .This is done in Lemma A.1.For a generic meromorphic function g, the functions g and gg have no common roots.Hence, by Lemma A.1 and Theorem 3.2 (more generally, Theorem A.1), we obtain: Theorem 3.3 Let g be a generic meromorphic function in 1 complex variable, and let f (x, y) be the function in 2 real variables constructed from g as above.Let (x n , y n ) be the sequence constructed by applying Backtracking New Q-Newton's method to f from a random initial point (x 0 , y 0 ).Then either the rate of convergence is quadratic.
Moreover, if g is a polynomial, then f has compact sublevels, and hence, only case (ii) happens.
If h is a non-constant meromorphic function, then g = h/h has only simple zeros (which are either zeros or poles of h).Hence, they will be non-degenerate global minima of f .If h is a polynomial, then g = h/h has compact sublevels.Now, we are ready to prove Theorem 3.2.
Proof (Of Theorem 3.2) Let be the complement of the set of poles of f .Then as mentioned, f is real analytic on .Let z n = (x n , y n ) be a sequence constructed by Backtracking New Q-Newton's method in [25].Then, since the sequence of function values { f (z n )} decreases and the value of f is infinity only at the poles of f , if the initial point is in , then the whole sequence stays in .We know by [25] that any cluster point of {z n } is a critical point of f .Hence, it remains to show that {z n } converges.To this end, by the arguments in [1], it suffices to show that for every point (x * , y * ) ∈ , if the point z n = (x n , y n ) is in a small open neighbourhood of (x * , y * ), then there is a constant C > 0 (depending on that neighbourhood but independent of the point z n ) so that Lemma A.2 in "Appendix A", whose proof is lengthy, then completes the proof of the theorem.
We finish this section with some pictures for basins of attraction in finding roots of a polynomial of degree 4: P 4 (z) = (z 2 + 1)(z − 2.3)(z + 2.3), which has 4 roots: The basins of attraction for Newton's method are then fractal.Moreover, there are sets of positive Lebesgue measure where Newton's method applied to an initial point in these sets will not converge to any of the roots.Basins of attraction for Backtracking gradient descent seem to be more regular than that for Newton's method, but less regular than that for Backtracking New Q-Newton's method.Figure 1 is created by choosing the initial point z 0 in a lattice v + (0.1 j, 0.1k), for j, k ∈ [−30, 30], and where v is a randomly chosen point in

Conclusions
This paper presented New Q-Newton's method, a new variant of Newton's method which is conceptually simple, easy to implement and can avoid saddle points while having a fast convergence rate (when it converges).New Q-Newton's method has been combined with Backtracking line search, by the first author, to obtain an iterative optimization method which also has the needed convergence guarantee.As an application, we obtain a new result on finding roots of meromorphic functions in 1 complex variable.Some experimental results, reported in "Appendix B", show that

3). The left image is for
Newton's method, the middle image is for gradient descent method with Backtracking line search, and the right image is for Backtracking New Q-Newton's method.Blue: initial points z 0 for which the constructed sequence converges to z * 1 .Cyan: similar for the root z * 2 .Green: similar for the root z * 3 .Red: similar for the root z * 4 .Black: other points the new algorithm works very well against well-known existing variants of Newton's method.
(i) lim n→∞ ||x n || = ∞, or (ii) {x n } converges to a local minimum of f , and the rate of convergence is quadratic.Moreover, if f has compact sublevels, then only case (ii) happens.
We now discuss properties of critical points of f (x, y) = u(x, y) 2 + v(x, y) 2 , outside poles of the meromorphic function g(z) = u(z) + iv(z), where z = x + iy.Recall that by Cauchy-Riemann's equations, we have Lemma A.1 Let f (x, y) : R 2 → R be associated with a meromorphic function g(z) as above. ( Proof We will write u x for ∂u/∂ x, u xy for ∂ 2 u/∂ x∂ y and so on. (1) By calculation, we have ∇ f = (2uu x + 2vv x , 2uu y + 2vv y ).By Cauchy-Riemann's equations, a critical point (x * , y * ) of f satisfies a system of equations uu x − vu y = 0, uu y + vu x = 0, Consider the above as a system of linear equations in variables u x , u y , we see that if (x * , y * ) is not a root of g, then it must be a root of u x , u y .In the latter case, by Cauchy-Riemann's equations, (x * , y * ) is also a root of v x , v y , and hence, z * = x * + iy * is a root of g (z).
(2) Since f ≥ 0, and f (x * , y * ) = 0 iff z * = x * +iy * is a root of g, such an (x * , y * ) is a global minimum of f .Moreover, since the zero set of g is discrete, (x * , y * ) is an isolated global minimum.
For the remaining claim, we need to show that if z * is not a root of g , then ∇ 2 f (x * , y * ) is invertible.By calculation, the Hessian of f at a general point is 2 times of: At (x * , y * ), we have u = v = 0, and hence, by Cauchy-Riemann's equations the above matrix becomes: which is positive definite if z * is not a root of g , as wanted.
(3) Since here (x * , y * ) is a solution of u x = u y = v x = v y = 0, the Hessian of f at (x * , y * ) is 2 times of: Note that by Cauchy-Riemann's equations we have u x x + u yy = 0 and v x x + v yy = 0. Therefore, if we put a = uu x x + vv x x and b = uu xy + vv xy , then the above matrix becomes: Since the determinant is −a 2 − b 2 , we conclude that (x * , y * ) is a saddle point of f , except the case where a = b = 0.In the latter case, by Cauchy-Riemann's equations we have u xy = v x x and v xy = −u yy , and hence, (x * , y * ) must be a solution to By Cauchy-Riemann's equations again, we find that this cannot be the case, except that z * is a root of gg = 0.
Lemma A.2 Assumptions are as in Theorem 3.2.For every point (x * , y * ) ∈ , if the point z n = (x n , y n ) is in a small open neighbourhood of (x * , y * ), then there is a constant C > 0 (depending on that neighbourhood but independent of the point z n ) so that Proof Let us recall that if w n is the one constructed by New Q-Newton's method, then z n+1 = z n − β n w n , where β n is chosen from the Backtracking line search so that Armijo's condition is satisfied.
We recall that the Hessian matrix ∇ 2 f (x, y) is:  The two concerned eigenvalues are the two roots of the characteristic polynomial of A = ∇ 2 f (x, y), which is t 2 − tr(A)t + det(A).By Cauchy-Riemann's equations again, we have From this, it is easy to arrive at the claimed asymptotic values for the two eigenvalues of Now, we complete the proof that ( 3) is satisfied in this case where z * = 0 is a root of g(z).We need to estimate which is of smaller size compared to λ 1 (z n ) and λ 2 (z n ).Therefore, we have minsp(A n )/sp(A n ) ∼ 1/(2N − 1) for z n near z * , which is bounded away from 0 as wanted.
Case 2: z * = 0 is a root of g (z).
If z * is also a root of g(z), then we are reduced to Case 1.Hence, we can assume that z * is not a root of g(z).Therefore, we can expand, in a small open neighbourhood of z * = 0: g(z) = γ + τ z N + h.o.t., where γ, τ = 0.
If N = 1, then z * is not a root of gg".Then by Lemma A.1, we obtain that z * is a saddle point of f .Hence, for z n near z * we obtain which is bounded away from 0, as wanted.

B Implementation and Experimental Results
In this "Appendix", we present some implementation details and experimental results on New Q-Newton's method.

B.1 Implementation Details
In this subsection, we present some practical points concerning implementation details, for the language Python.Source code is in the GitHub link [14].
Indeed, Python has already enough commands to implement New Q-Newton's method.There is a package, named numdifftools, which allows one to compute approximately the gradient and Hessian of a function.This package is also very convenient when working with a family f (x, t) of functions, where t is a parameter.Another package, named scipy.linalg,allows one to find (approximately) eigenvalues and the corresponding eigenvectors of a square matrix.More precisely, given a square matrix A, the command eig(A) will give pairs (λ, v λ ) where λ is an approximate eigenvalue of A and v λ a corresponding eigenvector.
One point to notice is that even if A is a symmetric matrix with real coefficients, the eigenvalues computed by the command eig could be complex numbers, and not real numbers, due to the fact that these are approximately computed.This can be easily resolved by taking the real part of λ, which is given in Python codes by λ.real.Similarly, we can do this for the eigenvectors.A very convenient feature of the command eig is that it already computes (approximate) orthonormal bases for the eigenspaces.Now, we present the coding detail of the main part of New Q-Newton's method: Given a symmetric invertible matrix A with real coefficients (in our case ), and a vector v, compute w which is the reflection of A −1 .valong the direct sum of eigenspace of negative eigenvectors of A. First, we use the command eig to get pairs {(λ j , v j )} j=1,...,m , and use the command real to get real parts.If we write v = m j=1 a j v j , then a j =< v j , v > (the inner product), which is computed by the Python command np.dot(v j , v).Then, Remark A.1 (1) We do not need to compute exactly the gradient and the Hessian of the cost function f , only approximately.Indeed, the proof of Theorem 3.1 shows that if one wants to stop when ||∇ f (x n )|| and ||x n − x ∞ || are smaller than a threshold , then it suffices to compute the gradient and the Hessian up to an accuracy of order .
Similarly, we do not need to compute the eigenvalues and eigenvectors of the Hessian exactly, but only up to an accuracy of order , where is the threshold to stop.
In many experiments, we only calculate the Hessian inexactly using the numdifftools package in Python and still obtain good performance.
(2) While theoretical guarantees are proven only when the hyperparameters δ 0 , . . ., δ m are randomly chosen and fixed from the beginning, in experiments we have also tried to choose-at each iterate n-randomly a δ.We find that this variant, which will be named Random New Q-Newton's method, has a performance similar to or better than the original version.
(3) Note that similar commands are also available on PyTorch and TensorFlow, two popular libraries for implementing deep neural networks.

B.2 Some Experimental Results
Here, we present a couple of illustrating experimental results.Additional experiments, which are quite extensive, are available in the arXiv version of the paper.We use the python package numdifftools [12] to compute gradients and Hessian, since symbolic computation is not quite efficient.The experiments here are run on a small personal laptop.The unit for running time is seconds.
Here, we will compare the performance of New Q-Newton's method against several, including well-known, existing variants of Newton's method: the usual Newton's method, BFGS [32], adaptive cubic regularization [7,19], as well as Random damping Newton's method [24] and Inertial Newton's method [6].
For New Q-Newton's method, we choose α = 1 in the definition.Moreover, we will choose = {0, ±1}, even though for theoretical proofs we need to have at least m+1 elements, where m = the number of variables.The justification is that when running New Q-Newton's method it is almost never the case that both ∇ 2 f (x) and ∇ 2 f (x) ± ||∇ f (x)|| 2 I d are not invertible.The experiments are coded in Python and run on a usual personal computer.For BFGS: we use the function scipy.optimize.fmin_bfgsavailable in Python and put gtol = 1e − 10 and maxiter = 1e + 6.For adaptive cubic regularization for Newton's method, we use the AdaptiveCubicReg module in the implementation in [13].We use the default hyperparameters as recommended there, and use "exact" for the hessian_update_method.For hyperparameters in Inertial Newton's method, we choose α = 0.5 and β = 0.1 as recommended by the authors of [6].Source codes for the current paper are available at the GitHub link [14].
Features reported We will report on the number of iterations needed, the function value, and the norm of the gradient at the last point, as well as the time needed to run.

B.2.1 A Toy Model for Protein Folding
This problem is taken from [22].Here is a brief description of the problem.The model has only two amino acids, called A and B, among 20 that occurs naturally.A molecule with n amino acids will be called an n-mer.The amino acids will be linked together and determined by the angles of bend θ 2 , . . ., θ n−1 ∈ [0, 2π ].We specify the amino acids by Boolean variables ξ 1 , . . ., ξ n ∈ {1, −1}, depending on whether the corresponding one is A or B. The intramolecular potential energy is given by: Here, V 1 is the backbone bend potential and V 2 is the non-bonded interaction, given by: Note that the value of C(ξ i , ξ j ) belongs to the finite set {1, 0.5, −0.5}.
In the first non-trivial dimension n = 3, we have Therefore, the global minimum (ground state) of Φ is obtained when cos(θ 2 ) = 1, at which the value of Φ is 4(1 − C(ξ 1 , ξ 3 )).In the special case where ξ 1 = 1 = ξ 3 (corresponding to AXA), the global minimum of Φ is 0. This is different from the assertion in Table 1 in [22], where the ground state of Φ has value −0.65821 at θ 2 = 0.61866.Our computations for other small dimension cases n = 4, 5 also obtain values different from that reported in Table 1 in [22].In [22], results are reported for dimension ≤ 5, while those for dimensions 6 and 7 are available upon request.Table 1presents the optimal values for the potential-energy function Φ for molecules To save space, only the cases different from [22] are reported.Here, θ * is the point found by the methods in [22] n-mer, where n ≤ 5, found by running different optimization methods from many random initial points.The cases listed here are the same as those in Table 1 in [22].
For comparison, we also compute the function value at the points listed in Table 1 in [22].
Here, we will perform experiments for two cases: ABBBA (dimension 5) and ABBBABABAB (dimension 10).The other cases (of dimensions 5 and 10) yield similar results.We will generate random initial points and report on the performance of the different algorithms.We observe that the performance of Inertial Newton's method and adaptive cubic regularization is less stable, less accurate, or slower than the other methods.
(1) For ABBBA: In this case, the performance of New Q-Newton's method and that of Random New Q-Newton's method are very similar, so we report only that of New Q-Newton's method.We found that the optimal value seems to be about 13.963.(2) For ABBBABABAB: In this case, usually Newton's method and Random damping Newton's method encounter the error "Singular matrix".Hence, we have to take more special care of them and reduce the number of iterations for them to 50.In this case, Random New Q-Newton's method can obtain better performances than New Q-Newton's methods, so we report both of them.In this case, it seems that the optimal Remark.We have tested with many random initial points and found that none of the algorithms here (adaptive cubic regularization, BFGS, Newton's method, New Q-Newton's method, Random Newton's method, Random New Q-Newton's method, and Inertial Newton's method) as well as Backtracking GD can find the above global minimum.The above global minimum value has been found by running Backtracking New Q-Newton's method with, for example, Point 1 below, with running time about 16.2 s.We will test with 4 random initial points (see The function value at the initial point is 579425.218039767.
The root of smallest absolute value of g 3 is close to 0.3430042 + 1.0339458i.It has a pole near −0.227 + 1.115i of absolute value just slightly larger than that of this root, and hence, when one applies the method in [8] one has to be careful.We choose (randomly) an initial point which is close to the pole of g 3 : (x, y) = (−0.227,1.115), at which point the value of f is 0.0415.The fourth is a polynomial function with multiple roots: We consider a (random) initial point (x, y) = (4.48270522,3.79095724), at which point the function value is 1e + 14.
The fifth is the 1001-st summand of the series defining Riemann zeta function: Here, recall that n −z = e −ln(n)z .We choose a (randomly chosen) initial point (x, y) = (9.76536427,−4.15647151), at which the function value is 0.9977.

Fig. 1
Fig.1Basins of attraction for the polynomial P 4 (z) = (z 2 + 1)(z − 2.3)(z + 2.3).The left image is for Newton's method, the middle image is for gradient descent method with Backtracking line search, and the right image is for Backtracking New Q-Newton's method.Blue: initial points z 0 for which the constructed sequence converges to z * 1 .Cyan: similar for the root z * 2 .Green: similar for the root z * 3 .Red: similar for the root z * 4 .Black: other points

.
vv x x uu xy + vv xy uu xy + vv xy uu yy + vv yy .
uu xy + vv xy u x u y + v x v y + uu xy + vv xy u 2 y + v 2 y + uu yy + vv yy

Table 1
Optimal values for the potential energy function Φ for n-mers, where n = 3, 4, 5

Table 2
Performance of different optimization methods for the toy protein folding problem for the 5-mer ABBBA at some random initial points Table2lists the performance of different methods (with a maximum number of 5000 iterates, but can stop earlier if ||∇ f (z n )|| < 1e − 10 or ||z n+1 − z n || < 1e − 20 or there is an unknown error).