Rates of superlinear convergence for classical quasi-Newton methods

We study the local convergence of classical quasi-Newton methods for nonlinear optimization. Although it was well established a long time ago that asymptotically these methods converge superlinearly, the corresponding rates of convergence still remain unknown. In this paper, we address this problem. We obtain first explicit non-asymptotic rates of superlinear convergence for the standard quasi-Newton methods, which are based on the updating formulas from the convex Broyden class. In particular, for the well-known DFP and BFGS methods, we obtain the rates of the form (nL2μ2k)k/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\frac{n L^2}{\mu ^2 k})^{k/2}$$\end{document} and (nLμk)k/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\frac{n L}{\mu k})^{k/2}$$\end{document} respectively, where k is the iteration counter, n is the dimension of the problem, μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} is the strong convexity parameter, and L is the Lipschitz constant of the gradient.


Introduction
Motivation In this work, we investigate the classical quasi-Newton algorithms for smooth unconstrained optimization, the main examples of which are the Davidon-Fletcher-Powell (DFP) method [1,2] and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [3][4][5][6][7]. These algorithms are based on the idea of replacing the exact Hessian in the Newton method with some approximation, that is updated in iterations according to certain formulas, involving only the gradients of the objective function. For an introduction into the topic, see [8] and [9,Chapter 6]; also see [10] for the treatment of quasi-Newton algorithms in the context of nonsmooth optimization and [11][12][13] for randomized variants of quasi-Newton methods.
One of the questions about quasi-Newton methods, that has been extensively studied in the literature, is their superlinear convergence. First theoretical results here were obtained for the methods with exact line search, first by Powell [14], who analyzed the DFP method, and then by Dixon [15,16], who showed that with the exact line search all quasi-Newton algorithms in the Broyden family [17] coincide. Soon after that Broyden, Dennis and Moré [18] considered the quasi-Newton algorithms without line search and proved the local superlinear convergence of DFP, BFGS and several other methods. Their analysis was based on the Frobenius-norm potential function. Later, Dennis and Moré [19] unified the previous proofs by establishing the necessary and sufficient condition of superlinear convergence. This condition together with the original analysis of Broyden, Dennis and Moré have been applied since then in almost every work on quasi-Newton methods for proving superlinear convergence (see e.g. [20][21][22][23][24][25][26][27]). Finally, one should mention that an important contribution to the theoretical analysis of quasi-Newton methods has been made by Byrd, Liu, Nocedal and Yuan in the series of works [28][29][30], where they introduced a new potential function by combining the trace with the logarithm of determinant.
However, the theory of superlinear convergence of quasi-Newton methods is still far from being complete. The main reason for this is that all currently existing results on superlinear convergence of quasi-Newton methods are only asymptotic: they simply show that the ratio of successive residuals in the method tends to zero as the number of iterations goes to infinity, without providing any specific bounds on the corresponding rate of convergence. It is therefore important to obtain some explicit and non-asymptotic rates of superlinear convergence for quasi-Newton methods.
This observation was the starting point for a recent work [31], where the greedy analogs of the classical quasi-Newton methods have been developed. As opposed to the classical quasi-Newton methods, which use the difference of successive iterates for updating Hessian approximations, these methods employ basis vectors, greedily selected to maximize a certain measure of progress. As shown in [31], greedy quasi-Newton methods have superlinear convergence rate of the form (1 − μ nL ) k 2 /2 ( nL μ ) k , where k is the iteration counter, n is the dimension of the problem, μ is the strong convexity parameter, and L is the Lipschitz constant of the gradient.
In this work, we continue the same line of research but now we study the classical quasi-Newton methods. Namely, we consider the methods, based on the updates from the convex Broyden class, which is formed by all convex combinations of the DFP and BFGS updates. For this class, we derive explicit bounds on the rate of superlinear convergence of standard quasi-Newton methods without line search. In particular, for the standard DFP and BFGS methods, we obtain the rates of the form ( nL 2 μ 2 k ) k/2 and ( nL μk ) k/2 respectively. Contents This paper is organized as follows. First, in Sect. 2, we study the convex Broyden class of updating rules for approximating a self-adjoint positive definite linear operator, and establish several important properties of this class. Then, in Sect. 3, we analyze the standard quasi-Newton scheme, based on the updating rules from the convex Broyden class, as applied to minimizing a quadratic function. We show that this scheme has the same rate of linear convergence as that of the classical gradient method, and also a superlinear convergence rate of the form ( Q k ) k/2 , where Q ≥ 1 is a certain constant, related to the condition number, and depending on the method. After that, in Sect. 4, we consider the general problem of unconstrained minimization and the corresponding quasi-Newton scheme for solving it. We show that, for this scheme, it is possible to prove absolutely the same results as for the quadratic function, provided that the starting point is sufficiently close to the solution. In Sect. 5, we compare the rates of superlinear convergence, that we obtain for the classical quasi-Newton methods, with the corresponding rates of the greedy quasi-Newton methods. Sect. 6 contains some auxiliary results, that we use in our analysis.
Notation In what follows, E denotes an arbitrary n-dimensional real vector space. Its dual space, composed by all linear functionals on E, is denoted by E * . The value of a linear function s ∈ E * , evaluated at point x ∈ E, is denoted by s, x .
For a smooth function f : E → R, we denote by ∇ f (x) and ∇ 2 f (x) its gradient and Hessian respectively, evaluated at a point x ∈ E. Note that ∇ f (x) ∈ E * , and ∇ 2 f (x) is a self-adjoint linear operator from E to E * .
The partial ordering of self-adjoint linear operators is defined in the standard way.
Any self-adjoint positive definite linear operator A : E → E * induces in the spaces E and E * the following pair of conjugate Euclidean norms: where f : E → R is a smooth function with positive definite Hessian, and x ∈ E, we prefer to use notation · x and · * x , provided that there is no ambiguity with the reference function f . Sometimes, in the formulas, involving products of linear operators, it is convenient to treat x ∈ E as a linear operator from R to E, defined by xα = αx, and x * as a linear operator from E * to R, defined by x * s = s, x . Likewise, any s ∈ E * can be treated as a linear operator from R to E * , defined by sα = αs, and s * as a linear operator from E to R, defined by s * x = s, x . In this case, x x * and ss * are rank-one self-adjoint linear operators from E * to E and from E * to E respectively, acting as follows: Given two self-adjoint linear operators A : E → E * and W : E * → E, we define the trace and the determinant of A with respect to W as follows: Note that W A is a linear operator from E to itself, and hence its trace and determinant are well-defined real numbers (they coincide with the trace and determinant of the matrix representation of W A with respect to an arbitrary chosen basis in the space E, and the result is independent of the particular choice of the basis). In particular, if W is positive definite, then W , A and Det(W , A) are respectively the sum and the product of the eigenvalues 1 of A relative to W −1 . Observe that ·, · is a bilinear form, and for any x ∈ E, we have When A is invertible, we also have for any δ ∈ R. Also recall the following multiplicative formula for the determinant:

Convex Broyden class
Let A and G be two self-adjoint positive definite linear operators from E to E * , where A is the target operator, which we want to approximate, and G is the current approximation of the operator A. The Broyden family of quasi-Newton updates of G with respect to A along a direction u ∈ E \ {0}, is the following class of updating formulas, parameterized by a scalar φ ∈ R: (2.1) 1 Recall that, for linear operators A, B : E → E * , a scalar λ ∈ R is called a (relative) eigenvalue of A with respect to B if Ax = λBx for some x ∈ E \ {0}. If A, B are self-adjoint and B is positive definite, it is known that there exist eigenvalues λ 1 , . . . , λ n ∈ R and a basis x 1 , . . . , x n ∈ E, such that Ax i = λ i Bx i , Note that Broyd φ (A, G, u) depends on A only through the product Au. For the sake of convenience, we also define Broyd φ (A, G, u) = G when u = 0. Two important members of the Broyden family, DFP and BFGS updates, correspond to the values φ = 1 and φ = 0 respectively: Thus, the Broyden family consists of all affine combinations of DFP and BFGS updates: 3) The subclass of the Broyden family, corresponding to φ ∈ [0, 1], is known as the convex Broyden class (or the restricted Broyden class in some texts).
Our subsequent developments will be based on two properties of the convex Broyden class. The first property states that each update from this class preserves the bounds on the relative eigenvalues with respect to the target operator.
where I E , I E * are the identity operators in the spaces E, E * respectively. Hence, DFP(A, G, u) (2.4) η For the BFGS update, we apply Lemma 6.1 (see Appendix): The proof is finished Remark 2.1 Lemma 2.1 has first been established in [5] in a slightly stronger form and using a different argument. It was also shown there that one of the relations in (2.5) The second property of the convex Broyden class, which we need, is related to the question of convergence of the approximations G to the target operator A. Note that without any restrictions on the choice of the update directions u, one cannot guarantee any convergence of G to A in the usual sense (see [19,31] for more details). However, for our goals it will be sufficient to show that, independently of the choice of u, it is still possible to ensure that G converges to A along the update directions u, and estimate the corresponding rate of convergence.
Let us define the following measure of the closeness of G to A along the direction u: where, for the sake of convenience, we define θ(A, G, u) = 0 if u = 0. Note that θ(A, G, u) = 0 if and only if Gu = Au. Thus, our goal now is to establish some upper bounds on θ , which will help us to estimate the rate, at which this measure goes to zero. For this, we will study how certain potential functions change after one update from the convex Broyden class, and estimate this change from below by an appropriate monotonically increasing function of θ . We will consider two potential functions. The first one is a simple trace potential function, that we will use only when we can guarantee that A G: for some η ≥ 1. Then, for any φ ∈ [0, 1] and any u ∈ E, we have Proof We can assume that u = 0 since otherwise the claim is trivial.
The second potential function is more universal since we can work with it even if the condition A G is violated. This function was first introduced in [29], and is defined as follows: (2.14) In fact, ψ is nothing else but the Bregman divergence, generated by the strictly convex , defined on the set of self-adjoint positive definite linear operators from E to E * , where B : E → E * is an arbitrary fixed self-adjoint positive definite linear operator. Indeed, Let ω : (−1, +∞) → R be the univariate function Clearly, ω is a convex function, which is decreasing on (−1, 0] and increasing on [0, +∞). Also, on the latter interval, it satisfies the following bounds (see [32, Lemma 5.1.5]): Thus, for large values of t, the function ω(t) is approximately linear in t, while for small values of t, it is quadratic.
There is a close relationship between ω and the potential function ψ. Indeed, if λ 1 , . . . , λ n ≥ 0 are the relative eigenvalues of G with respect to A, then We are going to use the function ω to estimate from below the change in the potential function ψ, which is achieved after one update from the convex Broyden class, via the closeness measure θ . However, first of all, we need an auxiliary lemma.

Lemma 2.3 For any real
Proof Equivalently, we need to prove that (2.17) Let us show that the right-hand side of (2.17) is increasing in β. This is evident if α ≥ 2 because ω is increasing on [0, +∞), so suppose that α < 2. Denote Note that t is decreasing in β. Therefore, it suffices to prove that the right-hand side of (2.17) is decreasing in t. But Thus, it suffices to prove (2.17) only in the boundary case β = α: or, equivalently, in view of (2.15), that For α ≥ 1, this is obvious, so suppose that α ≤ 1. It now remains to justify that for all t ∈ [0, 1). But this easily follows by integration from the fact that for all t ∈ [0, 1). Now we are ready to prove the main result.
for some ξ, η ≥ 1. Then, for any φ ∈ [0, 1] and any u ∈ E, we have Proof Suppose that u = 0 since otherwise the claim is trivial. Let us denote . We already know that Applying now Lemma 6.2, we obtain Thus, where we have used the concavity of the logarithm. Denote Clearly, α 1 ≥ β 1 and α 0 ≥ β 0 by the Cauchy-Schwartz inequality. Also, Therefore, by Lemma 2.3 and the fact that ω is increasing on [0, +∞), we have Combining these inequalities with (2.21), we obtain the claim.

Unconstrained quadratic minimization
In this section, we study the classical quasi-Newton methods, based on the updating formulas from the convex Broyden class, as applied to minimizing the quadratic function where A : E → E * is a self-adjoint positive definite operator, and b ∈ E * . Let B : E → E * be a self-adjoint positive definite linear operator, that we will use to initialize our methods. Denote by μ > 0 the strong convexity parameter of f , and by L > 0 the Lipschitz constant of the gradient of f , both measured with respect to B: Consider the following standard quasi-Newton scheme for minimizing (3.1). For the sake of simplicity, we assume that the constant L is available.

Remark 3.1
In an actual implementation of scheme (3.3), it is typical to store in memory and update in iterations the matrix H k def = G −1 k instead of G k (or, alternatively, the Cholesky decomposition of G k ). This allows one to compute G −1 k+1 ∇ f (x k ) in O(n 2 ) operations. Note that, due to a low-rank structure of the update (2.1), H k can be updated into H k+1 also in O(n 2 ) operations (for specific formulas, see e.g. [8,Section 8]).
To measure the convergence rate of scheme (3.3), we look at the norm of the gradient, measured with respect to A: The following lemma shows that the measure θ(A, G k , u k ), that we introduced in (2.6) to measure the closeness of G k to A along the direction u k , is directly related to the progress of one step of the scheme (3.3). Note that it is important here that the updating direction u k = x k+1 − x k is chosen as the difference of the iterates, and, for other choices of u k , this result is no longer true.

The proof is finished
Let us show that the scheme (3.3) has global linear convergence, and that the corresponding rate is at least as good as that of the standard gradient method.
Now, let us establish the superlinear convergence of the scheme (3.3). First, we do this by working with the trace potential function σ , defined by (2.7). Note that this is possible since A G k in view of (3.6).

Theorem 3.2 In scheme (3.3), for all k ≥ 1, we have
Let k ≥ 1 be arbitrary. From (3.6) and Lemma 2.2, it follows that Hence, by Lemma 3.1 and the arithmetic-geometric mean inequality, The proof is finished Remark 3.2 As can be seen from (3.10), the factor nL μ in the efficiency estimate (3.9) can be improved up to . . , λ n are the eigenvalues of A relative to B. This improved factor can be significantly smaller than the original one if the majority of the eigenvalues λ i are much larger than μ. However, for the sake of simplicity, we prefer to work directly with constants n, L and μ. This corresponds to the worst-case analysis. The same remark applies to all other theorems on superlinear convergence, that will follow.
Let us discuss the efficiency estimate (3.9). Note that its maximal value over all φ i ∈ [0, 1] is achieved at φ i = 1 for all 0 ≤ i ≤ k − 1. This corresponds to the DFP method. In this case, the efficiency estimate (3.9) looks as follows: Hence, the moment, when the superlinear convergence starts, can be described as follows: In contrast, the minimal value of the efficiency estimate (3.9) over all φ i ∈ [0, 1] is achieved at φ i = 0 for all 0 ≤ i ≤ k − 1. This corresponds to the BFGS method. In this case, the efficiency estimate (3.9) becomes and the moment, when the superlinear convergence begins, can be described as follows: Thus, we see that, compared to DFP, the superlinear convergence of BFGS starts in L μ times earlier, and its rate is much faster.
Let us present for the scheme (3.3) another justification of the superlinear convergence rate in the form (3.9). For this, instead of σ , we will work with the potential function ω, defined by (2.15). The advantage of this analysis is that it is extendable onto general nonlinear functions. (3.3), for all k ≥ 1, we have

Theorem 3.3 In scheme
Let k ≥ 1 and 0 ≤ i ≤ k − 1 be arbitrary. In view of (3.6) and Lemma 2.4, we have and we conclude that Summing this inequality and using the fact that ψ k ≥ 0, we obtain (3.14) Hence, by Lemma 3.1 and the arithmetic-geometric mean inequality,

The proof is finished
Comparing our new efficiency estimate (3.12) with the previous one (3.9), we see that they differ only in a constant. Thus, for the quadratic function, we do not gain anything by working with the potential function ω instead of σ . Nevertheless, our second proof is more universal, and, in contrast to the first one, can be generalized onto general nonlinear functions, as we will see in the next section.

Minimization of general functions
Consider now a general unconstrained minimization problem: where f : E → R is a twice differentiable function with positive definite Hessian.
To write down the standard quasi-Newton scheme for (4.1), we fix some self-adjoint positive definite linear operator B : E → E * and a constant L > 0, that we use to define the initial Hessian approximation.

Remark 4.1 Similarly to Remark 3.1, when implementing scheme (4.2), it is common
to work directly with the inverse H k def = G −1 k instead of G k . Also note that it is not necessary to compute J k explicitly. Indeed, for implementing the Hessian approximation update at Step 4 (or the corresponding update for its inverse), one only needs the product which is just the difference of the successive gradients.
In what follows, we make the following assumptions about the problem (4.1). First, we assume that, with respect to the operator B, the objective function f is strongly convex with parameter μ > 0 and its gradient is Lipschitz continuous with constant L, i.e.
for all x ∈ E. Second, we assume that the objective function f is strongly selfconcordant with some constant M ≥ 0, i.e.
for all x, y, z, w ∈ E. The class of strongly self-concordant functions was recently introduced in [31], and contains at least all strongly convex functions with Lipschitz continuous Hessian (see [31,Example 4.1]). It gives us the the following convenient relations between the Hessians of the objective function: [31,Lemma 4.1]) Let x, y ∈ E, and let r def = y − x x . Then,

Remark 4.2
Since we are interested in local convergence, it is possible to relax our assumptions by requiring that (4.3), (4.4) hold only in some neighborhood of a minimizer x * . For this, it suffices to assume that the Hessian of f is Lipschitz continuous in this neighborhood with ∇ 2 f (x * ) being positive definite. These are exactly the standard assumptions, used in [8] and many other works, studying local convergence of quasi-Newton methods. However, to avoid excessive technicalities, we do not do this.
Let us now analyze the process (4.2). For measuring its convergence, we look at the local norm of the gradient: First, let us estimate the progress of one step of the scheme (4.2). Recall that θ(J k , G k , u k ) is the measure of closeness of G k to J k along the direction u k (see (2.6)).

Lemma 4.2
In scheme (4.2), for all k ≥ 0 and r k def = u k x k , we have In view of Taylor's formula, Therefore, The proof is finished Our next result states that, if the starting point in scheme (4.2) is chosen sufficiently close to the solution, then the relative eigenvalues of the Hessian approximations G k with respect to both the Hessians ∇ 2 f (x k ) and the integral Hessians J k are always located between 1 and L μ , up to some small numerical constant. As a consequence, the process (4.2) has at least the linear convergence rate of the gradient method. Then, for all k ≥ 0, we have and r i def = u i x i for any i ≥ 0.
Now we are ready to prove the main result of this section on the superlinear convergence of the scheme (4.2). In contrast to the quadratic case, now we cannot use the proof, based on the trace potential function σ , defined by (2.7), because we cannot longer guarantee that J k G k . However, the proof, based on the potential function ψ, defined by (2.14), still works.
Let k ≥ 1 and 0 ≤ i ≤ k − 1 be arbitrary. By (4.12), (4.14) and Lemma 2.4, we have Moreover, since we also have Thus, where where (4.28) Hence, At the same time, Thus, Summing up (4.25) and using the fact that ψ k ≥ 0, we obtain Therefore, by Lemma 4.2 and the arithmetic-geometric mean inequality, The proof is finished

Discussion
Let us compare the rates of superlinear convergence, that we have obtained for the classical quasi-Newton methods, with those of the greedy quasi-Newton methods [31]. For brevity, we discuss only the BFGS method. Moreover, since the complexity bounds for the general nonlinear case differ from those for the quadratic one only in some absolute constants (both for the classical and the greedy methods), we only consider the case, when the objective function f is quadratic. As before, let n be the dimension of the problem, μ be the strong convexity parameter, L be the Lipschitz constant of the gradient of f , and λ f (x) be the local norm of the gradient of f at the point x ∈ E (as defined by (3.4)). Also, let us introduce the following condition number to simplify our notation: The greedy BFGS method [31] is essentially the classical BFGS algorithm (scheme (3.3) with φ k ≡ 0) with the only difference that, at each iteration, the update direction u k is chosen greedily according to the following rule: where e 1 , . . . , e n is a basis in E, such that B −1 = n i=1 e i e * i . For this method, we have the following recurrence (see [31,Theorem 3.2]): Hence, its rate of superlinear convergence is described by the expression Although the inequality (5.2) is valid for all k ≥ 0, it is useful only when In other words, the relation (5.3) specifies the moment, starting from which it becomes meaningful to speak about the superlinear convergence of the greedy BFGS method. For the classical BFGS method, we have the following bound (see (3.11)): and the starting moment of its superlinear convergence is described as follows: Comparing (5.3) and (5.4), we see that, for the standard BFGS, the superlinear convergence may start slightly earlier than for the greedy one. However, the difference is only in the logarithmic factor.
Nevertheless, let us show that, very soon after the superlinear convergence of the greedy BFGS begins, namely, after iterations, it will be significantly faster than the standard BFGS. Indeed, for all k ≥ 1. Note that the function t → ln t t is decreasing on [e, +∞) (since its logarithm ln ln t − ln t is a decreasing function of u = ln t for u ∈ [1, +∞), which is easily verified by differentiation). Hence, for all k ≥ K , we have (using first that k ≤ 2(k − 1) since k ≥ 2) Q ln(Qk) k − 1 ≤ Q ln(2Q(k − 1)) k − 1 ≤ Q ln(2Q(K − 1)) K − 1 Consequently, for all k ≥ K , we obtain Thus, after K iterations, the rate of superlinear convergence of the greedy BFGS is always better than that of the standard BFGS. Moreover, as k → ∞, the gap between these two rates grows as e −k 2 /Q . At the same time, the complexity of the Hessian update for the greedy BFGS method is more expensive than for the standard one. Consequently, Det(G −1 , G + ) (1.4) = Det(G −1 , G 0 )Det(G −1 0 , G + ) (6.6) = Au, u Gu, u Det(G −1 0 , G + ) (6.8) = Au, u Gu, u φ Gu, u AG −1 Au, u Au, u 2

Appendix
The proof is finished Proof Indeed, with respect to A, the operator A + αss * has n − 1 unit eigenvalues and one eigenvalue 1 + α s, A −1 s (corresponding to the eigenvector A −1 s).
Acknowledgements The authors are thankful to two anonymous reviewers for their valuable time and useful feedback.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.