Abstract
We study the local convergence of classical quasi-Newton methods for nonlinear optimization. Although it was well established a long time ago that asymptotically these methods converge superlinearly, the corresponding rates of convergence still remain unknown. In this paper, we address this problem. We obtain first explicit non-asymptotic rates of superlinear convergence for the standard quasi-Newton methods, which are based on the updating formulas from the convex Broyden class. In particular, for the well-known DFP and BFGS methods, we obtain the rates of the form \((\frac{n L^2}{\mu ^2 k})^{k/2}\) and \((\frac{n L}{\mu k})^{k/2}\) respectively, where k is the iteration counter, n is the dimension of the problem, \(\mu \) is the strong convexity parameter, and L is the Lipschitz constant of the gradient.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Motivation In this work, we investigate the classical quasi-Newton algorithms for smooth unconstrained optimization, the main examples of which are the Davidon–Fletcher–Powell (DFP) method [1, 2] and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [3,4,5,6,7]. These algorithms are based on the idea of replacing the exact Hessian in the Newton method with some approximation, that is updated in iterations according to certain formulas, involving only the gradients of the objective function. For an introduction into the topic, see [8] and [9, Chapter 6]; also see [10] for the treatment of quasi-Newton algorithms in the context of nonsmooth optimization and [11,12,13] for randomized variants of quasi-Newton methods.
One of the questions about quasi-Newton methods, that has been extensively studied in the literature, is their superlinear convergence. First theoretical results here were obtained for the methods with exact line search, first by Powell [14], who analyzed the DFP method, and then by Dixon [15, 16], who showed that with the exact line search all quasi-Newton algorithms in the Broyden family [17] coincide. Soon after that Broyden, Dennis and Moré [18] considered the quasi-Newton algorithms without line search and proved the local superlinear convergence of DFP, BFGS and several other methods. Their analysis was based on the Frobenius-norm potential function. Later, Dennis and Moré [19] unified the previous proofs by establishing the necessary and sufficient condition of superlinear convergence. This condition together with the original analysis of Broyden, Dennis and Moré have been applied since then in almost every work on quasi-Newton methods for proving superlinear convergence (see e.g. [20,21,22,23,24,25,26,27]). Finally, one should mention that an important contribution to the theoretical analysis of quasi-Newton methods has been made by Byrd, Liu, Nocedal and Yuan in the series of works [28,29,30], where they introduced a new potential function by combining the trace with the logarithm of determinant.
However, the theory of superlinear convergence of quasi-Newton methods is still far from being complete. The main reason for this is that all currently existing results on superlinear convergence of quasi-Newton methods are only asymptotic: they simply show that the ratio of successive residuals in the method tends to zero as the number of iterations goes to infinity, without providing any specific bounds on the corresponding rate of convergence. It is therefore important to obtain some explicit and non-asymptotic rates of superlinear convergence for quasi-Newton methods.
This observation was the starting point for a recent work [31], where the greedy analogs of the classical quasi-Newton methods have been developed. As opposed to the classical quasi-Newton methods, which use the difference of successive iterates for updating Hessian approximations, these methods employ basis vectors, greedily selected to maximize a certain measure of progress. As shown in [31], greedy quasi-Newton methods have superlinear convergence rate of the form \((1-\frac{\mu }{nL})^{k^2/2} (\frac{n L}{\mu })^k\), where k is the iteration counter, n is the dimension of the problem, \(\mu \) is the strong convexity parameter, and L is the Lipschitz constant of the gradient.
In this work, we continue the same line of research but now we study the classical quasi-Newton methods. Namely, we consider the methods, based on the updates from the convex Broyden class, which is formed by all convex combinations of the DFP and BFGS updates. For this class, we derive explicit bounds on the rate of superlinear convergence of standard quasi-Newton methods without line search. In particular, for the standard DFP and BFGS methods, we obtain the rates of the form \((\frac{n L^2}{\mu ^2 k})^{k/2}\) and \((\frac{n L}{\mu k})^{k/2}\) respectively.
Contents This paper is organized as follows. First, in Sect. 2, we study the convex Broyden class of updating rules for approximating a self-adjoint positive definite linear operator, and establish several important properties of this class. Then, in Sect. 3, we analyze the standard quasi-Newton scheme, based on the updating rules from the convex Broyden class, as applied to minimizing a quadratic function. We show that this scheme has the same rate of linear convergence as that of the classical gradient method, and also a superlinear convergence rate of the form \((\frac{Q}{k})^{k/2}\), where \(Q \ge 1\) is a certain constant, related to the condition number, and depending on the method. After that, in Sect. 4, we consider the general problem of unconstrained minimization and the corresponding quasi-Newton scheme for solving it. We show that, for this scheme, it is possible to prove absolutely the same results as for the quadratic function, provided that the starting point is sufficiently close to the solution. In Sect. 5, we compare the rates of superlinear convergence, that we obtain for the classical quasi-Newton methods, with the corresponding rates of the greedy quasi-Newton methods. Sect. 6 contains some auxiliary results, that we use in our analysis.
Notation In what follows, \(\mathbb {E}\) denotes an arbitrary n-dimensional real vector space. Its dual space, composed by all linear functionals on \(\mathbb {E}\), is denoted by \(\mathbb {E}^*\). The value of a linear function \(s \in \mathbb {E}^*\), evaluated at point \(x \in \mathbb {E}\), is denoted by \(\langle s, x \rangle \).
For a smooth function \(f : \mathbb {E}\rightarrow \mathbb {R}\), we denote by \(\nabla f(x)\) and \(\nabla ^2 f(x)\) its gradient and Hessian respectively, evaluated at a point \(x \in \mathbb {E}\). Note that \(\nabla f(x) \in \mathbb {E}^*\), and \(\nabla ^2 f(x)\) is a self-adjoint linear operator from \(\mathbb {E}\) to \(\mathbb {E}^*\).
The partial ordering of self-adjoint linear operators is defined in the standard way. We write \(A \preceq A_1\) for \(A, A_1 : \mathbb {E}\rightarrow \mathbb {E}^*\) if \(\langle (A_1 - A) x, x \rangle \ge 0\) for all \(x \in \mathbb {E}\), and \(W \preceq W_1\) for \(W, W_1 : \mathbb {E}^* \rightarrow \mathbb {E}\) if \(\langle s, (W_1 - W) s \rangle \ge 0\) for all \(s \in \mathbb {E}^*\).
Any self-adjoint positive definite linear operator \(A : \mathbb {E}\rightarrow \mathbb {E}^*\) induces in the spaces \(\mathbb {E}\) and \(\mathbb {E}^*\) the following pair of conjugate Euclidean norms:
When \(A = \nabla ^2 f(x)\), where \(f : \mathbb {E}\rightarrow \mathbb {R}\) is a smooth function with positive definite Hessian, and \(x \in \mathbb {E}\), we prefer to use notation \(\Vert \cdot \Vert _x\) and \(\Vert \cdot \Vert _x^*\), provided that there is no ambiguity with the reference function f.
Sometimes, in the formulas, involving products of linear operators, it is convenient to treat \(x \in \mathbb {E}\) as a linear operator from \(\mathbb {R}\) to \(\mathbb {E}\), defined by \(x \alpha = \alpha x\), and \(x^*\) as a linear operator from \(\mathbb {E}^*\) to \(\mathbb {R}\), defined by \(x^* s = \langle s, x \rangle \). Likewise, any \(s \in \mathbb {E}^*\) can be treated as a linear operator from \(\mathbb {R}\) to \(\mathbb {E}^*\), defined by \(s \alpha = \alpha s\), and \(s^*\) as a linear operator from \(\mathbb {E}\) to \(\mathbb {R}\), defined by \(s^* x = \langle s, x \rangle \). In this case, \(x x^*\) and \(s s^*\) are rank-one self-adjoint linear operators from \(\mathbb {E}^*\) to \(\mathbb {E}\) and from \(\mathbb {E}^*\) to \(\mathbb {E}\) respectively, acting as follows:
Given two self-adjoint linear operators \(A : \mathbb {E}\rightarrow \mathbb {E}^*\) and \(W : \mathbb {E}^* \rightarrow \mathbb {E}\), we define the trace and the determinant of A with respect to W as follows:
Note that WA is a linear operator from \(\mathbb {E}\) to itself, and hence its trace and determinant are well-defined real numbers (they coincide with the trace and determinant of the matrix representation of WA with respect to an arbitrary chosen basis in the space \(\mathbb {E}\), and the result is independent of the particular choice of the basis). In particular, if W is positive definite, then \(\langle W, A \rangle \) and \(\mathrm{Det}(W, A)\) are respectively the sum and the product of the eigenvaluesFootnote 1 of A relative to \(W^{-1}\). Observe that \(\langle \cdot , \cdot \rangle \) is a bilinear form, and for any \(x \in \mathbb {E}\), we have
When A is invertible, we also have
for any \(\delta \in \mathbb {R}\). Also recall the following multiplicative formula for the determinant:
which is valid for any invertible linear operator \(G : \mathbb {E}\rightarrow \mathbb {E}^*\). If the operator W is positive semidefinite, and \(A \preceq A_1\) for some self-adjoint linear operator \(A_1 : \mathbb {E}\rightarrow \mathbb {E}^*\), then \(\langle W, A \rangle \le \langle W, A_1 \rangle \) and \(\mathrm{Det}(W, A) \le \mathrm{Det}(W, A_1)\). Similarly, if A is positive semidefinite and \(W \preceq W_1\) for some self-adjoint linear operator \(W_1 : \mathbb {E}^* \rightarrow \mathbb {E}\), then \(\langle W, A \rangle \le \langle W_1, A \rangle \) and \(\mathrm{Det}(W, A) \le \mathrm{Det}(W_1, A)\).
2 Convex Broyden class
Let A and G be two self-adjoint positive definite linear operators from \(\mathbb {E}\) to \(\mathbb {E}^*\), where A is the target operator, which we want to approximate, and G is the current approximation of the operator A. The Broyden family of quasi-Newton updates of G with respect to A along a direction \(u \in \mathbb {E}\setminus \{0\}\), is the following class of updating formulas, parameterized by a scalar \(\phi \in \mathbb {R}\):
Note that \(\mathrm{Broyd}_{\phi }(A, G, u)\) depends on A only through the product Au. For the sake of convenience, we also define \(\mathrm{Broyd}_{\phi }(A, G, u) = G\) when \(u = 0\).
Two important members of the Broyden family, DFP and BFGS updates, correspond to the values \(\phi = 1\) and \(\phi = 0\) respectively:
Thus, the Broyden family consists of all affine combinations of DFP and BFGS updates:
The subclass of the Broyden family, corresponding to \(\phi \in [0, 1]\), is known as the convex Broyden class (or the restricted Broyden class in some texts).
Our subsequent developments will be based on two properties of the convex Broyden class. The first property states that each update from this class preserves the bounds on the relative eigenvalues with respect to the target operator.
Lemma 2.1
Let \(A, G : \mathbb {E}\rightarrow \mathbb {E}^*\) be self-adjoint positive definite linear operators such that
where \(\xi , \eta \ge 1\). Then, for any \(u \in \mathbb {E}\), and any \(\phi \in [0, 1]\), we have
Proof
Suppose that \(u \ne 0\) since otherwise the claim is trivial. In view of (2.3), it suffices to prove (2.5) only for the DFP and BFGS updates independently.
For the DFP update, we have
where \(I_{\mathbb {E}}\), \(I_{\mathbb {E}^*}\) are the identity operators in the spaces \(\mathbb {E}\), \(\mathbb {E}^*\) respectively. Hence,
For the BFGS update, we apply Lemma 6.1 (see Appendix):
The proof is finished \(\square \)
Remark 2.1
Lemma 2.1 has first been established in [5] in a slightly stronger form and using a different argument. It was also shown there that one of the relations in (2.5) may no longer be valid if \(\phi \in \mathbb {R}\setminus [0, 1]\).
The second property of the convex Broyden class, which we need, is related to the question of convergence of the approximations G to the target operator A. Note that without any restrictions on the choice of the update directions u, one cannot guarantee any convergence of G to A in the usual sense (see [19, 31] for more details). However, for our goals it will be sufficient to show that, independently of the choice of u, it is still possible to ensure that G converges to A along the update directions u, and estimate the corresponding rate of convergence.
Let us define the following measure of the closeness of G to A along the direction u:
where, for the sake of convenience, we define \(\theta (A, G, u) = 0\) if \(u = 0\). Note that \(\theta (A, G, u) = 0\) if and only if \(G u = A u\). Thus, our goal now is to establish some upper bounds on \(\theta \), which will help us to estimate the rate, at which this measure goes to zero. For this, we will study how certain potential functions change after one update from the convex Broyden class, and estimate this change from below by an appropriate monotonically increasing function of \(\theta \). We will consider two potential functions.
The first one is a simple trace potential function, that we will use only when we can guarantee that \(A \preceq G\):
Lemma 2.2
Let \(A, G : \mathbb {E}\rightarrow \mathbb {E}^*\) be self-adjoint positive definite linear operators such that
for some \(\eta \ge 1\). Then, for any \(\phi \in [0, 1]\) and any \(u \in \mathbb {E}\), we have
Proof
We can assume that \(u \ne 0\) since otherwise the claim is trivial. Denote \(G_+ {\mathop {=}\limits ^{\mathrm {def}}}\mathrm{Broyd}_{\phi }(A, G, u)\) and \(\theta {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G, u)\). Then,
Note that
ThereforeFootnote 2,
Consequently,
At the same time,
Hence,
Substituting now (2.12) and (2.13) into (2.10), we obtain (2.9). \(\square \)
The second potential function is more universal since we can work with it even if the condition \(A \preceq G\) is violated. This function was first introduced in [29], and is defined as follows:
In fact, \(\psi \) is nothing else but the Bregman divergence, generated by the strictly convex function \(d(G) {\mathop {=}\limits ^{\mathrm {def}}}-\ln \mathrm{Det}(B^{-1}, G)\), defined on the set of self-adjoint positive definite linear operators from \(\mathbb {E}\) to \(\mathbb {E}^*\), where \(B : \mathbb {E}\rightarrow \mathbb {E}^*\) is an arbitrary fixed self-adjoint positive definite linear operator. Indeed,
Thus, \(\psi (A, G) \ge 0\) and \(\psi (A, G) = 0\) if and only if \(G = A\).
Let \(\omega : (-1, +\infty ) \rightarrow \mathbb {R}\) be the univariate function
Clearly, \(\omega \) is a convex function, which is decreasing on \((-1, 0]\) and increasing on \([0, +\infty )\). Also, on the latter interval, it satisfies the following bounds (see [32, Lemma 5.1.5]):
Thus, for large values of t, the function \(\omega (t)\) is approximately linear in t, while for small values of t, it is quadratic.
There is a close relationship between \(\omega \) and the potential function \(\psi \). Indeed, if \(\lambda _1, \ldots , \lambda _n \ge 0\) are the relative eigenvalues of G with respect to A, then
We are going to use the function \(\omega \) to estimate from below the change in the potential function \(\psi \), which is achieved after one update from the convex Broyden class, via the closeness measure \(\theta \). However, first of all, we need an auxiliary lemma.
Lemma 2.3
For any real \(\alpha \ge \beta > 0\), we have
Proof
Equivalently, we need to prove that
Let us show that the right-hand side of (2.17) is increasing in \(\beta \). This is evident if \(\alpha \ge 2\) because \(\omega \) is increasing on \([0, +\infty )\), so suppose that \(\alpha < 2\). Denote
Note that t is decreasing in \(\beta \). Therefore, it suffices to prove that the right-hand side of (2.17) is decreasing in t. But
which is indeed decreasing in t since \(\omega \) is decreasing on \((-1, 0]\).
Thus, it suffices to prove (2.17) only in the boundary case \(\beta = \alpha \):
or, equivalently, in view of (2.15), that
For \(\alpha \ge 1\), this is obvious, so suppose that \(\alpha \le 1\). It now remains to justify that
for all \(t \in [0, 1)\). But this easily follows by integration from the fact that
for all \(t \in [0, 1)\). \(\square \)
Now we are ready to prove the main result.
Lemma 2.4
Let \(A, G : \mathbb {E}\rightarrow \mathbb {E}^*\) be self-adjoint positive definite linear operators such that
for some \(\xi , \eta \ge 1\). Then, for any \(\phi \in [0, 1]\) and any \(u \in \mathbb {E}\), we have
Proof
Suppose that \(u \ne 0\) since otherwise the claim is trivial. Let us denote \(G_+ {\mathop {=}\limits ^{\mathrm {def}}}\mathrm{Broyd}_{\phi }(A, G, u)\) and \(\theta {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G, u)\). We already know that
Applying now Lemma 6.2, we obtain
Thus,
where we have used the concavity of the logarithm.
Denote
Clearly, \(\alpha _1 \ge \beta _1\) and \(\alpha _0 \ge \beta _0\) by the Cauchy–Schwartz inequality. Also,
Therefore, by Lemma 2.3 and the fact that \(\omega \) is increasing on \([0, +\infty )\), we have
Combining these inequalities with (2.21), we obtain the claim. \(\square \)
3 Unconstrained quadratic minimization
In this section, we study the classical quasi-Newton methods, based on the updating formulas from the convex Broyden class, as applied to minimizing the quadratic function
where \(A : \mathbb {E}\rightarrow \mathbb {E}^*\) is a self-adjoint positive definite operator, and \(b \in \mathbb {E}^*\).
Let \(B : \mathbb {E}\rightarrow \mathbb {E}^*\) be a self-adjoint positive definite linear operator, that we will use to initialize our methods. Denote by \(\mu > 0\) the strong convexity parameter of f, and by \(L > 0\) the Lipschitz constant of the gradient of f, both measured with respect to B:
Consider the following standard quasi-Newton scheme for minimizing (3.1). For the sake of simplicity, we assume that the constant L is available.
Remark 3.1
In an actual implementation of scheme (3.3), it is typical to store in memory and update in iterations the matrix \(H_k {\mathop {=}\limits ^{\mathrm {def}}}G_k^{-1}\) instead of \(G_k\) (or, alternatively, the Cholesky decomposition of \(G_k\)). This allows one to compute \(G_{k+1}^{-1} \nabla f(x_k)\) in \(O(n^2)\) operations. Note that, due to a low-rank structure of the update (2.1), \(H_k\) can be updated into \(H_{k+1}\) also in \(O(n^2)\) operations (for specific formulas, see e.g. [8, Section 8]).
To measure the convergence rate of scheme (3.3), we look at the norm of the gradient, measured with respect to A:
The following lemma shows that the measure \(\theta (A, G_k, u_k)\), that we introduced in (2.6) to measure the closeness of \(G_k\) to A along the direction \(u_k\), is directly related to the progress of one step of the scheme (3.3). Note that it is important here that the updating direction \(u_k = x_{k+1} - x_k\) is chosen as the difference of the iterates, and, for other choices of \(u_k\), this result is no longer true.
Lemma 3.1
In scheme (3.3), for all \(k \ge 0\), we have
Proof
Indeed,
Hence, denoting \(\theta _k {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G_k, u_k)\), we get
The proof is finished \(\square \)
Let us show that the scheme (3.3) has global linear convergence, and that the corresponding rate is at least as good as that of the standard gradient method.
Theorem 3.1
In scheme (3.3), for all \(k \ge 0\), we have
and
Proof
For \(k=0\), (3.6) follows from the fact that \(G_0 = L B\) and (3.2). For all other \(k \ge 1\), it follows by induction using Lemma 2.1.
Thus, we have
Therefore,
and so
Applying now Lemma 3.5, we obtain (3.7). \(\square \)
Now, let us establish the superlinear convergence of the scheme (3.3). First, we do this by working with the trace potential function \(\sigma \), defined by (2.7). Note that this is possible since \(A \preceq G_k\) in view of (3.6).
Theorem 3.2
In scheme (3.3), for all \(k \ge 1\), we have
Proof
Denote \(\sigma _i {\mathop {=}\limits ^{\mathrm {def}}}\sigma (A, G_i)\), \(\theta _i {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G_i, u_i)\), and \(p_i {\mathop {=}\limits ^{\mathrm {def}}}\phi _i \frac{\mu }{L} + 1 - \phi _i\) for any \(i \ge 0\). Let \(k \ge 1\) be arbitrary. From (3.6) and Lemma 2.2, it follows that
for all \(0 \le i \le k - 1\). Summing up these inequalities, we obtain
Hence, by Lemma 3.1 and the arithmetic-geometric mean inequality,
The proof is finished \(\square \)
Remark 3.2
As can be seen from (3.10), the factor \(\frac{n L}{\mu }\) in the efficiency estimate (3.9) can be improved up to \(\langle A^{-1}, L B - A \rangle = \sum _{i=1}^n (\frac{L}{\lambda _i} - 1)\), where \(\lambda _1, \ldots , \lambda _n\) are the eigenvalues of A relative to B. This improved factor can be significantly smaller than the original one if the majority of the eigenvalues \(\lambda _i\) are much larger than \(\mu \). However, for the sake of simplicity, we prefer to work directly with constants n, L and \(\mu \). This corresponds to the worst-case analysis. The same remark applies to all other theorems on superlinear convergence, that will follow.
Let us discuss the efficiency estimate (3.9). Note that its maximal value over all \(\phi _i \in [0, 1]\) is achieved at \(\phi _i = 1\) for all \(0 \le i \le k-1\). This corresponds to the DFP method. In this case, the efficiency estimate (3.9) looks as follows:
Hence, the moment, when the superlinear convergence starts, can be described as follows:
In contrast, the minimal value of the efficiency estimate (3.9) over all \(\phi _i \in [0, 1]\) is achieved at \(\phi _i = 0\) for all \(0 \le i \le k-1\). This corresponds to the BFGS method. In this case, the efficiency estimate (3.9) becomes
and the moment, when the superlinear convergence begins, can be described as follows:
Thus, we see that, compared to DFP, the superlinear convergence of BFGS starts in \(\frac{L}{\mu }\) times earlier, and its rate is much faster.
Let us present for the scheme (3.3) another justification of the superlinear convergence rate in the form (3.9). For this, instead of \(\sigma \), we will work with the potential function \(\omega \), defined by (2.15). The advantage of this analysis is that it is extendable onto general nonlinear functions.
Theorem 3.3
In scheme (3.3), for all \(k \ge 1\), we have
Proof
Denote \(\theta _i {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G_i, u_i)\), \(\psi _i {\mathop {=}\limits ^{\mathrm {def}}}\psi (A, G_i)\), and \(p_i {\mathop {=}\limits ^{\mathrm {def}}}\phi _i \frac{\mu }{L} + 1 - \phi _i\) for any \(i \ge 0\). Let \(k \ge 1\) and \(0 \le i \le k - 1\) be arbitrary. In view of (3.6) and Lemma 2.4, we have
Note that \(\theta _i \le 1\). Indeed, if \(u_i = 0\), then \(\theta _i = 0\) by definition. Otherwise,
Therefore,
and we conclude that
Summing this inequality and using the fact that \(\psi _k \ge 0\), we obtain
Hence, by Lemma 3.1 and the arithmetic-geometric mean inequality,
The proof is finished \(\square \)
Comparing our new efficiency estimate (3.12) with the previous one (3.9), we see that they differ only in a constant. Thus, for the quadratic function, we do not gain anything by working with the potential function \(\omega \) instead of \(\sigma \). Nevertheless, our second proof is more universal, and, in contrast to the first one, can be generalized onto general nonlinear functions, as we will see in the next section.
4 Minimization of general functions
Consider now a general unconstrained minimization problem:
where \(f : \mathbb {E}\rightarrow \mathbb {R}\) is a twice differentiable function with positive definite Hessian.
To write down the standard quasi-Newton scheme for (4.1), we fix some self-adjoint positive definite linear operator \(B : \mathbb {E}\rightarrow \mathbb {E}^*\) and a constant \(L > 0\), that we use to define the initial Hessian approximation.
Remark 4.1
Similarly to Remark 3.1, when implementing scheme (4.2), it is common to work directly with the inverse \(H_k {\mathop {=}\limits ^{\mathrm {def}}}G_k^{-1}\) instead of \(G_k\). Also note that it is not necessary to compute \(J_k\) explicitly. Indeed, for implementing the Hessian approximation update at Step 4 (or the corresponding update for its inverse), one only needs the product
which is just the difference of the successive gradients.
In what follows, we make the following assumptions about the problem (4.1). First, we assume that, with respect to the operator B, the objective function f is strongly convex with parameter \(\mu > 0\) and its gradient is Lipschitz continuous with constant L, i.e.
for all \(x \in \mathbb {E}\). Second, we assume that the objective function f is strongly self-concordant with some constant \(M \ge 0\), i.e.
for all \(x, y, z, w \in \mathbb {E}\). The class of strongly self-concordant functions was recently introduced in [31], and contains at least all strongly convex functions with Lipschitz continuous Hessian (see [31, Example 4.1]). It gives us the the following convenient relations between the Hessians of the objective function:
Lemma 4.1
(see [31, Lemma 4.1]) Let \(x, y \in \mathbb {E}\), and let \(r {\mathop {=}\limits ^{\mathrm {def}}}\Vert y - x \Vert _x\). Then,
Also, for \(J {\mathop {=}\limits ^{\mathrm {def}}}\int _0^1 \nabla ^2 f(x + t (y - x)) dt\), we have
As a particular example of a nonquadratic function, satisfying assumptions (4.3), (4.4), one can consider the regularized log-sum-exp function, defined by \(f(x) {\mathop {=}\limits ^{\mathrm {def}}}\ln (\sum _{i=1}^m e^{\langle a_i, x \rangle + b_i}) + \frac{\mu }{2} \Vert x \Vert ^2\), where \(a_i \in \mathbb {E}^*\), \(b_i \in \mathbb {R}\) for \(i = 1, \ldots , m\), and \(\mu > 0\), \(\Vert x \Vert {\mathop {=}\limits ^{\mathrm {def}}}\langle B x, x \rangle ^{1/2}\).
Remark 4.2
Since we are interested in local convergence, it is possible to relax our assumptions by requiring that (4.3), (4.4) hold only in some neighborhood of a minimizer \(x^*\). For this, it suffices to assume that the Hessian of f is Lipschitz continuous in this neighborhood with \(\nabla ^2 f(x^*)\) being positive definite. These are exactly the standard assumptions, used in [8] and many other works, studying local convergence of quasi-Newton methods. However, to avoid excessive technicalities, we do not do this.
Let us now analyze the process (4.2). For measuring its convergence, we look at the local norm of the gradient:
First, let us estimate the progress of one step of the scheme (4.2). Recall that \(\theta (J_k, G_k, u_k)\) is the measure of closeness of \(G_k\) to \(J_k\) along the direction \(u_k\) (see (2.6)).
Lemma 4.2
In scheme (4.2), for all \(k \ge 0\) and \(r_k {\mathop {=}\limits ^{\mathrm {def}}}\Vert u_k \Vert _{x_k}\), we have
Proof
Denote \(\theta _k {\mathop {=}\limits ^{\mathrm {def}}}\theta (J_k, G_k, u_k)\). In view of Taylor’s formula,
Therefore,
The proof is finished \(\square \)
Our next result states that, if the starting point in scheme (4.2) is chosen sufficiently close to the solution, then the relative eigenvalues of the Hessian approximations \(G_k\) with respect to both the Hessians \(\nabla ^2 f(x_k)\) and the integral Hessians \(J_k\) are always located between 1 and \(\frac{L}{\mu }\), up to some small numerical constant. As a consequence, the process (4.2) has at least the linear convergence rate of the gradient method.
Theorem 4.1
Suppose that, in scheme (4.2),
Then, for all \(k \ge 0\), we have
whereFootnote 3
and \(r_i {\mathop {=}\limits ^{\mathrm {def}}}\Vert u_i \Vert _{x_i}\) for any \(i \ge 0\).
Proof
Note that \(\xi _0 = 1\) and \(G_0 = L B\). Therefore, for \(k = 0\), both (4.11), (4.13) are satisfied. Indeed, the first one reads \(\nabla ^2 f(x_0) \preceq L B \preceq \frac{L}{\mu } \nabla ^2 f(x_0)\) and follows from (4.3), while the second one reads \(\lambda _f(x_0) \le \lambda _f(x_0)\) and is obviously true.
Now assume that \(k \ge 0\), and that (4.11), (4.13) have already been proved for all \(0 \le k' \le k\). Combining (4.11) with (4.6), using the definition of \(\xi _k'\), we obtain (4.12). Further, denote \(\lambda _i {\mathop {=}\limits ^{\mathrm {def}}}\lambda _f(x_i)\) for \(0 \le i \le k\). Note that
Therefore,
Consequently, by the definition of \(\xi _k\) and \(\xi _k'\),
Thus, (4.12), (4.14) are now proved. To finish the proof by induction, it remains to prove (4.11), (4.13) for \(k' = k + 1\).
We start with (4.11). Applying Lemma 2.1, using (4.12), we obtain
Consequently,
and
Thus, (4.11) is proved for \(k'= k+1\).
It remains to prove (4.13) for \(k'= k+1\). By Lemma 4.2,
where \(\theta _k {\mathop {=}\limits ^{\mathrm {def}}}\theta (J_k, G_k, u_k)\). Note that
Hence,
where
Therefore,
Thus,
Consequently,
It remains to show that
Note that
Hence,
Also,
Combining (4.22) and (4.23), we obtain
and
Thus,
and (4.20) follows. \(\square \)
Now we are ready to prove the main result of this section on the superlinear convergence of the scheme (4.2). In contrast to the quadratic case, now we cannot use the proof, based on the trace potential function \(\sigma \), defined by (2.7), because we cannot longer guarantee that \(J_k \preceq G_k\). However, the proof, based on the potential function \(\psi \), defined by (2.14), still works.
Theorem 4.2
Suppose that the initial point \(x_0\) in scheme (4.2) is chosen sufficiently close to the solution, as specified by (4.10). Then, for all \(k \ge 1\), we have
Proof
Denote \(r_i {\mathop {=}\limits ^{\mathrm {def}}}\Vert u_i \Vert _{x_i}\), \(\theta _i {\mathop {=}\limits ^{\mathrm {def}}}\theta (J_i, G_i, u_i)\), \(\psi _i {\mathop {=}\limits ^{\mathrm {def}}}\psi (J_i, G_i)\), \(\tilde{\psi }_{i+1} {\mathop {=}\limits ^{\mathrm {def}}}\psi (J_i,G_{i+1})\), and \(p_i {\mathop {=}\limits ^{\mathrm {def}}}\phi _i \frac{\mu }{L} + 1 - \phi _i\) for any \(i \ge 0\). Let \(k \ge 1\) and \(0 \le i \le k - 1\) be arbitrary. By (4.12), (4.14) and Lemma 2.4, we have
Moreover, since
we also have
Thus,
where
Let us estimate \(\sum _{i=0}^{k-1} \varDelta _i\) from above. Note that
where
Hence,
and
At the same time,
Thus,
Summing up (4.25) and using the fact that \(\psi _k \ge 0\), we obtain
Since \((1 + t)^p \le 1 + p t\) for all \(t \ge -1\) and \(0 \le p \le 1\), we further have
Therefore, by Lemma 4.2 and the arithmetic-geometric mean inequality,
The proof is finished\(\square \)
5 Discussion
Let us compare the rates of superlinear convergence, that we have obtained for the classical quasi-Newton methods, with those of the greedy quasi-Newton methods [31]. For brevity, we discuss only the BFGS method. Moreover, since the complexity bounds for the general nonlinear case differ from those for the quadratic one only in some absolute constants (both for the classical and the greedy methods), we only consider the case, when the objective function f is quadratic.
As before, let n be the dimension of the problem, \(\mu \) be the strong convexity parameter, L be the Lipschitz constant of the gradient of f, and \(\lambda _f(x)\) be the local norm of the gradient of f at the point \(x \in \mathbb {E}\) (as defined by (3.4)). Also, let us introduce the following condition number to simplify our notation:
The greedy BFGS method [31] is essentially the classical BFGS algorithm (scheme (3.3) with \(\phi _k \equiv 0\)) with the only difference that, at each iteration, the update direction \(u_k\) is chosen greedily according to the following rule:
where \(e_1, \ldots , e_n\) is a basis in \(\mathbb {E}\), such that \(B^{-1} = \sum _{i=1}^n e_i e_i^*\). For this method, we have the following recurrence (see [31, Theorem 3.2]):
Hence, its rate of superlinear convergence is described by the expression
Although the inequality (5.2) is valid for all \(k \ge 0\), it is useful only when
In other words, the relation (5.3) specifies the moment, starting from which it becomes meaningful to speak about the superlinear convergence of the greedy BFGS method.
For the classical BFGS method, we have the following bound (see (3.11)):
and the starting moment of its superlinear convergence is described as follows:
Comparing (5.3) and (5.4), we see that, for the standard BFGS, the superlinear convergence may start slightly earlier than for the greedy one. However, the difference is only in the logarithmic factor.
Nevertheless, let us show that, very soon after the superlinear convergence of the greedy BFGS begins, namely, after
iterations, it will be significantly faster than the standard BFGS. Indeed,
for all \(k \ge 1\). Note that the function \(t \mapsto \frac{\ln t}{t}\) is decreasing on \([e, +\infty )\) (since its logarithm \(\ln \ln t - \ln t\) is a decreasing function of \(u = \ln t\) for \(u \in [1, +\infty )\), which is easily verified by differentiation). Hence, for all \(k \ge K\), we have (using first that \(k \le 2 (k-1)\) since \(k \ge 2\))
Consequently, for all \(k \ge K\), we obtain
Thus, after K iterations, the rate of superlinear convergence of the greedy BFGS is always better than that of the standard BFGS. Moreover, as \(k \rightarrow \infty \), the gap between these two rates grows as \(e^{-k^2/Q}\). At the same time, the complexity of the Hessian update for the greedy BFGS method is more expensive than for the standard one.
Notes
Recall that, for linear operators \(A, B : \mathbb {E}\rightarrow \mathbb {E}^*\), a scalar \(\lambda \in \mathbb {R}\) is called a (relative) eigenvalue of A with respect to B if \(A x = \lambda B x\) for some \(x \in \mathbb {E}\setminus \{0\}\). If A, B are self-adjoint and B is positive definite, it is known that there exist eigenvalues \(\lambda _1, \ldots , \lambda _n \in \mathbb {R}\) and a basis \(x_1, \ldots , x_n \in \mathbb {E}\), such that \(A x_i = \lambda _i B x_i\), \(\Vert x_i \Vert _B = 1\), \(\langle B x_i, x_j \rangle = 0\) for all \(1 \le i, j \le n\), \(i \ne j\).
This is evident when \(G - A\) is non-degenerate. The general case then follows by continuity.
Here we follow the standard convention that the sum over the empty set is defined as zero. Thus, \(\xi _0 = e^0 = 1\).
References
Davidon, W.: Variable metric method for minimization. Argonne Natl. Lab. Res. Dev. Rep. 5990, (1959)
Fletcher, R., Powell, M.: A rapidly convergent descent method for minimization. Comput. J. 6(2), 163–168 (1963)
Broyden, C.: The convergence of a class of double-rank minimization algorithms: 1. General considerations. IMA J. Appl. Math. 6(1), 76–90 (1970)
Broyden, C.: The convergence of a class of double-rank minimization algorithms: 2. The new algorithm. IMA J. Appl. Math. 6(3), 222–231 (1970)
Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970)
Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23–26 (1970)
Shanno, D.: Conditioning of quasi-Newton methods for function minimization. Math. Comput. 24(111), 647–656 (1970)
Dennis, J., Moré, J.: Quasi-Newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)
Lewis, A., Overton, M.: Nonsmooth optimization via quasi-Newton methods. Math. Program. 141(1–2), 135–163 (2013)
Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: International Conference on Machnie Learning, pp. 1869–1878 (2016)
Gower, R., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)
Kovalev, D., Gower, R., Richtárik, P., Rogozin, A.: Fast linear convergence of randomized BFGS. (2020). arXiv:2002.11337
Powell, M.: On the convergence of the variable metric algorithm. IMA J. Appl. Math. 7(1), 21–36 (1971)
Dixon, L.: Quasi-Newton algorithms generate identical points. Math. Program. 2(1), 383–387 (1972)
Dixon, L.: Quasi Newton techniques generate identical points II: the proofs of four new theorems. Math. Program. 3(1), 345–358 (1972)
Broyden, C.: Quasi-Newton methods and their application to function minimization. Math. Comput. 21(99), 368–381 (1967)
Broyden, C., Dennis, J., Moré, J.: On the local and superlinear convergence of quasi-Newton methods. IMA J. Appl. Math. 12(3), 223–245 (1973)
Dennis, J., Moré, J.: A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)
Stachurski, A.: Superlinear convergence of Broyden’s bounded \(\theta \)-class of methods. Math. Program. 20(1), 196–212 (1981)
Griewank, A., Toint, P.: Local convergence analysis for partitioned quasi-Newton updates. Numer. Math. 39(3), 429–448 (1982)
Engels, J., Martínez, H.: Local and superlinear convergence for partially known quasi-Newton methods. SIAM J. Optim. 1(1), 42–56 (1991)
Yabe, H., Yamaki, N.: Local and superlinear convergence of structured quasi-Newton methods for nonlinear optimization. J. Oper. Res. Soc. Jpn. 39(4), 541–557 (1996)
Wei, Z., Yu, G., Yuan, G., Lian, Z.: The superlinear convergence of a modified BFGS-type method for unconstrained optimization. Comput. Optim. Appl. 29(3), 315–332 (2004)
Yabe, H., Ogasawara, H., Yoshino, M.: Local and superlinear convergence of quasi-Newton methods based on modified secant conditions. J. Comput. Appl. Math. 205(1), 617–632 (2007)
Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)
Gao, W., Goldfarb, D.: Quasi-Newton methods: superlinear convergence without line searches for self-concordant functions. Optim. Methods Softw. 34(1), 194–217 (2019)
Byrd, R., Nocedal, J., Yuan, Y.: Global convergence of a class of quasi-Newton methods on convex problems. SIAM J. Numer. Anal. 24(5), 1171–1190 (1987)
Byrd, R., Nocedal, J.: A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 26(3), 727–739 (1989)
Byrd, R., Liu, D., Nocedal, J.: On the behavior of Broyden’s class of quasi-Newton methods. SIAM J. Optim. 2(4), 533–557 (1992)
Rodomanov, A., Nesterov, Y.: Greedy quasi-Newton methods with explicit superlinear convergence. In: CORE Discussion Papers 06, (2020)
Nesterov, Y.: Lectures on convex optimization. Springer, Berlin (2018)
Acknowledgements
The authors are thankful to two anonymous reviewers for their valuable time and useful feedback.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research results in this paper were obtained with support of ERC Advanced Grant 788368.
Appendix
Appendix
Lemma 6.1
Let \(A, B : \mathbb {E}\rightarrow \mathbb {E}^*\) be self-adjoint linear operators such that
Then, for any \(u \in \mathbb {E}\setminus \{0\}\), we have
Proof
Indeed, for all \(h \in \mathbb {E}\), we have
The proof is finished \(\square \)
Lemma 6.2
For any self-adjoint positive definite linear operators \(A, G : \mathbb {E}\rightarrow \mathbb {E}^*\), any scalar \(\phi \in \mathbb {R}\), and any direction \(u \in \mathbb {E}\setminus \{0\}\), we have
Remark 6.1
Note that formula (6.2) is known in the literature (see e.g. [30, eq. (1.9)]), although we do not know any reference, which contains an explicit proof of this result.
Proof
Denote \(G_+ {\mathop {=}\limits ^{\mathrm {def}}}\mathrm{Broyd}_{\phi }(A, G, u)\),
Note that
and
Let \(Q {\mathop {=}\limits ^{\mathrm {def}}}G + \frac{A u u^* A}{\langle A u, u \rangle }\). Note that
and \(G_0 = Q - \frac{G u u^* G}{\langle G u, u \rangle }\). Therefore, applying twice Lemma 6.3, we find that
Hence,
Further, note that
So, applying Lemma 6.3 again, we obtain
Consequently,
The proof is finished\(\square \)
Lemma 6.3
(Determinant of rank-1 perturbation) Let \(A : \mathbb {E}\rightarrow \mathbb {E}^*\) be a self-adjoint positive definite linear operator, \(s \in \mathbb {E}^*\), and \(\alpha \in \mathbb {R}\). Then,
Proof
Indeed, with respect to A, the operator \(A + \alpha s s^*\) has \(n-1\) unit eigenvalues and one eigenvalue \(1 + \alpha \langle s, A^ {-1} s \rangle \) (corresponding to the eigenvector \(A^{-1} s\)). \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rodomanov, A., Nesterov, Y. Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 194, 159–190 (2022). https://doi.org/10.1007/s10107-021-01622-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01622-5
Keywords
- Quasi-Newton methods
- Convex Broyden class
- DFP
- BFGS
- Superlinear convergence
- Local convergence
- Rate of convergence