1 Introduction

Motivation In this work, we investigate the classical quasi-Newton algorithms for smooth unconstrained optimization, the main examples of which are the Davidon–Fletcher–Powell (DFP) method [1, 2] and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [3,4,5,6,7]. These algorithms are based on the idea of replacing the exact Hessian in the Newton method with some approximation, that is updated in iterations according to certain formulas, involving only the gradients of the objective function. For an introduction into the topic, see [8] and [9, Chapter 6]; also see [10] for the treatment of quasi-Newton algorithms in the context of nonsmooth optimization and [11,12,13] for randomized variants of quasi-Newton methods.

One of the questions about quasi-Newton methods, that has been extensively studied in the literature, is their superlinear convergence. First theoretical results here were obtained for the methods with exact line search, first by Powell [14], who analyzed the DFP method, and then by Dixon [15, 16], who showed that with the exact line search all quasi-Newton algorithms in the Broyden family [17] coincide. Soon after that Broyden, Dennis and Moré [18] considered the quasi-Newton algorithms without line search and proved the local superlinear convergence of DFP, BFGS and several other methods. Their analysis was based on the Frobenius-norm potential function. Later, Dennis and Moré [19] unified the previous proofs by establishing the necessary and sufficient condition of superlinear convergence. This condition together with the original analysis of Broyden, Dennis and Moré have been applied since then in almost every work on quasi-Newton methods for proving superlinear convergence (see e.g. [20,21,22,23,24,25,26,27]). Finally, one should mention that an important contribution to the theoretical analysis of quasi-Newton methods has been made by Byrd, Liu, Nocedal and Yuan in the series of works [28,29,30], where they introduced a new potential function by combining the trace with the logarithm of determinant.

However, the theory of superlinear convergence of quasi-Newton methods is still far from being complete. The main reason for this is that all currently existing results on superlinear convergence of quasi-Newton methods are only asymptotic: they simply show that the ratio of successive residuals in the method tends to zero as the number of iterations goes to infinity, without providing any specific bounds on the corresponding rate of convergence. It is therefore important to obtain some explicit and non-asymptotic rates of superlinear convergence for quasi-Newton methods.

This observation was the starting point for a recent work [31], where the greedy analogs of the classical quasi-Newton methods have been developed. As opposed to the classical quasi-Newton methods, which use the difference of successive iterates for updating Hessian approximations, these methods employ basis vectors, greedily selected to maximize a certain measure of progress. As shown in [31], greedy quasi-Newton methods have superlinear convergence rate of the form \((1-\frac{\mu }{nL})^{k^2/2} (\frac{n L}{\mu })^k\), where k is the iteration counter, n is the dimension of the problem, \(\mu \) is the strong convexity parameter, and L is the Lipschitz constant of the gradient.

In this work, we continue the same line of research but now we study the classical quasi-Newton methods. Namely, we consider the methods, based on the updates from the convex Broyden class, which is formed by all convex combinations of the DFP and BFGS updates. For this class, we derive explicit bounds on the rate of superlinear convergence of standard quasi-Newton methods without line search. In particular, for the standard DFP and BFGS methods, we obtain the rates of the form \((\frac{n L^2}{\mu ^2 k})^{k/2}\) and \((\frac{n L}{\mu k})^{k/2}\) respectively.

Contents This paper is organized as follows. First, in Sect. 2, we study the convex Broyden class of updating rules for approximating a self-adjoint positive definite linear operator, and establish several important properties of this class. Then, in Sect. 3, we analyze the standard quasi-Newton scheme, based on the updating rules from the convex Broyden class, as applied to minimizing a quadratic function. We show that this scheme has the same rate of linear convergence as that of the classical gradient method, and also a superlinear convergence rate of the form \((\frac{Q}{k})^{k/2}\), where \(Q \ge 1\) is a certain constant, related to the condition number, and depending on the method. After that, in Sect. 4, we consider the general problem of unconstrained minimization and the corresponding quasi-Newton scheme for solving it. We show that, for this scheme, it is possible to prove absolutely the same results as for the quadratic function, provided that the starting point is sufficiently close to the solution. In Sect. 5, we compare the rates of superlinear convergence, that we obtain for the classical quasi-Newton methods, with the corresponding rates of the greedy quasi-Newton methods. Sect. 6 contains some auxiliary results, that we use in our analysis.

Notation In what follows, \(\mathbb {E}\) denotes an arbitrary n-dimensional real vector space. Its dual space, composed by all linear functionals on \(\mathbb {E}\), is denoted by \(\mathbb {E}^*\). The value of a linear function \(s \in \mathbb {E}^*\), evaluated at point \(x \in \mathbb {E}\), is denoted by \(\langle s, x \rangle \).

For a smooth function \(f : \mathbb {E}\rightarrow \mathbb {R}\), we denote by \(\nabla f(x)\) and \(\nabla ^2 f(x)\) its gradient and Hessian respectively, evaluated at a point \(x \in \mathbb {E}\). Note that \(\nabla f(x) \in \mathbb {E}^*\), and \(\nabla ^2 f(x)\) is a self-adjoint linear operator from \(\mathbb {E}\) to \(\mathbb {E}^*\).

The partial ordering of self-adjoint linear operators is defined in the standard way. We write \(A \preceq A_1\) for \(A, A_1 : \mathbb {E}\rightarrow \mathbb {E}^*\) if \(\langle (A_1 - A) x, x \rangle \ge 0\) for all \(x \in \mathbb {E}\), and \(W \preceq W_1\) for \(W, W_1 : \mathbb {E}^* \rightarrow \mathbb {E}\) if \(\langle s, (W_1 - W) s \rangle \ge 0\) for all \(s \in \mathbb {E}^*\).

Any self-adjoint positive definite linear operator \(A : \mathbb {E}\rightarrow \mathbb {E}^*\) induces in the spaces \(\mathbb {E}\) and \(\mathbb {E}^*\) the following pair of conjugate Euclidean norms:

$$\begin{aligned} \Vert h \Vert _A {\mathop {=}\limits ^{\mathrm {def}}}\langle A h, h \rangle ^{1/2}, \quad h \in \mathbb {E}, \qquad \quad \Vert s \Vert _A^* {\mathop {=}\limits ^{\mathrm {def}}}\langle s, A^{-1} s \rangle ^{1/2}, \quad s \in \mathbb {E}^*. \end{aligned}$$
(1.1)

When \(A = \nabla ^2 f(x)\), where \(f : \mathbb {E}\rightarrow \mathbb {R}\) is a smooth function with positive definite Hessian, and \(x \in \mathbb {E}\), we prefer to use notation \(\Vert \cdot \Vert _x\) and \(\Vert \cdot \Vert _x^*\), provided that there is no ambiguity with the reference function f.

Sometimes, in the formulas, involving products of linear operators, it is convenient to treat \(x \in \mathbb {E}\) as a linear operator from \(\mathbb {R}\) to \(\mathbb {E}\), defined by \(x \alpha = \alpha x\), and \(x^*\) as a linear operator from \(\mathbb {E}^*\) to \(\mathbb {R}\), defined by \(x^* s = \langle s, x \rangle \). Likewise, any \(s \in \mathbb {E}^*\) can be treated as a linear operator from \(\mathbb {R}\) to \(\mathbb {E}^*\), defined by \(s \alpha = \alpha s\), and \(s^*\) as a linear operator from \(\mathbb {E}\) to \(\mathbb {R}\), defined by \(s^* x = \langle s, x \rangle \). In this case, \(x x^*\) and \(s s^*\) are rank-one self-adjoint linear operators from \(\mathbb {E}^*\) to \(\mathbb {E}\) and from \(\mathbb {E}^*\) to \(\mathbb {E}\) respectively, acting as follows:

$$\begin{aligned} (x x^*) s = \langle s, x \rangle x, \qquad (s s^*) x = \langle s, x \rangle s, \qquad x \in \mathbb {E}, \ s \in \mathbb {E}^*. \end{aligned}$$

Given two self-adjoint linear operators \(A : \mathbb {E}\rightarrow \mathbb {E}^*\) and \(W : \mathbb {E}^* \rightarrow \mathbb {E}\), we define the trace and the determinant of A with respect to W as follows:

$$\begin{aligned} \langle W, A \rangle {\mathop {=}\limits ^{\mathrm {def}}}\mathrm{Tr}(W A), \qquad \mathrm{Det}(W, A) {\mathop {=}\limits ^{\mathrm {def}}}\mathrm{Det}(W A). \end{aligned}$$

Note that WA is a linear operator from \(\mathbb {E}\) to itself, and hence its trace and determinant are well-defined real numbers (they coincide with the trace and determinant of the matrix representation of WA with respect to an arbitrary chosen basis in the space \(\mathbb {E}\), and the result is independent of the particular choice of the basis). In particular, if W is positive definite, then \(\langle W, A \rangle \) and \(\mathrm{Det}(W, A)\) are respectively the sum and the product of the eigenvaluesFootnote 1 of A relative to \(W^{-1}\). Observe that \(\langle \cdot , \cdot \rangle \) is a bilinear form, and for any \(x \in \mathbb {E}\), we have

$$\begin{aligned} \langle A x, x \rangle= & {} \langle x x^*, A \rangle . \end{aligned}$$
(1.2)

When A is invertible, we also have

$$\begin{aligned} \langle A^{-1}, A \rangle = n, \qquad \mathrm{Det}(A^{-1}, \delta A) = \delta ^n. \end{aligned}$$
(1.3)

for any \(\delta \in \mathbb {R}\). Also recall the following multiplicative formula for the determinant:

$$\begin{aligned} \mathrm{Det}(W, A) = \mathrm{Det}(W, G) \mathrm{Det}(G^{-1}, A), \end{aligned}$$
(1.4)

which is valid for any invertible linear operator \(G : \mathbb {E}\rightarrow \mathbb {E}^*\). If the operator W is positive semidefinite, and \(A \preceq A_1\) for some self-adjoint linear operator \(A_1 : \mathbb {E}\rightarrow \mathbb {E}^*\), then \(\langle W, A \rangle \le \langle W, A_1 \rangle \) and \(\mathrm{Det}(W, A) \le \mathrm{Det}(W, A_1)\). Similarly, if A is positive semidefinite and \(W \preceq W_1\) for some self-adjoint linear operator \(W_1 : \mathbb {E}^* \rightarrow \mathbb {E}\), then \(\langle W, A \rangle \le \langle W_1, A \rangle \) and \(\mathrm{Det}(W, A) \le \mathrm{Det}(W_1, A)\).

2 Convex Broyden class

Let A and G be two self-adjoint positive definite linear operators from \(\mathbb {E}\) to \(\mathbb {E}^*\), where A is the target operator, which we want to approximate, and G is the current approximation of the operator A. The Broyden family of quasi-Newton updates of G with respect to A along a direction \(u \in \mathbb {E}\setminus \{0\}\), is the following class of updating formulas, parameterized by a scalar \(\phi \in \mathbb {R}\):

$$\begin{aligned} \mathrm{Broyd}_{\phi }(A, G, u)&{\mathop {=}\limits ^{\mathrm {def}}}&\phi \left[ G - \frac{A u u^* G + G u u^* A}{\langle A u, u \rangle } + \left( \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + 1 \right) \frac{A u u^* A}{\langle A u, u \rangle } \right] \nonumber \\&+ \, (1 - \phi ) \left[ G - \frac{G u u^* G}{\langle G u, u \rangle } + \frac{A u u^* A}{\langle A u, u \rangle } \right] . \end{aligned}$$
(2.1)

Note that \(\mathrm{Broyd}_{\phi }(A, G, u)\) depends on A only through the product Au. For the sake of convenience, we also define \(\mathrm{Broyd}_{\phi }(A, G, u) = G\) when \(u = 0\).

Two important members of the Broyden family, DFP and BFGS updates, correspond to the values \(\phi = 1\) and \(\phi = 0\) respectively:

$$\begin{aligned}&\mathrm{DFP}(A, G, u) {\mathop {=}\limits ^{\mathrm {def}}}G - \frac{A u u^* G + G u u^* A}{\langle A u, u \rangle } + \left( \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + 1 \right) \frac{A u u^* A}{\langle A u, u \rangle }, \nonumber \\&\mathrm{BFGS}(A, G, u) {\mathop {=}\limits ^{\mathrm {def}}}G - \frac{G u u^* G}{\langle G u, u \rangle } + \frac{A u u^* A}{\langle A u, u \rangle }. \end{aligned}$$
(2.2)

Thus, the Broyden family consists of all affine combinations of DFP and BFGS updates:

$$\begin{aligned} \mathrm{Broyd}_{\phi }(A, G, u) {\mathop {=}\limits ^{(2.1)}} \phi \mathrm{DFP}(A, G, u) + (1 - \phi ) \mathrm{BFGS}(A, G, u). \end{aligned}$$
(2.3)

The subclass of the Broyden family, corresponding to \(\phi \in [0, 1]\), is known as the convex Broyden class (or the restricted Broyden class in some texts).

Our subsequent developments will be based on two properties of the convex Broyden class. The first property states that each update from this class preserves the bounds on the relative eigenvalues with respect to the target operator.

Lemma 2.1

Let \(A, G : \mathbb {E}\rightarrow \mathbb {E}^*\) be self-adjoint positive definite linear operators such that

$$\begin{aligned} \frac{A}{\xi } \preceq G \preceq \eta A, \end{aligned}$$
(2.4)

where \(\xi , \eta \ge 1\). Then, for any \(u \in \mathbb {E}\), and any \(\phi \in [0, 1]\), we have

$$\begin{aligned} \frac{A}{\xi } \preceq \mathrm{Broyd}_{\phi }(A, G, u) \preceq \eta A. \end{aligned}$$
(2.5)

Proof

Suppose that \(u \ne 0\) since otherwise the claim is trivial. In view of (2.3), it suffices to prove (2.5) only for the DFP and BFGS updates independently.

For the DFP update, we have

$$\begin{aligned}&\mathrm{DFP}(A, G, u) {\mathop {=}\limits ^{(2.2)}} G - \frac{A u u^* G + G u u^* A}{\langle A u, u \rangle } + \left( \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + 1 \right) \frac{A u u^* A}{\langle A u, u \rangle } \\&\quad = \left( I_{\mathbb {E}^*} - \frac{A u u^*}{\langle A u, u \rangle } \right) G \left( I_{\mathbb {E}} - \frac{u u^* A}{\langle A u, u \rangle } \right) + \frac{A u u^* A}{\langle A u, u \rangle }, \end{aligned}$$

where \(I_{\mathbb {E}}\), \(I_{\mathbb {E}^*}\) are the identity operators in the spaces \(\mathbb {E}\), \(\mathbb {E}^*\) respectively. Hence,

$$\begin{aligned}&\mathrm{DFP}(A, G, u) {\mathop {\preceq }\limits ^{(2.4)}} \eta \left( I_{\mathbb {E}^*} - \frac{A u u^*}{\langle A u, u \rangle } \right) A \left( I_{\mathbb {E}} - \frac{u u^* A}{\langle A u, u \rangle } \right) + \frac{A u u^* A}{\langle A u, u \rangle } \\&\qquad = \eta \left( A - \frac{A u u^* A}{\langle A u, u \rangle } \right) + \frac{A u u^* A}{\langle A u, u \rangle } \;=\; \eta A - (\eta - 1) \frac{A u u^* A}{\langle A u, u \rangle } \;\preceq \; \eta A, \\&\mathrm{DFP}(A, G, u) {\mathop {\succeq }\limits ^{(2.4)}} \frac{1}{\xi } \left( I_ {\mathbb {E}^*} - \frac{A u u^*}{\langle A u, u \rangle } \right) A \left( I_ {\mathbb {E}} - \frac{u u^* A}{\langle A u, u \rangle } \right) + \frac{A u u^* A}{\langle A u, u \rangle } \\&\qquad = \frac{1}{\xi } \left( A - \frac{A u u^* A}{\langle A u, u \rangle } \right) + \frac{A u u^* A}{\langle A u, u \rangle } \;=\; \frac{1}{\xi } A + \left( 1 - \frac{1}{\xi } \right) \frac{A u u^* A}{\langle A u, u \rangle } \;\succeq \; \frac{1}{\xi } A. \end{aligned}$$

For the BFGS update, we apply Lemma 6.1 (see Appendix):

$$\begin{aligned}&\mathrm{BFGS}(A, G, u) {\mathop {=}\limits ^{(2.2)}} G - \frac{G u u^* G}{\langle G u, u \rangle } + \frac{A u u^* A}{\langle A u, u \rangle } \;{\mathop {\preceq }\limits ^{(2.4)}}\; \eta \left( A - \frac{A u u^* A}{\langle A u, u \rangle } \right) \\&\qquad + \frac{A u u^* A}{\langle A u, u \rangle } \\&\quad = \eta A - (\eta - 1) \frac{A u u^* A}{\langle A u, u \rangle } \;\preceq \; \eta A, \\&\mathrm{BFGS}(A, G, u) {\mathop {=}\limits ^{(2.2)}} G - \frac{G u u^* G}{\langle G u, u \rangle } + \frac{A u u^* A}{\langle A u, u \rangle } \;{\mathop {\succeq }\limits ^{(2.4)}}\; \frac{1}{\xi } \left( A - \frac{A u u^* A}{\langle A u, u \rangle } \right) \\&\qquad + \frac{A u u^* A}{\langle A u, u \rangle } \\&\quad = \frac{1}{\xi } A + \left( 1 - \frac{1}{\xi } \right) \frac{A u u^* A}{\langle A u, u \rangle } \;\succeq \; \frac{1}{\xi } A. \end{aligned}$$

The proof is finished \(\square \)

Remark 2.1

Lemma 2.1 has first been established in [5] in a slightly stronger form and using a different argument. It was also shown there that one of the relations in (2.5) may no longer be valid if \(\phi \in \mathbb {R}\setminus [0, 1]\).

The second property of the convex Broyden class, which we need, is related to the question of convergence of the approximations G to the target operator A. Note that without any restrictions on the choice of the update directions u, one cannot guarantee any convergence of G to A in the usual sense (see [19, 31] for more details). However, for our goals it will be sufficient to show that, independently of the choice of u, it is still possible to ensure that G converges to A along the update directions u, and estimate the corresponding rate of convergence.

Let us define the following measure of the closeness of G to A along the direction u:

$$\begin{aligned} \theta (A, G, u) {\mathop {=}\limits ^{\mathrm {def}}}\left[ \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle G A^{-1} G u, u \rangle } \right] ^{1/2}, \end{aligned}$$
(2.6)

where, for the sake of convenience, we define \(\theta (A, G, u) = 0\) if \(u = 0\). Note that \(\theta (A, G, u) = 0\) if and only if \(G u = A u\). Thus, our goal now is to establish some upper bounds on \(\theta \), which will help us to estimate the rate, at which this measure goes to zero. For this, we will study how certain potential functions change after one update from the convex Broyden class, and estimate this change from below by an appropriate monotonically increasing function of \(\theta \). We will consider two potential functions.

The first one is a simple trace potential function, that we will use only when we can guarantee that \(A \preceq G\):

$$\begin{aligned} \sigma (A, G) {\mathop {=}\limits ^{\mathrm {def}}}\langle A^{-1}, G - A \rangle \;\ge \; 0. \end{aligned}$$
(2.7)

Lemma 2.2

Let \(A, G : \mathbb {E}\rightarrow \mathbb {E}^*\) be self-adjoint positive definite linear operators such that

$$\begin{aligned} A \preceq G \preceq \eta A \end{aligned}$$
(2.8)

for some \(\eta \ge 1\). Then, for any \(\phi \in [0, 1]\) and any \(u \in \mathbb {E}\), we have

$$\begin{aligned} \sigma (A, G) - \sigma (A, \mathrm{Broyd}_{\phi }(A, G, u)) \ge \left( \phi \frac{1}{\eta } + 1 - \phi \right) \theta ^2(A, G, u). \end{aligned}$$
(2.9)

Proof

We can assume that \(u \ne 0\) since otherwise the claim is trivial. Denote \(G_+ {\mathop {=}\limits ^{\mathrm {def}}}\mathrm{Broyd}_{\phi }(A, G, u)\) and \(\theta {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G, u)\). Then,

$$\begin{aligned}&\sigma (A, G) - \sigma (A, G_+) \;{\mathop {=}\limits ^{(2.7)}}\; \langle A^ {-1}, G - G_+ \rangle \nonumber \\&\quad {\mathop {=}\limits ^{(2.1)}} 2 \phi \frac{\langle G u, u \rangle }{\langle A u, u \rangle } - \left[ \phi \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + 1 \right] + (1 - \phi ) \frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle } \nonumber \\&\quad = \phi \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + (1 - \phi ) \frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle } - 1 \nonumber \\&\quad = \phi \frac{\langle (G - A) u, u \rangle }{\langle A u, u \rangle } + (1 - \phi ) \frac{\langle G (A^{-1} - G^{-1}) G u, u \rangle }{\langle G u, u \rangle }. \end{aligned}$$
(2.10)

Note that

$$\begin{aligned} 0 {\mathop {\preceq }\limits ^{(2.8)}} G - A {\mathop {\preceq }\limits ^{(2.8)}} (\eta - 1) A \;\preceq \; \eta A. \end{aligned}$$

ThereforeFootnote 2,

$$\begin{aligned} (G - A) A^{-1} (G - A)\preceq & {} \eta (G - A). \end{aligned}$$
(2.11)

Consequently,

$$\begin{aligned}&\frac{\langle (G - A) u, u \rangle }{\langle A u, u \rangle } {\mathop {\ge }\limits ^{(2.11)}} \frac{1}{\eta } \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle A u, u \rangle } \nonumber \\&\quad {\mathop {\ge }\limits ^{(2.8)}} \frac{1}{\eta } \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle G A^{-1} G u, u \rangle } \;{\mathop {=}\limits ^{(2.6)}}\; \frac{1}{\eta } \theta ^2. \end{aligned}$$
(2.12)

At the same time,

$$\begin{aligned} (G - A) A^{-1} (G - A)= & {} G A^{-1} G - 2 G + A \\ {\mathop {\preceq }\limits ^{(2.8)}} G A^{-1} G - G= & {} G (A^{-1} - G^{-1}) G. \end{aligned}$$

Hence,

$$\begin{aligned}&\frac{\langle G (A^{-1} - G^{-1}) G u, u \rangle }{\langle G u, u \rangle } \ge \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle G u, u \rangle } \nonumber \\&\quad {\mathop {\ge }\limits ^{(2.8)}} \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle G A^{-1} G u, u \rangle } \;{\mathop {=}\limits ^{(2.6)}}\; \theta ^2. \end{aligned}$$
(2.13)

Substituting now (2.12) and (2.13) into (2.10), we obtain (2.9). \(\square \)

The second potential function is more universal since we can work with it even if the condition \(A \preceq G\) is violated. This function was first introduced in [29], and is defined as follows:

$$\begin{aligned} \psi (A, G) {\mathop {=}\limits ^{\mathrm {def}}}\langle A^{-1}, G - A \rangle - \ln \mathrm{Det}(A^{-1}, G). \end{aligned}$$
(2.14)

In fact, \(\psi \) is nothing else but the Bregman divergence, generated by the strictly convex function \(d(G) {\mathop {=}\limits ^{\mathrm {def}}}-\ln \mathrm{Det}(B^{-1}, G)\), defined on the set of self-adjoint positive definite linear operators from \(\mathbb {E}\) to \(\mathbb {E}^*\), where \(B : \mathbb {E}\rightarrow \mathbb {E}^*\) is an arbitrary fixed self-adjoint positive definite linear operator. Indeed,

$$\begin{aligned}&\psi (A, G) {\mathop {=}\limits ^{(1.3)}} -\ln \mathrm{Det}(B^{-1}, G) + \ln \mathrm{Det}(B^{-1}, A) - \langle -A^{-1}, G - A \rangle \\&\quad = d(G) - d(A) - \langle \nabla d(A), G - A \rangle . \end{aligned}$$

Thus, \(\psi (A, G) \ge 0\) and \(\psi (A, G) = 0\) if and only if \(G = A\).

Let \(\omega : (-1, +\infty ) \rightarrow \mathbb {R}\) be the univariate function

$$\begin{aligned} \omega (t) {\mathop {=}\limits ^{\mathrm {def}}}t - \ln (1 + t) \;\ge \; 0. \end{aligned}$$
(2.15)

Clearly, \(\omega \) is a convex function, which is decreasing on \((-1, 0]\) and increasing on \([0, +\infty )\). Also, on the latter interval, it satisfies the following bounds (see [32, Lemma 5.1.5]):

$$\begin{aligned} \frac{t^2}{2 (1 + t)} \le \frac{t^2}{2 \left( 1 + \frac{2}{3} t \right) } \le \omega (t) \le \frac{t^2}{2 + t}, \qquad t \ge 0. \end{aligned}$$
(2.16)

Thus, for large values of t, the function \(\omega (t)\) is approximately linear in t, while for small values of t, it is quadratic.

There is a close relationship between \(\omega \) and the potential function \(\psi \). Indeed, if \(\lambda _1, \ldots , \lambda _n \ge 0\) are the relative eigenvalues of G with respect to A, then

$$\begin{aligned} \psi (A, G) {\mathop {=}\limits ^{(2.14)}} \sum \limits _{i=1}^n (\lambda _i - 1 - \ln \lambda _i) \;{\mathop {=}\limits ^{(2.15)}}\; \sum \limits _ {i=1}^n \omega (\lambda _i - 1). \end{aligned}$$

We are going to use the function \(\omega \) to estimate from below the change in the potential function \(\psi \), which is achieved after one update from the convex Broyden class, via the closeness measure \(\theta \). However, first of all, we need an auxiliary lemma.

Lemma 2.3

For any real \(\alpha \ge \beta > 0\), we have

$$\begin{aligned} \alpha - \ln \beta - 1\ge & {} \omega (\sqrt{\alpha \beta - 2 \beta + 1}). \end{aligned}$$

Proof

Equivalently, we need to prove that

$$\begin{aligned} \alpha - 1\ge & {} \omega (\sqrt{\alpha \beta - 2 \beta + 1}) + \ln \beta . \end{aligned}$$
(2.17)

Let us show that the right-hand side of (2.17) is increasing in \(\beta \). This is evident if \(\alpha \ge 2\) because \(\omega \) is increasing on \([0, +\infty )\), so suppose that \(\alpha < 2\). Denote

$$\begin{aligned} t {\mathop {=}\limits ^{\mathrm {def}}}\sqrt{\alpha \beta - 2 \beta + 1} \;=\; \sqrt{1 - (2 - \alpha ) \beta } \;\in \; [0, 1). \end{aligned}$$
(2.18)

Note that t is decreasing in \(\beta \). Therefore, it suffices to prove that the right-hand side of (2.17) is decreasing in t. But

$$\begin{aligned}&\omega (\sqrt{\alpha \beta - 2 \beta + 1}) + \ln \beta {\mathop {=}\limits ^{(2.18)}} \omega (t) + \ln \frac{1 - t^2}{2 - \alpha } \\&\quad = \omega (t) + \ln (1 - t^2) - \ln (2 - \alpha ) \\&\quad {\mathop {=}\limits ^{(2.15)}} t - \ln (1 + t) + \ln (1 - t^2) - \ln (2 - \alpha ) \\&\quad = t + \ln (1 - t) - \ln (2 - \alpha ) \\&\quad {\mathop {=}\limits ^{(2.15)}} -\omega (-t) - \ln (2 - \alpha ), \end{aligned}$$

which is indeed decreasing in t since \(\omega \) is decreasing on \((-1, 0]\).

Thus, it suffices to prove (2.17) only in the boundary case \(\beta = \alpha \):

$$\begin{aligned} \alpha - 1\ge & {} \omega (\sqrt{\alpha ^2 - 2 \alpha + 1}) + \ln \alpha \;=\; \omega (|\alpha - 1|) + \ln \alpha , \end{aligned}$$

or, equivalently, in view of (2.15), that

$$\begin{aligned} \omega (\alpha - 1)\ge & {} \omega (|\alpha - 1|) \end{aligned}$$

For \(\alpha \ge 1\), this is obvious, so suppose that \(\alpha \le 1\). It now remains to justify that

$$\begin{aligned} \omega (-t)\ge & {} \omega (t), \end{aligned}$$
(2.19)

for all \(t \in [0, 1)\). But this easily follows by integration from the fact that

$$\begin{aligned} \frac{d}{dt} \omega (-t)= & {} -\omega '(-t) \;{\mathop {=}\limits ^{(2.15)}}\; \frac{t}{1-t} \;\ge \; \frac{t}{1+t} \;{\mathop {=}\limits ^{(2.15)}}\; \omega '(t) \end{aligned}$$

for all \(t \in [0, 1)\). \(\square \)

Now we are ready to prove the main result.

Lemma 2.4

Let \(A, G : \mathbb {E}\rightarrow \mathbb {E}^*\) be self-adjoint positive definite linear operators such that

$$\begin{aligned} \frac{1}{\xi } A \preceq G \preceq \eta A \end{aligned}$$
(2.20)

for some \(\xi , \eta \ge 1\). Then, for any \(\phi \in [0, 1]\) and any \(u \in \mathbb {E}\), we have

$$\begin{aligned} \psi (A, G) - \psi (A, \mathrm{Broyd}_{\phi }(A, G, u))\ge & {} \phi \, \omega \left( \frac{\theta (A, G, u)}{\xi ^{3/2} \sqrt{\eta }} \right) + (1-\phi ) \omega \left( \frac{\theta (A, G, u)}{\xi } \right) . \end{aligned}$$

Proof

Suppose that \(u \ne 0\) since otherwise the claim is trivial. Let us denote \(G_+ {\mathop {=}\limits ^{\mathrm {def}}}\mathrm{Broyd}_{\phi }(A, G, u)\) and \(\theta {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G, u)\). We already know that

$$\begin{aligned} \langle A^{-1}, G - G_+ \rangle {\mathop {=}\limits ^{(2.10)}} \phi \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + (1-\phi ) \frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle } - 1. \end{aligned}$$

Applying now Lemma 6.2, we obtain

$$\begin{aligned} \mathrm{Det}(G^{-1}, G_+)= & {} \phi \frac{\langle A G^{-1} A u, u \rangle }{\langle A u, u \rangle } + (1-\phi ) \frac{\langle A u, u \rangle }{\langle G u, u \rangle }. \end{aligned}$$

Thus,

$$\begin{aligned}&\psi (A, G) - \psi (A, G_+) \nonumber \\&\quad {\mathop {=}\limits ^{(2.14)}} \langle A^{-1}, G - G_+ \rangle + \ln \mathrm{Det}(A^{-1}, G_+) - \ln \mathrm{Det}(A^{-1}, G) \nonumber \\&\quad {\mathop {=}\limits ^{(1.4)}} \langle A^{-1}, G - G_+ \rangle + \ln \mathrm{Det}(G^ {-1}, G_+) \nonumber \\&\quad = \phi \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + (1-\phi ) \frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle }\nonumber \\&\qquad - 1 + \ln \left[ \phi \frac{\langle A G^{-1} A u, u \rangle }{\langle A u, u \rangle } + (1-\phi ) \frac{\langle A u, u \rangle }{\langle G u, u \rangle } \right] \nonumber \\&\quad \ge \phi \left[ \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + \ln \frac{\langle A G^{-1} A u, u \rangle }{\langle A u, u \rangle } \right] + (1-\phi ) \left[ \frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle } + \ln \frac{\langle A u, u \rangle }{\langle G u, u \rangle } \right] - 1 \nonumber \\&\quad = \phi \left[ \frac{\langle G u, u \rangle }{\langle A u, u \rangle } - \ln \frac{\langle A u, u \rangle }{\langle A G^{-1} A u, u \rangle } - 1 \right] \nonumber \\&\qquad + (1-\phi ) \left[ \frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle } - \ln \frac{\langle G u, u \rangle }{\langle A u, u \rangle } - 1 \right] , \end{aligned}$$
(2.21)

where we have used the concavity of the logarithm.

Denote

$$\begin{aligned}&\alpha _1 {\mathop {=}\limits ^{\mathrm {def}}}\frac{\langle G u, u \rangle }{\langle A u, u \rangle }, \qquad \beta _1 {\mathop {=}\limits ^{\mathrm {def}}}\frac{\langle A u, u \rangle }{\langle A G^{-1} A u, u \rangle }, \nonumber \\&\alpha _0 {\mathop {=}\limits ^{\mathrm {def}}}\frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle }, \qquad \beta _0 {\mathop {=}\limits ^{\mathrm {def}}}\frac{\langle G u, u \rangle }{\langle A u, u \rangle }. \end{aligned}$$
(2.22)

Clearly, \(\alpha _1 \ge \beta _1\) and \(\alpha _0 \ge \beta _0\) by the Cauchy–Schwartz inequality. Also,

$$\begin{aligned}&\alpha _1 \beta _1 - 2 \beta _1 + 1 {\mathop {=}\limits ^{(2.22)}} \frac{\langle G u, u \rangle }{\langle A G^{-1} A u, u \rangle } - 2 \frac{\langle A u, u \rangle }{\langle A G^{-1} A u, u \rangle } + 1 \;\\&\quad =\; \frac{\langle (G - A) G^{-1} (G - A) u, u \rangle }{\langle A G^{-1} A u, u \rangle } \\&\quad {\mathop {\ge }\limits ^{(2.20)}} \frac{1}{\eta } \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle A G^{-1} A u, u \rangle } \;{\mathop {\ge }\limits ^{(2.20)}}\; \frac{1}{\xi ^3 \eta } \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle G A^{-1} G u, u \rangle } \\&\quad {\mathop {=}\limits ^{(2.6)}} \frac{\theta ^2}{\xi ^3 \eta }, \\&\alpha _0 \beta _0 - 2 \beta _0 + 1 {\mathop {=}\limits ^{(2.22)}} \frac{\langle G A^{-1} G u, u \rangle }{\langle A u, u \rangle } - 2 \frac{\langle G u, u \rangle }{\langle A u, u \rangle } + 1 \;\\&\quad =\; \frac{\langle (G - A) A^ {-1} (G - A) u, u \rangle }{\langle A u, u \rangle } \\&\quad {\mathop {\ge }\limits ^{(2.20)}} \frac{1}{\xi ^2} \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle G A^{-1} G u, u \rangle } \;{\mathop {=}\limits ^{(2.6)}}\; \frac{\theta ^2}{\xi ^2}. \end{aligned}$$

Therefore, by Lemma 2.3 and the fact that \(\omega \) is increasing on \([0, +\infty )\), we have

$$\begin{aligned}&\frac{\langle G u, u \rangle }{\langle A u, u \rangle } - \ln \frac{\langle A u, u \rangle }{\langle A G^{-1} A u, u \rangle } - 1 \ge \omega \left( \left[ \frac{\langle (G - A) G^{-1} (G - A) u, u \rangle }{\langle A G^{-1} A u, u \rangle } \right] ^{1/2} \right) \;\\&\quad \ge \; \omega \left( \frac{\theta }{\xi ^{3/2} \sqrt{\eta }} \right) , \\&\frac{\langle G A^{-1} G u, u \rangle }{\langle G u, u \rangle } - \ln \frac{\langle G u, u \rangle }{\langle A u, u \rangle } - 1 \ge \omega \left( \left[ \frac{\langle (G - A) A^{-1} (G - A) u, u \rangle }{\langle A u, u \rangle } \right] ^{1/2} \right) \;\\&\quad \ge \; \omega \left( \frac{\theta }{\xi } \right) . \end{aligned}$$

Combining these inequalities with (2.21), we obtain the claim. \(\square \)

3 Unconstrained quadratic minimization

In this section, we study the classical quasi-Newton methods, based on the updating formulas from the convex Broyden class, as applied to minimizing the quadratic function

$$\begin{aligned} f(x) {\mathop {=}\limits ^{\mathrm {def}}}\frac{1}{2} \langle A x, x \rangle - \langle b, x \rangle , \end{aligned}$$
(3.1)

where \(A : \mathbb {E}\rightarrow \mathbb {E}^*\) is a self-adjoint positive definite operator, and \(b \in \mathbb {E}^*\).

Let \(B : \mathbb {E}\rightarrow \mathbb {E}^*\) be a self-adjoint positive definite linear operator, that we will use to initialize our methods. Denote by \(\mu > 0\) the strong convexity parameter of f, and by \(L > 0\) the Lipschitz constant of the gradient of f, both measured with respect to B:

$$\begin{aligned} \mu B \preceq A \preceq L B. \end{aligned}$$
(3.2)

Consider the following standard quasi-Newton scheme for minimizing (3.1). For the sake of simplicity, we assume that the constant L is available.

(3.3)

Remark 3.1

In an actual implementation of scheme (3.3), it is typical to store in memory and update in iterations the matrix \(H_k {\mathop {=}\limits ^{\mathrm {def}}}G_k^{-1}\) instead of \(G_k\) (or, alternatively, the Cholesky decomposition of \(G_k\)). This allows one to compute \(G_{k+1}^{-1} \nabla f(x_k)\) in \(O(n^2)\) operations. Note that, due to a low-rank structure of the update (2.1), \(H_k\) can be updated into \(H_{k+1}\) also in \(O(n^2)\) operations (for specific formulas, see e.g. [8, Section 8]).

To measure the convergence rate of scheme (3.3), we look at the norm of the gradient, measured with respect to A:

$$\begin{aligned} \lambda _f(x) {\mathop {=}\limits ^{\mathrm {def}}}\Vert \nabla f(x) \Vert _A^* \;{\mathop {=}\limits ^{(1.1)}}\; \langle \nabla f(x), A^{-1} \nabla f(x) \rangle ^{1/2}. \end{aligned}$$
(3.4)

The following lemma shows that the measure \(\theta (A, G_k, u_k)\), that we introduced in (2.6) to measure the closeness of \(G_k\) to A along the direction \(u_k\), is directly related to the progress of one step of the scheme (3.3). Note that it is important here that the updating direction \(u_k = x_{k+1} - x_k\) is chosen as the difference of the iterates, and, for other choices of \(u_k\), this result is no longer true.

Lemma 3.1

In scheme (3.3), for all \(k \ge 0\), we have

$$\begin{aligned} \lambda _f(x_{k+1})= & {} \theta (A, G_k, u_k) \lambda _f(x_k). \end{aligned}$$
(3.5)

Proof

Indeed,

$$\begin{aligned} \nabla f(x_{k+1}) {\mathop {=}\limits ^{(3.1)}} \nabla f(x_k) + A (x_ {k+1} - x_k) \;{\mathop {=}\limits ^{(3.3)}}\; -G_k u_k + A u_k \;=\; -(G_k - A) u_k. \end{aligned}$$

Hence, denoting \(\theta _k {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G_k, u_k)\), we get

$$\begin{aligned}&\lambda _f(x_{k+1}) {\mathop {=}\limits ^{(3.4 )}} \langle (G_k - A) A^{-1} (G_k - A) u_k, u_k \rangle ^{1/2} \;{\mathop {=}\limits ^{(2.6)}}\; \theta _k \langle G_k A^{-1} G_k u_k, u_k \rangle ^{1/2} \\&\quad {\mathop {=}\limits ^{(3.3)}} \theta _k \langle \nabla f(x_k), A^{-1} \nabla f(x_k) \rangle ^{1/2} \;{\mathop {=}\limits ^{(3.4)}}\; \theta _k \lambda _f(x_k). \end{aligned}$$

The proof is finished \(\square \)

Let us show that the scheme (3.3) has global linear convergence, and that the corresponding rate is at least as good as that of the standard gradient method.

Theorem 3.1

In scheme (3.3), for all \(k \ge 0\), we have

$$\begin{aligned} A \preceq G_k \preceq \frac{L}{\mu } A, \end{aligned}$$
(3.6)

and

$$\begin{aligned} \lambda _f(x_k) \le \left( 1 - \frac{\mu }{L} \right) ^k \lambda _f(x_0). \end{aligned}$$
(3.7)

Proof

For \(k=0\), (3.6) follows from the fact that \(G_0 = L B\) and (3.2). For all other \(k \ge 1\), it follows by induction using Lemma 2.1.

Thus, we have

$$\begin{aligned} 0 {\mathop {\preceq }\limits ^{(3.6)}} A^{-1} - G_k^{-1} {\mathop {\preceq }\limits ^{(3.6)}} \left( 1 - \frac{\mu }{L} \right) A^{-1}. \end{aligned}$$
(3.8)

Therefore,

$$\begin{aligned} (G_k - A) A^{-1} (G_k - A)= & {} G_k (A^{-1} - G_k^{-1}) A (A^{-1} - G_k^{-1}) G_k \\&\quad \preceq \left( 1 - \frac{\mu }{L} \right) ^2 G_k A^{-1} G_k, \end{aligned}$$

and so

$$\begin{aligned} \theta (A, G_k, u_k) {\mathop {\le }\limits ^{(2.6)}} 1 - \frac{\mu }{L}. \end{aligned}$$

Applying now Lemma 3.5, we obtain (3.7). \(\square \)

Now, let us establish the superlinear convergence of the scheme (3.3). First, we do this by working with the trace potential function \(\sigma \), defined by (2.7). Note that this is possible since \(A \preceq G_k\) in view of (3.6).

Theorem 3.2

In scheme (3.3), for all \(k \ge 1\), we have

$$\begin{aligned} \lambda _f(x_k)\le & {} \frac{1}{\prod _{i=0}^{k-1} \left( \phi _i \frac{\mu }{L} + 1 - \phi _i \right) ^{1/2}} \left( \frac{n L}{\mu k} \right) ^{k/2} \lambda _f(x_0). \end{aligned}$$
(3.9)

Proof

Denote \(\sigma _i {\mathop {=}\limits ^{\mathrm {def}}}\sigma (A, G_i)\), \(\theta _i {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G_i, u_i)\), and \(p_i {\mathop {=}\limits ^{\mathrm {def}}}\phi _i \frac{\mu }{L} + 1 - \phi _i\) for any \(i \ge 0\). Let \(k \ge 1\) be arbitrary. From (3.6) and Lemma 2.2, it follows that

$$\begin{aligned} \sigma _i - \sigma _{i+1}\ge & {} p_i \theta _i^2 \end{aligned}$$

for all \(0 \le i \le k - 1\). Summing up these inequalities, we obtain

$$\begin{aligned} \sum \limits _{i=0}^{k-1} p_i \theta _i^2\le & {} \sigma _0 - \sigma _k \;{\mathop {\le }\limits ^{(2.7)}}\; \sigma _0 \;{\mathop {=}\limits ^{(3.3)}}\; \sigma (A, L B) \;{\mathop {=}\limits ^{(2.7)}}\; \langle A^{-1}, L B - A \rangle \nonumber \\&{\mathop {\le }\limits ^{(3.2)}} \langle A^{-1}, \frac{L}{\mu } A - A \rangle \;{\mathop {=}\limits ^{(1.3)}}\; n \left( \frac{L}{\mu } - 1 \right) \;\le \; \frac{n L}{\mu }. \end{aligned}$$
(3.10)

Hence, by Lemma 3.1 and the arithmetic-geometric mean inequality,

$$\begin{aligned} \lambda _f(x_k)= & {} \lambda _f(x_0) \prod \limits _{i=0}^ {k-1} \theta _i \;=\; \frac{1}{\prod _{i=0}^{k-1} p_i^{1/2}} \left[ \prod \limits _{i=0}^{k-1} p_i \theta _i^2 \right] ^{1/2} \lambda _f(x_0) \\\le & {} \frac{1}{\prod _{i=0}^{k-1} p_i^{1/2}} \left( \frac{1}{k}\sum \limits _{i=0}^{k-1} p_i \theta _i^2 \right) ^{k/2} \lambda _f(x_0) \;{\mathop {\le }\limits ^{(3.10)}}\; \frac{1}{\prod _ {i=0}^{k-1} p_i^{1/2}} \left( \frac{n L}{\mu k} \right) ^ {k/2} \lambda _f(x_0). \end{aligned}$$

The proof is finished \(\square \)

Remark 3.2

As can be seen from (3.10), the factor \(\frac{n L}{\mu }\) in the efficiency estimate (3.9) can be improved up to \(\langle A^{-1}, L B - A \rangle = \sum _{i=1}^n (\frac{L}{\lambda _i} - 1)\), where \(\lambda _1, \ldots , \lambda _n\) are the eigenvalues of A relative to B. This improved factor can be significantly smaller than the original one if the majority of the eigenvalues \(\lambda _i\) are much larger than \(\mu \). However, for the sake of simplicity, we prefer to work directly with constants n, L and \(\mu \). This corresponds to the worst-case analysis. The same remark applies to all other theorems on superlinear convergence, that will follow.

Let us discuss the efficiency estimate (3.9). Note that its maximal value over all \(\phi _i \in [0, 1]\) is achieved at \(\phi _i = 1\) for all \(0 \le i \le k-1\). This corresponds to the DFP method. In this case, the efficiency estimate (3.9) looks as follows:

$$\begin{aligned} \lambda _f(x_k)\le & {} \left( \frac{n L^2}{\mu ^2 k} \right) ^{k/2} \lambda _f(x_0). \end{aligned}$$

Hence, the moment, when the superlinear convergence starts, can be described as follows:

$$\begin{aligned} \frac{n L^2}{\mu ^2 k}\le & {} 1 \qquad \Longleftrightarrow \qquad k \ge \frac{n L^2}{\mu ^2}. \end{aligned}$$

In contrast, the minimal value of the efficiency estimate (3.9) over all \(\phi _i \in [0, 1]\) is achieved at \(\phi _i = 0\) for all \(0 \le i \le k-1\). This corresponds to the BFGS method. In this case, the efficiency estimate (3.9) becomes

$$\begin{aligned} \lambda _f(x_k)\le & {} \left( \frac{n L}{\mu k} \right) ^{k/2} \lambda _f(x_0), \end{aligned}$$
(3.11)

and the moment, when the superlinear convergence begins, can be described as follows:

$$\begin{aligned} \frac{n L}{\mu k}\le & {} 1 \qquad \Longleftrightarrow \qquad k \ge \frac{n L}{\mu }. \end{aligned}$$

Thus, we see that, compared to DFP, the superlinear convergence of BFGS starts in \(\frac{L}{\mu }\) times earlier, and its rate is much faster.

Let us present for the scheme (3.3) another justification of the superlinear convergence rate in the form (3.9). For this, instead of \(\sigma \), we will work with the potential function \(\omega \), defined by (2.15). The advantage of this analysis is that it is extendable onto general nonlinear functions.

Theorem 3.3

In scheme (3.3), for all \(k \ge 1\), we have

$$\begin{aligned} \lambda _f(x_k)\le & {} \frac{1}{\prod _{i=0}^{k-1} \left( \phi _i \frac{\mu }{L} + 1 - \phi _i \right) ^{1/2}} \left( \frac{4 n L}{\mu k} \right) ^{k/2} \lambda _f(x_0). \end{aligned}$$
(3.12)

Proof

Denote \(\theta _i {\mathop {=}\limits ^{\mathrm {def}}}\theta (A, G_i, u_i)\), \(\psi _i {\mathop {=}\limits ^{\mathrm {def}}}\psi (A, G_i)\), and \(p_i {\mathop {=}\limits ^{\mathrm {def}}}\phi _i \frac{\mu }{L} + 1 - \phi _i\) for any \(i \ge 0\). Let \(k \ge 1\) and \(0 \le i \le k - 1\) be arbitrary. In view of (3.6) and Lemma 2.4, we have

$$\begin{aligned} \psi _i - \psi _{i+1} {\mathop {\ge }\limits ^{(2.21)}} \phi _i \omega \left( \sqrt{\frac{\mu }{L}} \theta _i \right) + (1-\phi _i) \omega (\theta _i). \end{aligned}$$
(3.13)

Note that \(\theta _i \le 1\). Indeed, if \(u_i = 0\), then \(\theta _i = 0\) by definition. Otherwise,

$$\begin{aligned} \theta _i^2 {\mathop {=}\limits ^{(2.6)}} 1 - \frac{\langle (2 G_i - A) u_i, u_i \rangle }{\langle G_i A^{-1} G_i u_i, u_i \rangle } \;{\mathop {\le }\limits ^{(3.6)}}\; 1. \end{aligned}$$

Therefore,

$$\begin{aligned} \omega \left( \sqrt{\frac{\mu }{L}} \theta _i \right) \;{\mathop {\ge }\limits ^{(2.16)}}\; \frac{\mu }{L} \frac{\theta _i^2}{2 \left( 1 + \sqrt{\frac{\mu }{L}} \theta _i \right) } \;\ge \; \frac{\mu }{L} \frac{\theta _i^2}{4}, \qquad \omega (\theta _i) \;{\mathop {\ge }\limits ^{(2.16)}}\; \frac{\theta _i^2}{2 (1 + \theta _i)} \;\ge \; \frac{\theta _i^2}{4}, \end{aligned}$$

and we conclude that

$$\begin{aligned} \psi _i - \psi _{i+1} {\mathop {\ge }\limits ^{(3.13)}} \frac{1}{4} p_i \theta _i^2. \end{aligned}$$

Summing this inequality and using the fact that \(\psi _k \ge 0\), we obtain

$$\begin{aligned} \frac{1}{4} \sum \limits _{i=0}^{k-1} p_i \theta _i^2\le & {} \psi _0 - \psi _k \;\le \; \psi _0 \;{\mathop {=}\limits ^{(3.3)}}\; \psi (A, L B) \nonumber \\&{\mathop {=}\limits ^{(2.14)}} \langle A^{-1}, L B - A \rangle - \ln \mathrm{Det}(A^{-1}, L B) \nonumber \\&{\mathop {\le }\limits ^{(3.2)}} \langle A^{-1}, \frac{L}{\mu } A - A \rangle - \ln \mathrm{Det}(A^{-1}, \frac{L}{\mu } A) \nonumber \\&{\mathop {=}\limits ^{(1.3)}} n \left( \frac{L}{\mu } - 1 - \ln \frac{L}{\mu } \right) \;\le \; \frac{n L}{\mu }. \end{aligned}$$
(3.14)

Hence, by Lemma 3.1 and the arithmetic-geometric mean inequality,

$$\begin{aligned} \lambda _f(x_k)= & {} \lambda _f(x_0) \prod \limits _{i=0}^ {k-1} \theta _i \;=\; \frac{1}{\prod _{i=0}^{k-1} p_i^{1/2}} \left[ \prod \limits _{i=0}^{k-1} p_i \theta _i^2 \right] ^{1/2} \lambda _f(x_0) \\\le & {} \frac{1}{\prod _{i=0}^{k-1} p_i^{1/2}} \left( \frac{1}{k} \sum \limits _{i=0}^{k-1} p_i \theta _i^2 \right) ^{k/2} \lambda _f(x_0) \\&{\mathop {\le }\limits ^{(3.14)}} \frac{1}{\prod _{i=0}^{k-1} p_i^{1/2}} \left( \frac{4 n L}{\mu k} \right) ^{k/2} \lambda _f(x_0). \end{aligned}$$

The proof is finished \(\square \)

Comparing our new efficiency estimate (3.12) with the previous one (3.9), we see that they differ only in a constant. Thus, for the quadratic function, we do not gain anything by working with the potential function \(\omega \) instead of \(\sigma \). Nevertheless, our second proof is more universal, and, in contrast to the first one, can be generalized onto general nonlinear functions, as we will see in the next section.

4 Minimization of general functions

Consider now a general unconstrained minimization problem:

$$\begin{aligned} \min \limits _{x \in \mathbb {E}} f(x), \end{aligned}$$
(4.1)

where \(f : \mathbb {E}\rightarrow \mathbb {R}\) is a twice differentiable function with positive definite Hessian.

To write down the standard quasi-Newton scheme for (4.1), we fix some self-adjoint positive definite linear operator \(B : \mathbb {E}\rightarrow \mathbb {E}^*\) and a constant \(L > 0\), that we use to define the initial Hessian approximation.

(4.2)

Remark 4.1

Similarly to Remark 3.1, when implementing scheme (4.2), it is common to work directly with the inverse \(H_k {\mathop {=}\limits ^{\mathrm {def}}}G_k^{-1}\) instead of \(G_k\). Also note that it is not necessary to compute \(J_k\) explicitly. Indeed, for implementing the Hessian approximation update at Step 4 (or the corresponding update for its inverse), one only needs the product

$$\begin{aligned} J_k u_k = \nabla f(x_{k+1}) - \nabla f(x_k), \end{aligned}$$

which is just the difference of the successive gradients.

In what follows, we make the following assumptions about the problem (4.1). First, we assume that, with respect to the operator B, the objective function f is strongly convex with parameter \(\mu > 0\) and its gradient is Lipschitz continuous with constant L, i.e.

$$\begin{aligned} \mu B \preceq \nabla ^2 f(x) \preceq L B \end{aligned}$$
(4.3)

for all \(x \in \mathbb {E}\). Second, we assume that the objective function f is strongly self-concordant with some constant \(M \ge 0\), i.e.

$$\begin{aligned} \nabla ^2 f(y) - \nabla ^2 f(x)\preceq & {} M \Vert y - x \Vert _z \nabla ^2 f(w) \end{aligned}$$
(4.4)

for all \(x, y, z, w \in \mathbb {E}\). The class of strongly self-concordant functions was recently introduced in [31], and contains at least all strongly convex functions with Lipschitz continuous Hessian (see [31, Example 4.1]). It gives us the the following convenient relations between the Hessians of the objective function:

Lemma 4.1

(see [31, Lemma 4.1]) Let \(x, y \in \mathbb {E}\), and let \(r {\mathop {=}\limits ^{\mathrm {def}}}\Vert y - x \Vert _x\). Then,

$$\begin{aligned} \frac{\nabla ^2 f(x)}{1 + M r} \preceq \nabla ^2 f(y) \preceq (1 + M r) \nabla ^2 f(x). \end{aligned}$$
(4.5)

Also, for \(J {\mathop {=}\limits ^{\mathrm {def}}}\int _0^1 \nabla ^2 f(x + t (y - x)) dt\), we have

$$\begin{aligned}&\frac{\nabla ^2 f(x)}{1 + \frac{M r}{2}} \preceq J \preceq \left( 1 + \frac{M r}{2} \right) \nabla ^2 f(x), \end{aligned}$$
(4.6)
$$\begin{aligned}&\frac{\nabla ^2 f(y)}{1 + \frac{M r}{2}} \preceq J \preceq \left( 1 + \frac{M r}{2} \right) \nabla ^2 f(y). \end{aligned}$$
(4.7)

As a particular example of a nonquadratic function, satisfying assumptions (4.3), (4.4), one can consider the regularized log-sum-exp function, defined by \(f(x) {\mathop {=}\limits ^{\mathrm {def}}}\ln (\sum _{i=1}^m e^{\langle a_i, x \rangle + b_i}) + \frac{\mu }{2} \Vert x \Vert ^2\), where \(a_i \in \mathbb {E}^*\), \(b_i \in \mathbb {R}\) for \(i = 1, \ldots , m\), and \(\mu > 0\), \(\Vert x \Vert {\mathop {=}\limits ^{\mathrm {def}}}\langle B x, x \rangle ^{1/2}\).

Remark 4.2

Since we are interested in local convergence, it is possible to relax our assumptions by requiring that (4.3), (4.4) hold only in some neighborhood of a minimizer \(x^*\). For this, it suffices to assume that the Hessian of f is Lipschitz continuous in this neighborhood with \(\nabla ^2 f(x^*)\) being positive definite. These are exactly the standard assumptions, used in [8] and many other works, studying local convergence of quasi-Newton methods. However, to avoid excessive technicalities, we do not do this.

Let us now analyze the process (4.2). For measuring its convergence, we look at the local norm of the gradient:

$$\begin{aligned} \lambda _f(x) {\mathop {=}\limits ^{\mathrm {def}}}\Vert \nabla f(x) \Vert _x^* \;{\mathop {=}\limits ^{(1.1)}}\; \langle \nabla f(x), \nabla ^2 f(x)^{-1} \nabla f (x) \rangle ^{1/2}, \qquad x \in \mathbb {E}. \end{aligned}$$
(4.8)

First, let us estimate the progress of one step of the scheme (4.2). Recall that \(\theta (J_k, G_k, u_k)\) is the measure of closeness of \(G_k\) to \(J_k\) along the direction \(u_k\) (see (2.6)).

Lemma 4.2

In scheme (4.2), for all \(k \ge 0\) and \(r_k {\mathop {=}\limits ^{\mathrm {def}}}\Vert u_k \Vert _{x_k}\), we have

$$\begin{aligned} \lambda _f(x_{k+1})\le & {} \left( 1 + \frac{M r_k}{2} \right) \theta (J_k, G_k, u_k) \lambda _f(x_k). \end{aligned}$$

Proof

Denote \(\theta _k {\mathop {=}\limits ^{\mathrm {def}}}\theta (J_k, G_k, u_k)\). In view of Taylor’s formula,

$$\begin{aligned} \nabla f(x_{k+1})= & {} \nabla f(x_k) + J_k (x_{k+1} - x_k) \;{\mathop {=}\limits ^{(4.2)}}\; -(G_k - J_k) u_k. \end{aligned}$$
(4.9)

Therefore,

$$\begin{aligned}&\lambda _f(x_{k+1}) {\mathop {=}\limits ^{(4.8)}} \langle \nabla f(x_{k+1}), \nabla ^2 f(x_{k+1})^{-1} \nabla f(x_{k+1}) \rangle ^{1/2} \\&\quad {\mathop {\le }\limits ^{(4.7)}} \sqrt{1 + \frac{M r_k}{2}} \, \langle \nabla f (x_{k+1}), J_k^{-1} \nabla f (x_{k+1}) \rangle ^{1/2} \\&\quad {\mathop {=}\limits ^{(4.9)}} \sqrt{1 + \frac{M r_k}{2}} \, \langle (G_k - J_k) J_k^{-1} (G_k - J_k) u_k, u_k \rangle ^{1/2} \\&\quad {\mathop {=}\limits ^{(2.6)}} \sqrt{1 + \frac{M r_k}{2}} \, \theta _k \langle G_k J_k^{-1} G_k u_k, u_k \rangle ^{1/2} \\&\quad {\mathop {=}\limits ^{(4.2)}} \sqrt{1 + \frac{M r_k}{2}} \, \theta _k \langle \nabla f(x_k), J_k^{-1} \nabla f(x_k) \rangle ^{1/2} \\&\quad {\mathop {\le }\limits ^{(4.6)}} \left( 1 + \frac{M r_k}{2} \right) \, \theta _k \langle \nabla f(x_k), \nabla ^2 f(x_k)^{-1} \nabla f (x_k) \rangle ^{1/2} \\&\quad {\mathop {=}\limits ^{(4.8)}} \left( 1 + \frac{M r_k}{2} \right) \theta _k \lambda _f(x_k). \end{aligned}$$

The proof is finished \(\square \)

Our next result states that, if the starting point in scheme (4.2) is chosen sufficiently close to the solution, then the relative eigenvalues of the Hessian approximations \(G_k\) with respect to both the Hessians \(\nabla ^2 f(x_k)\) and the integral Hessians \(J_k\) are always located between 1 and \(\frac{L}{\mu }\), up to some small numerical constant. As a consequence, the process (4.2) has at least the linear convergence rate of the gradient method.

Theorem 4.1

Suppose that, in scheme (4.2),

$$\begin{aligned} M \lambda _f(x_0)\le & {} \frac{\ln \frac{3}{2}}{4} \frac{\mu }{L}. \end{aligned}$$
(4.10)

Then, for all \(k \ge 0\), we have

$$\begin{aligned}&\frac{1}{\xi _k} \nabla ^2 f(x_k) \preceq G_k \preceq \xi _k \frac{L}{\mu } \nabla ^2 f(x_k), \end{aligned}$$
(4.11)
$$\begin{aligned}&\frac{1}{\xi _k'} J_k \preceq G_k \preceq \xi _k' \frac{L}{\mu } J_k, \end{aligned}$$
(4.12)
$$\begin{aligned}&\xi _k \lambda _f(x_k) \le \left( 1 - \frac{\mu }{2 L} \right) ^k \lambda _f(x_0), \end{aligned}$$
(4.13)

whereFootnote 3

$$\begin{aligned} \xi _k {\mathop {=}\limits ^{\mathrm {def}}}e^{M \sum _{i=0}^{k-1} r_i}\le & {} \left( 1 + \frac{M r_k}{2} \right) e^{M \sum _{i=0}^{k-1} r_i} {\mathop {=}\limits ^{\mathrm {def}}}\xi _k'\le \sqrt{\frac{3}{2}}, \end{aligned}$$
(4.14)

and \(r_i {\mathop {=}\limits ^{\mathrm {def}}}\Vert u_i \Vert _{x_i}\) for any \(i \ge 0\).

Proof

Note that \(\xi _0 = 1\) and \(G_0 = L B\). Therefore, for \(k = 0\), both (4.11), (4.13) are satisfied. Indeed, the first one reads \(\nabla ^2 f(x_0) \preceq L B \preceq \frac{L}{\mu } \nabla ^2 f(x_0)\) and follows from (4.3), while the second one reads \(\lambda _f(x_0) \le \lambda _f(x_0)\) and is obviously true.

Now assume that \(k \ge 0\), and that (4.11), (4.13) have already been proved for all \(0 \le k' \le k\). Combining (4.11) with (4.6), using the definition of \(\xi _k'\), we obtain (4.12). Further, denote \(\lambda _i {\mathop {=}\limits ^{\mathrm {def}}}\lambda _f(x_i)\) for \(0 \le i \le k\). Note that

$$\begin{aligned}&r_k {\mathop {=}\limits ^{(4.2)}} \Vert G_k^{-1} \nabla f(x_k) \Vert _{x_k} \;{\mathop {=}\limits ^{(1.1)}}\; \langle \nabla f(x_k), G_k^{-1} \nabla ^2 f (x_k) G_k^{-1} \nabla f(x_k) \rangle ^{1/2} \nonumber \\&\quad {\mathop {\le }\limits ^{(4.11)}} \xi _k \langle \nabla f(x_k), \nabla ^2 f(x_k)^{-1} \nabla f(x_k) \rangle ^{1/2} \;{\mathop {=}\limits ^{(4.8)}}\; \xi _k \lambda _k. \end{aligned}$$
(4.15)

Therefore,

$$\begin{aligned}&M \sum \limits _{i=0}^k r_i {\mathop {\le }\limits ^{(4.15)}} M \sum \limits _ {i=0}^k \xi _i \lambda _i \;{\mathop {\le }\limits ^{(4.13)}}\; M \lambda _0 \sum \limits _{i=0}^k \left( 1 - \frac{\mu }{2 L} \right) ^i \nonumber \\&\quad \le \frac{2 L}{\mu } M \lambda _0 \;{\mathop {\le }\limits ^{(4.10)}}\; \frac{\ln \frac{3}{2}}{2}. \end{aligned}$$
(4.16)

Consequently, by the definition of \(\xi _k\) and \(\xi _k'\),

$$\begin{aligned} \xi _k\le & {} \xi _k' \;\le \; e^{\frac{M r_k}{2}} e^{M \sum _ {i=0}^{k-1} r_i} \;\le \; e^{M \sum _{i=0}^k r_i} \;{\mathop {\le }\limits ^{(4.16)}}\; \sqrt{\frac{3}{2}}. \end{aligned}$$

Thus, (4.12), (4.14) are now proved. To finish the proof by induction, it remains to prove (4.11), (4.13) for \(k' = k + 1\).

We start with (4.11). Applying Lemma 2.1, using (4.12), we obtain

$$\begin{aligned} \frac{1}{\xi _k'} J_k \preceq G_{k+1} \preceq \xi _k' \frac{L}{\mu } J_k. \end{aligned}$$
(4.17)

Consequently,

$$\begin{aligned}&G_{k+1} {\mathop {\preceq }\limits ^{(4.7)}} \left( 1 + \frac{M r_k}{2} \right) \xi _k' \frac{L}{\mu } \nabla ^2 f(x_{k+1}) \;{\mathop {=}\limits ^{(4.14)}}\; \left( 1 + \frac{M r_k}{2} \right) ^2 \xi _k \frac{L}{\mu } \nabla f(x_{k+1}) \nonumber \\&\quad \preceq e^{M r_k} \xi _k \frac{L}{\mu } \nabla ^2 f(x_{k+1}) \;{\mathop {=}\limits ^{(4.14)}}\; \xi _{k+1} \frac{L}{\mu } \nabla ^2 f(x_ {k+1}), \end{aligned}$$

and

$$\begin{aligned} G_{k+1} {\mathop {\succeq }\limits ^{(4.7)}} \frac{\nabla ^2 f(x_{k+1})}{\left( 1 + \frac{M r_k}{2} \right) \xi _k'} \;{\mathop {=}\limits ^{(4.14)}}\; \frac{\nabla ^2 f (x_{k+1})}{\left( 1 + \frac{M r_k}{2} \right) ^2 \xi _k} \;\succeq \; \frac{\nabla ^2 f(x_{k+1})}{e^{M r_k} \cdot \xi _k} \;{\mathop {=}\limits ^{(4.14)}}\; \frac{\nabla ^2 f(x_{k+1})}{\xi _ {k+1}}. \end{aligned}$$

Thus, (4.11) is proved for \(k'= k+1\).

It remains to prove (4.13) for \(k'= k+1\). By Lemma 4.2,

$$\begin{aligned} \lambda _{k+1}\le & {} \left( 1 + \frac{M r_k}{2} \right) \theta _k \lambda _k, \end{aligned}$$
(4.18)

where \(\theta _k {\mathop {=}\limits ^{\mathrm {def}}}\theta (J_k, G_k, u_k)\). Note that

$$\begin{aligned} -\left( 1 - \frac{\mu }{\xi _k' L} \right) J_k^{-1} {\mathop {\preceq }\limits ^{(4.12)}} G_k^{-1} - J_k^{-1} {\mathop {\preceq }\limits ^{(4.12)}} (\xi _k' - 1) J_k^{-1}. \end{aligned}$$

Hence,

$$\begin{aligned} (J_k^{-1} - G_k^{-1}) J_k (J_k^{-1} - G_k^{-1})\preceq & {} \rho _k^2 J_k^{-1}, \end{aligned}$$

where

$$\begin{aligned} \rho _k {\mathop {=}\limits ^{\mathrm {def}}}\max \left\{ 1 - \frac{\mu }{\xi _k' L} , \, \xi _k' - 1 \right\} \;{\mathop {\ge }\limits ^{(4.14)}}\; 0. \end{aligned}$$
(4.19)

Therefore,

$$\begin{aligned}&\theta _k^2 {\mathop {=}\limits ^{(2.6)}} \frac{\langle (J_k - G_k) J_k^{-1} (J_k - G_k) u_k, u_k \rangle }{\langle J_k G_k^{-1} J_k u_k, u_k \rangle } \;\\&\quad =\; \frac{\langle G_k (J_k^{-1} - G_k^{-1}) J_k (J_k^ {-1} - G_k^{-1}) G_k u_k, u_k \rangle }{\langle J_k G_k^{-1} J_k u_k, u_k \rangle } \;\le \; \rho _k^2. \end{aligned}$$

Thus,

$$\begin{aligned} \lambda _{k+1} {\mathop {\le }\limits ^{(4.18)}} \left( 1 + \frac{M r_k}{2} \right) \rho _k \lambda _k. \end{aligned}$$

Consequently,

$$\begin{aligned} \xi _{k+1} \lambda _{k+1}\le & {} \xi _{k+1} \left( 1 + \frac{M r_k}{2} \right) \rho _k \lambda _k \;{\mathop {=}\limits ^{(4.14)}}\; e^ {M r_k} \left( 1 + \frac{M r_k}{2} \right) \rho _k \xi _k \lambda _k \\\le & {} e^{\frac{3 M r_k}{2}} \rho _k \xi _k \lambda _k \;{\mathop {\le }\limits ^{(4.13)}}\; e^{\frac{3 M r_k}{2}} \rho _k \left( 1 - \frac{\mu }{2 L} \right) ^k \lambda _0. \end{aligned}$$

It remains to show that

$$\begin{aligned} e^{\frac{3 M r_k}{2}} \rho _k\le & {} 1 - \frac{\mu }{2 L}. \end{aligned}$$
(4.20)

Note that

$$\begin{aligned}&\zeta _k {\mathop {=}\limits ^{\mathrm {def}}}\frac{3 M r_k}{2} \;{\mathop {\le }\limits ^{(4.15)}}\; \frac{3 M \xi _k \lambda _k}{2} \;{\mathop {\le }\limits ^{(4.13)}}\; \frac{3 M \lambda _0}{2} \nonumber \\&\quad {\mathop {\le }\limits ^{(4.10)}} \frac{3 \ln \frac{3}{2}}{8} \frac{\mu }{L} \;\le \; \frac{3 \mu }{16 L} \;\le \; \frac{\mu }{5 L} \;\le \; \frac{1}{5}. \end{aligned}$$
(4.21)

Hence,

$$\begin{aligned} e^{\zeta _k}\le & {} \sum \limits _{i=0}^{\infty } \zeta _k^i = 1 + \zeta _k \sum \limits _{i=0}^\infty \zeta _k^i \;{\mathop {\le }\limits ^{(4.21)}}\; 1 + \zeta _k \sum \limits _{i=0}^\infty \left( \frac{1}{5}\right) ^i \nonumber \\= & {} 1 + \frac{5 \zeta _k}{4} \;{\mathop {\le }\limits ^{(4.21)}}\; 1 + \frac{\mu }{4 L}. \end{aligned}$$
(4.22)

Also,

$$\begin{aligned} \xi _k' {\mathop {\le }\limits ^{(4.14)}} \sqrt{\frac{3}{2}} \;\le \; \frac{4}{3}. \end{aligned}$$
(4.23)

Combining (4.22) and (4.23), we obtain

$$\begin{aligned} e^{\frac{3 M r_k}{2}} \left( 1 - \frac{\mu }{\xi _k' L} \right)\le & {} \left( 1 + \frac{\mu }{4 L} \right) \left( 1 - \frac{3 \mu }{4 L} \right) \;\le \; 1 - \left( \frac{3}{4} - \frac{1}{4} \right) \frac{\mu }{L} \;=\; 1 - \frac{\mu }{2 L}, \end{aligned}$$

and

$$\begin{aligned} e^{\frac{3 M r_k}{2}} (\xi _k' - 1)\le & {} \left( 1 + \frac{1}{4} \right) \left( \sqrt{\frac{3}{2}} - 1 \right) \;=\; \frac{\frac{5}{4} \cdot \frac{1}{2}}{\sqrt{\frac{3}{2}} + 1} \;\le \; \frac{5}{16} \;\le \; \frac{1}{2} \;\le \; 1 - \frac{\mu }{2 L}. \end{aligned}$$

Thus,

$$\begin{aligned} e^{\frac{3 M r_k}{2}} \rho _k {\mathop {=}\limits ^{(4.19)}} e^{\frac{3 M r_k}{2}} \max \left\{ 1 - \frac{\mu }{\xi _k' L}, \, \xi _k' - 1 \right\} \;\le \; 1 - \frac{\mu }{2 L}, \end{aligned}$$

and (4.20) follows. \(\square \)

Now we are ready to prove the main result of this section on the superlinear convergence of the scheme (4.2). In contrast to the quadratic case, now we cannot use the proof, based on the trace potential function \(\sigma \), defined by (2.7), because we cannot longer guarantee that \(J_k \preceq G_k\). However, the proof, based on the potential function \(\psi \), defined by (2.14), still works.

Theorem 4.2

Suppose that the initial point \(x_0\) in scheme (4.2) is chosen sufficiently close to the solution, as specified by (4.10). Then, for all \(k \ge 1\), we have

$$\begin{aligned} \lambda _f(x_k)\le & {} \frac{1}{\prod _{i=0}^{k-1} \left( \phi _i \frac{\mu }{L} + 1 - \phi _i \right) ^{1/2}} \left( \frac{11 n L}{\mu k} \right) ^{k/2} \lambda _f(x_0). \end{aligned}$$

Proof

Denote \(r_i {\mathop {=}\limits ^{\mathrm {def}}}\Vert u_i \Vert _{x_i}\), \(\theta _i {\mathop {=}\limits ^{\mathrm {def}}}\theta (J_i, G_i, u_i)\), \(\psi _i {\mathop {=}\limits ^{\mathrm {def}}}\psi (J_i, G_i)\), \(\tilde{\psi }_{i+1} {\mathop {=}\limits ^{\mathrm {def}}}\psi (J_i,G_{i+1})\), and \(p_i {\mathop {=}\limits ^{\mathrm {def}}}\phi _i \frac{\mu }{L} + 1 - \phi _i\) for any \(i \ge 0\). Let \(k \ge 1\) and \(0 \le i \le k - 1\) be arbitrary. By (4.12), (4.14) and Lemma 2.4, we have

$$\begin{aligned} \psi _i - \tilde{\psi }_{i+1}\ge & {} \phi _i \omega \left( \frac{2}{3} \sqrt{\frac{\mu }{L}} \theta _i \right) + (1-\phi _i) \omega \left( \sqrt{\frac{2}{3}} \theta _i \right) . \end{aligned}$$
(4.24)

Moreover, since

$$\begin{aligned} \theta _i^2 {\mathop {=}\limits ^{(2.6)}} \frac{\langle (G_i - J_i) J_i^ {-1} (G_i - J_i) u_i, u_i \rangle }{\langle G_i J_i^{-1} G_i u_i, u_i \rangle } \;=\; 1 - \frac{\langle (2 G_i - J_i) u_i, u_i \rangle }{\langle G_i J_i^{-1} G_i u_i, u_i \rangle } \;{\mathop {\le }\limits ^{(4.12)}}\; 1, \end{aligned}$$

we also have

$$\begin{aligned}&\omega \left( \frac{2}{3} \sqrt{\frac{\mu }{L}} \theta _i \right) {\mathop {\ge }\limits ^{(2.16)}} \frac{\frac{4}{9} \frac{\mu }{L} \theta _i^2}{2 \left( 1 + \frac{2}{3} \cdot \frac{2}{3} \sqrt{\frac{\mu }{L}} \theta _i \right) } \;\ge \; \frac{ \frac{4}{9}}{2 (1 + \frac{4}{9})} \frac{\mu }{L} \theta _i^2 \;=\; \frac{2}{13} \frac{\mu }{L} \theta _i^2 \;\ge \; \frac{1}{7} \frac{\mu }{L} \theta _i^2, \\&\omega \left( \sqrt{\frac{2}{3}} \theta _i \right) {\mathop {\ge }\limits ^{(2.16)}} \frac{\frac{2}{3} \theta _i^2}{2 \left( 1 + \sqrt{\frac{2}{3}} \theta _i \right) } \;\ge \; \frac{ \frac{2}{3}}{2 \left( 1 + \sqrt{\frac{2}{3}} \right) } \;\ge \; \frac{\frac{2}{3}}{4} \theta _i^2 \;=\; \frac{1}{6} \theta _i^2 \;\ge \; \frac{1}{7} \theta _i^2. \end{aligned}$$

Thus,

$$\begin{aligned} \frac{1}{7} p_i \theta _i^2 {\mathop {\le }\limits ^{(4.24)}} \psi _i - \tilde{\psi }_{i+1} \;=\; \psi _i - \psi _{i+1} + \varDelta _i, \end{aligned}$$
(4.25)

where

$$\begin{aligned} \varDelta _i {\mathop {=}\limits ^{\mathrm {def}}}\psi _{i+1} - \tilde{\psi }_{i+1} {\mathop {=}\limits ^{(2.14)}} \langle J_{i+1}^{-1} - J_i^{-1}, G_{i+1} \rangle + \ln \mathrm{Det}(J_i^{-1}, J_{i+1}). \end{aligned}$$
(4.26)

Let us estimate \(\sum _{i=0}^{k-1} \varDelta _i\) from above. Note that

$$\begin{aligned} J_{i+1} {\mathop {\succeq }\limits ^{(4.6)}} \frac{\nabla ^2 f(x_{i+1})}{1 + \frac{M r_{i+1}}{2}} \;{\mathop {\succeq }\limits ^{(4.7)}}\; \frac{1}{\delta _i} J_i, \end{aligned}$$
(4.27)

where

$$\begin{aligned} \delta _i {\mathop {=}\limits ^{\mathrm {def}}}\left( 1 + \frac{M r_{i+1}}{2} \right) \left( 1 + \frac{M r_i}{2} \right) . \end{aligned}$$
(4.28)

Hence,

$$\begin{aligned}&\langle J_{i+1}^{-1} - J_i^{-1}, G_{i+1} \rangle {\mathop {\le }\limits ^{(4.27)}} (1 - \delta _i^{-1}) \langle J_{i+1}^{-1}, G_{i+1} \rangle \\&\quad {\mathop {\le }\limits ^{(4.12)}} (1 - \delta _i^{-1}) \sqrt{\frac{3}{2}} \frac{L}{\mu } \langle J_{i+1}^{-1}, J_{i+1} \rangle \\&\quad {\mathop {=}\limits ^{(1.3)}} \sqrt{\frac{3}{2}} \frac{n L}{\mu } (1 - \delta _i^{-1}) \;{\mathop {\le }\limits ^{(4.23)}}\; \frac{4 n L}{3 \mu } (1 - \delta _i^{-1}), \end{aligned}$$

and

$$\begin{aligned}&\sum \limits _{i=0}^{k-1} \varDelta _i {\mathop {\le }\limits ^{(4.26)}} \frac{4 n L}{3 \mu } \sum \limits _{i=0}^ {k-1} (1 - \delta _i^{-1}) + \sum \limits _{i=0}^{k-1} \ln \mathrm{Det}(J_i^{-1}, J_{i+1}) \nonumber \\&\quad = \frac{4 n L}{3 \mu }\sum \limits _{i=0}^ {k-1}(1 - \delta _i^{-1}) + \ln \mathrm{Det}(J_0^{-1}, J_k). \end{aligned}$$
(4.29)

At the same time,

$$\begin{aligned} \sum \limits _{i=0}^{k-1} (1 - \delta _i^{-1})\le & {} \sum \limits _{i=0}^{k-1} \left( 1 - e^{-\frac{M (r_i + r_{i+1})}{2}} \right) \;\le \; \frac{M}{2} \sum \limits _{i=0}^{k-1} (r_i + r_{i+1}) \;\le \; M \sum \limits _{i=0}^k r_i \\&{\mathop {\le }\limits ^{(4.13)}} M \lambda _0 \sum \limits _{i=0}^{k-1} \left( 1 - \frac{\mu }{2 L} \right) ^i \;\le \; \frac{2 L}{\mu } M \lambda _0 \;{\mathop {\le }\limits ^{(4.10)}}\; \frac{\ln \frac{3}{2}}{2} \;\le \; \frac{1}{4}. \end{aligned}$$

Thus,

$$\begin{aligned} \sum \limits _{i=0}^{k-1} \varDelta _i {\mathop {\le }\limits ^{(4.29)}} \frac{n L}{3 \mu } + \ln \mathrm{Det}(J_0^{-1}, J_k). \end{aligned}$$
(4.30)

Summing up (4.25) and using the fact that \(\psi _k \ge 0\), we obtain

$$\begin{aligned}&\frac{1}{7} \sum \limits _{i=0}^{k-1} p_i \theta _i^2 {\mathop {\le }\limits ^{(4.25)}} \psi _0 - \psi _k + \sum \limits _{i=0}^{k-1} \varDelta _i \;\le \; \psi _0 + \sum \limits _{i=0}^{k-1} \varDelta _i \nonumber \\&\quad {\mathop {=}\limits ^{(4.2)}} \psi (J_0, L B) + \sum \limits _{i=0}^{k-1} \varDelta _i \nonumber \\&\quad {\mathop {=}\limits ^{(2.14)}} \langle J_0^{-1}, L B - J_0 \rangle - \ln \mathrm{Det}(J_0^{-1}, L B) + \sum \limits _{i=0}^{k-1} \varDelta _i \nonumber \\&\quad {\mathop {\le }\limits ^{(4.30)}} \langle J_0^{-1}, L B - J_0 \rangle + \frac{n L}{3 \mu } - \ln \mathrm{Det}(J_k^{-1}, L B) \nonumber \\&\quad {\mathop {\le }\limits ^{(4.3)}} \langle J_0^{-1}, \frac{L}{\mu } J_0 - J_0 \rangle + \frac{n L}{3 \mu } \nonumber \\&\quad {\mathop {=}\limits ^{(1.3)}} n \left( \frac{L}{\mu } - 1 \right) + \frac{n L}{3 \mu } \;\le \; \frac{4}{3} \frac{n L}{\mu }. \end{aligned}$$
(4.31)

Since \((1 + t)^p \le 1 + p t\) for all \(t \ge -1\) and \(0 \le p \le 1\), we further have

$$\begin{aligned} 1 + \frac{M r_i}{2}\le & {} e^{\frac{M r_i}{2}} \;{\mathop {\le }\limits ^{(4.13)}}\; e^{\frac{M \lambda _0}{2}} \;{\mathop {\le }\limits ^{(4.10)}}\; \left( \frac{3}{2} \right) ^{1/8} \\= & {} \sqrt{\left( \frac{3}{2} \right) ^{1/4}} \;\le \; \sqrt{1 + \frac{1}{4} \cdot \frac{1}{2}} \;=\; \sqrt{\frac{9}{8}}. \end{aligned}$$

Therefore, by Lemma 4.2 and the arithmetic-geometric mean inequality,

$$\begin{aligned} \lambda _f(x_k)\le & {} \lambda _f(x_0) \prod \limits _{i=0}^ {k-1} \left[ \sqrt{\frac{9}{8}} \theta _i \right] \;=\; \frac{1}{\prod _{i=0}^{k-1} p_i^{1/2}} \left[ \left( \frac{9}{8} \right) ^k \prod \limits _{i=0}^{k-1} p_i \theta _i^2 \right] ^{1/2} \lambda _f(x_0) \\\le & {} \frac{1}{\prod _{i=0}^{k-1} p_i} \left( \frac{9}{8} \cdot \frac{1}{k} \sum \limits _{i=0}^{k-1} p_i \theta _i^2 \right) ^{k/2} \lambda _f(x_0) \;{\mathop {\le }\limits ^{(4.31)}}\; \left( \frac{9}{8} \cdot 7 \cdot \frac{4}{3} \frac{n L}{\mu k} \right) ^{k/2} \lambda _f(x_0) \\\le & {} \left( \frac{21 n L}{2 \mu k} \right) ^{k/2} \lambda _f(x_0) \;\le \; \left( \frac{11 n L}{\mu k} \right) ^ {k/2} \lambda _f(x_0). \end{aligned}$$

The proof is finished\(\square \)

5 Discussion

Let us compare the rates of superlinear convergence, that we have obtained for the classical quasi-Newton methods, with those of the greedy quasi-Newton methods [31]. For brevity, we discuss only the BFGS method. Moreover, since the complexity bounds for the general nonlinear case differ from those for the quadratic one only in some absolute constants (both for the classical and the greedy methods), we only consider the case, when the objective function f is quadratic.

As before, let n be the dimension of the problem, \(\mu \) be the strong convexity parameter, L be the Lipschitz constant of the gradient of f, and \(\lambda _f(x)\) be the local norm of the gradient of f at the point \(x \in \mathbb {E}\) (as defined by (3.4)). Also, let us introduce the following condition number to simplify our notation:

$$\begin{aligned} Q {\mathop {=}\limits ^{\mathrm {def}}}\frac{n L}{\mu } \;\ge \; 1. \end{aligned}$$
(5.1)

The greedy BFGS method [31] is essentially the classical BFGS algorithm (scheme (3.3) with \(\phi _k \equiv 0\)) with the only difference that, at each iteration, the update direction \(u_k\) is chosen greedily according to the following rule:

$$\begin{aligned} u_k {\mathop {=}\limits ^{\mathrm {def}}}\mathop {{{\,\mathrm{argmax}\,}}}\limits _{u \in \{e_1, \ldots , e_n\}} \frac{\langle G_k u, u \rangle }{\langle A u, u \rangle }, \end{aligned}$$

where \(e_1, \ldots , e_n\) is a basis in \(\mathbb {E}\), such that \(B^{-1} = \sum _{i=1}^n e_i e_i^*\). For this method, we have the following recurrence (see [31, Theorem 3.2]):

$$\begin{aligned} \lambda _f(x_{k+1})\le & {} \left( 1 - \frac{1}{Q} \right) ^k Q \lambda _f(x_k) \;\le \; e^{-\frac{k}{Q}} Q \lambda _f(x_k), \quad k \ge 0. \end{aligned}$$

Hence, its rate of superlinear convergence is described by the expression

$$\begin{aligned} \lambda _f(x_k)\le & {} \lambda _f(x_0) \prod \limits _{i=0}^{k-1} \left[ e^{-\frac{i}{Q}} Q \right] \;=\; e^{-\frac{k (k-1)}{2 Q}} Q^k \lambda _f(x_0) \;{\mathop {=}\limits ^{\mathrm {def}}}\; A_k, \quad k \ge 0. \end{aligned}$$
(5.2)

Although the inequality (5.2) is valid for all \(k \ge 0\), it is useful only when

$$\begin{aligned} e^{-\frac{k (k-1)}{2 Q}} Q^k \le 1 \qquad \Longleftrightarrow \qquad k \ge 1 + 2 Q \ln Q. \end{aligned}$$
(5.3)

In other words, the relation (5.3) specifies the moment, starting from which it becomes meaningful to speak about the superlinear convergence of the greedy BFGS method.

For the classical BFGS method, we have the following bound (see (3.11)):

$$\begin{aligned} \lambda _f(x_k)\le & {} \left( \frac{Q}{k} \right) ^{k/2} \lambda _f(x_0) \;{\mathop {=}\limits ^{\mathrm {def}}}\; B_k, \quad k \ge 1, \end{aligned}$$

and the starting moment of its superlinear convergence is described as follows:

$$\begin{aligned} \left( \frac{Q}{k} \right) ^{k/2}\le & {} 1 \qquad \Longleftrightarrow \qquad k \ge Q. \end{aligned}$$
(5.4)

Comparing (5.3) and (5.4), we see that, for the standard BFGS, the superlinear convergence may start slightly earlier than for the greedy one. However, the difference is only in the logarithmic factor.

Nevertheless, let us show that, very soon after the superlinear convergence of the greedy BFGS begins, namely, after

$$\begin{aligned} K {\mathop {=}\limits ^{\mathrm {def}}}1 + 6 Q \ln (4 Q) \qquad ({\mathop {\ge }\limits ^{(5.1)}}\; 7) \end{aligned}$$
(5.5)

iterations, it will be significantly faster than the standard BFGS. Indeed,

$$\begin{aligned} \frac{A_k}{B_k}= & {} e^{-\frac{k (k-1)}{2 Q}} Q^k \left( \frac{k}{Q} \right) ^{k/2} \;=\; e^{-\frac{k (k-1)}{2 Q}} (Q k)^{k/2} \nonumber \\= & {} e^{-\frac{k (k-1)}{2 Q} + \frac{k}{2} \ln (Q k) } \;=\; e^{-\frac{k (k-1)}{2 Q} \left[ 1 - \frac{Q \ln (Q k)}{k-1} \right] } \end{aligned}$$
(5.6)

for all \(k \ge 1\). Note that the function \(t \mapsto \frac{\ln t}{t}\) is decreasing on \([e, +\infty )\) (since its logarithm \(\ln \ln t - \ln t\) is a decreasing function of \(u = \ln t\) for \(u \in [1, +\infty )\), which is easily verified by differentiation). Hence, for all \(k \ge K\), we have (using first that \(k \le 2 (k-1)\) since \(k \ge 2\))

$$\begin{aligned} \frac{Q \ln (Q k)}{k-1}\le & {} \frac{Q \ln (2 Q (k-1))}{k-1} \;\le \; \frac{Q \ln (2 Q (K-1))}{K-1} \;{\mathop {=}\limits ^{(5.5)}}\; \frac{\ln \left( 12 Q^2 \ln (4 Q) \right) }{6 \ln (4 Q)} \\\le & {} \frac{\ln (48 Q^3)}{6 \ln (4 Q)} \;\le \; \frac{\ln (64 Q^3)}{6 \ln (4 Q)} \;=\; \frac{3 \ln (4 Q)}{6 \ln (4 Q)} \;=\; \frac{1}{2}. \end{aligned}$$

Consequently, for all \(k \ge K\), we obtain

$$\begin{aligned} \frac{A_k}{B_k} {\mathop {\le }\limits ^{(5.6)}} e^{-\frac{k (k-1)}{4 Q}} \;\le \; 1. \end{aligned}$$

Thus, after K iterations, the rate of superlinear convergence of the greedy BFGS is always better than that of the standard BFGS. Moreover, as \(k \rightarrow \infty \), the gap between these two rates grows as \(e^{-k^2/Q}\). At the same time, the complexity of the Hessian update for the greedy BFGS method is more expensive than for the standard one.