1 Introduction

In this paper, we focus on the non-asymptotic convergence analysis of quasi-Newton methods for the problem of minimizing a convex function \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\), i.e.,

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d} f(x). \end{aligned}$$

Specifically, we focus on two different settings. In the first case, we assume that the objective function f is strongly convex, smooth (its gradient is Lipschitz continuous), and its Hessian is Lipschitz continuous at the optimal solution. In the second case, we study the setting where the objective function f is self-concordant. We formally define these settings in the following sections. In both considered cases, the optimal solution is unique and denoted by \(x_*\).

There is an extensive literature on the use of first-order methods for convex optimization, and it is well-known that the best achievable convergence rate for first-order methods, when the objective function is strongly convex and smooth, is a linear convergence rate. Specifically, we say a sequence \(\{x_k\}\) converges linearly if \(\Vert x_k - x_*\Vert \le C\gamma ^k\Vert x_0 - x_*\Vert \), where \(\gamma \in (0, 1)\) is the constant of linear convergence, and C is a constant possibly depending on problem parameters. Among first-order methods, the accelerated gradient method proposed in [1] achieves a fast linear convergence rate of \((1-\sqrt{{\mu }/{L}})^{k/2}\), where \(\mu \) is the strong convexity parameter and L is the smoothness parameter (the Lipschitz constant of the gradient) [2]. It is also known that the convergence rate of the accelerated gradient method is optimal for first-order methods in the setting that the problem dimension d is sufficiently larger than the number of iterations [3].

Classical alternatives to improve the convergence rate of first-order methods are second-order methods [4,5,6,7] and in particular Newton’s method. It has been shown that if in addition to smoothness and strong convexity assumptions, the objective function f has a Lipschitz continuous Hessian, then the iterates generated by Newton’s method converge to the optimal solution at a quadratic rate in a local neighborhood of the optimal solution; see [8, Chapter 9]. A similar result has been established for the case that the objective function is self-concordant [9]. Despite the fact that the quadratic convergence rate of Newton’s method holds only in a local neighborhood of the optimal solution, it could reduce the overall number of iterations significantly as it is substantially faster than the linear rate of first-order methods. The fast quadratic convergence rate of Newton’s method, however, does not come for free. Implementation of Newton’s method requires solving a linear system at each iteration with the matrix defined by the objective function Hessian \(\nabla ^2 f(x)\). As a result, the computational cost of implementing Newton’s method in high-dimensional problems is prohibitive, as it could be \({\mathcal {O}}(d^3)\), unlike first-order methods that have a per iteration cost of \({\mathcal {O}}(d)\).

Quasi-Newton algorithms are quite popular since they serve as a middle ground between first-order methods and Newton-type algorithms. They improve the linear convergence rate of first-order methods and achieve a local superlinear rate, while their computational cost per iteration is \({\mathcal {O}}(d^2)\) instead of \({\mathcal {O}}(d^3)\) of Newton’s method. The main idea of quasi-Newton methods is to approximate the step of Newton’s method without computing the objective function Hessian \(\nabla ^2 f(x)\) or its inverse \(\nabla ^2 f(x)^{-1}\) at every iteration [10, Chapter 6]. To be more specific, quasi-Newton methods aim at approximating the curvature of the objective function by using only first-order information of the function, i.e., its gradients \(\nabla f(x)\); see Sect. 2 for more details. There are several different approaches for approximating the objective function Hessian and its inverse using first-order information, which leads to different quasi-Newton updates, but perhaps the most popular quasi-Newton algorithms are the Symmetric Rank-One (SR1) method [11], the Broyden method [12,13,14], the Davidon-Fletcher-Powell (DFP) method [15, 16], the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [17,18,19,20], and the limited-memory BFGS (L-BFGS) method [21, 22].

As mentioned earlier, a major advantage of quasi-Newton methods is their asymptotic local superlinear convergence rate. More precisely, we state that the sequence \(\{x_k\}\) converges to the optimal solution \(x_*\) superlinearly when the ratio between the distance to the optimal solution at time \(k+1\) and k approaches zero as k approaches infinity, i.e.,

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{\Vert x_{k+1} - x_*\Vert }{\Vert x_k - x_*\Vert } =0. \end{aligned}$$

For various settings, this superlinear convergence result has been established for a large class of quasi-Newton methods, including the Broyden method [13, 17, 23], the DFP method [13, 24, 25], the BFGS method [13, 25,26,27], and several other variants of these algorithms [28,29,30,31,32,33,34]. Although this result is promising and lies between the linear rate of first-order methods and the quadratic rate of Newton’s method, it only holds asymptotically and does not characterize an explicit upper bound on the error of quasi-Newton methods after a finite number of iterations. As a result, the overall complexity of quasi-Newton methods for achieving an \(\epsilon \)-accurate solution, i.e., \(\Vert x_k - x_*\Vert \le \epsilon \), cannot be explicitly characterized. Hence, it is essential to establish a non-asymptotic convergence rate for quasi-Newton methods, which is the main goal of this paper.

In this paper, we show that if the initial iterate is close to the optimal solution and the initial Hessian approximation error is sufficiently small, then the iterates of the convex Broyden class including both the DFP and BFGS methods converge to the optimal solution at a superlinear rate of \((1/k)^{k/2}\). We further show that our theoretical result suggests a trade-off between the size of the superlinear convergence neighborhood and the rate of superlinear convergence. In other words, one can improve the numerical constant in the above rate at the cost of reducing the radius of the neighborhood in which DFP and BFGS converge superlinearly. We believe that our theoretical guarantee provides one of the first non-asymptotic results for the superlinear convergence rate of BFGS and DFP.

Related work In a recent work [35], the authors studied the non-asymptotic analysis of a class of greedy quasi-Newton methods that are based on the update rule of the Broyden family and use a greedily selected basis vectors for updating Hessian approximations. In particular, they show a superlinear convergence rate of \((1-\frac{\mu }{dL})^{k^2/2}(\frac{dL}{\mu })^k\) for this class of algorithms. However, greedy quasi-Newton methods are more computationally costly than standard quasi-Newton methods, as they require computing a greedily selected basis vector at each iteration. It is worth noting that such computation requires access to additional information beyond the objective function gradient, e.g., the diagonal components of the Hessian. Also, two recent concurrent papers study the non-asymptotic superlinear convergence rate of the DFP and BFGS methods [36, 37]. In [36], the authors show that when the objective function is smooth, strongly convex, and strongly self-concordant, the iterates of BFGS and DFP, in a local neighborhood of the optimal solution, achieve the superlinear convergence rate of \((\frac{dL}{\mu k})^{k/2}\) and \((\frac{dL^2}{\mu ^2 k})^{k/2}\), respectively. In their follow-up paper [37], they improve the superlinear convergence results to \([e^{\frac{d}{k}\ln {\frac{L}{\mu }}} - 1]^{k/2}\) and \([\frac{L}{\mu }(e^{\frac{d}{k}\ln {\frac{L}{\mu }}} - 1)]^{k/2}\), respectively. We would like to highlight that the proof techniques, assumptions, and final theoretical results of [36, 37] and our paper are different and derived independently. The major difference in the analysis is that in [36, 37], the authors use a potential function related to the trace and the logarithm of the determinant of the Hessian approximation matrix, while we use a Frobenius norm potential function. In addition, our convergence rates for both DFP and BFGS are independent of the problem dimension d. Nevertheless, in our results, the neighborhood of superlinear convergence depends on d. Moreover, to derive our results we consider two settings where in the first case the objective function is strongly convex, smooth, and has a Lipschitz continuous Hessain at the optimal solution, and in the second setting the function is self-concordant. Both of these settings are more general than the setting in [36, 37], which requires the objective function to be strongly convex, smooth, and strongly self-concordant.

Outline In Sect. 2, we discuss the Broyden class of quasi-Newton methods, DFP and BFGS. In Sect. 3, we mention our assumptions, notations as well as some general technical lemmas. Then, in Sect. 4, we present the main theoretical results of our paper on the non-asymptotic superlinear convergence of DFP and BFGS for the setting that the objective function is strongly convex, smooth, and its Hessian is Lipschitz continuous at the optimal solution. In Sect. 5, we extend our theoretical results to the class of self-concordant functions, by exploiting the proof techniques developed in Sect. 4. In Sect. 6, we provide a detailed discussion on the advantages and drawbacks of our theoretical results and compare them with some concurrent works. In Sect. 7, we numerically evaluate the performance of DFP and BFGS on several datasets and compare their convergence rates with our theoretical bounds. Finally, in Sect. 8, we close the paper with some concluding remarks.

Notation For vector \(v\in {\mathbb {R}}^{d}\), its Euclidean norm (l-2 norm) is denoted by \(\Vert v\Vert \). We denote the Frobenius norm of matrix \(A \in {\mathbb {R}}^{d \times d}\) as \(\Vert A\Vert _F = \sqrt{\sum _{i = 1}^d\sum _{j = 1}^d A_{ij}^2}\) and its induced 2-norm is denoted by \(\Vert A\Vert = \max _{\Vert v\Vert =1}\Vert Av\Vert \). The trace of matrix A, which is the sum of its diagonal elements, is denoted by \(\mathrm {Tr}\left( A\right) \). For any two symmetric matrices \(A, B \in {\mathbb {R}}^{d \times d}\), we denote that \(A \preceq B\) if and only if \(B - A\) is a symmetric positive semidefinite matrix.

2 Quasi-Newton methods

In this section, we review standard quasi-Newton methods, and, in particular, we discuss the updates of the DFP and BFGS algorithms. Consider a time index k, a step size \(\eta _k\), and a positive-definite matrix \(B_k\) to define a generic descent algorithm through the iteration

$$\begin{aligned} x_{k+1} = x_k - \eta _k B_k^{-1}\nabla f(x_k). \end{aligned}$$
(1)

Note that if we simply replace \(B_k\) by the identity matrix I, we recover the update of gradient descent, and if we replace it by the objective function Hessian \(\nabla ^2 f(x_k)\), we obtain the update of Newton’s method. The main goal of quasi-Newton methods is to find a symmetric positive-definite matrix \(B_k\) using only first-order information such that \(B_k\) is close to the Hessian \(\nabla ^2 f(x_k)\). Note that the step size \(\eta _k\) is often computed according to a line search routine for the global convergence of quasi-Newton methods. Our focus in this paper, however, is on the local convergence of quasi-Newton methods, which requires the unit step size \(\eta _k=1\). Hence, in the rest of the paper, we assume that the iterate \(x_k\) is sufficiently close to the optimal solution \(x_*\) and the step size is \(\eta _k = 1\).

In most quasi-Newton methods, the function’s curvature is approximated in a way that it satisfies the secant condition. To better explain this property, let us first define the variable difference \(s_k\) and gradient difference \(y_k\) as

$$\begin{aligned} s_k = x_{k+1} - x_k, \quad \text {and} \quad y_k = \nabla f(x_{k+1}) - \nabla f(x_k). \end{aligned}$$
(2)

The goal is to find a matrix \(B_{k+1}\) that satisfies the secant condition \( B_{k+1} s_k = y_k\). The rationale for satisfying the secant condition is that the Hessian \(\nabla ^2 f(x_k)\) approximately satisfies this condition when \(x_{k+1}\) and \(x_k\) are close to each other, e.g., they are both close to the optimal solution \(x_*\). However, the secant condition alone is not sufficient to specify \(B_{k+1}\). To resolve this indeterminacy, different quasi-Newton algorithms consider different additional conditions. One common constraint is to enforce the Hessian approximation (or its inverse) at time \(k+1\) to be close to the one computed at time k. This is a reasonable extra condition as we expect the Hessian (or its inverse) evaluated at \(x_{k+1}\) to be close to the one computed at \(x_k\).

In the DFP method, we enforce the proximity condition on Hessian approximations \(B_k\) and \(B_{k+1}\). Basically, we aim to find the closest positive-definite matrix to \(B_k\) (in some weighted matrix norm) that satisfies the secant condition; see Chapter 6 of [10] for more details. The update of the Hessian approximation matrices of DFP is given by

$$\begin{aligned} B^{\text {DFP}}_{k+1} = \left( I-\frac{y_k s_k^\top }{y_k^\top s_k}\right) B_k \left( I-\frac{ s_ky_k^\top }{s_k^\top y_k}\right) +\frac{y_k y_k^\top }{y_k^\top s_k}. \end{aligned}$$
(3)

Since implementation of the update in (1) requires access to the inverse of the Hessian approximation, it is essential to derive an explicit update for the Hessian inverse approximation to avoid the cost of inverting a matrix at each iteration. If we define \(H_k\) as the inverse of \(B_k\), i.e., \(H_k=B_k^{-1}\), using the Sherman-Morrison-Woodbury formula, one can write

$$\begin{aligned} H^{\text {DFP}}_{k+1} = H_k - \frac{H_k y_k y_k^\top H_k}{y_k^\top H_k y_k} + \frac{s_k s_k^\top }{s_k^\top y_k}. \end{aligned}$$
(4)

The BFGS method can be considered as the dual of DFP. In BFGS, we also seek a positive-definite matrix that satisfies the secant condition, but instead of forcing the proximity condition on the Hessian approximation B, we enforce it on the Hessian inverse approximation H. To be more precise, we aim to find a positive-definite matrix \(H_{k+1}\) that satisfies the secant condition \( s_k = H_{k+1} y_k\) and is the closest matrix (in some weighted norm) to the previous Hessian inverse approximation \(H_k\). The update of the Hessian inverse approximation matrices of BFGS is given by,

$$\begin{aligned} H^{\text {BFGS}}_{k+1}= \left( I-\frac{s_k y_k^\top }{y_k^\top s_k}\right) H_k \left( I-\frac{ y_k s_k^\top }{s_k^\top y_k}\right) +\frac{s_k s_k^\top }{y_k^\top s_k}. \end{aligned}$$
(5)

Similarly, by the Sherman-Morrison-Woodbury formula, the update of BFGS method for the Hessian approximation matrices is given by,

$$\begin{aligned} B^{\text {BFGS}}_{k+1} = B_k - \frac{B_k s_k s_k^\top B_k}{s_k^\top B_k s_k} + \frac{y_k y_k^\top }{s_k^\top y_k}. \end{aligned}$$
(6)

Note that both DFP and BFGS belong to a more general class of quasi-Newton methods called the Broyden class. The Hessian approximation \(B_{k+1}\) of the Broyden class is defined as

$$\begin{aligned} B_{k+1} = \phi _k B^{\text {DFP}}_{k+1} + (1 - \phi _k) B^{\text {BFGS}}_{k+1}, \end{aligned}$$
(7)

and the Hessian inverse approximation is defined as

$$\begin{aligned} H_{k+1} = (1 - \psi _k) H^{\text {DFP}}_{k+1} + \psi _k H^{\text {BFGS}}_{k+1}, \end{aligned}$$
(8)

where \(\phi _k, \psi _k \in {\mathbb {R}}\). In this paper, we only focus on the convex class of Broyden quasi-Newton methods, where \(\phi _k, \psi _k \in [0, 1]\). The steps of this class of methods are summarized in Algorithm 1. In fact, in Algorithm 1, if we set \(\psi _k = 0\), we recover DFP, and if we set \(\psi _k = 1\), we recover BFGS. It is worth noting that the cost of computing the descent direction \(H_k \nabla f(x_k)\) for this class of quasi-Newton methods is of \({\mathcal {O}}(d^2)\), which improves \({\mathcal {O}}(d^3)\) per iteration cost of Newton’s method.

figure a

Remark 1

Note that when \(s_k = 0\), we have \(\nabla {f(x_k)} = 0\) from (1) and thus \(x_k = x_*\). Hence, in our implementation and analysis we assume \(s_k \ne 0\). Moreover, in both considered settings, the objective function is at least strictly convex. As a result, if \(s_k \ne 0\), then it follows that \(y_k \ne 0\) and \(s_k^\top y_k > 0\). This observation shows that the updates of BFGS and DFP are well-defined. Finally, it is well-known that for the convex class of Broyden methods if \(B_{k}\) is symmetric positive-definite and \(s_k^\top y_k > 0\), then \(B_{k+1}\) is also symmetric positive-definite [10]. In Algorithm 1, we assume that the initial Hessian approximation \(B_0\) is symmetric positive-definite, and, hence, all Hessian approximation matrices \(B_k\) and their inverse matrices \(H_k\) are symmetric positive-definite.

3 Preliminaries

In this section, we first specify the required assumptions for our results in Sect. 4 and introduce some notations to simplify our expressions. Moreover, we present some intermediate lemmas that will be use later in Sect. 4 to prove our main theoretical results for the setting that the objective function is strongly convex, smooth, and its Hessian is Lipschitz continuous at the optimal solution. In Sect. 5, we will use a subset of these intermediate results to extend our analysis to the class of self-concordant functions.

3.1 Assumptions

We next state the required assumptions for establishing our theoretical results in Sect. 4.

Assumption 3.1

The objective function f(x) is twice-differentiable. Moreover, the function f(x) is strongly convex with parameter \(\mu > 0\), i.e.,

$$\begin{aligned} \Vert \nabla {f(x)} - \nabla {f(y)}\Vert \ge \mu \Vert x - y\Vert , \quad \forall x, y \in {\mathbb {R}}^{d}. \end{aligned}$$
(9)

Assumption 3.2

The gradient of the objective function f(x) is Lipschitz continuous with parameter \(L > 0\), i.e.,

$$\begin{aligned} \Vert \nabla {f(x)} - \nabla {f(y)}\Vert \le L\Vert x - y\Vert , \quad \forall x, y \in {\mathbb {R}}^{d}. \end{aligned}$$
(10)

As f is twice-differentiable, Assumption 3.1 and 3.2 imply that the eigenvalues of the Hessian are larger than \(\mu \) and smaller than L, i.e., \(\mu I \preceq \nabla ^{2}{f(x)} \preceq LI, \forall x \in {\mathbb {R}}^{d}\). Note that for our main theoretical results, we only require Assumption 3.1, but to compare our results with other theoretical bounds we will use the condition in Assumption 3.2 in our discussions.

Assumption 3.3

The Hessian \(\nabla ^2 f(x)\) satisfies the following condition for some constant \(M\ge 0\),

$$\begin{aligned} \Vert \nabla ^{2}{f(x)} - \nabla ^{2}{f(x_*)}\Vert \le M\Vert x - x_{*}\Vert , \quad \forall x \in {\mathbb {R}}^{d}. \end{aligned}$$
(11)

The condition in Assumption 3.3 is common for analyzing second-order methods as we require a regularity condition on the objective function Hessian. In fact, Assumption 3.3 is one of the least strict conditions required for the analysis of second-order type methods as it requires Lipschitz continuity of the Hessian only at (near) the optimal solution. This condition is, indeed, weaker than assuming that the Hessian is Lipschitz continuous everywhere. Note that for the class of strongly convex and smooth functions, the strongly self-concordance assumption required in [36, 37] is equivalent to assuming that the Hessian is Lipschitz continuous everywhere. Hence, the condition in Assumption 3.3 is also weaker than the one in [36, 37]. Assumption 3.3 leads to the following corollary.

Corollary 1

If the condition in Assumption 3.3 holds, then for all \( x, y \in {\mathbb {R}}^{d}\), we have

$$\begin{aligned} \Vert \nabla {f(x)} - \nabla {f(y)} - \nabla ^{2}{f(x_*)}(x - y)\Vert \le \frac{M}{2}\Vert x - y\Vert (\Vert x - x_{*}\Vert + \Vert y - x_{*}\Vert ).\nonumber \\ \end{aligned}$$
(12)

Proof

Check Appendix A. \(\square \)

Remark 2

Our analysis can be extended to the case that Assumptions 3.13.2 and 3.3 only hold in a local neighborhood of the optimal solution \(x_{*}\). Here, we assume they hold in \({\mathbb {R}}^{d}\) to simplify our proofs.

3.2 Notations

Next, we briefly mention some of the definitions and notations that will be used in following theorems and proofs. We consider \(\nabla ^{2}f(x_*)^{\frac{1}{2}}\) and \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) as the square root of the matrices \(\nabla ^{2}f(x_*)\) and \(\nabla ^{2}f(x_*)^{-1}\), i.e., \(\nabla ^{2}f(x_*) = \nabla ^{2}f(x_*)^{\frac{1}{2}}\nabla ^{2}f(x_*)^{\frac{1}{2}}\) and \(\nabla ^{2}f(x_*)^{-1} = \nabla ^{2}f(x_*)^{-\frac{1}{2}}\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). By Assumption 3.1, both \(\nabla ^{2}f(x_*)^{\frac{1}{2}}\) and \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) are symmetric positive-definite. Throughout the paper, we analyze and study weighted version of the Hessian approximation \({\hat{B}}_k\) defined as

$$\begin{aligned} {\hat{B}}_k = \nabla ^{2}f(x_*)^{-\frac{1}{2}}B_k\nabla ^{2}f(x_*)^{-\frac{1}{2}}. \end{aligned}$$
(13)

\({\hat{B}}_k\) is symmetric positive-definite, since \(B_k\) and \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) are both symmetric positive-definite. We also use \(\Vert {\hat{B}}_k - I\Vert _F\) as the measure of closeness between \(B_k\) and \(\nabla ^{2}f(x_*)\), which can be written as

$$\begin{aligned} \Vert {\hat{B}}_k - I\Vert _F= \Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}\left( B_k - \nabla ^{2}f(x_*)\right) \nabla ^{2}f(x_*)^{-\frac{1}{2}}\Vert _F. \end{aligned}$$
(14)

We further define the weighted gradient difference \({\hat{y}}_k\), the weighted variable difference \({\hat{s}}_k\), and the weighted gradient \(\widehat{\nabla {f}}(x_k)\) as

$$\begin{aligned} {\hat{y}}_k = \nabla ^{2}f(x_*)^{-\frac{1}{2}}y_k, \qquad {\hat{s}}_k = \nabla ^{2}f(x_*)^{\frac{1}{2}}s_k, \qquad \widehat{\nabla {f}}(x_k) = \nabla ^{2}f(x_*)^{-\frac{1}{2}}\nabla {f(x_k)}. \end{aligned}$$
(15)

To measure closeness to the optimal solution for iterate \(x_k\), we use \(r_k\in {\mathbb {R}}^d\), \(\sigma _k \in {\mathbb {R}}\), and \(\tau _k \in {\mathbb {R}}\) which are formally defined as

$$\begin{aligned} r_k = \nabla ^{2}f(x_*)^{\frac{1}{2}}(x_k - x_*), \qquad \sigma _k = \frac{M}{\mu ^{\frac{3}{2}}}\Vert r_k\Vert , \qquad \tau _k = \max \{\sigma _k, \sigma _{k+1}\}. \end{aligned}$$
(16)

In (16), \(\mu \) is the strong convexity parameter defined in Assumption 3.1 and M is the Lipschitz continuity parameter of the Hessian at the optimal solution defined in Assumption 3.3. In our analysis, we also use the average Hessian \(J_k\) and its weighted version \({\hat{J}}_k\) that are formally defined as

$$\begin{aligned} J_k = \int _{0}^{1}\nabla ^2{f(x_* + \alpha (x_k - x_*))}d\alpha , \qquad {\hat{J}}_k = \nabla ^{2}f(x_*)^{-\frac{1}{2}}J_k\nabla ^{2}f(x_*)^{-\frac{1}{2}}. \end{aligned}$$
(17)

3.3 Intermediate Lemmas

Next, we present some lemmas that we will later use to establish the non-asymptotic superlinear convergence of DFP and BFGS. Proofs of these lemmas are relegated to the appendix.

Lemma 1

For any matrix \(A \in {\mathbb {R}}^{d \times d}\) and vector \(u \in {\mathbb {R}}^{d}\) with \(\Vert u\Vert = 1\), we have

$$\begin{aligned} \Vert A\Vert ^2_F - \Vert (I - uu^\top )A(I - uu^\top )\Vert ^2_F \ge \Vert Au\Vert ^2. \end{aligned}$$
(18)

Proof

Check Appendix B. \(\square \)

Lemma 2

For any matrices \(A, B \in {\mathbb {R}}^{d \times d}\), we have

$$\begin{aligned} \Vert AB\Vert _F \le \Vert A\Vert \Vert B\Vert _F,\qquad \Vert B^\top AB\Vert _F \le \Vert A\Vert \Vert B\Vert \Vert B\Vert _F. \end{aligned}$$
(19)

Proof

Check Appendix C. \(\square \)

The results in Lemma 1 and Lemma 2 hold for arbitrary matrices. The next lemma focuses on some properties of the weighted average Hessian \({\hat{J}}_k\) under Assumptions 3.1 and 3.3.

Lemma 3

Recall the definition of \(\sigma _k\) in (16) and \({\hat{J}}_k\) in (17). Suppose \(\alpha _k \in [0, 1]\) and define the matrix \(H_k = \nabla ^2{f(x_* + \alpha _k(x_k - x_*))}\) and \({\hat{H}}_k = \nabla ^{2}f(x_*)^{-\frac{1}{2}}H_k\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). If Assumptions 3.1 and 3.3 hold, then the following inequalities hold for all \(k \ge 0\),

$$\begin{aligned} \frac{1}{1 + \frac{\sigma _k}{2}}I \preceq {\hat{J}}_k \preceq (1 + \frac{\sigma _k}{2})I, \qquad \frac{1}{1 + \sigma _k}I \preceq {\hat{H}}_k \preceq (1 + \sigma _k)I. \end{aligned}$$
(20)

Proof

Check Appendix D. \(\square \)

In the following lemma, we establish some bounds that depend on the weighted gradient difference \({\hat{y}}_k\) and the weighted variable difference \({\hat{s}}_k\).

Lemma 4

Recall the definitions in (1316). If Assumptions 3.1 and 3.3 hold, then the following inequalities hold for all \(k \ge 0\),

$$\begin{aligned} \Vert {\hat{y}}_k - {\hat{s}}_k\Vert\le & {} \tau _k\Vert {\hat{s}}_k\Vert , \end{aligned}$$
(21)
$$\begin{aligned} (1 - \tau _k)\Vert {\hat{s}}_k\Vert ^2\le & {} {\hat{s}}_k^\top {\hat{y}}_k \le (1 + \tau _k)\Vert {\hat{s}}_k\Vert ^2, \end{aligned}$$
(22)
$$\begin{aligned} (1 - \tau _k)\Vert {\hat{s}}_k\Vert\le & {} \Vert {\hat{y}}_k\Vert \le (1 + \tau _k)\Vert {\hat{s}}_k\Vert , \end{aligned}$$
(23)
$$\begin{aligned} \Vert \widehat{\nabla {f}}(x_k) - r_k\Vert\le & {} \frac{\sigma _k}{2}\Vert r_k\Vert . \end{aligned}$$
(24)

Proof

Check Appendix  E. \(\square \)

4 Main theoretical results

In this section, we characterize the non-asymptotic superlinear convergence of the Broyden class of quasi-Newton methods, when Assumptions 3.13.2 and 3.3 hold. In Sect. 4.1, we first establish a crucial proposition which characterizes the error of Hessian approximation for this class of quasi-Newton methods. Then, in Sect. 4.2, we leverage this result to show that the iterates of this class of algorithms converge at least linearly to the optimal solution, if the initial distance to the optimal solution and the initial Hessian approximation error are sufficiently small. Finally, we use these intermediate results in Sect. 4.3 to prove that the iterates of the convex Broyden class, including both DFP and BFGS, converge to the optimal solution at a superlinear rate of \((1/k)^{k/2}\). Note that in Algorithm 1 we use the Hessian inverse approximation matrix \(H_k\) to describe the algorithm, but in our analysis we will study the behavior of the Hessian approximation matrix \(B_k\).

4.1 Hessian approximation error: Frobenius norm potential function

Next, we use the Frobenius norm of the Hessian approximation error \(\Vert {\hat{B}}_{k} - I\Vert _F\) as the potential function in our analysis. Specifically, we will use the results of Lemma 1, Lemma 2, and Lemma 4 to study the dynamic of the Hessian approximation error \(\Vert {\hat{B}}_{k} - I\Vert _F\) for both DFP and BFGS. First, start with the DFP method.

Lemma 5

Consider the update of DFP in (3) and recall the definition of \(\tau _k\) in (16). Suppose that for some \(\delta > 0\) and some \(k \ge 0\), we have that \(\tau _k < 1\) and \(\Vert {\hat{B}}_k - I\Vert _{F} \le \delta \). Then, the matrix \(B^{\text {DFP}}_{k+1}\) generated by the DFP update satisfies the following inequality

$$\begin{aligned} \Vert {\hat{B}}^{\text {DFP}}_{k+1} - I\Vert _F \le \Vert {\hat{B}}_k - I\Vert _{F} - \frac{\Vert ({\hat{B}}_k - I){\hat{s}}_k\Vert ^2}{2\delta \Vert {\hat{s}}_k\Vert ^2} + W_k\tau _k, \end{aligned}$$
(25)

where \(W_k = \Vert {\hat{B}}_k\Vert \frac{4}{(1 - \tau _k)^2} + \frac{3 + \tau _k}{1 - \tau _k}\).

Proof

The proof and conclusion of this lemma are similar to the ones in Lemma 3.2 in [33], except the value of parameter \(W_k\). This difference comes from the fact that [33] analyzed the modified DFP update, while we consider the standard DFP method. Recall the DFP update in (3) and multiply both sides of that expression by the matrix \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) from left and right to obtain

$$\begin{aligned} {\hat{B}}^{\text {DFP}}_{k+1} = \left( I - \frac{{\hat{y}}_k{\hat{s}}_k^\top }{{\hat{y}}_k^\top {\hat{s}}_k}\right) {\hat{B}}_{k}\left( I - \frac{{\hat{s}}_k{\hat{y}}_k^\top }{{\hat{s}}_k^\top {\hat{y}}_k}\right) +\frac{{\hat{y}}_k{\hat{y}}_k^\top }{{\hat{y}}_k^\top {\hat{s}}_k}, \end{aligned}$$
(26)

where we used the fact that \(s_k^\top y_k = s_k^\top \nabla ^{2}f(x_*)^{\frac{1}{2}}\nabla ^{2}f(x_*)^{-\frac{1}{2}}y_k = {\hat{s}}_k^\top {\hat{y}}_k\). To simplify the proof, we use the following notations:

$$\begin{aligned}&B = {\hat{B}}_{k}, \quad B_+ = {\hat{B}}^{\text {DFP}}_{k+1}, \quad s = {\hat{s}}_k, \quad y = {\hat{y}}_k, \quad \tau = \tau _k, \quad P = I - \frac{ss^\top }{\Vert s\Vert ^2},\nonumber \\&\quad Q = \frac{ss^\top }{\Vert s\Vert ^2} - \frac{sy^\top }{s^\top y}. \end{aligned}$$
(27)

Hence, (26) is equivalent to

$$\begin{aligned} B_+ = \left( I - \frac{ys^\top }{s^\top y}\right) B\left( I - \frac{sy^\top }{s^\top y}\right) + \frac{yy^\top }{s^\top y}. \end{aligned}$$

Moreover, we can express \( B_+ - I\) as

$$\begin{aligned} B_+ - I&= (P + Q^\top )B(P + Q) - I + \frac{yy^\top }{s^\top y}\nonumber \\&= PBP + Q^\top BP + PBQ + Q^\top BQ - I + \frac{yy^\top }{s^\top y}\nonumber \\&= P(B - I)P + P^2 - I + \frac{yy^\top }{s^\top y} + Q^\top BP + PBQ + Q^\top BQ. \end{aligned}$$
(28)

Notice that \(P^2 = P\) and \(P = P^\top \). Thus, (28) can be simplified as

$$\begin{aligned} B_+ - I = D + E + G^\top + G + H, \end{aligned}$$

where

$$\begin{aligned} D = P(B - I)P, \quad E = \frac{yy^\top }{s^\top y} - \frac{ss^\top }{\Vert s\Vert ^2}, \quad G = PBQ, \quad H = Q^\top BQ. \end{aligned}$$

Next, we proceed to upper bound \( \Vert B_+ - I \Vert _F\). To do so, we derive upper bounds on the Frobenius norm of matrices D, E, G and H. We start by \(\Vert D\Vert _F\). If we set \(u=s/\Vert s\Vert \) and \(A=B-I\) in Lemma 1, we obtain that

$$\begin{aligned} \frac{\Vert (B - I)s\Vert ^2}{\Vert s\Vert ^2} \le \Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F, \end{aligned}$$
(29)

which implies \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\). Moreover, using the fact that \(a^2 - b^2 \le 2a(a - b)\) we can write

$$\begin{aligned} \Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \le 2\Vert B - I\Vert _F(\Vert B - I\Vert _F - \Vert D\Vert _F) \le 2\delta (\Vert B - I\Vert _F - \Vert D\Vert _F),\nonumber \\ \end{aligned}$$
(30)

where the second inequality follows from the fact that \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\) and the assumption that \(\Vert B - I\Vert _F \le \delta \). Next, if we replace the right hand side of (29) by its upper bound in (30) and massage the resulted expression, we obtain that

$$\begin{aligned} \Vert D\Vert _F \le \Vert B - I\Vert _F - \frac{\Vert (B - I)s\Vert ^2}{2\delta \Vert s\Vert ^2}, \end{aligned}$$
(31)

which provides an upper bound on \(\Vert D\Vert _F\). To derive upper bounds for \(\Vert E\Vert _F\), \(\Vert G\Vert _F\) and \(\Vert H\Vert _F\), we first need to find an upper bound for \(\Vert Q\Vert _F\), where Q is defined in (27). Note that

$$\begin{aligned} \Vert Q\Vert _F= & {} \left\| \frac{ss^\top }{\Vert s\Vert ^2} - \frac{sy^\top }{s^\top y}\right\| _F = \left\| \frac{ss^\top }{\Vert s\Vert ^2} -\frac{sy^\top }{\Vert s\Vert ^2}+\frac{sy^\top }{\Vert s\Vert ^2}- \frac{sy^\top }{s^\top y}\right\| _F \\\le & {} \left\| \frac{ss^\top }{\Vert s\Vert ^2} - \frac{sy^\top }{\Vert s\Vert ^2}\right\| _F + \left\| \frac{sy^\top }{\Vert s\Vert ^2} - \frac{sy^\top }{s^\top y}\right\| _F, \end{aligned}$$

where the first equality holds by the definition of Q, the second equality is obtained by adding and subtracting \(\frac{sy^\top }{\Vert s\Vert ^2}\), and the inequality holds due to the triangle inequality. We can further simplify the right hand side as

$$\begin{aligned} \Vert Q\Vert _F\le & {} \frac{\Vert s(s - y)^\top \Vert _F}{\Vert s\Vert ^2} + \frac{|s^T(s - y)|\Vert sy^\top \Vert _F}{\Vert s\Vert ^2s^\top y} \le \frac{\Vert y - s\Vert }{\Vert s\Vert } + \frac{\Vert y - s\Vert \Vert y\Vert }{s^\top y} \nonumber \\\le & {} \tau + \frac{\tau (1 + \tau )\Vert s\Vert ^2}{(1 - \tau )\Vert s\Vert ^2} = \frac{2\tau }{1 - \tau }, \end{aligned}$$
(32)

where the second inequality holds using the Cauchy–Schwarz inequality and the fact that \(\Vert ab^\top \Vert _F = \Vert a\Vert \Vert b\Vert \) for \(a, b \in {\mathbb {R}}^d\), and the last inequality holds due to the results in (21), (22), and (23).

Next using the upper bound in (32) on \(\Vert Q\Vert _F\) we derive an upper bound on \(\Vert E\Vert _F\). Note that

$$\begin{aligned} \Vert E\Vert _F= & {} \left\| \frac{yy^\top }{s^\top y} - \frac{ss^\top }{\Vert s\Vert ^2}\right\| _F = \left\| \frac{yy^\top }{s^\top y} - \frac{sy^\top }{s^\top y}+\frac{sy^\top }{s^\top y} - \frac{ss^\top }{\Vert s\Vert ^2}\right\| _F \\\le & {} \left\| \frac{yy^\top }{s^\top y} - \frac{sy^\top }{s^\top y}\right\| _F + \left\| \frac{sy^\top }{s^\top y} - \frac{ss^\top }{\Vert s\Vert ^2}\right\| _F, \end{aligned}$$

where we used the triangle inequality in the last step. Using the definition of Q we can show that

$$\begin{aligned} \Vert E\Vert _F\le & {} \frac{\Vert (y - s)y^\top \Vert _F}{s^\top y} + \Vert Q\Vert _F \le \frac{\Vert y - s\Vert \Vert y\Vert }{s^\top y} + \frac{2\tau }{1 - \tau }\nonumber \\\le & {} \frac{\tau (1 + \tau )\Vert s\Vert ^2}{(1 - \tau )\Vert s\Vert ^2} + \frac{2\tau }{1 - \tau } = \frac{3 + \tau }{1 - \tau }\tau , \end{aligned}$$
(33)

where for the second inequality we use (32) and \(\Vert ab^\top \Vert _F = \Vert a\Vert \Vert b\Vert \), and for the third inequality we use the results in (21), (22), and (23).

We proceed to derive an upper bound for \(\Vert G\Vert _F\). Note that \(0 \preceq P \preceq I\) and thus \(\Vert P\Vert \le 1\). Using this observation, (32) and the first inequality in (19), we can show that \(\Vert G\Vert _F\) is bounded above by

$$\begin{aligned} \Vert G\Vert _F = \Vert PBQ\Vert _F \le \Vert PB\Vert \Vert Q\Vert _F \le \Vert P\Vert \Vert B\Vert \Vert Q\Vert _F \le \Vert B\Vert \Vert Q\Vert _F \le \Vert B\Vert \frac{2}{1 - \tau }\tau .\nonumber \\ \end{aligned}$$
(34)

Finally, we provide an upper bound for \(\Vert H\Vert _F\). By leveraging the second inequality in (19) and the fact that \(\Vert A\Vert \le \Vert A\Vert _F\) for any matrix \(A \in {\mathbb {R}}^{d \times d}\), we can show that

$$\begin{aligned} \Vert H\Vert _F = \Vert Q^\top BQ\Vert _F \le \Vert B\Vert \Vert Q\Vert \Vert Q\Vert _F \le \Vert B\Vert \Vert Q\Vert ^2_F \le \Vert B\Vert \frac{4\tau ^2}{(1 - \tau )^2}, \end{aligned}$$
(35)

where for the last inequality we used the result in (32).

If we replace \(\Vert D\Vert _F\), \(\Vert E\Vert _F\), \(\Vert G\Vert _F\), and \(\Vert H\Vert _F\) with their upper bounds in (31), (33), (34) and (35), respectively, we obtain that

$$\begin{aligned} \Vert B_+ - I\Vert _F&\le \Vert D\Vert _F + \Vert E\Vert _F + 2\Vert G\Vert _F + \Vert H\Vert _F\\&\le \Vert B - I\Vert _F - \frac{\Vert (B - I)s\Vert ^2}{2\delta \Vert s\Vert ^2} + \frac{3 + \tau }{1 - \tau }\tau + \Vert B\Vert \frac{4}{1 - \tau }\tau + \Vert B\Vert \frac{4\tau }{(1 - \tau )^2}\tau \\&= \Vert B - I\Vert _F - \frac{\Vert (B - I)s\Vert ^2}{2\delta \Vert s\Vert ^2} + W\tau , \end{aligned}$$

where \(W = \Vert B\Vert \frac{4}{1 - \tau } + \Vert B\Vert \frac{4\tau }{(1 - \tau )^2} + \frac{3 + \tau }{1 - \tau } = \Vert B\Vert \frac{4}{(1 - \tau )^2} + \frac{3 + \tau }{1 - \tau }\). Considering the notations introduced in (27), the result in (25) follows from the above inequality and the proof is complete. \(\square \)

The result in Lemma 5 shows how the error of Hessian approximation in DFP evolves as we run the updates. Next, we establish a similar result for the BFGS method.

Lemma 6

Consider the update of BFGS in (6) and recall the definition of \(\tau _k\) in (16). Suppose that for some \(\delta > 0\) and some \(k \ge 0\), we have that \(\tau _k < 1\) and \(\Vert {\hat{B}}_k - I\Vert _{F} \le \delta \). Then, the matrix \(B^{\text {BFGS}}_{k+1}\) generated by the BFGS update satisfies the following inequality

$$\begin{aligned} \Vert {\hat{B}}^{\text {BFGS}}_{k+1} - I\Vert _F \le \Vert {\hat{B}}_k - I\Vert _F - \frac{{\hat{s}}_k^\top ({\hat{B}}_k - I) {\hat{B}}_k ({\hat{B}}_k - I) {\hat{s}}_k}{2\delta {\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k} + V_k\tau _k, \end{aligned}$$
(36)

where \(V_k = \frac{3 + \tau _k}{1 - \tau _k}\).

Proof

The proof of this lemma is adapted from the proof of Lemma 3.6 in [32]. We should also add that our upper bound in (36) improves the bound in [32] as it contains an additional negative term, i.e., \(- \frac{{\hat{s}}_k^\top ({\hat{B}}_k - I) {\hat{B}}_k ({\hat{B}}_k - I) {\hat{s}}_k}{2\delta {\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k}\). Recall the BFGS update in (6) and multiply both sides of that expression with \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) from left and right to obtain

$$\begin{aligned} {\hat{B}}^{\text {BFGS}}_{k+1} = {\hat{B}}_k - \frac{{\hat{B}}_k {\hat{s}}_k {\hat{s}}_k^\top {\hat{B}}_k}{{\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k} + \frac{{\hat{y}}_k {\hat{y}}_k^\top }{{\hat{s}}_k^\top {\hat{y}}_k}, \end{aligned}$$
(37)

where we used the fact that \(s_k^\top B_k s_k = s_k^\top \nabla ^{2}f(x_*)^{\frac{1}{2}}\nabla ^{2}f(x_*)^{-\frac{1}{2}}B_k \nabla ^{2}f(x_*)^{-\frac{1}{2}}\nabla ^{2}f(x_*)^{\frac{1}{2}} s_k = {\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k\). To simplify the proof, we use the following notations:

$$\begin{aligned} B = {\hat{B}}_{k}, \quad B_+ = {\hat{B}}^{\text {BFGS}}_{k+1}, \quad s = {\hat{s}}_k, \quad y = {\hat{y}}_k, \quad \tau = \tau _k. \end{aligned}$$
(38)

Considering these notations, the expression in (37) can be written as

$$\begin{aligned} B_+ = B - \frac{Bss^\top B}{s^\top Bs} + \frac{yy^\top }{s^\top y}. \end{aligned}$$

Moreover, we can show that \( B_+ - I\) is given by

$$\begin{aligned} B_+ - I = B - I - \frac{Bss^\top B}{s^\top Bs} + \frac{ss^\top }{\Vert s\Vert ^2} + \frac{yy^\top }{s^\top y} - \frac{ss^\top }{\Vert s\Vert ^2} = D+ E, \end{aligned}$$

where

$$\begin{aligned} D = B - I - \frac{Bss^\top B}{s^\top Bs} + \frac{ss^\top }{\Vert s\Vert ^2}, \qquad E = \frac{yy^\top }{s^\top y} - \frac{ss^\top }{\Vert s\Vert ^2}. \end{aligned}$$

To establish an upper bound on \(\Vert B_+ - I \Vert _F\), we find upper bounds on \(\Vert D\Vert ^2_F\) and \(\Vert E\Vert ^2_F\). Note that using the fact that \(\Vert D\Vert ^2_F=\mathrm {Tr}\left[ DD^\top \right] \) and properties of the trace operator we can show that

$$\begin{aligned} \begin{aligned} \Vert D\Vert ^2_F&= \mathrm {Tr}\left[ \left( B - I - \frac{Bss^\top B}{s^\top Bs} + \frac{ss^\top }{\Vert s\Vert ^2}\right) \left( B - I - \frac{Bss^\top B}{s^\top Bs} + \frac{ss^\top }{\Vert s\Vert ^2}\right) ^\top \right] \\&= \mathrm {Tr}\left[ (B - I)^2 - \frac{Bss^\top B(B - I) + (B - I)Bss^\top B}{s^\top Bs} - \frac{ss^\top Bss^\top B + Bss^\top Bss^\top }{s^\top Bs\Vert s\Vert ^2}\right] \\&\quad + \mathrm {Tr}\left[ \frac{ss^\top (B - I) + (B - I)ss^\top }{\Vert s\Vert ^2} + \frac{Bss^\top BBss^\top B}{(s^\top Bs)^2} + \frac{ss^\top ss^\top }{\Vert s\Vert ^4} \right] .\\&= \mathrm {Tr}\left[ (B - I)^2 - \frac{Bss^\top B(B - I) + (B - I)Bss^\top B}{s^\top Bs} - \frac{ss^\top B + Bss^\top }{\Vert s\Vert ^2}\right] \\&\quad + \mathrm {Tr}\left[ \frac{ss^\top (B - I) + (B - I)ss^\top }{\Vert s\Vert ^2} + \frac{\Vert Bs\Vert ^2Bss^\top B}{(s^\top Bs)^2} + \frac{ss^\top }{\Vert s\Vert ^2} \right] .\\ \end{aligned} \end{aligned}$$
(39)

Using the fact that \(\mathrm {Tr}\left( ab^\top \right) = a^\top b\) for any \(a, b \in {\mathbb {R}}^d\) we can write the following simplifications:

$$\begin{aligned}&\mathrm {Tr}\left[ \frac{Bss^\top B(B - I) + (B - I)Bss^\top B}{s^\top Bs}\right] = 2\frac{s^\top B(B - I)Bs}{s^\top Bs}, \\&\mathrm {Tr}\left[ (B - I)^2\right] = \Vert B - I\Vert ^2_F,\qquad \mathrm {Tr}\left[ \frac{Bss^\top + ss^\top B}{\Vert s\Vert ^2}\right] = 2\frac{s^\top Bs}{\Vert s\Vert ^2}, \qquad \mathrm {Tr}\left[ \frac{ss^\top }{\Vert s\Vert ^2}\right] = 1, \\&\mathrm {Tr}\left[ \frac{ss^\top (B - I) + (B - I)ss^\top }{\Vert s\Vert ^2}\right] = 2\frac{s^\top (B - I)s}{\Vert s\Vert ^2},\qquad \mathrm {Tr}\left[ \frac{\Vert Bs\Vert ^2Bss^\top B}{(s^\top Bs)^2}\right] = \frac{\Vert Bs\Vert ^4}{(s^\top Bs)^2}. \end{aligned}$$

Substituting the above simplifications into (39), we obtain that

$$\begin{aligned} \Vert D\Vert ^2_F&= \Vert B - I\Vert ^2_F - 2\frac{s^\top B(B - I)Bs}{s^\top Bs} - 2\frac{s^\top Bs}{\Vert s\Vert ^2} + 2\frac{s^\top (B - I)s}{\Vert s\Vert ^2} + \frac{\Vert Bs\Vert ^4}{(s^\top Bs)^2} + 1\nonumber \\&= \Vert B - I\Vert ^2_F + \left[ \left( \frac{\Vert Bs\Vert ^2}{s^\top Bs}\right) ^2 - \frac{s^\top B^3s}{s^\top Bs}\right] - \frac{s^\top (B - I)B(B - I)s}{s^\top Bs}. \end{aligned}$$
(40)

Next, we proceed to show that the second term on the right hand side of (40), i.e., \(\left( \frac{\Vert Bs\Vert ^2}{s^\top Bs}\right) ^2 - \frac{s^\top B^3s}{s^\top Bs}\), is non-positive. Note that by using the Cauchy–Schwarz inequality, we have

$$\begin{aligned} \Vert Bs\Vert ^2 = s^\top B^2s = s^\top B^{\frac{3}{2}}B^{\frac{1}{2}}s \le \Vert B^{\frac{3}{2}}s\Vert \Vert B^{\frac{1}{2}}s\Vert . \end{aligned}$$

Now by computing the squared of both sides we obtain \( \Vert Bs\Vert ^4 \le \Vert B^{\frac{3}{2}}s\Vert ^2\Vert B^{\frac{1}{2}}s\Vert ^2 = s^\top B^3 ss^\top Bs, \) which implies that

$$\begin{aligned} \left( \frac{\Vert Bs\Vert ^2}{s^\top Bs}\right) ^2 - \frac{s^\top B^3s}{s^\top Bs} \le 0. \end{aligned}$$
(41)

By combining (40) and (41), we obtain that

$$\begin{aligned} \frac{s^\top (B - I)B(B - I)s}{s^\top Bs} \le \Vert B- I\Vert ^2_F - \Vert D\Vert ^2_F. \end{aligned}$$
(42)

The above inequality implies that \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\). Moreover, using the fact that \(a^2 - b^2 \le 2a(a - b), \forall a,b \in {\mathbb {R}}\), we can show that

$$\begin{aligned} \Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F\le & {} 2\Vert B - I\Vert _F(\Vert B - I\Vert _F - \Vert D_k\Vert _F)\nonumber \\\le & {} 2\delta (\Vert B - I\Vert _F - \Vert D\Vert _F). \end{aligned}$$
(43)

where the second inequality follows from \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\) and the fact that \(\Vert B - I\Vert _F \le \delta \). Now if combine the results in (42) and (43), we obtain that

$$\begin{aligned} \Vert D\Vert _F \le \Vert B - I\Vert _F - \frac{s^\top (B - I)B(B - I)s}{2\delta s^\top Bs}, \end{aligned}$$
(44)

which provides an upper bound on \(\Vert D\Vert _F\). Moreover, according to (33), \(\Vert E\Vert _F\) is bounded above by

$$\begin{aligned} \Vert E\Vert _F \le \frac{3 + \tau }{1 - \tau }\tau . \end{aligned}$$
(45)

If we replace \(\Vert D\Vert _F\) and \(\Vert E\Vert _F\) with their upper bounds in (44) and (45), we obtain that

$$\begin{aligned} \Vert B_+ - I\Vert _F \le \Vert D\Vert _F + \Vert E\Vert _F \le \Vert B - I\Vert _F - \frac{s^\top (B - I)B(B - I)s}{2\delta s^\top Bs} + V\tau , \end{aligned}$$

where \(V = \frac{3 + \tau }{1 - \tau }\). Considering the notations in (38), the claim follows from the above inequality. \(\square \)

Now we can combine Lemma 5 and Lemma 6 to derive a bound on the error of Hessian approximation for the (convex) Broyden class of quasi-Newton methods.

Lemma 7

Consider the update of the (convex) Broyden family in (7) and recall the definition of \(\tau _k\) in (16). Suppose that for some \(\delta > 0\) and some \(k \ge 0\), we have that \(\tau _k < 1\) and \(\Vert {\hat{B}}_k - I\Vert _{F} \le \delta \). Then, the matrix \(B_{k+1}\) generated by (7) satisfies the following inequality

$$\begin{aligned}&\Vert {\hat{B}}_{k+1} - I\Vert _F \le \Vert {\hat{B}}_k - I\Vert _F - \phi _k\frac{\Vert ({\hat{B}}_k - I){\hat{s}}_k\Vert ^2}{2\delta \Vert {\hat{s}}_k\Vert ^2}\nonumber \\&\quad \quad \quad \quad \quad \quad \quad \quad \quad \qquad - (1 - \phi _k)\frac{{\hat{s}}_k^\top ({\hat{B}}_k - I) {\hat{B}}_k ({\hat{B}}_k - I) {\hat{s}}_k}{2\delta {\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k} + Z_k\tau _k, \end{aligned}$$
(46)

where \(Z_k = \phi _k\Vert {\hat{B}}_k\Vert \frac{4}{(1 - \tau _k)^2} + \frac{3 + \tau _k}{1 - \tau _k}\). We also have that

$$\begin{aligned} \Vert {\hat{B}}_{k+1} - I\Vert _F \le \Vert {\hat{B}}_k - I\Vert _F + Z_k\tau _k. \end{aligned}$$
(47)

Proof

Notice that \(B_{k+1} = \phi _k B^{\text {DFP}}_{k+1} + (1 - \phi _k) B^{\text {BFGS}}_{k+1}\). Using this expression and the convexity of the norm, we can show that

$$\begin{aligned} \Vert {\hat{B}}_{k+1} - I\Vert _F= & {} \Vert \phi _k B^{\text {DFP}}_{k+1} + (1 - \phi _k) B^{\text {BFGS}}_{k+1}-I\Vert _F\le \phi _k \Vert B^{\text {DFP}}_{k+1}\\&\quad -I\Vert _F+(1 - \phi _k)\Vert B^{\text {BFGS}}_{k+1}-I\Vert _F. \end{aligned}$$

By replacing \(\Vert B^{\text {DFP}}_{k+1}-I\Vert _F\) and \(\Vert B^{\text {BFGS}}_{k+1}-I\Vert _F\) with their upper bounds in Lemma 5 and Lemma 6, the claim in (46) follows. Moreover, since \(\phi _k \in [0,1]\), \(\delta > 0\), \(\frac{\Vert ({\hat{B}}_k - I){\hat{s}}_k\Vert ^2}{\Vert {\hat{s}}_k\Vert ^2} \ge 0\) and \(\frac{{\hat{s}}_k^\top ({\hat{B}}_k - I) {\hat{B}}_k ({\hat{B}}_k - I) {\hat{s}}_k}{{\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k} \ge 0\), the result in (46) implies (47). \(\square \)

4.2 Linear convergence

In this section, we leverage the results from the previous section on the error of Hessian approximation to show that if the initial iterate is sufficiently close to the optimal solution and the initial Hessian approximation matrix is close to the Hessian at the optimal solution, the iterates of BFGS and DFP converge at least linearly to the optimal solution. Moreover, the Hessian approximation matrices always stay close to the Hessian at the optimal solution and the norms of Hessian approximation matrix and its inverse are always bounded above. These results are essential in proving our non-asymptotic superlinear convergence results.

Lemma 8

Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1, and recall the definitions in (1316). Suppose Assumptions 3.1 and 3.3 hold. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy

$$\begin{aligned} \sigma _0 \le \epsilon , \qquad \Vert {\hat{B}}_0 - I\Vert _F \le \delta , \end{aligned}$$
(48)

where \(\epsilon , \delta \in (0, \frac{1}{2})\) such that for some \(\rho \in (0, 1)\), they satisfy

$$\begin{aligned}&\left( \phi _{\text {max}}\frac{4(2\delta + 1)}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }\right) \frac{\epsilon }{1 - \rho } \le \delta ,\nonumber \\&\quad \frac{\epsilon }{2} + 2\delta \le (1 - 2\delta )\rho , \qquad \phi _{\text {max}} = \sup _{k \ge 0}\phi _k \ . \end{aligned}$$
(49)

Then, the sequence of iterates \(\{x_k\}_{k=0}^{+\infty }\) converges to the optimal solution \(x_*\) with

$$\begin{aligned} \sigma _{k + 1} \le \rho \sigma _{k}, \qquad \forall k \ge 0. \end{aligned}$$
(50)

Furthermore, the matrices \(\{B_k\}_{k=0}^{+\infty }\) stay in a neighborhood of \(\nabla ^{2}{f(x_*)}\) defined as

$$\begin{aligned} \Vert {\hat{B}}_{k} - I\Vert _F \le 2\delta , \qquad \forall k \ge 0. \end{aligned}$$
(51)

Moreover, the norms \(\{\Vert {\hat{B}}_k\Vert \}_{k=0}^{+\infty }\) and \(\{\Vert {\hat{B}}_k^{-1}\Vert \}_{k=0}^{+\infty }\) are all uniformly bounded above by

$$\begin{aligned} \Vert {\hat{B}}_k\Vert \le 1 + 2\delta , \qquad \Vert {\hat{B}}_k^{-1}\Vert \le \frac{1}{1 - 2\delta }, \qquad \forall k \ge 0. \end{aligned}$$
(52)

Proof

The proof of this lemma is adapted from the proof of Theorem 3.1 in [33]. In [33], the authors prove the results for the modified DFP method, while we consider the more general class of Broyden methods. We will use induction to prove (50), (51) and (52). First consider the base case of \(k = 0\). By the initial condition (48), it’s obvious that (51) holds for \(k = 0\). From (51) we know that all the eigenvalues of \({\hat{B}}_0\) are in the interval \([1 - 2\delta , 1 + 2\delta ]\). Suppose that \(\lambda _{max}({\hat{B}}_0)\) is the largest eigenvalue of \({\hat{B}}_0\) and \(\lambda _{min}({\hat{B}}_0)\) is the smallest eigenvalue of \({\hat{B}}_0\), we have

$$\begin{aligned} \Vert {\hat{B}}_0\Vert = \lambda _{max}({\hat{B}}_0) \le 1 + 2\delta , \quad \Vert {\hat{B}}_0^{-1}\Vert = \frac{1}{\lambda _{min}({\hat{B}}_0)} \le \frac{1}{1 - 2\delta }. \end{aligned}$$

Hence, (52) holds for \(k = 0\). Based on Assumptions 3.1 and 3.3 and the definitions in (1316), we have

$$\begin{aligned} \sigma _1&= \frac{M}{\mu ^{\frac{3}{2}}}\Vert \nabla ^{2}{f(x_*)}^{\frac{1}{2}}(x_1 - x_*)\Vert = \frac{M}{\mu ^{\frac{3}{2}}}\Vert \nabla ^{2}{f(x_*)}^{\frac{1}{2}}(x_0 - B_0^{-1}\nabla {f(x_0)} - x_*)\Vert \nonumber \\&= \frac{M}{\mu ^{\frac{3}{2}}}\Vert \nabla ^{2}{f(x_*)}^{\frac{1}{2}}B_0^{-1}[\nabla {f(x_0)} - \nabla ^2{f(x_*)}(x_0 - x_*) - (B_0 - \nabla ^2{f(x_*)})(x_0 - x_*)]\Vert \nonumber \\&= \frac{M}{\mu ^{\frac{3}{2}}}\Vert {\hat{B}}_0^{-1}[\widehat{\nabla {f}}(x_0) - r_0 - ({\hat{B}}_0 - I)r_0]\Vert \nonumber \\&\le \frac{M}{\mu ^{\frac{3}{2}}}\Vert {\hat{B}}_0^{-1}\Vert \left( \Vert \widehat{\nabla {f}}(x_0) - r_0\Vert + \Vert {\hat{B}}_0 - I\Vert \Vert r_0\Vert \right) . \end{aligned}$$
(53)

Now using the result in (24), and the bounds in (48), (49), (51) and (52) for \(k = 0\), we can write

$$\begin{aligned} \sigma _1\le & {} \frac{M}{\mu ^{\frac{3}{2}}}\Vert {\hat{B}}_0^{-1}\Vert (\frac{\sigma _0}{2}\Vert r_0\Vert + \Vert {\hat{B}}_0 - I\Vert \Vert r_0\Vert ) = \Vert {\hat{B}}_0^{-1}\Vert (\frac{\sigma _0}{2} + \Vert {\hat{B}}_0 - I\Vert )\sigma _0 \\\le & {} \frac{1}{1 - 2\delta }(\frac{\epsilon }{2} + 2\delta )\sigma _0 \le \rho \sigma _0. \end{aligned}$$

This indicates that the condition in (50) holds for \(k = 0\). Hence, all the conditions in (50), (51) and (52) hold for \(k = 0\), and the base of induction is complete. Now we assume that the conditions in (50), (51) and (52) hold for all \(0 \le k \le t\), where \(t \ge 0\). Our goal is to show that these conditions are also satisfied for the case of \(k = t + 1\). Since (50) holds for all \(0\le k \le t\), we have \(\tau _k = \max \{\sigma _k, \sigma _{k + 1}\} = \sigma _k \le \epsilon < 1\) for \(0 \le k \le t\). Moreover, since the condition in (51) holds for \(0\le k\le t\), we know that \(\Vert {\hat{B}}_k-I\Vert _F \le 2\delta \) for \(0 \le k \le t\). Hence, by (47) in Lemma 7, we obtain that

$$\begin{aligned} \Vert {\hat{B}}_{k+1} - I\Vert _F \le \Vert {\hat{B}}_k - I\Vert _{F} + Z_k\sigma _k, \qquad 0 \le k \le t, \end{aligned}$$
(54)

where \(Z_k = \phi _k\Vert {\hat{B}}_k\Vert \frac{4}{(1 - \sigma _k)^2} + \frac{3 + \sigma _k}{1 - \sigma _k}\). Using (52) and \(\sigma _k \le \epsilon \) for \(0 \le k \le t\), we obtain that

$$\begin{aligned} Z_k \le \phi _k \frac{4(2\delta + 1)}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }, \qquad 0 \le k \le t. \end{aligned}$$

Further if (48) and (50) hold for \(0 \le k \le t\), we have that

$$\begin{aligned} \sum _{k = 0}^{t}\sigma _k \le \sum _{k = 0}^{t}\rho ^{k}\sigma _0 \le \frac{\sigma _0}{1 - \rho } \le \frac{\epsilon }{1 - \rho }. \end{aligned}$$
(55)

Considering these results we can show that

$$\begin{aligned} \sum _{k = 0}^{t}Z_k\sigma _k\le & {} \sup _{k \ge 0}{\left[ \phi _k\frac{4(2\delta + 1)}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }\right] }\sum _{k = 0}^{t}\sigma _k\nonumber \\\le & {} \left( \phi _{\text {max}}\frac{4(2\delta + 1)}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }\right) \frac{\epsilon }{1 - \rho } \le \delta , \end{aligned}$$
(56)

where we use the definition \(\phi _{\text {max}} = \sup _{k \ge 0}\phi _k\) and the last inequality holds due to the first inequality in (49). By leveraging (56) and (48) and computing the sum of the terms in the left and right hand side of (54) from \(k = 0\) to t, we obtain

$$\begin{aligned} \Vert {\hat{B}}_{t + 1} - I\Vert _F \le \Vert {\hat{B}}_0 - I\Vert _{F} + \sum _{k = 0}^{t}Z_k\sigma _k \le \delta + \delta = 2\delta , \end{aligned}$$

which implies that (51) holds for \(k = t + 1\). Applying the same techniques we used in the base case, we can prove that (50) and (52) hold for \(k = t + 1\). Hence, all the claims in (50), (51) and (52) hold for \(k = t + 1\), and our induction step is complete. \(\square \)

4.3 Explicit non-asymptotic superlinear rate

In the previous section, we established local linear convergence of iterates generated by the convex Broyden class including DFP and BFGS. Indeed, these local linear results are not our ultimate goal, as first-order methods are also linearly convergent under the same assumptions. However, the linear convergence is required to establish a local non-asymptotic superlinear convergence result, which is our main contribution. Next, we state the main results of this paper on the non-asymptotic superlinear convergence rate of the convex Broyden class of quasi-Newton methods. To prove this claim, we use the results in Lemma 7 and Lemma 8.

Theorem 1

Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1. Suppose the objective function f satisfies the conditions in Assumptions 3.1 and 3.3. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy

$$\begin{aligned}&\frac{M}{\mu ^{\frac{3}{2}}}\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \le \epsilon ,\nonumber \\&\qquad \Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!\nabla ^{2}f(x_*)^{-\frac{1}{2}}\Vert _F \le \delta , \end{aligned}$$
(57)

where \(\epsilon , \delta \in (0, \frac{1}{2})\) such that for some \(\rho \in (0, 1)\), they satisfy

$$\begin{aligned}&\left( \phi _{\text {max}}(2\delta + 1)\frac{4}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }\right) \frac{\epsilon }{1 - \rho } \le \delta ,\nonumber \\&\qquad \frac{\epsilon }{2} + 2\delta \le (1 - 2\delta )\rho , \qquad \phi _{\text {max}} = \sup _{k \ge 0}\phi _k \ . \end{aligned}$$
(58)

Then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the convex Broyden class of quasi-Newton methods converge to \(x_*\) at a superlinear rate of

$$\begin{aligned} \frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert }\le & {} \left( \frac{C_1 q\sqrt{k} + C_2}{k}\right) ^k, \qquad \forall k \ge 1, \end{aligned}$$
(59)
$$\begin{aligned} \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\le & {} (1 + \epsilon )^2\left( \frac{C_1 q\sqrt{k} + C_2}{k}\right) ^{2k}, \qquad \forall k \ge 1, \end{aligned}$$
(60)
$$\begin{aligned} C_1= & {} 2\sqrt{2}\delta (1 + \rho )\left( 1 + \frac{\epsilon }{2}\right) , \qquad C_2 = \frac{(1 + \rho )(1 + \frac{\epsilon }{2})\epsilon }{2(1 - \rho )},\nonumber \\ \end{aligned}$$
(61)

where \(q = \frac{1}{\sqrt{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}} \in \left[ 1, \sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\right] \) and \(\phi _{\text {min}} = \inf _{k \ge 0}\phi _k\).

Proof

When both conditions (57) and (58) hold, by Lemma 8, the results in (50), (51) and (52) hold. This indicates that for any \(t \ge 0\), we have

$$\begin{aligned} \tau _t = \max \{\sigma _t, \sigma _{t+1}\} = \sigma _t \le \sigma _0 \le \epsilon < 1,\qquad \Vert {\hat{B}}_t - I\Vert _F \le 2\delta . \end{aligned}$$

Hence, using Lemma 7 for any \(t \ge 0\), we can show that

$$\begin{aligned} \Vert {\hat{B}}_{t+1} - I\Vert _F\le & {} \Vert {\hat{B}}_t - I\Vert _F - \phi _t\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{4\delta \Vert {\hat{s}}_t\Vert ^2}\nonumber \\&- (1 - \phi _t)\frac{{\hat{s}}_t^\top ({\hat{B}}_t - I) {\hat{B}}_t ({\hat{B}}_t - I) {\hat{s}}_t}{4\delta {\hat{s}}_t^\top {\hat{B}}_t {\hat{s}}_t} + Z_t\sigma _t, \end{aligned}$$
(62)

where \(Z_t = \phi _t\Vert {\hat{B}}_t\Vert \frac{4}{(1 - \sigma _t)^2} + \frac{3 + \sigma _t}{1 - \sigma _t}\). Using (55) and (56), for \(k \ge 0\) we have

$$\begin{aligned} \sum _{t = 0}^{k}\sigma _t \le \frac{\epsilon }{1 - \rho }, \qquad \sum _{t = 0}^{k}Z_t \sigma _t \le \delta . \end{aligned}$$
(63)

Now compute the sum of both sides of (62) from \(t = 0\) to \(k - 1\) to obtain

$$\begin{aligned}&\Vert {\hat{B}}_k - I\Vert _F \le \Vert {\hat{B}}_0 - I\Vert _F - \sum _{t = 0}^{k - 1}\left[ \phi _t\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{4\delta \Vert {\hat{s}}_t\Vert ^2} + (1 - \phi _t)\frac{{\hat{s}}_t^\top ({\hat{B}}_t - I) {\hat{B}}_t ({\hat{B}}_t - I) {\hat{s}}_t}{4\delta {\hat{s}}_t^\top {\hat{B}}_t {\hat{s}}_t}\right] \\&\quad \quad \quad \quad \quad \quad + \sum _{t = 0}^{k - 1}Z_t\sigma _t. \end{aligned}$$

Regroup the terms and use the results in (57) and (63) to show that

$$\begin{aligned}&\sum _{t = 0}^{k - 1}\left[ \phi _t\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{4\delta \Vert {\hat{s}}_t\Vert ^2} + (1 - \phi _t)\frac{{\hat{s}}_t^\top ({\hat{B}}_t - I) {\hat{B}}_t ({\hat{B}}_t - I) {\hat{s}}_t}{4\delta {\hat{s}}_t^\top {\hat{B}}_t {\hat{s}}_t}\right] \\&\quad \le \Vert {\hat{B}}_0 - I\Vert _F - \Vert {\hat{B}}_k - I\Vert _F + \sum _{t = 0}^{k - 1}Z_t\sigma _t \le \Vert {\hat{B}}_0 - I\Vert _F + \sum _{t = 0}^{k - 1}Z_t\sigma _t\le \delta + \delta = 2\delta , \end{aligned}$$

which leads to

$$\begin{aligned} \sum _{t = 0}^{k - 1}\left[ \phi _t\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{\Vert {\hat{s}}_t\Vert ^2} + (1 - \phi _t)\frac{{\hat{s}}_t^\top ({\hat{B}}_t - I) {\hat{B}}_t ({\hat{B}}_t - I) {\hat{s}}_t}{{\hat{s}}_t^\top {\hat{B}}_t {\hat{s}}_t}\right] \le 8\delta ^2. \end{aligned}$$
(64)

Moreover, using the bounds in (52) we can show that

$$\begin{aligned} {\hat{s}}_t^\top ({\hat{B}}_t - I) {\hat{B}}_t ({\hat{B}}_t - I) {\hat{s}}_t\ge & {} \frac{1}{\Vert {\hat{B}}_t^{-1}\Vert }\Vert ({\hat{B}}_t - I) {\hat{s}}_t\Vert ^2 \ge (1 - 2\delta )\Vert ({\hat{B}}_t - I) {\hat{s}}_t\Vert ^2, \\ {\hat{s}}_t^\top {\hat{B}}_t {\hat{s}}_t\le & {} \Vert {\hat{B}}_t\Vert \Vert {\hat{s}}_t\Vert ^2 \le (1 + 2\delta )\Vert {\hat{s}}_t\Vert ^2. \end{aligned}$$

Hence, we have

$$\begin{aligned} \frac{{\hat{s}}_t^\top ({\hat{B}}_t - I) {\hat{B}}_t ({\hat{B}}_t - I) {\hat{s}}_t}{{\hat{s}}_t^\top {\hat{B}}_t {\hat{s}}_t} \ge \frac{1 - 2\delta }{1 + 2\delta }\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{\Vert {\hat{s}}_t\Vert ^2}. \end{aligned}$$
(65)

By combining the bounds in (64) and (65), we obtain

$$\begin{aligned} \sum _{t = 0}^{k - 1}\left[ \phi _t + (1 - \phi _t)\frac{1 - 2\delta }{1 + 2\delta }\right] \frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{\Vert {\hat{s}}_t\Vert ^2} \le 8\delta ^2. \end{aligned}$$

Now by computing the minimum value of the term \(\phi _t + (1 - \phi _t)\frac{1 - 2\delta }{1 + 2\delta }\), we can show

$$\begin{aligned} \inf _{k \ge 0}{\left[ \phi _k + (1 - \phi _k)\frac{1 - 2\delta }{1 + 2\delta }\right] }\sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{\Vert {\hat{s}}_t\Vert ^2}&\le 8\delta ^2, \\ \left( \phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }\right) \sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{\Vert {\hat{s}}_t\Vert ^2}&\le 8\delta ^2, \end{aligned}$$

where \(\phi _{\text {min}} = \inf _{k \ge 0}\phi _k\) and by regrouping the terms, we obtain that

$$\begin{aligned} \sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{\Vert {\hat{s}}_t\Vert ^2} \le \frac{8\delta ^2}{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}. \end{aligned}$$

Considering the definition \(q := \frac{1}{\sqrt{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}}\), we can simplify our upper bound as

$$\begin{aligned} \sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert ^2}{\Vert {\hat{s}}_t\Vert ^2} \le 8\delta ^2q^2. \end{aligned}$$

By using the Cauchy-Schwarz inequality, we obtain that

$$\begin{aligned} \sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert }{\Vert {\hat{s}}_t\Vert } \le 2\sqrt{2}\delta q\sqrt{k}. \end{aligned}$$
(66)

Note that since \(\phi _k \in [0, 1]\), we have \(q \in \left[ 1, \sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\right] \). The result in (66) provides an upper bound on \(\sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert }{\Vert {\hat{s}}_t\Vert } \), which is a crucial term in the remaining of our proof.

Now, note that \(\nabla {f(x_t)} = J_t(x_t - x_*)\), where \(J_t\) is defined in (17). This implies that \( x_t - x_*=J_t^{-1}\nabla {f(x_t)}\) and hence we have

$$\begin{aligned} x_{t + 1} - x_* = x_t - x_* + s_t =J_t^{-1}\nabla {f(x_t)}+ s_t = - J_t^{-1}B_t s_t + s_t = J_t^{-1}(J_t - B_t)s_t. \end{aligned}$$

where the third equality holds since \(-B_t s_t = \nabla {f(x_t)} \). Pre-multiply both sides of the above expression by \(\nabla ^2{f(x_*)}^{\frac{1}{2}}\) to obtain

$$\begin{aligned} r_{t + 1} = {\hat{J}}_t^{-1}({\hat{J}}_t - {\hat{B}}_t){\hat{s}}_t = {\hat{J}}_t^{-1}[({\hat{J}}_t - I){\hat{s}}_t - ({\hat{B}}_t - I){\hat{s}}_t]. \end{aligned}$$

Therefore, we obtain that

$$\begin{aligned} \Vert r_{t + 1}\Vert \le \Vert {\hat{J}}_t^{-1}\Vert \left( \Vert ({\hat{J}}_t - I){\hat{s}}_t\Vert + \Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert \right) \le \Vert {\hat{J}}_t^{-1}\Vert \left( \Vert {\hat{J}}_t - I\Vert + \frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert }{\Vert {\hat{s}}_t\Vert }\right) \Vert {\hat{s}}_t\Vert . \end{aligned}$$

From Lemma 3 we know that \(\Vert {\hat{J}}_t^{-1}\Vert \le 1 + \frac{\sigma _t}{2}\) and \(\Vert {\hat{J}}_t - I\Vert \le \frac{\sigma _t}{2}\). Therefore, we have

$$\begin{aligned} \Vert r_{t + 1}\Vert \le \left( 1 + \frac{\sigma _t}{2}\right) \left( \frac{\sigma _t}{2} + \frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert }{\Vert {\hat{s}}_t\Vert }\right) \Vert {\hat{s}}_t\Vert . \end{aligned}$$
(67)

Also, since \(\sigma _{t+1} \le \rho \sigma _t\) and \(\sigma _t = \frac{M}{\mu ^{\frac{3}{2}}}\Vert r_t\Vert \), we obtain that \(\Vert r_{t+1}\Vert \le \rho \Vert r_t\Vert \). Hence, we can write

$$\begin{aligned} \Vert {\hat{s}}_t\Vert = \Vert \nabla {f(x_*)}^{\frac{1}{2}}(x_{t+1} - x_* + x_* - x_t)\Vert \le \Vert r_{t+1}\Vert + \Vert r_t\Vert \le (1 + \rho )\Vert r_t\Vert .\qquad \end{aligned}$$
(68)

Using the expressions in (67) and (68), we can show that \(\frac{\Vert r_{t+1}\Vert }{\Vert r_t\Vert }\) is bounded above by

$$\begin{aligned} \frac{\Vert r_{t+1}\Vert }{\Vert r_t\Vert } \le (1 + \rho )\left( 1 + \frac{\sigma _t}{2}\right) \left( \frac{\sigma _t}{2} + \frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert }{\Vert {\hat{s}}_t\Vert }\right) . \end{aligned}$$
(69)

Compute the sum of both sides of (69) from \(t = 0\) to \(k - 1\) and use \(\sigma _t \le \epsilon \), (63), and (66) to obtain

$$\begin{aligned} \sum _{t=0}^{k-1}\frac{\Vert r_{t+1}\Vert }{\Vert r_t\Vert }\le & {} (1 + \rho )\left( 1 + \frac{\epsilon }{2}\right) \!\left( \sum _{t = 0}^{k - 1}\frac{\sigma _t}{2} + \sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert }{\Vert {\hat{s}}_t\Vert }\right) \! \\\le & {} (1 + \rho ) \left( 1 + \frac{\epsilon }{2}\right) \!\left( \frac{\epsilon }{2(1 - \rho )} + 2\sqrt{2}\delta q\sqrt{k}\right) \!. \end{aligned}$$

By leveraging the arithmetic-geometric inequality, we obtain that

$$\begin{aligned} \frac{\Vert r_k\Vert }{\Vert r_0\Vert }= & {} \prod _{t=0}^{k-1}\frac{\Vert r_{t+1}\Vert }{\Vert r_t\Vert }\nonumber \\\le & {} \left( \frac{\sum _{t=0}^{k-1}\frac{\Vert r_{t+1}\Vert }{\Vert r_t\Vert }}{k}\right) ^k \le \left( \frac{2\sqrt{2}\delta (1 + \rho )(1 + \frac{\epsilon }{2})q\sqrt{k} + \frac{(1 + \rho )(1 + \frac{\epsilon }{2})\epsilon }{2(1 - \rho )}}{k}\right) ^k.\nonumber \\ \end{aligned}$$
(70)

Using condition (61), the proof of (59) is complete. Next, we proceed to prove (60). Based on the Taylor’s Theorem, there exists \(\alpha _t \in [0, 1]\) and the matrix \(H_t = \nabla ^2{f(x_* + \alpha _t(x_t - x_*))}\) such that

$$\begin{aligned} f(x_t) - f(x_*)&= \nabla {f(x_*)}(x_t - x_*) + \frac{1}{2}(x_t - x_*)^\top H_t(x_t - x_*) = \frac{1}{2}r_t^\top {\hat{H}}_t r_t, \end{aligned}$$

where we used \(\nabla {f(x_*)} = 0\) and \({\hat{H}}_t = \nabla ^{2}f(x_*)^{-\frac{1}{2}}H_t\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). By Lemma 3 and \(\sigma _t \le \epsilon \), we have

$$\begin{aligned} f(x_0) - f(x_*) = \frac{1}{2}r_0^\top {\hat{H}}_0 r_0 \ge \frac{1}{2(1 + \sigma _0)}\Vert r_0\Vert ^2 \ge \frac{1}{2(1 + \epsilon )}\Vert r_0\Vert ^2, \end{aligned}$$
(71)

and

$$\begin{aligned} f(x_k) - f(x_*) = \frac{1}{2}r_k^\top {\hat{H}}_k r_k \le \frac{1 + \sigma _k}{2}\Vert r_k\Vert ^2 \le \frac{1 + \epsilon }{2}\Vert r_k\Vert ^2. \end{aligned}$$
(72)

By combining (70), (71) and (72), we obtain that

$$\begin{aligned} \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le \frac{\frac{1 + \epsilon }{2}\Vert r_k\Vert ^2}{\frac{1}{2(1 + \epsilon )}\Vert r_0\Vert ^2} \le (1 + \epsilon )^2\left( \frac{2\sqrt{2}\delta (1 + \rho )(1 + \frac{\epsilon }{2})q\sqrt{k} + \frac{(1 + \rho )(1 + \frac{\epsilon }{2})\epsilon }{2(1 - \rho )}}{k}\right) ^{2k}, \end{aligned}$$

and the claim in (60) holds. \(\square \)

The above theorem establishes the non-asymptotic superlinear convergence of the Broyden class of quasi-Newton methods. Notice that we use the weighted norm in (59) to characterize the convergence rate. If in addition to the strong convexity condition in Assumption 3.1, we also assume that the gradient is Lipschitz continuous as in Assumption 3.2, then we have that \(\sqrt{\mu }\Vert x_t - x_*\Vert \le \Vert r_t\Vert \le \sqrt{L}\Vert x_t - x_*\Vert , \forall t \ge 0\). Hence, the result in (59) implies that

$$\begin{aligned} \frac{\Vert x_k - x_*\Vert }{\Vert x_0 - x_*\Vert } \le \sqrt{\frac{L}{\mu }}\left( \frac{C_1 q\sqrt{k} + C_2}{k}\right) ^k, \qquad \forall k \ge 1, \end{aligned}$$
(73)

where \(C_1\) and \(C_2\) are defined in (61). Next, we use the above theorem to report the results for DFP and BFGS, which are two special cases of the convex Broyden class of quasi-Newton methods.

Corollary 2

Consider the DFP and BFGS methods. Suppose Assumptions 3.1 and 3.3 hold and for some \(\epsilon , \delta \in (0, \frac{1}{2})\) and \(\rho \in (0, 1)\), the initial point \(x_0\) and initial Hessian approximation \(B_0\) satisfy

$$\begin{aligned} \frac{M}{\mu ^{\frac{3}{2}}}\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \le \epsilon , \quad \Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!\nabla ^{2}f(x_*)^{-\frac{1}{2}}\Vert _F \le \delta .\nonumber \\ \end{aligned}$$
(74)
  • For the DFP method, if the tuple \((\epsilon , \delta , \rho )\) satisfies

    $$\begin{aligned} \left[ \frac{4(2\delta + 1)}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }\right] \frac{\epsilon }{1 - \rho } \le \delta , \qquad \frac{\epsilon }{2} + 2\delta \le (1 - 2\delta )\rho \ , \end{aligned}$$
    (75)

    then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the DFP method converge to \(x_*\) at a superlinear rate of

    $$\begin{aligned} \frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } \le \left( \frac{C_1\sqrt{k} + C_2}{k}\right) ^k, \qquad \forall k \ge 1, \end{aligned}$$
    (76)
    $$\begin{aligned} \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le (1 + \epsilon )^2\left( \frac{C_1\sqrt{k} + C_2}{k}\right) ^{2k}, \qquad \forall k \ge 1. \end{aligned}$$
    (77)
  • For the BFGS method, if the tuple \((\epsilon , \delta , \rho )\) satisfies

    $$\begin{aligned} \frac{(3 + \epsilon )\epsilon }{(1 - \epsilon )(1 - \rho )} \le \delta , \qquad \frac{\epsilon }{2} + 2\delta \le (1 - 2\delta )\rho \ , \end{aligned}$$
    (78)

    then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the BFGS method converge to \(x_*\) at a superlinear rate of

    $$\begin{aligned} \frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } \le \left( \frac{C_1\sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\sqrt{k} + C_2}{k}\right) ^k, \qquad \forall k \ge 1, \end{aligned}$$
    (79)
    $$\begin{aligned} \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le (1 + \epsilon )^2\left( \frac{C_1\sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\sqrt{k} + C_2}{k}\right) ^{2k}, \qquad \forall k \ge 1, \end{aligned}$$
    (80)

where \(C_1\) and \(C_2\) are defined in (61).

Proof

In Theorem 1, set \(\phi _k = 1\) for all \(k \ge 0\) to obtain the results for DFP and set \(\phi _k = 0\) for all \(k \ge 0\) to obtain the results for BFGS. \(\square \)

The results in Corollary 2 indicate that, in a local neighborhood of the optimal solution, the iterates generated by DFP and BFGS converge to the optimal solution at a superlinear rate of \(({(C_1\sqrt{k}+C_2)}/{k})^k\), where the constants \(C_1\) and \(C_2\) are determined by \(\rho \), \(\epsilon \) and \(\delta \). Indeed, as time progresses, the rate behaves as \({\mathcal {O}}\left( (1/{\sqrt{k}})^{k}\right) \). The tuple \((\rho , \epsilon , \delta )\) is independent of the problem parameters \((\mu , L, M, d)\), and the only required condition for the tuple \((\rho , \epsilon , \delta )\) is that they should satisfy (75) or (78). Note that the superlinear rate in (76) and (79) is faster than linear rate of first-order methods as the contraction coefficient approaches zero at a sublinear rate of \({\mathcal {O}}({1}/{\sqrt{k}})\). Similarly, in terms of the function value, the superlinear rate shown in (77) and (80) behaves as \({\mathcal {O}}\left( (1/k)^{k}\right) \). The result in Corollary 2 also shows the existence of a trade-off between the rate of convergence and the neighborhood of superlinear convergence. We highlight this point in the following remark.

Remark 3

There exists a trade-off between the size of the local neighborhood in which DFP or BFGS converges superlinearly and their rate of convergence. To be more precise, by choosing larger values for \(\epsilon \) and \(\delta \) (as long as they satisfy (75) or (78)), we can increase the size of the region in which quasi-Newton method has a fast superlinear convergence rate, but on the other hand, it will lead to a slower superlinear convergence rate according to the bounds in (76), (77), (79) and (80). Conversely, by choosing small values for \(\epsilon \) and \(\delta \), the rate of convergence becomes faster, but the local neighborhood defined in (74) becomes smaller.

The final convergence results of Corollary 2 depend on the choice of parameters \((\rho , \epsilon , \delta )\), and it may not be easy to quantify the exact convergence rate at first glance. To better quantify the superlinear convergence rate of DFP and BFGS, in the following corollary, we state the results of Corollary 2 for specific choices of \(\rho \), \(\epsilon \) and \(\delta \) which simplifies our expressions. Indeed, one can choose another set of values for these parameters to control the neighborhood and rate of superlinear convergence, as long as they satisfy the conditions in (75) for DFP and (78) for BFGS.

Corollary 3

Consider the DFP and BFGS methods and suppose Assumptions 3.1 and 3.3 hold. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) of DFP satisfy

$$\begin{aligned} \frac{M}{\mu ^{\frac{3}{2}}}\Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \frac{1}{120}, \quad \Vert {\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!{\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\Vert _F \le \frac{1}{7},\nonumber \\ \end{aligned}$$
(81)

and the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) of BFGS satisfy

$$\begin{aligned} \frac{M}{\mu ^{\frac{3}{2}}}\Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \frac{1}{50}, \quad \Vert {\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!{\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\Vert _F \le \frac{1}{7}.\nonumber \\ \end{aligned}$$
(82)

Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by the DFP and BFGS methods satisfy

$$\begin{aligned} \frac{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_0 - x_*)\Vert } \le \left( \frac{1}{k}\right) ^{\frac{k}{2}}, \qquad \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le 1.1\left( \frac{1}{k}\right) ^{k}, \qquad \forall k \ge 1.\nonumber \\ \end{aligned}$$
(83)

Proof

The results for DFP can be shown by setting \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{120}\) and \(\delta = \frac{1}{7}\) in Corollary 2. We can check that for those values, the conditions in (75) are all satisfied. Moreover, the expressions in (76) and (77) can be simplified as

$$\begin{aligned}&\frac{2\sqrt{2}\delta (1 + \rho )(1 + \frac{\epsilon }{2})\sqrt{k} + \frac{(1 + \rho )(1 + \frac{\epsilon }{2})\epsilon }{2(1 - \rho )}}{k} \\&\quad \qquad = \frac{\frac{2\sqrt{2}}{7}(1 + \frac{1}{2})(1 + \frac{1}{240})\sqrt{k} + \frac{(1 + \frac{1}{2})(1 + \frac{1}{240})\frac{1}{120}}{2(1 - \frac{1}{2})}}{k} < \frac{1}{\sqrt{k}}, \end{aligned}$$

and \((1 + \epsilon )^2 = (1 + \frac{1}{120})^2 \le 1.1\). So the claims in (83) follow. The results for BFGS can be shown similarly by setting \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{50}\) and \(\delta = \frac{1}{7}\) in (78), (79) and (80). \(\square \)

The results in Corollary 3 show that for some specific choices of \((\epsilon , \delta , \rho )\), the convergence rate of DFP and BFGS is \(\left( 1/k\right) ^{k/2}\), which is asymptotically faster than any linear convergence rate of first-order methods. Moreover, we observe that the neighborhood in which this fast superlinear rate holds is slightly larger for BFGS compared to DFP, i.e., compare the first conditions in (81) and (82). This is in consistence with the fact that in practice, BFGS often outperforms DFP.

A major shortcoming of the results in Corollary 2 and Corollary 3 is that, in addition to assuming that the initial iterate \(x_0\) is sufficiently close to the optimal solution, we also require the initial Hessian approximation error to be sufficiently small. In the following theorem, we resolve this issue by suggesting a practical choice for \(B_0\) such that the second assumption in (81) and (82) can be satisfied under some conditions. To be more precise, we show that if \(\Vert \nabla ^2{f(x_*)}^\frac{1}{2}(x_0 - x_*)\Vert \) is sufficiently small (we formally describe this condition), then by setting \(B_0 = \nabla ^{2}{f(x_0)}\), the second condition in (81) and (82) for Hessian approximation is satisfied, and we can achieve the convergence rate in (83).

Theorem 2

Consider the DFP and BFGS methods and suppose Assumptions 3.1 and 3.3 hold. Moreover, for DFP, suppose the initial point \(x_0\) and initial Hessian approximation \(B_0\) satisfy

$$\begin{aligned} \frac{M}{\mu ^{\frac{3}{2}}}\Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \min \left\{ \frac{1}{120}, \frac{1}{7\sqrt{d}}\right\} , \qquad B_0 = \nabla ^{2}{f(x_0)}, \end{aligned}$$
(84)

and for BFGS, they satisfy

$$\begin{aligned} \frac{M}{\mu ^{\frac{3}{2}}}\Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \min \left\{ \frac{1}{50}, \frac{1}{7\sqrt{d}}\right\} , \qquad B_0 = \nabla ^{2}{f(x_0)}. \end{aligned}$$
(85)

Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by the DFP and BFGS methods satisfy

$$\begin{aligned} \frac{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_0 - x_*)\Vert } \le \left( \frac{1}{k}\right) ^{\frac{k}{2}}, \qquad \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le 1.1\left( \frac{1}{k}\right) ^{k}, \qquad \forall k \ge 1.\nonumber \\ \end{aligned}$$
(86)

Proof

First we consider the case of the DFP method. Notice that by (84), we obtain

$$\begin{aligned} \frac{M}{\mu ^{\frac{3}{2}}}\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \le \frac{1}{120}. \end{aligned}$$

Hence, the first part of (81) is satisfied. Moreover, using Assumptions 3.1 and 3.3, we have

$$\begin{aligned}&\Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}(\nabla ^{2}f(x_0) - \nabla ^{2}f(x_*))\nabla ^{2}f(x_*)^{-\frac{1}{2}}\Vert _F\\&\quad \le \sqrt{d}\Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}(\nabla ^{2}f(x_0) - \nabla ^{2}f(x_*))\nabla ^{2}f(x_*)^{-\frac{1}{2}}\Vert \\&\quad \le \sqrt{d}\Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}\Vert ^2\Vert \nabla ^{2}f(x_0) - \nabla ^{2}f(x_*)\Vert \le \sqrt{d}\frac{M}{\mu }\Vert x_0 - x_*\Vert \\&\quad = \sqrt{d}\frac{M}{\mu }\Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}\nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \\&\quad \le \sqrt{d}\frac{M}{\mu }\Vert \nabla ^{2}f(x_*)^{-\frac{1}{2}}\Vert \Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \\&\quad \le \sqrt{d}\frac{M}{\mu ^{\frac{3}{2}}}\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \le \frac{1}{7}. \end{aligned}$$

The first inequality holds as \(\Vert A\Vert _F \le \sqrt{d}\Vert A\Vert \) for any matrix \(A \in {\mathbb {R}}^{d \times d}\), and the last inequality is due to the first part of (84). The above bound shows that the second part of the (81) is also satisfied, and by Corollary 3 the claim follows. The proof for BFGS is similar to the proof for DFP. It can be derived by following the steps of proof of DFP and exploiting the BFGS results in Corollary 3. \(\square \)

According to Theorem 2, if the initial weighted error \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \) is sufficiently small, then by setting the initial Hessian approximation \(B_0\) as the Hessian at the initial point \(\nabla ^{2}{f(x_0)}\), the iterates will converge superlinearly at a rate of \((1/k)^{k/2}\). More specifically, based on the result in (24), it suffices to have \(\Vert \nabla ^2{f(x^*)}^{-\frac{1}{2}}\nabla {f(x_0)}\Vert \le {\mathcal {O}}(\mu ^{\frac{3}{2}}/(M\sqrt{d}))\) to ensure \(\Vert \nabla ^2{f(x^*)}^{\frac{1}{2}}(x_0 - x^*)\Vert \le {\mathcal {O}}(\mu ^{\frac{3}{2}}/(M\sqrt{d}))\) as stated in (84) and (85). Hence, this condition is satisfied when \(\Vert \nabla {f(x_0)}\Vert \le {\mathcal {O}}(\mu ^2/(M\sqrt{d}))\). This observation implies that, in practice, we can exploit any optimization algorithm to find an initial point \(x_0\) such that \(\Vert \nabla {f(x_0)}\Vert \le {\mathcal {O}}(\mu ^2/(M\sqrt{d}))\), and once this condition is satisfied, by setting \(B_0=\nabla ^2 f(x_0)\) we obtain the guaranteed superlinear convergence result. The suggested procedure requires only one evaluation of the Hessian inverse for the initial iterate, and in the rest of the algorithm, the Hessian inverse approximations are updated according to the convex Broyden update in (8).

5 Analysis of self-concordant functions

The results that we have presented so far require three assumptions: (i) the objective function is strongly convex, (ii) its gradient is Lipschitz continuous (iii) and its Hessian is Lipschitz continuous only at the optimal solution. In this section, we extend our theoretical results to a different setting where the objective function is self-concordant.

Assumption 5.1

The objective function f is standard self-concordant. In other words, it satisfies the following conditions: (i) f is closed with open domain dom(f). (ii) it is three times continuously differentiable, (iii) \(\nabla ^2{f(x)} \succ 0\) for all \(x \in dom(f)\), and (iv) the Hessian satisfies

$$\begin{aligned} \left. \frac{d}{dt}\nabla ^2{f(x + ty)}\right| _{t=0} \preceq 2\left( y^\top \nabla ^2{f(x)}y\right) ^{\frac{1}{2}}\nabla ^2{f(x)}, \quad \forall x \in dom(f), \quad \forall y \in {\mathbb {R}}^d.\nonumber \\ \end{aligned}$$
(87)

Notice that the constant 2 in the above condition corresponds to standard self-concordant functions, but, in principle, there can be any arbitrary constant instead of 2. The analysis of Newton-type methods for self-concordant functions (see, e.g., [38, 39]) expands the theory of second-order algorithms beyond the classic setting considered in the previous section. This family of functions are of interest as it includes a large set of loss functions that are widely used in machine learning, such as linear functions, convex quadratic functions, and negative logarithm functions. In this section, we extend our results to this class of functions.

We should mention that the setup considered in this section is neither more general nor more strict than the setup in the previous section. For instance, the function \(f(x) = -\log {x}\) is self-concordant and satisfies Assumption 5.1, but it does not satisfy Assumption 3.1, 3.2 or 3.3 for any \(x > 0\). Conversely, the self-concordance assumption is not a necessary condition for the assumption that the Hessian is Lipschitz continuous only at the optimal solution. For instance, the objective function

$$\begin{aligned} f(x) = {\left\{ \begin{array}{ll} 7x^2 + 8x + 3 &{}\quad \text {if} \quad x \in (-\infty , -1)\\ x^4 + x^2 &{}\quad \text {if} \quad x \in [-1, 1]\\ 7x^2 - 8x + 3 &{}\quad \text {if} \quad x \in (1, +\infty )\\ \end{array}\right. } \end{aligned}$$
(88)

satisfies the conditions in Assumptions 3.1, 3.2 and 3.3. However, it is not self-concordant, as its third derivative is not continuous.

Based on these points, the analysis in this section extends our convergence analysis of quasi-Newton methods to a new setting that is not covered by the setup in the previous section.

We should also mention that in [35,36,37] for the finite-time analysis of quasi-Newton methods, the authors assume that the objective function is strongly self-concordant which forms a subclass of self-concordant functions, formally defined in [35]. Note that a function f is strongly self-concordant when there exists a constant \(K \ge 0\) such that for any \(x, y, z, w \in dom(f)\), we have

$$\begin{aligned} \nabla ^2{f(y)} -\nabla ^2{f(x)} \preceq K\left( (y - x)^\top \nabla ^2{f(z)}(y - x)\right) ^{\frac{1}{2}}\nabla ^2{f(w)}. \end{aligned}$$
(89)

In addition, in [35,36,37] the authors require the objective function to be strongly convex and smooth. Indeed, our considered setting in this section is more general than the setup in these works as we only require the function to be self-concordant.

Note that the condition \(\nabla ^2{f(x)} \succ 0\) guarantees that the inner product \(s_k^\top y_k\) in quasi-Newton updates is always positive in all iterations, as stated in Sect. 2. Also by the definition of self-concordance, the function f(x) is always strictly convex. We start our analysis by stating the following lemma which plays an important role in our analysis for self-concordant functions.

Lemma 9

Suppose function f satisfies Assumption 5.1 and \(x, y \in dom(f)\). Further, consider the definition \(G := \int _{0}^{1}\nabla ^2{f(x + \alpha (y - x))}d\alpha \). If x and y are such that \(r = \Vert \nabla ^2{f(x)}^{\frac{1}{2}}(y - x)\Vert < 1\), then

$$\begin{aligned} (1 - r)^{2}\nabla ^2{f(x)}\preceq & {} \nabla ^2{f(y)} \preceq \frac{1}{(1 - r)^{2}}\nabla ^2{f(x)}, \end{aligned}$$
(90)
$$\begin{aligned} (1 - r + \frac{r^2}{3})\nabla ^2{f(x)}\preceq & {} G \preceq \frac{1}{1 - r}\nabla ^2{f(x)}. \end{aligned}$$
(91)

Proof

Check Theorem 4.1.6 and Corollary 4.1.4 of [9]. \(\square \)

The next two lemmas are based on Lemma 9 and are similar to the results in Lemma 3 and 4, except here we prove them for the case that the conditions in Assumption 5.1 are satisfied.

Lemma 10

Recall the definition of \(r_k\) in (16) and \({\hat{J}}_k\) in (17). Suppose that there exists \(\alpha _k \in [0, 1]\) and define the matrix \(H_k = \nabla ^2{f(x_* + \alpha _k(x_k - x_*))}\) and \({\hat{H}}_k = \nabla ^{2}f(x_*)^{-\frac{1}{2}}H_k\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). If Assumption 5.1 holds and \(\Vert r_k\Vert \le \frac{1}{2}\), then for all \(k \ge 0\) we have

$$\begin{aligned} \frac{1}{1 + 2\Vert r_k\Vert }I \preceq {\hat{J}}_k \preceq (1 + 2\Vert r_k\Vert )I, \qquad (1 - \Vert r_k\Vert )^2I \preceq {\hat{H}}_k \preceq \frac{1}{(1 - \Vert r_k\Vert )^2}I.\qquad \end{aligned}$$
(92)

Proof

Check Appendix F\(\square \)

Lemma 11

Recall the definitions in (13) - (16) and consider the definition \(\theta _k := \max \{\Vert r_k\Vert , \Vert r_{k+1}\Vert \}\). Suppose that for some \(k \ge 0\), we have \(\theta _k \le \frac{1}{2}\). If Assumption 5.1 holds, we have

$$\begin{aligned} \Vert {\hat{y}}_k - {\hat{s}}_k\Vert\le & {} 6\theta _k\Vert {\hat{s}}_k\Vert , \end{aligned}$$
(93)
$$\begin{aligned} (1 - 6\theta _k)\Vert {\hat{s}}_k\Vert ^2\le & {} {\hat{s}}_k^\top {\hat{y}}_k \le (1 + 6\theta _k)\Vert {\hat{s}}_k\Vert ^2, \end{aligned}$$
(94)
$$\begin{aligned} (1 - 6\theta _k)\Vert {\hat{s}}_k\Vert\le & {} \Vert {\hat{y}}_k\Vert \le (1 + 6\theta _k)\Vert {\hat{s}}_k\Vert , \end{aligned}$$
(95)
$$\begin{aligned} \Vert \widehat{\nabla {f}}(x_k) - r_k\Vert\le & {} 2\Vert r_k\Vert ^2. \end{aligned}$$
(96)

Proof

Check Appendix G. \(\square \)

By comparing Lemma 10 and Lemma 11 with Lemma 3 and Lemma 4, respectively, we observe that the only difference between these results is that we replaced \({\sigma _k}/{2} = ({M}/{(2\mu ^{\frac{3}{2}}}))\Vert r_k\Vert \) by \(2\Vert r_k\Vert \) and \(\tau _k = \max \{\sigma _k, \sigma _{k+1}\}\) by \(6\theta _k = 6\max \{\Vert r_k\Vert , \Vert r_{k+1}\Vert \}\). Due to this similarity, our results for the self-concordant setting are very similar to the previous case considered in Sect. 4. As a result, the superlinear convergence proof in this section is also similar to the one in Sect. 4. Next, we directly present the final superlinear convergence rate results for self-concordant functions.

Theorem 3

Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1. Suppose the objective function f satisfies the conditions in Assumption 5.1. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy

$$\begin{aligned} \Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \frac{\epsilon }{6}, \qquad \Vert {\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!{\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\Vert _F \le \delta ,\nonumber \\ \end{aligned}$$
(97)

where \(\epsilon , \delta \in (0, \frac{1}{2})\) such that for some \(\rho \in (0, 1)\), they satisfy

$$\begin{aligned}&\left( \phi _{\text {max}}(2\delta + 1)\frac{4}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }\right) \frac{\epsilon }{1 - \rho } \le \delta , \qquad \frac{\epsilon }{3} + 2\delta \le (1 - 2\delta )\rho ,\nonumber \\&\quad \phi _{\text {max}} = \sup _{k \ge 0}\phi _k \ . \end{aligned}$$
(98)

Then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the convex Broyden class of quasi-Newton methods converge to \(x_*\) at a superlinear rate of

$$\begin{aligned} \frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert }\le & {} \left( \frac{C_3q\sqrt{k} + C_4}{k}\right) ^k, \qquad \forall k \ge 1, \end{aligned}$$
(99)
$$\begin{aligned} \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\le & {} \frac{1}{(1 - \frac{\epsilon }{6})^4}\left( \frac{C_3 q\sqrt{k} + C_4}{k}\right) ^{2k}, \qquad \forall k \ge 1, \end{aligned}$$
(100)
$$\begin{aligned} C_3= & {} 2\sqrt{2}\delta (1 + \rho )\left( 1 + \frac{\epsilon }{3}\right) , \qquad C_4 = \frac{(1 + \rho )(1 + \frac{\epsilon }{3})\epsilon }{3(1 - \rho )},\nonumber \\ \end{aligned}$$
(101)

where \(q = \frac{1}{\sqrt{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}} \in \left[ 1, \sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\right] \) and \(\phi _{\text {min}} = \inf _{k \ge 0}\phi _k\).

Proof

Check Appendix H. \(\square \)

Similarly, we can set \(\phi _k = 1\) or \(\phi _k = 0\) for all \(k \ge 0\) to obtain the results for DFP and BFGS, respectively, as stated in Corollary 2. We can also select specific values for \((\epsilon , \delta , \rho )\) to simplify our bounds.

Corollary 4

Consider the DFP and BFGS methods and suppose Assumption 5.1 holds. Moreover, suppose for the DFP method, the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy

$$\begin{aligned} \Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \frac{1}{720}, \qquad \Vert {\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!{\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\Vert _F \le \frac{1}{7},\nonumber \\ \end{aligned}$$
(102)

and for the BFGS method, the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy

$$\begin{aligned} \Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \frac{1}{300}, \qquad \Vert {\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!{\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\Vert _F \le \frac{1}{7}.\nonumber \\ \end{aligned}$$
(103)

Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by these methods satisfy

$$\begin{aligned} \frac{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_0 - x_*)\Vert } \le \left( \frac{1}{k}\right) ^{\frac{k}{2}}, \qquad \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le 1.1\left( \frac{1}{k}\right) ^{k}, \qquad \forall k \ge 1.\nonumber \\ \end{aligned}$$
(104)

Proof

As in the proof of Corollary 3, we set \(\phi _k = 1\), \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{120}\), \(\delta = \frac{1}{7}\) for the DFP method and \(\phi _k = 0\), \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{50}\), \(\delta = \frac{1}{7}\) for the BFGS method in Theorem 3. Then, the claims follow. \(\square \)

We can also set the initial Hessian approximation matrix to be \(\nabla ^2{f(x_0)}\) as in Theorem 2 to achieve the same superlinear convergence rate as long as the distance between the initial point \(x_0\) and the optimal point \(x_*\) is sufficiently small.

Theorem 4

Consider the DFP and BFGS methods and suppose Assumption 5.1 holds. Moreover, suppose for the DFP method, the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy

$$\begin{aligned} \Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \min \left\{ \frac{1}{720}, \frac{1}{21\sqrt{d}}\right\} , \qquad B_0 = \nabla ^{2}{f(x_0)}, \end{aligned}$$
(105)

and for the BFGS method, they satisfy

$$\begin{aligned} \Vert {\nabla ^{2}f(x_*)^\frac{1}{2}}(x_0 - x_*)\Vert \le \min \left\{ \frac{1}{300}, \frac{1}{21\sqrt{d}}\right\} , \qquad B_0 = \nabla ^{2}{f(x_0)}. \end{aligned}$$
(106)

Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by these methods satisfy

$$\begin{aligned} \frac{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_0 - x_*)\Vert } \le \left( \frac{1}{k}\right) ^{\frac{k}{2}}, \qquad \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le 1.1\left( \frac{1}{k}\right) ^{k}, \quad \forall k \ge 1.\nonumber \\ \end{aligned}$$
(107)

Proof

Check Appendix I. \(\square \)

In summary, we established the local convergence rate of the convex Broyden class of quasi-Newton methods for self-concordant functions. We showed that if the initial distance to the optimal solution is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(1)\) and the initial Hessian approximation error is \(\Vert \nabla ^{2}{f(x_*)}^{-\frac{1}{2}}(B_0 - \nabla ^{2}f(x_*))\nabla ^{2}{f(x_*)}^{-\frac{1}{2}}\Vert _F = {\mathcal {O}}(1)\), the iterations converge to the optimal solution at a superlinear rate of \(\frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } = {\mathcal {O}}{\left( \frac{1}{\sqrt{k}}\right) ^{k}}\) and \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} = {\mathcal {O}}{\left( \frac{1}{k}\right) ^{k}}\). Moreover, we can achieve the same superlinear rate if the initial error is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(\frac{1}{\sqrt{d}})\) and the initial Hessian approximation matrix is \(B_0 = \nabla ^2{f(x_0)}\).

6 Discussion

In this section, we discuss the strengths and shortcomings of our theoretical results and compare them with concurrent papers [36, 37] on the non-asymptotic superlinear convergence of DFP and BFGS.

Initial Hessian approximation condition Note that in our main theoretical results, in addition to the fact that the initial iterate \(x_0\) has to be close to the optimal solution \(x_*\), which is a common condition for local convergence results, we also need the initial Hessian approximation \(B_0\) to be close to the Hessian at the optimal solution \(\nabla ^2 f(x_*)\). At first glance, this might seem restrictive, but as we have shown in Theorem 2 and Theorem 4, if we set the initial Hessian approximation to the Hessian at the initial point \(\nabla ^2 f(x_0)\), this condition is automatically satisfied as long as the initial iterate error \(\Vert x_0-x_*\Vert \) is sufficiently small. From a complexity point of view, this approach is reasonable as quasi-Newton methods and Newton’s method outperform first-order methods in a local neighborhood of the optimal solution, and their global linear convergence rate may not be faster than the linear convergence rate of first-order methods. Hence, as suggested in [2], to optimize the overall iteration complexity according to theoretical bounds, one might use first-order methods such as Nesterov’s accelerated gradient method to reach a local neighborhood of the optimal solution, and then switch to locally fast methods such as quasi-Newton methods. If this procedure is used, our theoretical results show that by setting \(B_0=\nabla ^2 f(x_0)\) (and equivalently \(H_0=\nabla ^2 f(x_0)^{-1}\)) for the convex Broyden class of quasi-Newton, the fast superlinear convergence rate of \((1/k)^{k/2}\) can be obtained.

It is worth noting that, however, in practice algorithms that do not require switching between algorithms or knowledge of problem parameters are more favorable. Due to these reasons, quasi-Newton methods with an Armijo-Wolfe line search are more practical, as they offer an adaptive choice of the steplength with global convergence and avoid specifying typically unknown constants such as the Lipschitz constant of the gradient, Lipschitz constant of the Hessian, and strong convexity parameter.

It is worth noting that, in practice, algorithms that do not switch between different methods or require knowledge of the problem parameters are more favorable. Because of these reasons, quasi-Newton methods with an Armijo-Wolfe line search are more practical as they offer an adaptive choice of the stepsize with global convergence guarantees without knowledge of unknown constants such as the Lipschitz constant of the gradient, Lipschitz constant of the Hessian, and strong convexity parameter. Indeed, both this framework and the framework in [36, 37] requires re-initializing the Hessian approximation when one is sufficiently close to the solution. An ideal theoretical guarantee would follow a line-search approach that guarantees that, once the iterates reach a local neighborhood of the optimal solution, the Hessian approximation for DFP or BFGS automatically satisfies the required conditions for superlinear convergence without any modification to the Hessian approximation.

Convergence rate-neighborhood trade-off As mentioned earlier, we observe a trade-off between the radius of the neighborhood in which BFGS and DFP converge superlinearly to the optimal solution and the rate (speed) of superlinear convergence. One important observation here is that for specific choices of \(\epsilon \), \(\delta \) and \(\rho \), the rate of convergence could be independent of the problem dimension d, while the neighborhood of the convergence would depend on d. Note that by selecting different parameters we could improve the dependency of the neighborhood on d, at the cost of achieving a contraction factor that depends on d. In this case, the contraction factor may not be always smaller than 1, and we can only guarantee that after a few iterations it becomes smaller than 1 and eventually behaves as 1/k. The results in [36, 37] have a similar structure. For instance, in [36], the authors show that when the initial Newton decrement is smaller than \(\frac{\mu ^{\frac{5}{2}}}{ML}\), which is independent of the problem dimension, the convergence rate would be of the form \((\frac{dL}{\mu k})^{k/2}\). Hence, to observe the superlinear convergence rate one need to run the BFGS method at least for \(d L/\mu \) iterations to ensure the contraction factor is smaller than 1. A similar conclusion could be made using our results, if we adjust the neighborhood. In our main result, we only report the case that the neighborhood depends on d and the rate is independent of that, since in this case the contraction factor is always smaller than 1 and the superlinear behavior starts from the first iteration.

7 Numerical experiments

In this section, we present our numerical experiments and compare the non-asymptotic performance of quasi-Newton methods with Newton’s method and the gradient descent algorithm. We further investigate if the convergence rates of quasi-Newton methods are consistent with our theoretical guarantees. In particular, we solve the following logistic regression problem with \(l_2\) regularization

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d} f(x) = \frac{1}{N}\sum _{i = 1}^{N}\ln {(1 + e^{-y_i z_i^\top x})} + \frac{\mu }{2}\Vert x\Vert ^2. \end{aligned}$$
(108)

We assume that \(\{z_i\}_{i = 1}^{N}\) are the data points and \(\{y_i\}_{i = 1}^{N}\) are their corresponding labels where \(z_i \in {\mathbb {R}}^d\) and \(y_i \in \{-1, 1\}\) for \(1 \le i \le N\). Note that the function f(x) in (108) is strongly convex with parameter \(\mu > 0\). We normalize all data points such that \(\Vert z_i\Vert = 1\) for all \(1 \le i \le N\). Therefore, the gradient of the function f(x) is Lipschitz continuous with parameter \(L = {1}/{4} + \mu \). It is also known that the logistic regression objective function is self-concordant after a suitable scaling, i.e., it is self-concordant but not standard self-concordant (with constant 2). Moreover, its Hessian is Lipschitz continuous. In summary, the objective function f in (108) satisfies Assumptions 3.13.23.3 and Assumption 5.1.

We conduct our experiments on four different datasets: (i) colon-cancer dataset [40], (ii) Covertype dataset [41], (iii) GISETTE handwritten digits classification dataset from the NIPS 2003 feature selection challenge [42] and (iv) MNIST dataset of handwritten digits [43].Footnote 1 We compare the performance of DFP, BFGS, Newton’s method, and gradient descent. We initialize all the algorithms with the same initial point \(x_0 = c*\mathbf {{\mathbf {1}}}\) where \(c > 0\) is a tuned parameter and \(\mathbf {{\mathbf {1}}} \in {\mathbb {R}}^d\) is the one vector. We set the initial Hessian inverse approximation matrix as \(\nabla ^2{f(x_0)}^{-1}\) for the DFP and BFGS methods. The step size is 1 for DFP, BFGS, and Newton’s method. The step size of the gradient descent method is tuned by hand to achieve the best performance on each dataset.

All the parameters (sample size N, dimension d, initial point parameter c and regularization \(\mu \)) of these different datasets are provided in Table 1. Notice that the initial point parameter c is selected from the set \({\mathcal {A}} = \{ {0.001}, {0.01}, 0.1, 1, 10\}\) to guarantee that the initial point \(x_0\) is close enough to the optimal solution \(x_*\) so that we can achieve the superlinear convergence rate of DFP and BFGS on each dataset. The regularization parameter \(\mu \) is also chosen from the same set \({\mathcal {A}}\) to obtain the best performance on each dataset.

Table 1 Sample size N, dimension d, initial point parameter c and regularization \(\mu \) of each dataset

From the theoretical results of Sects. 4.3 and 5, we expect the iterates \(\{x_k\}_{k = 0}^{\infty }\) generated by the DFP method and the BFGS method to satisfy the following superlinear convergence rate

$$\begin{aligned} \frac{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_0 - x_*)\Vert } \le \left( \frac{1}{\sqrt{k}}\right) ^{k}, \qquad \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le 1.1\left( \frac{1}{k}\right) ^{k}, \qquad \forall k \ge 1. \end{aligned}$$

Hence, in our numerical experiments, we compare the convergence rate of \(\frac{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_0 - x_*)\Vert }\) with \((\frac{1}{\sqrt{k}})^{k}\) and the convergence rate of \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\) with \((\frac{1}{k})^{k}\) to check the tightness of our theoretical bounds. Our numerical experiments are shown in Figs. 123 and 4 for different datasets. Note that for each problem, we present two plots. The left plot (plot (a)) showcases \(\frac{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_0 - x_*)\Vert }\) for different algorithms as well as our theoretical bound which is \((\frac{1}{\sqrt{k}})^{k}\). In the right plot (plot (b)), we compare \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\) for different methods with our theoretical bound which is \((\frac{1}{k})^{k}\).

Fig. 1
figure 1

Convergence rates of logistic regression on the Colon-cancer dataset

Fig. 2
figure 2

Convergence rates of logistic regression on the Covertype dataset

Fig. 3
figure 3

Convergence rates of logistic regression on the GISETTE dataset

Fig. 4
figure 4

Convergence rates of logistic regression on the MNIST dataset

We observe that \(\frac{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_0 - x_*)\Vert }\) for the DFP and BFGS methods are bounded above by \((\frac{1}{\sqrt{k}})^{k}\) and \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\) for the DFP and BFGS methods are bounded above by \((\frac{1}{k})^{k}\). Therefore, these experimental results confirm our theoretical superlinear convergence rates of quasi-Newton methods.

8 Conclusion

In this paper, we studied the local convergence rate of the convex Broyden class of quasi-Newton methods which includes the DFP and BFGS methods. We focused on two settings: (i) the objective function is \(\mu \)-strongly convex, its gradient is L-Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution with parameter M, (ii) the objective function is self-concordant. For these two settings we characterized the explicit non-asymptotic superlinear convergence rate of Broyden class of quasi-Newton methods. In particular, for the first setting, we showed that if the initial distance to the optimal solution is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(\frac{\mu ^{\frac{3}{2}}}{M})\) and the initial Hessian approximation error is \(\Vert {\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!{\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\Vert _F = {\mathcal {O}}(1)\), the iterations generated by the DFP and BFGS methods converge to the optimal solution at a superlinear rate of \(\frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } = {\mathcal {O}}{\left( \frac{1}{\sqrt{k}}\right) ^{k}}\) and \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} = {\mathcal {O}}{\left( \frac{1}{k}\right) ^{k}}\). We further showed that we can achieve the same superlinear convergence rate if the initial error is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(\frac{\mu ^\frac{3}{2}}{M\sqrt{d}})\) and the initial Hessian approximation matrix is \(B_0 = \nabla ^2{f(x_0)}\). We proved similar convergence rate results for the second setting where the objective function is self-concordant.