Abstract
In this paper, we study and prove the non-asymptotic superlinear convergence rate of the Broyden class of quasi-Newton algorithms which includes the Davidon–Fletcher–Powell (DFP) method and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method. The asymptotic superlinear convergence rate of these quasi-Newton methods has been extensively studied in the literature, but their explicit finite–time local convergence rate is not fully investigated. In this paper, we provide a finite–time (non-asymptotic) convergence analysis for Broyden quasi-Newton algorithms under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution. We show that in a local neighborhood of the optimal solution, the iterates generated by both DFP and BFGS converge to the optimal solution at a superlinear rate of \((1/k)^{k/2}\), where k is the number of iterations. We also prove a similar local superlinear convergence result holds for the case that the objective function is self-concordant. Numerical experiments on several datasets confirm our explicit convergence rate bounds. Our theoretical guarantee is one of the first results that provide a non-asymptotic superlinear convergence rate for quasi-Newton methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this paper, we focus on the non-asymptotic convergence analysis of quasi-Newton methods for the problem of minimizing a convex function \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\), i.e.,
Specifically, we focus on two different settings. In the first case, we assume that the objective function f is strongly convex, smooth (its gradient is Lipschitz continuous), and its Hessian is Lipschitz continuous at the optimal solution. In the second case, we study the setting where the objective function f is self-concordant. We formally define these settings in the following sections. In both considered cases, the optimal solution is unique and denoted by \(x_*\).
There is an extensive literature on the use of first-order methods for convex optimization, and it is well-known that the best achievable convergence rate for first-order methods, when the objective function is strongly convex and smooth, is a linear convergence rate. Specifically, we say a sequence \(\{x_k\}\) converges linearly if \(\Vert x_k - x_*\Vert \le C\gamma ^k\Vert x_0 - x_*\Vert \), where \(\gamma \in (0, 1)\) is the constant of linear convergence, and C is a constant possibly depending on problem parameters. Among first-order methods, the accelerated gradient method proposed in [1] achieves a fast linear convergence rate of \((1-\sqrt{{\mu }/{L}})^{k/2}\), where \(\mu \) is the strong convexity parameter and L is the smoothness parameter (the Lipschitz constant of the gradient) [2]. It is also known that the convergence rate of the accelerated gradient method is optimal for first-order methods in the setting that the problem dimension d is sufficiently larger than the number of iterations [3].
Classical alternatives to improve the convergence rate of first-order methods are second-order methods [4,5,6,7] and in particular Newton’s method. It has been shown that if in addition to smoothness and strong convexity assumptions, the objective function f has a Lipschitz continuous Hessian, then the iterates generated by Newton’s method converge to the optimal solution at a quadratic rate in a local neighborhood of the optimal solution; see [8, Chapter 9]. A similar result has been established for the case that the objective function is self-concordant [9]. Despite the fact that the quadratic convergence rate of Newton’s method holds only in a local neighborhood of the optimal solution, it could reduce the overall number of iterations significantly as it is substantially faster than the linear rate of first-order methods. The fast quadratic convergence rate of Newton’s method, however, does not come for free. Implementation of Newton’s method requires solving a linear system at each iteration with the matrix defined by the objective function Hessian \(\nabla ^2 f(x)\). As a result, the computational cost of implementing Newton’s method in high-dimensional problems is prohibitive, as it could be \({\mathcal {O}}(d^3)\), unlike first-order methods that have a per iteration cost of \({\mathcal {O}}(d)\).
Quasi-Newton algorithms are quite popular since they serve as a middle ground between first-order methods and Newton-type algorithms. They improve the linear convergence rate of first-order methods and achieve a local superlinear rate, while their computational cost per iteration is \({\mathcal {O}}(d^2)\) instead of \({\mathcal {O}}(d^3)\) of Newton’s method. The main idea of quasi-Newton methods is to approximate the step of Newton’s method without computing the objective function Hessian \(\nabla ^2 f(x)\) or its inverse \(\nabla ^2 f(x)^{-1}\) at every iteration [10, Chapter 6]. To be more specific, quasi-Newton methods aim at approximating the curvature of the objective function by using only first-order information of the function, i.e., its gradients \(\nabla f(x)\); see Sect. 2 for more details. There are several different approaches for approximating the objective function Hessian and its inverse using first-order information, which leads to different quasi-Newton updates, but perhaps the most popular quasi-Newton algorithms are the Symmetric Rank-One (SR1) method [11], the Broyden method [12,13,14], the Davidon-Fletcher-Powell (DFP) method [15, 16], the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [17,18,19,20], and the limited-memory BFGS (L-BFGS) method [21, 22].
As mentioned earlier, a major advantage of quasi-Newton methods is their asymptotic local superlinear convergence rate. More precisely, we state that the sequence \(\{x_k\}\) converges to the optimal solution \(x_*\) superlinearly when the ratio between the distance to the optimal solution at time \(k+1\) and k approaches zero as k approaches infinity, i.e.,
For various settings, this superlinear convergence result has been established for a large class of quasi-Newton methods, including the Broyden method [13, 17, 23], the DFP method [13, 24, 25], the BFGS method [13, 25,26,27], and several other variants of these algorithms [28,29,30,31,32,33,34]. Although this result is promising and lies between the linear rate of first-order methods and the quadratic rate of Newton’s method, it only holds asymptotically and does not characterize an explicit upper bound on the error of quasi-Newton methods after a finite number of iterations. As a result, the overall complexity of quasi-Newton methods for achieving an \(\epsilon \)-accurate solution, i.e., \(\Vert x_k - x_*\Vert \le \epsilon \), cannot be explicitly characterized. Hence, it is essential to establish a non-asymptotic convergence rate for quasi-Newton methods, which is the main goal of this paper.
In this paper, we show that if the initial iterate is close to the optimal solution and the initial Hessian approximation error is sufficiently small, then the iterates of the convex Broyden class including both the DFP and BFGS methods converge to the optimal solution at a superlinear rate of \((1/k)^{k/2}\). We further show that our theoretical result suggests a trade-off between the size of the superlinear convergence neighborhood and the rate of superlinear convergence. In other words, one can improve the numerical constant in the above rate at the cost of reducing the radius of the neighborhood in which DFP and BFGS converge superlinearly. We believe that our theoretical guarantee provides one of the first non-asymptotic results for the superlinear convergence rate of BFGS and DFP.
Related work In a recent work [35], the authors studied the non-asymptotic analysis of a class of greedy quasi-Newton methods that are based on the update rule of the Broyden family and use a greedily selected basis vectors for updating Hessian approximations. In particular, they show a superlinear convergence rate of \((1-\frac{\mu }{dL})^{k^2/2}(\frac{dL}{\mu })^k\) for this class of algorithms. However, greedy quasi-Newton methods are more computationally costly than standard quasi-Newton methods, as they require computing a greedily selected basis vector at each iteration. It is worth noting that such computation requires access to additional information beyond the objective function gradient, e.g., the diagonal components of the Hessian. Also, two recent concurrent papers study the non-asymptotic superlinear convergence rate of the DFP and BFGS methods [36, 37]. In [36], the authors show that when the objective function is smooth, strongly convex, and strongly self-concordant, the iterates of BFGS and DFP, in a local neighborhood of the optimal solution, achieve the superlinear convergence rate of \((\frac{dL}{\mu k})^{k/2}\) and \((\frac{dL^2}{\mu ^2 k})^{k/2}\), respectively. In their follow-up paper [37], they improve the superlinear convergence results to \([e^{\frac{d}{k}\ln {\frac{L}{\mu }}} - 1]^{k/2}\) and \([\frac{L}{\mu }(e^{\frac{d}{k}\ln {\frac{L}{\mu }}} - 1)]^{k/2}\), respectively. We would like to highlight that the proof techniques, assumptions, and final theoretical results of [36, 37] and our paper are different and derived independently. The major difference in the analysis is that in [36, 37], the authors use a potential function related to the trace and the logarithm of the determinant of the Hessian approximation matrix, while we use a Frobenius norm potential function. In addition, our convergence rates for both DFP and BFGS are independent of the problem dimension d. Nevertheless, in our results, the neighborhood of superlinear convergence depends on d. Moreover, to derive our results we consider two settings where in the first case the objective function is strongly convex, smooth, and has a Lipschitz continuous Hessain at the optimal solution, and in the second setting the function is self-concordant. Both of these settings are more general than the setting in [36, 37], which requires the objective function to be strongly convex, smooth, and strongly self-concordant.
Outline In Sect. 2, we discuss the Broyden class of quasi-Newton methods, DFP and BFGS. In Sect. 3, we mention our assumptions, notations as well as some general technical lemmas. Then, in Sect. 4, we present the main theoretical results of our paper on the non-asymptotic superlinear convergence of DFP and BFGS for the setting that the objective function is strongly convex, smooth, and its Hessian is Lipschitz continuous at the optimal solution. In Sect. 5, we extend our theoretical results to the class of self-concordant functions, by exploiting the proof techniques developed in Sect. 4. In Sect. 6, we provide a detailed discussion on the advantages and drawbacks of our theoretical results and compare them with some concurrent works. In Sect. 7, we numerically evaluate the performance of DFP and BFGS on several datasets and compare their convergence rates with our theoretical bounds. Finally, in Sect. 8, we close the paper with some concluding remarks.
Notation For vector \(v\in {\mathbb {R}}^{d}\), its Euclidean norm (l-2 norm) is denoted by \(\Vert v\Vert \). We denote the Frobenius norm of matrix \(A \in {\mathbb {R}}^{d \times d}\) as \(\Vert A\Vert _F = \sqrt{\sum _{i = 1}^d\sum _{j = 1}^d A_{ij}^2}\) and its induced 2-norm is denoted by \(\Vert A\Vert = \max _{\Vert v\Vert =1}\Vert Av\Vert \). The trace of matrix A, which is the sum of its diagonal elements, is denoted by \(\mathrm {Tr}\left( A\right) \). For any two symmetric matrices \(A, B \in {\mathbb {R}}^{d \times d}\), we denote that \(A \preceq B\) if and only if \(B - A\) is a symmetric positive semidefinite matrix.
2 Quasi-Newton methods
In this section, we review standard quasi-Newton methods, and, in particular, we discuss the updates of the DFP and BFGS algorithms. Consider a time index k, a step size \(\eta _k\), and a positive-definite matrix \(B_k\) to define a generic descent algorithm through the iteration
Note that if we simply replace \(B_k\) by the identity matrix I, we recover the update of gradient descent, and if we replace it by the objective function Hessian \(\nabla ^2 f(x_k)\), we obtain the update of Newton’s method. The main goal of quasi-Newton methods is to find a symmetric positive-definite matrix \(B_k\) using only first-order information such that \(B_k\) is close to the Hessian \(\nabla ^2 f(x_k)\). Note that the step size \(\eta _k\) is often computed according to a line search routine for the global convergence of quasi-Newton methods. Our focus in this paper, however, is on the local convergence of quasi-Newton methods, which requires the unit step size \(\eta _k=1\). Hence, in the rest of the paper, we assume that the iterate \(x_k\) is sufficiently close to the optimal solution \(x_*\) and the step size is \(\eta _k = 1\).
In most quasi-Newton methods, the function’s curvature is approximated in a way that it satisfies the secant condition. To better explain this property, let us first define the variable difference \(s_k\) and gradient difference \(y_k\) as
The goal is to find a matrix \(B_{k+1}\) that satisfies the secant condition \( B_{k+1} s_k = y_k\). The rationale for satisfying the secant condition is that the Hessian \(\nabla ^2 f(x_k)\) approximately satisfies this condition when \(x_{k+1}\) and \(x_k\) are close to each other, e.g., they are both close to the optimal solution \(x_*\). However, the secant condition alone is not sufficient to specify \(B_{k+1}\). To resolve this indeterminacy, different quasi-Newton algorithms consider different additional conditions. One common constraint is to enforce the Hessian approximation (or its inverse) at time \(k+1\) to be close to the one computed at time k. This is a reasonable extra condition as we expect the Hessian (or its inverse) evaluated at \(x_{k+1}\) to be close to the one computed at \(x_k\).
In the DFP method, we enforce the proximity condition on Hessian approximations \(B_k\) and \(B_{k+1}\). Basically, we aim to find the closest positive-definite matrix to \(B_k\) (in some weighted matrix norm) that satisfies the secant condition; see Chapter 6 of [10] for more details. The update of the Hessian approximation matrices of DFP is given by
Since implementation of the update in (1) requires access to the inverse of the Hessian approximation, it is essential to derive an explicit update for the Hessian inverse approximation to avoid the cost of inverting a matrix at each iteration. If we define \(H_k\) as the inverse of \(B_k\), i.e., \(H_k=B_k^{-1}\), using the Sherman-Morrison-Woodbury formula, one can write
The BFGS method can be considered as the dual of DFP. In BFGS, we also seek a positive-definite matrix that satisfies the secant condition, but instead of forcing the proximity condition on the Hessian approximation B, we enforce it on the Hessian inverse approximation H. To be more precise, we aim to find a positive-definite matrix \(H_{k+1}\) that satisfies the secant condition \( s_k = H_{k+1} y_k\) and is the closest matrix (in some weighted norm) to the previous Hessian inverse approximation \(H_k\). The update of the Hessian inverse approximation matrices of BFGS is given by,
Similarly, by the Sherman-Morrison-Woodbury formula, the update of BFGS method for the Hessian approximation matrices is given by,
Note that both DFP and BFGS belong to a more general class of quasi-Newton methods called the Broyden class. The Hessian approximation \(B_{k+1}\) of the Broyden class is defined as
and the Hessian inverse approximation is defined as
where \(\phi _k, \psi _k \in {\mathbb {R}}\). In this paper, we only focus on the convex class of Broyden quasi-Newton methods, where \(\phi _k, \psi _k \in [0, 1]\). The steps of this class of methods are summarized in Algorithm 1. In fact, in Algorithm 1, if we set \(\psi _k = 0\), we recover DFP, and if we set \(\psi _k = 1\), we recover BFGS. It is worth noting that the cost of computing the descent direction \(H_k \nabla f(x_k)\) for this class of quasi-Newton methods is of \({\mathcal {O}}(d^2)\), which improves \({\mathcal {O}}(d^3)\) per iteration cost of Newton’s method.
Remark 1
Note that when \(s_k = 0\), we have \(\nabla {f(x_k)} = 0\) from (1) and thus \(x_k = x_*\). Hence, in our implementation and analysis we assume \(s_k \ne 0\). Moreover, in both considered settings, the objective function is at least strictly convex. As a result, if \(s_k \ne 0\), then it follows that \(y_k \ne 0\) and \(s_k^\top y_k > 0\). This observation shows that the updates of BFGS and DFP are well-defined. Finally, it is well-known that for the convex class of Broyden methods if \(B_{k}\) is symmetric positive-definite and \(s_k^\top y_k > 0\), then \(B_{k+1}\) is also symmetric positive-definite [10]. In Algorithm 1, we assume that the initial Hessian approximation \(B_0\) is symmetric positive-definite, and, hence, all Hessian approximation matrices \(B_k\) and their inverse matrices \(H_k\) are symmetric positive-definite.
3 Preliminaries
In this section, we first specify the required assumptions for our results in Sect. 4 and introduce some notations to simplify our expressions. Moreover, we present some intermediate lemmas that will be use later in Sect. 4 to prove our main theoretical results for the setting that the objective function is strongly convex, smooth, and its Hessian is Lipschitz continuous at the optimal solution. In Sect. 5, we will use a subset of these intermediate results to extend our analysis to the class of self-concordant functions.
3.1 Assumptions
We next state the required assumptions for establishing our theoretical results in Sect. 4.
Assumption 3.1
The objective function f(x) is twice-differentiable. Moreover, the function f(x) is strongly convex with parameter \(\mu > 0\), i.e.,
Assumption 3.2
The gradient of the objective function f(x) is Lipschitz continuous with parameter \(L > 0\), i.e.,
As f is twice-differentiable, Assumption 3.1 and 3.2 imply that the eigenvalues of the Hessian are larger than \(\mu \) and smaller than L, i.e., \(\mu I \preceq \nabla ^{2}{f(x)} \preceq LI, \forall x \in {\mathbb {R}}^{d}\). Note that for our main theoretical results, we only require Assumption 3.1, but to compare our results with other theoretical bounds we will use the condition in Assumption 3.2 in our discussions.
Assumption 3.3
The Hessian \(\nabla ^2 f(x)\) satisfies the following condition for some constant \(M\ge 0\),
The condition in Assumption 3.3 is common for analyzing second-order methods as we require a regularity condition on the objective function Hessian. In fact, Assumption 3.3 is one of the least strict conditions required for the analysis of second-order type methods as it requires Lipschitz continuity of the Hessian only at (near) the optimal solution. This condition is, indeed, weaker than assuming that the Hessian is Lipschitz continuous everywhere. Note that for the class of strongly convex and smooth functions, the strongly self-concordance assumption required in [36, 37] is equivalent to assuming that the Hessian is Lipschitz continuous everywhere. Hence, the condition in Assumption 3.3 is also weaker than the one in [36, 37]. Assumption 3.3 leads to the following corollary.
Corollary 1
If the condition in Assumption 3.3 holds, then for all \( x, y \in {\mathbb {R}}^{d}\), we have
Proof
Check Appendix A. \(\square \)
Remark 2
Our analysis can be extended to the case that Assumptions 3.1, 3.2 and 3.3 only hold in a local neighborhood of the optimal solution \(x_{*}\). Here, we assume they hold in \({\mathbb {R}}^{d}\) to simplify our proofs.
3.2 Notations
Next, we briefly mention some of the definitions and notations that will be used in following theorems and proofs. We consider \(\nabla ^{2}f(x_*)^{\frac{1}{2}}\) and \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) as the square root of the matrices \(\nabla ^{2}f(x_*)\) and \(\nabla ^{2}f(x_*)^{-1}\), i.e., \(\nabla ^{2}f(x_*) = \nabla ^{2}f(x_*)^{\frac{1}{2}}\nabla ^{2}f(x_*)^{\frac{1}{2}}\) and \(\nabla ^{2}f(x_*)^{-1} = \nabla ^{2}f(x_*)^{-\frac{1}{2}}\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). By Assumption 3.1, both \(\nabla ^{2}f(x_*)^{\frac{1}{2}}\) and \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) are symmetric positive-definite. Throughout the paper, we analyze and study weighted version of the Hessian approximation \({\hat{B}}_k\) defined as
\({\hat{B}}_k\) is symmetric positive-definite, since \(B_k\) and \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) are both symmetric positive-definite. We also use \(\Vert {\hat{B}}_k - I\Vert _F\) as the measure of closeness between \(B_k\) and \(\nabla ^{2}f(x_*)\), which can be written as
We further define the weighted gradient difference \({\hat{y}}_k\), the weighted variable difference \({\hat{s}}_k\), and the weighted gradient \(\widehat{\nabla {f}}(x_k)\) as
To measure closeness to the optimal solution for iterate \(x_k\), we use \(r_k\in {\mathbb {R}}^d\), \(\sigma _k \in {\mathbb {R}}\), and \(\tau _k \in {\mathbb {R}}\) which are formally defined as
In (16), \(\mu \) is the strong convexity parameter defined in Assumption 3.1 and M is the Lipschitz continuity parameter of the Hessian at the optimal solution defined in Assumption 3.3. In our analysis, we also use the average Hessian \(J_k\) and its weighted version \({\hat{J}}_k\) that are formally defined as
3.3 Intermediate Lemmas
Next, we present some lemmas that we will later use to establish the non-asymptotic superlinear convergence of DFP and BFGS. Proofs of these lemmas are relegated to the appendix.
Lemma 1
For any matrix \(A \in {\mathbb {R}}^{d \times d}\) and vector \(u \in {\mathbb {R}}^{d}\) with \(\Vert u\Vert = 1\), we have
Proof
Check Appendix B. \(\square \)
Lemma 2
For any matrices \(A, B \in {\mathbb {R}}^{d \times d}\), we have
Proof
Check Appendix C. \(\square \)
The results in Lemma 1 and Lemma 2 hold for arbitrary matrices. The next lemma focuses on some properties of the weighted average Hessian \({\hat{J}}_k\) under Assumptions 3.1 and 3.3.
Lemma 3
Recall the definition of \(\sigma _k\) in (16) and \({\hat{J}}_k\) in (17). Suppose \(\alpha _k \in [0, 1]\) and define the matrix \(H_k = \nabla ^2{f(x_* + \alpha _k(x_k - x_*))}\) and \({\hat{H}}_k = \nabla ^{2}f(x_*)^{-\frac{1}{2}}H_k\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). If Assumptions 3.1 and 3.3 hold, then the following inequalities hold for all \(k \ge 0\),
Proof
Check Appendix D. \(\square \)
In the following lemma, we establish some bounds that depend on the weighted gradient difference \({\hat{y}}_k\) and the weighted variable difference \({\hat{s}}_k\).
Lemma 4
Recall the definitions in (13–16). If Assumptions 3.1 and 3.3 hold, then the following inequalities hold for all \(k \ge 0\),
Proof
Check Appendix E. \(\square \)
4 Main theoretical results
In this section, we characterize the non-asymptotic superlinear convergence of the Broyden class of quasi-Newton methods, when Assumptions 3.1, 3.2 and 3.3 hold. In Sect. 4.1, we first establish a crucial proposition which characterizes the error of Hessian approximation for this class of quasi-Newton methods. Then, in Sect. 4.2, we leverage this result to show that the iterates of this class of algorithms converge at least linearly to the optimal solution, if the initial distance to the optimal solution and the initial Hessian approximation error are sufficiently small. Finally, we use these intermediate results in Sect. 4.3 to prove that the iterates of the convex Broyden class, including both DFP and BFGS, converge to the optimal solution at a superlinear rate of \((1/k)^{k/2}\). Note that in Algorithm 1 we use the Hessian inverse approximation matrix \(H_k\) to describe the algorithm, but in our analysis we will study the behavior of the Hessian approximation matrix \(B_k\).
4.1 Hessian approximation error: Frobenius norm potential function
Next, we use the Frobenius norm of the Hessian approximation error \(\Vert {\hat{B}}_{k} - I\Vert _F\) as the potential function in our analysis. Specifically, we will use the results of Lemma 1, Lemma 2, and Lemma 4 to study the dynamic of the Hessian approximation error \(\Vert {\hat{B}}_{k} - I\Vert _F\) for both DFP and BFGS. First, start with the DFP method.
Lemma 5
Consider the update of DFP in (3) and recall the definition of \(\tau _k\) in (16). Suppose that for some \(\delta > 0\) and some \(k \ge 0\), we have that \(\tau _k < 1\) and \(\Vert {\hat{B}}_k - I\Vert _{F} \le \delta \). Then, the matrix \(B^{\text {DFP}}_{k+1}\) generated by the DFP update satisfies the following inequality
where \(W_k = \Vert {\hat{B}}_k\Vert \frac{4}{(1 - \tau _k)^2} + \frac{3 + \tau _k}{1 - \tau _k}\).
Proof
The proof and conclusion of this lemma are similar to the ones in Lemma 3.2 in [33], except the value of parameter \(W_k\). This difference comes from the fact that [33] analyzed the modified DFP update, while we consider the standard DFP method. Recall the DFP update in (3) and multiply both sides of that expression by the matrix \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) from left and right to obtain
where we used the fact that \(s_k^\top y_k = s_k^\top \nabla ^{2}f(x_*)^{\frac{1}{2}}\nabla ^{2}f(x_*)^{-\frac{1}{2}}y_k = {\hat{s}}_k^\top {\hat{y}}_k\). To simplify the proof, we use the following notations:
Hence, (26) is equivalent to
Moreover, we can express \( B_+ - I\) as
Notice that \(P^2 = P\) and \(P = P^\top \). Thus, (28) can be simplified as
where
Next, we proceed to upper bound \( \Vert B_+ - I \Vert _F\). To do so, we derive upper bounds on the Frobenius norm of matrices D, E, G and H. We start by \(\Vert D\Vert _F\). If we set \(u=s/\Vert s\Vert \) and \(A=B-I\) in Lemma 1, we obtain that
which implies \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\). Moreover, using the fact that \(a^2 - b^2 \le 2a(a - b)\) we can write
where the second inequality follows from the fact that \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\) and the assumption that \(\Vert B - I\Vert _F \le \delta \). Next, if we replace the right hand side of (29) by its upper bound in (30) and massage the resulted expression, we obtain that
which provides an upper bound on \(\Vert D\Vert _F\). To derive upper bounds for \(\Vert E\Vert _F\), \(\Vert G\Vert _F\) and \(\Vert H\Vert _F\), we first need to find an upper bound for \(\Vert Q\Vert _F\), where Q is defined in (27). Note that
where the first equality holds by the definition of Q, the second equality is obtained by adding and subtracting \(\frac{sy^\top }{\Vert s\Vert ^2}\), and the inequality holds due to the triangle inequality. We can further simplify the right hand side as
where the second inequality holds using the Cauchy–Schwarz inequality and the fact that \(\Vert ab^\top \Vert _F = \Vert a\Vert \Vert b\Vert \) for \(a, b \in {\mathbb {R}}^d\), and the last inequality holds due to the results in (21), (22), and (23).
Next using the upper bound in (32) on \(\Vert Q\Vert _F\) we derive an upper bound on \(\Vert E\Vert _F\). Note that
where we used the triangle inequality in the last step. Using the definition of Q we can show that
where for the second inequality we use (32) and \(\Vert ab^\top \Vert _F = \Vert a\Vert \Vert b\Vert \), and for the third inequality we use the results in (21), (22), and (23).
We proceed to derive an upper bound for \(\Vert G\Vert _F\). Note that \(0 \preceq P \preceq I\) and thus \(\Vert P\Vert \le 1\). Using this observation, (32) and the first inequality in (19), we can show that \(\Vert G\Vert _F\) is bounded above by
Finally, we provide an upper bound for \(\Vert H\Vert _F\). By leveraging the second inequality in (19) and the fact that \(\Vert A\Vert \le \Vert A\Vert _F\) for any matrix \(A \in {\mathbb {R}}^{d \times d}\), we can show that
where for the last inequality we used the result in (32).
If we replace \(\Vert D\Vert _F\), \(\Vert E\Vert _F\), \(\Vert G\Vert _F\), and \(\Vert H\Vert _F\) with their upper bounds in (31), (33), (34) and (35), respectively, we obtain that
where \(W = \Vert B\Vert \frac{4}{1 - \tau } + \Vert B\Vert \frac{4\tau }{(1 - \tau )^2} + \frac{3 + \tau }{1 - \tau } = \Vert B\Vert \frac{4}{(1 - \tau )^2} + \frac{3 + \tau }{1 - \tau }\). Considering the notations introduced in (27), the result in (25) follows from the above inequality and the proof is complete. \(\square \)
The result in Lemma 5 shows how the error of Hessian approximation in DFP evolves as we run the updates. Next, we establish a similar result for the BFGS method.
Lemma 6
Consider the update of BFGS in (6) and recall the definition of \(\tau _k\) in (16). Suppose that for some \(\delta > 0\) and some \(k \ge 0\), we have that \(\tau _k < 1\) and \(\Vert {\hat{B}}_k - I\Vert _{F} \le \delta \). Then, the matrix \(B^{\text {BFGS}}_{k+1}\) generated by the BFGS update satisfies the following inequality
where \(V_k = \frac{3 + \tau _k}{1 - \tau _k}\).
Proof
The proof of this lemma is adapted from the proof of Lemma 3.6 in [32]. We should also add that our upper bound in (36) improves the bound in [32] as it contains an additional negative term, i.e., \(- \frac{{\hat{s}}_k^\top ({\hat{B}}_k - I) {\hat{B}}_k ({\hat{B}}_k - I) {\hat{s}}_k}{2\delta {\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k}\). Recall the BFGS update in (6) and multiply both sides of that expression with \(\nabla ^{2}f(x_*)^{-\frac{1}{2}}\) from left and right to obtain
where we used the fact that \(s_k^\top B_k s_k = s_k^\top \nabla ^{2}f(x_*)^{\frac{1}{2}}\nabla ^{2}f(x_*)^{-\frac{1}{2}}B_k \nabla ^{2}f(x_*)^{-\frac{1}{2}}\nabla ^{2}f(x_*)^{\frac{1}{2}} s_k = {\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k\). To simplify the proof, we use the following notations:
Considering these notations, the expression in (37) can be written as
Moreover, we can show that \( B_+ - I\) is given by
where
To establish an upper bound on \(\Vert B_+ - I \Vert _F\), we find upper bounds on \(\Vert D\Vert ^2_F\) and \(\Vert E\Vert ^2_F\). Note that using the fact that \(\Vert D\Vert ^2_F=\mathrm {Tr}\left[ DD^\top \right] \) and properties of the trace operator we can show that
Using the fact that \(\mathrm {Tr}\left( ab^\top \right) = a^\top b\) for any \(a, b \in {\mathbb {R}}^d\) we can write the following simplifications:
Substituting the above simplifications into (39), we obtain that
Next, we proceed to show that the second term on the right hand side of (40), i.e., \(\left( \frac{\Vert Bs\Vert ^2}{s^\top Bs}\right) ^2 - \frac{s^\top B^3s}{s^\top Bs}\), is non-positive. Note that by using the Cauchy–Schwarz inequality, we have
Now by computing the squared of both sides we obtain \( \Vert Bs\Vert ^4 \le \Vert B^{\frac{3}{2}}s\Vert ^2\Vert B^{\frac{1}{2}}s\Vert ^2 = s^\top B^3 ss^\top Bs, \) which implies that
By combining (40) and (41), we obtain that
The above inequality implies that \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\). Moreover, using the fact that \(a^2 - b^2 \le 2a(a - b), \forall a,b \in {\mathbb {R}}\), we can show that
where the second inequality follows from \(\Vert B - I\Vert ^2_F - \Vert D\Vert ^2_F \ge 0\) and the fact that \(\Vert B - I\Vert _F \le \delta \). Now if combine the results in (42) and (43), we obtain that
which provides an upper bound on \(\Vert D\Vert _F\). Moreover, according to (33), \(\Vert E\Vert _F\) is bounded above by
If we replace \(\Vert D\Vert _F\) and \(\Vert E\Vert _F\) with their upper bounds in (44) and (45), we obtain that
where \(V = \frac{3 + \tau }{1 - \tau }\). Considering the notations in (38), the claim follows from the above inequality. \(\square \)
Now we can combine Lemma 5 and Lemma 6 to derive a bound on the error of Hessian approximation for the (convex) Broyden class of quasi-Newton methods.
Lemma 7
Consider the update of the (convex) Broyden family in (7) and recall the definition of \(\tau _k\) in (16). Suppose that for some \(\delta > 0\) and some \(k \ge 0\), we have that \(\tau _k < 1\) and \(\Vert {\hat{B}}_k - I\Vert _{F} \le \delta \). Then, the matrix \(B_{k+1}\) generated by (7) satisfies the following inequality
where \(Z_k = \phi _k\Vert {\hat{B}}_k\Vert \frac{4}{(1 - \tau _k)^2} + \frac{3 + \tau _k}{1 - \tau _k}\). We also have that
Proof
Notice that \(B_{k+1} = \phi _k B^{\text {DFP}}_{k+1} + (1 - \phi _k) B^{\text {BFGS}}_{k+1}\). Using this expression and the convexity of the norm, we can show that
By replacing \(\Vert B^{\text {DFP}}_{k+1}-I\Vert _F\) and \(\Vert B^{\text {BFGS}}_{k+1}-I\Vert _F\) with their upper bounds in Lemma 5 and Lemma 6, the claim in (46) follows. Moreover, since \(\phi _k \in [0,1]\), \(\delta > 0\), \(\frac{\Vert ({\hat{B}}_k - I){\hat{s}}_k\Vert ^2}{\Vert {\hat{s}}_k\Vert ^2} \ge 0\) and \(\frac{{\hat{s}}_k^\top ({\hat{B}}_k - I) {\hat{B}}_k ({\hat{B}}_k - I) {\hat{s}}_k}{{\hat{s}}_k^\top {\hat{B}}_k {\hat{s}}_k} \ge 0\), the result in (46) implies (47). \(\square \)
4.2 Linear convergence
In this section, we leverage the results from the previous section on the error of Hessian approximation to show that if the initial iterate is sufficiently close to the optimal solution and the initial Hessian approximation matrix is close to the Hessian at the optimal solution, the iterates of BFGS and DFP converge at least linearly to the optimal solution. Moreover, the Hessian approximation matrices always stay close to the Hessian at the optimal solution and the norms of Hessian approximation matrix and its inverse are always bounded above. These results are essential in proving our non-asymptotic superlinear convergence results.
Lemma 8
Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1, and recall the definitions in (13–16). Suppose Assumptions 3.1 and 3.3 hold. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy
where \(\epsilon , \delta \in (0, \frac{1}{2})\) such that for some \(\rho \in (0, 1)\), they satisfy
Then, the sequence of iterates \(\{x_k\}_{k=0}^{+\infty }\) converges to the optimal solution \(x_*\) with
Furthermore, the matrices \(\{B_k\}_{k=0}^{+\infty }\) stay in a neighborhood of \(\nabla ^{2}{f(x_*)}\) defined as
Moreover, the norms \(\{\Vert {\hat{B}}_k\Vert \}_{k=0}^{+\infty }\) and \(\{\Vert {\hat{B}}_k^{-1}\Vert \}_{k=0}^{+\infty }\) are all uniformly bounded above by
Proof
The proof of this lemma is adapted from the proof of Theorem 3.1 in [33]. In [33], the authors prove the results for the modified DFP method, while we consider the more general class of Broyden methods. We will use induction to prove (50), (51) and (52). First consider the base case of \(k = 0\). By the initial condition (48), it’s obvious that (51) holds for \(k = 0\). From (51) we know that all the eigenvalues of \({\hat{B}}_0\) are in the interval \([1 - 2\delta , 1 + 2\delta ]\). Suppose that \(\lambda _{max}({\hat{B}}_0)\) is the largest eigenvalue of \({\hat{B}}_0\) and \(\lambda _{min}({\hat{B}}_0)\) is the smallest eigenvalue of \({\hat{B}}_0\), we have
Hence, (52) holds for \(k = 0\). Based on Assumptions 3.1 and 3.3 and the definitions in (13–16), we have
Now using the result in (24), and the bounds in (48), (49), (51) and (52) for \(k = 0\), we can write
This indicates that the condition in (50) holds for \(k = 0\). Hence, all the conditions in (50), (51) and (52) hold for \(k = 0\), and the base of induction is complete. Now we assume that the conditions in (50), (51) and (52) hold for all \(0 \le k \le t\), where \(t \ge 0\). Our goal is to show that these conditions are also satisfied for the case of \(k = t + 1\). Since (50) holds for all \(0\le k \le t\), we have \(\tau _k = \max \{\sigma _k, \sigma _{k + 1}\} = \sigma _k \le \epsilon < 1\) for \(0 \le k \le t\). Moreover, since the condition in (51) holds for \(0\le k\le t\), we know that \(\Vert {\hat{B}}_k-I\Vert _F \le 2\delta \) for \(0 \le k \le t\). Hence, by (47) in Lemma 7, we obtain that
where \(Z_k = \phi _k\Vert {\hat{B}}_k\Vert \frac{4}{(1 - \sigma _k)^2} + \frac{3 + \sigma _k}{1 - \sigma _k}\). Using (52) and \(\sigma _k \le \epsilon \) for \(0 \le k \le t\), we obtain that
Further if (48) and (50) hold for \(0 \le k \le t\), we have that
Considering these results we can show that
where we use the definition \(\phi _{\text {max}} = \sup _{k \ge 0}\phi _k\) and the last inequality holds due to the first inequality in (49). By leveraging (56) and (48) and computing the sum of the terms in the left and right hand side of (54) from \(k = 0\) to t, we obtain
which implies that (51) holds for \(k = t + 1\). Applying the same techniques we used in the base case, we can prove that (50) and (52) hold for \(k = t + 1\). Hence, all the claims in (50), (51) and (52) hold for \(k = t + 1\), and our induction step is complete. \(\square \)
4.3 Explicit non-asymptotic superlinear rate
In the previous section, we established local linear convergence of iterates generated by the convex Broyden class including DFP and BFGS. Indeed, these local linear results are not our ultimate goal, as first-order methods are also linearly convergent under the same assumptions. However, the linear convergence is required to establish a local non-asymptotic superlinear convergence result, which is our main contribution. Next, we state the main results of this paper on the non-asymptotic superlinear convergence rate of the convex Broyden class of quasi-Newton methods. To prove this claim, we use the results in Lemma 7 and Lemma 8.
Theorem 1
Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1. Suppose the objective function f satisfies the conditions in Assumptions 3.1 and 3.3. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy
where \(\epsilon , \delta \in (0, \frac{1}{2})\) such that for some \(\rho \in (0, 1)\), they satisfy
Then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the convex Broyden class of quasi-Newton methods converge to \(x_*\) at a superlinear rate of
where \(q = \frac{1}{\sqrt{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}} \in \left[ 1, \sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\right] \) and \(\phi _{\text {min}} = \inf _{k \ge 0}\phi _k\).
Proof
When both conditions (57) and (58) hold, by Lemma 8, the results in (50), (51) and (52) hold. This indicates that for any \(t \ge 0\), we have
Hence, using Lemma 7 for any \(t \ge 0\), we can show that
where \(Z_t = \phi _t\Vert {\hat{B}}_t\Vert \frac{4}{(1 - \sigma _t)^2} + \frac{3 + \sigma _t}{1 - \sigma _t}\). Using (55) and (56), for \(k \ge 0\) we have
Now compute the sum of both sides of (62) from \(t = 0\) to \(k - 1\) to obtain
Regroup the terms and use the results in (57) and (63) to show that
which leads to
Moreover, using the bounds in (52) we can show that
Hence, we have
By combining the bounds in (64) and (65), we obtain
Now by computing the minimum value of the term \(\phi _t + (1 - \phi _t)\frac{1 - 2\delta }{1 + 2\delta }\), we can show
where \(\phi _{\text {min}} = \inf _{k \ge 0}\phi _k\) and by regrouping the terms, we obtain that
Considering the definition \(q := \frac{1}{\sqrt{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}}\), we can simplify our upper bound as
By using the Cauchy-Schwarz inequality, we obtain that
Note that since \(\phi _k \in [0, 1]\), we have \(q \in \left[ 1, \sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\right] \). The result in (66) provides an upper bound on \(\sum _{t = 0}^{k - 1}\frac{\Vert ({\hat{B}}_t - I){\hat{s}}_t\Vert }{\Vert {\hat{s}}_t\Vert } \), which is a crucial term in the remaining of our proof.
Now, note that \(\nabla {f(x_t)} = J_t(x_t - x_*)\), where \(J_t\) is defined in (17). This implies that \( x_t - x_*=J_t^{-1}\nabla {f(x_t)}\) and hence we have
where the third equality holds since \(-B_t s_t = \nabla {f(x_t)} \). Pre-multiply both sides of the above expression by \(\nabla ^2{f(x_*)}^{\frac{1}{2}}\) to obtain
Therefore, we obtain that
From Lemma 3 we know that \(\Vert {\hat{J}}_t^{-1}\Vert \le 1 + \frac{\sigma _t}{2}\) and \(\Vert {\hat{J}}_t - I\Vert \le \frac{\sigma _t}{2}\). Therefore, we have
Also, since \(\sigma _{t+1} \le \rho \sigma _t\) and \(\sigma _t = \frac{M}{\mu ^{\frac{3}{2}}}\Vert r_t\Vert \), we obtain that \(\Vert r_{t+1}\Vert \le \rho \Vert r_t\Vert \). Hence, we can write
Using the expressions in (67) and (68), we can show that \(\frac{\Vert r_{t+1}\Vert }{\Vert r_t\Vert }\) is bounded above by
Compute the sum of both sides of (69) from \(t = 0\) to \(k - 1\) and use \(\sigma _t \le \epsilon \), (63), and (66) to obtain
By leveraging the arithmetic-geometric inequality, we obtain that
Using condition (61), the proof of (59) is complete. Next, we proceed to prove (60). Based on the Taylor’s Theorem, there exists \(\alpha _t \in [0, 1]\) and the matrix \(H_t = \nabla ^2{f(x_* + \alpha _t(x_t - x_*))}\) such that
where we used \(\nabla {f(x_*)} = 0\) and \({\hat{H}}_t = \nabla ^{2}f(x_*)^{-\frac{1}{2}}H_t\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). By Lemma 3 and \(\sigma _t \le \epsilon \), we have
and
By combining (70), (71) and (72), we obtain that
and the claim in (60) holds. \(\square \)
The above theorem establishes the non-asymptotic superlinear convergence of the Broyden class of quasi-Newton methods. Notice that we use the weighted norm in (59) to characterize the convergence rate. If in addition to the strong convexity condition in Assumption 3.1, we also assume that the gradient is Lipschitz continuous as in Assumption 3.2, then we have that \(\sqrt{\mu }\Vert x_t - x_*\Vert \le \Vert r_t\Vert \le \sqrt{L}\Vert x_t - x_*\Vert , \forall t \ge 0\). Hence, the result in (59) implies that
where \(C_1\) and \(C_2\) are defined in (61). Next, we use the above theorem to report the results for DFP and BFGS, which are two special cases of the convex Broyden class of quasi-Newton methods.
Corollary 2
Consider the DFP and BFGS methods. Suppose Assumptions 3.1 and 3.3 hold and for some \(\epsilon , \delta \in (0, \frac{1}{2})\) and \(\rho \in (0, 1)\), the initial point \(x_0\) and initial Hessian approximation \(B_0\) satisfy
-
For the DFP method, if the tuple \((\epsilon , \delta , \rho )\) satisfies
$$\begin{aligned} \left[ \frac{4(2\delta + 1)}{(1 - \epsilon )^2} + \frac{3 + \epsilon }{1 - \epsilon }\right] \frac{\epsilon }{1 - \rho } \le \delta , \qquad \frac{\epsilon }{2} + 2\delta \le (1 - 2\delta )\rho \ , \end{aligned}$$(75)then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the DFP method converge to \(x_*\) at a superlinear rate of
$$\begin{aligned} \frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } \le \left( \frac{C_1\sqrt{k} + C_2}{k}\right) ^k, \qquad \forall k \ge 1, \end{aligned}$$(76)$$\begin{aligned} \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le (1 + \epsilon )^2\left( \frac{C_1\sqrt{k} + C_2}{k}\right) ^{2k}, \qquad \forall k \ge 1. \end{aligned}$$(77) -
For the BFGS method, if the tuple \((\epsilon , \delta , \rho )\) satisfies
$$\begin{aligned} \frac{(3 + \epsilon )\epsilon }{(1 - \epsilon )(1 - \rho )} \le \delta , \qquad \frac{\epsilon }{2} + 2\delta \le (1 - 2\delta )\rho \ , \end{aligned}$$(78)then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the BFGS method converge to \(x_*\) at a superlinear rate of
$$\begin{aligned} \frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } \le \left( \frac{C_1\sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\sqrt{k} + C_2}{k}\right) ^k, \qquad \forall k \ge 1, \end{aligned}$$(79)$$\begin{aligned} \frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} \le (1 + \epsilon )^2\left( \frac{C_1\sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\sqrt{k} + C_2}{k}\right) ^{2k}, \qquad \forall k \ge 1, \end{aligned}$$(80)
where \(C_1\) and \(C_2\) are defined in (61).
Proof
In Theorem 1, set \(\phi _k = 1\) for all \(k \ge 0\) to obtain the results for DFP and set \(\phi _k = 0\) for all \(k \ge 0\) to obtain the results for BFGS. \(\square \)
The results in Corollary 2 indicate that, in a local neighborhood of the optimal solution, the iterates generated by DFP and BFGS converge to the optimal solution at a superlinear rate of \(({(C_1\sqrt{k}+C_2)}/{k})^k\), where the constants \(C_1\) and \(C_2\) are determined by \(\rho \), \(\epsilon \) and \(\delta \). Indeed, as time progresses, the rate behaves as \({\mathcal {O}}\left( (1/{\sqrt{k}})^{k}\right) \). The tuple \((\rho , \epsilon , \delta )\) is independent of the problem parameters \((\mu , L, M, d)\), and the only required condition for the tuple \((\rho , \epsilon , \delta )\) is that they should satisfy (75) or (78). Note that the superlinear rate in (76) and (79) is faster than linear rate of first-order methods as the contraction coefficient approaches zero at a sublinear rate of \({\mathcal {O}}({1}/{\sqrt{k}})\). Similarly, in terms of the function value, the superlinear rate shown in (77) and (80) behaves as \({\mathcal {O}}\left( (1/k)^{k}\right) \). The result in Corollary 2 also shows the existence of a trade-off between the rate of convergence and the neighborhood of superlinear convergence. We highlight this point in the following remark.
Remark 3
There exists a trade-off between the size of the local neighborhood in which DFP or BFGS converges superlinearly and their rate of convergence. To be more precise, by choosing larger values for \(\epsilon \) and \(\delta \) (as long as they satisfy (75) or (78)), we can increase the size of the region in which quasi-Newton method has a fast superlinear convergence rate, but on the other hand, it will lead to a slower superlinear convergence rate according to the bounds in (76), (77), (79) and (80). Conversely, by choosing small values for \(\epsilon \) and \(\delta \), the rate of convergence becomes faster, but the local neighborhood defined in (74) becomes smaller.
The final convergence results of Corollary 2 depend on the choice of parameters \((\rho , \epsilon , \delta )\), and it may not be easy to quantify the exact convergence rate at first glance. To better quantify the superlinear convergence rate of DFP and BFGS, in the following corollary, we state the results of Corollary 2 for specific choices of \(\rho \), \(\epsilon \) and \(\delta \) which simplifies our expressions. Indeed, one can choose another set of values for these parameters to control the neighborhood and rate of superlinear convergence, as long as they satisfy the conditions in (75) for DFP and (78) for BFGS.
Corollary 3
Consider the DFP and BFGS methods and suppose Assumptions 3.1 and 3.3 hold. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) of DFP satisfy
and the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) of BFGS satisfy
Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by the DFP and BFGS methods satisfy
Proof
The results for DFP can be shown by setting \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{120}\) and \(\delta = \frac{1}{7}\) in Corollary 2. We can check that for those values, the conditions in (75) are all satisfied. Moreover, the expressions in (76) and (77) can be simplified as
and \((1 + \epsilon )^2 = (1 + \frac{1}{120})^2 \le 1.1\). So the claims in (83) follow. The results for BFGS can be shown similarly by setting \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{50}\) and \(\delta = \frac{1}{7}\) in (78), (79) and (80). \(\square \)
The results in Corollary 3 show that for some specific choices of \((\epsilon , \delta , \rho )\), the convergence rate of DFP and BFGS is \(\left( 1/k\right) ^{k/2}\), which is asymptotically faster than any linear convergence rate of first-order methods. Moreover, we observe that the neighborhood in which this fast superlinear rate holds is slightly larger for BFGS compared to DFP, i.e., compare the first conditions in (81) and (82). This is in consistence with the fact that in practice, BFGS often outperforms DFP.
A major shortcoming of the results in Corollary 2 and Corollary 3 is that, in addition to assuming that the initial iterate \(x_0\) is sufficiently close to the optimal solution, we also require the initial Hessian approximation error to be sufficiently small. In the following theorem, we resolve this issue by suggesting a practical choice for \(B_0\) such that the second assumption in (81) and (82) can be satisfied under some conditions. To be more precise, we show that if \(\Vert \nabla ^2{f(x_*)}^\frac{1}{2}(x_0 - x_*)\Vert \) is sufficiently small (we formally describe this condition), then by setting \(B_0 = \nabla ^{2}{f(x_0)}\), the second condition in (81) and (82) for Hessian approximation is satisfied, and we can achieve the convergence rate in (83).
Theorem 2
Consider the DFP and BFGS methods and suppose Assumptions 3.1 and 3.3 hold. Moreover, for DFP, suppose the initial point \(x_0\) and initial Hessian approximation \(B_0\) satisfy
and for BFGS, they satisfy
Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by the DFP and BFGS methods satisfy
Proof
First we consider the case of the DFP method. Notice that by (84), we obtain
Hence, the first part of (81) is satisfied. Moreover, using Assumptions 3.1 and 3.3, we have
The first inequality holds as \(\Vert A\Vert _F \le \sqrt{d}\Vert A\Vert \) for any matrix \(A \in {\mathbb {R}}^{d \times d}\), and the last inequality is due to the first part of (84). The above bound shows that the second part of the (81) is also satisfied, and by Corollary 3 the claim follows. The proof for BFGS is similar to the proof for DFP. It can be derived by following the steps of proof of DFP and exploiting the BFGS results in Corollary 3. \(\square \)
According to Theorem 2, if the initial weighted error \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert \) is sufficiently small, then by setting the initial Hessian approximation \(B_0\) as the Hessian at the initial point \(\nabla ^{2}{f(x_0)}\), the iterates will converge superlinearly at a rate of \((1/k)^{k/2}\). More specifically, based on the result in (24), it suffices to have \(\Vert \nabla ^2{f(x^*)}^{-\frac{1}{2}}\nabla {f(x_0)}\Vert \le {\mathcal {O}}(\mu ^{\frac{3}{2}}/(M\sqrt{d}))\) to ensure \(\Vert \nabla ^2{f(x^*)}^{\frac{1}{2}}(x_0 - x^*)\Vert \le {\mathcal {O}}(\mu ^{\frac{3}{2}}/(M\sqrt{d}))\) as stated in (84) and (85). Hence, this condition is satisfied when \(\Vert \nabla {f(x_0)}\Vert \le {\mathcal {O}}(\mu ^2/(M\sqrt{d}))\). This observation implies that, in practice, we can exploit any optimization algorithm to find an initial point \(x_0\) such that \(\Vert \nabla {f(x_0)}\Vert \le {\mathcal {O}}(\mu ^2/(M\sqrt{d}))\), and once this condition is satisfied, by setting \(B_0=\nabla ^2 f(x_0)\) we obtain the guaranteed superlinear convergence result. The suggested procedure requires only one evaluation of the Hessian inverse for the initial iterate, and in the rest of the algorithm, the Hessian inverse approximations are updated according to the convex Broyden update in (8).
5 Analysis of self-concordant functions
The results that we have presented so far require three assumptions: (i) the objective function is strongly convex, (ii) its gradient is Lipschitz continuous (iii) and its Hessian is Lipschitz continuous only at the optimal solution. In this section, we extend our theoretical results to a different setting where the objective function is self-concordant.
Assumption 5.1
The objective function f is standard self-concordant. In other words, it satisfies the following conditions: (i) f is closed with open domain dom(f). (ii) it is three times continuously differentiable, (iii) \(\nabla ^2{f(x)} \succ 0\) for all \(x \in dom(f)\), and (iv) the Hessian satisfies
Notice that the constant 2 in the above condition corresponds to standard self-concordant functions, but, in principle, there can be any arbitrary constant instead of 2. The analysis of Newton-type methods for self-concordant functions (see, e.g., [38, 39]) expands the theory of second-order algorithms beyond the classic setting considered in the previous section. This family of functions are of interest as it includes a large set of loss functions that are widely used in machine learning, such as linear functions, convex quadratic functions, and negative logarithm functions. In this section, we extend our results to this class of functions.
We should mention that the setup considered in this section is neither more general nor more strict than the setup in the previous section. For instance, the function \(f(x) = -\log {x}\) is self-concordant and satisfies Assumption 5.1, but it does not satisfy Assumption 3.1, 3.2 or 3.3 for any \(x > 0\). Conversely, the self-concordance assumption is not a necessary condition for the assumption that the Hessian is Lipschitz continuous only at the optimal solution. For instance, the objective function
satisfies the conditions in Assumptions 3.1, 3.2 and 3.3. However, it is not self-concordant, as its third derivative is not continuous.
Based on these points, the analysis in this section extends our convergence analysis of quasi-Newton methods to a new setting that is not covered by the setup in the previous section.
We should also mention that in [35,36,37] for the finite-time analysis of quasi-Newton methods, the authors assume that the objective function is strongly self-concordant which forms a subclass of self-concordant functions, formally defined in [35]. Note that a function f is strongly self-concordant when there exists a constant \(K \ge 0\) such that for any \(x, y, z, w \in dom(f)\), we have
In addition, in [35,36,37] the authors require the objective function to be strongly convex and smooth. Indeed, our considered setting in this section is more general than the setup in these works as we only require the function to be self-concordant.
Note that the condition \(\nabla ^2{f(x)} \succ 0\) guarantees that the inner product \(s_k^\top y_k\) in quasi-Newton updates is always positive in all iterations, as stated in Sect. 2. Also by the definition of self-concordance, the function f(x) is always strictly convex. We start our analysis by stating the following lemma which plays an important role in our analysis for self-concordant functions.
Lemma 9
Suppose function f satisfies Assumption 5.1 and \(x, y \in dom(f)\). Further, consider the definition \(G := \int _{0}^{1}\nabla ^2{f(x + \alpha (y - x))}d\alpha \). If x and y are such that \(r = \Vert \nabla ^2{f(x)}^{\frac{1}{2}}(y - x)\Vert < 1\), then
Proof
Check Theorem 4.1.6 and Corollary 4.1.4 of [9]. \(\square \)
The next two lemmas are based on Lemma 9 and are similar to the results in Lemma 3 and 4, except here we prove them for the case that the conditions in Assumption 5.1 are satisfied.
Lemma 10
Recall the definition of \(r_k\) in (16) and \({\hat{J}}_k\) in (17). Suppose that there exists \(\alpha _k \in [0, 1]\) and define the matrix \(H_k = \nabla ^2{f(x_* + \alpha _k(x_k - x_*))}\) and \({\hat{H}}_k = \nabla ^{2}f(x_*)^{-\frac{1}{2}}H_k\nabla ^{2}f(x_*)^{-\frac{1}{2}}\). If Assumption 5.1 holds and \(\Vert r_k\Vert \le \frac{1}{2}\), then for all \(k \ge 0\) we have
Proof
Check Appendix F\(\square \)
Lemma 11
Recall the definitions in (13) - (16) and consider the definition \(\theta _k := \max \{\Vert r_k\Vert , \Vert r_{k+1}\Vert \}\). Suppose that for some \(k \ge 0\), we have \(\theta _k \le \frac{1}{2}\). If Assumption 5.1 holds, we have
Proof
Check Appendix G. \(\square \)
By comparing Lemma 10 and Lemma 11 with Lemma 3 and Lemma 4, respectively, we observe that the only difference between these results is that we replaced \({\sigma _k}/{2} = ({M}/{(2\mu ^{\frac{3}{2}}}))\Vert r_k\Vert \) by \(2\Vert r_k\Vert \) and \(\tau _k = \max \{\sigma _k, \sigma _{k+1}\}\) by \(6\theta _k = 6\max \{\Vert r_k\Vert , \Vert r_{k+1}\Vert \}\). Due to this similarity, our results for the self-concordant setting are very similar to the previous case considered in Sect. 4. As a result, the superlinear convergence proof in this section is also similar to the one in Sect. 4. Next, we directly present the final superlinear convergence rate results for self-concordant functions.
Theorem 3
Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1. Suppose the objective function f satisfies the conditions in Assumption 5.1. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy
where \(\epsilon , \delta \in (0, \frac{1}{2})\) such that for some \(\rho \in (0, 1)\), they satisfy
Then the iterates \(\{x_{k}\}_{k=0}^{+\infty }\) generated by the convex Broyden class of quasi-Newton methods converge to \(x_*\) at a superlinear rate of
where \(q = \frac{1}{\sqrt{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}} \in \left[ 1, \sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\right] \) and \(\phi _{\text {min}} = \inf _{k \ge 0}\phi _k\).
Proof
Check Appendix H. \(\square \)
Similarly, we can set \(\phi _k = 1\) or \(\phi _k = 0\) for all \(k \ge 0\) to obtain the results for DFP and BFGS, respectively, as stated in Corollary 2. We can also select specific values for \((\epsilon , \delta , \rho )\) to simplify our bounds.
Corollary 4
Consider the DFP and BFGS methods and suppose Assumption 5.1 holds. Moreover, suppose for the DFP method, the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy
and for the BFGS method, the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy
Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by these methods satisfy
Proof
As in the proof of Corollary 3, we set \(\phi _k = 1\), \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{120}\), \(\delta = \frac{1}{7}\) for the DFP method and \(\phi _k = 0\), \(\rho = \frac{1}{2}\), \(\epsilon = \frac{1}{50}\), \(\delta = \frac{1}{7}\) for the BFGS method in Theorem 3. Then, the claims follow. \(\square \)
We can also set the initial Hessian approximation matrix to be \(\nabla ^2{f(x_0)}\) as in Theorem 2 to achieve the same superlinear convergence rate as long as the distance between the initial point \(x_0\) and the optimal point \(x_*\) is sufficiently small.
Theorem 4
Consider the DFP and BFGS methods and suppose Assumption 5.1 holds. Moreover, suppose for the DFP method, the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy
and for the BFGS method, they satisfy
Then, the iterates \(\{x_k\}_{k=0}^{+\infty }\) generated by these methods satisfy
Proof
Check Appendix I. \(\square \)
In summary, we established the local convergence rate of the convex Broyden class of quasi-Newton methods for self-concordant functions. We showed that if the initial distance to the optimal solution is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(1)\) and the initial Hessian approximation error is \(\Vert \nabla ^{2}{f(x_*)}^{-\frac{1}{2}}(B_0 - \nabla ^{2}f(x_*))\nabla ^{2}{f(x_*)}^{-\frac{1}{2}}\Vert _F = {\mathcal {O}}(1)\), the iterations converge to the optimal solution at a superlinear rate of \(\frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } = {\mathcal {O}}{\left( \frac{1}{\sqrt{k}}\right) ^{k}}\) and \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} = {\mathcal {O}}{\left( \frac{1}{k}\right) ^{k}}\). Moreover, we can achieve the same superlinear rate if the initial error is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(\frac{1}{\sqrt{d}})\) and the initial Hessian approximation matrix is \(B_0 = \nabla ^2{f(x_0)}\).
6 Discussion
In this section, we discuss the strengths and shortcomings of our theoretical results and compare them with concurrent papers [36, 37] on the non-asymptotic superlinear convergence of DFP and BFGS.
Initial Hessian approximation condition Note that in our main theoretical results, in addition to the fact that the initial iterate \(x_0\) has to be close to the optimal solution \(x_*\), which is a common condition for local convergence results, we also need the initial Hessian approximation \(B_0\) to be close to the Hessian at the optimal solution \(\nabla ^2 f(x_*)\). At first glance, this might seem restrictive, but as we have shown in Theorem 2 and Theorem 4, if we set the initial Hessian approximation to the Hessian at the initial point \(\nabla ^2 f(x_0)\), this condition is automatically satisfied as long as the initial iterate error \(\Vert x_0-x_*\Vert \) is sufficiently small. From a complexity point of view, this approach is reasonable as quasi-Newton methods and Newton’s method outperform first-order methods in a local neighborhood of the optimal solution, and their global linear convergence rate may not be faster than the linear convergence rate of first-order methods. Hence, as suggested in [2], to optimize the overall iteration complexity according to theoretical bounds, one might use first-order methods such as Nesterov’s accelerated gradient method to reach a local neighborhood of the optimal solution, and then switch to locally fast methods such as quasi-Newton methods. If this procedure is used, our theoretical results show that by setting \(B_0=\nabla ^2 f(x_0)\) (and equivalently \(H_0=\nabla ^2 f(x_0)^{-1}\)) for the convex Broyden class of quasi-Newton, the fast superlinear convergence rate of \((1/k)^{k/2}\) can be obtained.
It is worth noting that, however, in practice algorithms that do not require switching between algorithms or knowledge of problem parameters are more favorable. Due to these reasons, quasi-Newton methods with an Armijo-Wolfe line search are more practical, as they offer an adaptive choice of the steplength with global convergence and avoid specifying typically unknown constants such as the Lipschitz constant of the gradient, Lipschitz constant of the Hessian, and strong convexity parameter.
It is worth noting that, in practice, algorithms that do not switch between different methods or require knowledge of the problem parameters are more favorable. Because of these reasons, quasi-Newton methods with an Armijo-Wolfe line search are more practical as they offer an adaptive choice of the stepsize with global convergence guarantees without knowledge of unknown constants such as the Lipschitz constant of the gradient, Lipschitz constant of the Hessian, and strong convexity parameter. Indeed, both this framework and the framework in [36, 37] requires re-initializing the Hessian approximation when one is sufficiently close to the solution. An ideal theoretical guarantee would follow a line-search approach that guarantees that, once the iterates reach a local neighborhood of the optimal solution, the Hessian approximation for DFP or BFGS automatically satisfies the required conditions for superlinear convergence without any modification to the Hessian approximation.
Convergence rate-neighborhood trade-off As mentioned earlier, we observe a trade-off between the radius of the neighborhood in which BFGS and DFP converge superlinearly to the optimal solution and the rate (speed) of superlinear convergence. One important observation here is that for specific choices of \(\epsilon \), \(\delta \) and \(\rho \), the rate of convergence could be independent of the problem dimension d, while the neighborhood of the convergence would depend on d. Note that by selecting different parameters we could improve the dependency of the neighborhood on d, at the cost of achieving a contraction factor that depends on d. In this case, the contraction factor may not be always smaller than 1, and we can only guarantee that after a few iterations it becomes smaller than 1 and eventually behaves as 1/k. The results in [36, 37] have a similar structure. For instance, in [36], the authors show that when the initial Newton decrement is smaller than \(\frac{\mu ^{\frac{5}{2}}}{ML}\), which is independent of the problem dimension, the convergence rate would be of the form \((\frac{dL}{\mu k})^{k/2}\). Hence, to observe the superlinear convergence rate one need to run the BFGS method at least for \(d L/\mu \) iterations to ensure the contraction factor is smaller than 1. A similar conclusion could be made using our results, if we adjust the neighborhood. In our main result, we only report the case that the neighborhood depends on d and the rate is independent of that, since in this case the contraction factor is always smaller than 1 and the superlinear behavior starts from the first iteration.
7 Numerical experiments
In this section, we present our numerical experiments and compare the non-asymptotic performance of quasi-Newton methods with Newton’s method and the gradient descent algorithm. We further investigate if the convergence rates of quasi-Newton methods are consistent with our theoretical guarantees. In particular, we solve the following logistic regression problem with \(l_2\) regularization
We assume that \(\{z_i\}_{i = 1}^{N}\) are the data points and \(\{y_i\}_{i = 1}^{N}\) are their corresponding labels where \(z_i \in {\mathbb {R}}^d\) and \(y_i \in \{-1, 1\}\) for \(1 \le i \le N\). Note that the function f(x) in (108) is strongly convex with parameter \(\mu > 0\). We normalize all data points such that \(\Vert z_i\Vert = 1\) for all \(1 \le i \le N\). Therefore, the gradient of the function f(x) is Lipschitz continuous with parameter \(L = {1}/{4} + \mu \). It is also known that the logistic regression objective function is self-concordant after a suitable scaling, i.e., it is self-concordant but not standard self-concordant (with constant 2). Moreover, its Hessian is Lipschitz continuous. In summary, the objective function f in (108) satisfies Assumptions 3.1, 3.2, 3.3 and Assumption 5.1.
We conduct our experiments on four different datasets: (i) colon-cancer dataset [40], (ii) Covertype dataset [41], (iii) GISETTE handwritten digits classification dataset from the NIPS 2003 feature selection challenge [42] and (iv) MNIST dataset of handwritten digits [43].Footnote 1 We compare the performance of DFP, BFGS, Newton’s method, and gradient descent. We initialize all the algorithms with the same initial point \(x_0 = c*\mathbf {{\mathbf {1}}}\) where \(c > 0\) is a tuned parameter and \(\mathbf {{\mathbf {1}}} \in {\mathbb {R}}^d\) is the one vector. We set the initial Hessian inverse approximation matrix as \(\nabla ^2{f(x_0)}^{-1}\) for the DFP and BFGS methods. The step size is 1 for DFP, BFGS, and Newton’s method. The step size of the gradient descent method is tuned by hand to achieve the best performance on each dataset.
All the parameters (sample size N, dimension d, initial point parameter c and regularization \(\mu \)) of these different datasets are provided in Table 1. Notice that the initial point parameter c is selected from the set \({\mathcal {A}} = \{ {0.001}, {0.01}, 0.1, 1, 10\}\) to guarantee that the initial point \(x_0\) is close enough to the optimal solution \(x_*\) so that we can achieve the superlinear convergence rate of DFP and BFGS on each dataset. The regularization parameter \(\mu \) is also chosen from the same set \({\mathcal {A}}\) to obtain the best performance on each dataset.
From the theoretical results of Sects. 4.3 and 5, we expect the iterates \(\{x_k\}_{k = 0}^{\infty }\) generated by the DFP method and the BFGS method to satisfy the following superlinear convergence rate
Hence, in our numerical experiments, we compare the convergence rate of \(\frac{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_0 - x_*)\Vert }\) with \((\frac{1}{\sqrt{k}})^{k}\) and the convergence rate of \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\) with \((\frac{1}{k})^{k}\) to check the tightness of our theoretical bounds. Our numerical experiments are shown in Figs. 1, 2, 3 and 4 for different datasets. Note that for each problem, we present two plots. The left plot (plot (a)) showcases \(\frac{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_0 - x_*)\Vert }\) for different algorithms as well as our theoretical bound which is \((\frac{1}{\sqrt{k}})^{k}\). In the right plot (plot (b)), we compare \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\) for different methods with our theoretical bound which is \((\frac{1}{k})^{k}\).
We observe that \(\frac{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_k - x_*)\Vert }{\Vert \nabla ^2{f(x_*)}^{{1}/{2}}(x_0 - x_*)\Vert }\) for the DFP and BFGS methods are bounded above by \((\frac{1}{\sqrt{k}})^{k}\) and \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)}\) for the DFP and BFGS methods are bounded above by \((\frac{1}{k})^{k}\). Therefore, these experimental results confirm our theoretical superlinear convergence rates of quasi-Newton methods.
8 Conclusion
In this paper, we studied the local convergence rate of the convex Broyden class of quasi-Newton methods which includes the DFP and BFGS methods. We focused on two settings: (i) the objective function is \(\mu \)-strongly convex, its gradient is L-Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution with parameter M, (ii) the objective function is self-concordant. For these two settings we characterized the explicit non-asymptotic superlinear convergence rate of Broyden class of quasi-Newton methods. In particular, for the first setting, we showed that if the initial distance to the optimal solution is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(\frac{\mu ^{\frac{3}{2}}}{M})\) and the initial Hessian approximation error is \(\Vert {\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\ \! (B_0 - \nabla ^{2}f(x_*))\ \!{\nabla ^{2}f(x_*)^{-\frac{1}{2}}}\Vert _F = {\mathcal {O}}(1)\), the iterations generated by the DFP and BFGS methods converge to the optimal solution at a superlinear rate of \(\frac{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_k - x_*)\Vert }{\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert } = {\mathcal {O}}{\left( \frac{1}{\sqrt{k}}\right) ^{k}}\) and \(\frac{f(x_k) - f(x_*)}{f(x_0) - f(x_*)} = {\mathcal {O}}{\left( \frac{1}{k}\right) ^{k}}\). We further showed that we can achieve the same superlinear convergence rate if the initial error is \(\Vert \nabla ^{2}f(x_*)^\frac{1}{2}(x_0 - x_*)\Vert = {\mathcal {O}}(\frac{\mu ^\frac{3}{2}}{M\sqrt{d}})\) and the initial Hessian approximation matrix is \(B_0 = \nabla ^2{f(x_0)}\). We proved similar convergence rate results for the second setting where the objective function is self-concordant.
Code Availability
The code of the numerical experiments is available at the following link: https://github.com/QiujiangJin/Non-asymptotic-Superlinear-Convergence-of-Standard-Quasi-Newton-Methods
Notes
We use LIBSVM [44] with license: https://www.csie.ntu.edu.tw/~cjlin/libsvm/COPYRIGHT.
References
Nesterov, Y.: A method for solving the convex programming problem with convergence rate o(1/k\(^{}2\)). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer Science & Business Media, Berlin (2013)
Nemirovsky, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. SIAM, New Delhi (1983)
Bennett, A.A.: Newton’s method in general analysis. Proc. Natl. Acad. Sci. U. S. A. 2(10), 592 (1916)
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables, vol. 30. SIAM, New Delhi (1970)
Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, New Delhi (2000)
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Nesterov, Y.: Introductory Lectures on Convex Optimization. Springer Science & Business Media, Berlin (2004)
Nocedal, J., Wright, S.: Numerical Optimization. Springer Science & Business Media, Berlin (2006)
Conn, A.R., Gould, N.I.M., Toint, P.L.: Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Program. 50(1–3), 177–195 (1991)
Broyden, C.G.: A class of methods for solving nonlinear simultaneous equations. Math. Comput. 19(92), 577–593 (1965)
Broyden, C.G., Broyden, J.E.D., Jr., More, J.J.: On the local and superlinear convergence of quasi-Newton methods. IMA J. Appl. Math. 12, 223–245 (1973)
Gay, D.M.: Some convergence properties of Broyden’s method. SIAM J. Numer. Anal. 16(4), 623–630 (1979)
Davidon, W.: Variable metric method for minimization. SIAM J. Optim. (1991). https://epubs.siam.org/doi/10.1137/0801001
Fletcher, R., Powell, M.J.: A rapidly convergent descent method for minimization. Comput. J. 6(2), 163–168 (1963)
Broyden, C.G.: The convergence of single-rank quasi-Newton methods. Math. Comput. 24(110), 365–382 (1970)
Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970)
Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23–26 (1970)
Shanno, D.F.: Conditioning of quasi-Newton methods for function minimization. Math. Comput. 24(111), 647–656 (1970)
Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35(151), 773–782 (1980)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)
Moré, J.J., Trangenstein, J.A.: On the global convergence of Broyden’s method. Math. Comput. 30(135), 523–540 (1976)
Powell, M.: On the convergence of the variable metric algorithm. IMA J. Appl. Math. 7(1), 21–36 (1971)
Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)
Byrd, R.H., Nocedal, J., Yuan, Y.-X.: Global convergence of a class of quasi-Newton methods on convex problems. SIAM J. Numer. Anal. 24(5), 1171–1190 (1987)
Gao, W., Goldfarb, D.: Quasi-newton methods: superlinear convergence without line searches for self-concordant functions. Opt. Methods Softw. 34(1), 194–217 (2019)
Griewank, A., Toint, P.L.: Local convergence analysis for partitioned quasi-Newton updates. Numer. Math. 39(3), 429–448 (1982)
Dennis, J., Martinez, H.J., Tapia, R.A.: Convergence theory for the structured BFGS secant method with an application to nonlinear least squares. J. Optim. Theory Appl. 61(2), 161–178 (1989)
Yuan, Y.-X.: A modified BFGS algorithm for unconstrained optimization. IMA J. Numer. Anal. 11(3), 325–332 (1991)
Al-Baali, M.: Global and superlinear convergence of a restricted class of self-scaling methods with inexact line searches, for convex functions. Comput. Optim. Appl. 9(2), 191–203 (1998)
Li, D., Fukushima, M.: A globally and superlinearly convergent Gauss-Newton-based BFGS method for symmetric nonlinear equations. SIAM J. Numer. Anal. 37(1), 152–172 (1999)
Yabe, H., Ogasawara, H., Yoshino, M.: Local and superlinear convergence of quasi-Newton methods based on modified secant conditions. J. Comput. Appl. Math. 205(1), 617–632 (2007)
Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)
Rodomanov, A., Nesterov, Y.: Greedy quasi-Newton methods with explicit superlinear convergence. SIAM J. Optim. 31(1), 785–811 (2021)
Rodomanov, A., Nesterov, Y.: Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 194, 159–190 (2022)
Rodomanov, A., Nesterov, Y.: New results on superlinear convergence of classical quasi-Newton methods. J. Optim. Theory Appl. 188(3), 744–769 (2021)
Nesterov, J.E.: ”Self-concordant functions and polynomial-time methods in convex programming,” Report, Central Economic and Mathematic Institute. USSR Acad, Sci (1989)
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, New Delhi (1994)
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Cell Biol. 96, 6745–6750 (1999)
Blackard, A. Jock., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric. 24(3), 131–151 (2000)
Isabelle, G., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the nips: feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 2005 (2003)
LeCun, Y., Cortes, C., Burges, C. J.: MNIST handwritten digit database. AT &T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist/, 2010
Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. doi (2011). https://doi.org/10.1145/1961189.1961199
Acknowledgements
Q. Jin acknowledges support from the National Initiative for Modeling and Simulation (NIMS) Graduate Research Fellowship. This research is supported by NSF Award CCF-2007668.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Proof of corollary 1
According to the definition \(J = \int _{0}^{1}\nabla ^2{f(x + t(y - x))}dt\), we have \( \nabla {f(x)} - \nabla {f(y)} = J(x - y) \). Hence, we can write
Moreover, we can show that
By Assumption 3.3, we can replace the upper bound in the above expression by the following
By combining (109) and (110), the result in (12) follows.
B Proof of Lemma 1
Define \(P := I - uu^\top \). Since \(\Vert u\Vert = 1\), we have \(P = P^\top \), \(P^2 = P\), and \(0 \preceq P \preceq I\). These properties imply that
Moreover, for symmetric matrices \(X_1\) and \(X_2\) that satisfy \(X_1 \preceq X_2\) we have \(\mathrm {Tr}(X_1 Y) \le \mathrm {Tr}(X_2 Y)\) when Y is positive-semidefinite. This result and \(0 \preceq P \preceq I\) imply that
By combining the results in (111) and (112), and considering the definition \(P := I - uu^\top \), the claim in (18) follows.
C Proof of Lemma 2
Notice that \(\mathrm {Tr}(X_1 Y) \le \mathrm {Tr}(X_2 Y)\) for any symmetric matrices \(X_1 \preceq X_2\) and symmetric positive-semidefinite matrix Y. Since \(A^\top A \preceq \Vert A\Vert ^2 I\), we obtain that
which leads to the first inequality in (19). The second inequality in (19) follows from the first one, since
D Proof of Lemma 3
By Assumption 3.3, we have that
Hence, we have \(J_k - \nabla ^2{f(x_*)}\preceq \frac{M}{2}\Vert x_k - x_*\Vert I\). Considering this bound and Assumption 3.1, we obtain
Similarly, we have that
Combining (113) and (114), we obtain that
Multiplying both side of the above expression by \(\nabla ^2{f(x_*)}^{-\frac{1}{2}}\) from left and right leads to the first result in (20). By Assumption 3.3 and \(\alpha _k \in [0, 1]\), we have that
Hence, we have \(H_k - \nabla ^2{f(x_*)} \preceq M\Vert x_k - x_*\Vert I\). Considering this bound and Assumption 3.1, we obtain
Similarly, we have that
Combining (115) and (116), we obtain that
Multiplying both side of the above expression by \(\nabla ^2{f(x_*)}^{-\frac{1}{2}}\) from left and right leads to the second result in (20).
E Proof of Lemma 4
By Assumption 3.1 and Corollary 1, we have
Notice that
Based on the definition \(r_k = \nabla ^{2}f(x_*)^{\frac{1}{2}}(x_k - x_*)\), we have \(x_k - x_* = \nabla ^{2}f(x_*)^{-\frac{1}{2}}r_k\) and hence
Substitute (118) and (119) into (117) and recall the definition in (16) to obtain
Hence, the proof of the first claim in (21) is complete. By using the Cauchy-Schwarz inequality and (21), we can write
Therefore, we obtain that
and the second claim in (22) holds. Using the reverse triangle inequality and (21), we have \(|\Vert {\hat{y}}_k\Vert - \Vert {\hat{s}}_k\Vert | \le \Vert {\hat{y}}_k - {\hat{s}}_k\Vert \le \tau _k\Vert {\hat{s}}_k\Vert \). Hence, the third claim in (23) holds. Finally, to prove the last claim in (24), we use Assumption 3.1 and Corollary 1 to show that
F Proof of Lemma 10
Set \(x = x_*\) and \(y = x_k\) in (91) of Lemma 9 and note that \(\Vert r_k\Vert \le \frac{1}{2} < 1\). By (91), we have
Multiply the above expressions form left and right by \(\nabla ^2{f(x_*)}^{-\frac{1}{2}}\) to obtain
Using the fact that \(\Vert r_k\Vert \le \frac{1}{2}\), we have
Replace the lower and upper bounds in (120) with the ones in (121) and (122), respectively, to obtain the first result in (92). Set \(x = x_*\) and \(y = x_* + \alpha _k(x_k - x_*)\) in (90) of Lemma 9 and notice that since \(\alpha _k \in [0, 1]\), we obtain that
By (90), we have that
Using that \(r \le \Vert r_k\Vert < 1\), we get that
Multiply the above expressions form left and right by \(\nabla ^2{f(x_*)}^{-\frac{1}{2}}\) to obtain the second result in (92).
G Proof of Lemma 11
We first show that for \(x = x_*\) and \(y=x_k + \alpha (x_{k+1} - x_k)\), where \(\alpha \in [0, 1]\), the value of \(r=\Vert \nabla ^2{f(x)}^{\frac{1}{2}}(y - x)\Vert \) defined in Lemma 9 is less than 1. To do so, note that
where \(\Vert r_k\Vert =\Vert \nabla ^2{f(x_*)}^\frac{1}{2}(x_k-x_*)\Vert \). Note that in the above simplification we used the assumption that \(\theta \le 1/2\). Now using the result in (90) we have
Moreover, since \(r \le \theta _k \in [0, 1)\), we can write
By computing the integral for \(\alpha \) from 0 to 1 in the above inequality, we get that
where we used the definition \(G_k := \int _{0}^{1}\nabla ^2{f(x_k + \alpha (x_{k+1} - x_k))}d\alpha \). Multiplying the above expression from left and right by \(\nabla ^2{f(x_*)}^{-\frac{1}{2}}\) leads to
where \({\hat{G}}_k = \nabla ^2{f(x_*)}^{-\frac{1}{2}}G_k\nabla ^2{f(x_*)}^{-\frac{1}{2}}\). The above inequality is equivalent to
which indicates that
Since \(\theta _k \in [0, 1)\), we have that
Hence, (123) can be simplified as
where the second inequality holds due to \(\theta _k \le \frac{1}{2}\).
Considering the definition \(G_k := \int _{0}^{1}\nabla ^2{f(x_k + \alpha (x_{k+1} - x_k))}d\alpha \), we have \(y_k = G_k s_k\). Using this observation, we have
where the last inequality holds due to (124). Hence, the proof of the first claim in (93) is complete.
By using the Cauchy-Schwarz inequality and (93), we can write
Therefore, we obtain that
and the second claim in (94) holds. Using the reverse triangle inequality and (93), we have \(|\Vert {\hat{y}}_k\Vert - \Vert {\hat{s}}_k\Vert | \le \Vert {\hat{y}}_k - {\hat{s}}_k\Vert \le 6\theta _k\Vert {\hat{s}}_k\Vert \). Hence, the third claim in (95) holds. Finally, using Lemma 10 we know that
and
Thus, the last claim in (96) holds.
H Proof of Theorem 3
From the assumption on the initial point, it follows that \(x_k \in \) dom f for all \(k \ge 0\), so that Algorithm 1 is well-defined. The proof of this Theorem 3 is very similar to the proof of Theorem 1. The only difference is that we utilize the Lemma 10 and Lemma 11 instead of Lemma 3 and Lemma 4. Hence, we need to replace all the terms \(\frac{\sigma _k}{2} = \frac{M}{2\mu ^{\frac{3}{2}}}\Vert r_k\Vert \) by \(2\Vert r_k\Vert \) and \(\tau _k = \max \{\sigma _k, \sigma _{k+1}\}\) by \(6\theta _k = 6\max \{\Vert r_k\Vert , \Vert r_{k+1}\Vert \}\). Here, we only stated the outline of the proof and omit the details to avoid redundancy.
First we present the potential function similar to the (46) from Lemma 7. Suppose that for some \(\delta > 0\) and some \(k \ge 0\), we have \(\theta _k = \max \{\Vert r_k\Vert , \Vert r_{k+1}\Vert \} < \frac{1}{6}\) and \(\Vert {\hat{B}}_k - I\Vert _{F} \le \delta \). Then, the matrix \(B_{k+1}\) generated by the convex Broyden class update (7) satisfies
where \(Z_k = \phi _k\Vert {\hat{B}}_k\Vert \frac{4}{(1 - 6\theta _k)^2} + \frac{3 + 6\theta _k}{1 - 6\theta _k}\). We also have that
The proof of the above conclusion is the same as the proof we presented in Lemmas 5, 6, and 7 except that we use the results of Lemma 11 instead of Lemma 4. Then, we present the similar linear convergence results like Lemma 8. Suppose that the objective function f satisfies the conditions in Assumption 5.1. Moreover, suppose the initial point \(x_0\) and initial Hessian approximation matrix \(B_0\) satisfy
where \(\epsilon , \delta \in (0, \frac{1}{2})\) such that for some \(\rho \in (0, 1)\), they satisfy
Then, the sequence of iterates \(\{x_k\}_{k=0}^{+\infty }\) converges to the optimal solution \(x_*\) with
Furthermore, the matrices \(\{B_k\}_{k=0}^{+\infty }\) stay in a neighborhood of \(\nabla ^{2}{f(x_*)}\) defined as
Moreover, the norms \(\{\Vert {\hat{B}}_k\Vert \}_{k=0}^{+\infty }\) and \(\{\Vert {\hat{B}}_k^{-1}\Vert \}_{k=0}^{+\infty }\) are all uniformly bounded above by
We apply the same induction technique used in the proof of Lemma 8 to prove the above linear convergence results and utilize the potential function in (126) and Lemma 11. Finally we can prove the superlinear convergence results of
where \(q = \frac{1}{\sqrt{\phi _{\text {min}}\frac{4\delta }{1 + 2\delta } + \frac{1 - 2\delta }{1 + 2\delta }}} \in \left[ 1, \sqrt{\frac{1 + 2\delta }{1 - 2\delta }}\right] \) and \(\phi _{\text {min}} = \inf _{k \ge 0}\phi _k\). This proof is based on the linear convergence results of (129), (130), (131) and is the same as the proof in Theorem 1, except that here we replace the results of Lemma 3 by the results of Lemma 10, substitute the results of Lemma 4 with the results of Lemma 11 and utilize the intermediate inequality (125) instead of (46). Notice that all the term \(\frac{\epsilon }{2}\) has been replaced with the term \(\frac{\epsilon }{3}\) since in this setting, we use the term \(2\Vert r_t\Vert \) instead of the term \(\frac{\sigma _t}{2}\) and \(2\Vert r_t\Vert \le 2\Vert r_0\Vert \le 2\frac{\epsilon }{6} = \frac{\epsilon }{3}\).
I Proof of Theorem 4
First we focus on the DFP method. Notice that by (105) we have
Hence, the first condition in (102) is satisfied. Set \(x = x_*\) and \(y = x_0\) in Lemma 9. Notice that \(\Vert r_0\Vert = \Vert \nabla ^2{f(x_*)}^{\frac{1}{2}}(x_0 - x_*)\Vert \le \frac{1}{720} < 1\). Hence, using (90) we obtain that
Multiply the above expression by \(\nabla ^2{f(x_*)}^{-\frac{1}{2}}\) from left and right to obtain
which implies that
The above two inequalities indicate that
Since \(\Vert r_0\Vert \in [0, 1)\), we have that
Hence, (135) can be simplified as
where the second inequality holds due to \(\Vert r_0\Vert \le \frac{1}{720}\). Therefore, we can show that
where the first inequality is true since \(\Vert A\Vert _F \le \sqrt{d}\Vert A\Vert \) for any matrix \(A \in {\mathbb {R}}^{d \times d}\) and the last inequality is due to the first part of (105). Hence, the second condition in (102) is also satisfied. By Corollary 4, we can conclude that (107) holds. The proof for BFGS is similar to the proof for DFP. It can be derived by following the steps of proof of DFP and exploiting the BFGS results in Corollary 4.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jin, Q., Mokhtari, A. Non-asymptotic superlinear convergence of standard quasi-Newton methods. Math. Program. 200, 425–473 (2023). https://doi.org/10.1007/s10107-022-01887-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01887-4