Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods

In this paper, we study and prove the non-asymptotic superlinear convergence rate of the Broyden class of quasi-Newton algorithms which includes the Davidon--Fletcher--Powell (DFP) method and the Broyden--Fletcher--Goldfarb--Shanno (BFGS) method. The asymptotic superlinear convergence rate of these quasi-Newton methods has been extensively studied in the literature, but their explicit finite-time local convergence rate is not fully investigated. In this paper, we provide a finite-time (non-asymptotic) convergence analysis for Broyden quasi-Newton algorithms under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution. We show that in a local neighborhood of the optimal solution, the iterates generated by both DFP and BFGS converge to the optimal solution at a superlinear rate of $(1/k)^{k/2}$, where $k$ is the number of iterations. We also prove a similar local superlinear convergence result holds for the case that the objective function is self-concordant. Numerical experiments on several datasets confirm our explicit convergence rate bounds. Our theoretical guarantee is one of the first results that provide a non-asymptotic superlinear convergence rate for quasi-Newton methods.


Introduction
In this paper, we focus on the non-asymptotic convergence analysis of quasi-Newton methods for the problem of minimizing a convex function f : R d → R, i.e., min Specifically, we focus on two different settings.In the first case, we assume that the objective function f is strongly convex, smooth (its gradient is Lipschitz continuous), and its Hessian is Lipschitz continuous at the optimal solution.In the second case, we study the setting where the objective function f is self-concordant.We formally define these settings in the following sections.In both considered cases, the optimal solution solution is unique and denoted by x * .
There is an extensive literature on the use of first-order methods for convex optimization, and it is well-known that the best achievable convergence rate for first-order methods, when the objective function is strongly convex and smooth, is a linear convergence rate.Specifically, we say a sequence {x k } converges linearly if x k − x * ≤ Cγ k x 0 − x * , where γ ∈ (0, 1) is the constant of linear convergence, and C is a constant possibly depending on problem parameters.Among first-order methods, the accelerated gradient method proposed in [1] achieves a fast linear convergence rate of (1 − µ/L) k/2 , where µ is the strong convexity parameter and L is the smoothness parameter (the Lipschitz constant of the gradient) [2].It is also known that the convergence rate of the accelerated gradient method is optimal for first-order methods in the setting that the problem dimension d is sufficiently larger than the number of iterations [3].
Classical alternatives to improve the convergence rate of first-order methods are second-order methods [4,5,6,7] and in particular Newton's method.It has been shown that if in addition to smoothness and strong convexity assumptions, the objective function f has a Lipschitz continuous Hessian, then the iterates generated by Newton's method converge to the optimal solution at a quadratic rate in a local neighborhood of the optimal solution; see [8,Chapter 9].A similar result has been established for the case that the objective function is self-concordant [9].Despite the fact that the quadratic convergence rate of Newton's method holds only in a local neighborhood of the optimal solution, it could reduce the overall number of iterations significantly as it is substantially faster than the linear rate of first-order methods.The fast quadratic convergence rate of Newton's method, however, does not come for free.Implementation of Newton's method requires solving a linear system at each iteration with the matrix defined by the objective function Hessian ∇ 2 f (x).As a result, the computational cost of implementing Newton's method in high-dimensional problems is prohibitive, as it could be O(d 3 ), unlike first-order methods that have a per iteration cost of O(d).
Quasi-Newton algorithms are quite popular since they serve as a middle ground between first-order methods and Newton-type algorithms.They improve the linear convergence rate of first-order methods and achieve a local superlinear rate, while their computational cost per iteration is O(d 2 ) instead of O(d 3 ) of Newton's method.The main idea of quasi-Newton methods is to approximate the step of Newton's method without computing the objective function Hessian ∇ 2 f (x) or its inverse ∇ 2 f (x) −1 at every iteration [10,Chapter 6].To be more specific, quasi-Newton methods aim at approximating the curvature of the objective function by using only first-order information of the function, i.e., its gradients ∇ f (x); see Section 2 for more details.There are several different approaches for approximating the objective function Hessian and its inverse using first-order information, which leads to different quasi-Newton updates, but perhaps the most popular quasi-Newton algorithms are the Symmetric Rank-One (SR1) method [11], the Broyden method [12,13,14], the Davidon-Fletcher-Powell (DFP) method [15,16], the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [17,18,19,20], and the limited-memory BFGS (L-BFGS) method [21,22].
As mentioned earlier, a major advantage of quasi-Newton methods is their asymptotic local superlinear convergence rate.More precisely, we state that the sequence {x k } converges to the optimal solution x * superlinearly when the ratio between the distance to the optimal solution at time k + 1 and k approaches zero as k approaches infinity, i.e., lim For various settings, this superlinear convergence result has been established for a large class of quasi-Newton methods, including the Broyden method [17,13,23], the DFP method [24,13,25], the BFGS method [13,25,26,27], and several other variants of these algorithms [28,29,30,31,32,33,34].Although this result is promising and lies between the linear rate of first-order methods and the quadratic rate of Newton's method, it only holds asymptotically and does not characterize an explicit upper bound on the error of quasi-Newton methods after a finite number of iterations.As a result, the overall complexity of quasi-Newton methods for achieving an -accurate solution, i.e., x k − x * ≤ , cannot be explicitly characterized.Hence, it is essential to establish a non-asymptotic convergence rate for quasi-Newton methods, which is the main goal of this paper.
In this paper, we show that if the initial iterate is close to the optimal solution and the initial Hessian approximation error is sufficiently small, then the iterates of the convex Broyden class including both the DFP and BFGS methods converge to the optimal solution at a superlinear rate of (1/k) k/2 .We further show that our theoretical result suggests a trade-off between the size of the superlinear convergence neighborhood and the rate of superlinear convergence.
In other words, one can improve the numerical constant in the above rate at the cost of reducing the radius of the neighborhood in which DFP and BFGS converge superlinearly.We believe that our theoretical guarantee provides one of the first non-asymptotic results for the superlinear convergence rate of BFGS and DFP.
Related Work.In a recent work [35], the authors studied the non-asymptotic analysis of a class of greedy quasi-Newton methods that are based on the update rule of the Broyden family and use a greedily selected basis vectors for updating Hessian approximations.In particular, they show a superlinear convergence rate of (1 − µ dL ) k 2 /2 ( dL µ ) k for this class of algorithms.However, greedy quasi-Newton methods are more computationally costly than standard quasi-Newton methods, as they require computing a greedily selected basis vector at each iteration.It is worth noting that such computation requires access to additional information beyond the objective function gradient, e.g., the diagonal components of the Hessian.Also, two recent concurrent papers study the non-asymptotic superlinear convergence rate of the DFP and BFGS methods [36,37].In [36], the authors show that when the objective function is smooth, strongly convex, and strongly self-concordant, the iterates of BFGS and DFP, in a local neighborhood of the optimal solution, achieve the superlinear convergence rate of ( dL µk ) k/2 and ( dL 2 µ 2 k ) k/2 , respectively.In their follow-up paper [37], they improve the superlinear convergence results to [e , respectively.We would like to highlight that the proof techniques, assumptions, and final theoretical results of [36,37] and our paper are different and derived independently.The major difference in the analysis is that in [36,37], the authors use a potential function related to the trace and the logarithm of the determinant of the Hessian approximation matrix, while we use a Frobenius norm potential function.In addition, our convergence rates for both DFP and BFGS are independent of the problem dimension d.Nevertheless, in our results, the neighborhood of superlinear convergence depends on d.Moreover, to derive our results we consider two settings where in the first case the objective function is strongly convex, smooth, and has a Lipschitz continuous Hessain at the optimal solution, and in the second setting the function is self-concordant.Both of these settings are more general than the setting in [36,37], which requires the objective function to be strongly convex, smooth, and strongly self-concordant.
Outline.In Section 2, we discuss the Broyden class of quasi-Newton methods, DFP and BFGS.
In Section 3, we mention our assumptions, notations as well as some general technical lemmas.Then, in Section 4, we present the main theoretical results of our paper on the non-asymptotic superlinear convergence of DFP and BFGS for the setting that the objective function is strongly convex, smooth, and its Hessian is Lipschitz continuous at the optimal solution.In Section 5, we extend our theoretical results to the class of self-concordant functions, by exploiting the proof techniques developed in Section 4. In Section 6, we provide a detailed discussion on the advantages and drawbacks of our theoretical results and compare them with some concurrent works.In Section 7, we numerically evaluate the performance of DFP and BFGS on several datasets and compare their convergence rates with our theoretical bounds.Finally, in Section 8, we close the paper with some concluding remarks.

Notation.
For vector v ∈ R d , its Euclidean norm (l-2 norm) is denoted by v .We denote the Frobenius norm of matrix A ∈ R d×d as A F = ∑ d i=1 ∑ d j=1 A 2 ij and its induced 2-norm is denoted by A = max v =1 Av .The trace of matrix A, which is the sum of its diagonal elements, is denoted by Tr (A).For any two symmetric matrices A, B ∈ R d×d , we denote that A B if and only if B − A is a symmetric positive-definite matrix.

Quasi-Newton Methods
In this section, we review standard quasi-Newton methods, and, in particular, we discuss the updates of the DFP and BFGS algorithms.Consider a time index k, a step size η k , and a positive-definite matrix B k to define a generic descent algorithm through the iteration Note that if we simply replace B k by the identity matrix I, we recover the update of gradient descent, and if we replace it by the objective function Hessian ∇ 2 f (x k ), we obtain the update of Newton's method.The main goal of quasi-Newton methods is to find a symmetric positive-definite matrix B k using only first-order information such that B k is close to the Hessian ∇ 2 f (x k ).Note that the step size η k is often computed according to a line search routine for the global convergence of quasi-Newton methods.Our focus in this paper, however, is on the local convergence of quasi-Newton methods, which requires the unit step size η k = 1.Hence, in the rest of the paper, we assume that the iterate x k is sufficiently close to the optimal solution x * and the step size is η k = 1.
In most quasi-Newton methods, the function's curvature is approximated in a way that it satisfies the secant condition.To better explain this property, let us first define the variable difference s k and gradient difference y k as The goal is to find a matrix B k+1 that satisfies the secant condition B k+1 s k = y k .The rationale for satisfying the secant condition is that the Hessian ∇ 2 f (x k ) approximately satisfies this condition when x k+1 and x k are close to each other, e.g., they are both close to the optimal solution x * .However, the secant condition alone is not sufficient to specify B k+1 .To resolve this indeterminacy, different quasi-Newton algorithms consider different additional conditions.One common constraint is to enforce the Hessian approximation (or its inverse) at time k + 1 to be close to the one computed at time k.This is a reasonable extra condition as we expect the Hessian (or its inverse) evaluated at x k+1 to be close to the one computed at x k .
In the DFP method, we enforce the proximity condition on Hessian approximations B k and B k+1 .Basically, we aim to find the closest positive-definite matrix to B k (in some weighted matrix norm) that satisfies the secant condition; see Chapter 6 of [10] for more details.The update of the Hessian approximation matrices of DFP is given by Since implementation of the update in (1) requires access to the inverse of the Hessian approximation, it is essential to derive an explicit update for the Hessian inverse approximation to avoid the cost of inverting a matrix at each iteration.If we define H k as the inverse of B k , i.e., H k = B −1 k , using the Sherman-Morrison-Woodbury formula, one can write The BFGS method can be considered as the dual of DFP.In BFGS, we also seek a positivedefinite matrix that satisfies the secant condition, but instead of forcing the proximity condition on the Hessian approximation B, we enforce it on the Hessian inverse approximation H.To be more precise, we aim to find a positive-definite matrix H k+1 that satisfies the secant condition s k = H k+1 y k and is the closest matrix (in some weighted norm) to the previous Hessian inverse approximation H k .The update of the Hessian inverse approximation matrices of BFGS is given by, Similarly, by the Sherman-Morrison-Woodbury formula, the update of BFGS method for the Hessian approximation matrices is given by, Note that both DFP and BFGS belong to a more general class of quasi-Newton methods called the Broyden class.The Hessian approximation B k+1 of the Broyden class is defined as where φ k , ψ k ∈ R. In this paper, we only focus on the convex class of Broyden quasi-Newton methods, where φ k , ψ k ∈ [0, 1].The steps of this class of methods are summarized in Algorithm 1.In fact, in Algorithm 1, if we set ψ k = 0, we recover DFP, and if we set ψ k = 1, we recover BFGS.It is worth noting that the cost of computing the descent direction for this class of quasi-Newton methods is of O(d 2 ), which improves O(d 3 ) per iteration cost of Newton's method.
Remark 2.1.Note that when s k = 0, we have ∇ f (x k ) = 0 from (1) and thus x k = x * .Hence, in our implementation and analysis we assume s k = 0.Moreover, in both considered settings, the objective function is at least strictly convex.As a result, if s k = 0, then it follows that y k = 0 and s k y k > 0. This observation shows that the updates of BFGS and DFP are well-defined.Finally, it is well-known that for the convex class of Broyden methods if B k is symmetric positive-definite and s k y k > 0, then B k+1 is also symmetric positive-definite [10].In Algorithm 1, we assume that the initial Hessian approximation B 0 is symmetric positive-definite, and, hence, all Hessian approximation matrices B k and their inverse matrices H k are symmetric positive-definite.

Preliminaries
In this section, we first specify the required assumptions for our results in Section 4 and introduce some notations to simplify our expressions.Moreover, we present some intermediate lemmas that will be use later in Section 4 to prove our main theoretical results for the setting that the objective function is strongly convex, smooth, and its Hessian is Lipschitz continuous at the optimal solution.In Section 5, we will use a subset of these intermediate results to extend our analysis to the class of self-concordant functions.

Assumptions
We formally state the required assumptions for establishing our theoretical results in Section 4.
Assumption 3.1.The objective function f (x) is twice-differentiable.Moreover, it is strongly convex with parameter µ > 0 and its gradient ∇ f is Lipschitz continuous with parameter L > 0. Hence, As f is twice-differentiable, Assumption 3.1 implies that the eigenvalues of the Hessian are larger than µ and smaller than L, i.e., µI ∇ 2 f (x) LI, ∀x ∈ R d .
Assumption 3.2.The Hessian ∇ 2 f (x) satisfies the following condition for some constant M ≥ 0, The condition in Assumption 3.2 is common for analyzing second-order methods as we require a regularity condition on the objective function Hessian.In fact, Assumption 3.2 is one of the least strict conditions required for the analysis of second-order type methods as it requires Lipschitz continuity of the Hessian only at (near) the optimal solution.This condition is, indeed, weaker than assuming that the Hessian is Lipschitz continuous everywhere.Note that for the class of strongly convex and smooth functions, the strongly self-concordance assumption required in [36,37] is equivalent to assuming that the Hessian is Lipschitz continuous everywhere.Hence, the condition in Assumption 3.2 is also weaker than the one in [36,37].Assumption 3.2 leads to the following corollary.
Corollary 3.1.If the condition in Assumption 3.2 holds, then for all x, y ∈ R d , we have Proof.Check Appendix A.
Remark 3.2.Our analysis can be extended to the case that Assumptions 3.1 and 3.2 only hold in a local neighborhood of the optimal solution x * .Here, we assume they hold in R d to simplify our proofs.

Notations
Next, we briefly mention some of the definitions and notations that will be used in following theorems and proofs.We consider ∇ 2 f (x * ) 2 and ∇ 2 f (x * ) − 1 2 are symmetric positive-definite.Throughout the paper, we analyze and study weighted version of the Hessian approximation Bk defined as Bk is symmetric positive-definite, since B k and ∇ 2 f (x * ) − 1 2 are both symmetric positive-definite.We also use Bk − I F as the measure of closeness between B k and ∇ 2 f (x * ), which can be written as We further define the weighted gradient difference ŷk , the weighted variable difference ŝk , and the weighted gradient To measure closeness to the optimal solution for iterate x k , we use r k ∈ R d , σ k ∈ R, and τ k ∈ R which are formally defined as In (15), µ is the strong convexity parameter defined in Assumption 3.1 and M is the Lipschitz continuity parameter of the Hessian at the optimal solution defined in Assumption 3.2.In our analysis, we also use the average Hessian J k and its weighted version Ĵk that are formally defined as

Intermediate Lemmas
Next, we present some lemmas that we will later use to establish the non-asymptotic superlinear convergence of DFP and BFGS.Proofs of these lemmas are relegated to the appendix.
Lemma 3.3.For any matrix A ∈ R d×d and vector u ∈ R d with u = 1, we have Proof.Check Appendix B.
Lemma 3.4.For any matrices A, B ∈ R d×d , we have Proof.Check Appendix C.
The results in Lemma 3.3 and Lemma 3.4 hold for arbitrary matrices.The next lemma focuses on some properties of the weighted average Hessian Ĵk under Assumptions 3.1 and 3.2.
Lemma 3.5.Recall the definition of σ k in (15) and Ĵk in (16).If Assumptions 3.1 and 3.2 hold, then the following inequalities hold for all k ≥ 0, Proof.Check Appendix D.
In the following lemma, we establish some bounds that depend on the weighted gradient difference ŷk and the weighted variable difference ŝk .

Main Theoretical Results
In this section, we characterize the non-asymptotic superlinear convergence of the Broyden class of quasi-Newton methods, when Assumptions 3.1 and 3.2 hold.In Section 4.1, we first establish a crucial proposition which characterizes the error of Hessian approximation for this class of quasi-Newton methods.Then, in Section 4.2, we leverage this result to show that the iterates of this class of algorithms converge at least linearly to the optimal solution, if the initial distance to the optimal solution and the initial Hessian approximation error are sufficiently small.Finally, we use these intermediate results in Section 4.3 to prove that the iterates of the convex Broyden class, including both DFP and BFGS, converge to the optimal solution at a superlinear rate of (1/k) k/2 .Note that in Algorithm 1 we use the Hessian inverse approximation matrix H k to describe the algorithm, but in our analysis we will study the behavior of the Hessian approximation matrix B k .

Hessian approximation error: Frobenius norm potential function
Next, we use the Frobenius norm of the Hessian approximation error Bk − I F as the potential function in our analysis.Specifically, we will use the results of Lemma 3.3, Lemma 3.4, and Lemma 3.6 to study the dynamic of the Hessian approximation error Bk − I F for both DFP and BFGS.First, start with the DFP method.
Lemma 4.1.Consider the update of DFP in (3) and recall the definition of τ k in (15).Suppose that for some δ > 0 and some k ≥ 0, we have that τ k < 1 and Bk − I F ≤ δ.Then, the matrix B DFP k+1 generated by the DFP update satisfies the following inequality where Proof.The proof and conclusion of this lemma are similar to the ones in Lemma 3.2 in [33], except the value of parameter W k .This difference comes from the fact that [33] analyzed the modified DFP update, while we consider the standard DFP method.Recall the DFP update in (3) and multiply both sides of that expression by the matrix ∇ 2 f (x * ) − 1 2 from left and right to obtain where we used the fact that To simplify the proof, we use the following notations: Hence, ( 25) is equivalent to Moreover, we can express B + − I as Notice that P 2 = P and P = P .Thus, ( 27) can be simplified as where Next, we proceed to upper bound B + − I F .To do so, we derive upper bounds on the Frobenius norm of matrices D, E, G and H.We start by D F .If we set u = s/ s and A = B − I in Lemma 3.3, we obtain that where the second inequality follows from the fact that B − I 2 F − D 2 F ≥ 0 and the assumption that B − I F ≤ δ.Next, if we replace the right hand side of ( 28) by its upper bound in (29) and massage the resulted expression, we obtain that which provides an upper bound on D F .To derive upper bounds for E F , G F and H F , we first need to find an upper bound for Q F , where Q is defined in (26).Note that where the first equality holds by the definition of Q, the second equality is obtained by adding and subtracting sy s 2 , and the inequality holds due to the triangle inequality.We can further simplify the right hand side as where the second inequality holds using the Cauchy-Schwarz inequality and the fact that ab F = a b for a, b ∈ R d , and the last inequality holds due to the results in ( 20), ( 21), and (22).
Next using the upper bound in (31)  , where we used the triangle inequality in the last step.Using the definition of Q we can show that where for the second inequality we use (31) and ab F = a b , and for the third inequality we use the results in ( 20), (21), and (22).
We proceed to derive an upper bound for G F .Note that 0 P I and thus P ≤ 1.Using this observation, (31) and the first inequality in (18), we can show that G F is bounded above by Finally, we provide an upper bound for H F .By leveraging the second inequality in (18) and the fact that A ≤ A F for any matrix A ∈ R d×d , we can show that where for the last inequality we used the result in (31).
If we replace D F , E F , G F , and H F with their upper bounds in ( 30), ( 32), ( 33) and (34), respectively, we obtain that Considering the notations introduced in (26), the result in (24) follows from the above inequality and the proof is complete.
The result in Lemma 4.1 shows how the error of Hessian approximation in DFP evolves as we run the updates.Next, we establish a similar result for the BFGS method.Lemma 4.2.Consider the update of BFGS in (6) and recall the definition of τ k in (15).Suppose that for some δ > 0 and some k ≥ 0, we have that τ k < 1 and Bk − I F ≤ δ.Then, the matrix B BFGS k+1 generated by the BFGS update satisfies the following inequality where Proof.The proof of this lemma is adapted from the proof of Lemma 3.6 in [32].We should also add that our upper bound in (35) improves the bound in [32] as it contains an additional negative term, i.e., − . Recall the BFGS update in (6) and multiply both sides of that expression with ∇ 2 f (x * ) − 1 2 from left and right to obtain where we used the fact that To simplify the proof, we use the following notations: Considering these notations, the expression in (36) can be written as Substituting the above simplifications into (38), we obtain that Next, we proceed to show that the second term on the right hand side of (39), i.e., Bs 2 s Bs s Bs , is non-positive.Note that by using the Cauchy-Schwarz inequality, we have By combining (39) and (40), we obtain that The above inequality implies that B − I 2 F − D 2 F ≥ 0.Moreover, using the fact that where the second inequality follows from B − I 2 F − D 2 F ≥ 0 and the fact that B − I F ≤ δ.Now if combine the results in ( 41) and ( 42), we obtain that which provides an upper bound on D F .Moreover, according to ( 32), E F is bounded above by If we replace D F and E F with their upper bounds in ( 43) and ( 44), we obtain that where V = 3+τ 1−τ .Considering the notations in (37), the claim follows from the above inequality.Now we can combine Lemma 4.1 and Lemma 4.2 to derive a bound on the error of Hessian approximation for the (convex) Broyden class of quasi-Newton methods.
Lemma 4.3.Consider the update of the (convex) Broyden family in (7) and recall the definition of τ k in (15).Suppose that for some δ > 0 and some k ≥ 0, we have that τ k < 1 and Bk − I F ≤ δ.Then, the matrix B k+1 generated by (7) satisfies the following inequality where We also have that Using this expression and the convexity of the norm, we can show that By replacing B DFP k+1 − I F and B BFGS k+1 − I F with their upper bounds in Lemma 4.1 and Lemma 4.2, the claim in (45) follows.Moreover, since ≥ 0, the result in (45) implies (46).

Linear convergence
In this section, we leverage the results from the previous section on the error of Hessian approximation to show that if the initial iterate is sufficiently close to the optimal solution and the initial Hessian approximation matrix is close to the Hessian at the optimal solution, the iterates of BFGS and DFP converge at least linearly to the optimal solution.Moreover, the Hessian approximation matrices always stay close to the Hessian at the optimal solution and the norms of Hessian approximation matrix and its inverse are always bounded above.These results are essential in proving our non-asymptotic superlinear convergence results.
Lemma 4.4.Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1, and recall the definitions in ( 12)- (15).Suppose Assumptions 3.1 and 3.2 hold.Moreover, suppose the initial point x 0 and initial Hessian approximation matrix B 0 satisfy where , δ ∈ (0, 1  2 ) such that for some ρ ∈ (0, 1), they satisfy Then, the sequence of iterates {x k } +∞ k=0 converges to the optimal solution x * with Furthermore, the matrices {B k } +∞ k=0 stay in a neighborhood of ∇ 2 f (x * ) defined as Moreover, the norms { Bk } +∞ k=0 and { B−1 k } +∞ k=0 are all uniformly bounded above by Proof.The proof of this lemma is adapted from the proof of Theorem 3.1 in [33].In [33], the authors prove the results for the modified DFP method, while we consider the more general class of Broyden methods.We will use induction to prove (49), ( 50) and (51).First consider the base case of k = 0.By the initial condition (47), it's obvious that (50) holds for k = 0. From (50) we know that all the eigenvalues of B0 are in the interval [1 − 2δ, 1 + 2δ].Suppose that λ max ( B0 ) is the largest eigenvalue of B0 and λ min ( B0 ) is the smallest eigenvalue of B0 , we have Hence, (51) holds for k = 0. Based on Assumptions 3.1-3.2and the definitions in ( 12)-( 15), we have Now using the result in (23), and the bounds in (47), ( 48), ( 50) and (51) for k = 0, we can write This indicates that the condition in (49) holds for k = 0. Hence, all the conditions in (49), ( 50) and (51) hold for k = 0, and the base of induction is complete.Now we assume that the conditions in (49), ( 50) and ( 51) hold for all 0 ≤ k ≤ t, where t ≥ 0. Our goal is to show that these conditions are also satisfied for the case of k = t + 1.Since (49) holds for all 0 ≤ k ≤ t, we have where Using (51) and σ k ≤ for 0 ≤ k ≤ t, we obtain that Further if (47) and (49) hold for 0 ≤ k ≤ t, we have that Considering these results we can show that where the last inequality holds due to the first inequality in (48).By leveraging (55) and (47) and computing the sum of the terms in the left and right hand side of (53) from k = 0 to t, we obtain which implies that (50) holds for k = t + 1. Applying the same techniques we used in the base case, we can prove that (49) and (51) hold for k = t + 1.Hence, all the claims in (49), ( 50) and (51) hold for k = t + 1, and our induction step is complete.

Explicit non-asymptotic superlinear rate
In the previous section, we established local linear convergence of iterates generated by the convex Broyden class including DFP and BFGS.Indeed, these local linear results are not our ultimate goal, as first-order methods are also linearly convergent under the same assumptions.However, the linear convergence is required to establish a local non-asymptotic superlinear convergence result, which is our main contribution.Next, we state the main results of this paper on the non-asymptotic superlinear convergence rate of the convex Broyden class of quasi-Newton methods.To prove this claim, we use the results in Lemma 4.3 and Lemma 4.4.
where , δ ∈ (0, 1  2 ) such that for some ρ ∈ (0, 1), they satisfy Then the iterates {x k } +∞ k=0 generated by the convex Broyden class of quasi-Newton methods converge to x * at a superlinear rate of where q = max k≥0 Proof.When both conditions (56) and (57) hold, by Lemma 4.4, the results in (49), ( 50) and (51) hold.This indicates that for any t ≥ 0, we have Hence, using Lemma 4.3 for any t ≥ 0, we can show that where Using (54) and (55), for k ≥ 0 we have Now compute the sum of both sides of (60) from t = 0 to k − 1 to obtain Regroup the terms and use the results in ( 56) and (61) to show that Moreover, using the bounds in (51) we can show that Hence, we have By combining the bounds in ( 62) and (63), we obtain Now by computing the minimum value of the term φ t + (1 − φ t ) 1−2δ 1+2δ , we can show and by regrouping the terms, we obtain that Considering the definition q := max k≥0 , we can simplify our upper bound as By using the Cauchy-Schwarz inequality, we obtain that Note that since φ k ∈ [0, 1], we have q ∈ 1, 1+2δ 1−2δ .The result in (64) provides an upper bound on ∑ k−1 t=0 , which is a crucial term in the remaining of our proof.Now, note that ∇ f (x t ) = J t (x t − x * ), where J t is defined in (16).This implies that x t − x * = J −1 t ∇ f (x t ) and hence we have where the third equality holds since −B t s t = ∇ f (x t ).Pre-multiply both sides of the above expression by ∇ 2 f (x * ) 1 2 to obtain Therefore, we obtain that From Lemma 3.5 we know that Ĵ−1 t ≤ 1 + σ t 2 and Ĵt − I ≤ σ t 2 .Therefore, we have Also, since σ t+1 ≤ ρσ t and r t , we obtain that r t+1 ≤ ρ r t .Hence, we can write Using the expressions in (65) and (66), we can show that r t+1 r t is bounded above by Compute the sum of both sides of (67) from t = 0 to k − 1 and use σ t ≤ , (61), and (64) to obtain By leveraging the arithmetic-geometric inequality, we obtain that The proof of (58) is complete.Next, we proceed to prove (59).Based on the definition of J t , we have where we used ∇ f (x * ) = 0 and the definitions in ( 15) and (16).By Lemma 3.5 and σ t ≤ , we have and By combining (68), ( 69) and (70), we obtain that and the claim in (59) holds.
The above theorem establishes the non-asymptotic superlinear convergence of the Broyden class of quasi-Newton methods.Notice that we use the weighted norm in (58) to characterize the convergence rate.Using the fact that Next, we use the above theorem to report the results for DFP and BFGS, which are two special cases of the convex Broyden class of quasi-Newton methods.
Corollary 4.6.Consider the DFP and BFGS methods.Suppose Assumptions 3.1 and 3.2 hold and for some , δ ∈ (0, 1 2 ) and ρ ∈ (0, 1), the initial point x 0 and initial Hessian approximation B 0 satisfy • For the DFP method, if the tuple ( , δ, ρ) satisfies then the iterates {x k } +∞ k=0 generated by the DFP method converge to x * at a superlinear rate of • For the BFGS method, if the tuple ( , δ, ρ) satisfies then the iterates {x k } +∞ k=0 generated by the BFGS method converge to x * at a superlinear rate of (78) Proof.In Theorem 4.5, set φ k = 1 to obtain the results for DFP and set φ k = 0 to obtain the results for BFGS.
The results in Corollary 4.6 indicate that, in a local neighborhood of the optimal solution, the iterates generated by DFP and BFGS converge to the optimal solution at a superlinear rate of (( , where the constants C 1 and C 2 are determined by ρ, and δ.Indeed, as time progresses, the rate behaves as O (1/ √ k) k .The tuple (ρ, , δ) is independent of the problem parameters (µ, L, M, d), and the only required condition for the tuple (ρ, , δ) is that they should satisfy (73) or (76).Note that the superlinear rate in (74) and ( 77) is faster than linear rate of first-order methods as the contraction coefficient approaches zero at a sublinear rate of O(1/ √ k).Similarly, in terms of the function value, the superlinear rate shown in (75) and (78) behaves as O (1/k) k .The result in Corollary 4.6 also shows the existence of a tradeoff between the rate of convergence and the neighborhood of superlinear convergence.We highlight this point in the following remark.

Remark 4.7.
There exists a trade-off between the size of the local neighborhood in which DFP or BFGS converges superlinearly and their rate of convergence.To be more precise, by choosing larger values for and δ (as long as they satisfy (73) or (76)), we can increase the size of the region in which quasi-Newton method has a fast superlinear convergence rate, but on the other hand, it will lead to a slower superlinear convergence rate according to the bounds in (74), ( 75), ( 77) and (78).Conversely, by choosing small values for and δ, the rate of convergence becomes faster, but the local neighborhood defined in (72) becomes smaller.
The final convergence results of Corollary 4.6 depend on the choice of parameters (ρ, , δ), and it may not be easy to quantify the exact convergence rate at first glance.To better quantify the superlinear convergence rate of DFP and BFGS, in the following corollary, we state the results of Corollary 4.6 for specific choices of ρ, and δ which simplifies our expressions.Indeed, one can choose another set of values for these parameters to control the neighborhood and rate of superlinear convergence, as long as they satisfy the conditions in (73) for DFP and (76) for BFGS.
The results in Corollary 4.8 show that for some specific choices of ( , δ, ρ), the convergence rate of DFP and BFGS is (1/k) k/2 , which is asymptotically faster than any linear convergence rate of first-order methods.Moreover, we observe that the neighborhood in which this fast superlinear rate holds is slightly larger for BFGS compared to DFP, i.e., compare the first conditions in (79) and (80).This is in consistence with the fact that in practice, BFGS often outperforms DFP.
A major shortcoming of the results in Corollary 4.6 and Corollary 4.8 is that, in addition to assuming that the initial iterate x 0 is sufficiently close to the optimal solution, we also require the initial Hessian approximation error to be sufficiently small.In the following theorem, we resolve this issue by suggesting a practical choice for B 0 such that the second assumption in ( 79) and ( 80) can be satisfied under some conditions.To be more precise, we show that if is sufficiently small (we formally describe this condition), then by setting B 0 = ∇ 2 f (x 0 ), the second condition in (79) and (80) for Hessian approximation is satisfied, and we can achieve the convergence rate in (81).
and for BFGS, they satisfy Then, the iterates {x k } +∞ k=0 generated by the DFP and BFGS methods satisfy Proof.First we consider the case of the DFP method.Notice that by (82), we obtain Hence, the first part of ( 79) is satisfied.Moreover, using Assumptions 3.1 and 3.2, we have The first inequality holds as A F ≤ √ d A for any matrix A ∈ R d×d , and the last inequality is due to the first part of (82).The above bound shows that the second part of the (79) is also satisfied, and by Corollary 4.8 the claim follows.The proof for BFGS is similar to the proof for DFP.It can be derived by following the steps of proof of DFP and exploiting the BFGS results in Corollary 4.8.
According to Theorem 4.9, if the initial weighted error ∇ 2 f (x * ) 1 2 (x 0 − x * ) is sufficiently small, then by setting the initial Hessian approximation B 0 as the Hessian at the initial point ∇ 2 f (x 0 ), the iterates will converge superlinearly at a rate of (1/k) k/2 .More specifically, based on the result in (23), it suffices to have as stated in ( 82) and (83).Hence, this condition is satisfied when ∇ . This observation implies that, in practice, we can exploit any optimization algorithm to find an initial point x 0 such that , and once this condition is satisfied, by setting B 0 = ∇ 2 f (x 0 ) we obtain the guaranteed superlinear convergence result.The suggested procedure requires only one evaluation of the Hessian inverse for the initial iterate, and in the rest of the algorithm, the Hessian inverse approximations are updated according to the convex Broyden update in (8).
Note that the condition ∇ 2 f (x) 0 guarantees that the inner product s k y k in quasi-Newton updates is always positive in all iterations, as stated in Section 2. Also by the definition of selfconcordance, the function f (x) is always strictly convex.We start our analysis by stating the following lemma which plays an important role in our analysis for self-concordant functions.
The next two lemmas are based on Lemma 5.1 and are similar to the results in Lemma 3.5 and 3.6, except here we prove them for the case that the conditions in Assumption 5.1 are satisfied.
Lemma 5.2.Recall the definition of r k in (15) and Ĵk in (16).If Assumption 5.1 holds and r k ≤ 1 2 , then for all k ≥ 0 we have Proof.Check Appendix F. Suppose that for any k ≥ 0, we have θ k ≤ 1 2 .If Assumption 5.1 holds, then for all k ≥ 0 we have Proof.Check Appendix G.
By comparing Lemma 5.2 and Lemma 5.3 with Lemma 3.5 and Lemma 3.6, respectively, we observe that the only difference between these results is that we replaced σ k /2 = (M/(2µ Theorem 5.4.Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1. Suppose the objective function f satisfies the conditions in Assumption 5.1.Moreover, suppose the initial point x 0 and initial Hessian approximation matrix B 0 satisfy where , δ ∈ (0, 1  2 ) such that for some ρ ∈ (0, 1), they satisfy Then the iterates {x k } +∞ k=0 generated by the convex Broyden class of quasi-Newton methods converge to x * at a superlinear rate of where q = max k≥0 Proof.Check Appendix H.
Similarly, we can set φ k = 1 or φ k = 0 to obtain the results for DFP and BFGS, respectively, as stated in Corollary 4.6.We can also select specific values for ( , δ, ρ) to simplify our bounds.
Corollary 5.5.Consider the DFP and BFGS methods and suppose Assumption 5.1 holds.Moreover, suppose for the DFP method, the initial point x 0 and initial Hessian approximation matrix B 0 satisfy and for the BFGS method, the initial point x 0 and initial Hessian approximation matrix B 0 satisfy Then, the iterates {x k } +∞ k=0 generated by these methods satisfy Proof.As in the proof of Corollary 4.8, we set φ k = 1, ρ = 1 2 , = 1 120 , δ = 1 7 for the DFP method and φ k = 0, ρ = 1 2 , = 1 50 , δ = 1 7 for the BFGS method in Theorem 5.4.Then, the claims follow.
We can also set the initial Hessian approximation matrix to be ∇ 2 f (x 0 ) as in Theorem 4.9 to achieve the same superlinear convergence rate as long as the distance between the initial point x 0 and the optimal point x * is sufficiently small.Theorem 5.6.Consider the DFP and BFGS methods and suppose Assumption 5.1 holds.Moreover, suppose for the DFP method, the initial point x 0 and initial Hessian approximation matrix B 0 satisfy and for the BFGS method, they satisfy Then, the iterates {x k } +∞ k=0 generated by these methods satisfy Proof.First we focus on the DFP method.Notice that by (102) we have Hence, the first condition in (99) is satisfied.Set x = x * and y = x 0 in Lemma 5.1.Notice that Hence, using (88) we obtain that Multiply the above expression by ∇ 2 f (x * ) − 1 2 from left and right to obtain The above two inequalities indicate that (105) Since r 0 ∈ [0, 1), we have that Hence, (105) can be simplified as where the second inequality holds due to r 0 ≤ 1 720 .Therefore, we can show that where the first inequality is true since A F ≤ √ d A for any matrix A ∈ R d×d and the last inequality is due to the first part of (102).Hence, the second condition in (99) is also satisfied.By Corollary 5.5, we can conclude that (104) holds.The proof for BFGS is similar to the proof for DFP.It can be derived by following the steps of proof of DFP and exploiting the BFGS results in Corollary 5.5.
In summary, we established the local convergence rate of the convex Broyden class of quasi-Newton methods for self-concordant functions.We showed that if the initial distance to the optimal solution is ∇ , the iterations converge to the optimal solution at a superlinear rate of we can achieve the same superlinear rate if the initial error is ∇ 2 f (x * )

Discussion
In this section, we discuss the strengths and shortcomings of our theoretical results and compare them with concurrent papers [36,37] on the non-asymptotic superlinear convergence of DFP and BFGS.Initial Hessian approximation condition.Note that in our main theoretical results, in addition to the fact that the initial iterate x 0 has to be close to the optimal solution x * , which is a common condition for local convergence results, we also need the initial Hessian approximation B 0 to be close to the Hessian at the optimal solution ∇ 2 f (x * ).At first glance, this might seem restrictive, but as we have shown in Theorem 4.9 and Theorem 5.6, if we set the initial Hessian approximation to the Hessian at the initial point ∇ 2 f (x 0 ), this condition is automatically satisfied as long as the initial iterate error x 0 − x * is sufficiently small.From a practical point of view, this approach is reasonable as quasi-Newton methods and Newton's method outperform first-order methods in a local neighborhood of the optimal solution, and their global linear convergence rate may not be faster than the linear convergence rate of first-order methods.Hence, as suggested in [2], to optimize the overall iteration complexity according to theoretical bounds, one might use first-order methods such as Nesterov's accelerated gradient method to reach a local neighborhood of the optimal solution, and then switch to locally fast methods such as quasi-Newton methods.If this procedure is used, our theoretical results show that by setting B 0 = ∇ 2 f (x 0 ) (and equivalently H 0 = ∇ 2 f (x 0 ) −1 ) for the convex Broyden class of quasi-Newton, the fast superlinear convergence rate of (1/k) k/2 can be obtained.
It is worth noting that, however, in practice algorithms that do not require switching between algorithms or knowledge of problem parameters are more favorable.Due to these reasons, quasi-Newton methods with an Armijo-Wolfe line search are more practical, as they offer an adaptive choice of the steplength with global convergence and avoid specifying typically unknown constants such as the Lipschitz constant of the gradient, Lipschitz constant of the Hessian, and strong convexity parameter.
We should add that the frameworks in [36,37] require the initial Hessian approximation to be B 0 = LI, where I is the identity matrix and L is the Lipschitz constant of the gradient.Indeed, satisfying this condition is computationally more affordable than our proposed scheme, as it does not require access to the Hessian or its inverse at the initial iterate x 0 .However, it still requires a switching scheme.To be more precise, it requires to monitor the error of iterates and setting the Hessian approximation as LI, once the error x − x * is sufficiently small.An ideal theoretical guarantee would be compatible with line-search schemes.To be more precise, in both mentioned analyses, we need to monitor the error x − x * and reset the Hessian approximation once the error is small.A more comprehensive analysis should be applicable to the case that we follow a line-search approach from the very beginning, and it would automatically guarantee that once the iterates reach a local neighborhood of the optimal solution, the Hessian approximation for DFP or BFGS satisfies the required conditions for superlinear convergence without requiring to reset the Hessian approximation matrix.That said, the results in this work and [36,37] are first attempts to study the non-asymptotic behavior of quasi-Newton methods and there is indeed room for improving these results.
Convergence rate-neighborhood trade-off.As mentioned earlier, we observe a trade-off between the radius of the neighborhood in which BFGS and DFP converge superlinearly to the optimal solution and the rate (speed) of superlinear convergence.One important observation here is that for specific choices of , δ and ρ, the rate of convergence could be independent of the problem dimension d, while the neighborhood of the convergence would depend on d.Note that by selecting different parameters we could improve the dependency of the neighborhood on d, at the cost of achieving a contraction factor that depends on d.In this case, the contraction factor may not be always smaller than 1, and we can only guarantee that after a few iterations it becomes smaller than 1 and eventually behaves as 1/k.The results in [36,37] have a similar structure.For instance, in [36], the authors show that when the initial Newton decrement is smaller than ML , which is independent of the problem dimension, the convergence rate would be of the form ( dL µk ) k/2 .Hence, to observe the superlinear convergence rate one need to run the BFGS method at least for dL/µ iterations to ensure the contraction factor is smaller than 1.A similar conclusion could be made using our results, if we adjust the neighborhood.In our main result, we only report the case that the neighborhood depends on d and the rate is independent of that, since in this case the contraction factor is always smaller than 1 and the superlinear behavior starts from the first iteration.

Numerical Experiments
In this section, we present our numerical experiments and compare the non-asymptotic performance of quasi-Newton methods with Newton's method and the gradient descent algorithm.We further investigate if the convergence rates of quasi-Newton methods are consistent with our theoretical guarantees.In particular, we solve the following logistic regression problem with l 2 regularization min We assume that {z i } N i=1 are the data points and {y i } N i=1 are their corresponding labels where z i ∈ R d and y i ∈ {−1, 1} for 1 ≤ i ≤ N. Note that the function f (x) in ( 106) is strongly convex with parameter µ > 0. We normalize all data points such that z i = 1 for all 1 ≤ i ≤ N. Therefore, the gradient of the function f (x) is Lipschitz continuous with parameter L = 1 + µ.It is also well known that the logistic regression objective function is self-concordant and its Hessian is Lipschitz continuous.In summary, the objective function f (x) defined in (106) satisfies Assumptions 3.1-3.2and Assumption 5.1.
We conduct our experiments on four different datasets: (i) Colon-cancer dataset [40], (ii) Covertype dataset [41], (iii) GISETTE handwritten digits classification dataset from the NIPS 2003 feature selection challenge [42] and (iv) MNIST dataset of handwritten digits [43]. 1 We compare the performance of DFP, BFGS, Newton's method, and gradient descent.We initialize all the algorithms with the same initial point x 0 = c * 1 where c > 0 is a tuned parameter and 1 ∈ R d is the one vector.We set the initial Hessian inverse approximation matrix as ∇ 2 f (x 0 ) −1 for the DFP and BFGS methods.The step size is 1 for DFP, BFGS, and Newton's method.The step size of the gradient descent method is tuned by hand to achieve the best performance on each dataset.
All the parameters (sample size N, dimension d, initial point parameter c and regularization µ) of these different datasets are provided in Table 1.Notice that the initial point parameter c is selected from the set A = {0.001,0.01, 0.1, 1, 10} to guarantee that the initial point x 0 is close enough to the optimal solution x * so that we can achieve the superlinear convergence rate of DFP and BFGS on each dataset.The regularization parameter µ is also chosen from the same set A to obtain the best performance on each dataset.
From the theoretical results of Section 4.3 and Section 5, we expect the iterates {x k } ∞ k=0 generated by the DFP method and the BFGS method to satisfy the following superlinear conver- .
Fig. 1 Convergence rates of logistic regression on the Colon-cancer dataset. . .
Fig. 3 Convergence rates of logistic regression on the GISETTE dataset.gence rate Hence, in our numerical experiments, we compare the convergence rate of to check the tightness of our .
Fig. 4 Convergence rates of logistic regression on the MNIST dataset.theoretical bounds.Our numerical experiments are shown in Figures 1, 2, 3 and 4 for different datasets.Note that for each problem, we present two plots.The left plot (plot (a)) showcases for different algorithms as well as our theoretical bound which is We observe that for the DFP and BFGS methods are bounded above by for the DFP and BFGS methods are bounded above by ( 1 k ) k .Therefore, these experimental results confirm our theoretical superlinear convergence rates of quasi-Newton methods.

Conclusion
In this paper, we studied the local convergence rate of the convex Broyden class of quasi-Newton methods which includes the DFP and BFGS methods.We focused on two settings: (i) the objective function is µ-strongly convex, its gradient is L-Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution with parameter M, (ii) the objective function is self-concordant.For these two settings we characterized the explicit non-asymptotic superlinear convergence rate of Broyden class of quasi-Newton methods.In particular, for the first setting, we showed that if the initial distance to the opti- , the iterations generated by the DFP and BFGS methods converge to the optimal solution at a superlinear rate of We further showed that we can achieve the same superlinear convergence rate if the initial error is ∇ and the initial Hessian approximation matrix is B 0 = ∇ 2 f (x 0 ).We proved similar convergence rate results for the second setting where the objective function is self-concordant.

Notice that
Based on the definition max { r k , r k+1 }.
(115) Substitute ( 114) and ( 115) into (113) and recall the definition in (15) Hence, the proof of the first claim in (20) is complete.By using the Cauchy-Schwarz inequality and (20), we can write Therefore, we obtain that and the second claim in F Proof of Lemma 5.2 Set x = x * and y = x k in Lemma 5.1 and note that r k ≤ 1 2 < 1.By (89), we have Multiply the above expressions form left and right by ∇ 2 f (x * ) − 1 2 to obtain Using the fact that r k ≤ 1 2 , we have Replace the lower and upper bounds in (116) with the ones in (117) and (118), respectively, to obtain result in (90).

G Proof of Lemma 5.3
We first show that for x = x * and y = x k + α(x k+1 − x k ), where α ∈ [0, 1], the value of r = ∇ 2 f (x) 1 2 (y − x) defined in Lemma 5.1 is less than 1.To do so, note that r = ∇ 2 f (x * ) where r k = ∇ 2 f (x * ) 1 2 (x k − x * ) .Note that in the above simplification we used the assumption that θ ≤ 1/2.Now using the result in (88) we have Moreover, since r ≤ θ k ∈ [0, 1), we can write By computing the integral for α from 0 to 1 in the above inequality, we get that where we used the definition G k := 1 0 ∇ 2 f (x k + α(x k+1 − x k ))dα.Multiplying the above expression from left and right by ∇ 2 f (x * ) − 1 2 leads to where Ĝk = ∇ 2 f (x * ) − 1 2 G k ∇ 2 f (x * ) − 1 2 .The above inequality is equivalent to Since θ k ∈ [0, 1), we have that Hence, (119) can be simplified as where the second inequality holds due to θ k ≤ 1 2 .Considering the definition G k := 1 0 ∇ 2 f (x k + α(x k+1 − x k ))dα, we have y k = G k s k .Using this observation, we have where the last inequality holds due to (120).Hence, the proof of the first claim in (91) is complete.
By using the Cauchy-Schwarz inequality and (91), we can write Therefore, we obtain that Thus, the last claim in (94) holds.

H Proof of Theorem 5.4
The proof of this Theorem 5.4 is very similar to the proof of Theorem 4.5.The only difference is that we utilize the Lemma 5. (1−6θ k ) 2 + 3+6θ k 1−6θ k .We also have that The proof of the above conclusion is the same as the proof we presented in Lemmas 4.1, 4.2, and 4.3 except that we use the results of Lemma 5.3 instead of Lemma 3.6.Then, we present the similar linear convergence results like Lemma 4.4.Suppose that the objective function f satisfies the conditions in Assumption 5.1.Moreover, suppose the initial point x 0 and initial Hessian approximation matrix B 0 satisfy where , δ ∈ (0, 1  2 ) such that for some ρ ∈ (0, We apply the same induction technique used in the proof of Lemma 4.4 to prove the above linear convergence results and utilize the potential function in (122) and Lemma 5.3.Finally we can prove the superlinear convergence results of where q = max k≥0 1−2δ .This proof is based on the linear convergence results of (125), ( 126), (127) and is the same as the proof in Theorem 4.5, except that here we replace the results of Lemma 3.5 by the results of Lemma 5.2, substitute the results of Lemma 3.6 with the results of Lemma 5.3 and utilize the intermediate inequality (121) instead of (45).Notice that all the term 2 has been replaced with the term 3 since in this setting, we use the term 2 r t instead of the term σ t 2 and 2 r t ≤ 2 r 0 ≤ 2 6 = 3 .

Theorem 4 . 5 . 3 2
Consider the convex Broyden class of quasi-Newton methods described in Algorithm 1. Suppose the objective function f satisfies the conditions in Assumptions 3.1 and 3.2.Moreover, suppose the initial point x 0 and initial Hessian approximation matrix B 0 satisfy M µ

Corollary 4 . 8 . 3 2
Consider the DFP and BFGS methods and suppose Assumptions 3.1 and 3.2 hold.Moreover, suppose the initial point x 0 and initial Hessian approximation matrix B 0 of DFP satisfy M µ

Theorem 4 . 9 .
Consider the DFP and BFGS methods and suppose Assumptions 3.1 and 3.2 hold.Moreover, for DFP, suppose the initial point x 0 and initial Hessian approximation B 0 satisfy M µ 3 2

Fig. 2
Fig.2Convergence rates of logistic regression on the Covertype dataset.

Algorithm 1
The convex Broyden class of quasi-Newton methods Require: Initial iterate x 0 and initial Hessian inverse approximation H 0 .1:fork = 0, 1, 2, . ..doUpdate the variable:x k+1 = x k − H k ∇ f (x k ); on Q F we derive an upper bound on E F .Note that To establish an upper bound on B + − I F , we find upper bounds on D 2F and E 2 F .Note that using the fact that D 2 F = Tr DD and properties of the trace operator we can show that Moreover, we can show that B + − I is given byB + − I = B − I + = Tr (B − I)2 − Bss B(B − I) + (B − I)Using the fact that Tr ab = a b for any a, b ∈ R d we can write the following simplifications: Tr Bss B(B − I) + (B − I)Bss Moreover, since the condition in (50) holds for 0 ≤ k ≤ t, we know that Bk − I F ≤ 2δ for 0 ≤ k ≤ t.Hence, by (46) in Lemma 4.3, we obtain that Bk+1

Table 1 :
Sample size N, dimension d, initial point parameter c and regularization µ of each dataset.