On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions

We consider the gradient (or steepest) descent method with exact line search applied to a strongly convex function with Lipschitz continuous gradient. We establish the exact worst-case rate of convergence of this scheme, and show that this worst-case behavior is exhibited by a certain convex quadratic function. We also give the tight worst-case complexity bound for a noisy variant of gradient descent method, where exact line-search is performed in a search direction that differs from negative gradient by at most a prescribed relative tolerance. The proofs are computer-assisted, and rely on the resolutions of semidefinite programming performance estimation problems as introduced in the paper [Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1-2):451-482, 2014].


Introduction
The gradient (or steepest) descent method for unconstrained method was devised by Augustin-Louis Cauchy (1789-1857) in the 19th century, and remains one of the most iconic algorithms for unconstrained optimization. Indeed, it is usually the first algorithm that is taught during introductory courses on nonlinear optimization. It is therefore somewhat surprising that the worst-case convergence rate of the method is not yet precisely understood for smooth strongly convex functions.
In this paper, we settle the worst-case convergence rate question of the gradient descent method with exact line search for strongly convex, continuously differentiable functions f with Lipschitz continuous gradient. Formally we consider the following function class. Definition 1.1. A continuously differentiable function f : R n → R is called L-smooth, µ-strongly convex with parameters L > 0 and µ > 0 if 1. x → f (x) − µ 2 x 2 is a convex function on R n , where the norm is the Euclidean norm; 2. ∇f (x + ∆x) − ∇f (x) ≤ L ∆x holds for all x ∈ R n and ∆x ∈ R n .
The class of L-smooth, µ-strongly convex functions on R n will be denoted by F µ,L (R n ).
Note that, if f is twice continuously differentiable, then f ∈ F µ,L (R n ) is equivalent to LI ∇ 2 f (x) µI ∀x ∈ R n where the notation A B for symmetric matrices A and B means the matrix A − B is positive semidefinite, and I is the identity matrix. Equivalently, the eigenvalues of the Hessian matrix ∇ 2 f (x) lie in the interval [µ, L] for all x.
The gradient method with exact line search may be described as follows.

Gradient descent method with exact line search
Input: f ∈ Fµ,L(R n ), x0 ∈ R n .
Our main result may now be stated concisely.
Theorem 1.2. Let f ∈ F µ,L (R n ), x * a global minimizer of f on R n , and f * = f (x * ). Each iteration of the gradient method with exact line search satisfies Note that the result in Theorem 1.2, which establises a global linear convergence rate on objective function accuracy, is known for the case of quadratic functions in F µ,L (R n ), that is for functions of the form and the starting point One may readily check that the gradient at x 0 is equal to ∇f (x 0 ) = (1, 0, . . . , 0, 1) T and that the minimum of the line-search from x 0 in that direction is attained for step γ = 2 L+µ . One therefore obtains and, for all i = 0, 1, . . .
Since f * = 0, it is straightforward to verify that equality holds as required. Figure 1: Illustration of Example 1.3 for the case n = 2 (small arrows indicate direction of negative gradient).
The construction in Example 1.3 is illustrated in Figure 1 in the case n = 2, where the ellipses shown are level curves of the objective function. Each step from x i to x i+1 is orthogonal to the ellipse at x i (since it uses the steepest descent direction) and tangent to the ellipse at x i+1 (because of the exact line-search direction), hence successive steps are orthogonal to each other.
As an immediate consequence of Theorem 1.2 and Example 1.3, one has the following tight bound on the number of steps needed to obtain ǫ-relative accuracy on the objective function for a given ǫ > 0. Corollary 1.4. Given ǫ > 0, the gradient method with exact line search yields a solution with relative accuracy ǫ for any function f ∈ F µ,L (R n ) after at most N = where x 0 is the starting point. Moreover, this iteration bound is tight for the quadratic function defined in Example 1.3.
For non-quadratic functions in F µ,L (R n ), only bounds weaker than (1) are known. For example, in [3, p. 240], the following bound is shown: In [8, Theorem 3.4] a stronger result than Theorem 1.2 was claimed, but this was retracted in a subsequent erratum 1 , and only an asymptotic result is claimed in the erratum.
A result related to Theorem 1.2 is given in [5] where Armijo-rule line search is used instead of exact line search. An explicit rate in the strongly convex case is given there in Proposition 3.3.5 on page 53 (definition of the method is (3.1.2) on page 44). More general upper bounds on the convergence rates of gradient-type methods for convex functions may be found in the books [6,7]. We mention one more particular result by Nesterov [7] that is similar to our main result in Theorem 1.2, but that uses a fixed step-length and relies on the initial distance to the solution.  [7]). Given f ∈ F µ,L (R n ) and x 0 ∈ R n , the gradient descent method with fixed step length γ = 2 µ+L generate iterates x i (i = 0, 1,2, . . .) that satisfy

Background results
In this section we collect some known results on strongly convex functions and on the gradient method. We will need these results in the proof of our main result, Theorem 1.2.

Properties of the gradient method with exact line search
Let x i (i = 1, 2, . . . , N ) be the iterates produced by the gradient method with exact line search started at x 0 . Those iterates are defined by the following two conditions for i = 0, 1, . . . , N − 1 ∇f where the first condition (2) states that we move in the direction of the negative gradient, and the second condition (3) expresses the exact line search condition.
A consequence of those conditions is that successive gradients are orthogonal, i.e.
Instead of relying on conditions (2)-(3) that define the iterates of the gradient method with exact line search, our analysis will be based on the weaker conditions (3)-(4), which are also satisfied by other sequences of iterates.

Interpolation with functions in
We now consider the following interpolation problem over the class of functions F µ,L (R n ).

Definition 2.1. Consider an integer N ≥ 1 and given data
A necessary and sufficient condition for F µ,L -interpolability in given in the next theorem, taken from [11].

interpolable if and only if the following inequality
In principle, Theorem 2.2 allows one to generate all possible valid inequalities that hold for functions in F µ,L (R n ) in terms of their function values and gradients at a set of points x 0 , . . . , x N . This will be the essential for the proof of our main result, Theorem 1.2.

A performance estimation problem
The proof technique we will use for Theorem 1.2 is inspired by recent work on the so-called performance estimation problem, as introduced in [2] and further developed in [11]. The idea is to formulate the computation of the worstcase behavior of certain iterative methods as an explicit semidefinite programming (SDP) problem. We first recall the definition of SDP problems (in a form that is suitable to our purposes).

Semidefinite programs
We will consider semidefinite programs (SDPs) of the form where S n is the set of symmetric matrices of size n, and matrices A k = a (k) ij ∈ S n and the matrix C = (c ij ) ∈ S n are given, as well as the scalars b k and vectors a k ∈ R ℓ (k = 1, . . . , m), and c ∈ R ℓ . Since every positive semidefinite matrix X ∈ S n is a Gram matrix, there exist vectors v 1 , . . . , v n ∈ R n such that Thus the SDP problem (5) may be equivalently rewritten as which features terms that are linear in the inner products v T i v j in the objective function and constraints. The associated dual SDP problem is We will later use the fact that each dual variable y k may be viewed as a (Lagrange) multiplier of the primal constraint

Performance estimation of the gradient method with exact line search
Consider the following SDP problem, for fixed parameters N ≥ 1, R > 0, µ > 0 and L > µ: where the variables are x i ∈ R n , f i ∈ R and g i ∈ R n (i ∈ { * , 0, 1, . . . , N }). Note that this is indeed an SDP problem of the form (6), with dual problem of the form (7), since equalities and interpolability conditions are linear in the inner products of variables x i and g i . Proof. Fix any f ∈ F µ,L (R n ), and let x 0 , . . . , x N be the iterates of the gradient method with exact line search applied to f . Now a feasible solution to the SDP problem is given by The objective function value at this feasible point is f N = f (x N ), so that the optimal value of the SDP is an upper bound on f (x N ) − f * .
We are now ready to give a proof of our main result. We already mention that the SDP relaxation (8) is not used directly in the proof, but was used to devise the proof, in a sense that will be explained later.

Proof of Theorem 1.2
A little reflection shows that, to prove Theorem 1.2, we need only consider one iteration of the gradient method with exact line search. Thus we consider only the first iterate, given by x 0 and x 1 , as well as the minimizer x * of f ∈ F µ,L .
Set f i = f (x i ) and g i = ∇f (x i ) for i ∈ { * , 0, 1}. Note that g * = 0. The following five inequalities are now satisfied:

:
f Indeed, the first three inequalities are the F µ,L -interpolability conditions, the fourth inequality is a relaxation of (4), and the fifth inequality is a relaxation of (3). We aggregate these five inequalities by defining the following positive multipliers, and adding the five inequalities together after multiplying each one by the corresponding multiplier. The result is the following inequality (as may be verified directly): Since the last two right-hand-side terms are nonpositive, we obtain: Since x 0 was arbitrary, this completes the proof of Theorem 1.2.

Remarks on the proof of Theorem 1.2.
• First, note that we have proven a bit more than what is stated in Theorem 1.2. Indeed, the result in Theorem 1.2 holds for any iterative method that satisfies the five inequalities used in its proof.
• Although the proof of Theorem 1.2 is easy to verify, it is not apparent how the multipliers y 1 , . . . , y 5 in (9) were obtained. This was in fact done via preliminary computations, and subsequently guessing the values in (9), through the following steps: 1. The SDP performance estimation problem (8) with N = 1 was solved numerically for various values of the parameters µ , L and R -actually, the values of L and R can safely be fixed to some positive constants using appropriate scaling arguments (see e.g., [11, Section 3.5] for a related discussion).
2. The optimal values of the dual SDP multipliers of the constraints corresponding to the five inequalities in the proof gave the guesses for the correct values y 1 , . . . , y 5 as stated in in (9).

Finally the correctness of the guess was verified directly (by symbolic computation and by hand).
• The key inequality (10) may be rewritten in another, more symmetric way where κ = µ/L is the condition number (between 0 and 1) and slack vectors s 1 and s 2 are Note that the four expressions Lµ expressions are invariant under dilation of f , and that cases of equality in (10) simply correspond to equalities s 1 = s 2 = 0.
• It is interesting to note that the known proof of Theorem 1.2 for the quadratic case only requires the so-called Kantorovich inequality, that may be stated as follows.
Theorem 4.1 (Kantorovch inequality; see e.g. Lemma 3.1 in [1]). Let Q be a symmetric positive definite n × n matrix with smallest and largest eigenvalues µ > 0 and L ≥ µ respectively. Then, for any unit vector x ∈ R n , one has: Thus, the inequality (10) replaces the Kantorovich inequality in the proof of Theorem 1.2 for non-quadratic f ∈ F µ,L (R n ).
• Finally, we note that this proof can be modified very easily to handle the case of the fixed-step gradient method that was mentioned in Theorem 1.5. Indeed, observe that the proof aggregates the fourth and fifth inequalities with multipliers y 4 = 2 L+µ and y 5 = 1, which leads to the combined inequality Now note that the gradient method with fixed step γ = 2 L+µ satisfies this combined inequality (since the second factor in the left-hand side becomes zero), and hence the rest of the proof establishes the same rate for this method as for the gradient descent with exact line search.
Finally, note that Example 1.3 also establishes that this rate is tight. Hence we have the relatively surprising fact that, when looking at the worst-case convergence rate of the objective function accuracy, performing exact line-search is not better than using a well-chosen fixed step length.

Extension to 'noisy' gradient descent with exact line search
Theorem 1.2 may be generalized to what we will call noisy gradient descent method with exact linear search; see e.g. [1, p.59] where it is called gradient descent method with (relative) error. Here the search direction at iteration i, say where 0 ≤ ε < 1 is some given relative tolerance on the deviation from the negative gradient. Note that the algorithm cannot be guaranteed to converge as soon as ε ≥ 1, since d i = 0 then becomes feasible. We recover the normal gradient descent algorithm when ε = 0.
In the case of more general values of ε, one can for example satisfy the relative error criterion by imposing a restriction of the type | sin θ| ≤ ε on the angle θ between search direction d i and the current negative gradient −∇f (x i ).
Using a search direction d i that satisfies (11) corresponds, for example, to an implementation of the gradient descent method where each component of −∇f (x i ) is only calculated to a fixed number of significant digits. It is also related to the so-called stochastic gradient descent method that is used in training neural networks; see e.g. [4] and the references therein.
Select any seach direction di that satisfies (11); One may show the following generalization of Theorem 1.2.
Theorem 5.1. Let f ∈ F µ,L (R n ), x * a global minimizer of f on R n , and f * = f (x * ). Given a relative tolerance ε, each iteration of the noisy gradient descent method with exact line search satisfies where When ε = 0, the rate becomes 1−κ 1+κ = L−µ L+µ , which matches exactly Theorem 1.2, and the proof of Theorem 5.1 is a straightforward generalization of the proof of Theorem 1.2. The key is again to consider a wider class of iterative methods that satisfies certain inequalities. Here we use the inequalities: The first four inequalities are the same as before, and the fifth is satisfied by the iterates of the noisy gradient descent with exact line search. Indeed, in the first iteration one has: (by Cauchy-Schwartz and (11)).
We rewrite the fifth inequality as the equivalent linear matrix inequality: We first aggregate the first four inequalities in (13) by adding them together after multiplication by the respective multipliers: where L ε = (1 + ε)L, µ ε = (1 − ε)µ, κ ε = µε Lε and ρ ε = 1−κε 1+κε . Next we define a positive semidefinite matrix multiplier for the linear matrix inequality (14), namely with a = 1 Lε+µε , and add nonnegativity of the inner product between the left-hand-side of (14) and the multiplier matrix (15) to the aggregated constraints. It can now be checked that the resulting expression is the following (slight) generalization of (10) with the appropriate coefficients and α 5 = − L+µ 2Lµ . This completes the proof. To conclude this section, the following example, based on the same quadratic function as Example1.3, shows that our bound (12) for the noisy gradient descent is also tight.
Let θ be an angle satisfying 0 ≤ θ < π 2 . Consider the noisy gradient descent method where direction d 0 is obtained by performing a clockwise 2D-rotation with angle θ on the first and last coordinates of the gradient ∇f (x 0 ). As mentioned above, this satisfies our definition with relative tolerance ε = sin θ. Define now the starting point Tedious but straightforward computations show that Moreover, if one chooses d 1 by rotating the second gradient ∇f (x 1 ) by the same angle θ in the counterclockwise direction, one obtains A similar reasoning for the next iterates, alternating clockwise and counterclockwise rotations, shows that x 1 for all i = 0, 1, . . . and hence we have that equality holds as announced. Figure 2 displays a few iterates, and can be compared to Figure 1.

Concluding remarks
The main results of this paper are the exact convergence rates of the gradient descent method with exact line search and its noisy variant for strongly convex functions with Lipschitz continuous gradients. The computer-assisted technique of proof is also of independent interest, and demonstrates the importance of the SDP performance estimation problems (PEPs) introduced in [2]. Indeed, to obtain our proof of Theorem 5.1, the following SDP PEP was solved numerically for various fixed values of R, µ and L: max f 1 − f * subject to (13) and f 0 − f * ≤ R.
It was observed that, for each set of values, the optimal value of the SDP corresponded exactly to the bound in Theorem 5.1 (actually, for homogeneity reasons, L and R could be fixed and only µ needed to vary). Based on this, a rigorous proof Theorem 5.1 could be given by guessing the correct values of the dual SDP multipliers as functions of µ, L and R, and then verifying the guess through an explicit computation. We believe this type of computer-assisted proof could prove useful in the analysis of more methods where exact line search is used (see for example [10] which studies conditional gradient methods).
PEPs have been used by now to study worst-case convergence rates of several first-order optimization methods [2,10,11]. This paper differs in an important aspect: the performance estimation problem considered actually characterizes a whole class of methods that contains the method of interest (gradient descent with exact line search) as well as many other methods. This relaxation in principle only provides an upper bound on the worst-case of gradient descent, and it is the fact that Example 1.3 matches this bound that allows us to conclude with a tight result.
The reason we could not solve the peformance estimation problem for the gradient descent method itself is that equation (2), which essentially states that the step x i+1 − x i is parallel to the gradient ∇f (x i ), cannot be formulated as a convex constraint in the SDP formulation. The main obstruction appears to be that requiring that two vectors are parallel is a nonconvex constraint, even when working with their inner products 2 . Instead, our convex formulation enforces that those two vectors are both orthogonal to a third one, the next gradient ∇f (x i+1 ).