The exact worst-case convergence rate of the gradient method with fixed step lengths for L-smooth functions

In this paper, we study the convergence rate of the gradient (or steepest descent) method with fixed step lengths for finding a stationary point of an $L$-smooth function. We establish a new convergence rate, and show that the bound may be exact in some cases, in particular when all step lengths lie in the interval $(0,1/L]$. In addition, we derive an optimal step length with respect to the new bound.

where f : R n → R is bounded from below, and let a real number f ⋆ denote a lower bound of problem (1). In addition, we assume throughout the paper that f has an L-Lipschitz gradient, that is for some (known) Lipschitz constant L > 0. Following the notation used by Nesterov [10], we let C 1,1 L (R n ) denote functions with L-Lipschitz gradient. Problem (1) arises naturally in many applications including machine learning, signal and image processing, to name but a few [1,9]. One of the historic solution methods for problem (1) is the gradient method, proposed by Cauchy in 1847 [4].
The gradient method with fixed step lengths may be described as follows.

Algorithm 1 Gradient method with fixed step lengths
Set N and {t k } N k=1 (step lengths) and pick x 1 ∈ R n . For k = 1, 2, . . . , N perform the following step: 1.
Recently, semidefinite programming performance estimation have been employed as a tool for the worst-case analysis of first-order methods [5,6,8,12,14]. In this method, the worst-case convergence is cast as a quadratic program with quadratic constraints and the problem is then solved by semi-definite programming methods. By employing the performance estimation method, Taylor [11, page 190], without giving a proof, states the following convergence rate: It can be observed that when the step lengths are the same for each iteration and tend to 1 L , the bound (4) reduces to Taylor's convergence rate. In this paper, we investigate the convergence rate of Algorithm 1 further. By using the performance estimation method, we provide a converge rate, which is tighter than all aforementioned bounds. For example, as a part of our main result in Theorem 2, we improve on (4) by showing, for any choice of As a consequence, we also prove and improve on (3) by showing, in the special case where all In addition, we construct an L-smooth function that attains the given bound in Theorem 2 for certain step lengths. We also propose an optimal step length that minimizes the right-hand-side of the bound (5), namely t k = √ 4/3 L for all k ∈ {1, . . . , N }.

Outline
The paper is organized as follows. We describe the performance estimation technique in Section 2. In Section 3, we study the convergence rate by using performance estimation. Finally, we conclude the paper with a conjecture.

Notation
The n-dimensional Euclidean space is denoted by R n . We use ·, · and · to denote the Euclidean inner product and norm, respectively. For a matrix A, A ij denotes its (i, j)-th entry, and A T represents the transpose of A. The notation A 0 means the matrix A is symmetric positive semi-definite.

Performance estimation
Computation of the worst-case convergence rate for a given iterative method and a given class of functions is an infinite-dimensional optimization problem. In their seminal paper [8], Drori and Teboulle take advantage of this idea, called performance estimation, and introduce some relaxation method to deal with this infinite-dimensional optimization problem. Performance estimation has been used extensively for the analysis of first-order methods [5,6,8,12,14].
Similar to problem (P) in [8], the worst-case convergence rate of Algorithm 1 may be formulated as the following abstract optimization problem, where ∆ ≥ 0 denote the difference between the given lower bound, f ⋆ , and the value of f at the starting point. In problem (6), f and x 1 are decision variables. This is an infinite-dimensional optimization problem with infinite number of constraints, and consequently intractable in general. In what follows, we provide a semidefinite programming relaxation for the problem.
The following proposition states a well-known characterization of L-smooth functions that follows, e.g., from [10, Lemma 1. The following well-known result is a fundamental property of gradient descent for L-smooth functions, if the step length 1/L is used.
The following theorem plays a key role in our analysis. Indeed, it provides necessary and sufficient conditions for the interpolation of L-smooth functions. Using this theorem, we will formulate problem (6) as a finite dimensional optimization problem.
if and only if In addition, if the triple {(x i ; g i ; f i )} i∈I satisfies (9), then there exists Lsmooth function f for which (8) holds and min x∈R n f (x) = min i∈I f i − 1 2L g i 2 . Moreover, letting i * ∈ arg min i∈I f i − 1 2L g i 2 , a global minimizer of this function is given by Another proof of the first part of Theorem 1 may be found in [15, Theorem 2, Page 148]. By virtue of Theorem 1, problem (6) may be reformulated as follows, In the above formulation, . . , N + 1}. These constraints do not necessarily impose a given L-Lipschitz function f with which has the lower bound f ⋆ . Therefore, the optimal value of (6) and (10) may not be equal in general. However, if an optimal solution of problem (10) satisfies f ⋆ = min 1≤k≤N +1 f k − 1 2L g k 2 , the formulation will be exact; see the second part of Theorem 1. By Proposition 2, we have f ( 2L g k 2 ≥ f ⋆ and consider the following problem: From the constraint x k+1 = x k − t k g k , we get x i = x 1 + i−1 k=1 g k , i ∈ {2, . . . , N }. By using this relation to eliminate the x i (i ∈ {2, . . . , N + 2}), problem (11) may be written as follows: where ℓ is an auxiliary variable to convert problem (11) into a quadratic program. Problem (12) is a non-convex quadratic program with quadratic constraints. In the following proposition, we show that the optimal values of problems (6) and (11) (or equivalently problem (12)) are the same for step lengths in the interval (0, 1 2L ).
Proof. Clearly, problem (11) is a relaxation of problem (6). Therefore, we only need to show that, for any feasible solution of (11), say {( and min x∈R n f (x) ≥ f ⋆ . The existence such of a function follows from Theorem 1, as all assumptions of Theorem 1 are satisfied.
To obtain a tractable form of problem (12), we relax it to a semidefinite program, in the spirit of [8]. To this end, we define the (N + 1) × (N + 1) positive semi-definite matrix G as, We may now formulate the following semidefinite program, max ℓ where the (N + 1) × (N + 1) matrices A ij , i = j ∈ {1, . . . , N + 1}, are formed according to the constraints (12), and G, ℓ, f i , i ∈ {1, . . . , N + 1}, are decision variables. Problem (13) is a relaxation of (12), but if n ≥ N + 1 the relaxation is exact, that is the optimal values of (12) and (13) are the same. Indeed, if n ≥ N + 1, and G is a feasible matrix in (13), then G is the Gram matrix of N + 1 vectors in R n , and these vectors may be identified with g 1 , . . . , g N +1 ; a similar argument is used in [14,Theorem 5].

Worst-case convergence rate
In this section, we investigate the convergence rate of gradient method with fixed step lengths. The next theorem gives the worst-case convergence rate of Algorithm 1 to a stationary point of an L-smooth function. The technique of the proof, as is usual for SDP performance estimation, is to use weak duality. In particular, we will in fact construct a feasible solution to the dual SDP problem of (13), and thus derive an upper bound for problem (12). In practice, this dual feasible solution is constructed in a computer-assisted manner, by solving the primal and dual SDP problems for different fixed values of the parameters, and subsequently guessing the values of the dual multipliers. (There is also dedicated software for this purpose, namely 'PESTO' by Taylor, Glineur, and Hendrickx [13].) In the proof of Theorem 2, we simply verify that these 'guesses' are correct. Then, if x 1 , . . . , x N +1 denote the iterates of Algorithm 1, one has In particular, if Similarly, if t k = 1 L for k ∈ {1, . . . , N }, one has Proof. Let U denote the square of the right-side of inequality (14) and let B = U ∆ . To establish this bound, we show that U is an upper bound for problem (12). Consider the feasible point {g k ; f k } N +1 1 ; ℓ for problem (12). Suppose that . . , N } . In addition, we define σ 1 and σ k , respectively, as follows: . . , N }, the σ k 's will be non-negative. It is seen that . . , N }. By using the last equality, one may verify directly through elementary algebra that Q k is a non-negative quadratic function and the given dual multipliers are non-negative, we have ℓ ≤ U for any feasible solution of (12).
The special step length t k = √ 4/3 L for k ∈ {1, . . . , N } used to obtain (15) will be motivated later in Theorem 3. Note that (16) gives a formal proof (with a small improvement) of the bound claimed by Taylor [11, page 190]; see (3).
An important question concerning the bound (14) is its difference with the optimal value of (6). It is known that the lower bound for Algorithm 1 is of the order Ω 1 √ N [2,3]. In what follows, we establish that the bound (14) is exact in some cases.
Proof. It suffices for a given N to demonstrate an L-smooth function f and a point x 1 such that Suppose now that t k ∈ (0, 1 L ], k ∈ {1, . . . , N }, and U denotes the right-handside of equality (17). We set t N +1 = 1 L . Let and l N +2 = 0. By elementary calculus, one can check that the function f : R → R given by for i ∈ {1, . . . , N }, is L-smooth with the optimal value f ⋆ = 0 and the optimal solution x ⋆ = 0. In addition, we have equality (17) for x 1 = l 1 . Indeed, Note that, though we have only shown the exactness of the bound (14) for step lengths in the interval (0, 1 L ], we also conjecture that the bound (14) is in fact exact for all step lengths in the interval (0, √ 3 L ). By minimizing the right-hand-side of (14), the next theorem gives the 'optimal' step lengths with respect to the bound.
Theorem 3 Let f be an L-smooth function. Then the optimal step size for gradient method with respect to bound (14) is given by Proof. We minimize the right-hand-side of (14), that is which is equivalent to maximizing Since H is a strictly concave function on 0, The step length 1 L commonly is regarded as the optimal step length in the literature; see [10,Chapter 1]. Due to the example introduced in (18), we see that the worst-case convergence rate for the step length 1 L cannot be better

Concluding remarks
In this paper, we studied the convergence rate of gradient method for Lsmooth functions and we provided a new convergence rate when the step lengths belong to the interval (0, √ 3 L ). Moreover, we have shown that this convergence rate is tight for step lengths in the interval (0, 1 L ]. As mentioned in the introduction, Algorithm 1 is convergent for the step lengths in the larger interval (0, 2 L ). Following extensive numerical experiments, where we solved the semidefinite program (13) for different parameter values, we conjecture that when t k ∈ 0, 2 L for k ∈ {1, . . . , N }, we have min 1≤k≤N +1 ∇f (x k ) ≤ 4∆ N k=1 min (−L 2 t 3 k + 4t k , −Lt 2 k + 4t k )