On the worstcase complexity of the gradient method with exact line search for smooth strongly convex functions
 1.3k Downloads
 1 Citations
Abstract
We consider the gradient (or steepest) descent method with exact line search applied to a strongly convex function with Lipschitz continuous gradient. We establish the exact worstcase rate of convergence of this scheme, and show that this worstcase behavior is exhibited by a certain convex quadratic function. We also give the tight worstcase complexity bound for a noisy variant of gradient descent method, where exact linesearch is performed in a search direction that differs from negative gradient by at most a prescribed relative tolerance. The proofs are computerassisted, and rely on the resolutions of semidefinite programming performance estimation problems as introduced in the paper (Drori and Teboulle, Math Progr 145(1–2):451–482, 2014).
Keywords
Gradient method Steepest descent Semidefinite programming Performance estimation problem1 Introduction
The gradient (or steepest) descent method for unconstrained method was devised by Augustin–Louis Cauchy (1789–1857) in the nineteenth century, and remains one of the most iconic algorithms for unconstrained optimization. Indeed, it is usually the first algorithm that is taught during introductory courses on nonlinear optimization. It is therefore somewhat surprising that the worstcase convergence rate of the method is not yet precisely understood for smooth strongly convex functions.
In this paper, we settle the worstcase convergence rate question of the gradient descent method with exact line search for strongly convex, continuously differentiable functions f with Lipschitz continuous gradient. Formally we consider the following function class.
Definition 1.1
 1.
\(\mathbf {x} \mapsto f(\mathbf {x})  \frac{\mu }{2}\Vert \mathbf {x}\Vert ^2\) is a convex function on \(\mathbb {R}^n\), where the norm is the Euclidean norm;
 2.
\(\Vert \nabla f(\mathbf {x}+\Delta \mathbf {x})  \nabla f(\mathbf {x}) \Vert \le L \Vert \Delta \mathbf {x}\Vert \) holds for all \(\mathbf {x} \in \mathbb {R}^n\) and \(\Delta \mathbf {x} \in \mathbb {R}^n\).
The gradient method with exact line search may be described as follows.
Our main result may now be stated concisely.
Theorem 1.2
Example 1.3
As an immediate consequence of Theorem 1.2 and Example 1.3, one has the following tight bound on the number of steps needed to obtain \(\epsilon \)relative accuracy on the objective function for a given \(\epsilon > 0\).
Corollary 1.4
A result related to Theorem 1.2 is given in [5] where Armijorule line search is used instead of exact line search. An explicit rate in the strongly convex case is given there in Proposition 3.3.5 on page 53 (definition of the method is (3.1.2) on page 44). More general upper bounds on the convergence rates of gradienttype methods for convex functions may be found in the books [6, 7]. We mention one more particular result by Nesterov [7] that is similar to our main result in Theorem 1.2, but that uses a fixed steplength and relies on the initial distance to the solution.
Theorem 1.5
Note that this result does not imply Theorem 1.2.
2 Background results
In this section we collect some known results on strongly convex functions and on the gradient method. We will need these results in the proof of our main result, Theorem 1.2.
2.1 Properties of the gradient method with exact line search
2.2 Interpolation with functions in \(\mathcal {F}_{\mu ,L}(\mathbb {R}^n)\)
We now consider the following interpolation problem over the class of functions \(\mathcal {F}_{\mu ,L}(\mathbb {R}^n)\).
Definition 2.1
A necessary and sufficient condition for \(\mathcal {F}_{\mu ,L}\)interpolability in given in the next theorem, taken from [11].
Theorem 2.2
In principle, Theorem 2.2 allows one to generate all possible valid inequalities that hold for functions in \(\mathcal {F}_{\mu ,L}(\mathbb {R}^n)\) in terms of their function values and gradients at a set of points \(\mathbf {x}_0,\ldots ,\mathbf {x}_N\). This will be essential for the proof of our main result, Theorem 1.2.
3 A performance estimation problem
The proof technique we will use for Theorem 1.2 is inspired by recent work on the socalled performance estimation problem, as introduced in [2] and further developed in [11]. The idea is to formulate the computation of the worstcase behavior of certain iterative methods as an explicit semidefinite programming (SDP) problem. We first recall the definition of SDP problems (in a form that is suitable to our purposes).
3.1 Semidefinite programs
3.2 Performance estimation of the gradient method with exact line search
Note that this is indeed an SDP problem of the form (6), with dual problem of the form (7), since equalities and interpolability conditions are linear in the inner products of variables \(\mathbf {x}_i\) and \(\mathbf {g}_i\).
Lemma 3.1
The optimal value of the above SDP problem (8) is an upper bound on \(f(\mathbf {x}_N)  f_*\), where f is any function from \(\mathcal {F}_{\mu ,L}(\mathbb {R}^n)\), \(f_*\) is its minimum and \(\mathbf {x}_N\) is the Nth iterate of the gradient method with exact line search applied to f from any starting point \(\mathbf {x}_0\) that satisfies \(f(\mathbf {x}_0) {f_*}\le R\).
Proof
We are now ready to give a proof of our main result. We already mention that the SDP relaxation (8) is not used directly in the proof, but was used to devise the proof, in a sense that will be explained later.
4 Proof of Theorem 1.2
A little reflection shows that, to prove Theorem 1.2, we need only consider one iteration of the gradient method with exact line search. Thus we consider only the first iterate, given by \(\mathbf {x}_0\) and \(\mathbf {x}_1\), as well as the minimizer \(\mathbf {x}_*\) of \(f \in \mathcal {F}_{\mu ,L}\).
4.1 Remarks on the proof of Theorem 1.2

First, note that we have proven a bit more than what is stated in Theorem 1.2. Indeed, the result in Theorem 1.2 holds for any iterative method that satisfies the five inequalities used in its proof.
 Although the proof of Theorem 1.2 is easy to verify, it is not apparent how the multipliers \(y_1,\ldots , y_5\) in (9) were obtained. This was in fact done via preliminary computations, and subsequently guessing the values in (9), through the following steps:
 1.
The SDP performance estimation problem (8) with \(N=1\) was solved numerically for various values of the parameters \(\mu \) , L and R—actually, the values of L and R can safely be fixed to some positive constants using appropriate scaling arguments (see e.g., [11, Section 3.5] for a related discussion).
 2.
The optimal values of the dual SDP multipliers of the constraints corresponding to the five inequalities in the proof gave the guesses for the correct values \(y_1,\ldots , y_5\) as stated in (9).
 3.
Finally the correctness of the guess was verified directly (by symbolic computation and by hand).
 1.
 The key inequality (10) may be rewritten in another, more symmetric waywhere \(\kappa = \mu /L\) is the condition number (between 0 and 1) and slack vectors \(\mathbf {s}_1\) and \(\mathbf {s}_2\) are$$\begin{aligned} (f_1  f_*) \le (f_0  f_*) \left( \frac{1\kappa }{1+\kappa }\right) ^2  \frac{\mu }{4} \left( \frac{{{\left\ \mathbf {s}_1\right\ }^2}}{1+\sqrt{\kappa }} + \frac{{{\left\ \mathbf {s}_2\right\ }^2}}{1\sqrt{\kappa }} \right) , \end{aligned}$$Note that the four expressions \( \mathbf {x}_i  \mathbf {x}_* \pm \mathbf {g}_i/\sqrt{L\mu }\) expressions are invariant under dilation of f, and that cases of equality in (10) simply correspond to equalities \(\mathbf {s}_1=\mathbf {s}_2=0\).$$\begin{aligned} {\mathbf {s}_1}= & {} \frac{(1+\sqrt{\kappa })^2}{1+\kappa } \left( \mathbf {x}_0 \mathbf {x}_*  \mathbf {g}_0/\sqrt{L\mu } \right) + \left( \mathbf {x}_1  \mathbf {x}_* + \mathbf {g}_1/\sqrt{L\mu } \right) \\ {\mathbf {s}_2}= & {} \ \ \ \frac{(1\sqrt{\kappa })^2}{1+\kappa } \left( \mathbf {x}_0  \mathbf {x}_* + \mathbf {g}_0/\sqrt{L\mu } \right)  \left( \mathbf {x}_1  \mathbf {x}_*  \mathbf {g}_1/\sqrt{L\mu } \right) . \\ \end{aligned}$$

It is interesting to note that the known proof of Theorem 1.2 for the quadratic case only requires the socalled Kantorovich inequality, that may be stated as follows.
Theorem 4.1
 Finally, we note that this proof can be modified very easily to handle the case of the fixedstep gradient method that was mentioned in Theorem 1.5. Indeed, observe that the proof aggregates the fourth and fifth inequalities with multipliers \(y_4=\frac{2}{L+\mu }\) and \(y_5=1\), which leads to the combined inequalityNow note that the gradient method with fixed step \(\gamma = \frac{2}{L+\mu }\) satisfies this combined inequality (since the second factor in the lefthand side becomes zero), and hence the rest of the proof establishes the same rate for this method as for the gradient descent with exact line search.$$\begin{aligned} \frac{2}{L+\mu } ( \mathbf {g}_{0}^{\mathsf{T}} \mathbf {g}_{1}) + \mathbf {g}_{1}^{\mathsf{T}} (\mathbf {x}_0\mathbf {x}_1)\ge 0 \quad \Leftrightarrow \quad \mathbf {g}_{1}^{\mathsf{T}} \left( \mathbf {x}_0  \frac{2}{L+\mu } \mathbf {g}_0  \mathbf {x}_1\right) \ge 0. \end{aligned}$$
Theorem 4.2
Note that Example 1.3 also establishes that this rate is tight. Hence we have the relatively surprising fact that, when looking at the worstcase convergence rate of the objective function accuracy, performing exact linesearch is not better than using a wellchosen fixed step length.
5 Extension to ‘noisy’ gradient descent with exact line search
In the case of more general values of \(\varepsilon \), one can for example satisfy the relative error criterion by imposing a restriction of the type \(\sin \theta  \le \varepsilon \) on the angle \(\theta \) between search direction \(\mathbf {d}_i\) and the current negative gradient \(\nabla f(\mathbf {x}_i)\).
Using a search direction \(\mathbf {d}_i\) that satisfies (11) corresponds, for example, to an implementation of the gradient descent method where each component of \(\nabla f(\mathbf {x}_i)\) is only calculated to a fixed number of significant digits. It is also related to the socalled stochastic gradient descent method that is used in training neural networks; see e.g. [4] and the references therein.
Thus we consider the following algorithm:
One may show the following generalization of Theorem 1.2.
Theorem 5.1
To conclude this section, the following example, based on the same quadratic function as Example1.3, shows that our bound (12) for the noisy gradient descent is also tight.
Example 5.2
6 Concluding remarks
The main results of this paper are the exact convergence rates of the gradient descent method with exact line search and its noisy variant for strongly convex functions with Lipschitz continuous gradients. The computerassisted technique of proof is also of independent interest, and demonstrates the importance of the SDP performance estimation problems (PEPs) introduced in [2].
We believe this type of computerassisted proof could prove useful in the analysis of more methods where exact line search is used (see for example [10] which studies conditional gradient methods).
PEPs have been used by now to study worstcase convergence rates of several firstorder optimization methods [2, 10, 11]. This paper differs in an important aspect: the performance estimation problem considered actually characterizes a whole class of methods that contains the method of interest (gradient descent with exact line search) as well as many other methods. This relaxation in principle only provides an upper bound on the worstcase of gradient descent, and it is the fact that Example 1.3 matches this bound that allows us to conclude with a tight result.
The reason we could not solve the performance estimation problem for the gradient descent method itself is that Eq. (2), which essentially states that the step \(\mathbf {x}_{i+1}\mathbf {x}_i\) is parallel to the gradient \(\nabla f(\mathbf {x}_i)\), cannot be formulated as a convex constraint in the SDP formulation. The main obstruction appears to be that requiring that two vectors are parallel is a nonconvex constraint, even when working with their inner products.^{2} Instead, our convex formulation enforces that those two vectors are both orthogonal to a third one, the next gradient \(\nabla f(\mathbf {x}_{i+1})\).
Footnotes
 1.
The erratum is available at: http://users.iems.northwestern.edu/~nocedal/book/2ndprint.pdf.
 2.
One such nonconvex formulation would be \(\mathbf {g}_{i}^{\mathsf{T}} (\mathbf {x}_i\mathbf {x}_{i+1}) = \Vert \mathbf {g}_{i}\Vert \Vert \mathbf {x}_i\mathbf {x}_{i+1} \Vert \).
Notes
References
 1.Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Massachusetts (1999)Google Scholar
 2.Drori, Y., Teboulle, M.: Performance of firstorder methods for smooth convex minimization: a novel approach. Math. Progr. 145(1–2), 451–482 (2014)MathSciNetCrossRefMATHGoogle Scholar
 3.Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming. Springer, Berlin (2008)Google Scholar
 4.Neelakantan, A., Vilnis, L., Le, Q.V., Sutskever, I., Kaiser, L., Kurach, K., Martens, J.: Adding gradient noise improves learning for very deep networks (2015). arXiv:1511.06807v1
 5.Nemirovski, A.: Optimization II: numerical methods for nonlinear continuous optimization. In: Lecture Notes (1999). http://www2.isye.gatech.edu/~nemirovs/Lect_OptII.pdf
 6.Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)Google Scholar
 7.Nesterov, Yu.: Introductory lectures on convex optimization: a basic course. In: Applied Optimization. Kluwer Academic Publ., Boston (2004)Google Scholar
 8.Nocedal, J., Wright, S.: Numerical Optimization. Springer Science & Business Media, Berlin (2006)Google Scholar
 9.Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)MATHGoogle Scholar
 10.Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worstcase performance of firstorder methods for composite convex optimization (2015). arXiv:1512.07516
 11.Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worstcase performance of firstorder methods. Math. Progr. (2016). doi: 10.1007/s1010701610093
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.