Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization

Taylor, Adrien B.; Hendrickx, Julien M.; Glineur, François

doi:10.1007/s10957-018-1298-1

Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization

Published: 10 May 2018

Volume 178, pages 455–476, (2018)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Adrien B. Taylor ORCID: orcid.org/0000-0003-2509-1765¹,
Julien M. Hendrickx¹ &
François Glineur^1,2

1352 Accesses
33 Citations
Explore all metrics

Abstract

We study the worst-case convergence rates of the proximal gradient method for minimizing the sum of a smooth strongly convex function and a non-smooth convex function, whose proximal operator is available. We establish the exact worst-case convergence rates of the proximal gradient method in this setting for any step size and for different standard performance measures: objective function accuracy, distance to optimality and residual gradient norm. The proof methodology relies on recent developments in performance estimation of first-order methods, based on semidefinite programming. In the case of the proximal gradient method, this methodology allows obtaining exact and non-asymptotic worst-case guarantees that are conceptually very simple, although apparently new. On the way, we discuss how strong convexity can be replaced by weaker assumptions, while preserving the corresponding convergence rates. We also establish that the same fixed step size policy is optimal for all three performance measures. Finally, we extend recent results on the worst-case behavior of gradient descent with exact line search to the proximal case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Efficiency of higher-order algorithms for minimizing composite functions

Article 10 October 2023

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Notes

A list of useful analytical proximal operators is available in [13].
Those $\lambda $’s were found by identifying an analytical optimal solution to the dual performance estimation problem. That is, each $\lambda $ can be seen as a Lagrange multiplier for the corresponding inequality. The methodology is explained and illustrated in details in [14, Section 4.1].
Actually, both regimes are valid for $\gamma =\frac{2}{L+\mu }$.
The difference between the gradient mapping and the residual gradient norm is simple, but somewhat subtle. The gradient mapping measures ${\left\| \nabla f(x_k)+s_{k+1}\right\| }$, whereas the residual gradient norm measures ${\left\| \nabla f(x_{k+1})+s_{k+1}\right\| }$ with $s_{k+1}\in \partial h(x_{k+1})$ the subgradient used in the proximal operation.

References

Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)
MATH Google Scholar
Ryu, E.K., Boyd, S.: A primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)
MathSciNet MATH Google Scholar
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Article MathSciNet MATH Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, pp. 1458–1466 (2011)
Zhang, H., Cheng, L.: Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization. Optim. Lett. 9(5), 961–979 (2015)
Article MathSciNet MATH Google Scholar
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 1–39 (2018). https://doi.org/10.1007/s10107-018-1232-1
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
Article MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Article MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)
Article MathSciNet MATH Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Google Scholar
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)
de Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
MATH Google Scholar
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
Article MATH Google Scholar
Drori, Y.: Contributions to the Complexity Analysis of Optimization Algorithms. Ph.D. Thesis, Tel-Aviv University (2014)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Taylor, A.B.: Convex Interpolation and Performance Estimation of First-Order Methods for Convex Optimization. Ph.D. Thesis, Université catholique de Louvain (2017)
Nemirovski, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Article MathSciNet Google Scholar
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)
Article MathSciNet MATH Google Scholar
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
Article MathSciNet MATH Google Scholar
Taylor, A., Hendrickx, J., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: Proceedings of the 56th IEEE conference on decision and control (CDC 2017) (2017)

Download references

Acknowledgements

This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office, and of the Concerted Research Action (ARC) programme supported by the Federation Wallonia-Brussels (contract ARC 14/19-060). The scientific responsibility rests with its authors. The authors also thank the anonymous referee for his very careful reading and comments that include Remark 2.1.

Author information

Authors and Affiliations

ICTEAM, Université catholique de Louvain, 1348, Louvain-la-Neuve, Belgium
Adrien B. Taylor, Julien M. Hendrickx & François Glineur
CORE, Université catholique de Louvain, 1348, Louvain-la-Neuve, Belgium
François Glineur

Authors

Adrien B. Taylor
View author publications
You can also search for this author in PubMed Google Scholar
Julien M. Hendrickx
View author publications
You can also search for this author in PubMed Google Scholar
François Glineur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrien B. Taylor.

Appendices

Appendix A: Details of the Proof of Theorem 3.1

In this section, we provide some details on the verification of the proof of Theorem 3.1 (convergence in distance to optimality). The proofs of Theorem 3.2 (convergence in residual gradient norm) and Theorem 3.3 (convergence in function value) follow the same lines, although the results in function values are technically more involved (we advise the reader to use appropriate computer algebra software to preserve his sanity).

As the proofs for both regimes (small and large step sizes) follow the same lines, we only consider the case $0\le \gamma \le \frac{2}{L+\mu }$ here.

The goal is to prove that the inequality

$$\begin{aligned} \left( 1-\gamma \mu \right) ^2 {\left\| x_k-x_\star \right\| ^2}\ge & {} {\left\| x_{k+1}-x_\star \right\| ^2} +\gamma ^2 {\left\| g_\star +s_{k+1}\right\| ^2}\nonumber \\&+\frac{\gamma (2-\gamma (L+\mu ))}{L-\mu } {\left\| \mu {(x_k -x_\star )} - g_k+ g_\star \right\| ^2}, \end{aligned}$$

(7)

can be obtained by performing a weighted sum of the following inequalities:

$$\begin{aligned}&\begin{array}{ll} f_\star &{}\ge f_k+{\left\langle g_k, x_\star -x_k\right\rangle }+\frac{1}{2L}{\left\| g_k-g_\star \right\| ^2} \\ &{}\quad +\frac{\mu }{2(1-\mu /L)}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2} \end{array} \quad&:2\gamma \rho (\gamma ), \\&\begin{array}{ll} f_k&{}\ge f_\star +{\left\langle g_\star , x_k-x_\star \right\rangle }+\frac{1}{2L}{\left\| g_k-g_\star \right\| ^2} \\ &{}\quad +\frac{\mu }{2(1-\mu /L)}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2} \end{array} \quad&:2\gamma \rho (\gamma ),\\&\begin{array}{ll} h_\star \ge h_{k+1}+{\left\langle s_{k+1}, x_\star -x_{k+1}\right\rangle }&\end{array}\quad&:2\gamma ,\\&\begin{array}{ll}h_{k+1}\ge h_\star + {\left\langle s_\star , x_{k+1}-x_\star \right\rangle }&\end{array} \quad&:2\gamma . \end{aligned}$$

For simplicity, let us first sum the previous inequalities two by two:

$$\begin{aligned}&\begin{array}{ll} 0&{}\ge -{\left\langle g_\star -g_k, x_\star -x_k\right\rangle } +\frac{1}{L}{\left\| g_k-g_\star \right\| ^2} \\ &{}\quad +\frac{\mu }{1-\mu /L}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2} \end{array} \quad&:2\gamma \rho (\gamma ), \\&\begin{array}{ll} 0\ge -{\left\langle s_\star -s_{k+1}, x_\star -x_{k+1}\right\rangle }&\end{array}\quad&:2\gamma . \end{aligned}$$

We proceed by showing that (7) can be obtained by reformulation of the following expression:

$$\begin{aligned} 0\ge & {} 2\gamma \rho (\gamma ) \left[ {\left\langle g_k-g_\star , x_\star -x_k\right\rangle } +\frac{1}{L} {\left\| g_k-g_\star \right\| ^2} \right. \nonumber \\&\left. +\frac{\mu }{1-\mu /L} {\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2}\right] \nonumber \\&+\,2\gamma {\left\langle s_{k+1}-s_\star , x_\star -x_{k+1}\right\rangle }. \end{aligned}$$

(8)

For doing that, we simply verify that the expression (7) minus the expression (8) is identically zero; that is:

$$\begin{aligned} 0= & {} 2\gamma \rho (\gamma ) \left[ {\left\langle g_k-g_\star , x_\star -x_k\right\rangle } +\frac{1}{L}{\left\| g_k-g_\star \right\| ^2} \right. \nonumber \\&\left. +\frac{\mu }{1-\mu /L}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2}\right] \nonumber \\&+\,2\gamma {\left\langle s_{k+1}-s_\star , x_\star -x_{k+1}\right\rangle }\nonumber \\&-\left[ -\left( 1-\gamma \mu \right) ^2 {\left\| x_k-x_\star \right\| ^2}+{\left\| x_{k+1}-x_\star \right\| ^2} +\gamma ^2 {\left\| g_\star +s_{k+1}\right\| ^2}\right] \nonumber \\&-\frac{\gamma (2-\gamma (L+\mu ))}{L-\mu } {\left\| \mu {(x_k -x_\star )} - g_k+ g_\star \right\| ^2}, \end{aligned}$$

(9)

with $x_{k+1}=x_k-\frac{1}{L}\left( g_k+s_{k+1}\right) $ and $s_\star =-g_\star $.

Finally, one can simply verify the equality (9) by expanding (7) and (8), which are both equal to:

$$\begin{aligned} \begin{aligned} 0\ge \frac{2}{L-\mu }&\Bigg ((\gamma -\gamma ^2\mu ) {\left\| g_k\right\| ^2} +{(\gamma ^2\mu +\gamma ^2 L-2 \gamma ){\left\langle g_k, g_\star \right\rangle } }\\&+{( \gamma ^2 L-\gamma ^2\mu ) {\left\langle g_k, s_{k+1}\right\rangle }}+{(\gamma ^2 \mu ^2+\gamma ^2 \mu L-\gamma L -\gamma \mu ) {\left\langle g_k, x_k\right\rangle }}\\&+{(- \gamma ^2 \mu ^2-\gamma ^2\mu L +\gamma L+\gamma \mu ){\left\langle g_k, x_\star \right\rangle }}+{(\gamma -\gamma ^2 \mu ) {\left\| g_\star \right\| ^2} }\\&+{( \gamma ^2 L-\gamma ^2 \mu ){\left\langle g_\star , s_{k+1}\right\rangle }}+{(2\gamma \mu - \gamma ^2 \mu ^2-\gamma ^2 \mu L ){\left\langle g_\star , x_k\right\rangle }}\\&+{( \gamma ^2 \mu ^2+\gamma ^2\mu L-2 \gamma \mu ) {\left\langle g_\star , x_\star \right\rangle }}+{( \gamma ^2 L-\gamma ^2 \mu ) {\left\| s_{k+1}\right\| ^2}}\\&+{(\gamma \mu -\gamma L ) {\left\langle s_{k+1}, x_k\right\rangle }}+{(\gamma L-\gamma \mu ) {\left\langle s_{k+1}, x_\star \right\rangle }}\\&+{(\gamma \mu L -\gamma ^2 \mu ^2 L) {\left\| x_k\right\| ^2}}+{(2 \gamma ^2 \mu ^2 L-2 \gamma \mu L) {\left\langle x_k, x_\star \right\rangle }}\\&+{( \gamma \mu L-\gamma ^2 \mu ^2 L) {\left\| x_\star \right\| ^2}}\Bigg ). \end{aligned} \end{aligned}$$

$\square $

Note that all proofs presented in the symbolic validation code rely on the exact same idea: the equivalences between pairs of expressions are verified by checking their differences being identically zero.

Appendix B: Details on Lower Bounds for Mixed Performance Measures

In this section, we provide details for obtaining the lower bounds marked with ($\star $) in Table 2 (i.e., those that do not come from purely quadratic functions).

First, consider two constants $0<\mu \le L<\infty $ and the following one-dimensional quadratic minimization problem

$$\begin{aligned} \min _{x\ge 0} \left( \frac{\mu }{2} x^2+cx\right) . \end{aligned}$$

The function $f(x)=\frac{\mu }{2} x^2+cx$ is clearly L-smooth and $\mu $-strongly convex with unique optimal point $x_\star =0$ over the nonnegative reals. A corresponding composite problem can be written as $F(x)=f(x)+i_{\ge 0}(x)$ with $i_{\ge 0}(.)$ being the indicator function for the nonnegative real half-line.

Second, consider the starting point $x_0\ge 0$ and a number of iterations $N\in \mathbb {N}$. Using the proximal gradient method with step size $\frac{1}{L}$ results in the following rule for the iterates, assuming c is small enough (i.e., such that $x_k\ge 0$ for all $0\le k \le N$):

$$\begin{aligned} x_{k+1}&=x_k-\frac{1}{L} \nabla f(x_k),\\&=\left( 1-\frac{\mu }{L}\right) x_k-\frac{c}{L}. \end{aligned}$$

Solving the recurrence equation provides us with the following rule for the iterates:

$$\begin{aligned}x_{k}= {\frac{(1-\kappa )^N (c+\mu x_0)-c}{\mu }}, \end{aligned}$$

with $\kappa {:=}\frac{\mu }{L}$ the (inverse) condition number, and the corresponding values:

$$\begin{aligned}&\nabla f(x_N)={(1-\kappa )^N (c+\mu x_0)},\\&f(x_N)-f(x_\star )={\frac{(1-\kappa )^{2 N} (c+\mu x_0)^2}{2 \mu }.} \end{aligned}$$

By optimizing over c, we get the following extreme cases:

$\diamond $ :: (quadratic optimization—maximize $f(x_N)-f(x_\star )$ with respect to c) $c=\frac{\mu x_0}{(1-\kappa )^{-2N}-1}$ results in $F(x_N)-F(x_\star )=\frac{\mu }{2} \frac{{\left\| x_0-x_\star \right\| }^2}{\rho ^{-2N}-1}$,
$\diamond $ :: (linear optimization—maximize $|\nabla f(x_N)|$ with respect to c by enforcing equality in the constraint $x_N\ge 0$; that is, we impose $x_N=0$) $c=\frac{\mu x_0}{(1-\kappa )^{-N}-1}$ results in $||\tilde{\nabla } F(x_k)||^2=\frac{\mu ^2{\left\| x_0-x_\star \right\| }^2}{(\rho ^{-N}-1)^2}$,
$\diamond $ :: (linear optimization—maximize $|\nabla f(x_N)|$ with respect to c by enforcing equality in the constraint $x_N\ge 0$; that is, we impose $x_N=0$) $c=\frac{\mu x_0}{(1-\kappa )^{-N}-1}$ (or equivalently $c=\frac{\sqrt{2\mu } \sqrt{F(x_0)-F_\star }}{\sqrt{(1-\kappa )^{-2N}-1}}$) results in $||\tilde{\nabla } F(x_N)||^2={2\mu \frac{F(x_0)-F_\star }{\rho ^{-2N}-1}}$,

which match the entries marked ($\star $) in Table 2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Taylor, A.B., Hendrickx, J.M. & Glineur, F. Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization. J Optim Theory Appl 178, 455–476 (2018). https://doi.org/10.1007/s10957-018-1298-1

Download citation

Received: 11 May 2017
Accepted: 30 April 2018
Published: 10 May 2018
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10957-018-1298-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

Efficiency of higher-order algorithms for minimizing composite functions

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Details of the Proof of Theorem 3.1

Appendix B: Details on Lower Bounds for Mixed Performance Measures

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

Efficiency of higher-order algorithms for minimizing composite functions

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Details of the Proof of Theorem 3.1

Appendix B: Details on Lower Bounds for Mixed Performance Measures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation