Skip to main content
Log in

Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

We study the worst-case convergence rates of the proximal gradient method for minimizing the sum of a smooth strongly convex function and a non-smooth convex function, whose proximal operator is available. We establish the exact worst-case convergence rates of the proximal gradient method in this setting for any step size and for different standard performance measures: objective function accuracy, distance to optimality and residual gradient norm. The proof methodology relies on recent developments in performance estimation of first-order methods, based on semidefinite programming. In the case of the proximal gradient method, this methodology allows obtaining exact and non-asymptotic worst-case guarantees that are conceptually very simple, although apparently new. On the way, we discuss how strong convexity can be replaced by weaker assumptions, while preserving the corresponding convergence rates. We also establish that the same fixed step size policy is optimal for all three performance measures. Finally, we extend recent results on the worst-case behavior of gradient descent with exact line search to the proximal case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. A list of useful analytical proximal operators is available in [13].

  2. Those \(\lambda \)’s were found by identifying an analytical optimal solution to the dual performance estimation problem. That is, each \(\lambda \) can be seen as a Lagrange multiplier for the corresponding inequality. The methodology is explained and illustrated in details in [14, Section 4.1].

  3. Actually, both regimes are valid for \(\gamma =\frac{2}{L+\mu }\).

  4. The difference between the gradient mapping and the residual gradient norm is simple, but somewhat subtle. The gradient mapping measures \({\left\| \nabla f(x_k)+s_{k+1}\right\| }\), whereas the residual gradient norm measures \({\left\| \nabla f(x_{k+1})+s_{k+1}\right\| }\) with \(s_{k+1}\in \partial h(x_{k+1})\) the subgradient used in the proximal operation.

References

  1. Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)

    MATH  Google Scholar 

  2. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)

    MATH  Google Scholar 

  3. Ryu, E.K., Boyd, S.: A primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)

    MathSciNet  MATH  Google Scholar 

  4. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  5. Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, pp. 1458–1466 (2011)

  6. Zhang, H., Cheng, L.: Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization. Optim. Lett. 9(5), 961–979 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  7. Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 1–39 (2018). https://doi.org/10.1007/s10107-018-1232-1

  8. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)

  9. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  10. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  11. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  12. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)

    Google Scholar 

  13. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)

  14. de Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  15. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  16. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)

    Article  MATH  Google Scholar 

  17. Drori, Y.: Contributions to the Complexity Analysis of Optimization Algorithms. Ph.D. Thesis, Tel-Aviv University (2014)

  18. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  19. Taylor, A.B.: Convex Interpolation and Performance Estimation of First-Order Methods for Convex Optimization. Ph.D. Thesis, Université catholique de Louvain (2017)

  20. Nemirovski, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)

    Article  MathSciNet  Google Scholar 

  21. Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  22. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  23. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  24. Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  25. Taylor, A., Hendrickx, J., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: Proceedings of the 56th IEEE conference on decision and control (CDC 2017) (2017)

Download references

Acknowledgements

This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office, and of the Concerted Research Action (ARC) programme supported by the Federation Wallonia-Brussels (contract ARC 14/19-060). The scientific responsibility rests with its authors. The authors also thank the anonymous referee for his very careful reading and comments that include Remark 2.1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrien B. Taylor.

Appendices

Appendix A: Details of the Proof of Theorem 3.1

In this section, we provide some details on the verification of the proof of Theorem 3.1 (convergence in distance to optimality). The proofs of Theorem 3.2 (convergence in residual gradient norm) and Theorem 3.3 (convergence in function value) follow the same lines, although the results in function values are technically more involved (we advise the reader to use appropriate computer algebra software to preserve his sanity).

As the proofs for both regimes (small and large step sizes) follow the same lines, we only consider the case \(0\le \gamma \le \frac{2}{L+\mu }\) here.

The goal is to prove that the inequality

$$\begin{aligned} \left( 1-\gamma \mu \right) ^2 {\left\| x_k-x_\star \right\| ^2}\ge & {} {\left\| x_{k+1}-x_\star \right\| ^2} +\gamma ^2 {\left\| g_\star +s_{k+1}\right\| ^2}\nonumber \\&+\frac{\gamma (2-\gamma (L+\mu ))}{L-\mu } {\left\| \mu {(x_k -x_\star )} - g_k+ g_\star \right\| ^2}, \end{aligned}$$
(7)

can be obtained by performing a weighted sum of the following inequalities:

$$\begin{aligned}&\begin{array}{ll} f_\star &{}\ge f_k+{\left\langle g_k, x_\star -x_k\right\rangle }+\frac{1}{2L}{\left\| g_k-g_\star \right\| ^2} \\ &{}\quad +\frac{\mu }{2(1-\mu /L)}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2} \end{array} \quad&:2\gamma \rho (\gamma ), \\&\begin{array}{ll} f_k&{}\ge f_\star +{\left\langle g_\star , x_k-x_\star \right\rangle }+\frac{1}{2L}{\left\| g_k-g_\star \right\| ^2} \\ &{}\quad +\frac{\mu }{2(1-\mu /L)}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2} \end{array} \quad&:2\gamma \rho (\gamma ),\\&\begin{array}{ll} h_\star \ge h_{k+1}+{\left\langle s_{k+1}, x_\star -x_{k+1}\right\rangle }&\end{array}\quad&:2\gamma ,\\&\begin{array}{ll}h_{k+1}\ge h_\star + {\left\langle s_\star , x_{k+1}-x_\star \right\rangle }&\end{array} \quad&:2\gamma . \end{aligned}$$

For simplicity, let us first sum the previous inequalities two by two:

$$\begin{aligned}&\begin{array}{ll} 0&{}\ge -{\left\langle g_\star -g_k, x_\star -x_k\right\rangle } +\frac{1}{L}{\left\| g_k-g_\star \right\| ^2} \\ &{}\quad +\frac{\mu }{1-\mu /L}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2} \end{array} \quad&:2\gamma \rho (\gamma ), \\&\begin{array}{ll} 0\ge -{\left\langle s_\star -s_{k+1}, x_\star -x_{k+1}\right\rangle }&\end{array}\quad&:2\gamma . \end{aligned}$$

We proceed by showing that (7) can be obtained by reformulation of the following expression:

$$\begin{aligned} 0\ge & {} 2\gamma \rho (\gamma ) \left[ {\left\langle g_k-g_\star , x_\star -x_k\right\rangle } +\frac{1}{L} {\left\| g_k-g_\star \right\| ^2} \right. \nonumber \\&\left. +\frac{\mu }{1-\mu /L} {\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2}\right] \nonumber \\&+\,2\gamma {\left\langle s_{k+1}-s_\star , x_\star -x_{k+1}\right\rangle }. \end{aligned}$$
(8)

For doing that, we simply verify that the expression (7) minus the expression (8) is identically zero; that is:

$$\begin{aligned} 0= & {} 2\gamma \rho (\gamma ) \left[ {\left\langle g_k-g_\star , x_\star -x_k\right\rangle } +\frac{1}{L}{\left\| g_k-g_\star \right\| ^2} \right. \nonumber \\&\left. +\frac{\mu }{1-\mu /L}{\left\| x_k-x_\star -\frac{{1}}{L}(g_k-g_\star )\right\| ^2}\right] \nonumber \\&+\,2\gamma {\left\langle s_{k+1}-s_\star , x_\star -x_{k+1}\right\rangle }\nonumber \\&-\left[ -\left( 1-\gamma \mu \right) ^2 {\left\| x_k-x_\star \right\| ^2}+{\left\| x_{k+1}-x_\star \right\| ^2} +\gamma ^2 {\left\| g_\star +s_{k+1}\right\| ^2}\right] \nonumber \\&-\frac{\gamma (2-\gamma (L+\mu ))}{L-\mu } {\left\| \mu {(x_k -x_\star )} - g_k+ g_\star \right\| ^2}, \end{aligned}$$
(9)

with \(x_{k+1}=x_k-\frac{1}{L}\left( g_k+s_{k+1}\right) \) and \(s_\star =-g_\star \).

Finally, one can simply verify the equality (9) by expanding (7) and (8), which are both equal to:

$$\begin{aligned} \begin{aligned} 0\ge \frac{2}{L-\mu }&\Bigg ((\gamma -\gamma ^2\mu ) {\left\| g_k\right\| ^2} +{(\gamma ^2\mu +\gamma ^2 L-2 \gamma ){\left\langle g_k, g_\star \right\rangle } }\\&+{( \gamma ^2 L-\gamma ^2\mu ) {\left\langle g_k, s_{k+1}\right\rangle }}+{(\gamma ^2 \mu ^2+\gamma ^2 \mu L-\gamma L -\gamma \mu ) {\left\langle g_k, x_k\right\rangle }}\\&+{(- \gamma ^2 \mu ^2-\gamma ^2\mu L +\gamma L+\gamma \mu ){\left\langle g_k, x_\star \right\rangle }}+{(\gamma -\gamma ^2 \mu ) {\left\| g_\star \right\| ^2} }\\&+{( \gamma ^2 L-\gamma ^2 \mu ){\left\langle g_\star , s_{k+1}\right\rangle }}+{(2\gamma \mu - \gamma ^2 \mu ^2-\gamma ^2 \mu L ){\left\langle g_\star , x_k\right\rangle }}\\&+{( \gamma ^2 \mu ^2+\gamma ^2\mu L-2 \gamma \mu ) {\left\langle g_\star , x_\star \right\rangle }}+{( \gamma ^2 L-\gamma ^2 \mu ) {\left\| s_{k+1}\right\| ^2}}\\&+{(\gamma \mu -\gamma L ) {\left\langle s_{k+1}, x_k\right\rangle }}+{(\gamma L-\gamma \mu ) {\left\langle s_{k+1}, x_\star \right\rangle }}\\&+{(\gamma \mu L -\gamma ^2 \mu ^2 L) {\left\| x_k\right\| ^2}}+{(2 \gamma ^2 \mu ^2 L-2 \gamma \mu L) {\left\langle x_k, x_\star \right\rangle }}\\&+{( \gamma \mu L-\gamma ^2 \mu ^2 L) {\left\| x_\star \right\| ^2}}\Bigg ). \end{aligned} \end{aligned}$$

\(\square \)

Note that all proofs presented in the symbolic validation code rely on the exact same idea: the equivalences between pairs of expressions are verified by checking their differences being identically zero.

Appendix B: Details on Lower Bounds for Mixed Performance Measures

In this section, we provide details for obtaining the lower bounds marked with (\(\star \)) in Table 2 (i.e., those that do not come from purely quadratic functions).

First, consider two constants \(0<\mu \le L<\infty \) and the following one-dimensional quadratic minimization problem

$$\begin{aligned} \min _{x\ge 0} \left( \frac{\mu }{2} x^2+cx\right) . \end{aligned}$$

The function \(f(x)=\frac{\mu }{2} x^2+cx\) is clearly L-smooth and \(\mu \)-strongly convex with unique optimal point \(x_\star =0\) over the nonnegative reals. A corresponding composite problem can be written as \(F(x)=f(x)+i_{\ge 0}(x)\) with \(i_{\ge 0}(.)\) being the indicator function for the nonnegative real half-line.

Second, consider the starting point \(x_0\ge 0\) and a number of iterations \(N\in \mathbb {N}\). Using the proximal gradient method with step size \(\frac{1}{L}\) results in the following rule for the iterates, assuming c is small enough (i.e., such that \(x_k\ge 0\) for all \(0\le k \le N\)):

$$\begin{aligned} x_{k+1}&=x_k-\frac{1}{L} \nabla f(x_k),\\&=\left( 1-\frac{\mu }{L}\right) x_k-\frac{c}{L}. \end{aligned}$$

Solving the recurrence equation provides us with the following rule for the iterates:

$$\begin{aligned}x_{k}= {\frac{(1-\kappa )^N (c+\mu x_0)-c}{\mu }}, \end{aligned}$$

with \(\kappa {:=}\frac{\mu }{L}\) the (inverse) condition number, and the corresponding values:

$$\begin{aligned}&\nabla f(x_N)={(1-\kappa )^N (c+\mu x_0)},\\&f(x_N)-f(x_\star )={\frac{(1-\kappa )^{2 N} (c+\mu x_0)^2}{2 \mu }.} \end{aligned}$$

By optimizing over c, we get the following extreme cases:

\(\diamond \) :

(quadratic optimization—maximize \(f(x_N)-f(x_\star )\) with respect to c) \(c=\frac{\mu x_0}{(1-\kappa )^{-2N}-1}\) results in \(F(x_N)-F(x_\star )=\frac{\mu }{2} \frac{{\left\| x_0-x_\star \right\| }^2}{\rho ^{-2N}-1}\),

\(\diamond \) :

(linear optimization—maximize \(|\nabla f(x_N)|\) with respect to c by enforcing equality in the constraint \(x_N\ge 0\); that is, we impose \(x_N=0\)) \(c=\frac{\mu x_0}{(1-\kappa )^{-N}-1}\) results in \(||\tilde{\nabla } F(x_k)||^2=\frac{\mu ^2{\left\| x_0-x_\star \right\| }^2}{(\rho ^{-N}-1)^2}\),

\(\diamond \) :

(linear optimization—maximize \(|\nabla f(x_N)|\) with respect to c by enforcing equality in the constraint \(x_N\ge 0\); that is, we impose \(x_N=0\)) \(c=\frac{\mu x_0}{(1-\kappa )^{-N}-1}\) (or equivalently \(c=\frac{\sqrt{2\mu } \sqrt{F(x_0)-F_\star }}{\sqrt{(1-\kappa )^{-2N}-1}}\)) results in \(||\tilde{\nabla } F(x_N)||^2={2\mu \frac{F(x_0)-F_\star }{\rho ^{-2N}-1}}\),

which match the entries marked (\(\star \)) in Table 2.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Taylor, A.B., Hendrickx, J.M. & Glineur, F. Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization. J Optim Theory Appl 178, 455–476 (2018). https://doi.org/10.1007/s10957-018-1298-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-018-1298-1

Keywords

Mathematics Subject Classification

Navigation