Abstract
We study the worst-case convergence rates of the proximal gradient method for minimizing the sum of a smooth strongly convex function and a non-smooth convex function, whose proximal operator is available. We establish the exact worst-case convergence rates of the proximal gradient method in this setting for any step size and for different standard performance measures: objective function accuracy, distance to optimality and residual gradient norm. The proof methodology relies on recent developments in performance estimation of first-order methods, based on semidefinite programming. In the case of the proximal gradient method, this methodology allows obtaining exact and non-asymptotic worst-case guarantees that are conceptually very simple, although apparently new. On the way, we discuss how strong convexity can be replaced by weaker assumptions, while preserving the corresponding convergence rates. We also establish that the same fixed step size policy is optimal for all three performance measures. Finally, we extend recent results on the worst-case behavior of gradient descent with exact line search to the proximal case.
Similar content being viewed by others
Notes
A list of useful analytical proximal operators is available in [13].
Those \(\lambda \)’s were found by identifying an analytical optimal solution to the dual performance estimation problem. That is, each \(\lambda \) can be seen as a Lagrange multiplier for the corresponding inequality. The methodology is explained and illustrated in details in [14, Section 4.1].
Actually, both regimes are valid for \(\gamma =\frac{2}{L+\mu }\).
The difference between the gradient mapping and the residual gradient norm is simple, but somewhat subtle. The gradient mapping measures \({\left\| \nabla f(x_k)+s_{k+1}\right\| }\), whereas the residual gradient norm measures \({\left\| \nabla f(x_{k+1})+s_{k+1}\right\| }\) with \(s_{k+1}\in \partial h(x_{k+1})\) the subgradient used in the proximal operation.
References
Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)
Ryu, E.K., Boyd, S.: A primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, pp. 1458–1466 (2011)
Zhang, H., Cheng, L.: Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization. Optim. Lett. 9(5), 961–979 (2015)
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 1–39 (2018). https://doi.org/10.1007/s10107-018-1232-1
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)
de Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
Drori, Y.: Contributions to the Complexity Analysis of Optimization Algorithms. Ph.D. Thesis, Tel-Aviv University (2014)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Taylor, A.B.: Convex Interpolation and Performance Estimation of First-Order Methods for Convex Optimization. Ph.D. Thesis, Université catholique de Louvain (2017)
Nemirovski, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
Taylor, A., Hendrickx, J., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: Proceedings of the 56th IEEE conference on decision and control (CDC 2017) (2017)
Acknowledgements
This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office, and of the Concerted Research Action (ARC) programme supported by the Federation Wallonia-Brussels (contract ARC 14/19-060). The scientific responsibility rests with its authors. The authors also thank the anonymous referee for his very careful reading and comments that include Remark 2.1.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Details of the Proof of Theorem 3.1
In this section, we provide some details on the verification of the proof of Theorem 3.1 (convergence in distance to optimality). The proofs of Theorem 3.2 (convergence in residual gradient norm) and Theorem 3.3 (convergence in function value) follow the same lines, although the results in function values are technically more involved (we advise the reader to use appropriate computer algebra software to preserve his sanity).
As the proofs for both regimes (small and large step sizes) follow the same lines, we only consider the case \(0\le \gamma \le \frac{2}{L+\mu }\) here.
The goal is to prove that the inequality
can be obtained by performing a weighted sum of the following inequalities:
For simplicity, let us first sum the previous inequalities two by two:
We proceed by showing that (7) can be obtained by reformulation of the following expression:
For doing that, we simply verify that the expression (7) minus the expression (8) is identically zero; that is:
with \(x_{k+1}=x_k-\frac{1}{L}\left( g_k+s_{k+1}\right) \) and \(s_\star =-g_\star \).
Finally, one can simply verify the equality (9) by expanding (7) and (8), which are both equal to:
\(\square \)
Note that all proofs presented in the symbolic validation code rely on the exact same idea: the equivalences between pairs of expressions are verified by checking their differences being identically zero.
Appendix B: Details on Lower Bounds for Mixed Performance Measures
In this section, we provide details for obtaining the lower bounds marked with (\(\star \)) in Table 2 (i.e., those that do not come from purely quadratic functions).
First, consider two constants \(0<\mu \le L<\infty \) and the following one-dimensional quadratic minimization problem
The function \(f(x)=\frac{\mu }{2} x^2+cx\) is clearly L-smooth and \(\mu \)-strongly convex with unique optimal point \(x_\star =0\) over the nonnegative reals. A corresponding composite problem can be written as \(F(x)=f(x)+i_{\ge 0}(x)\) with \(i_{\ge 0}(.)\) being the indicator function for the nonnegative real half-line.
Second, consider the starting point \(x_0\ge 0\) and a number of iterations \(N\in \mathbb {N}\). Using the proximal gradient method with step size \(\frac{1}{L}\) results in the following rule for the iterates, assuming c is small enough (i.e., such that \(x_k\ge 0\) for all \(0\le k \le N\)):
Solving the recurrence equation provides us with the following rule for the iterates:
with \(\kappa {:=}\frac{\mu }{L}\) the (inverse) condition number, and the corresponding values:
By optimizing over c, we get the following extreme cases:
- \(\diamond \) :
-
(quadratic optimization—maximize \(f(x_N)-f(x_\star )\) with respect to c) \(c=\frac{\mu x_0}{(1-\kappa )^{-2N}-1}\) results in \(F(x_N)-F(x_\star )=\frac{\mu }{2} \frac{{\left\| x_0-x_\star \right\| }^2}{\rho ^{-2N}-1}\),
- \(\diamond \) :
-
(linear optimization—maximize \(|\nabla f(x_N)|\) with respect to c by enforcing equality in the constraint \(x_N\ge 0\); that is, we impose \(x_N=0\)) \(c=\frac{\mu x_0}{(1-\kappa )^{-N}-1}\) results in \(||\tilde{\nabla } F(x_k)||^2=\frac{\mu ^2{\left\| x_0-x_\star \right\| }^2}{(\rho ^{-N}-1)^2}\),
- \(\diamond \) :
-
(linear optimization—maximize \(|\nabla f(x_N)|\) with respect to c by enforcing equality in the constraint \(x_N\ge 0\); that is, we impose \(x_N=0\)) \(c=\frac{\mu x_0}{(1-\kappa )^{-N}-1}\) (or equivalently \(c=\frac{\sqrt{2\mu } \sqrt{F(x_0)-F_\star }}{\sqrt{(1-\kappa )^{-2N}-1}}\)) results in \(||\tilde{\nabla } F(x_N)||^2={2\mu \frac{F(x_0)-F_\star }{\rho ^{-2N}-1}}\),
which match the entries marked (\(\star \)) in Table 2.
Rights and permissions
About this article
Cite this article
Taylor, A.B., Hendrickx, J.M. & Glineur, F. Exact Worst-Case Convergence Rates of the Proximal Gradient Method for Composite Convex Minimization. J Optim Theory Appl 178, 455–476 (2018). https://doi.org/10.1007/s10957-018-1298-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-018-1298-1