Abstract
We describe a novel constructive technique for devising efficient first-order methods for a wide range of large-scale convex minimization settings, including smooth, non-smooth, and strongly convex minimization. The technique builds upon a certain variant of the conjugate gradient method to construct a family of methods such that (a) all methods in the family share the same worst-case guarantee as the base conjugate gradient method, and (b) the family includes a fixed-step first-order method. We demonstrate the effectiveness of the approach by deriving optimal methods for the smooth and non-smooth cases, including new methods that forego knowledge of the problem parameters at the cost of a one-dimensional line search per iteration, and a universal method for the union of these classes that requires a three-dimensional search per iteration. In the strongly convex case, we show how numerical tools can be used to perform the construction, and show that the resulting method offers an improved worst-case bound compared to Nesterov’s celebrated fast gradient method.
Similar content being viewed by others
References
Arjevani, Y., Shalev-Shwartz, S., Shamir, O.: On lower and upper bounds in smooth and strongly convex optimization. J. Mach. Learn. Res. 17(126), 1–51 (2016)
Beck, A.: Quadratic matrix programming. SIAM J. Optim. 17(4), 1224–1238 (2007)
Beck, A., Drori, Y., Teboulle, M.: A new semidefinite programming relaxation scheme for a class of quadratic matrix problems. Oper. Res. Lett. 40(4), 298–302 (2012)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Bubeck, S., Lee, Y.T., Singh, M.: A geometric alternative to Nesterov’s accelerated gradient descent (2015). arXiv preprint arXiv:1506.08187
De Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)
Devolder, O., Glineur, F., Nesterov, Y.: Intermediate gradient methods for smooth convex problems with inexact oracle. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), Technical report (2013)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
Diehl, M., Ferreau, H.J., Haverbeke, N.: Efficient numerical methods for nonlinear MPC and moving horizon estimation. Nonlinear Model Predict. Control 384, 391–417 (2009)
Drori, Y.: Contributions to the complexity analysis of optimization algorithms. Ph.D. thesis, Tel-Aviv University (2014)
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
Drori, Y., Teboulle, M.: An optimal variant of Kelley’s cutting-plane method. Math. Program. 160(1–2), 321–351 (2016)
Drusvyatskiy, D., Fazel, M., Roy, S.: An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018)
Fazlyab, M., Ribeiro, A., Morari, M., Preciado, V.M.: Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM J. Optim. 28(3), 2654–2689 (2018)
Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming. version 2.0 beta. http://cvxr.com/cvx (2013)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Stand. 49(6), 409–436 (1952)
Hu, B., Lessard, L.: Dissipativity theory for Nesterov’s accelerated method. In: International Conference on Machine Learning (ICML), pp. 1549–1557 (2017)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 315–323 (2013)
Karimi, S., Vavasis, S.A.: A unified convergence bound for conjugate gradient and accelerated gradient. (2016). arXiv preprint arXiv:1605.00320
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)
Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2663–2671 (2012)
Lemaréchal, C., Sagastizábal, C.: Variable metric bundle methods: from conceptual to implementable forms. Math. Program. 76(3), 393–410 (1997)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Löfberg, J.: YALMIP: a toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference (2004)
Mosek, A.: The MOSEK Optimization Software, vol. 54 (2010). http://www.mosek.com
Narkiss, G., Zibulevsky, M.: Sequential subspace optimization method for large-scale unconstrained problems. In: Technion-IIT, Department of Electrical Engineering (2005)
Nemirovski, A.: Orth-method for smooth convex optimization. Izvestia AN SSSR 2, 937–947 (1982). (in Russian)
Nemirovski, A.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Nemirovski, A.: Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nemirovski, A., Yudin, D.: Information-based complexity of mathematical programming. Izvestia AN SSSR, Ser. Tekhnicheskaya Kibernetika 1 (1983) (in Russian)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Willey-Interscience, New York (1983)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(\(1/k^2\))). Soviet Mathematics Doklady 27, 372–376 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, London (2004)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Nesterov, Y., Shikhman, V.: Quasi-monotone subgradient methods for nonsmooth convex minimization. J. Optim. Theory Appl. 165(3), 917–940 (2015)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
Ruszczyński, A.P.: Nonlinear Optimization, vol. 13. Princeton University Press, Princeton (2006)
Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection (2018). arXiv preprint arXiv:1812.00146
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems (NIPS), pp. 1458–1466 (2011)
Scieur, D., Roulet, V., Bach, F., d’Aspremont, A.: Integration methods and optimization algorithms. In: Advances in Neural Information Processing Systems (NIPS), pp. 1109–1118 (2017)
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems (NIPS), pp. 2510–2518 (2014)
Taylor, A.: Convex interpolation and performance estimation of first-order methods for convex optimization. Ph.D. thesis, Université catholique de Louvain (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: IEEE 56th Annual Conference on Decision and Control (CDC), pp. 1278–1283 (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization. J. Optim. Theory Appl. 178(2), 455–476 (2018)
Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018)
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization. (2016). arXiv preprint arXiv:1611.02635
Wright, S.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Wright, S., Nocedal, J.: Numerical optimization. Science 35, 67–68 (1999)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The Adrien B. Taylor was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement 724063).
Appendices
Appendix A: Proof of Lemma 1
We start the proof of Lemma 1 with the following a technical lemma.
Lemma 5
Let \({\mathcal {F}}\) be a class of contraction-preserving c.c.p. functions (see Definition 3), and let \(S=\{(x_i,g_i,f_i)\}_{i\in I^*_N}\) be an \({\mathcal {F}}\)-interpolable set satisfying
then there exists \(\{{\hat{x}}_i\}_{i\in I^*_N}\subset \mathbb {R}^d\) such that the set \({\hat{S}}=\{({\hat{x}}_i,g_i,f_i)\}_{i\in I^*_N}\) is \({\mathcal {F}}\)-interpolable, and
Proof
By the orthogonal decomposition theorem there exists \(\{h_{i,j}\}_{0\le j<i\le N} \subset \mathbb {R}\) and \(\{v_i\}_{0\le i\le N} \subset \mathbb {R}^d\) with \({\left\langle g_k, v_i\right\rangle }=0\) for all \(0\le k<i \le N\) such that
furthermore, there exist \(r_*\in \mathbb {R}^d\) satisfying \({\left\langle r_*, v_j\right\rangle }=0\) for all \(0\le j \le N\) and some \(\{\nu _{j}\}_{0\le j\le N}\subset \mathbb {R}\), such that
By (23) and (24) it then follows that for all \(k\ge i\)
hence, together with the definition of \(v_i\), we get
Let us now choose \(\{{\hat{x}}_i\}_{i\in I^*_N}\) as follows:
It follows immediately from this definition that (26) holds, it thus remains to show that \({\hat{S}}\) is \({\mathcal {F}}\)-interpolable and that (25) holds.
In order to establish that \({\hat{S}}\) is \({\mathcal {F}}\)-interpolable, from Definition 3 it is enough to show that the conditions in (4) are satisfied. This is indeed the case, as \({\left\langle g_j, {\hat{x}}_i - {\hat{x}}_0\right\rangle }={\left\langle g_j, x_i-x_0\right\rangle }\) follows directly from definition of \(\{{\hat{x}}_i\}\) and (27), whereas \({\left||{\hat{x}}_i - {\hat{x}}_j\right||}\le {\left||x_i-x_j\right||}\) in the case \(i,j\ne *\) follows from
and in the case \(j=*\), follows from
where for the second equality we used \({\left\langle v_i, r_*\right\rangle }=0\). The last inequality also establishes (25), which completes the proof. \(\square \)
Proof of Lemma 1
By the first-order necessary and sufficient optimality conditions (see e.g., [42, Theorem 3.5]), the definitions of \(x_i\) and \(f'(x_i)\) in (5) and (6) can be equivalently defined as a solution to the problem of finding \(x_i\in \mathbb {R}^d\) and \(f'(x_i)\in \partial f(x_i)\) (\(0\le i\le N\)), that satisfy:
hence the problem (PEP) can be equivalently expressed as follows:
Now, since all constraints in (28) depend only on the first-order information of f at \(\{x_i\}_{i\in I^*_N}\), by taking advantage of Definition 2 we can denote \(f_i:=f(x_i)\) and \(g_i:=f'(x_i)\) and treat these and as optimization variables, thereby reaching the following equivalent formulation
Since (PEP-GFOM) is a relaxation of (29), we get
which establishes the bound (13).
In order to establish the second part of the claim, let \(\varepsilon >0\). We will proceed to show that there exists some valid input for GFOM \((f, x_0)\), such that \(f(\mathrm {GFOM}_N(f, x_0)) - f_*\ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon \).
Indeed, by the definition of (PEP-GFOM), there exists a set \(S=\{(x_i,g_i,f_i)\}_{i\in I^*_N}\) that satisfies the constraints in (PEP-GFOM) and reaches an objective value \(f_N-f_* \ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon \). Since S satisfies the requirements of Lemma 5 [as these requirements are constraints in (PEP-GFOM)], there exists a set of vectors \(\{{\hat{x}}_i\}_{i\in I^*_N}\) for which
hold, and in addition, \({\hat{S}}:=\{({\hat{x}}_i,g_i,f_i)\}_{i\in I^*_N}\) is \({\mathcal {F}}(\mathbb {R}^d)\)-interpolable. By definition of an \({\mathcal {F}}(\mathbb {R}^d)\)-interpolable set, it follows that there exists a function \({\hat{f}}\in {\mathcal {F}}(\mathbb {R}^d)\) such that \({\hat{f}}({\hat{x}}_i) = f_i\), \(g_i \in \partial {\hat{f}}({\hat{x}}_i)\), hence satisfying
Furthermore, since \(g_*=0\) we have that \({\hat{x}}_*\) is an optimal solution of \({\hat{f}}\).
We conclude that the sequence \({\hat{x}}_0, \dots , {\hat{x}}_N\) forms a valid execution of GFOM on the input \(({\hat{f}}, {\hat{x}}_0)\), that the requirement \({\left||{\hat{x}}_0 - {\hat{x}}_*\right||}\le R_x\) is satisfied, and that the output of the method, \({\hat{x}}_N\), attains the absolute inaccuracy value of \({\hat{f}}({\hat{x}}_N) -{\hat{f}}({\hat{x}}_*) = f_N - f_* \ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon \). \(\square \)
Appendix B: Proof of Theorem 3
Lemma 6
Suppose there exists a pair \((f,x_0)\) such that \(f\in {\mathcal {F}}\), \({\left||x_0-x_*\right||}\le R_x\) and \(\mathrm {GFOM}_{2N+1}(f, x_0)\) is not optimal for f, then (sdp-PEP-GFOM) satisfies Slater’s condition. In particular, no duality gap occurs between the primal-dual pair (sdp-PEP-GFOM), (dual-PEP-GFOM), and the dual optimal value is attained.
Proof
Let \((f,x_0)\) be a pair satisfying the premise of the lemma and denote by \(\{x_i\}_{i\ge 0}\) the sequence generated according to GFOM and by \(\{f'(x_i)\}_{i\ge 0}\) the subgradients chosen at each iteration of the method, respectively. By the assumption that the optimal value is not obtained after \(2N+1\) iterations, we have \(f(x_{2N+1})>f_*\).
We show that the set \(\{({\tilde{x}}_i,{\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}\) with
corresponds to a Slater point for (sdp-PEP-GFOM).
In order to proceed, we consider the Gram matrix \({\tilde{G}}\) and the vector \({\tilde{F}}\) constructed from the set \(\{({\tilde{x}}_i, {\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}\) as in Sect. 3.2. We then continue in two steps:
-
(i)
we show that \(({\tilde{G}}, {\tilde{F}})\) is feasible for (sdp-PEP-GFOM),
-
(ii)
we show that \({\tilde{G}}\succ 0\).
The proofs follow.
-
(i)
First, we note that the set \(\{({\tilde{x}}_i, {\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}\) satisfies the interpolation conditions for \({\mathcal {F}}\), as it was obtained by taking the values and gradients of a function in \({\mathcal {F}}\). Furthermore, since \({\tilde{x}}_0 = x_0\) and \({\tilde{x}}_*=x_*\) we also get that the initial condition \({\left||{\tilde{x}}_0-{\tilde{x}}_*\right||}\le R_x\) is respected, and since \(\{x_i\}\) correspond to the iterates of GFOM, we also have by Lemma 5 that
$$\begin{aligned}&{\left\langle {\tilde{g}}_i, {\tilde{g}}_j\right\rangle }= 0, \quad \text {for all } 0\le j<i=1,\ldots N, \\&{\left\langle {\tilde{g}}_i, {\tilde{x}}_j-{\tilde{x}}_0\right\rangle }= 0, \quad \text {for all } 1\le j \le i=1,\ldots N. \end{aligned}$$It then follows from the construction of \({\tilde{G}}\) and \({\tilde{F}}\) and by (10) that \({\tilde{G}}\) and \({\tilde{F}}\) satisfies the constrains of (sdp-PEP-GFOM).
-
(ii)
In order to establish that \({\tilde{G}}\succ 0\) it suffices to show that the vectors
$$\begin{aligned} \{{\tilde{g}}_0,\ldots , {\tilde{g}}_N ; {\tilde{x}}_1- {\tilde{x}}_0,\ldots ,{\tilde{x}}_N- {\tilde{x}}_0 ; {\tilde{x}}_*- {\tilde{x}}_0 \} \end{aligned}$$are linearly independent. Indeed, this follows from Lemma 5, since these vectors are all non-zero, and since \({\tilde{x}}_*\) does not fall in the linear space spanned by \({\tilde{g}}_0,\ldots , {\tilde{g}}_N ; {\tilde{x}}_1- {\tilde{x}}_0,\ldots , {\tilde{x}}_N- {\tilde{x}}_0\) (as otherwise \(x_{2N+1}\) would be an optimal solution).
We conclude that \(({\tilde{G}}, {\tilde{F}})\) forms a Slater point for (sdp-PEP-GFOM).\(\square \)
Proof of Theorem 3
The bound follows directly from
established by Lemmas 1 and 2. The tightness claim follows from the tightness claims of Lemmas 1, 2 and 6. \(\square \)
Appendix C: Proof of Theorem 4
We begin the proof of Theorem 4 by recalling a well-known lemma on constraint aggregation, showing that it is possible to aggregate the constraints of a minimization problem while keeping the optimal value of the resulting program bounded from below.
Lemma 7
Consider the problem
where \(f:\mathbb {R}^d\rightarrow \mathbb {R}\), \(h:\mathbb {R}^d\rightarrow \mathbb {R}^n\), \(g:\mathbb {R}^d\rightarrow \mathbb {R}^m\) are some (not necessarily convex) functions, and suppose \(({\tilde{\alpha }}, {\tilde{\beta }})\in \mathbb {R}^{n}\times \mathbb {R}_+^{m}\) is a feasible point for the Lagrangian dual of (P) that attains the value \({\tilde{\omega }}\). Let \(k\in {\mathbb {N}}\), and let \(M\in \mathbb {R}^{n \times k}\) be a linear map such that \({\tilde{\alpha }} \in \mathrm {range}(M)\), then
is bounded from below by \({\tilde{\omega }}\).
Proof
Let
be the Lagrangian for the problem (P), then by the assumption on \(({\tilde{\alpha }}, {\tilde{\beta }})\) we have \( \min _x L(x, {\tilde{\alpha }}, {\tilde{\beta }}) = {\tilde{\omega }}. \) Now, let \(u\in \mathbb {R}^k\) be some vector such that \(Mu = {\tilde{\alpha }}\), then for every x in the domain of (P\('\))
where that last inequality follows from nonnegativity of \({\tilde{\beta }}\). We get
and thus the desired result \(w'\ge {\tilde{\omega }}\) holds. \(\square \)
Before proceeding with the proof of the main results, let us first formulate a performance estimation problem for the class of methods described by (14).
Lemma 8
Let \( R_x\ge 0\) and let \(\{\beta _{i,j}\}_{1\le i\le N, 0\le j\le i-1}\), \(\{\gamma _{i,j}\}_{1\le i\le N, 1\le j\le i}\) be some given sets of real numbers, then for any pair \((f, x_0)\) such that \(f\in {\mathcal {F}}(\mathbb {R}^d)\) and \({\left||x_0-x_*\right||}\le R_x\) (where \(x_*\in {{\,\mathrm{argmin}\,}}_x f(x)\)). Then for any sequence \(\{x_i\}_{1\le i\le N}\) that satisfies
for some \(f'(x_i)\in \partial f(x_i)\), the following bound holds:
We omit the proof since it follows the exact same lines as for (sdp-PEP-GFOM) (c.f. the derivations in [13, 50]).
Proof of Theorem 4
The key observation underlying the proof is that by taking the PEP for GFOM (sdp-PEP-GFOM) and aggregating the constraints that define its iterates, we can reach a PEP for the class of methods (14). Furthermore, by Lemma 7, this aggregation can be done in a way that maintains the optimal value of the program, thereby reaching a specific method in this class whose corresponding PEP attains an optimal value that is at least as good as that of the PEP for GFOM.
We perform the aggregation of the constraints as follows: for all \(i=1,\dots ,N\) we aggregate the constraints which correspond to \(\{\beta _{i,j}\}_{0\le j<i}\), \(\{\gamma _{i,j}\}_{1\le j\le i}\) (weighted by \(\{{\tilde{\beta }}_{i,j}\}_{0\le j<i}\), \(\{{\tilde{\gamma }}_{i,j}\}_{1\le j\le i}\), respectively) into a single constraint, reaching
By Lemma 7 and the choice of weights \(\{{\tilde{\beta }}_{i,j}\}_{0\le j<i}\), \(\{{\tilde{\gamma }}_{i,j}\}_{1\le j\le i}\) it follows that
Finally, by Lemma 8, we conclude that \(w'(N, {\mathcal {F}}({\mathbb {R}}^d),R_x)\) forms an upper bound on the performance of the method (14), i.e., for any valid pair \((f, x_0)\) and any \(\{x_i\}_{i\ge 0}\) that satisfies (14) we have
\(\square \)
Rights and permissions
About this article
Cite this article
Drori, Y., Taylor, A.B. Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184, 183–220 (2020). https://doi.org/10.1007/s10107-019-01410-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01410-2