Abstract
We develop subgradient- and gradient-based methods for minimizing strongly convex functions under a notion which generalizes the standard Euclidean strong convexity. We propose a unifying framework for subgradient methods which yields two kinds of methods, namely, the proximal gradient method (PGM) and the conditional gradient method (CGM), unifying several existing methods. The unifying framework provides tools to analyze the convergence of PGMs and CGMs for non-smooth, (weakly) smooth, and further for structured problems such as the inexact oracle models. The proposed subgradient methods yield optimal PGMs for several classes of problems and yield optimal and nearly optimal CGMs for smooth and weakly smooth problems, respectively.
Similar content being viewed by others
Notes
Notice that the function \(\varphi (x):=\psi (x)-\tau d(x)\) satisfies \(\varphi '(y;x-y)=\psi '(y;x-y)-\tau \left\langle \nabla {d}(y),x-y\right\rangle \), \(\forall x,y \in Q\). Hence, the convexity of \(\varphi (x)\) on Q implies \(\varphi (x)\ge \varphi (y)+\varphi '(y;x-y),\forall x,y \in Q\), which is equivalent to (2). Conversely, since \(\psi '(y;x-y)\ge -\psi '(y;y-x)\) holds and so is true for \(\varphi (\cdot )\) for \(x,y \in Q\), (2) implies the two inequalities \(\varphi (y)\ge \varphi (z)+\varphi '(z;y-z)\) and \(\varphi (x)\ge \varphi (z)-\varphi '(z;z-x)\) for \(x,y,z \in Q\). Since \(\varphi '(y;\cdot )\) is positively homogeneous, the convexity of \(\varphi (\cdot )\) on Q follows by taking a convex combination of the two with \(z=\alpha x + (1-\alpha )y,\alpha \in [0,1],x,y \in Q\).
In fact, since they have the convergence rate \(f(\hat{x}_k)-f(x^*)\le \frac{cL\Vert x_0-x^* \Vert _2^2}{2k^2}\) for a constant \(c>0\), after \(k\ge \sqrt{2cL/\sigma _f}\) iterations, we have \(f(\hat{x}_k)-f(x^*)\le \frac{\sigma _f}{4}\Vert x_0-x^* \Vert _2^2\le \frac{1}{2}(f(x_0)-f(x^*))\) by the strong convexity of f and the optimality of \(x^*\). Then repeating \(O(\log _2(1/\varepsilon ))\) times of restarting the method every \(\sqrt{2cL/\sigma _f}\) iterations, it ensures an \(\varepsilon \)-solution.
The auxiliary function \(\varphi _k(x)\) can possibly be an affine function. In that case, we will assume the boundedness of Q in order to ensure an existence of a minimizer \(z_k\).
The proof of [16, Theorem 5.3] replacing the notation \((h(\cdot ),\lambda _{k+1},\tilde{\lambda }_{k+1},L_{k+1},\delta _{k+1},\bar{\alpha }_{k+1},\beta _{k+1},\alpha _k)\) of [16] by \((-f(\cdot ),x_k,z_k,L(x_k),\delta (x_k,x_{k+1}),\tau _k,S_k/\lambda _0,\lambda _k/\lambda _0)\) for \(k \ge 0\) shows the desired estimate because showing the result uses the assumption [16, eq.(52)] with \((L,\delta )=(L_{k+1},\delta _{k+1})\) only at \((\lambda ,\bar{\lambda })=(\lambda _{k+2},\lambda _{k+1})\), which corresponds to our assumption (6) at \((x,y)=(x_k,x_{k+1})\).
As is indicated in [31], an obvious upper bound of \(d(x^*)\) can be obtained if \(\nabla {f}(x^*)=0\) and we know M for the weakly smooth problems [example (iv) in Sect. 2.3.1] in the Euclidean setting \(d(x)=\frac{1}{2}\Vert x-x_0 \Vert _2^2\) : The inequality \(d(x^*) \le \frac{1}{2}(\frac{2M}{\rho \sigma _f})^{2/(2-\rho )}\) follows since we have \(\frac{\sigma _f}{2}\Vert x^*-x_0 \Vert _2^2 \le f(x_0)-f(x^*) \le \frac{M}{\rho }\Vert x_0-x^* \Vert _2^\rho \) [recall the strong convexity and (6)].
References
Argyriou, A., Signoretto, M., Suykens, J.: Hybrid conditional gradient - smoothing algorithms with applications to sparse and low rank regularization. In: Suykens, J., Argyriou, A., Signoretto, M. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, pp. 53–82. Chapman & Hall/CRC, Boca Raton (2014)
Auslender, A., Teboulle, M.: Interior gradient and proximal method for convex and conic optimization. SIAM J. Optim. 16, 697–725 (2006)
Bach, F.: Duality between subgradient and conditional gradient methods. SIAM J. Optim. 25, 115–129 (2015)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2, 183–202 (2009)
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22, 557–580 (2012)
Bregman, L.: The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–217 (1967)
Chen, X., Lin, Q., Peña, J.: Optimal regularized dual averaging methods for stochastic optimization. Adv. Neural Inf. Process. Syst. 25, 395–403 (2012)
Cox, B., Juditsky, B., Nemirovski, A.: Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 148, 143–180 (2013)
Demyanov, V.F., Rubinov, A.M.: Approximate Methods in Optimization Problems. American Elsevier Publishing Company, New York (1970)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods with inexact oracle: the strongly convex case. CORE discussion paper, 2013/16 (2013)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146, 37–75 (2014)
Dunn, J., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62, 432–444 (1978)
Elster, K.-H. (ed.): Modern Mathematical Methods in Optimization. Akademie Verlag, Berlin (1993)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Q. 3, 95–110 (1956)
Freund, R.M., Grigas, P.: New analysis and results for the Frank–Wolfe method. Math. Program. 155, 199–230 (2014)
Fukushima, M., Mine, H.: A generalized proximal point algorithm for certain non-convex minimization problems. Int. J. Syst. Sci. 12, 989–1000 (1981)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: A generic algorithmic framework. SIAM J. Optim. 22, 1469–1492 (2012)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: Shrinking procedures and optimal algorithms. SIAM J. Optim. 23, 2061–2089 (2013)
Guzmán, C., Nemirovski, A.: On lower complexity bounds for large-scale convex optimization. J. Complex. 31, 1–14 (2015)
Harchaoui, Z., Juditsky, A., Nemirovski, A.: Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program. 152, 75–112 (2015)
Ito, M., Fukuda, M.: A family of subgradient-based methods for convex optimization problems in a unifying framework. Optim. Meth. Software (to appear)
Jaggi, M.: Sparse convex optimization methods for machine learning, Ph.D. thesis, ETH Zurich (2011)
Jaggi, M.: Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In: Proceedings of the 30th international conference on machine learning, pp. 427–435 (2013)
Juditsky, A., Nesterov, Y.: Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization. Stoch. Syst. 4, 44–80 (2014)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133, 365–397 (2012)
Lan, G.: The complexity of large-scale convex programming under a linear optimization oracle. arXiv:1309.5550v2 (2014)
Lan, G.: Gradient sliding for composite optimization. Math. Program. (to appear)
Nedić, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P. (eds.) Stoch. Optim., pp. 223–264. Kluwer Academic Publishers, Dordrecht (2001)
Nedić, A., Lee, S.: On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM J. Optim. 24, 84–107 (2014)
Nemirovski, A., Nesterov, Y.: Optimal methods for smooth convex minimization, Zh. Vychishl. Mat. i Mat. Fiz., 25, 356–369 (1985) (in Russian); English translation: USSR Computational Mathematics and Mathematical Physics, 24, 80–82 (1984)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization, Nauka Publishers, Moscow, Russia (1979) (in Russian). English translation: Wiley, New York (1983)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 27, 372–376 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Boston (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)
Nesterov, Y.: Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16, 235–249 (2005)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–259 (2009)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140, 125–161 (2013)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)
Nesterov, Y.: Complexity bounds for primal-dual methods minimizing the model of objective function. CORE discussion paper, 2015/3 (2015)
Pshenichny, B.N., Danilin, Y.M.: Numerical Methods in Extremal Problems. MIR Publishers, Moscow (1978)
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization, Technical Report, University of Washington (2008)
Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125, 263–295 (2010)
Acknowledgments
The author is very thankful to the anonymous referees who gave constructive suggestions which improved substantially the readability of the paper. He is also thankful to Prof. Mituhiro Fukuda for comments and suggestions and also to Prof. Guanghui Lan for pointing out some related results. This work was partially supported by JSPS Grant-in-Aid for Scientific Research (C) Number 26330024.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In order to complete the proof of Theorem 5.3, we need to obtain upper bounds for \(1/S_k\) and \(\sum _{i=0}^kS_i/S_k\) for the sequence \(\{S_k\}_{k \ge 0}\) defined by (46). Since \(\lambda _{k+1}=S_{k+1}-S_k\), writing \(r:=\frac{\sigma _f\sigma _d}{L-\bar{\sigma }_f\sigma _d}\ge 0\), the sequence \(\{S_k\}_{k\ge 0}\) in (46) is determined by the recurrence
where the root of the equation in \(S_{k+1}\) takes the largest one, namely,
The essentials of lemmas below are the same as [11, Lemma 4–7] excepting the replacement of \(\mu /L\) in the article by an arbitrary \(r\ge 0\).
Lemma 7.1
For any sequence \(\{S_k\}_{k \ge 0}\) defined by (52) for \(r \ge 0\), we have
Proof
Since \(S_{k+1}\ge S_k\), we have
which shows \(\sqrt{S_k}\ge \frac{k}{2}+\sqrt{S_0}=\frac{k+2}{2}\) for all \(k \ge 0\). Then, we have
which gives \(S_k \ge S_0+\frac{k(k+5)}{4}=\frac{(k+1)(k+4)}{4}\). On the other hand, using (53) yields that
for all \(k \ge 0\). Hence, we have \(S_k\ge S_0\left( \frac{2+r+\sqrt{r^2+4r}}{2}\right) ^k=\left( \frac{2+r+\sqrt{r^2+4r}}{2}\right) ^k\). \(\square \)
Remark 7.2
The linear convergence factor \(\frac{2}{2+r+\sqrt{r^2+4r}}\) in the above lemma satisfies
In fact, since
we obtain
Note that if \(\bar{\sigma }_f=\sigma _f\) and \(r=\frac{\sigma _f\sigma _d}{L-\bar{\sigma }_f\sigma _d}\), then \(\sqrt{\frac{r}{r+1}}=\sqrt{\frac{\sigma _f\sigma _d}{L}}\).
Lemma 7.3
The sequence \(\{S_k\}_{k \ge 0}\) defined by (52) for \(r>0\) satisfies
Proof
Notice that \(\gamma := \frac{1+\sqrt{1+4r^{-1}}}{2}\) satisfies
Therefore, we obtain \(\frac{S_k}{S_{k+1}}\le 1-\frac{1}{\gamma }\) by (55). Now the result follows by induction: if \(\sum _{i=0}^kS_i/S_k\le \gamma \) holds for some \(k \ge 0\), we have
This proves the first inequality; the second can be verified from \(\sqrt{1+4r^{-1}}\le 1+2\sqrt{r^{-1}}\). \(\square \)
Note that the result of Lemma 7.3 is the same as [11, Lemma 5] because \(1+\frac{2\sqrt{r^{-1}}}{\sqrt{r}+\sqrt{r+4}}=\frac{1+\sqrt{1+4r^{-1}}}{2}\).
Lemma 7.4
Let \(\{S_k\}_{k \ge 0}\) be defined as Lemma 7.3 and \(\{T_k\}_{k \ge 0}\) be defined by (52) with \(r:=0\), namely \(T_0:=1\) and \(T_{k+1}:=\frac{1+2T_k+\sqrt{1+4T_k}}{2}\) for \(k \ge 0\). Then, we have
Proof
Due to the identity
it is enough to show that \(\frac{S_k}{S_{k+1}} \le \frac{T_k}{T_{k+1}}\) for every \(k \ge 0\). Notice that we have
which suggests us to prove \(\frac{1+rS_k}{S_k} \ge \frac{1}{T_k}\) for \(k \ge 0\). It is true for \(k=0\) by \(S_0=T_0\). If it holds for \(k \ge 0\), then, writing \(\alpha :=\frac{1+rS_k}{S_k} \ge \beta := \frac{1}{T_k}\), we obtain
since \(S_{k+1} \ge S_k\) and \(x \mapsto \frac{2x}{x+2+\sqrt{(x+2)^2-4}}=\frac{2}{1+2x^{-1}+\sqrt{1+4x^{-1}}}\) is non-decreasing on \((0,\infty )\). Hence, we claim \(\frac{1+rS_k}{S_k} \ge \frac{1}{T_k}\) for all \(k \ge 0\) and therefore the proof is completed. \(\square \)
Lemma 7.5
Let \(\{T_k\}_{k \ge 0}\) be a sequence defined by (52) with \(r:=0\), namely \(T_0:=1\) and \(T_{k+1}:=\frac{1+2T_k+\sqrt{1+4T_k}}{2}\) for \(k \ge 0\). Then, we have
Proof
The case \(k =0\) is obvious. Assume that the assertion is true for some \(k \ge 0\). Putting \(U_k:=\frac{1}{3}k +\frac{1}{6}\log (k+2)+1\), we have
Hence, it remains to show \(1+\frac{T_k}{T_{k+1}}U_k \le U_{k+1}\) for \(k \ge 0\). For that, we analyze the sequence \(t_0:=1,~t_{k+1}:=T_{k+1}-T_k\) for \(k \ge 0\) (namely, \(T_k=\sum _{i=0}^k t_i\)). The recurrence relation of \(T_k\) implies \(t_{k}^2=(T_{k}-T_{k-1})^2=T_{k}\) and
Analyzing the difference \(t_{k+1}-t_k\) shows for \(k \ge 0\) that
Since Lemma 7.1 yields \(t_k=\sqrt{T_k}\ge \sqrt{(k+1)(k+4)/4}\ge (k+2)/2\) for \(k \ge 0\), we obtain
for all \(k \ge 0\). Finally, this upper bound of \(t_k\) concludes that
Taking the inverse and multiplying by \(U_k\) for both sides yield \(1+\frac{T_k}{T_{k+1}}U_k \le U_{k+1}\). \(\square \)
Rights and permissions
About this article
Cite this article
Ito, M. New results on subgradient methods for strongly convex optimization problems with a unified analysis. Comput Optim Appl 65, 127–172 (2016). https://doi.org/10.1007/s10589-016-9841-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-016-9841-1
Keywords
- Non-smooth/smooth convex optimization
- Structured convex optimization
- Subgradient/gradient-based proximal method
- Conditional gradient method
- Complexity theory
- Strongly convex functions
- Weakly smooth functions