Abstract
In the paper, we develop a composite version of Mirror Prox algorithm for solving convex–concave saddle point problems and monotone variational inequalities of special structure, allowing to cover saddle point/variational analogies of what is usually called “composite minimization” (minimizing a sum of an easy-to-handle nonsmooth and a general-type smooth convex functions “as if” there were no nonsmooth component at all). We demonstrate that the composite Mirror Prox inherits the favourable (and unimprovable already in the large-scale bilinear saddle point case) efficiency estimate of its prototype. We demonstrate that the proposed approach can be successfully applied to Lasso-type problems with several penalizing terms (e.g. acting together \(\ell _1\) and nuclear norm regularization) and to problems of semi-separable structures considered in the alternating directions methods, implying in both cases methods with the
complexity bounds.
This is a preview of subscription content, access via your institution.



Notes
The precise meaning of simplicity and fitting will be specified later. As of now, it suffices to give a couple of examples. When \(\varPsi _k\) is the \(\ell _1\) norm, \(Y_k\) can be the entire space, or the centered at the origin \(\ell _p\)-ball, \(1\le p\le 2\); when \(\varPsi _k\) is the nuclear norm, \(Y_k\) can be the entire space, or the centered at the origin Frobenius/nuclear norm ball.
Our exposition follows.
With our implementation, we run this test for both search points and approximate solutions generated by the algorithm.
Note that the latter relation implies that what was denoted by \(\widetilde{\varPhi }\) in Proposition 2 is nothing but \(\overline{\varPhi }\).
If the goal of solving (56) were to recover \(y_{\#}\), our \({\lambda }\) and \(\mu \) would, perhaps, be too large. Our goal, however, was solving (56) as an “optimization beast,” and we were interested in “meaningful” contribution of \(\varPsi _0\) and \(\varPsi _1\) to the objective of the problem, and thus in not too small \({\lambda }\) and \(\mu \).
Recall that we do not expect linear convergence, just \(O(1/t)\) one.
Note that in a more complicated matrix recovery problem, where noisy linear combinations of the matrix entries rather than just some of these entries are observed, applying ADMM becomes somehow problematic, whilethe proposed algorithm still is applicable “as is.”
In what follows, we call a collection \(a_{s,t}\) of reals nonincreasing in time, if \(a_{s',t'}\le a_{s,t}\) whenever \(s'\ge s\), same as whenever \(s=s'\) and \(t'\ge t\). “Nondecreasing in time” is defined similarly.
We assume w.l.o.g. that \(|\underline{{\hbox {Opt}}}_{s,t}|\le L\).
References
Andersen, E. D., Andersen, K. D.: The MOSEK optimization tools manual. http://www.mosek.com/fileadmin/products/6_0/tools/doc/pdf/tools.pdf
Aujol, J.-F., Chambolle, A.: Dual norms and image decomposition models. Int. J. Comput. Vis. 63(1), 85–104 (2005)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Becker, S., Bobin, J., Candès, E.J.: Nesta: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2011)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 122–122 (2010)
Buades, A., Coll, B., Morel, J.-M.: A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4(2), 490–530 (2005)
Candés, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM (JACM) 58(3), 11 (2011)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Deng, W., Lai, M.-J., Peng, Z., Yin, W.: Parallel multi-block admm with o (1/k) convergence, 2013. http://www.optimization-online.org/DB_HTML/2014/03/4282.html (2013)
Goldfarb, D., Ma, S.: Fast multiple-splitting algorithms for convex optimization. SIAM J. Optim. 22(2), 533–556 (2012)
Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1–2), 349–382 (2013)
Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.0 beta. http://cvxr.com/cvx (2013)
Juditsky, A., Nemirovski, A.: First-order methods for nonsmooth largescale convex minimization: I general purpose methods; ii utilizing problems structure. In: Sra, S., Nowozin, S., Wright, S. (eds.) Optimization for Machine Learning, pp. 121–183. The MIT Press, (2011)
Lemarchal, C., Nemirovskii, A., Nesterov, Y.: New variants of bundle methods. Math. Program. 69(1–3), 111–147 (1995)
Monteiro, R.D., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)
Nemirovski, A.: Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nemirovski, A., Onn, S., Rothblum, U.G.: Accuracy certificates for computational problems with convex structure. Math. Oper. Res. 35(1), 52–78 (2010)
Nemirovski, A., Rubinstein, R.: An efficient stochastic approximation algorithm for stochastic saddle point problems. In: Dror, M., L’Ecuyer, P., Szidarovszky, F. (eds.) Modeling Uncertainty and Examination of Stochastic Theory, Methods, and Applications, pp. 155–184. Kluwer Academic Publishers, Boston (2002)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Orabona, F., Argyriou, A., Srebro, N.: Prisma: Proximal iterative smoothing algorithm. arXiv preprint arXiv:1206.2372, (2012)
Ouyang, Y., Chen, Y., Lan, G., Pasiliao, E. Jr.: An accelerated linearized alternating direction method of multipliers, arXiv:1401.6607 (2014)
Qin, Z., Goldfarb, D.: Structured sparsity via alternating direction methods. J. Mach. Learn. Res. 13, 1373–1406 (2012)
Scheinberg, K., Goldfarb, D., Bai, X.: Fast first-order methods for composite convex optimization with backtracking. http://www.optimization-online.org/DB_FILE/2011/04/3004.pdf (2011)
Tseng, P.: Alternating projection-proximal methods for convex programming and variational inequalities. SIAM J. Optim. 7(4), 951–965 (1997)
Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. SIAM J. Optim. (2008, submitted)
Wen, Z., Goldfarb, D., Yin, W.: Alternating direction augmented lagrangian methods for semidefinite programming. Math. Program. Comput. 2(3–4), 203–230 (2010)
Acknowledgments
Research of the first and the third authors was supported by the NSF Grant CMMI-1232623. Research of the second author was supported by the CNRS-Mastodons Project GARGANTUA, and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Proof of Theorem 1
0 \(^o\). Let us verify that the prox-mapping (28) indeed is well defined whenever \(\zeta =\gamma F_v\) with \(\gamma >0\). All we need is to show that whenever \(u\in U\), \(\eta \in E_u\), \(\gamma >0\) and \([w_t;s_t]\in X\), \(t=1,2, \ldots \), are such that \(\Vert w_t\Vert _2+\Vert s_t\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \), we have
Indeed, assuming the opposite and passing to a subsequence, we make the sequence \(r_t\) bounded. Since \(\omega (\cdot )\) is strongly convex, modulus 1, w.r.t. \(\Vert \cdot \Vert \), and the linear function \(\langle F_v,s\rangle \) of \([w;s]\) is below bounded on \(X\) by A4, boundedness of the sequence \(\{r_t\}\) implies boundedness of the sequence \(\{w_t\}\), and since \(\Vert [w_t;s_t]\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \), we get \(\Vert s_t\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \). Since \(\langle F_v,s\rangle \) is coercive in \(s\) on \(X\) by A4, and \(\gamma >0\), we conclude that \(b_t\rightarrow \infty \), \(t\rightarrow \infty \), while the sequence \(\{a_t\}\) is bounded since the sequence \(\{w_t\in U\}\) is so and \(\omega \) is continuously differentiable. Thus, \(\{a_t\}\) is bounded, \(b_t\rightarrow \infty \), \(t\rightarrow \infty \), implying that \(r_t\rightarrow \infty \), \(t\rightarrow \infty \), which is the desired contradiction
\(1^{\circ }\). Recall the well-known identity [9]: for all \(u,u',w\in U\) one has
Indeed, the right hand side is
For \(x=[u;v]\in X,\;\xi =[\eta ;\zeta ]\), let \(P_x(\xi )=[u';v']\in X\). By the optimality condition for the problem (28), for all \([s;w]\in X\),
which by (78) implies that
\(2^{\circ }\). When applying (79) with \([u;v]=[u_\tau ;v_\tau ]=x_\tau \), \(\xi =\gamma _\tau F(x_\tau )=[\gamma _\tau F_u(u_\tau );\gamma _\tau F_v]\), \([u';v']=[u'_\tau ;v'_\tau ]=y_\tau \), and \([s;w]=[u_{\tau +1};v_{\tau +1}]=x_{\tau +1}\) we obtain:
and applying (79) with \([u;v]=x_\tau \), \(\xi =\gamma _\tau F(y_\tau )\), \([u';v']=x_{\tau +1}\), and \([s;w]=z\in X\) we get:
Adding (81) to (80) we obtain for every \(z=[s;w]\in X\)
Due to the strong convexity, modulus 1, of \(V_u(\cdot )\) w.r.t. \(\Vert \cdot \Vert \), \(V_u(u')\ge {1\over 2}\Vert u-u'\Vert ^2\) for all \(u,u'\). Therefore,
where the last inequality is due to (23). Note that \(\gamma _\tau L<1\) implies that
Let us assume that the stepsizes \(\gamma _\tau >0\) ensure that (30) holds, meaning that \( \delta _\tau \le \gamma _\tau ^2M^2\) (which, by the above analysis, is definitely the case when \(0<\gamma _\tau \le {1\over \sqrt{2}L}\); when \(M=0\), we can take also \(\gamma _\tau \le {1\over L}\)). When summing up inequalities (82) over \(\tau =1,2, \ldots ,t\) and taking into account that \(V_{u_{t+1}}(s)\ge 0\), we conclude that for all \(z=[s;w]\in X\),
\(\square \)
Appendix 2: Proof of Lemma 1
Proof
All we need to verify is the second inequality in (38). To this end note that when \(t=1\), the inequality in (38) holds true by definition of \(\widehat{\varTheta }(\cdot )\). Now let \(1<t\le N+1\). Summing up the inequalities (82) over \(\tau =1, \ldots ,t-1\), we get for every \(x=[u;v]\in X\):
(we have used (30)). When \([u;v]\) is \(z_*\), the left hand side in the resulting inequality is \(\ge 0\), and we arrive at
hence
hence also
and therefore
and (38) follows. \(\square \)
Appendix 3: Proof of Proposition 3
Proof
From (82) and (30) it follows that
Summing up these inequalities over \(\tau =1, \ldots ,N\), we get \(\forall (x=[u;v]\in X)\):
where the concluding inequality is due to (38), and (40) follows.\(\square \)
Appendix 4: Proof of Proposition 5
1\(^o\). \(h_{s,t}(\alpha )\) are concave piecewise linear functions on \([0,1]\) which clearly are pointwise nonincreasing in time. As a result, \({\hbox {Gap}}(s,t)\) is nonincreasing in time. Further, we have
where \(\lambda ^*_{pq}\ge 0\) and sum up to 1. Recalling that for every \((p,q)\in Q_{s,t}\) we have at our disposal \(y_{pq}\in Y\) such that \(p\ge f(y_{pq})\) and \(q\ge g(y_{pq})\), setting \(\widehat{y}^{s,t}=\sum \limits _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}y_{pq}\) and invoking convexity of \(f,g\), we get
and (69) follows, due to \(\underline{{\hbox {Opt}}}_{s,t}\le {\hbox {Opt}}\).
\(2^{\circ }\). We have \(\overline{f}_s^t=\alpha _sf(y^{s,t}) +(1-\alpha _s)g(y^{s,t})\) for some \(y^{s,t}\in Y\) which we have at our disposal at step \(t\), implying that \((\widehat{p}=f(y^{s,t}),\widehat{q}=g(y^{s,t}))\in Q_{s,t}\). Hence by definition of \(h_{s,t}(\cdot )\) it holds
where the concluding inequality is given by (67). Thus, \(h_{s,t}(\alpha _s)\le \overline{f}_s^t-\underline{f}_{s,t} \le \epsilon _t\). On the other hand, if stage \(s\) does not terminate in course of the first \(t\) steps, \(\alpha _s\) is well-centered in the segment \(\varDelta _{s,t}\) where the concave function \(h_{s,t}(\alpha )\) is nonnegative. We conclude that \(0\le {\hbox {Gap}}(s,t)=\max _{0\le \alpha \le 1}h_{s,t}(\alpha )=\max _{\alpha \in \varDelta _{s,t}}h_{s,t}(\alpha )\le 3h_{s,t}(\alpha _s)\). Thus, if a stage \(s\) does not terminate in course of the first \(t\) steps, we have \({\hbox {Gap}}(s,t)\le 3\epsilon _t\), which implies (71). Further, \(\alpha _s\) is the midpoint of the segment \(\varDelta ^{s-1}=\varDelta _{s-1,t_{s-1}}\), where \(t_r\) is the last step of stage \(r\) (when \(s=1\), we should define \(\varDelta ^0\) as \([0,1]\)), and \(\alpha _s\) is not well-centered in the segment \(\varDelta ^s=\varDelta _{s,t_s}\subset \varDelta _{s-1,t_{s-1}}\), which clearly implies that \(|\varDelta ^s|\le \mathrm{\small {3\over 4}}|\varDelta ^{s-1}|\). Thus, \(|\varDelta ^s|\le \left( \mathrm{\small {3\over 4}}\right) ^s\) for all \(s\). On the other hand, when \(|\varDelta _{s,t}|<1\), we have \({\hbox {Gap}}(s,t)=\max _{\alpha \in \varDelta _{s,t}}h_{s,t}(\alpha )\le 3L |\varDelta _{s,t}|\) (since \(h_{s,t}(\cdot )\) is Lipschitz continuous with constant \(3L\) Footnote 10 and \(h_{s,t}(\cdot )\) vanishes at (at least) one endpoint of \(\varDelta _{s,t}\)). Thus, the number of stages before \({\hbox {Gap}}(s,t)\le \epsilon \) is reached indeed obeys the bound (70).\(\square \)
Rights and permissions
About this article
Cite this article
He, N., Juditsky, A. & Nemirovski, A. Mirror Prox algorithm for multi-term composite minimization and semi-separable problems. Comput Optim Appl 61, 275–319 (2015). https://doi.org/10.1007/s10589-014-9723-3
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-014-9723-3
Keywords
- Numerical algorithms for variational problems
- Composite optimization
- Minimization problems with multi-term penalty
- Proximal methods