Skip to main content

Mirror Prox algorithm for multi-term composite minimization and semi-separable problems


In the paper, we develop a composite version of Mirror Prox algorithm for solving convex–concave saddle point problems and monotone variational inequalities of special structure, allowing to cover saddle point/variational analogies of what is usually called “composite minimization” (minimizing a sum of an easy-to-handle nonsmooth and a general-type smooth convex functions “as if” there were no nonsmooth component at all). We demonstrate that the composite Mirror Prox inherits the favourable (and unimprovable already in the large-scale bilinear saddle point case) efficiency estimate of its prototype. We demonstrate that the proposed approach can be successfully applied to Lasso-type problems with several penalizing terms (e.g. acting together \(\ell _1\) and nuclear norm regularization) and to problems of semi-separable structures considered in the alternating directions methods, implying in both cases methods with the complexity bounds.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3


  1. The precise meaning of simplicity and fitting will be specified later. As of now, it suffices to give a couple of examples. When \(\varPsi _k\) is the \(\ell _1\) norm, \(Y_k\) can be the entire space, or the centered at the origin \(\ell _p\)-ball, \(1\le p\le 2\); when \(\varPsi _k\) is the nuclear norm, \(Y_k\) can be the entire space, or the centered at the origin Frobenius/nuclear norm ball.

  2. Our exposition follows.

  3. In principle, these parameters should be chosen to optimize the resulting efficiency estimates; this indeed is doable, provided that we have at our disposal upper bounds on the Lipschitz constants of the components of \(F_u\) and that \(U\) is bounded, see [17, Section 5] or [14, Section 6.3.3].

  4. With our implementation, we run this test for both search points and approximate solutions generated by the algorithm.

  5. Note that the latter relation implies that what was denoted by \(\widetilde{\varPhi }\) in Proposition 2 is nothing but \(\overline{\varPhi }\).

  6. If the goal of solving (56) were to recover \(y_{\#}\), our \({\lambda }\) and \(\mu \) would, perhaps, be too large. Our goal, however, was solving (56) as an “optimization beast,” and we were interested in “meaningful” contribution of \(\varPsi _0\) and \(\varPsi _1\) to the objective of the problem, and thus in not too small \({\lambda }\) and \(\mu \).

  7. Recall that we do not expect linear convergence, just \(O(1/t)\) one.

  8. Note that in a more complicated matrix recovery problem, where noisy linear combinations of the matrix entries rather than just some of these entries are observed, applying ADMM becomes somehow problematic, whilethe proposed algorithm still is applicable “as is.”

  9. In what follows, we call a collection \(a_{s,t}\) of reals nonincreasing in time, if \(a_{s',t'}\le a_{s,t}\) whenever \(s'\ge s\), same as whenever \(s=s'\) and \(t'\ge t\). “Nondecreasing in time” is defined similarly.

  10. We assume w.l.o.g. that \(|\underline{{\hbox {Opt}}}_{s,t}|\le L\).


  1. Andersen, E. D., Andersen, K. D.: The MOSEK optimization tools manual.

  2. Aujol, J.-F., Chambolle, A.: Dual norms and image decomposition models. Int. J. Comput. Vis. 63(1), 85–104 (2005)

    Article  MathSciNet  Google Scholar 

  3. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  4. Becker, S., Bobin, J., Candès, E.J.: Nesta: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  5. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 122–122 (2010)

    Article  Google Scholar 

  6. Buades, A., Coll, B., Morel, J.-M.: A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4(2), 490–530 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  7. Candés, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM (JACM) 58(3), 11 (2011)

    Article  Google Scholar 

  8. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  9. Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  10. Deng, W., Lai, M.-J., Peng, Z., Yin, W.: Parallel multi-block admm with o (1/k) convergence, 2013. (2013)

  11. Goldfarb, D., Ma, S.: Fast multiple-splitting algorithms for convex optimization. SIAM J. Optim. 22(2), 533–556 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  12. Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1–2), 349–382 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  13. Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.0 beta. (2013)

  14. Juditsky, A., Nemirovski, A.: First-order methods for nonsmooth largescale convex minimization: I general purpose methods; ii utilizing problems structure. In: Sra, S., Nowozin, S., Wright, S. (eds.) Optimization for Machine Learning, pp. 121–183. The MIT Press, (2011)

  15. Lemarchal, C., Nemirovskii, A., Nesterov, Y.: New variants of bundle methods. Math. Program. 69(1–3), 111–147 (1995)

    Google Scholar 

  16. Monteiro, R.D., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  17. Nemirovski, A.: Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  18. Nemirovski, A., Onn, S., Rothblum, U.G.: Accuracy certificates for computational problems with convex structure. Math. Oper. Res. 35(1), 52–78 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  19. Nemirovski, A., Rubinstein, R.: An efficient stochastic approximation algorithm for stochastic saddle point problems. In: Dror, M., L’Ecuyer, P., Szidarovszky, F. (eds.) Modeling Uncertainty and Examination of Stochastic Theory, Methods, and Applications, pp. 155–184. Kluwer Academic Publishers, Boston (2002)

    Google Scholar 

  20. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  21. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  22. Orabona, F., Argyriou, A., Srebro, N.: Prisma: Proximal iterative smoothing algorithm. arXiv preprint arXiv:1206.2372, (2012)

  23. Ouyang, Y., Chen, Y., Lan, G., Pasiliao, E. Jr.: An accelerated linearized alternating direction method of multipliers, arXiv:1401.6607 (2014)

  24. Qin, Z., Goldfarb, D.: Structured sparsity via alternating direction methods. J. Mach. Learn. Res. 13, 1373–1406 (2012)

    MathSciNet  Google Scholar 

  25. Scheinberg, K., Goldfarb, D., Bai, X.: Fast first-order methods for composite convex optimization with backtracking. (2011)

  26. Tseng, P.: Alternating projection-proximal methods for convex programming and variational inequalities. SIAM J. Optim. 7(4), 951–965 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  27. Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. SIAM J. Optim. (2008, submitted)

  28. Wen, Z., Goldfarb, D., Yin, W.: Alternating direction augmented lagrangian methods for semidefinite programming. Math. Program. Comput. 2(3–4), 203–230 (2010)

    Article  MATH  MathSciNet  Google Scholar 

Download references


Research of the first and the third authors was supported by the NSF Grant CMMI-1232623. Research of the second author was supported by the CNRS-Mastodons Project GARGANTUA, and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Niao He.


Appendix 1: Proof of Theorem 1

0 \(^o\). Let us verify that the prox-mapping (28) indeed is well defined whenever \(\zeta =\gamma F_v\) with \(\gamma >0\). All we need is to show that whenever \(u\in U\), \(\eta \in E_u\), \(\gamma >0\) and \([w_t;s_t]\in X\), \(t=1,2, \ldots \), are such that \(\Vert w_t\Vert _2+\Vert s_t\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \), we have

$$\begin{aligned} r_t:=\underbrace{\langle \eta -\omega '(u),w_t\rangle +\omega (w_t)}_{a_t}+\underbrace{\gamma \langle F_v,s_t\rangle }_{b_t} \rightarrow \infty ,\,t\rightarrow \infty . \end{aligned}$$

Indeed, assuming the opposite and passing to a subsequence, we make the sequence \(r_t\) bounded. Since \(\omega (\cdot )\) is strongly convex, modulus 1, w.r.t. \(\Vert \cdot \Vert \), and the linear function \(\langle F_v,s\rangle \) of \([w;s]\) is below bounded on \(X\) by A4, boundedness of the sequence \(\{r_t\}\) implies boundedness of the sequence \(\{w_t\}\), and since \(\Vert [w_t;s_t]\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \), we get \(\Vert s_t\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \). Since \(\langle F_v,s\rangle \) is coercive in \(s\) on \(X\) by A4, and \(\gamma >0\), we conclude that \(b_t\rightarrow \infty \), \(t\rightarrow \infty \), while the sequence \(\{a_t\}\) is bounded since the sequence \(\{w_t\in U\}\) is so and \(\omega \) is continuously differentiable. Thus, \(\{a_t\}\) is bounded, \(b_t\rightarrow \infty \), \(t\rightarrow \infty \), implying that \(r_t\rightarrow \infty \), \(t\rightarrow \infty \), which is the desired contradiction

\(1^{\circ }\). Recall the well-known identity [9]: for all \(u,u',w\in U\) one has

$$\begin{aligned} \langle V'_{u}(u'),w-u'\rangle =V_{u}(w)-V_{u'}(w)-V_{u}(u'). \end{aligned}$$

Indeed, the right hand side is

$$\begin{aligned}&[\omega (w)-\omega (u)-\langle \omega '(u),w-u\rangle ] -\,[\omega (w)-\omega (u')-\langle \omega '(u'),w-u'\rangle ]\\&\qquad -[\omega (u') -\omega (u)-\langle \omega '(u),u'-u\rangle ]\\&\quad =\langle \omega '(u),u-w\rangle +\langle \omega '(u),u'-u\rangle +\langle \omega '(u'),w-u'\rangle \\&\quad =\langle \omega '(u') - \omega '(u),w-u'\rangle =\langle V'_{u}(u'),w-u'\rangle . \end{aligned}$$

For \(x=[u;v]\in X,\;\xi =[\eta ;\zeta ]\), let \(P_x(\xi )=[u';v']\in X\). By the optimality condition for the problem (28), for all \([s;w]\in X\),

$$\begin{aligned} \langle \eta +V'_u(u'),u'-s\rangle +\langle \zeta ,v'-w\rangle \le 0, \end{aligned}$$

which by (78) implies that

$$\begin{aligned} \langle \eta ,u'-s\rangle +\langle \zeta ,v'-w\rangle \le \langle V'_u(u'),s-u'\rangle = V_{u}(s)-V_{u'}(s)-V_{u}(u'). \end{aligned}$$

\(2^{\circ }\). When applying (79) with \([u;v]=[u_\tau ;v_\tau ]=x_\tau \), \(\xi =\gamma _\tau F(x_\tau )=[\gamma _\tau F_u(u_\tau );\gamma _\tau F_v]\), \([u';v']=[u'_\tau ;v'_\tau ]=y_\tau \), and \([s;w]=[u_{\tau +1};v_{\tau +1}]=x_{\tau +1}\) we obtain:

$$\begin{aligned} \gamma _\tau [\langle F_u(u_\tau ),u'_\tau -u_{\tau +1}\rangle +\langle F_v,v'_\tau -v_{\tau +1}\rangle ]\le V_{u_\tau }(u_{\tau +1})-V_{u'_\tau }(u_{\tau +1})-V_{u_\tau }(u'_{\tau });\nonumber \\ \end{aligned}$$

and applying (79) with \([u;v]=x_\tau \), \(\xi =\gamma _\tau F(y_\tau )\), \([u';v']=x_{\tau +1}\), and \([s;w]=z\in X\) we get:

$$\begin{aligned} \gamma _\tau [\langle F_u(u'_\tau ),u_{\tau +1}-s\rangle +\langle F_v,v_{\tau +1}-w\rangle ]\le V_{u_\tau }(s)-V_{u_{\tau +1}}(s)-V_{u_\tau }(u_{\tau +1}).\nonumber \\ \end{aligned}$$

Adding (81) to (80) we obtain for every \(z=[s;w]\in X\)

$$\begin{aligned}&{\gamma _\tau \langle F(y_\tau ),y_\tau -z\rangle =\gamma _\tau [\langle F_u(u'_\tau ),u'_\tau -s\rangle +\langle F_v,v'_\tau -w\rangle ]}\le V_{u_\tau }(s)\nonumber \\&-V_{u_{\tau +1}}(s)+\underbrace{\gamma _\tau \langle F_u(u'_\tau )-F_u(u_\tau ), u'_\tau -u_{\tau +1}\rangle -V_{u'_\tau }(u_{\tau +1})-V_{u_\tau }(u'_{\tau }) }_{\delta _\tau }. \end{aligned}$$

Due to the strong convexity, modulus 1, of \(V_u(\cdot )\) w.r.t. \(\Vert \cdot \Vert \), \(V_u(u')\ge {1\over 2}\Vert u-u'\Vert ^2\) for all \(u,u'\). Therefore,

$$\begin{aligned} \delta _\tau&\le \gamma _\tau \Vert F_u(u'_\tau )-F_u(u_\tau )\Vert _*\Vert u'_\tau -u_{\tau +1}\Vert - {\frac{1}{2}}\Vert u'_\tau -u_{\tau +1}\Vert ^2- {\frac{1}{2}}\Vert u_\tau -u'_\tau \Vert ^2\\&\le {\frac{1}{2}}\left[ \gamma _\tau ^2\Vert F_u(u'_\tau )-F_u(u_\tau )\Vert _*^2-\Vert u_\tau -u'_\tau \Vert ^2\right] \\&\le {\frac{1}{2}}\left[ \gamma _\tau ^2[M+L\Vert u'_\tau -u_\tau \Vert ]^2-\Vert u_\tau -u'_\tau \Vert ^2\right] , \end{aligned}$$

where the last inequality is due to (23). Note that \(\gamma _\tau L<1\) implies that

$$\begin{aligned} \gamma _\tau ^2[M+L\Vert u'_\tau -u_\tau \Vert ]^2-\Vert u'_\tau -u_\tau \Vert ^2\le \max _r \left[ \gamma _\tau ^2[M+Lr]^2-r^2\right] ={\gamma _\tau ^2M^2\over 1-\gamma _\tau ^2L^2}. \end{aligned}$$

Let us assume that the stepsizes \(\gamma _\tau >0\) ensure that (30) holds, meaning that \( \delta _\tau \le \gamma _\tau ^2M^2\) (which, by the above analysis, is definitely the case when \(0<\gamma _\tau \le {1\over \sqrt{2}L}\); when \(M=0\), we can take also \(\gamma _\tau \le {1\over L}\)). When summing up inequalities (82) over \(\tau =1,2, \ldots ,t\) and taking into account that \(V_{u_{t+1}}(s)\ge 0\), we conclude that for all \(z=[s;w]\in X\),

$$\begin{aligned} \sum _{\tau =1}^t\lambda ^t_\tau \langle F(y_\tau ),y_\tau -z\rangle&\le {V_{u_1}(s) +\sum _{\tau =1}^t\delta _\tau \over \sum _{\tau =1}^t\gamma _\tau }\le {V_{u_1}(s) +M^2\sum _{\tau =1}^t\gamma _\tau ^2\over \sum _{\tau =1}^t\gamma _\tau },\\ \lambda ^t_\tau&= \gamma _\tau /\sum _{i=1}^t\gamma _i. \end{aligned}$$

\(\square \)

Appendix 2: Proof of Lemma 1


All we need to verify is the second inequality in (38). To this end note that when \(t=1\), the inequality in (38) holds true by definition of \(\widehat{\varTheta }(\cdot )\). Now let \(1<t\le N+1\). Summing up the inequalities (82) over \(\tau =1, \ldots ,t-1\), we get for every \(x=[u;v]\in X\):

$$\begin{aligned} \sum _{\tau =1}^{t-1}\gamma _\tau \langle F(y_\tau ),y_\tau -[u;v]\rangle&\le V_{u_1}(u)-V_{u_t}(u)+\sum _{\tau =1}^{t-1}\delta _\tau \\&\le V_{u_1}(u)-V_{u_t}(u)+\sum _{\tau =1}^{t-1}\delta _\tau \\&\le V_{u_1}(u)-V_{u_t}(u)+M^2\sum _{\tau =1}^{t-1}\gamma _\tau ^2 \end{aligned}$$

(we have used (30)). When \([u;v]\) is \(z_*\), the left hand side in the resulting inequality is \(\ge 0\), and we arrive at

$$\begin{aligned} V_{u_t}(u_*)\le V_{u_1}(u_*)+M^2\sum _{\tau =1}^{t-1}\gamma _\tau ^2, \end{aligned}$$


$$\begin{aligned} {1\over 2}\Vert u_t-u_*\Vert ^2\le V_{u_1}(u_*)+M^2\sum _{\tau =1}^{t-1}\gamma _\tau ^2 \end{aligned}$$

hence also

$$\begin{aligned} \Vert u_t-u_1\Vert ^2\le 2\Vert u_t-u_*\Vert ^2+2\Vert u_*-u_1 \Vert ^2\le 4\left[ V_{u_1}(u_*)+M^2\sum _{\tau =1}^{t-1}\gamma _\tau ^2\right] +4V_{u_1}(u_*) \end{aligned}$$

and therefore

$$\begin{aligned} \Vert u_t-u_1\Vert \le 2\sqrt{2V_{u_1}(u_*)+M^2\sum _{\tau =1}^{t-1}\gamma _t^2}= R_N, \end{aligned}$$

and (38) follows. \(\square \)

Appendix 3: Proof of Proposition 3


From (82) and (30) it follows that

$$\begin{aligned} \forall (x&= [u;v]\in X,\tau \le N): \lambda _\tau \langle F(y_\tau ),y_\tau -x\rangle \\&\le {\lambda _\tau \over \gamma _\tau }[V_{u_\tau }(u)-V_{u_{\tau +1}}(u)]+M^2\lambda _\tau \gamma _\tau . \end{aligned}$$

Summing up these inequalities over \(\tau =1, \ldots ,N\), we get \(\forall (x=[u;v]\in X)\):

$$\begin{aligned}&\sum \limits _{\tau =1}^N\lambda _\tau \langle F(y_\tau ),y_\tau -x\rangle \\&\quad \le {\lambda _1\over \gamma _1}[V_{u_1}(u)-V_{u_2}(u)] +\,{\lambda _2\over \gamma _2}[V_{u_2}(u)-V_{u_3}(u)]+ \cdots \\&\quad \quad + {\lambda _N\over \gamma _N}[V_{u_N}(u)-V_{u_{N+1}}(u)] +M^2\sum \limits _{\tau =1}^N\lambda _\tau \gamma _\tau \\&\quad =\underbrace{{\lambda _1\over \gamma _1}}_{\ge 0}V_{u_1}(u) +\underbrace{\left[ {\lambda _2\over \gamma _2} -\,{\lambda _1\over \gamma _1}\right] }_{\ge 0}V_{u_2}(u)+\cdots + \underbrace{\left[ {\lambda _N\over \gamma _N} -{\lambda _{N-1}\over \gamma _{N-1}}\right] }_{\ge 0} V_{u_N}(u)\\&\quad \quad -{\lambda _N\over \gamma _N}\underbrace{V_{u_{N+1}}(u)}_{\ge 0} +M^2\sum \limits _{\tau =1}^N\lambda _\tau \gamma _\tau \\&\qquad \le {\lambda _1\over \gamma _1}\widehat{\varTheta } (\max [R_N,\Vert u-u_1\Vert ])+\left[ {\lambda _2\over \gamma _2} -{\lambda _1\over \gamma _1}\right] \widehat{\varTheta }(\max [R_N,\Vert u-u_1\Vert ])+\cdots \\&\qquad \quad +\left[ {\lambda _N\over \gamma _N}-{\lambda _{N-1}\over \gamma _{N-1}}\right] \widehat{\varTheta }(\max [R_N,\Vert u-u_1\Vert ])+M^2\sum \limits _{\tau =1}^N \lambda _\tau \gamma _\tau ,\\&\quad ={\lambda _N\over \gamma _N}\widehat{\varTheta }(\max [R_N,\Vert u-u_1\Vert ]) +M^2\sum \limits _{\tau =1}^N\lambda _\tau \gamma _\tau , \end{aligned}$$

where the concluding inequality is due to (38), and (40) follows.\(\square \)

Appendix 4: Proof of Proposition 5

1\(^o\). \(h_{s,t}(\alpha )\) are concave piecewise linear functions on \([0,1]\) which clearly are pointwise nonincreasing in time. As a result, \({\hbox {Gap}}(s,t)\) is nonincreasing in time. Further, we have

$$\begin{aligned} {\hbox {Gap}}(s,t)&= \max \limits _{\alpha \in [0,1]}\left\{ \min \limits _{\lambda } \sum \limits _{(p,q)\in Q_{s,t}}\lambda _{pq} [\alpha (p-\underline{{\hbox {Opt}}}_{s,t})+(1-\alpha )q]:\; \lambda _{pq}\ge 0,\right. \\&\qquad \quad \left. \sum _{(p,q)\in Q_{s,t}}\lambda _{pq}=1\right\} \\&= \max _{\alpha \in [0,1]}\sum \limits _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}[\alpha (p-\underline{{\hbox {Opt}}}_{s,t})+(1-\alpha )q]\\&= \max \left[ \sum _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}(p -\underline{{\hbox {Opt}}}_{s,t}),\sum _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}q\right] , \end{aligned}$$

where \(\lambda ^*_{pq}\ge 0\) and sum up to 1. Recalling that for every \((p,q)\in Q_{s,t}\) we have at our disposal \(y_{pq}\in Y\) such that \(p\ge f(y_{pq})\) and \(q\ge g(y_{pq})\), setting \(\widehat{y}^{s,t}=\sum \limits _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}y_{pq}\) and invoking convexity of \(f,g\), we get

$$\begin{aligned} f(\widehat{y}^{s,t})&\le \sum \limits _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}p\le \underline{{\hbox {Opt}}}_{s,t}+{\hbox {Gap}}(s,t), \,\,g(\widehat{y}^{s,t})\\&\le \sum \limits _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}q\le {\hbox {Gap}}(s,t); \end{aligned}$$

and (69) follows, due to \(\underline{{\hbox {Opt}}}_{s,t}\le {\hbox {Opt}}\).

\(2^{\circ }\). We have \(\overline{f}_s^t=\alpha _sf(y^{s,t}) +(1-\alpha _s)g(y^{s,t})\) for some \(y^{s,t}\in Y\) which we have at our disposal at step \(t\), implying that \((\widehat{p}=f(y^{s,t}),\widehat{q}=g(y^{s,t}))\in Q_{s,t}\). Hence by definition of \(h_{s,t}(\cdot )\) it holds

$$\begin{aligned} h_{s,t}(\alpha _s)\le \alpha _s (\widehat{p}-\underline{{\hbox {Opt}}}_{s,t})+(1-\alpha _s)\widehat{q}=\overline{f}_s^t-\alpha _s\underline{{\hbox {Opt}}}_{s,t}\le \overline{f}_s^t-\underline{f}_{s,t}, \end{aligned}$$

where the concluding inequality is given by (67). Thus, \(h_{s,t}(\alpha _s)\le \overline{f}_s^t-\underline{f}_{s,t} \le \epsilon _t\). On the other hand, if stage \(s\) does not terminate in course of the first \(t\) steps, \(\alpha _s\) is well-centered in the segment \(\varDelta _{s,t}\) where the concave function \(h_{s,t}(\alpha )\) is nonnegative. We conclude that \(0\le {\hbox {Gap}}(s,t)=\max _{0\le \alpha \le 1}h_{s,t}(\alpha )=\max _{\alpha \in \varDelta _{s,t}}h_{s,t}(\alpha )\le 3h_{s,t}(\alpha _s)\). Thus, if a stage \(s\) does not terminate in course of the first \(t\) steps, we have \({\hbox {Gap}}(s,t)\le 3\epsilon _t\), which implies (71). Further, \(\alpha _s\) is the midpoint of the segment \(\varDelta ^{s-1}=\varDelta _{s-1,t_{s-1}}\), where \(t_r\) is the last step of stage \(r\) (when \(s=1\), we should define \(\varDelta ^0\) as \([0,1]\)), and \(\alpha _s\) is not well-centered in the segment \(\varDelta ^s=\varDelta _{s,t_s}\subset \varDelta _{s-1,t_{s-1}}\), which clearly implies that \(|\varDelta ^s|\le \mathrm{\small {3\over 4}}|\varDelta ^{s-1}|\). Thus, \(|\varDelta ^s|\le \left( \mathrm{\small {3\over 4}}\right) ^s\) for all \(s\). On the other hand, when \(|\varDelta _{s,t}|<1\), we have \({\hbox {Gap}}(s,t)=\max _{\alpha \in \varDelta _{s,t}}h_{s,t}(\alpha )\le 3L |\varDelta _{s,t}|\) (since \(h_{s,t}(\cdot )\) is Lipschitz continuous with constant \(3L\) Footnote 10 and \(h_{s,t}(\cdot )\) vanishes at (at least) one endpoint of \(\varDelta _{s,t}\)). Thus, the number of stages before \({\hbox {Gap}}(s,t)\le \epsilon \) is reached indeed obeys the bound (70).\(\square \)

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, N., Juditsky, A. & Nemirovski, A. Mirror Prox algorithm for multi-term composite minimization and semi-separable problems. Comput Optim Appl 61, 275–319 (2015).

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI:


Mathematics Subject Classification