Skip to main content
Log in

Unifying mirror descent and dual averaging

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We introduce and analyze a new family of first-order optimization algorithms which generalizes and unifies both mirror descent and dual averaging. Within the framework of this family, we define new algorithms for constrained optimization that combines the advantages of mirror descent and dual averaging. Our preliminary simulation study shows that these new algorithms significantly outperform available methods in some situations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. We also refer to [27,  Appendix C] for a discussion comparing MD and DA.

  2. Similar statements can be found in the literature (cf. e.g., [14,  Lemma A.1]) but we could not find one that exactly matches assumptions of Theorem 1 on F and \({{\mathcal {X}}}\). We provide a detailed proof in Appendix B for completeness.

  3. In its general form [35], the DA algorithm allows for a time-variable regularizer. For the sake of clarity, we consider here the simple case of time-invariant regularizers which already captures some essential differences between MD and DA.

  4. With some terminological abuse, we say that g is strongly convex when it is strongly convex with modulus 1.

  5. In the case of compact \({{\mathcal {X}}}\) one can take \(\Omega _X=\left[ \max _{x\in {{\mathcal {X}}}}2D_h(x,x_1;\ {\vartheta }_1)\right] ^{1/2}\). Note that in this case due to strong convexity of \(D_h(\cdot ,x_1,{\vartheta }_1)\) one has \(\Omega _{{\mathcal {X}}}\geqslant \max _{x\in {{\mathcal {X}}}}\Vert x-x_1\Vert \).

  6. APDD and IPDD algorithms should be seen as mere examples having nothing special which sets them apart from other possible UMD implementations.

  7. Recall, that satisfaction of such condition (cf. (12)) at each iteration of the method, ensures that the bound (13) of Theorem 2 holds true.

  8. https://archive.ics.uci.edu/ml/datasets/BlogFeedback.

  9. https://archive.ics.uci.edu/ml/datasets/Madelon.

  10. The second parameter in the definition of the k-\(\ell \)-ADPP corresponds to the \(\ell \)-step ahead computation of the objective when determining the choice of update every k steps of the algorithm.

  11. The domain of a convex function is convex, and therefore \({\mathcal {D}}_F={\text {int}}{\text {dom}}F\) is convex as the interior of a convex set.

References

  1. Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: Proceedings of the 22nd Annual Conference on Learning Theory (COLT), pp. 217–226 (2009)

  2. Audibert, J.Y., Bubeck, S.: Regret bounds and minimax policies under partial monitoring. J. Mac. Learn. Res. 11, 2785–2836 (2010)

    MathSciNet  MATH  Google Scholar 

  3. Audibert, J.Y., Bubeck, S., Lugosi, G.: Regret in online combinatorial optimization. Math. Oper. Res. 39(1), 31–45 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  4. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  5. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bubeck, S.: Introduction To Online Optimization: Lecture Notes. Princeton University, Princeton, NJ (2011)

    Google Scholar 

  8. Bubeck, S.: Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning 8(3–4), 231–357 (2015)

    Article  MATH  Google Scholar 

  9. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Mach. Learn. 5(1), 1–122 (2012)

    MATH  Google Scholar 

  10. Bubeck, S., Cesa-Bianchi, N., Kakade, S.M.: Towards minimax policies for online linear optimization with bandit feedback. In: JMLR: Workshop and Conference Proceedings (COLT), vol. 23, pp. 41.1–41.14 (2012)

  11. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)

    Book  MATH  Google Scholar 

  12. Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  13. Cohen, A., Hazan, T., Koren, T.: Tight bounds for bandit combinatorial optimization. In: Proceedings of Machine Learning Research (COLT 2017) vol. 65, pp. 1–14. (2017)

  14. Cox, B., Juditsky, A., Nemirovski, A.: Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 148(1–2), 143–180 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  15. Dasgupta, S., Telgarsky, M.J.: Agglomerative Bregman clustering. In: Proceedings of the 29th International Conference on Machine Learning (ICML 12), pp. 1527–1534 (2012)

  16. Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(Jan), 165–202 (2012)

    MathSciNet  MATH  Google Scholar 

  17. Duchi, J.C., Agarwal, A., Wainwright, M.J.: Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans. Autom. Control. 57(3), 592–606 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  18. Duchi, J.C., Ruan, F.: Asymptotic optimality in stochastic optimization. The Annals of Statistics (to appear)

  19. Flammarion, N., Bach, F.: Stochastic composite least-squares regression with convergence rate O(1/n). In: Proceedings of Machine Learning Research (COLT 2017), vol. 65, pp. 1–44 (2017)

  20. Hazan, E.: The convex optimization approach to regret minimization. In: S.N. S. Sra, S. Wrigh (eds.) Optimization for Machine Learning, pp. 287–303. MIT press (2012)

  21. Juditsky, A., Nemirovski, A.: First order methods for nonsmooth convex large-scale optimization, II: utilizing problems structure. Optimization for Machine Learning 30(9), 149–183 (2011)

    Google Scholar 

  22. Juditsky, A., Rigollet, P., Tsybakov, A.B.: Learning by mirror averaging. Ann. Stat. 36(5), 2183–2206 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  23. Juditsky, A.B., Nazin, A.V., Tsybakov, A.B., Vayatis, N.: Recursive aggregation of estimators by the mirror descent algorithm with averaging. Probl. Inf. Transm. 41(4), 368–384 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  24. Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  25. Lee, S., Wright, S.J.: Manifold identification in dual averaging for regularized stochastic online learning. J. Mach. Learn. Res. 13(Jun), 1705–1744 (2012)

    MathSciNet  MATH  Google Scholar 

  26. McMahan, B.: Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 525–533 (2011)

  27. McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mac. Learn. Res. 18(1), 3117–3166 (2017)

    MathSciNet  MATH  Google Scholar 

  28. Nazin, A.V.: Algorithms of inertial mirror descent in convex problems of stochastic optimization. Autom. Remote. Control. 79(1), 78–88 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  29. Nemirovski, A.: Efficient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody 15 (1979)

  30. Nemirovski, A.: Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  31. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  32. Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, UK (1983)

    Google Scholar 

  33. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  34. Nesterov, Y.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109(2–3), 319–344 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  35. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  36. Nesterov, Y., Shikhman, V.: Quasi-monotone subgradient methods for nonsmooth convex minimization. J. Optim. Theory Appl. 165(3), 917–940 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  37. Rakhlin, A., Tewari, A.: Lecture notes on online learning (2009)

  38. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton, NJ (1970)

    Book  MATH  Google Scholar 

  39. Shalev-Shwartz, S.: Online learning: Theory, algorithms, and applications. Ph.D. thesis, The Hebrew University of Jerusalem, Israel (2007)

    MATH  Google Scholar 

  40. Shalev-Shwartz, S.: Online learning and online convex optimization. Foundations and Trends in Machine Learning 4(2), 107–194 (2011)

    Article  MATH  Google Scholar 

  41. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)

Download references

Acknowledgements

The authors are grateful to Roberto Cominetti, Cristóbal Guzmán, Nicolas Flammarion and Sylvain Sorin for inspiring discussions and suggestions. A. Juditsky was supported by MIAI @ Grenoble Alpes (ANR-19-P3IA-0003). J. Kwon was supported by a public grant as part of the “Investissement d’avenir” project (ANR-11-LABX-0056-LMH), LabEx LMH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joon Kwon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Convex analysis tools

Definition 9

(Lower-semicontinuity) A function \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is lower-semicontinuous if for all \(c\in {\mathbb {R}}\), the sublevel set \(\left\{ x\in {\mathbb {R}}^n\,:f(x)\leqslant c \right\} \) is closed.

One can easily check that the sum of two lower-semicontinuous functions is lower-semicontinuous. Continuous functions and characteristic functions \(I_{{{\mathcal {X}}}}\) of closed sets \({{\mathcal {X}}}\subset {\mathbb {R}}^n\) are examples of lower-semicontinuous functions.

Definition 10

(Strong-convexity) Let \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\), \(\left\| \,\cdot \,\right\| _{}\) be a norm in \({\mathbb {R}}^n\) and \(K > 0\). Function g is said to be strongly convex with modulus \(\kappa \) with respect to norm \(\left\| \,\cdot \,\right\| _{}\) if for all \(x,x'\in {\mathbb {R}}^n\) and \(\lambda \in \left[ 0,1 \right] \),

$$\begin{aligned} g(\lambda x+(1-\lambda )x')\leqslant \lambda g(x)+(1-\lambda )g(x')-\frac{\kappa \lambda (1-\lambda )}{2}\left\| x'-x \right\| _{}^2. \end{aligned}$$

Proposition 12

(Theorem 23.5 in [38]) Let \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a lower-semicontinuous convex function with nonempty domain. Then for all \(x,y\in {\mathbb {R}}^n\), the following statements are equivalent.

  1. (i)

    \(x\in \partial g^*(y)\);

  2. (ii)

    \(y\in \partial g(x)\);

  3. (iii)

    \(\langle y | x \rangle =g(x)+g^*(y)\);

  4. (iv)

    \(x\in {{\,\mathrm{Arg\,max}\,}}_{x'\in {\mathbb {R}}^n}\left\{ \langle y | x' \rangle - g(x')\right\} \);

  5. (v)

    \(y\in {{\,\mathrm{Arg\,max}\,}}_{y'\in {\mathbb {R}}^n}\left\{ \langle y' | x \rangle - g^*(y')\right\} \).

Postponed proofs

1.1 Proofs for Section 2

1.1.1 Proof of Proposition 1

Let \({\vartheta }\in {\mathbb {R}}^n\). By property (iii) from Definition 1, there exists \(x_1\in {\mathcal {D}}_F\) such that \(\nabla F(x_1)={\vartheta }\). Therefore, function \(\varphi _{{\vartheta }}:x\mapsto \langle {\vartheta }| x \rangle -F(x)\) is differentiable at \(x_1\) and \(\nabla \varphi _{{\vartheta }}(x_1)=0\). Moreover, \(\varphi _{{\vartheta }}\) is strictly concave as a consequence of property (i) from Definition 1. Therefore, \(x_1\) is the unique maximizer of \(\varphi _{{\vartheta }}\) and:

$$\begin{aligned} F^*({\vartheta })=\max _{x\in {\mathbb {R}}^n}\left\{ \langle {\vartheta }| x \rangle -F(x) \right\} <+\infty , \end{aligned}$$

which proves property (i).

Besides, we have

$$\begin{aligned} x_1\in \partial F^*({\vartheta }) \quad \Longleftrightarrow \quad {\vartheta }=\nabla F(x_1) \quad \Longleftrightarrow \quad x_1\text { minimizer of }\phi _{{\vartheta }}, \end{aligned}$$
(18)

where the first equivalence comes from Proposition 12. Point \(x_1\) being the unique maximizer of \(\varphi _{{\vartheta }}\), we have that \(\partial F^*({\vartheta })\) is a singleton. In other words, \(F^*\) is differentiable in \({\vartheta }\) and

$$\begin{aligned} \nabla F^*({\vartheta })=x_1\in {\mathcal {D}}_F. \end{aligned}$$
(19)

First, the above (19) proves property (ii). Second, this equality combined with the equality from (18) gives the second identity from property (iv). Third, this proves that \(\nabla F^*({\mathbb {R}}^n)\subset {\mathcal {D}}_F\).

It remains to prove the reverse inclusion to get property (iii). Let \(x\in {\mathcal {D}}_F\). By property (ii) from Definition 1, F is differentiable in x. Consider

$$\begin{aligned} {\vartheta }:=\nabla F(x), \end{aligned}$$
(20)

and all the above holds with this special point \({\vartheta }\). In particular, \(x_1=x\) by uniqueness of \(x_1\). Therefore (19) gives

$$\begin{aligned} \nabla F^*({\vartheta })=x, \end{aligned}$$
(21)

and this proves \(\nabla F^*({\mathbb {R}}^n)\supset {\mathcal {D}}_F\) and thus property (iii). Combining (20) and (21) gives the first identity from property (iv).

1.1.2 Proof of Theorem 1

Let \(x_0\in {\mathcal {D}}_F\). By definition of the mirror map, F is differentiable at \(x_0\). Therefore, \(D_F(x,x_0)\) is well-defined for all \(x\in {\mathbb {R}}^n\).

For all real value \(\alpha \in {\mathbb {R}}\), consider the sublevel set \(S_{{{\mathcal {X}}}}(\alpha )\) of function \(x\mapsto D_F(x,x_0)\) associated with value \(\alpha \) and restricted to \({{\mathcal {X}}}\):

$$\begin{aligned} S_{{{\mathcal {X}}}}(\alpha ):=\left\{ x\in {{\mathcal {X}}}\,:D_F(x,x_0)\leqslant \alpha \right\} . \end{aligned}$$

Inheriting properties from F, function \(D_F(\,\cdot \,,x_0)\) is lower-semicontinuous and strictly convex: consequently, the sublevel sets \(S_{{{\mathcal {X}}}}(\alpha )\) are closed and convex.

Let us also prove that the sublevel sets \(S_{{{\mathcal {X}}}}(\alpha )\) are bounded. For each value \(\alpha \in {\mathbb {R}}\), we write

$$\begin{aligned} S_{{{\mathcal {X}}}}(\alpha )\subset S_{{\mathbb {R}}^n}(\alpha ):=\left\{ x\in {\mathbb {R}}^n\,:\,D_F(x,x_0)\leqslant \alpha \right\} \end{aligned}$$

and aim at proving that the latter set is bounded. By contradiction, let us suppose that there exists an unbounded sequence in \(S_{{\mathbb {R}}^n}(\alpha )\): let \((x_k)_{k\geqslant 1}\) be such that \(0<\left\| x_k-x_0 \right\| _{}\xrightarrow [k \rightarrow +\infty ]{}+\infty \) and \(D_F(x_k,x_0)\leqslant \alpha \) for all \(k\geqslant 1\). Using the Bolzano–Weierstrass theorem, there exists \(v\ne 0\) and a subsequence \((x_{\phi (k)})_{k\geqslant 1}\) such that

$$\begin{aligned} \frac{x_{\phi (k)}-x_0}{\left\| x_{\phi (k)}-x_0 \right\| }\xrightarrow [k \rightarrow +\infty ]{}v. \end{aligned}$$

The point \(x_0+\frac{x_{\phi (k)}-x_0}{\left\| x_{\phi (k)}-x_0 \right\| }\) being a convex combination of \(x_0\) and \(x_{\phi (k)}\), we can write the corresponding convexity inequality for function \(D_F(\,\cdot \,,x_0)\):

$$\begin{aligned} D_F\left( x_0+\lambda _k(x_{\phi (k)}-x_0),x_0 \right)&\leqslant (1-\lambda _k)D_F(x_0,x_0) +\lambda _kD_F(x_{\phi (k)},x_0 )\\&\leqslant \lambda _k\alpha \xrightarrow [k \rightarrow +\infty ]{}0, \end{aligned}$$

where we used shorthand \(\lambda _k:=\left\| x_{\phi (k)}-x_0 \right\| ^{-1}\). For the first above inequality, we used \(D_F(x_0,x_0)=0\) and that \(D_F(x_{\phi (k)},x_0)\leqslant \alpha \) by definition of \((x_k)_{k\geqslant 1}\). Then, using the lower-semicontinuity of \(D_F(\,\cdot \,,x_0)\) and the fact that \(x_0+\lambda _k(x_{\phi (k)}-x_0) \xrightarrow [k \rightarrow +\infty ]{}x_0+v\), we have

$$\begin{aligned} D_F(x_0+v,x_0)\leqslant \liminf _{k\rightarrow +\infty }D_F(x_0+\lambda _k(x_{\phi (k)}-x_0),x_0)\leqslant \liminf _{k\rightarrow +\infty }\lambda _k\alpha =0. \end{aligned}$$

The Bregman divergence of a convex function being nonnegative, the above implies \(D_F(x_0+v,x_0)=0\). Thus, function \(D_F(\,\cdot \,,x_0)\) attains its minimum (0) at two different points (at \(x_0\) and at \(x_0+v\)): this contradicts its strong convexity. Therefore, sublevel sets \(S_{{{\mathcal {X}}}}(\alpha )\) are bounded and thus compact.

We now consider the value \(\alpha _{\text {inf}}\) defined as

$$\begin{aligned} \alpha _{\text {inf}}:=\inf \left\{ \alpha \,:S_{{{\mathcal {X}}}}(\alpha )\ne \emptyset \right\} . \end{aligned}$$

In other words, \(\alpha _{\text {inf}}\) is the infimum value of \(D_F(\,\cdot \,,x_0)\) on \({{\mathcal {X}}}\), and thus the only possible value for the minimum (if it exists). We know that \(\alpha _{\text {inf}} \geqslant 0\) because the Bregman divergence is always nonnegative. From the definition of the sets \(S_{{{\mathcal {X}}}}(\alpha )\), it easily follows that:

$$\begin{aligned} S_{{{\mathcal {X}}}}(\alpha _{\text {inf}})=\bigcap _{\alpha > \alpha _{\text {inf}}}^{}S_{{{\mathcal {X}}}}(\alpha ). \end{aligned}$$

Naturally, the sets \(S_{{{\mathcal {X}}}}(\alpha )\) are increasing in \(\alpha \) with respect to the inclusion order. Therefore, \(S_{{{\mathcal {X}}}}(\alpha _{\text {inf}})\) is the intersection of a nested sequence of nonempty compact sets. It is thus nonempty as well by Cantor’s intersection theorem. Consequently, \(D_F(\,\cdot \,,x_0)\) does admit a minimum on \({{\mathcal {X}}}\), and the minimizer is unique because of the strong convexity.

Let us now prove that the minimizer \(x_*:=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{x\in {{\mathcal {X}}}}D_F(x,\ x_0)\) also belongs to \({\mathcal {D}}_F\). Let us assume by contradiction that \(x_*\in {{\mathcal {X}}}{\setminus } {\mathcal {D}}_F\). By definition of the mirror map, \({{\mathcal {X}}}\cap {\mathcal {D}}_F\) is nonempty; let \(x_1\in {{\mathcal {X}}}\cap {\mathcal {D}}_F\). The set \({\mathcal {D}}_F\) being open by definition, there exists \(\varepsilon > 0\) such that the closed Euclidean ball \({\overline{B}}(x_1,\varepsilon )\) centered in \(x_1\) and of radius \(\varepsilon \) is a subset of \({\mathcal {D}}_F\). We consider the convex hull

$$\begin{aligned} {\mathcal {C}}:={\text {co}}\left( \left\{ x_* \right\} \cup {\overline{B}}(x_1,\varepsilon ) \right) , \end{aligned}$$

which is clearly is a compact set.

Consider function G defined by:

$$\begin{aligned} G(x):=D_F(x,x_0)=F(x)-F(x_0)-\left\langle \nabla F(x_0) \vert x-x_0 \right\rangle , \end{aligned}$$

so that \(x_*\) is the minimizer of G on \({{\mathcal {X}}}\). In particular, G is finite in \(x_*\). G inherits strict convexity, lower-semicontinuity, and differentiability on \({\mathcal {D}}_F\) from function F. G is continuous on the compact set \({\overline{B}}(x_1,\varepsilon )\) because G is convex on the open set \({\mathcal {D}}_F\supset {\overline{B}}(x_1,\varepsilon )\). Therefore, G is bounded on \({\overline{B}}(x_1,\varepsilon )\). Let us prove that G is also bounded on \({\mathcal {C}}\). Let \(x\in {\mathcal {C}}\). By definition of \({\mathcal {C}}\), there exists \(\lambda \in [0,1]\) and \(x'\in {\overline{B}}(x_1,\varepsilon )\) such that \(x=\lambda x_*+(1-\lambda )x'\). By convexity of G, we have:

$$\begin{aligned} G(x)\leqslant \lambda G(x_*)+(1-\lambda )G(x')\leqslant G(x_*)+G(x'). \end{aligned}$$

We know that \(G(x_*)\) is finite and that \(G(x')\) is bounded for \(x'\in {\overline{B}}(x_1,\varepsilon )\). Therefore G is bounded on \({\mathcal {C}}\): let us denote \(G_{\text {max}}\) and \(G_{\text {min}}\) some upper and lower bounds for the value of G on \({\mathcal {C}}\).

Because \({{\mathcal {X}}}\) is a convex set, the segment \([x_*,x_1]\) (in other words the convex hull of \(\left\{ x_*,x_1 \right\} \)) is a subset of \({{\mathcal {X}}}\). Besides, let us prove that the set

$$\begin{aligned} (x_*,x_1]:=\left\{ (1-\lambda )x_*+\lambda x_1\,:\lambda \in (0,1] \right\} \end{aligned}$$

is a subset of \({\mathcal {D}}_F\). Let \(x_{\lambda }:=(1-\lambda )x_*+\lambda x_1\) (with \(\lambda \in (0,1]\)) a point in the above set, and let us prove that it belongs to \({\mathcal {D}}_F\). By definition of the mirror map, we have \({{\mathcal {X}}}\subset {\text {cl}}{\mathcal {D}}_F\), and besides \(x_*\in {{\mathcal {X}}}\) by definition. Therefore, there exists a sequence \((x_k)_{k\geqslant 1}\) in \({\mathcal {D}}_F\) such that \(x_k\rightarrow x_*\) as \(k\rightarrow +\infty \). Then, we can write

$$\begin{aligned} x_{\lambda }&=(1-\lambda )x_*+\lambda x_1\\&=(1-\lambda )x_k + (1-\lambda )(x_*-x_k)+\lambda x_1\\&=(1-\lambda )x_k+\lambda \left( x_1+\frac{1-\lambda }{\lambda }(x_*-x_k) \right) . \end{aligned}$$

Since \(x_k\rightarrow x_*\), for high enough k, the point \(x_1+(1-\lambda )\lambda ^{-1}(x_*-x_k)\) belongs to \({\overline{B}}(x_1,\varepsilon )\) and therefore to \({\mathcal {D}}_F\). Then, the point \(x_{\lambda }\) belongs to the convex setFootnote 11\({\mathcal {D}}_F\) as the convex combination of two points in \({\mathcal {D}}_F\). Therefore, \((x_*,x_1]\) is indeed a subset of \({\mathcal {D}}_F\).

figure a

G being differentiable on \({\mathcal {D}}_F\) by definition of the mirror map, the gradient of G exists at each point of \((x_*,x_1]\). Let us prove that \(\nabla G\) is bounded on \((x_*,x_1]\). Let \(x_{\lambda }\in (x_*,x_1]\), where \(\lambda \in (0,1]\) is such that

$$\begin{aligned} x_{\lambda }= (1-\lambda )x_*+\lambda x_1, \end{aligned}$$

and let \(u\in {\mathbb {R}}^n\) such that \(\left\| u \right\| _2=1\). The point \(x_1+\varepsilon u\) belongs to \({\mathcal {C}}\) because it belongs to \({\overline{B}}(x_1,\varepsilon )\). The following point also belongs to convex set \({\mathcal {C}}\) as the convex combination of \(x_*\) and \(x_1+\varepsilon u\) which both belong to \({\mathcal {C}}\):

$$\begin{aligned} x_{\lambda }+\lambda \varepsilon u = (1-\lambda )x_*+\lambda (x_1+\varepsilon u)\in {\mathcal {C}}. \end{aligned}$$
(22)

Let \(h\in (0,\varepsilon ]\). The following point also belongs to \({\mathcal {C}}\) as a convex combination of \(x_{\lambda }\) and the above point \(x_{\lambda }+\lambda \varepsilon u\):

$$\begin{aligned} x_{\lambda }+\lambda hu = \left( 1-\frac{h}{\varepsilon } \right) x_{\lambda }+\frac{h}{\varepsilon }\left( x_{\lambda }+\lambda \varepsilon u \right) \in {\mathcal {C}}. \end{aligned}$$
(23)

Now using for G the convexity inequality associated with the convex combination from (23), we write:

$$\begin{aligned} G(x_{\lambda }+h\lambda u)-G(x_{\lambda })&\leqslant \frac{h}{\varepsilon }\left( G(x_{\lambda }+\lambda \varepsilon u)-G(x_{\lambda }) \right) \nonumber \\&=\frac{h}{\varepsilon }\left( G(x_{\lambda }+\lambda \varepsilon u)-G(x_*)+G(x_*)-G(x_{\lambda }) \right) \nonumber \\&\leqslant \frac{h}{\varepsilon }\left( G(x_{\lambda }+\lambda \varepsilon u)-G(x_*) \right) , \end{aligned}$$
(24)

where for the last line we used \(G(x_*)\leqslant G(x_{\lambda })\) which is true because \(x_{\lambda }\) belongs to \({{\mathcal {X}}}\) and \(x_*\) is by definition the minimizer of G on \({{\mathcal {X}}}\). Using the convexity inequality associated with the convex combination from (22), we also write

$$\begin{aligned} G(x_{\lambda }+\lambda \varepsilon u)-G(x_*)&\leqslant \lambda \left( G(x_1+\varepsilon u)-G(x_*) \right) \nonumber \\&\leqslant \lambda \left( G_{\text {max}}-G_{\text {min}} \right) . \end{aligned}$$
(25)

Combining (24) and (25) and dividing by \(h\lambda \), we get

$$\begin{aligned} \frac{G(x_{\lambda }+h\lambda u)-G(x_{\lambda })}{h\lambda }\leqslant \frac{G_{\text {max}}-G_{\text {min}}}{\varepsilon }. \end{aligned}$$

Taking the limit as \(h\rightarrow 0^+\), we get that \(\langle \nabla G(x_{\lambda }) | u \rangle \leqslant (G_{\text {max}}-G_{\text {min}})/\varepsilon \). This being true for all vector u such that \(\left\| u \right\| _2=1\), we have

$$\begin{aligned} \left\| \nabla G(x_{\lambda }) \right\| _{2}=\max _{\left\| u \right\| _{2}=1 }\langle \nabla G(x_{\lambda }) | u \rangle \leqslant \frac{G_{\text {max}}-G_{\text {min}}}{\varepsilon }. \end{aligned}$$

As a result, \(\nabla G\) is bounded on \((x_*,x_1]\).

Let us deduce that \(\partial G(x_*)\) is nonempty. The sequence \((\nabla G(x_{1/k}))_{k\geqslant 1}\) is bounded. Using the Bolzano–Weierstrass theorem, there exists a subsequence \((\nabla G(x_{1/\phi (k)}))_{k\geqslant 1}\) which converges to some vector \({\vartheta }_*\in {\mathbb {R}}^n\). For each \(k\geqslant 1\), the following is satisfied by convexity of G:

$$\begin{aligned} \langle \nabla G(x_{1/\phi (k)}) | x-x_{1/\phi (k)} \rangle \leqslant G(x)-G(x_{1/\phi (k)}),\quad x\in {\mathbb {R}}^n. \end{aligned}$$

Taking the limsup on both sides for each \(x\in {\mathbb {R}}^n\) as \(k\rightarrow +\infty \), we get (because obviously \(x_{1/\phi (k)}\rightarrow x_*\)):

$$\begin{aligned} \langle {\vartheta }_* | x-x_* \rangle \leqslant G(x)-\liminf _{k\rightarrow +\infty }G(x_{1/\phi (k)})\leqslant G(x)-G(x_*),\quad x\in {\mathbb {R}}^n, \end{aligned}$$

where the second inequality follows from the lower-semicontinuity of G. Consequently, \({\vartheta }_*\) belongs to \(\partial G(x_*)\).

But by definition of the mirror map \(\nabla F\) takes all possible values and so does \(\nabla G\), because it follows from the definition of G that \(\nabla G=\nabla F-\nabla F(x_0)\). Therefore, there exists a point \({\tilde{x}}\in {\mathcal {D}}_F\) (thus \({\tilde{x}} \ne x_*\)) such that \(\nabla G({\tilde{x}})={\vartheta }_*\). Considering the point \(x_{\text {mid}}=\frac{1}{2}(x_*+{\tilde{x}})\), we can write the following convexity inequalities:

$$\begin{aligned} \langle {\vartheta }_* | x_{\text {mid}}-x_* \rangle&\leqslant G(x_{\text {mid}})-G(x_*)\\ \langle {\vartheta }_* | x_{\text {mid}}-{\tilde{x}} \rangle&\leqslant G(x_{\text {mid}})-G({\tilde{x}}). \end{aligned}$$

We now add both inequalities and use the fact that \(x_{\text {mid}}-{\tilde{x}}=x_*-x_{\text {mid}}\) by definition of \(x_{\text {mid}}\) to get \(0\leqslant 2G(x_{\text {mid}})-G(x_*)-G({\tilde{x}})\), which can also be written

$$\begin{aligned} G\left( \frac{x_*+{\tilde{x}}}{2} \right) \geqslant \frac{G(x_*)+G({\tilde{x}})}{2}, \end{aligned}$$

which contradicts the strong convexity of G. We conclude that \(x_*\in {\mathcal {D}}_F\).

1.1.3 Proof of Proposition 2

Let \({\vartheta }\in {\mathbb {R}}^n\). For each of the three assumptions, let us prove that \(h^*({\vartheta })\) is finite. This will prove that \({\text {dom}}h^*={\mathbb {R}}^n\).

  1. (i)

    Because \({\text {cl}}{\text {dom}}h={{\mathcal {X}}}\) by definition of a pre-regularizer, we have:

    $$\begin{aligned} h^*({\vartheta })=\max _{x\in {\mathbb {R}}^n}\left\{ \left\langle {\vartheta } \vert x \right\rangle -h(x) \right\} =\max _{x\in {{\mathcal {X}}}}\left\{ \left\langle {\vartheta } \vert x \right\rangle -h(x) \right\} . \end{aligned}$$

    Besides, the function \(x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)\) is upper-semicontinuous and therefore attains a maximum on \({{\mathcal {X}}}\) because \({{\mathcal {X}}}\) is assumed to be compact. Therefore \(h^*({\vartheta })<+\infty \).

  2. (ii)

    Because \(\nabla h({\mathcal {D}}_h)={\mathbb {R}}^n\) by assumption, there exists \(x\in {\mathcal {D}}_h\) such that \(\nabla h(x)={\vartheta }\). Then, by Proposition 12, \(h^*({\vartheta })=\left\langle {\vartheta } \vert x \right\rangle -h(x)<+\infty \).

  3. (iii)

    The function \(x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)\) is strongly concave on \({\mathbb {R}}^n\) and therefore admits a maximum. Therefore, \(h^*({\vartheta })<+\infty \).

1.1.4 Proof of Proposition 3

Let \({\vartheta }\in {\mathbb {R}}^n\). Because \({\text {dom}}h^*={\mathbb {R}}^n\), the subdifferential \(\partial h^*({\vartheta })\) is nonempty—see e.g. [38,  Theorem 23.4]. By Proposition 12, \(\partial h^*({\vartheta })\) is the set of maximizers of function \(x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)\), which is strictly concave. Therefore, the maximizer is unique and \(h^*\) is differentiable at \({\vartheta }\).

Let \(x\in {\mathcal {D}}_F\) and let us prove that \(\nabla F(x)\in \partial h(x)\). By convexity of F, the following is true

$$\begin{aligned} \forall x'\in {\mathbb {R}}^n,\quad F(x')-F(x)\geqslant \langle \nabla F(x) | x'-x \rangle . \end{aligned}$$

By definition of h, we obviously have \(h(x')\geqslant F(x')\) for all \(x'\in {\mathbb {R}}^n\), and \(h(x)=F(x)+I_{{{\mathcal {X}}}}(x)=F(x)\) because \(x\in {{\mathcal {X}}}\). Therefore, the following is also true

$$\begin{aligned} \forall x'\in {\mathbb {R}}^n,\quad h(x')-h(x)\geqslant \langle \nabla F(x) | x'-x \rangle . \end{aligned}$$

In other words, \(\nabla F(x)\in \partial f(x)\).

1.1.5 Proof of Proposition 4

h is strictly convex as the sum of two convex functions, one of which (F) is strictly convex. h is lower-semicontinuous as the sum of two lower-continuous functions.

Let us now prove that \({\text {cl}}{\text {dom}}h={{\mathcal {X}}}\). First, we write

$$\begin{aligned} {\text {dom}}h={\text {dom}}(F+I_{{{\mathcal {X}}}})={\text {dom}}F\cap {\text {dom}}I_{{{\mathcal {X}}}}={\text {dom}}F\cap {{\mathcal {X}}}. \end{aligned}$$

Let \(x\in {\text {cl}}{\text {dom}}h={\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})\). There exists a sequence \((x_k)_{k\geqslant 1}\) in \({\text {dom}}F\cap {{\mathcal {X}}}\) such that \(x_k\rightarrow x\). In particular, each \(x_k\) belongs to closed set \({{\mathcal {X}}}\), and so does the limit: \(x\in {{\mathcal {X}}}\).

Conversely, let \(x\in {{\mathcal {X}}}\) and let us prove that \(x\in {\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})\) by constructing a sequence \((x_k)_{k\geqslant 1}\) in \({\text {dom}}F\cap {{\mathcal {X}}}\) which converges to x. By definition of the mirror map, we have \({{\mathcal {X}}}\subset {\text {cl}}{\mathcal {D}}_F\), where \({\mathcal {D}}_F:={\text {int}}{\text {dom}}F\). Therefore, there exists a sequence \((x_l')_{l\geqslant 1}\) in \({\mathcal {D}}_F\) such that \(x_l'\rightarrow x\) as \(l\rightarrow +\infty \). From the definition of the mirror map, we also have that \({{\mathcal {X}}}\cap {\mathcal {D}}_F\ne \emptyset \). Let \(x_0\in {{\mathcal {X}}}\cap {\mathcal {D}}_F\). In particular, \(x_0\) belongs \({\mathcal {D}}_F\) which is an open set by definition. Therefore, there exists a neighborhood \(U\subset {\mathcal {D}}_F\) of point \(x_0\). We now construct the sequence \((x_k)_{k\geqslant 1}\) as follows:

$$\begin{aligned} x_k:=\left( 1-\frac{1}{k} \right) x+\frac{1}{k}x_0,\quad k\geqslant 1. \end{aligned}$$

\(x_k\) belongs to \({{\mathcal {X}}}\) as the convex combination of two points in the convex set \({{\mathcal {X}}}\), and obviously converges to x. Besides, \(x_k\) can also be written, for any \(k,l\geqslant 1\),

$$\begin{aligned} x_k&= \left( 1-\frac{1}{k} \right) x'_l+\left( 1-\frac{1}{k} \right) (x-x_l')+\frac{1}{k}x_0 \\&=\left( 1-\frac{1}{k} \right) x_l'+ \frac{1}{k}\left( x_0+(k-1)(x-x_l') \right) \\&=\left( 1-\frac{1}{k} \right) x_l'+\frac{1}{k}x_{0}^{(kl)}, \end{aligned}$$

where we set \(x_0^{(kl)}:=x_0+(k-1)(x-x_l')\). For a given \(k\geqslant 1\), we see that \(x_0^{(kl)}\rightarrow x_0\) as \(l\rightarrow +\infty \) because \(x_l'\rightarrow x\) by definition of \((x_l')_{l\geqslant 1}\). Therefore, for large enough l, \(x_0^{(kl)}\) belongs to the neighborhood U and therefore to \({\mathcal {D}}_F\). \(x_k\) then appears as the convex combination of \(x_l'\) and \(x_0^{(kl)}\) which both belong to the convex set \({\mathcal {D}}_F\subset {\text {dom}}F\). \((x_k)\) is thus a sequence in \({\text {dom}}F\cap {{\mathcal {X}}}\) which converges to x. Therefore, \(x\in {\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})\) and h is an \({{\mathcal {X}}}\)-pre-regularizer.

Finally, we have \(F\leqslant h\) by definition of h. One can easily check that this implies \(h^*\leqslant F^*\) and we know from Proposition 1 that \({\text {dom}}F^*={\mathbb {R}}^n\), in other words that \(F^*\) only takes finite values. Therefore, so does \(h^*\) and h is an \({{\mathcal {X}}}\)-regularizer.

1.2 Proofs for Section 4

1.2.1 Proof of Proposition 11

Let \(t\geqslant 2\). It follows from the definition of the iterates that \(x_t-y_t=(\nu _{t-1}^{-1}-1)(y_t-y_{t-1})\). Therefore, utilizing the convexity of f, we get

$$\begin{aligned} \langle \gamma _tf'(y_t) | x_t-x_* \rangle&=\gamma _t\langle f'(y_t) | y_t-x_* \rangle +\gamma _t\langle f'(y_t) | x_t-y_t \rangle \\&=\gamma _t\langle f'(y_t) | y_t-x_* \rangle +\gamma _t(\nu _{t-1}^{-1}-1)\langle f'(y_t) | y_t-y_{t-1} \rangle \\&\geqslant \gamma _t\left( f(y_t)-f_* \right) +\gamma _t(\nu _{t-1}^{-1}-1)\left( f(y_t)-f(y_{t-1}) \right) \\&= \gamma _t\nu _{t-1}^{-1}f(y_t)-\gamma _t(\nu _{t-1}^{-1}-1)f(y_{t-1})-\gamma _tf_*. \end{aligned}$$

Besides this, for \(t=1\), we have \(\gamma _1\langle f'(y_1) | x_1-x_* \rangle \geqslant \gamma _1(f(y_1)-f_*)\) because \(x_1=y_1\) by definition. Then, summing over \(t=1,\dots ,T\), we obtain after simplifications:

$$\begin{aligned}&(\gamma _1-\gamma _2(\nu _1^{-1}-1))f(y_1)+\sum _{t=2}^{T-1}(\gamma _t\nu _{t-1}^{-1}-\gamma _{t+1}(\nu _t^{-1}-1))f(y_t)+\gamma _T\nu _{t-1}^{-1}f(y_T)\\&\quad -\left( \sum _{t=1}^T\gamma _t \right) f_*\leqslant \sum _{t=1}^T\langle \gamma _tf'(y_t) | x_t-x_* \rangle . \end{aligned}$$

Using the definition of coefficients \(\nu _t\), the above left-hand side simplifies to result in the inequality

$$\begin{aligned} \left( \sum _{t=1}^T\gamma _t \right) \left( f(y_T)-f_* \right) \leqslant \sum _{t=1}^T\langle \gamma _tf'(y_t) | x_t-x_* \rangle . \end{aligned}$$

Finally, because \((x_t,{\vartheta }_t)_{t\geqslant 1}\) is a sequence of UMD\((h,\xi )\) iterates with dual increments \(\xi :=(-\gamma _tf'(y_t))_{t\geqslant 1}\), the result then follows by applying inequality (8) from Corollary 2 and dividing by \(\sum _{t=1}^T\gamma _t\). \(\square \)

1.2.2 Proof of Theorem 2

First, observe that whenever \(\gamma _t\leqslant 1/L\), due to (11),

$$\begin{aligned} f(x_{t+1})-f(x_{t})-\left\langle \nabla f(x_t) \vert x_{t+1}-x_t \right\rangle\leqslant & {} {L\over 2}\Vert x_{t+1}-x_t\Vert ^2\nonumber \\ {}\leqslant & {} (2\gamma _t)^{-1}\Vert x_{t+1}-x_t\Vert ^2. \end{aligned}$$
(26)

Thus,

$$\begin{aligned} \gamma _t D_f(x_{t+1},x_t)&=\gamma _t[f(x_{t+1})-f(x_{t})-\left\langle f'(x_t) \vert x_{t+1}-x_t \right\rangle ]\\&\leqslant \frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2\\&\leqslant D_h(x_{t+1},x_t;\ {\vartheta }_t). \end{aligned}$$

by the strong convexity of \(D_h\). On the other hand, by (6) of Lemma 1, for any \(x\in {{\mathcal {X}}}\cap {\text {dom}}h\),

$$\begin{aligned} D_h(x,x_{t+1}; {\vartheta }_{t+1})&\leqslant D_h(x,x_t;{\vartheta }_t)+\gamma _t\left\langle \nabla f(x_t) \vert x-x_{t+1} \right\rangle -D_h(x_{t+1},x_t; {\vartheta }_t)\\ [\text{ by } (12)]&\leqslant D_h(x,x_t;{\vartheta }_t)+\gamma _t\left\langle \nabla f(x_t) \vert x-x_{t} \right\rangle \\&\qquad -\gamma _t\left\langle \nabla f(x_t) \vert x_{t+1}-x_{t} \right\rangle -D_f(x_{t+1},x_t)\\ [\text{ due } \text{ to } (16)]&\leqslant D_h(x,x_t;{\vartheta }_t)+\gamma _t\left\langle \nabla f(x_t) \vert x-x_{t} \right\rangle -\gamma _t[f(x_{t+1}-f(x_t)]\\ [\text{ by } \text{ convexity } \text{ of } f]&\leqslant D_h(x,x_t;{\vartheta }_t)-\gamma _t(f(x_{t+1})-f(x)). \end{aligned}$$

Consequently, \(\forall x\in {{\mathcal {X}}}\cap {\text {dom}}h\),

$$\begin{aligned} \gamma _t(f(x_{t+1})-f(x_t))\leqslant D_h(x,x_t;{\vartheta }_t)-D_h(x,x_{t+1}; {\vartheta }_{t+1}). \end{aligned}$$

When applying the above inequality to \(x=x_t\) we conclude that

$$\begin{aligned} \gamma _t(f(x_{t+1})-f(x_t))\leqslant -D_h(x_t,x_{t+1}; {\vartheta }_{t+1})\leqslant 0. \end{aligned}$$

Finally, when setting \(x=x_*\), we obtain

$$\begin{aligned} \left( \sum _{t=1}^T\gamma _t\right) (f(x_{T+1})-f_*)\leqslant \sum _{t=1}^T\gamma _t(f(x_{t+1})-f(x_*))\leqslant D_h(x_*,x_1;{\vartheta }_1) \end{aligned}$$

which implies (13). \(\square \)

1.2.3 Proof of Theorem 3

We start with the following technical result.

Lemma 2

Assume that positive step-sizes \(\nu _t\in (0,1]\) and \(\gamma _t>0\) are such that the relationship

$$\begin{aligned} f({z}_{t+1})\leqslant f(y_t)+\nu _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_t \right\rangle +{\nu _t\over \gamma _t} D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$
(27)

holds for all t which is certainly the case if \(\nu _t\gamma _t\leqslant L^{-1}\). Denote \(s_t=f(z_t)-f_*\); then

$$\begin{aligned} {\gamma _t\nu _t^{-1}}(s_{t+1}-s_t)+\gamma _ts_t&\leqslant D_h(x_*,{x}_t;{\vartheta }_t)-D_h(x_*,{x}_{t+1};{\vartheta }_{t+1}). \end{aligned}$$
(28)

Proof of the lemma

Observe first that by construction,

$$\begin{aligned} {z}_{t+1}-y_t= (1-\nu _t){z}_t+\nu _t{x}_{t+1}-[(1-\nu _t){z}_t+\nu _t{x}_{t}]=\nu _t({x}_{t+1}-{x}_t) \end{aligned}$$

By strong convexity of h, for \(\nu _t\gamma _t\leqslant L^{-1}\) we have

$$\begin{aligned} f({z}_{t+1})&\leqslant f(y_t)+\langle \nabla f(y_t),{z}_{t+1}-y_t\rangle +{L\over 2}\Vert {z}_{t+1}-y_t\Vert ^2\nonumber \\&= f(y_t)+\nu _t\langle \nabla f(y_t),{x}_{t+1}-x_t\rangle +{L\nu _t^2\over 2}\Vert {x}_{t+1}-{x}_t\Vert ^2\nonumber \\&\leqslant f(y_t)+\nu _t\langle \nabla f(y_t),{x}_{t+1}-x_t\rangle +{\nu _t\over \gamma _t}D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$

what is (27).

Next, observe that by (14a),

$$\begin{aligned} \nu _t(x_*-x_t)=(\nu _tx_*+(1-\nu _t)z_t)-y_t, \end{aligned}$$

whence, by convexity of f,

$$\begin{aligned} \nu _t\left\langle \nabla f(y_t) \vert x_*-x_t \right\rangle&=\left\langle \nabla f(y_t) \vert (\nu _tx_*+(1-\nu _t)z_t)-y_t \right\rangle \\&\leqslant f(\nu _tx_*+(1-\nu _t)z_t)-f(y_t)\\&\leqslant \nu _t(f(x_*)-f(y_t))+(1-\nu _t)(f(z_t)-f(y_t)). \end{aligned}$$

When substituting the latter bound into (27) we get

$$\begin{aligned} f({z}_{t+1})&\leqslant f(y_t)+\nu _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_* \right\rangle +\nu _t(f(x_*)-f(y_t))\\ {}&\qquad +(1-\nu _t)(f(z_t)-f(y_t)) +{\nu _t\over \gamma _t} D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$

or

$$\begin{aligned} f({z}_{t+1})-f(z_t)\leqslant \nu _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_* \right\rangle +\nu _t(f_*-f(z_t)) +{\nu _t\over \gamma _t} D_h({x}_{t+1},{x}_t;{\vartheta }_t). \end{aligned}$$

Now, because \((x_t,{\vartheta }_t)_{t\geqslant 1}\) is a sequence of UMD iterates, by (6) of Lemma 1,

$$\begin{aligned} \gamma _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_* \right\rangle \leqslant D_h(x_*,{x}_t;{\vartheta }_t)-D_h(x_*,{x}_{t+1};{\vartheta }_{t+1})-D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$

and we arrive at

$$\begin{aligned} {\gamma _t\nu _t^{-1}}(f({z}_{t+1})-f(z_t))\leqslant D_h(x_*,{x}_t;{\vartheta }_t)-D_h(x_*,{x}_{t+1};{\vartheta }_{t+1})+\gamma _t(f_*-f(z_t)), \end{aligned}$$

what is (28). \(\square \)

Proof of the Theorem

Assume that \(\nu _t\) and \(\gamma _t\) satisfy

$$\begin{aligned} \nu _1=1,\quad \nu _t\in (0,1], \quad \gamma _{t+1}(\nu _{t+1}^{-1}-1)\leqslant \gamma _t\nu _{t}^{-1}. \end{aligned}$$
(29)

When summing (28) up from 1 to T we get

$$\begin{aligned} D_h(x_*,x_t;{\vartheta }_1)&\geqslant \sum _{t=1}^T[{\gamma _t\nu _t^{-1}}(s_{t+1}-s_t)+\gamma _ts_t]\\&={\gamma _T\nu _T^{-1}}s_{T+1}+\sum _{t=2}^Ts_t\left( {\gamma _{t-1}\nu _{t-1}^{-1}}-\gamma _t(\nu _t^{-1}-1)\right) -\gamma _1(\nu _1^{-1}-1)s_1\\&\quad \underbrace{\geqslant }_{[\text{ by } (29)]} {\gamma _T\nu _T^{-1}}s_{T+1}={\gamma _T\nu _T^{-1}}(f(z_{T+1})-f_*). \end{aligned}$$

It is clear that the choice of \(\gamma _1=L^{-1}\), \(\nu _1=1\) and \(\nu _t=(\gamma _tL)^{-1}\) satisfies the relationship \(\gamma _t\nu _t\leqslant L^{-1}\). In this case, when choosing step-sizes \((\gamma _t)_{t\geqslant 1}\) to saturate recursively the last relation in (29), specifically,

$$\begin{aligned} \gamma ^2_{t+1}L-\gamma _{t+1}= \gamma _t^2L \end{aligned}$$

we come to celebrated Nesterov step-sizes (15) which satisfy \(\gamma _t\nu _t^{-1}\geqslant {(t+1)^2\over 4L}\), and we arrive at (16). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Juditsky, A., Kwon, J. & Moulines, É. Unifying mirror descent and dual averaging. Math. Program. 199, 793–830 (2023). https://doi.org/10.1007/s10107-022-01850-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01850-3

Mathematics Subject Classification

Navigation