Unifying mirror descent and dual averaging

Juditsky, Anatoli; Kwon, Joon; Moulines, Éric

doi:10.1007/s10107-022-01850-3

Unifying mirror descent and dual averaging

Full Length Paper
Series A
Published: 30 June 2022

Volume 199, pages 793–830, (2023)
Cite this article

Mathematical Programming Submit manuscript

1276 Accesses
6 Citations
Explore all metrics

Abstract

We introduce and analyze a new family of first-order optimization algorithms which generalizes and unifies both mirror descent and dual averaging. Within the framework of this family, we define new algorithms for constrained optimization that combines the advantages of mirror descent and dual averaging. Our preliminary simulation study shows that these new algorithms significantly outperform available methods in some situations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to optimize: A tutorial for continuous and mixed-integer optimization

Article 08 May 2024

Bias-constrained integer least squares estimation: distributional properties and applications in GNSS ambiguity resolution

Article Open access 14 May 2024

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Article 11 May 2024

Notes

We also refer to [27, Appendix C] for a discussion comparing MD and DA.
Similar statements can be found in the literature (cf. e.g., [14, Lemma A.1]) but we could not find one that exactly matches assumptions of Theorem 1 on F and ${{\mathcal {X}}}$. We provide a detailed proof in Appendix B for completeness.
In its general form [35], the DA algorithm allows for a time-variable regularizer. For the sake of clarity, we consider here the simple case of time-invariant regularizers which already captures some essential differences between MD and DA.
With some terminological abuse, we say that g is strongly convex when it is strongly convex with modulus 1.
In the case of compact ${{\mathcal {X}}}$ one can take $\Omega _X=\left[ \max _{x\in {{\mathcal {X}}}}2D_h(x,x_1;\ {\vartheta }_1)\right] ^{1/2}$. Note that in this case due to strong convexity of $D_h(\cdot ,x_1,{\vartheta }_1)$ one has $\Omega _{{\mathcal {X}}}\geqslant \max _{x\in {{\mathcal {X}}}}\Vert x-x_1\Vert $.
APDD and IPDD algorithms should be seen as mere examples having nothing special which sets them apart from other possible UMD implementations.
Recall, that satisfaction of such condition (cf. (12)) at each iteration of the method, ensures that the bound (13) of Theorem 2 holds true.
https://archive.ics.uci.edu/ml/datasets/BlogFeedback.
https://archive.ics.uci.edu/ml/datasets/Madelon.
The second parameter in the definition of the k-$\ell $-ADPP corresponds to the $\ell $-step ahead computation of the objective when determining the choice of update every k steps of the algorithm.
The domain of a convex function is convex, and therefore ${\mathcal {D}}_F={\text {int}}{\text {dom}}F$ is convex as the interior of a convex set.

References

Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: Proceedings of the 22nd Annual Conference on Learning Theory (COLT), pp. 217–226 (2009)
Audibert, J.Y., Bubeck, S.: Regret bounds and minimax policies under partial monitoring. J. Mac. Learn. Res. 11, 2785–2836 (2010)
MathSciNet MATH Google Scholar
Audibert, J.Y., Bubeck, S., Lugosi, G.: Regret in online combinatorial optimization. Math. Oper. Res. 39(1), 31–45 (2013)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Article MathSciNet MATH Google Scholar
Bubeck, S.: Introduction To Online Optimization: Lecture Notes. Princeton University, Princeton, NJ (2011)
Google Scholar
Bubeck, S.: Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning 8(3–4), 231–357 (2015)
Article MATH Google Scholar
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Mach. Learn. 5(1), 1–122 (2012)
MATH Google Scholar
Bubeck, S., Cesa-Bianchi, N., Kakade, S.M.: Towards minimax policies for online linear optimization with bandit feedback. In: JMLR: Workshop and Conference Proceedings (COLT), vol. 23, pp. 41.1–41.14 (2012)
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)
Book MATH Google Scholar
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Article MathSciNet MATH Google Scholar
Cohen, A., Hazan, T., Koren, T.: Tight bounds for bandit combinatorial optimization. In: Proceedings of Machine Learning Research (COLT 2017) vol. 65, pp. 1–14. (2017)
Cox, B., Juditsky, A., Nemirovski, A.: Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 148(1–2), 143–180 (2014)
Article MathSciNet MATH Google Scholar
Dasgupta, S., Telgarsky, M.J.: Agglomerative Bregman clustering. In: Proceedings of the 29th International Conference on Machine Learning (ICML 12), pp. 1527–1534 (2012)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(Jan), 165–202 (2012)
MathSciNet MATH Google Scholar
Duchi, J.C., Agarwal, A., Wainwright, M.J.: Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans. Autom. Control. 57(3), 592–606 (2012)
Article MathSciNet MATH Google Scholar
Duchi, J.C., Ruan, F.: Asymptotic optimality in stochastic optimization. The Annals of Statistics (to appear)
Flammarion, N., Bach, F.: Stochastic composite least-squares regression with convergence rate O(1/n). In: Proceedings of Machine Learning Research (COLT 2017), vol. 65, pp. 1–44 (2017)
Hazan, E.: The convex optimization approach to regret minimization. In: S.N. S. Sra, S. Wrigh (eds.) Optimization for Machine Learning, pp. 287–303. MIT press (2012)
Juditsky, A., Nemirovski, A.: First order methods for nonsmooth convex large-scale optimization, II: utilizing problems structure. Optimization for Machine Learning 30(9), 149–183 (2011)
Google Scholar
Juditsky, A., Rigollet, P., Tsybakov, A.B.: Learning by mirror averaging. Ann. Stat. 36(5), 2183–2206 (2008)
Article MathSciNet MATH Google Scholar
Juditsky, A.B., Nazin, A.V., Tsybakov, A.B., Vayatis, N.: Recursive aggregation of estimators by the mirror descent algorithm with averaging. Probl. Inf. Transm. 41(4), 368–384 (2005)
Article MathSciNet MATH Google Scholar
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)
Article MathSciNet MATH Google Scholar
Lee, S., Wright, S.J.: Manifold identification in dual averaging for regularized stochastic online learning. J. Mach. Learn. Res. 13(Jun), 1705–1744 (2012)
MathSciNet MATH Google Scholar
McMahan, B.: Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 525–533 (2011)
McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mac. Learn. Res. 18(1), 3117–3166 (2017)
MathSciNet MATH Google Scholar
Nazin, A.V.: Algorithms of inertial mirror descent in convex problems of stochastic optimization. Autom. Remote. Control. 79(1), 78–88 (2018)
Article MathSciNet MATH Google Scholar
Nemirovski, A.: Efficient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody 15 (1979)
Nemirovski, A.: Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Article MathSciNet MATH Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet MATH Google Scholar
Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, UK (1983)
Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109(2–3), 319–344 (2007)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Article MathSciNet MATH Google Scholar
Nesterov, Y., Shikhman, V.: Quasi-monotone subgradient methods for nonsmooth convex minimization. J. Optim. Theory Appl. 165(3), 917–940 (2015)
Article MathSciNet MATH Google Scholar
Rakhlin, A., Tewari, A.: Lecture notes on online learning (2009)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton, NJ (1970)
Book MATH Google Scholar
Shalev-Shwartz, S.: Online learning: Theory, algorithms, and applications. Ph.D. thesis, The Hebrew University of Jerusalem, Israel (2007)
MATH Google Scholar
Shalev-Shwartz, S.: Online learning and online convex optimization. Foundations and Trends in Machine Learning 4(2), 107–194 (2011)
Article MATH Google Scholar
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)

Download references

Acknowledgements

The authors are grateful to Roberto Cominetti, Cristóbal Guzmán, Nicolas Flammarion and Sylvain Sorin for inspiring discussions and suggestions. A. Juditsky was supported by MIAI @ Grenoble Alpes (ANR-19-P3IA-0003). J. Kwon was supported by a public grant as part of the “Investissement d’avenir” project (ANR-11-LABX-0056-LMH), LabEx LMH.

Author information

Authors and Affiliations

LJK, Université Grenoble Alpes, Grenoble, Palaiseau, France
Anatoli Juditsky
INRAE, AgroParisTech & Université Paris-Saclay, 22 place de l’Agronomie, 91120, Palaiseau, France
Joon Kwon
École Polytechnique, IPP & HSE University, Palaiseau, France
Éric Moulines

Authors

Anatoli Juditsky
View author publications
You can also search for this author in PubMed Google Scholar
Joon Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Éric Moulines
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joon Kwon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Convex analysis tools

Definition 9

(Lower-semicontinuity) A function $g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}$ is lower-semicontinuous if for all $c\in {\mathbb {R}}$, the sublevel set $\left\{ x\in {\mathbb {R}}^n\,:f(x)\leqslant c \right\} $ is closed.

One can easily check that the sum of two lower-semicontinuous functions is lower-semicontinuous. Continuous functions and characteristic functions $I_{{{\mathcal {X}}}}$ of closed sets ${{\mathcal {X}}}\subset {\mathbb {R}}^n$ are examples of lower-semicontinuous functions.

Definition 10

(Strong-convexity) Let $g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}$, $\left\| \,\cdot \,\right\| _{}$ be a norm in ${\mathbb {R}}^n$ and $K > 0$. Function g is said to be strongly convex with modulus $\kappa $ with respect to norm $\left\| \,\cdot \,\right\| _{}$ if for all $x,x'\in {\mathbb {R}}^n$ and $\lambda \in \left[ 0,1 \right] $,

$$\begin{aligned} g(\lambda x+(1-\lambda )x')\leqslant \lambda g(x)+(1-\lambda )g(x')-\frac{\kappa \lambda (1-\lambda )}{2}\left\| x'-x \right\| _{}^2. \end{aligned}$$

Proposition 12

(Theorem 23.5 in [38]) Let $g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}$ be a lower-semicontinuous convex function with nonempty domain. Then for all $x,y\in {\mathbb {R}}^n$, the following statements are equivalent.

(i)
$x\in \partial g^*(y)$;
(ii)
$y\in \partial g(x)$;
(iii)
$\langle y | x \rangle =g(x)+g^*(y)$;
(iv)
$x\in {{\,\mathrm{Arg\,max}\,}}_{x'\in {\mathbb {R}}^n}\left\{ \langle y | x' \rangle - g(x')\right\} $;
(v)
$y\in {{\,\mathrm{Arg\,max}\,}}_{y'\in {\mathbb {R}}^n}\left\{ \langle y' | x \rangle - g^*(y')\right\} $.

Postponed proofs

1.1 Proofs for Section 2

1.1.1 Proof of Proposition 1

Let ${\vartheta }\in {\mathbb {R}}^n$. By property (iii) from Definition 1, there exists $x_1\in {\mathcal {D}}_F$ such that $\nabla F(x_1)={\vartheta }$. Therefore, function $\varphi _{{\vartheta }}:x\mapsto \langle {\vartheta }| x \rangle -F(x)$ is differentiable at $x_1$ and $\nabla \varphi _{{\vartheta }}(x_1)=0$. Moreover, $\varphi _{{\vartheta }}$ is strictly concave as a consequence of property (i) from Definition 1. Therefore, $x_1$ is the unique maximizer of $\varphi _{{\vartheta }}$ and:

$$\begin{aligned} F^*({\vartheta })=\max _{x\in {\mathbb {R}}^n}\left\{ \langle {\vartheta }| x \rangle -F(x) \right\} <+\infty , \end{aligned}$$

which proves property (i).

Besides, we have

$$\begin{aligned} x_1\in \partial F^*({\vartheta }) \quad \Longleftrightarrow \quad {\vartheta }=\nabla F(x_1) \quad \Longleftrightarrow \quad x_1\text { minimizer of }\phi _{{\vartheta }}, \end{aligned}$$

(18)

where the first equivalence comes from Proposition 12. Point $x_1$ being the unique maximizer of $\varphi _{{\vartheta }}$, we have that $\partial F^*({\vartheta })$ is a singleton. In other words, $F^*$ is differentiable in ${\vartheta }$ and

$$\begin{aligned} \nabla F^*({\vartheta })=x_1\in {\mathcal {D}}_F. \end{aligned}$$

(19)

First, the above (19) proves property (ii). Second, this equality combined with the equality from (18) gives the second identity from property (iv). Third, this proves that $\nabla F^*({\mathbb {R}}^n)\subset {\mathcal {D}}_F$.

It remains to prove the reverse inclusion to get property (iii). Let $x\in {\mathcal {D}}_F$. By property (ii) from Definition 1, F is differentiable in x. Consider

$$\begin{aligned} {\vartheta }:=\nabla F(x), \end{aligned}$$

(20)

and all the above holds with this special point ${\vartheta }$. In particular, $x_1=x$ by uniqueness of $x_1$. Therefore (19) gives

$$\begin{aligned} \nabla F^*({\vartheta })=x, \end{aligned}$$

(21)

and this proves $\nabla F^*({\mathbb {R}}^n)\supset {\mathcal {D}}_F$ and thus property (iii). Combining (20) and (21) gives the first identity from property (iv).

1.1.2 Proof of Theorem 1

Let $x_0\in {\mathcal {D}}_F$. By definition of the mirror map, F is differentiable at $x_0$. Therefore, $D_F(x,x_0)$ is well-defined for all $x\in {\mathbb {R}}^n$.

For all real value $\alpha \in {\mathbb {R}}$, consider the sublevel set $S_{{{\mathcal {X}}}}(\alpha )$ of function $x\mapsto D_F(x,x_0)$ associated with value $\alpha $ and restricted to ${{\mathcal {X}}}$:

$$\begin{aligned} S_{{{\mathcal {X}}}}(\alpha ):=\left\{ x\in {{\mathcal {X}}}\,:D_F(x,x_0)\leqslant \alpha \right\} . \end{aligned}$$

Inheriting properties from F, function $D_F(\,\cdot \,,x_0)$ is lower-semicontinuous and strictly convex: consequently, the sublevel sets $S_{{{\mathcal {X}}}}(\alpha )$ are closed and convex.

Let us also prove that the sublevel sets $S_{{{\mathcal {X}}}}(\alpha )$ are bounded. For each value $\alpha \in {\mathbb {R}}$, we write

$$\begin{aligned} S_{{{\mathcal {X}}}}(\alpha )\subset S_{{\mathbb {R}}^n}(\alpha ):=\left\{ x\in {\mathbb {R}}^n\,:\,D_F(x,x_0)\leqslant \alpha \right\} \end{aligned}$$

and aim at proving that the latter set is bounded. By contradiction, let us suppose that there exists an unbounded sequence in $S_{{\mathbb {R}}^n}(\alpha )$: let $(x_k)_{k\geqslant 1}$ be such that $0<\left\| x_k-x_0 \right\| _{}\xrightarrow [k \rightarrow +\infty ]{}+\infty $ and $D_F(x_k,x_0)\leqslant \alpha $ for all $k\geqslant 1$. Using the Bolzano–Weierstrass theorem, there exists $v\ne 0$ and a subsequence $(x_{\phi (k)})_{k\geqslant 1}$ such that

$$\begin{aligned} \frac{x_{\phi (k)}-x_0}{\left\| x_{\phi (k)}-x_0 \right\| }\xrightarrow [k \rightarrow +\infty ]{}v. \end{aligned}$$

The point $x_0+\frac{x_{\phi (k)}-x_0}{\left\| x_{\phi (k)}-x_0 \right\| }$ being a convex combination of $x_0$ and $x_{\phi (k)}$, we can write the corresponding convexity inequality for function $D_F(\,\cdot \,,x_0)$:

$$\begin{aligned} D_F\left( x_0+\lambda _k(x_{\phi (k)}-x_0),x_0 \right)&\leqslant (1-\lambda _k)D_F(x_0,x_0) +\lambda _kD_F(x_{\phi (k)},x_0 )\\&\leqslant \lambda _k\alpha \xrightarrow [k \rightarrow +\infty ]{}0, \end{aligned}$$

where we used shorthand $\lambda _k:=\left\| x_{\phi (k)}-x_0 \right\| ^{-1}$. For the first above inequality, we used $D_F(x_0,x_0)=0$ and that $D_F(x_{\phi (k)},x_0)\leqslant \alpha $ by definition of $(x_k)_{k\geqslant 1}$. Then, using the lower-semicontinuity of $D_F(\,\cdot \,,x_0)$ and the fact that $x_0+\lambda _k(x_{\phi (k)}-x_0) \xrightarrow [k \rightarrow +\infty ]{}x_0+v$, we have

$$\begin{aligned} D_F(x_0+v,x_0)\leqslant \liminf _{k\rightarrow +\infty }D_F(x_0+\lambda _k(x_{\phi (k)}-x_0),x_0)\leqslant \liminf _{k\rightarrow +\infty }\lambda _k\alpha =0. \end{aligned}$$

The Bregman divergence of a convex function being nonnegative, the above implies $D_F(x_0+v,x_0)=0$. Thus, function $D_F(\,\cdot \,,x_0)$ attains its minimum (0) at two different points (at $x_0$ and at $x_0+v$): this contradicts its strong convexity. Therefore, sublevel sets $S_{{{\mathcal {X}}}}(\alpha )$ are bounded and thus compact.

We now consider the value $\alpha _{\text {inf}}$ defined as

$$\begin{aligned} \alpha _{\text {inf}}:=\inf \left\{ \alpha \,:S_{{{\mathcal {X}}}}(\alpha )\ne \emptyset \right\} . \end{aligned}$$

In other words, $\alpha _{\text {inf}}$ is the infimum value of $D_F(\,\cdot \,,x_0)$ on ${{\mathcal {X}}}$, and thus the only possible value for the minimum (if it exists). We know that $\alpha _{\text {inf}} \geqslant 0$ because the Bregman divergence is always nonnegative. From the definition of the sets $S_{{{\mathcal {X}}}}(\alpha )$, it easily follows that:

$$\begin{aligned} S_{{{\mathcal {X}}}}(\alpha _{\text {inf}})=\bigcap _{\alpha > \alpha _{\text {inf}}}^{}S_{{{\mathcal {X}}}}(\alpha ). \end{aligned}$$

Naturally, the sets $S_{{{\mathcal {X}}}}(\alpha )$ are increasing in $\alpha $ with respect to the inclusion order. Therefore, $S_{{{\mathcal {X}}}}(\alpha _{\text {inf}})$ is the intersection of a nested sequence of nonempty compact sets. It is thus nonempty as well by Cantor’s intersection theorem. Consequently, $D_F(\,\cdot \,,x_0)$ does admit a minimum on ${{\mathcal {X}}}$, and the minimizer is unique because of the strong convexity.

Let us now prove that the minimizer $x_*:=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{x\in {{\mathcal {X}}}}D_F(x,\ x_0)$ also belongs to ${\mathcal {D}}_F$. Let us assume by contradiction that $x_*\in {{\mathcal {X}}}{\setminus } {\mathcal {D}}_F$. By definition of the mirror map, ${{\mathcal {X}}}\cap {\mathcal {D}}_F$ is nonempty; let $x_1\in {{\mathcal {X}}}\cap {\mathcal {D}}_F$. The set ${\mathcal {D}}_F$ being open by definition, there exists $\varepsilon > 0$ such that the closed Euclidean ball ${\overline{B}}(x_1,\varepsilon )$ centered in $x_1$ and of radius $\varepsilon $ is a subset of ${\mathcal {D}}_F$. We consider the convex hull

$$\begin{aligned} {\mathcal {C}}:={\text {co}}\left( \left\{ x_* \right\} \cup {\overline{B}}(x_1,\varepsilon ) \right) , \end{aligned}$$

which is clearly is a compact set.

Consider function G defined by:

$$\begin{aligned} G(x):=D_F(x,x_0)=F(x)-F(x_0)-\left\langle \nabla F(x_0) \vert x-x_0 \right\rangle , \end{aligned}$$

so that $x_*$ is the minimizer of G on ${{\mathcal {X}}}$. In particular, G is finite in $x_*$. G inherits strict convexity, lower-semicontinuity, and differentiability on ${\mathcal {D}}_F$ from function F. G is continuous on the compact set ${\overline{B}}(x_1,\varepsilon )$ because G is convex on the open set ${\mathcal {D}}_F\supset {\overline{B}}(x_1,\varepsilon )$. Therefore, G is bounded on ${\overline{B}}(x_1,\varepsilon )$. Let us prove that G is also bounded on ${\mathcal {C}}$. Let $x\in {\mathcal {C}}$. By definition of ${\mathcal {C}}$, there exists $\lambda \in [0,1]$ and $x'\in {\overline{B}}(x_1,\varepsilon )$ such that $x=\lambda x_*+(1-\lambda )x'$. By convexity of G, we have:

$$\begin{aligned} G(x)\leqslant \lambda G(x_*)+(1-\lambda )G(x')\leqslant G(x_*)+G(x'). \end{aligned}$$

We know that $G(x_*)$ is finite and that $G(x')$ is bounded for $x'\in {\overline{B}}(x_1,\varepsilon )$. Therefore G is bounded on ${\mathcal {C}}$: let us denote $G_{\text {max}}$ and $G_{\text {min}}$ some upper and lower bounds for the value of G on ${\mathcal {C}}$.

Because ${{\mathcal {X}}}$ is a convex set, the segment $[x_*,x_1]$ (in other words the convex hull of $\left\{ x_*,x_1 \right\} $) is a subset of ${{\mathcal {X}}}$. Besides, let us prove that the set

$$\begin{aligned} (x_*,x_1]:=\left\{ (1-\lambda )x_*+\lambda x_1\,:\lambda \in (0,1] \right\} \end{aligned}$$

is a subset of ${\mathcal {D}}_F$. Let $x_{\lambda }:=(1-\lambda )x_*+\lambda x_1$ (with $\lambda \in (0,1]$) a point in the above set, and let us prove that it belongs to ${\mathcal {D}}_F$. By definition of the mirror map, we have ${{\mathcal {X}}}\subset {\text {cl}}{\mathcal {D}}_F$, and besides $x_*\in {{\mathcal {X}}}$ by definition. Therefore, there exists a sequence $(x_k)_{k\geqslant 1}$ in ${\mathcal {D}}_F$ such that $x_k\rightarrow x_*$ as $k\rightarrow +\infty $. Then, we can write

$$\begin{aligned} x_{\lambda }&=(1-\lambda )x_*+\lambda x_1\\&=(1-\lambda )x_k + (1-\lambda )(x_*-x_k)+\lambda x_1\\&=(1-\lambda )x_k+\lambda \left( x_1+\frac{1-\lambda }{\lambda }(x_*-x_k) \right) . \end{aligned}$$

Since $x_k\rightarrow x_*$, for high enough k, the point $x_1+(1-\lambda )\lambda ^{-1}(x_*-x_k)$ belongs to ${\overline{B}}(x_1,\varepsilon )$ and therefore to ${\mathcal {D}}_F$. Then, the point $x_{\lambda }$ belongs to the convex set^{Footnote 11}${\mathcal {D}}_F$ as the convex combination of two points in ${\mathcal {D}}_F$. Therefore, $(x_*,x_1]$ is indeed a subset of ${\mathcal {D}}_F$.

G being differentiable on ${\mathcal {D}}_F$ by definition of the mirror map, the gradient of G exists at each point of $(x_*,x_1]$. Let us prove that $\nabla G$ is bounded on $(x_*,x_1]$. Let $x_{\lambda }\in (x_*,x_1]$, where $\lambda \in (0,1]$ is such that

$$\begin{aligned} x_{\lambda }= (1-\lambda )x_*+\lambda x_1, \end{aligned}$$

and let $u\in {\mathbb {R}}^n$ such that $\left\| u \right\| _2=1$. The point $x_1+\varepsilon u$ belongs to ${\mathcal {C}}$ because it belongs to ${\overline{B}}(x_1,\varepsilon )$. The following point also belongs to convex set ${\mathcal {C}}$ as the convex combination of $x_*$ and $x_1+\varepsilon u$ which both belong to ${\mathcal {C}}$:

$$\begin{aligned} x_{\lambda }+\lambda \varepsilon u = (1-\lambda )x_*+\lambda (x_1+\varepsilon u)\in {\mathcal {C}}. \end{aligned}$$

(22)

Let $h\in (0,\varepsilon ]$. The following point also belongs to ${\mathcal {C}}$ as a convex combination of $x_{\lambda }$ and the above point $x_{\lambda }+\lambda \varepsilon u$:

$$\begin{aligned} x_{\lambda }+\lambda hu = \left( 1-\frac{h}{\varepsilon } \right) x_{\lambda }+\frac{h}{\varepsilon }\left( x_{\lambda }+\lambda \varepsilon u \right) \in {\mathcal {C}}. \end{aligned}$$

(23)

Now using for G the convexity inequality associated with the convex combination from (23), we write:

$$\begin{aligned} G(x_{\lambda }+h\lambda u)-G(x_{\lambda })&\leqslant \frac{h}{\varepsilon }\left( G(x_{\lambda }+\lambda \varepsilon u)-G(x_{\lambda }) \right) \nonumber \\&=\frac{h}{\varepsilon }\left( G(x_{\lambda }+\lambda \varepsilon u)-G(x_*)+G(x_*)-G(x_{\lambda }) \right) \nonumber \\&\leqslant \frac{h}{\varepsilon }\left( G(x_{\lambda }+\lambda \varepsilon u)-G(x_*) \right) , \end{aligned}$$

(24)

where for the last line we used $G(x_*)\leqslant G(x_{\lambda })$ which is true because $x_{\lambda }$ belongs to ${{\mathcal {X}}}$ and $x_*$ is by definition the minimizer of G on ${{\mathcal {X}}}$. Using the convexity inequality associated with the convex combination from (22), we also write

$$\begin{aligned} G(x_{\lambda }+\lambda \varepsilon u)-G(x_*)&\leqslant \lambda \left( G(x_1+\varepsilon u)-G(x_*) \right) \nonumber \\&\leqslant \lambda \left( G_{\text {max}}-G_{\text {min}} \right) . \end{aligned}$$

(25)

Combining (24) and (25) and dividing by $h\lambda $, we get

$$\begin{aligned} \frac{G(x_{\lambda }+h\lambda u)-G(x_{\lambda })}{h\lambda }\leqslant \frac{G_{\text {max}}-G_{\text {min}}}{\varepsilon }. \end{aligned}$$

Taking the limit as $h\rightarrow 0^+$, we get that $\langle \nabla G(x_{\lambda }) | u \rangle \leqslant (G_{\text {max}}-G_{\text {min}})/\varepsilon $. This being true for all vector u such that $\left\| u \right\| _2=1$, we have

$$\begin{aligned} \left\| \nabla G(x_{\lambda }) \right\| _{2}=\max _{\left\| u \right\| _{2}=1 }\langle \nabla G(x_{\lambda }) | u \rangle \leqslant \frac{G_{\text {max}}-G_{\text {min}}}{\varepsilon }. \end{aligned}$$

As a result, $\nabla G$ is bounded on $(x_*,x_1]$.

Let us deduce that $\partial G(x_*)$ is nonempty. The sequence $(\nabla G(x_{1/k}))_{k\geqslant 1}$ is bounded. Using the Bolzano–Weierstrass theorem, there exists a subsequence $(\nabla G(x_{1/\phi (k)}))_{k\geqslant 1}$ which converges to some vector ${\vartheta }_*\in {\mathbb {R}}^n$. For each $k\geqslant 1$, the following is satisfied by convexity of G:

$$\begin{aligned} \langle \nabla G(x_{1/\phi (k)}) | x-x_{1/\phi (k)} \rangle \leqslant G(x)-G(x_{1/\phi (k)}),\quad x\in {\mathbb {R}}^n. \end{aligned}$$

Taking the limsup on both sides for each $x\in {\mathbb {R}}^n$ as $k\rightarrow +\infty $, we get (because obviously $x_{1/\phi (k)}\rightarrow x_*$):

$$\begin{aligned} \langle {\vartheta }_* | x-x_* \rangle \leqslant G(x)-\liminf _{k\rightarrow +\infty }G(x_{1/\phi (k)})\leqslant G(x)-G(x_*),\quad x\in {\mathbb {R}}^n, \end{aligned}$$

where the second inequality follows from the lower-semicontinuity of G. Consequently, ${\vartheta }_*$ belongs to $\partial G(x_*)$.

But by definition of the mirror map $\nabla F$ takes all possible values and so does $\nabla G$, because it follows from the definition of G that $\nabla G=\nabla F-\nabla F(x_0)$. Therefore, there exists a point ${\tilde{x}}\in {\mathcal {D}}_F$ (thus ${\tilde{x}} \ne x_*$) such that $\nabla G({\tilde{x}})={\vartheta }_*$. Considering the point $x_{\text {mid}}=\frac{1}{2}(x_*+{\tilde{x}})$, we can write the following convexity inequalities:

$$\begin{aligned} \langle {\vartheta }_* | x_{\text {mid}}-x_* \rangle&\leqslant G(x_{\text {mid}})-G(x_*)\\ \langle {\vartheta }_* | x_{\text {mid}}-{\tilde{x}} \rangle&\leqslant G(x_{\text {mid}})-G({\tilde{x}}). \end{aligned}$$

We now add both inequalities and use the fact that $x_{\text {mid}}-{\tilde{x}}=x_*-x_{\text {mid}}$ by definition of $x_{\text {mid}}$ to get $0\leqslant 2G(x_{\text {mid}})-G(x_*)-G({\tilde{x}})$, which can also be written

$$\begin{aligned} G\left( \frac{x_*+{\tilde{x}}}{2} \right) \geqslant \frac{G(x_*)+G({\tilde{x}})}{2}, \end{aligned}$$

which contradicts the strong convexity of G. We conclude that $x_*\in {\mathcal {D}}_F$.

1.1.3 Proof of Proposition 2

Let ${\vartheta }\in {\mathbb {R}}^n$. For each of the three assumptions, let us prove that $h^*({\vartheta })$ is finite. This will prove that ${\text {dom}}h^*={\mathbb {R}}^n$.

(i)
Because ${\text {cl}}{\text {dom}}h={{\mathcal {X}}}$ by definition of a pre-regularizer, we have:
$$\begin{aligned} h^*({\vartheta })=\max _{x\in {\mathbb {R}}^n}\left\{ \left\langle {\vartheta } \vert x \right\rangle -h(x) \right\} =\max _{x\in {{\mathcal {X}}}}\left\{ \left\langle {\vartheta } \vert x \right\rangle -h(x) \right\} . \end{aligned}$$
Besides, the function $x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)$ is upper-semicontinuous and therefore attains a maximum on ${{\mathcal {X}}}$ because ${{\mathcal {X}}}$ is assumed to be compact. Therefore $h^*({\vartheta })<+\infty $.
(ii)
Because $\nabla h({\mathcal {D}}_h)={\mathbb {R}}^n$ by assumption, there exists $x\in {\mathcal {D}}_h$ such that $\nabla h(x)={\vartheta }$. Then, by Proposition 12, $h^*({\vartheta })=\left\langle {\vartheta } \vert x \right\rangle -h(x)<+\infty $.
(iii)
The function $x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)$ is strongly concave on ${\mathbb {R}}^n$ and therefore admits a maximum. Therefore, $h^*({\vartheta })<+\infty $.

1.1.4 Proof of Proposition 3

Let ${\vartheta }\in {\mathbb {R}}^n$. Because ${\text {dom}}h^*={\mathbb {R}}^n$, the subdifferential $\partial h^*({\vartheta })$ is nonempty—see e.g. [38, Theorem 23.4]. By Proposition 12, $\partial h^*({\vartheta })$ is the set of maximizers of function $x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)$, which is strictly concave. Therefore, the maximizer is unique and $h^*$ is differentiable at ${\vartheta }$.

Let $x\in {\mathcal {D}}_F$ and let us prove that $\nabla F(x)\in \partial h(x)$. By convexity of F, the following is true

$$\begin{aligned} \forall x'\in {\mathbb {R}}^n,\quad F(x')-F(x)\geqslant \langle \nabla F(x) | x'-x \rangle . \end{aligned}$$

By definition of h, we obviously have $h(x')\geqslant F(x')$ for all $x'\in {\mathbb {R}}^n$, and $h(x)=F(x)+I_{{{\mathcal {X}}}}(x)=F(x)$ because $x\in {{\mathcal {X}}}$. Therefore, the following is also true

$$\begin{aligned} \forall x'\in {\mathbb {R}}^n,\quad h(x')-h(x)\geqslant \langle \nabla F(x) | x'-x \rangle . \end{aligned}$$

In other words, $\nabla F(x)\in \partial f(x)$.

1.1.5 Proof of Proposition 4

h is strictly convex as the sum of two convex functions, one of which (F) is strictly convex. h is lower-semicontinuous as the sum of two lower-continuous functions.

Let us now prove that ${\text {cl}}{\text {dom}}h={{\mathcal {X}}}$. First, we write

$$\begin{aligned} {\text {dom}}h={\text {dom}}(F+I_{{{\mathcal {X}}}})={\text {dom}}F\cap {\text {dom}}I_{{{\mathcal {X}}}}={\text {dom}}F\cap {{\mathcal {X}}}. \end{aligned}$$

Let $x\in {\text {cl}}{\text {dom}}h={\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})$. There exists a sequence $(x_k)_{k\geqslant 1}$ in ${\text {dom}}F\cap {{\mathcal {X}}}$ such that $x_k\rightarrow x$. In particular, each $x_k$ belongs to closed set ${{\mathcal {X}}}$, and so does the limit: $x\in {{\mathcal {X}}}$.

Conversely, let $x\in {{\mathcal {X}}}$ and let us prove that $x\in {\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})$ by constructing a sequence $(x_k)_{k\geqslant 1}$ in ${\text {dom}}F\cap {{\mathcal {X}}}$ which converges to x. By definition of the mirror map, we have ${{\mathcal {X}}}\subset {\text {cl}}{\mathcal {D}}_F$, where ${\mathcal {D}}_F:={\text {int}}{\text {dom}}F$. Therefore, there exists a sequence $(x_l')_{l\geqslant 1}$ in ${\mathcal {D}}_F$ such that $x_l'\rightarrow x$ as $l\rightarrow +\infty $. From the definition of the mirror map, we also have that ${{\mathcal {X}}}\cap {\mathcal {D}}_F\ne \emptyset $. Let $x_0\in {{\mathcal {X}}}\cap {\mathcal {D}}_F$. In particular, $x_0$ belongs ${\mathcal {D}}_F$ which is an open set by definition. Therefore, there exists a neighborhood $U\subset {\mathcal {D}}_F$ of point $x_0$. We now construct the sequence $(x_k)_{k\geqslant 1}$ as follows:

$$\begin{aligned} x_k:=\left( 1-\frac{1}{k} \right) x+\frac{1}{k}x_0,\quad k\geqslant 1. \end{aligned}$$

$x_k$ belongs to ${{\mathcal {X}}}$ as the convex combination of two points in the convex set ${{\mathcal {X}}}$, and obviously converges to x. Besides, $x_k$ can also be written, for any $k,l\geqslant 1$,

$$\begin{aligned} x_k&= \left( 1-\frac{1}{k} \right) x'_l+\left( 1-\frac{1}{k} \right) (x-x_l')+\frac{1}{k}x_0 \\&=\left( 1-\frac{1}{k} \right) x_l'+ \frac{1}{k}\left( x_0+(k-1)(x-x_l') \right) \\&=\left( 1-\frac{1}{k} \right) x_l'+\frac{1}{k}x_{0}^{(kl)}, \end{aligned}$$

where we set $x_0^{(kl)}:=x_0+(k-1)(x-x_l')$. For a given $k\geqslant 1$, we see that $x_0^{(kl)}\rightarrow x_0$ as $l\rightarrow +\infty $ because $x_l'\rightarrow x$ by definition of $(x_l')_{l\geqslant 1}$. Therefore, for large enough l, $x_0^{(kl)}$ belongs to the neighborhood U and therefore to ${\mathcal {D}}_F$. $x_k$ then appears as the convex combination of $x_l'$ and $x_0^{(kl)}$ which both belong to the convex set ${\mathcal {D}}_F\subset {\text {dom}}F$. $(x_k)$ is thus a sequence in ${\text {dom}}F\cap {{\mathcal {X}}}$ which converges to x. Therefore, $x\in {\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})$ and h is an ${{\mathcal {X}}}$-pre-regularizer.

Finally, we have $F\leqslant h$ by definition of h. One can easily check that this implies $h^*\leqslant F^*$ and we know from Proposition 1 that ${\text {dom}}F^*={\mathbb {R}}^n$, in other words that $F^*$ only takes finite values. Therefore, so does $h^*$ and h is an ${{\mathcal {X}}}$-regularizer.

1.2 Proofs for Section 4

1.2.1 Proof of Proposition 11

Let $t\geqslant 2$. It follows from the definition of the iterates that $x_t-y_t=(\nu _{t-1}^{-1}-1)(y_t-y_{t-1})$. Therefore, utilizing the convexity of f, we get

$$\begin{aligned} \langle \gamma _tf'(y_t) | x_t-x_* \rangle&=\gamma _t\langle f'(y_t) | y_t-x_* \rangle +\gamma _t\langle f'(y_t) | x_t-y_t \rangle \\&=\gamma _t\langle f'(y_t) | y_t-x_* \rangle +\gamma _t(\nu _{t-1}^{-1}-1)\langle f'(y_t) | y_t-y_{t-1} \rangle \\&\geqslant \gamma _t\left( f(y_t)-f_* \right) +\gamma _t(\nu _{t-1}^{-1}-1)\left( f(y_t)-f(y_{t-1}) \right) \\&= \gamma _t\nu _{t-1}^{-1}f(y_t)-\gamma _t(\nu _{t-1}^{-1}-1)f(y_{t-1})-\gamma _tf_*. \end{aligned}$$

Besides this, for $t=1$, we have $\gamma _1\langle f'(y_1) | x_1-x_* \rangle \geqslant \gamma _1(f(y_1)-f_*)$ because $x_1=y_1$ by definition. Then, summing over $t=1,\dots ,T$, we obtain after simplifications:

$$\begin{aligned}&(\gamma _1-\gamma _2(\nu _1^{-1}-1))f(y_1)+\sum _{t=2}^{T-1}(\gamma _t\nu _{t-1}^{-1}-\gamma _{t+1}(\nu _t^{-1}-1))f(y_t)+\gamma _T\nu _{t-1}^{-1}f(y_T)\\&\quad -\left( \sum _{t=1}^T\gamma _t \right) f_*\leqslant \sum _{t=1}^T\langle \gamma _tf'(y_t) | x_t-x_* \rangle . \end{aligned}$$

Using the definition of coefficients $\nu _t$, the above left-hand side simplifies to result in the inequality

$$\begin{aligned} \left( \sum _{t=1}^T\gamma _t \right) \left( f(y_T)-f_* \right) \leqslant \sum _{t=1}^T\langle \gamma _tf'(y_t) | x_t-x_* \rangle . \end{aligned}$$

Finally, because $(x_t,{\vartheta }_t)_{t\geqslant 1}$ is a sequence of UMD$(h,\xi )$ iterates with dual increments $\xi :=(-\gamma _tf'(y_t))_{t\geqslant 1}$, the result then follows by applying inequality (8) from Corollary 2 and dividing by $\sum _{t=1}^T\gamma _t$. $\square $

1.2.2 Proof of Theorem 2

First, observe that whenever $\gamma _t\leqslant 1/L$, due to (11),

$$\begin{aligned} f(x_{t+1})-f(x_{t})-\left\langle \nabla f(x_t) \vert x_{t+1}-x_t \right\rangle\leqslant & {} {L\over 2}\Vert x_{t+1}-x_t\Vert ^2\nonumber \\ {}\leqslant & {} (2\gamma _t)^{-1}\Vert x_{t+1}-x_t\Vert ^2. \end{aligned}$$

(26)

Thus,

$$\begin{aligned} \gamma _t D_f(x_{t+1},x_t)&=\gamma _t[f(x_{t+1})-f(x_{t})-\left\langle f'(x_t) \vert x_{t+1}-x_t \right\rangle ]\\&\leqslant \frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2\\&\leqslant D_h(x_{t+1},x_t;\ {\vartheta }_t). \end{aligned}$$

by the strong convexity of $D_h$. On the other hand, by (6) of Lemma 1, for any $x\in {{\mathcal {X}}}\cap {\text {dom}}h$,

$$\begin{aligned} D_h(x,x_{t+1}; {\vartheta }_{t+1})&\leqslant D_h(x,x_t;{\vartheta }_t)+\gamma _t\left\langle \nabla f(x_t) \vert x-x_{t+1} \right\rangle -D_h(x_{t+1},x_t; {\vartheta }_t)\\ [\text{ by } (12)]&\leqslant D_h(x,x_t;{\vartheta }_t)+\gamma _t\left\langle \nabla f(x_t) \vert x-x_{t} \right\rangle \\&\qquad -\gamma _t\left\langle \nabla f(x_t) \vert x_{t+1}-x_{t} \right\rangle -D_f(x_{t+1},x_t)\\ [\text{ due } \text{ to } (16)]&\leqslant D_h(x,x_t;{\vartheta }_t)+\gamma _t\left\langle \nabla f(x_t) \vert x-x_{t} \right\rangle -\gamma _t[f(x_{t+1}-f(x_t)]\\ [\text{ by } \text{ convexity } \text{ of } f]&\leqslant D_h(x,x_t;{\vartheta }_t)-\gamma _t(f(x_{t+1})-f(x)). \end{aligned}$$

Consequently, $\forall x\in {{\mathcal {X}}}\cap {\text {dom}}h$,

$$\begin{aligned} \gamma _t(f(x_{t+1})-f(x_t))\leqslant D_h(x,x_t;{\vartheta }_t)-D_h(x,x_{t+1}; {\vartheta }_{t+1}). \end{aligned}$$

When applying the above inequality to $x=x_t$ we conclude that

$$\begin{aligned} \gamma _t(f(x_{t+1})-f(x_t))\leqslant -D_h(x_t,x_{t+1}; {\vartheta }_{t+1})\leqslant 0. \end{aligned}$$

Finally, when setting $x=x_*$, we obtain

$$\begin{aligned} \left( \sum _{t=1}^T\gamma _t\right) (f(x_{T+1})-f_*)\leqslant \sum _{t=1}^T\gamma _t(f(x_{t+1})-f(x_*))\leqslant D_h(x_*,x_1;{\vartheta }_1) \end{aligned}$$

which implies (13). $\square $

1.2.3 Proof of Theorem 3

We start with the following technical result.

Lemma 2

Assume that positive step-sizes $\nu _t\in (0,1]$ and $\gamma _t>0$ are such that the relationship

$$\begin{aligned} f({z}_{t+1})\leqslant f(y_t)+\nu _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_t \right\rangle +{\nu _t\over \gamma _t} D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$

(27)

holds for all t which is certainly the case if $\nu _t\gamma _t\leqslant L^{-1}$. Denote $s_t=f(z_t)-f_*$; then

$$\begin{aligned} {\gamma _t\nu _t^{-1}}(s_{t+1}-s_t)+\gamma _ts_t&\leqslant D_h(x_*,{x}_t;{\vartheta }_t)-D_h(x_*,{x}_{t+1};{\vartheta }_{t+1}). \end{aligned}$$

(28)

Proof of the lemma

Observe first that by construction,

$$\begin{aligned} {z}_{t+1}-y_t= (1-\nu _t){z}_t+\nu _t{x}_{t+1}-[(1-\nu _t){z}_t+\nu _t{x}_{t}]=\nu _t({x}_{t+1}-{x}_t) \end{aligned}$$

By strong convexity of h, for $\nu _t\gamma _t\leqslant L^{-1}$ we have

$$\begin{aligned} f({z}_{t+1})&\leqslant f(y_t)+\langle \nabla f(y_t),{z}_{t+1}-y_t\rangle +{L\over 2}\Vert {z}_{t+1}-y_t\Vert ^2\nonumber \\&= f(y_t)+\nu _t\langle \nabla f(y_t),{x}_{t+1}-x_t\rangle +{L\nu _t^2\over 2}\Vert {x}_{t+1}-{x}_t\Vert ^2\nonumber \\&\leqslant f(y_t)+\nu _t\langle \nabla f(y_t),{x}_{t+1}-x_t\rangle +{\nu _t\over \gamma _t}D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$

what is (27).

Next, observe that by (14a),

$$\begin{aligned} \nu _t(x_*-x_t)=(\nu _tx_*+(1-\nu _t)z_t)-y_t, \end{aligned}$$

whence, by convexity of f,

$$\begin{aligned} \nu _t\left\langle \nabla f(y_t) \vert x_*-x_t \right\rangle&=\left\langle \nabla f(y_t) \vert (\nu _tx_*+(1-\nu _t)z_t)-y_t \right\rangle \\&\leqslant f(\nu _tx_*+(1-\nu _t)z_t)-f(y_t)\\&\leqslant \nu _t(f(x_*)-f(y_t))+(1-\nu _t)(f(z_t)-f(y_t)). \end{aligned}$$

When substituting the latter bound into (27) we get

$$\begin{aligned} f({z}_{t+1})&\leqslant f(y_t)+\nu _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_* \right\rangle +\nu _t(f(x_*)-f(y_t))\\ {}&\qquad +(1-\nu _t)(f(z_t)-f(y_t)) +{\nu _t\over \gamma _t} D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$

or

$$\begin{aligned} f({z}_{t+1})-f(z_t)\leqslant \nu _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_* \right\rangle +\nu _t(f_*-f(z_t)) +{\nu _t\over \gamma _t} D_h({x}_{t+1},{x}_t;{\vartheta }_t). \end{aligned}$$

Now, because $(x_t,{\vartheta }_t)_{t\geqslant 1}$ is a sequence of UMD iterates, by (6) of Lemma 1,

$$\begin{aligned} \gamma _t\left\langle \nabla f(y_t) \vert {x}_{t+1}-x_* \right\rangle \leqslant D_h(x_*,{x}_t;{\vartheta }_t)-D_h(x_*,{x}_{t+1};{\vartheta }_{t+1})-D_h({x}_{t+1},{x}_t;{\vartheta }_t), \end{aligned}$$

and we arrive at

$$\begin{aligned} {\gamma _t\nu _t^{-1}}(f({z}_{t+1})-f(z_t))\leqslant D_h(x_*,{x}_t;{\vartheta }_t)-D_h(x_*,{x}_{t+1};{\vartheta }_{t+1})+\gamma _t(f_*-f(z_t)), \end{aligned}$$

what is (28). $\square $

Proof of the Theorem

Assume that $\nu _t$ and $\gamma _t$ satisfy

$$\begin{aligned} \nu _1=1,\quad \nu _t\in (0,1], \quad \gamma _{t+1}(\nu _{t+1}^{-1}-1)\leqslant \gamma _t\nu _{t}^{-1}. \end{aligned}$$

(29)

When summing (28) up from 1 to T we get

$$\begin{aligned} D_h(x_*,x_t;{\vartheta }_1)&\geqslant \sum _{t=1}^T[{\gamma _t\nu _t^{-1}}(s_{t+1}-s_t)+\gamma _ts_t]\\&={\gamma _T\nu _T^{-1}}s_{T+1}+\sum _{t=2}^Ts_t\left( {\gamma _{t-1}\nu _{t-1}^{-1}}-\gamma _t(\nu _t^{-1}-1)\right) -\gamma _1(\nu _1^{-1}-1)s_1\\&\quad \underbrace{\geqslant }_{[\text{ by } (29)]} {\gamma _T\nu _T^{-1}}s_{T+1}={\gamma _T\nu _T^{-1}}(f(z_{T+1})-f_*). \end{aligned}$$

It is clear that the choice of $\gamma _1=L^{-1}$, $\nu _1=1$ and $\nu _t=(\gamma _tL)^{-1}$ satisfies the relationship $\gamma _t\nu _t\leqslant L^{-1}$. In this case, when choosing step-sizes $(\gamma _t)_{t\geqslant 1}$ to saturate recursively the last relation in (29), specifically,

$$\begin{aligned} \gamma ^2_{t+1}L-\gamma _{t+1}= \gamma _t^2L \end{aligned}$$

we come to celebrated Nesterov step-sizes (15) which satisfy $\gamma _t\nu _t^{-1}\geqslant {(t+1)^2\over 4L}$, and we arrive at (16). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Juditsky, A., Kwon, J. & Moulines, É. Unifying mirror descent and dual averaging. Math. Program. 199, 793–830 (2023). https://doi.org/10.1007/s10107-022-01850-3

Download citation

Received: 30 September 2020
Accepted: 05 June 2022
Published: 30 June 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01850-3

Mathematics Subject Classification

90C25

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unifying mirror descent and dual averaging

Abstract

Access this article

Similar content being viewed by others

Learning to optimize: A tutorial for continuous and mixed-integer optimization

Bias-constrained integer least squares estimation: distributional properties and applications in GNSS ambiguity resolution

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Convex analysis tools

Definition 9

Definition 10

Proposition 12

Postponed proofs

1.1 Proofs for Section 2

1.1.1 Proof of Proposition 1

1.1.2 Proof of Theorem 1

1.1.3 Proof of Proposition 2

1.1.4 Proof of Proposition 3

1.1.5 Proof of Proposition 4

1.2 Proofs for Section 4

1.2.1 Proof of Proposition 11

1.2.2 Proof of Theorem 2

1.2.3 Proof of Theorem 3

Lemma 2

Proof of the lemma

Proof of the Theorem

Rights and permissions

About this article

Cite this article

Mathematics Subject Classification

Navigation

Unifying mirror descent and dual averaging

Abstract

Access this article

Similar content being viewed by others

Learning to optimize: A tutorial for continuous and mixed-integer optimization

Bias-constrained integer least squares estimation: distributional properties and applications in GNSS ambiguity resolution

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Convex analysis tools

Definition 9

Definition 10

Proposition 12

Postponed proofs

1.1 Proofs for Section 2

1.1.1 Proof of Proposition 1

1.1.2 Proof of Theorem 1

1.1.3 Proof of Proposition 2

1.1.4 Proof of Proposition 3

1.1.5 Proof of Proposition 4

1.2 Proofs for Section 4

1.2.1 Proof of Proposition 11

1.2.2 Proof of Theorem 2

1.2.3 Proof of Theorem 3

Lemma 2

Proof of the lemma

Proof of the Theorem

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation