Skip to main content
Log in

An online convex optimization-based framework for convex bilevel optimization

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

We propose a new framework for solving the convex bilevel optimization problem, where one optimizes a convex objective over the optimal solutions of another convex optimization problem. As a key step of our framework, we form an online convex optimization (OCO) problem in which the objective function remains fixed while the domain changes over time. We note that the structure of our OCO problem is different from the classical OCO problem that has been intensively studied in the literature. We first develop two OCO algorithms that work under different assumptions and provide their theoretical convergence rates. Our first algorithm works under minimal convexity assumptions, while our second algorithm is equipped to exploit structural information on the objective function, including smoothness, lack of first-order smoothness, and strong convexity. We then carefully translate our OCO results into their counterparts for solving the convex bilevel problem. In the context of convex bilevel optimization, our results lead to rates of convergence in terms of both inner and outer objective functions simultaneously, and in particular without assuming strong convexity in the outer objective function. Specifically, after T iterations, our first algorithm achieves \(O(T^{-1/3})\) error bound in both levels, and this is further improved to \(O(T^{-1/2})\) by our second algorithm. We illustrate the numerical efficacy of our algorithms on standard linear inverse problems and a large-scale text classification problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Amini, M., Yousefian, F.: An iterative regularized incremental projected subgradient method for a class of bilevel optimization problems. In: 2019 American Control Conference (ACC), pp. 4069–4074, Philadelphia, PA, USA (2019). IEEE

  2. Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA (2017)

    Book  MATH  Google Scholar 

  3. Beck, A., Sabach, S.: A first order method for finding minimal norm-like solutions of convex optimization problems. Math. Program. 147(1–2), 25–46 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  4. Cabot, A.: Proximal point algorithm controlled by a slowly vanishing term: applications to hierarchical minimization. SIAM J. Optim. 15(2), 555–572 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  5. Dutta, J., Pandit, T.: Algorithms for Simple Bilevel Programming, pp. 253–291. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-52119-6_9 . (ISBN 978-3-030-52119-6)

    Book  MATH  Google Scholar 

  6. Garrigos, G., Rosasco, L., Villa, S.: Iterative regularization via dual diagonal descent. J. Math. Imaging Vis. 60(2), 189–215 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  7. Hansen, P.C.: Regularization tools: a Matlab package for analysis and solution of discrete ill-posed problems. Numer. Algorithms 6(1), 1–35 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  8. Helou, E.S., Simões, L.E.A.: \(\epsilon \)-subgradient algorithms for bilevel convex optimization. Inverse Prob. 33(5), 055020 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  9. Juditsky, A., Nemirovski, A.: First-order methods for nonsmooth convex large-scale optimization, I: general purpose methods. In: Sra, S., Nowozin, S., Wright, S.J. (eds.) Optimization for Machine Learning. The MIT Press, Cambridge (2011) . (ISBN 978-0-262-29877-3)

    MATH  Google Scholar 

  10. Kaushik, H.D., Yousefian, F.: A method with convergence rates for optimization problems with variational inequality constraints. SIAM J. Optim. 31(3), 2171–2198 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  11. Koshal, J., Nedić, A., Shanbhag, U.V.: Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Trans. Autom. Control 58(3), 594–609 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  12. Liu, S., Vicente, L. N.: Accuracy and fairness trade-offs in machine learning: a stochastic multi-objective approach (2020). arXiv:2008.01132

  13. Nedić, A., Özdağlar, A.: Approximate primal solutions and rate analysis for dual subgradient methods. SIAM J. Optim. 19(4), 1757–1780 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  14. Neely, M. J., Yu, H.: Online convex optimization with time-varying constraints (2017). arXiv:1702.04783v2

  15. Nesterov, Y.: Lectures on Convex Optimization, volume 137 of Springer Optimization and Its Applications. Springer, Berlin (2018)

    Google Scholar 

  16. Sabach, S., Shtern, S.: A first order method for solving convex bilevel optimization problems. SIAM J. Optim. 27(2), 640–660 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  17. Solodov, M.: An explicit descent method for bilevel convex optimization. J. Convex Anal. 14(2), 227–237 (2007)

    MathSciNet  MATH  Google Scholar 

  18. Solodov, M.V.: A bundle method for a class of bilevel nonsmooth convex minimization problems. SIAM J. Optim. 18(1), 242–259 (2007). https://doi.org/10.1137/050647566

    Article  MathSciNet  MATH  Google Scholar 

  19. Yousefian, F., Nedić, A., Shanbhag, U.V.: On smoothing, regularization, and averaging in stochastic approximation methods for stochastic variational inequality problems. Math. Program. 165(1), 391–431 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  20. Yu, Hao, Neely, Michael J.: A primal-dual type algorithm with the \({O}(1/t)\) convergence rate for large scale constrained convex programs. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 1900–1905, Las Vegas, NV, USA (2016). IEEE

  21. Hao, Yu., Neely, M.J.: A low complexity algorithm with \({O}(\sqrt{T})\) regret and \({O}(1)\) constraint violations for online convex optimization with long term constraints. J. Mach. Learn. Res. 21(1), 1–24 (2020)

    MathSciNet  MATH  Google Scholar 

  22. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors wish to thank the review team for their constructive feedback that improved the presentation of the material in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fatma Kılınç-Karzan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Fundamental results for mirror descent-type updates

In this appendix, we present some classical results on mirror descent-type updates originating from the setup in Sect. 2.2 that we use in our analysis.

Lemma 5

([2, Lemma 9.11]) For any \(x,y,z\in X\), we have

$$\begin{aligned} \langle \nabla \omega (y)-\nabla \omega (x),z-y\rangle&= V_x(z) - V_y(z) - V_x(y). \end{aligned}$$

Remark 7

In the particular case of Euclidean setup, Lemma 5 implies that for any \(x,y,z\in X\), we have

$$\begin{aligned} 2\langle y-x,z-y\rangle&= \Vert x-z\Vert _2^2 - \Vert y-z\Vert _2^2 - \Vert y-x\Vert _2^2. \end{aligned}$$

\(\square \)

Next, recall the following summary of the classical results related to a single iteration of the proximal mapping.

Lemma 6

Suppose \(x_{t+1}={{\,\mathrm{Prox}\,}}_{x_t}(\gamma _t\xi _t)\). Then, for any \(x\in X\),

  1. (a)

    \(\Vert x_{t+1}-x_t\Vert \le \gamma _t\Vert \xi _t\Vert _*\).

  2. (b)

    \(\gamma _t\langle \xi _t,x_{t+1}-x\rangle \le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1})\).

  3. (c)

    \(\gamma _t\langle \xi _t,x_t-x\rangle \le V_{x_t}(x)-V_{x_{t+1}}(x) +\frac{\gamma _t^2}{2}\Vert \xi _t\Vert _*^2\).

Proof

  1. (a)

    By the definition of the proximal mapping,

    $$\begin{aligned} x_{t+1} = \mathop {{\mathrm{arg\,min}}}\limits _{y\in X}\{V_{x_t}(y)+\langle \gamma _t\xi _t,y\rangle \}. \end{aligned}$$

    The optimality condition is that, for any \(x\in X\),

    $$\begin{aligned} \langle \nabla \omega (x_{t+1})-\nabla \omega (x_t)+\gamma _t\xi _t,x-x_{t+1}\rangle \ge 0. \end{aligned}$$
    (18)

    Plugging in \(x=x_t\) and rearranging, we obtain

    $$\begin{aligned} \langle \gamma _t\xi _t,x_t-x_{t+1}\rangle&\ge \langle \nabla \omega (x_{t+1})-\nabla \omega (x_t),x_{t+1}-x_t\rangle . \end{aligned}$$

    By Cauchy–Schwarz inequality, \(\langle \gamma _t\xi _t,x_t-x_{t+1}\rangle \le \Vert \gamma _t\xi _t\Vert _*\Vert x_t-x_{t+1}\Vert \). Since \(\omega \) is 1-strongly convex with respect to \(\Vert \cdot \Vert \), we have \(\langle \nabla \omega (x_{t+1})-\nabla \omega (x_t),x_{t+1}-x_t\rangle \ge \Vert x_{t+1}-x_t\Vert ^2\). Therefore, we conclude that \(\gamma _t\Vert \xi _t\Vert _*\ge \Vert x_{t+1}-x_t\Vert \).

  2. (b)

    Rearranging (18), we have

    $$\begin{aligned} \langle \gamma _t\xi _t,x_{t+1}-x\rangle&\le \langle \nabla \omega (x_{t+1})-\nabla \omega (x_t),x-x_{t+1}\rangle . \end{aligned}$$

    Applying Lemma 5 to the right hand side gives the result.

  3. (c)

    Following part (b), we have

    $$\begin{aligned} \gamma _t\langle \xi _t,x_t-x\rangle&\le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}) + \gamma _t\langle \xi _t,x_t-x_{t+1}\rangle \\&\le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}) + \gamma _t\Vert \xi _t\Vert _*\Vert x_t-x_{t+1}\Vert \\&\le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}) + \frac{\gamma _t^2}{2}\Vert \xi _t\Vert _*^2 + \frac{1}{2}\Vert x_t-x_{t+1}\Vert ^2 \\&\le V_{x_t}(x)-V_{x_{t+1}}(x) + \frac{\gamma _t^2}{2}\Vert \xi _t\Vert _*^2, \end{aligned}$$

    where we use Cauchy–Schwarz inequality, the fact that \(ab\le \frac{1}{2}(a^2+b^2)\) and \(\frac{1}{2}\Vert x_t-x_{t+1}\Vert ^2\le V_{x_t}(x_{t+1})\).

\(\square \)

In our developments we also use the following lemma relating the Bregman distance \(V_x(x')\) and the subgradient inequality of an arbitrary convex function under a variety of structural assumptions on \(f_t\).

Lemma 7

The following statements hold for convex functions \(\{f_t\}_{t\in [T]}\).

  1. (a)

    For any \(x\in X\) and \(t\in [T]\),

    $$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t} V_{x_t}(x_{t+1}) + [f_t(x_t)-f_t(x)]. \end{aligned}$$
  2. (b)

    Under Assumption 8, for any \(x\in X\) and \(t\in [T]\), we have

    $$\begin{aligned} \langle \nabla f_{t+1}(x_t),x_{t+1}-x\rangle&\ge f_{t+1}(x_{t+1}) - f_{t+1}(x) - H_fV_{x_t}(x_{t+1}). \end{aligned}$$
  3. (c)

    Under Assumption 9, for any \(x\in X\) and \(t\in [T]\), we have

    $$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t}V_{x_1}(x_{t+1}) + [f_t(x_t)-f_t(x)] + cV_{x_t}(x). \end{aligned}$$

Proof

  1. (a)

    Starting from the subgradient inequality, we apply Cauchy–Schwarz inequality, and the fact that \(a^2+b^2\ge 2ab\), to obtain

    $$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&= \langle \nabla f_t(x_t),[x_{t+1}-x_t]+[x_t-x]\rangle \\&\ge \langle \nabla f_t(x_t),x_{t+1}-x_t\rangle + [f_t(x_t)-f_t(x)] \\&\ge -\Vert \nabla f_t(x_t)\Vert _*\Vert x_{t+1}-x_t\Vert + [f_t(x_t)-f_t(x)] \\&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{4\gamma _t}\Vert x_{t+1}-x_t\Vert ^2 + [f_t(x_t)-f_t(x)] \\&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t} V_{x_t}(x_{t+1}) + [f_t(x_t)-f_t(x)], \end{aligned}$$

    where the last step follows from \(\frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2\le V_{x_t}(x_{t+1})\).

  2. (b)

    Combining the subgradient inequality and Assumption 8,

    $$\begin{aligned} \langle \nabla f_{t+1}(x_t),x_{t+1}-x\rangle&= \langle \nabla f_{t+1}(x_t),[x_{t+1}-x_t]-[x-x_t]\rangle \\&\ge \left[ f_{t+1}(x_{t+1})-f_{t+1}(x_t)-\frac{1}{2}H_f\Vert x_{t+1}-x_t\Vert ^2\right] \\&\quad - [f_{t+1}(x)-f_{t+1}(x_t)] \\&= f_{t+1}(x_{t+1}) - f_{t+1}(x) - \frac{1}{2}H_f\Vert x_{t+1}-x_t\Vert ^2 \\&\ge f_{t+1}(x_{t+1}) - f_{t+1}(x) - H_fV_{x_t}(x_{t+1}). \end{aligned}$$
  3. (c)

    Applying Assumption 9,

    $$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&= \langle \nabla f_t(x_t),[x_{t+1}-x_t]+[x_t-x]\rangle \\&\ge \langle \nabla f_t(x_t),x_{t+1}-x_t\rangle + [f_t(x_t)-f_t(x)] + cV_{x_t}(x) \\&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t}V_{x_1}(x_{t+1}) + [f_t(x_t)-f_t(x)] + cV_{x_t}(x), \end{aligned}$$

    where the last step follows the same analysis as in Lemma 7(a).

\(\square \)

Proofs for Section 4

We first give the per-step analysis of Algorithm 1. This serves as an important ingredient for Theorem 1 that provides a bound for the objective regret term in (5).

Lemma 8

For any \(t\in [T]\) and \(x\in X\), in Algorithm 1, we have

$$\begin{aligned} f_t(x_t)-f_t(x) - \sum _{i\in [k]}\lambda _{t,i}g_{t,i}(x)&\le \frac{1}{\gamma _t}V_{x_t}(x) - \frac{1}{\gamma _t}V_{x_{t+1}}(x) + \frac{\gamma _t}{2}\Vert \nabla f_t(x_t)\Vert _*^2 \\&\quad + \frac{1}{2\beta _t}\Vert \lambda _t\Vert _2^2 - \frac{1}{2\beta _t}\Vert \lambda _{t+1}\Vert _2^2 + \frac{\beta _t}{2}\Vert \zeta _t\Vert _2^2 . \end{aligned}$$

Proof

Fix \(t\in [T]\). It follows from Lemma 6(b) that for any \(x\in X\)

$$\begin{aligned} \gamma _t\langle \nabla _xL_t(x_t,\lambda _t), x_{t+1}-x\rangle&\le V_{x_t}(x) - V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}). \end{aligned}$$
(19)

Recall that \(\nabla _xL_t(x_t,\lambda _t)=\nabla f_t(x_t)+\sum _{i\in [k]}\lambda _{t,i}\nabla g_{t,i}(x_t)\). Moreover,

$$\begin{aligned} \langle \nabla g_{t,i}(x_t),x_{t+1}-x\rangle&= \langle \nabla g_{t,i}(x_t),x_t-x\rangle + \langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle \\&\ge g_{t,i}(x_t)-g_{t,i}(x) + \langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle , \end{aligned}$$

where the last step follows from the subgradient inequality associated with \(g_{t,i}(x)\), i.e., \( \langle \nabla g_{t,i}(x_t),x_t-x\rangle \ge g_{t,i}(x_t)-g_{t,i}(x). \) Plugging in the definition of \(\nabla _xL_t(x_t,\lambda _t)\) along with the above relation in (19) then results in

$$\begin{aligned}&\gamma _t\langle \nabla f_t(x_t),x_{t+1}-x\rangle + \gamma _t\sum _{i\in [k]}\lambda _{t,i}[-g_{t,i}(x)+g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle ] \nonumber \\&\quad \le V_{x_t}(x) - V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}). \end{aligned}$$
(20)

Moreover, using the Cauchy-Schwarz and \(ab\le a^2/2+b^2/2\), we deduce

$$\begin{aligned} -\gamma _t\langle \nabla f_t(x_t),x_{t+1}-x_t\rangle&\le \gamma _t\Vert \nabla f_t(x_t)\Vert _*\Vert x_{t+1}-x_t\Vert \\&\le \frac{\gamma _t^2}{2}\Vert \nabla f_t(x_t)\Vert _*^2 + \frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2 \\&\le \frac{\gamma _t^2}{2}\Vert \nabla f_t(x_t)\Vert _*^2 + V_{x_t}(x_{t+1}). \end{aligned}$$

By summing this relation with (20) and the subgradient inequality associated with \(f_{t}(x)\), i.e., \( \langle \nabla f_{t}(x_t),x_t-x\rangle \ge f_{t}(x_t)-f_{t}(x) \), we arrive at the following conclusion from the primal step

$$\begin{aligned}&\gamma _t[f_t(x_t)-f_t(x)] + \gamma _t\sum _{i\in [k]}\lambda _{t,i}[-g_{t,i}(x)+g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle ] \\&\quad \le V_{x_t}(x) - V_{x_{t+1}}(x) + \frac{\gamma _t^2}{2}\Vert \nabla f_t(x_t)\Vert _*^2. \end{aligned}$$

Recall from Remark 4 that the dual step can be viewed as the projected subgradient update, i.e.,

$$\begin{aligned} \lambda _{t+1} = {{\,\mathrm{Proj}\,}}_{{{\mathbb {R}}}^k_+}(\lambda _t+\beta _t\zeta _t). \end{aligned}$$

Thus, the dual update is an MD update on a domain equipped with the Euclidean setup, i.e., the d.g.f. is selected to be the squared Euclidean norm and \(V_x(y)=\frac{1}{2}\Vert x-y\Vert ^2\). Then, it follows from Lemma 6(c) that

$$\begin{aligned} \beta _t\sum _{i\in [k]}\lambda _{t,i}[g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle ]&\ge \frac{1}{2}\Vert \lambda _{t+1}\Vert _2^2 - \frac{1}{2}\Vert \lambda _t\Vert _2^2 - \frac{\beta _t^2}{2}\Vert \zeta _t\Vert _2^2. \end{aligned}$$

Dividing the inequality for the primal by \(\gamma _t\) and dividing the above inequality for the dual step by \(\beta _t\) and then summing them up leads to the desired conclusion. \(\square \)

Theorem 1

Let \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all \(t\in [T]\). Then, under Assumptions 1 and 2, for any \(x^*\in X_0\), Algorithm 1 guarantees the following objective regret bound

$$\begin{aligned} {{{\mathcal {R}}}}\left( \left\{ x_t\right\} _{t \in [T]}; x^* \right) = \frac{1}{T}\sum _{t\in [T]}[f_t(x_t)-f_t(x^*)]&\le \frac{\varOmega ^2}{2\gamma _0T} + \frac{\gamma _0G^2}{2} + \frac{\beta _0k(F+G\varOmega )^2}{2} . \end{aligned}$$

Proof

Note that for any \(x\in X_0\), we have \(\sum _{i\in [k]}\lambda _{t,i}g_{t,i}(x)\le 0\). Then, by summing up the inequality in Lemma 8 over \(t\in [T]\) and noting also that the right hand side telescopes, we obtain

$$\begin{aligned} \sum _{t\in [T]}[f_t(x_t)-f_t(x)]&\le \frac{1}{\gamma _0}V_{x_1}(x) - \frac{1}{\gamma _0}V_{x_{T+1}}(x) + \frac{\gamma _0}{2}\sum _{t\in [T]}\Vert \nabla f_t(x_t)\Vert _*^2 \\&\quad + \frac{1}{2\beta _0}\Vert \lambda _1\Vert _2^2 - \frac{1}{2\beta _0}\Vert \lambda _{T+1}\Vert _2^2 + \frac{\beta _0}{2}\sum _{t\in [T]}\Vert \zeta _t\Vert _2^2 \\&\le \frac{\varOmega ^2}{2\gamma _0} + \frac{\gamma _0G^2}{2}T + \frac{\beta _0k(F+G\varOmega )^2}{2}T. \end{aligned}$$

In the last inequality, we drop nonpositive terms, note that by definition \(\lambda _1=0\), and use the bounds from Assumptions 1 and 2. In particular, when bounding \(\zeta _{t,i} = g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle ,\,\forall i\in [k]\), recall from Remark 2 that we have \(\Vert x_{t+1}-x_t\Vert \le \sqrt{2V_{x_t}(x_{t+1})}\le \varOmega \). \(\square \)

We next establish a bound on the value of the dual variable \(\lambda _t\) over time when running Algorithm 1. This is a key component in the analysis of Algorithm 1, and it is critical in providing a bound for the constraint violation in Theorem 2. We define the notation

$$\begin{aligned} \varLambda _0(\tau )&:=\beta _0\sqrt{k}(F+G\varOmega )\tau + \frac{\varOmega ^2}{2\gamma _0r\tau } + \frac{G\varOmega }{r} + \frac{\beta _0k(F+G\varOmega )^2}{2r}, \quad \tau \in [T]\\ \varLambda _0&:=\sqrt{\frac{2\beta _0\sqrt{k}(F+G\varOmega )\varOmega ^2}{\gamma _0r}} + \frac{G\varOmega }{r} + \frac{\beta _0k(F+G\varOmega )^2}{2r}. \end{aligned}$$

Note that by comparing the first two terms in \(\varLambda _0(\tau )\) and the first term in \(\varLambda _0\) and using the arithmetic geometric inequality, we deduce that \(\varLambda _0(\tau ) \le \varLambda _0\) holds for any \(\tau \in [T]\).

Lemma 9

Let \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all \(t\in [T]\) in Algorithm 1. Under Assumptions 1 to 3, we have

$$\begin{aligned}&\Vert \lambda _t\Vert _2 \le \varLambda _0(\tau ), \quad \forall \tau \in [T],\,\forall t\in [T] \end{aligned}$$
(21)
$$\begin{aligned}&\quad \implies \Vert \lambda _t\Vert _2 \le \varLambda _0, \quad \forall t\in [T]. \end{aligned}$$
(22)

Proof

Based on the dual update, we have

$$\begin{aligned} \Vert \lambda _{t+1}\Vert _2&\le \Vert \lambda _t + \beta _t\zeta _t\Vert _2 \le \Vert \lambda _t\Vert _2 + \beta _t\Vert \zeta _t\Vert _2, \end{aligned}$$

and

$$\begin{aligned} \Vert \lambda _{t+1}\Vert _2^2&\le \Vert \lambda _t + \beta _t\zeta _t\Vert _2^2 = \Vert \lambda _t\Vert _2^2 + 2\beta _t\langle \lambda _t,\zeta _t\rangle + \beta _t^2\Vert \zeta _t\Vert _2^2. \end{aligned}$$

The inner product term can be bounded using (20) and plugging in \(x={{\hat{x}}}\in X\) as follows.

$$\begin{aligned} \beta _t\langle \lambda _t,\zeta _t\rangle&=\beta _t\sum _{i\in [k]}\lambda _{t,i}[g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle ] \\&\le \beta _t\sum _{i\in [k]}\lambda _{t,i}g_{t,i}({{\hat{x}}}) - \beta _t\langle \nabla f_t(x_t),x_{t+1}-{{\hat{x}}}\rangle + \frac{\beta _t}{\gamma _t}[V_{x_t}({{\hat{x}}}) - V_{x_{t+1}}({{\hat{x}}})] \\&\le - \beta _tr\Vert \lambda _t\Vert _2 + \beta _tG\varOmega + \frac{\beta _t}{\gamma _t}[V_{x_t}({{\hat{x}}}) - V_{x_{t+1}}({{\hat{x}}})], \end{aligned}$$

where the last step follows from Assumptions 1 to 3. Under the same assumptions, we also have

$$\begin{aligned} \beta _t^2\Vert \zeta _t\Vert _2^2&\le \beta _t^2k(F+G\varOmega )^2. \end{aligned}$$

Thus, we obtain

$$\begin{aligned} \Vert \lambda _{t+1}\Vert _2\le \Vert \lambda _t\Vert _2+\beta _t\sqrt{k}(F+G\varOmega ), \end{aligned}$$
(23)

and

$$\begin{aligned} \Vert \lambda _{t+1}\Vert _2^2&\le \Vert \lambda _t\Vert _2^2 - 2\beta _tr\Vert \lambda _t\Vert _2 + 2\beta _tG\varOmega + 2\frac{\beta _t}{\gamma _t}[V_{x_t}({{\hat{x}}}) - V_{x_{t+1}}({{\hat{x}}})] + \beta _t^2k(F+G\varOmega )^2. \end{aligned}$$

For any \(\tau \in [t-1]\), by summing up the above inequality for \(\tau '=t-\tau ,\dots ,t-1\), and noting that \(\beta _t=\beta _0\) and \(\gamma _t=\gamma _0\), we arrive at

$$\begin{aligned} \Vert \lambda _t\Vert _2^2&\le \Vert \lambda _{t-\tau }\Vert _2^2-2\beta _0r\sum _{\tau '\in [\tau ]}\Vert \lambda _{t-\tau '}\Vert _2 + 2\beta _0G\varOmega \tau \\&\quad + 2\frac{\beta _0}{\gamma _0}[V_{x_{t-\tau }}({{\hat{x}}})-V_{x_t}({{\hat{x}}})] + \beta _0^2k(F+G\varOmega )^2\tau \\&\le \Vert \lambda _{t-\tau }\Vert _2^2-2\beta _0r\sum _{\tau '\in [\tau ]}[\Vert \lambda _t\Vert _2-\tau '\beta _0\sqrt{k}(F+G\varOmega )] + 2\beta _0G\varOmega \tau \\&\quad + \frac{\beta _0\varOmega ^2}{\gamma _0}+ \beta _0^2k(F+G\varOmega )^2\tau \\&= \Vert \lambda _{t-\tau }\Vert _2^2-2\beta _0r\tau \Vert \lambda _t\Vert _2+\beta _0^2r\sqrt{k}(F+G\varOmega )\tau (\tau +1) \\&\quad + 2\beta _0G\varOmega \tau + \frac{\beta _0\varOmega ^2}{\gamma _0}+ \beta _0^2k(F+G\varOmega )^2\tau \\&= \Vert \lambda _{t-\tau }\Vert _2^2-2\beta _0r\tau \Vert \lambda _t\Vert _2+2\beta _0 r \tau \varLambda _0(\tau ) - \beta _0^2r\sqrt{k}(F+G\varOmega )(\tau ^2-\tau ), \end{aligned}$$

where the second inequality follows from \(0\le V_x(y)\le \frac{1}{2}\varOmega ^2\) for any \(x,y\in X\), relation (23) implying \(\Vert \lambda _{t-\tau '}\Vert \ge \Vert \lambda _t\Vert _2 -\tau '\beta _0\sqrt{k}(F+G\varOmega )\), and part of the terms telescoping since \(\gamma _t=\gamma _0,\,\beta _t=\beta _0\). The last equation above follows from the definition of \(\varLambda _0(\tau )\) in (21). Hence, by dropping the last term which is nonpositive and rearranging terms, we deduce that the relation

$$\begin{aligned} \Vert \lambda _t\Vert _2^2 + 2\beta _0r\tau \Vert \lambda _t\Vert _2 \le \Vert \lambda _{t-\tau }\Vert _2^2+2\beta _0 r \tau \varLambda _0(\tau ) \end{aligned}$$
(24)

holds for all \(\tau \in [t-1]\) and \(t\in [T]\).

As a last step, we will show by induction that \(\Vert \lambda _t\Vert _2\le \varLambda _0(\tau )\) for any \(\tau \in [T]\) fixed and \(t\in [T]\). The base case holds since \(\lambda _1=0\) and \(\varLambda _0(\tau )\ge 0\) for any \(\tau \). For \(t\le \tau \), by (23) we have \(\Vert \lambda _t\Vert _2\le \Vert \lambda _1\Vert _2+t\beta _0\sqrt{k}(F+G\varOmega )\le \varLambda _0(\tau )\). For \(t>\tau \), assuming by induction hypothesis that \(\Vert \lambda _{t-\tau }\Vert _2\le \varLambda _0(\tau )\) and using (24), we arrive at

$$\begin{aligned} \Vert \lambda _t\Vert _2^2 + 2\beta _0r\tau \Vert \lambda _t\Vert _2&\le \varLambda _0(\tau )^2 + 2\beta _0r\tau \varLambda _0(\tau ). \end{aligned}$$

This implies that \(\Vert \lambda _t\Vert _2\le \varLambda _0(\tau )\) as desired. \(\square \)

Theorem 2

Let \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all \(t\in [T]\). Then, under Assumptions 1 to 3, for any \(x^*\in X_0\), Algorithm 1 guarantees the following constraint violation bound

$$\begin{aligned} {{{\mathcal {V}}}}_i\left( \left\{ x_t\right\} _{t \in [T]} \right) = \frac{1}{T}\sum _{t\in [T]}g_{t,i}(x_t)&\le \frac{\varLambda _0}{\beta _0T} + \gamma _0G^2(\sqrt{k}\varLambda _0+1), \quad \forall i \in [k]. \end{aligned}$$

Proof

Consider any \(t\in [T]\). From the dual update step in Algorithm 1, we have

$$\begin{aligned} \lambda _{t+1,i} - \lambda _{t,i}&\ge \beta _t \zeta _{t,i}=\beta _t[g_{t,i}(x_t) + \langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle ] \\&\ge \beta _tg_{t,i}(x_t) - \beta _t\Vert \nabla g_{t,i}(x_t)\Vert _*\Vert x_{t+1}-x_t\Vert . \end{aligned}$$

If follows from Lemma 6(a) that

$$\begin{aligned} \Vert x_{t+1}-x_t\Vert&\le \Vert \gamma _t\nabla _xL_t(x_t,\lambda _t)\Vert _* \\&\le \gamma _t\Vert \nabla f_t(x_t)\Vert _*+\gamma _t\sum _{i\in [k]}\lambda _{t,i}\Vert \nabla g_{t,i}(x_t)\Vert _* \\&\le \gamma _tG+ \gamma _t \sqrt{k} \varLambda _0G, \end{aligned}$$

where the second inequality follows from definition of \(L_t(x_t,\lambda _t)\) and the last one from the relation \(\Vert \lambda _t\Vert _1\le \sqrt{k}\Vert \lambda _t\Vert _2\) and Assumptions 1and 2 and Lemma 9. Combining these two inequalities, we arrive at

$$\begin{aligned} g_{t,i}(x_t)&\le \frac{1}{\beta _t}[\lambda _{t+1,i}-\lambda _{t,i}] + \gamma _tG^2(1+ \sqrt{k} \varLambda _0),\quad \forall i\in [k]. \end{aligned}$$

Summing this inequality over \(t\in [T]\), and using the fact that \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all t, we observe that the right hand side telescopes. Then, from Lemma 9, we deduce that

$$\begin{aligned} \sum _{t\in [T]}g_{t,i}(x_t)&\le \frac{\varLambda _0}{\beta _0} + \gamma _0G^2(1+ \sqrt{k} \varLambda _0)T\quad \forall i\in [k]. \end{aligned}$$

\(\square \)

Corollary 1

Let \(\gamma _0 = T^{-(1/2+\delta )}\), \(\beta _0 = T^{-(1/2-\delta )}\), where \(-1/2< \delta < 1/2\). Then, for any \(x^*\in X_0\), under Assumptions 1 to 3, Algorithm 1 with \(\gamma _t = \gamma _0\), \(\beta _t = \beta _0\) for \(t \in [T]\) leads to

$$\begin{aligned} {{{\mathcal {R}}}}\left( \{x_t\}_{t \in [T]}; x^* \right)&\le O\left( T^{-(1/2+\delta )} + T^{-(1/2-\delta )} \right) ,\\ {{{\mathcal {V}}}}\left( \{x_t\}_{t \in [T]} \right)&\le O\left( \frac{T^{-1/2}}{\sqrt{r}} + \frac{T^{-(1/2+\delta )}}{r} + T^{-(1/2+\delta )} \right) . \end{aligned}$$

Proof

We hide constants \(k,F,G, \varOmega \) in the bounds. Thus,

$$\begin{aligned} \beta _0 T&= T^{1/2 + \delta } = 1/\gamma _0, \qquad \gamma _0 T = T^{1/2 - \delta } = 1/\beta _0,\\ \varLambda _0&= O\left( T^\delta /\sqrt{r} + 1/r + T^{-(1/2-\delta )}/r \right) ,\\ \gamma _0 \varLambda _0&= O\left( T^{-1/2}/\sqrt{r} + T^{-(1/2+\delta )}/r + T^{-1}/r \right) = O\left( T^{-1/2}/\sqrt{r} + T^{-(1/2+\delta )}/r \right) \\ \frac{\varLambda _0}{\beta _0 T}&= \gamma _0 \varLambda _0 = O\left( T^{-1/2}/\sqrt{r} + T^{-(1/2+\delta )}/r \right) . \end{aligned}$$

Therefore,

$$\begin{aligned} {{{\mathcal {R}}}}\left( \{x_t\}_{t \in [T]}; x^* \right)&\le O(\gamma _0 + \beta _0) = O\left( T^{-(1/2+\delta )} + T^{-(1/2-\delta )} \right) \\ {{{\mathcal {V}}}}\left( \{x_t\}_{t \in [T]} \right)&\le O(\gamma _0 \varLambda _0 + \gamma _0) = O\left( \frac{T^{-1/2}}{\sqrt{r}} + \frac{T^{-(1/2+\delta )}}{r} + T^{-(1/2+\delta )} \right) . \end{aligned}$$

\(\square \)

Proofs for Section 5

1.1 Proofs for Lemmas

Lemma 1

In Algorithm 2, the following holds for all \(t\ge 1\).

  1. (a)

    \(\lambda _t-g_t(x_t)\ge 0\).

  2. (b)

    \(g_{t+1}(x_{t+1})\le [\lambda _{t+1}-g_{t+1}(x_{t+1})]-[\lambda _t-g_t(x_t)]\).

  3. (c)

    \(\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2\le \Vert \lambda _t+g_{t+1}(x_{t+1})-g_t(x_t)\Vert _2^2+\Vert g_{t+1}(x_{t+1})\Vert _2^2\).

  4. (d)

    \(\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2 \ge \Vert g_{t+1}(x_{t+1})\Vert _2\), and \(\Vert \lambda _1-g_1(x_1)\Vert _2\le \Vert g_1(x_1)\Vert _2\).

Proof

By definition in Algorithm 2, we have \(\lambda _1-g_1(x_1)\ge 0\). Assume by induction that \(\lambda _t - g_t(x_t) \ge 0\) for \(t \ge 1\). For any \(i\in [k]\), the dual update rule leads to

$$\begin{aligned} \lambda _{t+1,i}-g_{t+1,i}(x_{t+1}) = \max \{\lambda _{t,i}-g_{t,i}(x_t)+g_{t+1,i}(x_{t+1}),~ -g_{t+1,i}(x_{t+1})\}. \end{aligned}$$
(25)

For any \(i\in [k]\), if \(\lambda _{t+1,i}-g_{t+1,i}(x_{t+1}) < 0\), then \(\lambda _{t,i}-g_{t,i}(x_t)< -g_{t+1,i}(x_{t+1}) < 0\). But, then by induction this will contradict \(\lambda _t-g_t(x_t) \ge 0\). Thus, we have \(\lambda _{t+1}-g_{t+1}(x_{t+1})\ge 0\). The inequalities in (b) and (c) follow directly from (25) (note that (c) follows since \(\Vert a\Vert _2^2 \le \Vert b\Vert _2^2 + \Vert c\Vert _2^2\) when \(a = \max \{b,c\}\)). The second part in (d) follows from the initialization of \(\lambda _1\) in Algorithm 2 which implies \(\lambda _{1,i}-g_{1,i}(x_1)=\max \{0,-g_{1,i}(x_1)\}\) for all \(i\in [k]\). To show the first part of (d), note that from (25) and part (a), for any \(i\in [k]\) we have

$$\begin{aligned} |\lambda _{t+1,i}-g_{t+1,i}(x_{t+1})|&= \lambda _{t+1,i}-g_{t+1,i}(x_{t+1}) \\&= \max \{\lambda _{t,i}-g_{t,i}(x_t)+g_{t+1,i}(x_{t+1}),-g_{t+1,i}(x_{t+1})\} \\&\ge \max \{g_{t+1,i}(x_{t+1}),-g_{t+1,i}(x_{t+1})\} \\&= |g_{t+1,i}(x_{t+1})|. \end{aligned}$$

\(\square \)

Lemma 2

Under Assumption 2, running Algorithm 2 guarantees for any \(t\in [T]\) that

  1. (a)
    $$\begin{aligned} -\langle \lambda _t,g_{t+1}(x_{t+1})\rangle&\le \frac{1}{2}\left[ \Vert \lambda _t-g_t(x_t)\Vert _2^2-\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2\right] \\&\quad + \frac{1}{2}\left[ \Vert g_{t+1}(x_{t+1})\Vert _2^2-\Vert g_t(x_t)\Vert _2^2\right] \\&\quad + \Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 + 4kG^2V_{x_t}(x_{t+1}), \end{aligned}$$
  2. (b)

    and if we further assume Assumption 5, we also have for any \(x\in X_0\) and \(t_0\in \{0,1\}\)

    $$\begin{aligned} \langle \nabla f_{t+t_0}(x_t),x_{t+1}-x\rangle&\le \frac{1}{\gamma _t}\left[ V_{x_t}(x)-V_{x_{t+1}}(x)\right] \\&\quad + \frac{1}{2}\left[ \Vert \lambda _t-g_t(x_t)\Vert _2^2- \Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2\right] \\&\quad + \frac{1}{2}\left[ \Vert g_{t+1}(x_{t+1})\Vert _2^2 - \Vert g_t(x_t)\Vert _2^2\right] + \Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 \\&\quad + \left( \sqrt{k}H_g\Vert \lambda _t\Vert _2+4kG^2-\frac{1}{\gamma _t}\right) V_{x_t}(x_{t+1}). \end{aligned}$$

Proof

We first prove part (a). For any \(t\in [T]\), expanding the inequality in Lemma 1(c), we have

$$\begin{aligned} \Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2&\le \Vert \lambda _t-g_t(x_t)\Vert _2^2 +2\langle \lambda _t-g_t(x_t),g_{t+1}(x_{t+1})\rangle + 2\Vert g_{t+1}(x_{t+1})\Vert _2^2. \end{aligned}$$
(26)

Recall from Remark 7 that by taking \(x=g_{t+1}(x_{t+1})\), \(y=0\), \(z=g_t(x_t)\), we obtain

$$\begin{aligned} -2\langle g_t(x_t),g_{t+1}(x_{t+1})\rangle&= \Vert g_{t+1}(x_{t+1})-g_t(x_t)\Vert _2^2 - \Vert g_{t+1}(x_{t+1})\Vert _2^2 - \Vert g_t(x_t)\Vert _2^2. \end{aligned}$$

Plugging this into (26) and rearranging results in

$$\begin{aligned}&-2\langle \lambda _t,g_{t+1}(x_{t+1})\rangle \nonumber \\&\quad \le \Vert \lambda _t-g_t(x_t)\Vert _2^2-\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2 + \Vert g_{t+1}(x_{t+1})-g_t(x_t)\Vert _2^2 \nonumber \\&\qquad + \Vert g_{t+1}(x_{t+1})\Vert _2^2 - \Vert g_t(x_t)\Vert _2^2. \end{aligned}$$
(27)

Note also that using the subgradient inequality, the definition of the dual norm, and Assumption 2, for any \(i\in [k]\), we have the following relations

$$\begin{aligned} g_{t+1,i}(x_{t+1})-g_{t+1,i}(x_t)&\ge \langle \nabla g_{t+1,i}(x_t),x_{t+1}-x_t\rangle \\&\ge -\Vert \nabla g_{t+1,i}(x_t)\Vert _* \Vert x_{t+1}-x_t\Vert \ge -G\Vert x_{t+1}-x_t\Vert , \\ g_{t+1,i}(x_{t+1})-g_{t+1,i}(x_t)&\le \langle \nabla g_{t+1,i}(x_{t+1}),x_{t+1}-x_t\rangle \\&\le \Vert \nabla g_{t+1,i}(x_{t+1})\Vert _* \Vert x_{t+1}-x_t\Vert \le G\Vert x_{t+1}-x_t\Vert . \end{aligned}$$

These two inequalities together imply \(\Vert g_{t+1}(x_{t+1})-g_{t+1}(x_t)\Vert _2\le \sqrt{k}G\Vert x_{t+1}-x_t\Vert \). Then, using triangle inequality, we deduce

$$\begin{aligned} \Vert g_{t+1}(x_{t+1})-g_t(x_t)\Vert _2^2&\le [\Vert g_{t+1}(x_{t+1})-g_{t+1}(x_t)\Vert _2+\Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2]^2 \\&\le 2\Vert g_{t+1}(x_{t+1})-g_{t+1}(x_t)\Vert _2^2+2\Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 \\&\le 2kG^2\Vert x_{t+1}-x_t\Vert ^2 + 2\Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2, \end{aligned}$$

where the last inequality follows from the fact that \(\Vert g_{t+1}(x_{t+1})-g_{t+1}(x_t)\Vert _2\le \sqrt{k}G\Vert x_{t+1}-x_t\Vert \). By plugging the above inequality into (27), we arrive at

$$\begin{aligned} -2\langle \lambda _t,g_{t+1}(x_{t+1})\rangle&\le \Vert \lambda _t-g_t(x_t)\Vert _2^2-\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2 \\&\quad + 2kG^2\Vert x_{t+1}-x_t\Vert ^2 + 2\Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 \\&\quad + \Vert g_{t+1}(x_{t+1})\Vert _2^2 - \Vert g_t(x_t)\Vert _2^2. \end{aligned}$$

Finally, by noting that \(\Vert x_{t+1}-x_t\Vert ^2\le 2V_{x_t}(x_{t+1})\) and rearranging the above inequality, we obtain the desired inequality for part (a).

To prove part (b), we first show that under Assumption 5, for any \(x\in X_0\) and \(t_0\in \{0,1\}\), Algorithm 2 ensures

$$\begin{aligned}&\langle \nabla f_{t+t_0}(x_t),x_{t+1}-x\rangle + \sum _{i\in [k]}\lambda _{t,i}\,g_{t+1,i}(x_{t+1})\\&\quad \le \frac{1}{\gamma _t}\left[ V_{x_t}(x)-V_{x_{t+1}}(x)\right] + \left( \sqrt{k}H_g\Vert \lambda _t\Vert _2-\frac{1}{\gamma _t}\right) V_{x_t}(x_{t+1}) \quad \forall t\in [T].\nonumber \end{aligned}$$
(28)

To see this, fix any \(x\in X_0\) and \(t_0\in \{0,1\}\). It follows from Lemma 6(b) that

$$\begin{aligned} \gamma _t\langle \xi _t,x_{t+1}-x\rangle&\le V_{x_t}(x) - V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}). \end{aligned}$$
(29)

Recall that in Algorithm 2, \(\xi _t=\nabla f_{t+t_0}(x_t)+\sum _{i\in [k]}\lambda _{t,i}\nabla g_{t+1,i}(x_t)\). Combining the subgradient inequality and Assumption 5 results in

$$\begin{aligned} \langle \nabla g_{t+1,i}(x_t),x_{t+1}-x\rangle&= \langle \nabla g_{t+1,i}(x_t),[x_{t+1}-x_t]-[x-x_t]\rangle \\&\ge [g_{t+1,i}(x_{t+1})-g_{t+1,i}(x_t) - \frac{1}{2}H_g\Vert x_{t+1}-x_t\Vert ^2] \\&\quad - [g_{t+1,i}(x)-g_{t+1,i}(x_t)] \\&= g_{t+1,i}(x_{t+1})-g_{t+1,i}(x) - \frac{1}{2}H_g\Vert x_{t+1}-x_t\Vert ^2. \end{aligned}$$

By plugging this into (29) and rearranging, we arrive at

$$\begin{aligned}&\gamma _t\langle \nabla f_{t+t_0}(x_t),x_{t+1}-x\rangle + \gamma _t\sum _{i\in [k]}\lambda _{t,i}[g_{t+1,i}(x_{t+1})-g_{t+1,i}(x)]\\&\quad \le V_{x_t}(x) - V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}) + \frac{1}{2}\sum _{i\in [k]}\lambda _{t,i}H_g\Vert x_{t+1}-x_t\Vert ^2. \end{aligned}$$

Equation (28) then follows by noting that \(\frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2\le V_{x_t}(x_{t+1})\), \(\sum _{i\in [k]}\lambda _{t,i}\le \sqrt{k}\Vert \lambda _t\Vert _2\) and that \(g_{t+1,i}(x)\le 0\) for \(x\in X_0\).

Now, part (b) follows from combining part (a) and (28). \(\square \)

In our next result, we observe a consequence of Assumption 4. Lemma 10 will play a crucial role in bounding the dual variable \(\lambda _t\) in Lemma 3.

Lemma 10

Under Assumption 4, running Algorithm 2 for any \(\tau \ge 2\), we have

$$\begin{aligned} \sum _{t-1\in [\tau -1]}[f_t(x_t^*)-f_t(x_t)]&\le \frac{M}{r}\Vert \lambda _\tau -g_\tau (x_\tau )\Vert _2. \end{aligned}$$

Proof

For all \(t\ge 1\), under Assumption 4, using the definition of saddle point \((x_t^*,\lambda _t^*)\), we have

$$\begin{aligned} f_t(x_t^*) = L_t(x_t^*,0) \le L_t(x_t^*,\lambda _t^*) \le L_t(x_t,\lambda _t^*) = f_t(x_t)+\langle \lambda _t^*,g_t(x_t)\rangle . \end{aligned}$$

Thus, \(f_t(x_t^*)-f_t(x_t) \le \langle \lambda _t^*,g_t(x_t)\rangle \). By summing over \(t-1\in [\tau ]\), for \(\tau \ge 1\), we get

$$\begin{aligned}&\sum _{t\in [\tau ]}[f_{t+1}(x_{t+1}^*)-f_{t+1}(x_{t+1})] \\&\quad \le \sum _{t\in [\tau ]}\langle \lambda _{t+1}^*,g_{t+1}(x_{t+1})\rangle \\&\quad \le \sum _{t\in [\tau ]}\langle \lambda _{t+1}^*,[\lambda _{t+1}-g_{t+1}(x_{t+1})]-[\lambda _t-g_t(x_t)]\rangle \\&\quad = \langle \lambda _{\tau +1}^*,\lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\rangle + \sum _{t-1\in [\tau -1]}\langle \lambda _t^*-\lambda _{t+1}^*,\lambda _t-g_t(x_t)\rangle - \langle \lambda _2^*,\lambda _1-g_1(x_1)\rangle \\&\quad \le \langle \lambda _{\tau +1}^*,\lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\rangle . \end{aligned}$$

Here, the second inequality follows from Lemma 1(b). The last inequality follows from Assumption 4 that guarantees \(\lambda _t^*\le \lambda _{t+1}^*\) and Lemma 1(a) that ensures \(\lambda _t-g_t(x_t)\ge 0\) for any \(t\ge 1\). Finally, applying Cauchy-Schwarz inequality and using Assumption 4 once more lead to the desired conclusion. \(\square \)

Lemma 3

Suppose for any \(\tau \in [T]\), the relation (10) holds with \(C_0(T), C_1, C_2 \ge 0\). Let \(C_3(T)\) be defined in (11), and let \(U_f(T)\) be defined in Assumption 6. For \(t\in [T]\), let \(0<\gamma _t\le \gamma _0\), where \(\gamma _0\) satisfies

$$\begin{aligned} \frac{1}{\gamma _0}&\ge \frac{1}{C_2^2}\left[ kH_g^2\varOmega ^2+2C_2(\sqrt{k}H_gC_3(T)+C_1)\right] . \end{aligned}$$
(12)

Then, under Assumptions 1 to 6, Algorithm 2 guarantees for any \(t\in [T]\),

  1. (a)

    \(\Vert \lambda _t-g_t(x_t)\Vert _2\le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T)-\sqrt{k}F\),

  2. (b)

    \(\Vert \lambda _t\Vert _2 \le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T)\), and

  3. (c)

    \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\le 0\).

Proof

We will show by induction that \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\le 0\) and

$$\begin{aligned} \Vert \lambda _t-g_t(x_t)\Vert _2&\le \varLambda ' :=\frac{M}{r} + \sqrt{ \frac{M^2}{r^2} + \frac{\varOmega ^2}{\gamma _0} + kF^2 + 2C_0(T) + 2U_f(T) + 2F_0 }\\&= \frac{M}{r} + \sqrt{\frac{M^2}{r^2} + \frac{\varOmega ^2}{\gamma _0} + (C_3(T) - \sqrt{k}F - 2M/r)^2}. \end{aligned}$$

The base case holds since \(\Vert \lambda _1-g_1(x_1)\Vert _2\le \Vert g_1(x_1)\Vert _2\le \sqrt{k}F\le \varLambda '\) by definition \(\lambda _1=\max \{g_1(x_1),0\}\) and Assumption 2. Moreover,

$$\begin{aligned} \sqrt{k}H_g\Vert \lambda _1\Vert _2+C_1-C_2\frac{1}{\gamma _1}&\le \sqrt{k}H_g\sqrt{k}F+C_1-2(\sqrt{k}H_gC_3(T)+C_1)\le 0, \end{aligned}$$

where the first inequality follows as (12) implies that \(\frac{1}{\gamma _1}\ge \frac{2}{C_2}(\sqrt{k}H_gC_3(T)+C_1)\) and by definition \(\lambda _1=\max \{g_1(x_1),0\}\) and Assumption 2 implies that \(\Vert \lambda _1\Vert _2 \le \Vert g_1(x_1)\Vert _2\le \sqrt{k}F\), and the last inequality follows because \(C_3(T)\ge \sqrt{k}F\) holds by its definition. Fix any \(x^*\in X_0\). Assume by induction hypothesis that \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t} \le 0\) for any \(t\le \tau \). Then, using the induction hypothesis and the fact that \(V_x(y)\ge 0\) for all xy, for any \(t_0\in \{0,1\}\), we deduce from (10) that

$$\begin{aligned}&\sum _{t\in [\tau ]}[f_{t+t_0}(x_{t+t_0})-f_{t+t_0}(x^*)] \\&\quad \le \frac{1}{\gamma _0}V_{x_1}(x^*) + \frac{1}{2}[\Vert \lambda _1-g_1(x_1)\Vert _2^2-\Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2^2]\\&\qquad + \frac{1}{2}[\Vert g_{\tau +1}(x_{\tau +1})\Vert _2^2-\Vert g_1(x_1)\Vert _2^2] + C_0(T) \\&\quad \le -\frac{1}{2}\Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2^2 + \frac{\varOmega ^2}{2\gamma _0} + \frac{kF^2}{2} + C_0(T), \end{aligned}$$

where the last inequality follows from Lemma 1(d), Assumptions 1 and 2. By Lemma 10 and Assumption 6, for any \(t_0\in \{0,1\}\) we have

$$\begin{aligned}&\sum _{t\in [\tau ]}[f_{t+t_0}(x^*)-f_{t+t_0}(x_{t+t_0})] \\&\quad = \sum _{t\in [\tau ]}[f_{t+t_0}(x^*)-f_{t+t_0}(x_{t+t_0}^*)] + \sum _{t\in [\tau ]}[f_{t+t_0}(x_{t+t_0}^*)-f_{t+t_0}(x_{t+t_0})] \\&\quad \le U_f(T) + F_0 + \frac{M}{r}\Vert \lambda _{\tau +t_0}-g_{\tau +t_0}(x_{\tau +t_0})\Vert _2. \end{aligned}$$

By adding the above two inequalities we arrive at

$$\begin{aligned} 0&\le -\frac{1}{2}\Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2^2 + \frac{M}{r}\Vert \lambda _{\tau +t_0}-g_{\tau +t_0}(x_{\tau +t_0})\Vert _2 \nonumber \\&\quad + \frac{\varOmega ^2}{2\gamma _0} + \frac{kF^2}{2} + C_0(T) + U_f(T) + F_0. \end{aligned}$$
(30)

When \(t_0=1\), by completing the square in (30) we observe that \( \Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2 \le \varLambda '\). When \(t_0=0\), by induction hypothesis we have \(\Vert \lambda _\tau -g_\tau (x_\tau )\Vert _2\le \varLambda '\). Plugging this into (30), we obtain

$$\begin{aligned} \Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2^2&\le 2\frac{M}{r}\Vert \lambda _\tau -g_\tau (x_\tau )\Vert _2 + \frac{\varOmega ^2}{\gamma _0} + kF^2 + 2C_0(T) + 2U_f(T) + 2F_0 \\&\le 2\frac{M}{r}\varLambda ' + \left[ \left( \varLambda '-\frac{M}{r}\right) ^2-\left( \frac{M}{r}\right) ^2\right] = (\varLambda ')^2, \end{aligned}$$

where we have used the definition of \(\varLambda '\) in the second inequality as well. Therefore, we have shown by induction that \(\Vert \lambda _t-g_t(x_t)\Vert _2\le \varLambda '\) for all \(t\ge 1\). Observe that \(a^2+b^2\le (a+b)^2\) for \(a,b\ge 0\), and we have \(M/r \ge 0,~ \varOmega /\sqrt{\gamma _0} \ge 0,~ C_3(T) - \sqrt{k}F - 2M/r \ge 0\). Here the last relation holds due to the definition of \(C_3(T)\) in (11). So

$$\begin{aligned} \varLambda '&= \frac{M}{r} + \sqrt{\frac{M^2}{r^2} + \frac{\varOmega ^2}{\gamma _0} + (C_3(T) - \sqrt{k}F - 2M/r)^2} \\&\le \frac{M}{r} + \sqrt{\frac{M^2}{r^2} + (\varOmega /\sqrt{\gamma _0} + C_3(T) - \sqrt{k}F - 2M/r)^2}\\&\le \frac{M}{r} + \sqrt{(M/r + \varOmega /\sqrt{\gamma _0} + C_3(T) - \sqrt{k}F - 2M/r)^2}\\&\le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T) - \sqrt{k}F \le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T). \end{aligned}$$

Thus, we deduce for all \(t\ge 1\), \(\Vert \lambda _t-g_t(x_t)\Vert _2\le \varLambda '\le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T)-\sqrt{k}F\) and furthermore \(\Vert \lambda _t\Vert _2\le \Vert \lambda _t-g_t(x_t)\Vert _2+\Vert g_t(x_t)\Vert _2\le \varLambda '+ \sqrt{k}F \le \frac{\varOmega }{\sqrt{\gamma _0}}+C_3(T)\). Then, from \(\gamma _0\ge \gamma _t>0~\forall \tau \ge 1\), we have

$$\begin{aligned} \sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}&\le \sqrt{k}H_gC_3(T)+C_1+\sqrt{k}H_g\varOmega \frac{1}{\sqrt{\gamma _0}} -C_2\frac{1}{\gamma _0}. \end{aligned}$$

Solving the inequality \(\sqrt{k}H_gC_3(T)+C_1+\sqrt{k}H_g\varOmega \frac{1}{\sqrt{\gamma _0}} -C_2\frac{1}{\gamma _0}\le 0\) in terms of \(\gamma _0^{-1/2}>0\) gives

$$\begin{aligned} \frac{1}{\sqrt{\gamma _0}}&\ge \frac{1}{2C_2}\left[ \sqrt{k}H_g\varOmega +(kH_g^2\varOmega ^2+4C_2(\sqrt{k}H_gC_3(T)+C_1))^{1/2}\right] . \end{aligned}$$

Using the fact that \((a+b)^2\le 2a^2+2b^2\), for \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\) to hold, we conclude that it suffices to have

$$\begin{aligned} \frac{1}{\gamma _0}&\ge \frac{1}{C_2^2}\left[ kH_g^2\varOmega ^2+2C_2(\sqrt{k}H_gC_3(T)+C_1)\right] . \end{aligned}$$

Note that this inequality is exactly the premise of this lemma. Hence, this concludes the induction step and establishes that \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\le 0\) holds for all \(t\in [T]\). \(\square \)

1.2 Exploiting smoothness

In this section, we exploit extra assumption on smoothness of \(\{f_t\}_{t\in [T]}\), i.e., Assumption 8 and prove Theorem 5.

Theorem 5

Suppose that Assumptions 1 to 8 hold. Let \(t_0=1,\) and for \(t\in [T]\) select \(\gamma _t=\gamma _0\) such that

$$\begin{aligned} \frac{1}{\gamma _0} \ge 2\sqrt{k}H_g\left( \sqrt{2U_g(T)}+\sqrt{2U_f(T)}+2\tfrac{M}{r}+2\sqrt{k}F\right) + kH_g^2\varOmega ^2 + 2H_f + 4kG^2. \end{aligned}$$

Then, for any \(x^*\in X_0\), running Algorithm 2 guarantees that

$$\begin{aligned}&\frac{1}{T}\sum _{t\in [T]}[f_{t+1}(x_{t+1})-f_{t+1}(x^*)] \le \frac{U_g(T)}{T} + \frac{\varOmega ^2}{2\gamma _0T}, \\&\frac{1}{T}\sum _{t\in [T]}g_{t,i}(x_t) \le \frac{\sqrt{2U_f(T)}}{T}+\frac{\sqrt{2U_g(T)}}{T} + \frac{\varOmega }{\sqrt{\gamma _0}T} + \frac{2(M+r\sqrt{k}F)}{rT}. \end{aligned}$$

Proof

Combining Lemma 7(b) with Lemma 2(b) leads to

$$\begin{aligned} f_{t+1}(x_{t+1})-f_{t+1}(x)&\le \frac{1}{\gamma _t}[V_{x_t}(x)-V_{x_{t+1}}(x)] \\&\quad + \frac{1}{2}[\Vert \lambda _t-g_t(x_t)\Vert _2^2-\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2]\\&\quad + \frac{1}{2}[\Vert g_{t+1}(x_{t+1})\Vert _2^2 - \Vert g_t(x_t)\Vert _2^2] + \Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 \\&\quad + \left( \sqrt{k}H_g\Vert \lambda _t\Vert _2+H_f+4kG^2-\frac{1}{\gamma _t}\right) V_{x_t}(x_{t+1}). \end{aligned}$$

Let \(\tau \in [T]\). By summing up the inequality over \(t\in [\tau ]\), we arrive at

$$\begin{aligned} \sum _{t\in [\tau ]}[f_{t+1}(x_{t+1})-f_{t+1}(x)]&\le \frac{1}{\gamma _0}V_{x_1}(x) + \frac{1}{2}[\Vert \lambda _1-g_1(x_1)\Vert _2^2-\Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2^2]\\&\quad + \frac{1}{2}[\Vert g_{\tau +1}(x_{\tau +1})\Vert _2^2 - \Vert g_1(x_1)\Vert _2^2] + U_g(T)\\&\quad + \sum _{t\in [\tau ]}\left( \sqrt{k}H_g\Vert \lambda _t\Vert _2+H_f+4kG^2-\frac{1}{\gamma _t} \right) V_{x_t}(x_{t+1}), \end{aligned}$$

where we note the telescoping terms due to \(\gamma _t=\gamma _0\), and apply Assumption 7. Thus, (10) is satisfied by taking \(C_0(T)=U_g(T)\), \(C_1=H_f+2kG^2\) and \(C_2=1\). From Theorem 3, we deduce

$$\begin{aligned} \sum _{t\in [T]}[f_{t+1}(x_{t+1})-f_{t+1}(x^*)]&\le \frac{\varOmega ^2}{2\gamma _0} + U_g(T), \end{aligned}$$

and

$$\begin{aligned} \sum _{t\in [T]}g_{t,i}(x)&\le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T), \end{aligned}$$

where

$$\begin{aligned} C_3(T)&= 2\tfrac{M}{r} + \sqrt{k}F + \left( kF^2 + 2U_g(T) + 2U_f(T) \right) ^{1/2} \\&\le 2\tfrac{M}{r} + 2\sqrt{k}F + \sqrt{2U_g(T)} + \sqrt{2U_f(T)}. \end{aligned}$$

Recall that we can get rid of \(F_0\) in the case where \(t_0=1\). Moreover, (12) turns into

$$\begin{aligned} \frac{1}{\gamma _0}&\ge kH_g^2\varOmega ^2 + 2\sqrt{k}H_gC_3(T) + 2H_f+4kG^2. \end{aligned}$$

\(\square \)

1.3 Exploiting strong convexity

Theorem 6

Suppose that Assumptions 1 to 7 and 9 hold. Let \(t_0=0\), and for \(t\in [T]\) select \(\gamma _t=\min \{\frac{1}{ct},\gamma _0\}\), where

$$\begin{aligned} \frac{1}{\gamma _0}&\ge 4\sqrt{k}H_g\left( \sqrt{2U_f(T)} + \sqrt{2U_g(T)} + G\frac{\sqrt{2\ln T}}{\sqrt{c}} +2\frac{M}{r}+2\sqrt{k}F+\sqrt{2F_0}\right) \\&\quad + 4kH_g^2\varOmega ^2 + 8kG^2. \end{aligned}$$

Then, for any \(x^*\in X_0\), running Algorithm 2 guarantees that

$$\begin{aligned} \frac{1}{T}\sum _{t\in [T]}[f_t(x_t)-f_t(x^*)]&\le \frac{U_g(T)}{T} + \frac{\varOmega ^2}{2\gamma _0T} + \frac{G^2\ln T}{cT}, \\ \frac{1}{T}\sum _{t\in [T]}g_{t,i}(x_t)&\le \frac{\sqrt{2U_f(T)}}{T} + \frac{\sqrt{2U_g(T)}}{T} + \frac{\varOmega }{\sqrt{\gamma _0}T} + \frac{G\sqrt{2\ln T}}{\sqrt{c}T} \\&\quad + \frac{2(M+r\sqrt{k}F)+r\sqrt{2F_0}}{rT}. \end{aligned}$$

Proof

Plugging Lemma 7(c) into Lemma 2(b) leads to

$$\begin{aligned} f_t(x_t)-f_t(x)&\le \left( \frac{1}{\gamma _t}-c\right) V_{x_t}(x)-\frac{1}{\gamma _t}V_{x_{t+1}}(x)\\&\quad + \frac{1}{2}[\Vert \lambda _t-g_t(x_t)\Vert _2^2-\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2] \\&\quad + \frac{1}{2}[\Vert g_{t+1}(x_{t+1})\Vert _2^2 - \Vert g_t(x_t)\Vert _2^2] + \Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 + \gamma _tG^2 \\&\quad + \left( \sqrt{k} H_g\Vert \lambda _t\Vert _2+4kG^2-\frac{1}{2\gamma _t}\right) V_{x_t}(x_{t+1}). \end{aligned}$$

Note that

$$\begin{aligned}&\sum _{t\in [\tau ]}\left[ \left( \frac{1}{\gamma _t}-c\right) V_{x_t}(x)-\frac{1}{\gamma _t}V_{x_{t+1}}(x)\right] \\&\quad = \left( \frac{1}{\gamma _1}-c\right) V_{x_1}(x) + \sum _{t\in [\tau -1]}\left( \frac{1}{\gamma _{t+1}}-c-\frac{1}{\gamma _t}\right) V_{x_{t+1}}(x) - \frac{1}{\gamma _\tau }V_{x_{\tau +1}}(x) \\&\quad \le \left( \frac{1}{\gamma _1}-c\right) V_{x_1}(x) - \frac{1}{\gamma _\tau }V_{x_{\tau +1}}(x) \\&\quad \le \frac{1}{\gamma _0}V_{x_1}(x). \end{aligned}$$

The first inequality follows since \(\frac{1}{\gamma _{t+1}}-c-\frac{1}{\gamma _t}\le \max \{\frac{1}{\gamma _0}-c-\frac{1}{\gamma _0},c(t+1)-c-ct\}\le 0\), and in the last step we drop nonpositive terms. Moreover, \(\sum _{t\in [\tau ]}\gamma _t\le \sum _{t\in [\tau ]}\frac{1}{ct}\le \frac{1}{c}\ln T\). Summing up the inequality in Lemma 7(c) over \(t\in [\tau ]\),

$$\begin{aligned} \sum _{t\in [\tau ]}[f_t(x_t)-f_t(x)]&\le \frac{1}{\gamma _0}V_{x_1}(x) + \frac{1}{2}[\Vert \lambda _1-g_1(x_1)\Vert _2^2-\Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2^2]\\&\quad + \frac{1}{2}[\Vert g_{\tau +1}(x_{\tau +1})\Vert _2^2 - \Vert g_1(x_1)\Vert _2^2] + U_g(T) + \frac{G^2}{c}\ln T \\&\quad + \sum _{t\in [\tau ]}\left( \sqrt{k}H_g\Vert \lambda _t\Vert _2+4kG^2 -\frac{1}{2\gamma _t}\right) V_{x_t}(x_{t+1}), \end{aligned}$$

where we apply Assumption 7 in addition to the above observations. Thus, (10) is satisfied by taking \(C_0(T)=U_g(T)+\frac{G^2}{c}\ln T\), \(C_1=4kG^2\) and \(C_2=\frac{1}{2}\). By Theorem 3,

$$\begin{aligned}&\sum _{t\in [T]}[f_{t+1}(x_{t+1})-f_{t+1}(x^*)] \le \frac{\varOmega ^2}{2\gamma _0} + U_g(T) + \frac{G^2}{c}\ln T, \\&\sum _{t\in [T]}g_{t,i}(x_t) \le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T),\\&\quad \text {where } C_3(T) = 2\frac{M}{r} + \sqrt{k}F + \left( kF^2 + 2U_g(T) + 2U_f(T) + \frac{2G^2}{c}\ln T + 2F_0\right) ^{1/2} \\&\quad \le 2\frac{M}{r} + 2\sqrt{k}F + \sqrt{2U_g(T)} + \sqrt{2U_f(T)} + G\sqrt{\frac{2\ln T}{c}} + \sqrt{2F_0}. \end{aligned}$$

\(\square \)

Implementation details

1.1 Stepsize computation for Section 7.1

In this section, we give the details of estimating the constants and computing the stepsizes for Algorithms 1 and 2 in Section 7.1.

For Algorithm 1, according to Theorem 7, we take stepsizes \(\gamma _0=T^{-2/3}\), \(\beta _0=T^{-1/3}\), and the Slater constant \(r=T^{-1/3}\).

To estimate the stepsize bound for Algorithm 2, we first take \(r=O(T^{-1/2})\) according to Theorem 8. Now, recall that

$$\begin{aligned} f_t(x)&= \phi (x) = \frac{1}{2}x^\top Qx, \\ g_{t}(x)&= \eta (x)-\eta _t-r = \frac{1}{2}\Vert Ax-b\Vert _2^2-\eta _t-r. \end{aligned}$$

Since \(\{\eta _t\}\) is generated by Mirror Descent with stepsize \(1/H_g\) where \(H_g\) is the Lipschitz constant for \(\nabla \eta \), we have \(\eta _t-\eta ^*\le \frac{H_g}{2T}\Vert x_0-x_\eta ^*\Vert _2^2\). Then we have the following estimation for the parameters \(F=\max \{F_f,F_g\}\), \(G=\max \{G_f,G_g\}\), \(H_f\), \(H_g\), \(U_f(T)\), \(U_g(T)\), M and \(F_0\).

$$\begin{aligned} \Vert f_t(x)\Vert _2&\le \frac{1}{2}\Vert Q\Vert _2\Vert x\Vert _2^2 \le \frac{1}{2}\Vert Q\Vert _2\left( \frac{\varOmega }{2}\right) ^2 =:F_f \\ \Vert g_{t}(x)\Vert _2&\le \frac{1}{2}\Vert Ax-b\Vert _2^2+\eta _t + r \\&\le 2\cdot \frac{1}{2}(\Vert A\Vert _2\Vert x\Vert _2+\Vert b\Vert _2)^2+r \\&\le (\Vert A\Vert _2\frac{\varOmega }{2}+\Vert b\Vert _2)^2+r =:F_g \\ \Vert \nabla f_t(x)\Vert _2&= \Vert Qx\Vert _2 \le \Vert Q\Vert _2\frac{\varOmega }{2} =:G_f \\ \Vert \nabla g_{t}(x)\Vert _2&= \Vert A^\top (Ax-b)\Vert _2 \le \Vert A\Vert _2\left( \Vert A\Vert _2\frac{\varOmega }{2}+\Vert b\Vert _2\right) =:G_g \\ \Vert \nabla ^2f_t(x)\Vert _2&= \Vert Q\Vert _2 =:H_f \\ \Vert \nabla ^2g_{t}(x)\Vert _2&= \Vert A^\top A\Vert _2 =:H_g \\ \sum _{t\in [T]}|f_t(x^*)-f_t(x_t^*)|&\le \sum _{t\in [T]}\phi (x^*) \le TF_f =:U_f \\ \sum _{t\in [T]}\max _x\Vert g_{t+1}(x)-g_t(x)\Vert _2^2&= \sum _{t\in [T]}(\eta _{t+1}-\eta _t)^2 \\&\le \sum _{t\in [T]}(\eta _t-\eta _*)^2 \\&\le \sum _{t\in [T]}\frac{H_g^2}{2^2t^2}\Vert x_0-x_{\eta }^*\Vert ^4 \\&\le \frac{H_g^2}{4T}\left( \frac{\varOmega }{2}\right) ^4 \qquad \text {(as }x_0={\mathbf {0}})=:U_g \\ \max _x\phi (x)-\min _x\phi (x)&\le \max _x\phi (x)\le F_f =:M \\ f_1(x_1^*)-f_1(x_1)&\le f_1(x_1^*) \le F_f =:F_0 \end{aligned}$$

According to Theorem 4, plugging the parameters into the following expression

$$\begin{aligned} \frac{1}{\gamma _0}&\ge 4\sqrt{k}H_g\left( \sqrt{2U_f(T)} + \sqrt{2U_g(T)} + G\sqrt{2T} +\tfrac{2M}{r}+2\sqrt{k}F+\sqrt{2F_0}\right) \\&\quad + 4kH_g^2\varOmega ^2 + 8kG^2 \end{aligned}$$

gives the stepsize \(\gamma _0\) for Algorithm 2. Table 8 summarizes the bounds of the parameters and the stepsize for the three linear inverse problems foxgood, baart and phillips when \(X=\{x\in {{\mathbb {R}}}^{1000}:\Vert x\Vert _2\le 20\}\).

Table 8 Stepsize computation for Algorithm 2, \(X=\{x:\Vert x\Vert _2\le 20\}\)

1.2 Adaptive stepsizes via backtracking

Let us plug \(f_t=\phi \) and \(g_t=\eta -\eta _t-r\) in the description of Algorithm 1. Then, we obtain the following.

  1. 1.

    Set \(x_1\in X\), \(\lambda _1=0\). Let \(\gamma _t\) and \(\beta _t\) be the primal and dual stepsizes.

  2. 2.

    For \(t = 1,\dots ,T\), update

    $$\begin{aligned} x_{t+1}&= \text {Prox}_{x_t}(\gamma _t(\nabla \phi (x_t)+\lambda _t\nabla \eta (x_t))), \\ \lambda _{t+1}&= \max \{\lambda _t+\beta _t(\eta (x_t)-\eta _t-r+\langle \nabla \eta (x_t),x_{t+1}-x_t\rangle ),0\}, \end{aligned}$$

where \(\{\eta _t\}\) is generated by the mirror descent algorithm. When the objective functions \(\phi \), \(\eta \) are smooth, the stepsize \(\gamma _t\) can be selected by a backtracking procedure (see [3]). In particular, we consider two additional parameters \(l_x\), \(e_x\). At each iteration, the next iterate \(x_{t+1}\) is computed based on the current \(\gamma _t=1/l_x\), and then we examine if a certain condition is satisfied. If not, \(l_x\) is updated by multiplying with \(e_x\) recursively until the condition is reached. In our setting, we check the following condition

$$\begin{aligned} L_t(x_{t+1},\lambda _t) - L_t(x_t,\lambda _t)&\le \langle \nabla L_t(x_t),x_{t+1}-x_t\rangle + \frac{1}{2}l_x\Vert x_{t+1}-x_t\Vert ^2 . \end{aligned}$$

Plugging in \(L_t(x,\lambda ) = \phi (x) + \lambda (\eta (x)-\eta _t-r)\), the above inequality becomes

$$\begin{aligned} \phi (x_{t+1})-\phi (x_t)+\lambda _t(\eta (x_{t+1})-(x_t)) \le \langle \nabla \phi (x_t)+\lambda _t\nabla \eta (x_t),x_{t+1}-x_t\rangle + \frac{1}{2}l_x\Vert x_{t+1}-x_t\Vert ^2 . \end{aligned}$$

By comparing the updating rules for the dual variable in Algorithms 1 and 2, i.e., \(\lambda _{t+1,i} = \max \{\lambda _{t,i}+\beta _t\zeta _{t,i}, 0\}\) where \(\zeta _{t,i} :=g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle \) and \(\lambda _{t+1,i}=\max \{\lambda _{t,i}+2g_{t+1,i}(x_{t+1})-g_{t,i}(x_t),0\}\), we observe that taking \(\beta _t=1\) in Algorithm 1 would make the increments \(\beta _t\zeta _{t,i}\) and \(2g_{t+1,i}(x_{t+1})-g_{t,i}(x_t)\) roughly of the same magnitude. Therefore, we set the dual stepsize in Algorithm 1 to \(\beta _t=1\). Moreover, we take \(\lambda _1=100\), the Slater constant \(r=1/T=10^{-4}\), \(s_\lambda =1\), \(l_z=0.01\), \(l_x=1\), and \(e_z=e_x=1+2^{-5}\). The details of our implementation of Algorithm 1 are summarized in Algorithm 3.

figure c

Similarly, when plugging in \(f_t=\phi \), \(g_t=\eta -\eta _t-r\), Algorithm 2 turns into the following.

  1. 1.

    Set \(x_1\in X\), \(\lambda _1=\max \{\eta (x_1)-\eta _1-r\}\). Let \(\gamma _t\) the primal stepsizes.

  2. 2.

    For \(t = 1,\dots ,T\), update by

    $$\begin{aligned} x_{t+1}&= \text {Prox}_{x_t}(\gamma _t(\nabla \phi (x_t)+\lambda _t\nabla \eta (x_t))), \\ \lambda _{t+1}&= \max \{\lambda _t+2(\eta (x_{t+1})-\eta _{t+1}-r)-(\eta (x_t)-\eta _t-r),0\}. \end{aligned}$$

Again \(\{\eta _t\}\) is generated by the mirror descent algorithm. The stepsize \(\gamma _t\) is determined by backtracking procedures, where we adopt a different condition to guarantee that the last term in Lemma 2(b) is nonpositive, i.e.,

$$\begin{aligned} \left( H_g\lambda _t + 2G^2 - \frac{1}{\gamma _t}\right) \cdot \frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2\le 0. \end{aligned}$$

According to the calculations in Lemma 2(a), it suffices to have

$$\begin{aligned} (H_g\lambda _t -l_x)\cdot \frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2 + (\eta (x_{t+1})-\eta (x_t))^2 \le 0. \end{aligned}$$

Note that the condition entails an estimate for \(H_g\). Moreover, we take \(\lambda _1=0\), \(r=1/T=10^{-4}\), \(l_z=0.01\), \(l_x=1\), \(e_z=e_x=1+2^{-5}\), and \(c_x=1.1\). The implementation of Algorithm 2 is summarized in Algorithm 4.

figure d
Fig. 3
figure 3

Comparison of Algorithm 1 with different r values in terms of the inner optimality gap (on the left) and the outer objective value (on the right), on problem instances foxgood, baart and phillips with bounded domain. Note that both x-axis and y-axis are in logarithmic scale on the left column

Remark 8

We note that the updates for \(\{x_t\}\) in Algorithms 3 and 4 are very similar to the algorithm proposed by Solodov [17]: both compute \(x_{t+1} \leftarrow \text {Proj}(x_t-l_x^{-1}(\nabla L_t(x_t,\lambda _t))\), where \(L_t(x,\lambda ) = \phi (x) + \lambda (\eta (x)-\eta _t-r)\), for some stepsize \(l_x^{-1}\) computed via backtracking. However, there are key critical differences between the method of Solodov [17] and ours.

First, the descent condition that Solodov [17] uses in the backtracking is

$$\begin{aligned} L_t(x_{t+1},\lambda _t) - L_t(x_t,\lambda _t)&\le \langle \nabla L_t(x_t),x_{t+1}-x_t\rangle , \end{aligned}$$

whereas ours are slightly different

$$\begin{aligned}&L_t(x_{t+1},\lambda _t) - L_t(x_t,\lambda _t) \le \langle \nabla L_t(x_t),x_{t+1}-x_t\rangle + \frac{1}{2}l_x\Vert x_{t+1}-x_t\Vert ^2\\&\frac{1}{2}(H_g\lambda _t-l_x)\Vert x_{t+1}-x_t\Vert ^2 + (\eta (x_{t+1})-\eta (x_t))^2 \le 0 \end{aligned}$$

for Algorithms 3 and 4 respectively.

Second, and most critically, the dual variable sequence \(\{\lambda _t\}\) is chosen differently. Solodov [17] does not provide an explicit update scheme, but (in line with other iterative regularization schemes) analyzes the case when \(\lambda _t \rightarrow \infty \) sufficiently slowly, so that \(\sum _{t=1}^\infty (1/\lambda _t) = \infty \). On the other hand we provide an analysis when \(\lambda _{t+1}\) is explicitly updated using knowledge of \(\eta (x_t), \eta (x_{t+1})\) and estimates \(\eta _t,\eta _{t+1}\) of the inner optimal value \(\eta ^*\). \(\square \)

1.3 Additional numerical results

In this appendix, we provide additional numerical results that illustrate how the choice of the Slater constant r changes the behavior of Algorithm 1 in the context of Remarks 1 and 3. In particular, based on Theorem 7 in Section 6.2, \(r=T^{-1/3}\) was used for Algorithm 1 in Sect. 7.1. One can notice in Fig. 1 that with this choice of r Algorithm 1 seems to be stuck in terms of inner objective value accuracy for the instances baart and phillips. This is precisely due to the convergence to the neighborhood of the optimum solution phenomenon discussed in Remarks 1 and 3. To illustrate this point, here we present numerical results with the identical setup as in Sect. 7.1 where the only difference is that we use \(r=T^{-1/2}\) for Algorithm 1. With this smaller choice of r, we observe in Fig.e 3 that Algorithm 1 is no longer stuck in terms of the inner objective accuracy for these instances. Thus, we conclude that the choice of r is critical not only to ensure a desired convergence rate but also to ensure convergence to the desired neighborhood of the optimizer set \(X^*\).

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, L., Ho-Nguyen, N. & Kılınç-Karzan, F. An online convex optimization-based framework for convex bilevel optimization. Math. Program. 198, 1519–1582 (2023). https://doi.org/10.1007/s10107-022-01894-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01894-5

Mathematics Subject Classification

Navigation