Abstract
We propose a new framework for solving the convex bilevel optimization problem, where one optimizes a convex objective over the optimal solutions of another convex optimization problem. As a key step of our framework, we form an online convex optimization (OCO) problem in which the objective function remains fixed while the domain changes over time. We note that the structure of our OCO problem is different from the classical OCO problem that has been intensively studied in the literature. We first develop two OCO algorithms that work under different assumptions and provide their theoretical convergence rates. Our first algorithm works under minimal convexity assumptions, while our second algorithm is equipped to exploit structural information on the objective function, including smoothness, lack of first-order smoothness, and strong convexity. We then carefully translate our OCO results into their counterparts for solving the convex bilevel problem. In the context of convex bilevel optimization, our results lead to rates of convergence in terms of both inner and outer objective functions simultaneously, and in particular without assuming strong convexity in the outer objective function. Specifically, after T iterations, our first algorithm achieves \(O(T^{-1/3})\) error bound in both levels, and this is further improved to \(O(T^{-1/2})\) by our second algorithm. We illustrate the numerical efficacy of our algorithms on standard linear inverse problems and a large-scale text classification problem.
Similar content being viewed by others
References
Amini, M., Yousefian, F.: An iterative regularized incremental projected subgradient method for a class of bilevel optimization problems. In: 2019 American Control Conference (ACC), pp. 4069–4074, Philadelphia, PA, USA (2019). IEEE
Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA (2017)
Beck, A., Sabach, S.: A first order method for finding minimal norm-like solutions of convex optimization problems. Math. Program. 147(1–2), 25–46 (2014)
Cabot, A.: Proximal point algorithm controlled by a slowly vanishing term: applications to hierarchical minimization. SIAM J. Optim. 15(2), 555–572 (2005)
Dutta, J., Pandit, T.: Algorithms for Simple Bilevel Programming, pp. 253–291. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-52119-6_9 . (ISBN 978-3-030-52119-6)
Garrigos, G., Rosasco, L., Villa, S.: Iterative regularization via dual diagonal descent. J. Math. Imaging Vis. 60(2), 189–215 (2018)
Hansen, P.C.: Regularization tools: a Matlab package for analysis and solution of discrete ill-posed problems. Numer. Algorithms 6(1), 1–35 (1994)
Helou, E.S., Simões, L.E.A.: \(\epsilon \)-subgradient algorithms for bilevel convex optimization. Inverse Prob. 33(5), 055020 (2017)
Juditsky, A., Nemirovski, A.: First-order methods for nonsmooth convex large-scale optimization, I: general purpose methods. In: Sra, S., Nowozin, S., Wright, S.J. (eds.) Optimization for Machine Learning. The MIT Press, Cambridge (2011) . (ISBN 978-0-262-29877-3)
Kaushik, H.D., Yousefian, F.: A method with convergence rates for optimization problems with variational inequality constraints. SIAM J. Optim. 31(3), 2171–2198 (2021)
Koshal, J., Nedić, A., Shanbhag, U.V.: Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Trans. Autom. Control 58(3), 594–609 (2013)
Liu, S., Vicente, L. N.: Accuracy and fairness trade-offs in machine learning: a stochastic multi-objective approach (2020). arXiv:2008.01132
Nedić, A., Özdağlar, A.: Approximate primal solutions and rate analysis for dual subgradient methods. SIAM J. Optim. 19(4), 1757–1780 (2009)
Neely, M. J., Yu, H.: Online convex optimization with time-varying constraints (2017). arXiv:1702.04783v2
Nesterov, Y.: Lectures on Convex Optimization, volume 137 of Springer Optimization and Its Applications. Springer, Berlin (2018)
Sabach, S., Shtern, S.: A first order method for solving convex bilevel optimization problems. SIAM J. Optim. 27(2), 640–660 (2017)
Solodov, M.: An explicit descent method for bilevel convex optimization. J. Convex Anal. 14(2), 227–237 (2007)
Solodov, M.V.: A bundle method for a class of bilevel nonsmooth convex minimization problems. SIAM J. Optim. 18(1), 242–259 (2007). https://doi.org/10.1137/050647566
Yousefian, F., Nedić, A., Shanbhag, U.V.: On smoothing, regularization, and averaging in stochastic approximation methods for stochastic variational inequality problems. Math. Program. 165(1), 391–431 (2017)
Yu, Hao, Neely, Michael J.: A primal-dual type algorithm with the \({O}(1/t)\) convergence rate for large scale constrained convex programs. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 1900–1905, Las Vegas, NV, USA (2016). IEEE
Hao, Yu., Neely, M.J.: A low complexity algorithm with \({O}(\sqrt{T})\) regret and \({O}(1)\) constraint violations for online convex optimization with long term constraints. J. Mach. Learn. Res. 21(1), 1–24 (2020)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
The authors wish to thank the review team for their constructive feedback that improved the presentation of the material in this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Fundamental results for mirror descent-type updates
In this appendix, we present some classical results on mirror descent-type updates originating from the setup in Sect. 2.2 that we use in our analysis.
Lemma 5
([2, Lemma 9.11]) For any \(x,y,z\in X\), we have
Remark 7
In the particular case of Euclidean setup, Lemma 5 implies that for any \(x,y,z\in X\), we have
\(\square \)
Next, recall the following summary of the classical results related to a single iteration of the proximal mapping.
Lemma 6
Suppose \(x_{t+1}={{\,\mathrm{Prox}\,}}_{x_t}(\gamma _t\xi _t)\). Then, for any \(x\in X\),
-
(a)
\(\Vert x_{t+1}-x_t\Vert \le \gamma _t\Vert \xi _t\Vert _*\).
-
(b)
\(\gamma _t\langle \xi _t,x_{t+1}-x\rangle \le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1})\).
-
(c)
\(\gamma _t\langle \xi _t,x_t-x\rangle \le V_{x_t}(x)-V_{x_{t+1}}(x) +\frac{\gamma _t^2}{2}\Vert \xi _t\Vert _*^2\).
Proof
-
(a)
By the definition of the proximal mapping,
$$\begin{aligned} x_{t+1} = \mathop {{\mathrm{arg\,min}}}\limits _{y\in X}\{V_{x_t}(y)+\langle \gamma _t\xi _t,y\rangle \}. \end{aligned}$$The optimality condition is that, for any \(x\in X\),
$$\begin{aligned} \langle \nabla \omega (x_{t+1})-\nabla \omega (x_t)+\gamma _t\xi _t,x-x_{t+1}\rangle \ge 0. \end{aligned}$$(18)Plugging in \(x=x_t\) and rearranging, we obtain
$$\begin{aligned} \langle \gamma _t\xi _t,x_t-x_{t+1}\rangle&\ge \langle \nabla \omega (x_{t+1})-\nabla \omega (x_t),x_{t+1}-x_t\rangle . \end{aligned}$$By Cauchy–Schwarz inequality, \(\langle \gamma _t\xi _t,x_t-x_{t+1}\rangle \le \Vert \gamma _t\xi _t\Vert _*\Vert x_t-x_{t+1}\Vert \). Since \(\omega \) is 1-strongly convex with respect to \(\Vert \cdot \Vert \), we have \(\langle \nabla \omega (x_{t+1})-\nabla \omega (x_t),x_{t+1}-x_t\rangle \ge \Vert x_{t+1}-x_t\Vert ^2\). Therefore, we conclude that \(\gamma _t\Vert \xi _t\Vert _*\ge \Vert x_{t+1}-x_t\Vert \).
-
(b)
Rearranging (18), we have
$$\begin{aligned} \langle \gamma _t\xi _t,x_{t+1}-x\rangle&\le \langle \nabla \omega (x_{t+1})-\nabla \omega (x_t),x-x_{t+1}\rangle . \end{aligned}$$Applying Lemma 5 to the right hand side gives the result.
-
(c)
Following part (b), we have
$$\begin{aligned} \gamma _t\langle \xi _t,x_t-x\rangle&\le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}) + \gamma _t\langle \xi _t,x_t-x_{t+1}\rangle \\&\le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}) + \gamma _t\Vert \xi _t\Vert _*\Vert x_t-x_{t+1}\Vert \\&\le V_{x_t}(x)-V_{x_{t+1}}(x) - V_{x_t}(x_{t+1}) + \frac{\gamma _t^2}{2}\Vert \xi _t\Vert _*^2 + \frac{1}{2}\Vert x_t-x_{t+1}\Vert ^2 \\&\le V_{x_t}(x)-V_{x_{t+1}}(x) + \frac{\gamma _t^2}{2}\Vert \xi _t\Vert _*^2, \end{aligned}$$where we use Cauchy–Schwarz inequality, the fact that \(ab\le \frac{1}{2}(a^2+b^2)\) and \(\frac{1}{2}\Vert x_t-x_{t+1}\Vert ^2\le V_{x_t}(x_{t+1})\).
\(\square \)
In our developments we also use the following lemma relating the Bregman distance \(V_x(x')\) and the subgradient inequality of an arbitrary convex function under a variety of structural assumptions on \(f_t\).
Lemma 7
The following statements hold for convex functions \(\{f_t\}_{t\in [T]}\).
-
(a)
For any \(x\in X\) and \(t\in [T]\),
$$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t} V_{x_t}(x_{t+1}) + [f_t(x_t)-f_t(x)]. \end{aligned}$$ -
(b)
Under Assumption 8, for any \(x\in X\) and \(t\in [T]\), we have
$$\begin{aligned} \langle \nabla f_{t+1}(x_t),x_{t+1}-x\rangle&\ge f_{t+1}(x_{t+1}) - f_{t+1}(x) - H_fV_{x_t}(x_{t+1}). \end{aligned}$$ -
(c)
Under Assumption 9, for any \(x\in X\) and \(t\in [T]\), we have
$$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t}V_{x_1}(x_{t+1}) + [f_t(x_t)-f_t(x)] + cV_{x_t}(x). \end{aligned}$$
Proof
-
(a)
Starting from the subgradient inequality, we apply Cauchy–Schwarz inequality, and the fact that \(a^2+b^2\ge 2ab\), to obtain
$$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&= \langle \nabla f_t(x_t),[x_{t+1}-x_t]+[x_t-x]\rangle \\&\ge \langle \nabla f_t(x_t),x_{t+1}-x_t\rangle + [f_t(x_t)-f_t(x)] \\&\ge -\Vert \nabla f_t(x_t)\Vert _*\Vert x_{t+1}-x_t\Vert + [f_t(x_t)-f_t(x)] \\&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{4\gamma _t}\Vert x_{t+1}-x_t\Vert ^2 + [f_t(x_t)-f_t(x)] \\&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t} V_{x_t}(x_{t+1}) + [f_t(x_t)-f_t(x)], \end{aligned}$$where the last step follows from \(\frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2\le V_{x_t}(x_{t+1})\).
-
(b)
Combining the subgradient inequality and Assumption 8,
$$\begin{aligned} \langle \nabla f_{t+1}(x_t),x_{t+1}-x\rangle&= \langle \nabla f_{t+1}(x_t),[x_{t+1}-x_t]-[x-x_t]\rangle \\&\ge \left[ f_{t+1}(x_{t+1})-f_{t+1}(x_t)-\frac{1}{2}H_f\Vert x_{t+1}-x_t\Vert ^2\right] \\&\quad - [f_{t+1}(x)-f_{t+1}(x_t)] \\&= f_{t+1}(x_{t+1}) - f_{t+1}(x) - \frac{1}{2}H_f\Vert x_{t+1}-x_t\Vert ^2 \\&\ge f_{t+1}(x_{t+1}) - f_{t+1}(x) - H_fV_{x_t}(x_{t+1}). \end{aligned}$$ -
(c)
Applying Assumption 9,
$$\begin{aligned} \langle \nabla f_t(x_t),x_{t+1}-x\rangle&= \langle \nabla f_t(x_t),[x_{t+1}-x_t]+[x_t-x]\rangle \\&\ge \langle \nabla f_t(x_t),x_{t+1}-x_t\rangle + [f_t(x_t)-f_t(x)] + cV_{x_t}(x) \\&\ge -\gamma _t\Vert \nabla f_t(x_t)\Vert _*^2 - \frac{1}{2\gamma _t}V_{x_1}(x_{t+1}) + [f_t(x_t)-f_t(x)] + cV_{x_t}(x), \end{aligned}$$where the last step follows the same analysis as in Lemma 7(a).
\(\square \)
Proofs for Section 4
We first give the per-step analysis of Algorithm 1. This serves as an important ingredient for Theorem 1 that provides a bound for the objective regret term in (5).
Lemma 8
For any \(t\in [T]\) and \(x\in X\), in Algorithm 1, we have
Proof
Fix \(t\in [T]\). It follows from Lemma 6(b) that for any \(x\in X\)
Recall that \(\nabla _xL_t(x_t,\lambda _t)=\nabla f_t(x_t)+\sum _{i\in [k]}\lambda _{t,i}\nabla g_{t,i}(x_t)\). Moreover,
where the last step follows from the subgradient inequality associated with \(g_{t,i}(x)\), i.e., \( \langle \nabla g_{t,i}(x_t),x_t-x\rangle \ge g_{t,i}(x_t)-g_{t,i}(x). \) Plugging in the definition of \(\nabla _xL_t(x_t,\lambda _t)\) along with the above relation in (19) then results in
Moreover, using the Cauchy-Schwarz and \(ab\le a^2/2+b^2/2\), we deduce
By summing this relation with (20) and the subgradient inequality associated with \(f_{t}(x)\), i.e., \( \langle \nabla f_{t}(x_t),x_t-x\rangle \ge f_{t}(x_t)-f_{t}(x) \), we arrive at the following conclusion from the primal step
Recall from Remark 4 that the dual step can be viewed as the projected subgradient update, i.e.,
Thus, the dual update is an MD update on a domain equipped with the Euclidean setup, i.e., the d.g.f. is selected to be the squared Euclidean norm and \(V_x(y)=\frac{1}{2}\Vert x-y\Vert ^2\). Then, it follows from Lemma 6(c) that
Dividing the inequality for the primal by \(\gamma _t\) and dividing the above inequality for the dual step by \(\beta _t\) and then summing them up leads to the desired conclusion. \(\square \)
Theorem 1
Let \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all \(t\in [T]\). Then, under Assumptions 1 and 2, for any \(x^*\in X_0\), Algorithm 1 guarantees the following objective regret bound
Proof
Note that for any \(x\in X_0\), we have \(\sum _{i\in [k]}\lambda _{t,i}g_{t,i}(x)\le 0\). Then, by summing up the inequality in Lemma 8 over \(t\in [T]\) and noting also that the right hand side telescopes, we obtain
In the last inequality, we drop nonpositive terms, note that by definition \(\lambda _1=0\), and use the bounds from Assumptions 1 and 2. In particular, when bounding \(\zeta _{t,i} = g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle ,\,\forall i\in [k]\), recall from Remark 2 that we have \(\Vert x_{t+1}-x_t\Vert \le \sqrt{2V_{x_t}(x_{t+1})}\le \varOmega \). \(\square \)
We next establish a bound on the value of the dual variable \(\lambda _t\) over time when running Algorithm 1. This is a key component in the analysis of Algorithm 1, and it is critical in providing a bound for the constraint violation in Theorem 2. We define the notation
Note that by comparing the first two terms in \(\varLambda _0(\tau )\) and the first term in \(\varLambda _0\) and using the arithmetic geometric inequality, we deduce that \(\varLambda _0(\tau ) \le \varLambda _0\) holds for any \(\tau \in [T]\).
Lemma 9
Let \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all \(t\in [T]\) in Algorithm 1. Under Assumptions 1 to 3, we have
Proof
Based on the dual update, we have
and
The inner product term can be bounded using (20) and plugging in \(x={{\hat{x}}}\in X\) as follows.
where the last step follows from Assumptions 1 to 3. Under the same assumptions, we also have
Thus, we obtain
and
For any \(\tau \in [t-1]\), by summing up the above inequality for \(\tau '=t-\tau ,\dots ,t-1\), and noting that \(\beta _t=\beta _0\) and \(\gamma _t=\gamma _0\), we arrive at
where the second inequality follows from \(0\le V_x(y)\le \frac{1}{2}\varOmega ^2\) for any \(x,y\in X\), relation (23) implying \(\Vert \lambda _{t-\tau '}\Vert \ge \Vert \lambda _t\Vert _2 -\tau '\beta _0\sqrt{k}(F+G\varOmega )\), and part of the terms telescoping since \(\gamma _t=\gamma _0,\,\beta _t=\beta _0\). The last equation above follows from the definition of \(\varLambda _0(\tau )\) in (21). Hence, by dropping the last term which is nonpositive and rearranging terms, we deduce that the relation
holds for all \(\tau \in [t-1]\) and \(t\in [T]\).
As a last step, we will show by induction that \(\Vert \lambda _t\Vert _2\le \varLambda _0(\tau )\) for any \(\tau \in [T]\) fixed and \(t\in [T]\). The base case holds since \(\lambda _1=0\) and \(\varLambda _0(\tau )\ge 0\) for any \(\tau \). For \(t\le \tau \), by (23) we have \(\Vert \lambda _t\Vert _2\le \Vert \lambda _1\Vert _2+t\beta _0\sqrt{k}(F+G\varOmega )\le \varLambda _0(\tau )\). For \(t>\tau \), assuming by induction hypothesis that \(\Vert \lambda _{t-\tau }\Vert _2\le \varLambda _0(\tau )\) and using (24), we arrive at
This implies that \(\Vert \lambda _t\Vert _2\le \varLambda _0(\tau )\) as desired. \(\square \)
Theorem 2
Let \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all \(t\in [T]\). Then, under Assumptions 1 to 3, for any \(x^*\in X_0\), Algorithm 1 guarantees the following constraint violation bound
Proof
Consider any \(t\in [T]\). From the dual update step in Algorithm 1, we have
If follows from Lemma 6(a) that
where the second inequality follows from definition of \(L_t(x_t,\lambda _t)\) and the last one from the relation \(\Vert \lambda _t\Vert _1\le \sqrt{k}\Vert \lambda _t\Vert _2\) and Assumptions 1and 2 and Lemma 9. Combining these two inequalities, we arrive at
Summing this inequality over \(t\in [T]\), and using the fact that \(\gamma _t=\gamma _0\), \(\beta _t=\beta _0\) for all t, we observe that the right hand side telescopes. Then, from Lemma 9, we deduce that
\(\square \)
Corollary 1
Let \(\gamma _0 = T^{-(1/2+\delta )}\), \(\beta _0 = T^{-(1/2-\delta )}\), where \(-1/2< \delta < 1/2\). Then, for any \(x^*\in X_0\), under Assumptions 1 to 3, Algorithm 1 with \(\gamma _t = \gamma _0\), \(\beta _t = \beta _0\) for \(t \in [T]\) leads to
Proof
We hide constants \(k,F,G, \varOmega \) in the bounds. Thus,
Therefore,
\(\square \)
Proofs for Section 5
1.1 Proofs for Lemmas
Lemma 1
In Algorithm 2, the following holds for all \(t\ge 1\).
-
(a)
\(\lambda _t-g_t(x_t)\ge 0\).
-
(b)
\(g_{t+1}(x_{t+1})\le [\lambda _{t+1}-g_{t+1}(x_{t+1})]-[\lambda _t-g_t(x_t)]\).
-
(c)
\(\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2\le \Vert \lambda _t+g_{t+1}(x_{t+1})-g_t(x_t)\Vert _2^2+\Vert g_{t+1}(x_{t+1})\Vert _2^2\).
-
(d)
\(\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2 \ge \Vert g_{t+1}(x_{t+1})\Vert _2\), and \(\Vert \lambda _1-g_1(x_1)\Vert _2\le \Vert g_1(x_1)\Vert _2\).
Proof
By definition in Algorithm 2, we have \(\lambda _1-g_1(x_1)\ge 0\). Assume by induction that \(\lambda _t - g_t(x_t) \ge 0\) for \(t \ge 1\). For any \(i\in [k]\), the dual update rule leads to
For any \(i\in [k]\), if \(\lambda _{t+1,i}-g_{t+1,i}(x_{t+1}) < 0\), then \(\lambda _{t,i}-g_{t,i}(x_t)< -g_{t+1,i}(x_{t+1}) < 0\). But, then by induction this will contradict \(\lambda _t-g_t(x_t) \ge 0\). Thus, we have \(\lambda _{t+1}-g_{t+1}(x_{t+1})\ge 0\). The inequalities in (b) and (c) follow directly from (25) (note that (c) follows since \(\Vert a\Vert _2^2 \le \Vert b\Vert _2^2 + \Vert c\Vert _2^2\) when \(a = \max \{b,c\}\)). The second part in (d) follows from the initialization of \(\lambda _1\) in Algorithm 2 which implies \(\lambda _{1,i}-g_{1,i}(x_1)=\max \{0,-g_{1,i}(x_1)\}\) for all \(i\in [k]\). To show the first part of (d), note that from (25) and part (a), for any \(i\in [k]\) we have
\(\square \)
Lemma 2
Under Assumption 2, running Algorithm 2 guarantees for any \(t\in [T]\) that
-
(a)
$$\begin{aligned} -\langle \lambda _t,g_{t+1}(x_{t+1})\rangle&\le \frac{1}{2}\left[ \Vert \lambda _t-g_t(x_t)\Vert _2^2-\Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2\right] \\&\quad + \frac{1}{2}\left[ \Vert g_{t+1}(x_{t+1})\Vert _2^2-\Vert g_t(x_t)\Vert _2^2\right] \\&\quad + \Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 + 4kG^2V_{x_t}(x_{t+1}), \end{aligned}$$
-
(b)
and if we further assume Assumption 5, we also have for any \(x\in X_0\) and \(t_0\in \{0,1\}\)
$$\begin{aligned} \langle \nabla f_{t+t_0}(x_t),x_{t+1}-x\rangle&\le \frac{1}{\gamma _t}\left[ V_{x_t}(x)-V_{x_{t+1}}(x)\right] \\&\quad + \frac{1}{2}\left[ \Vert \lambda _t-g_t(x_t)\Vert _2^2- \Vert \lambda _{t+1}-g_{t+1}(x_{t+1})\Vert _2^2\right] \\&\quad + \frac{1}{2}\left[ \Vert g_{t+1}(x_{t+1})\Vert _2^2 - \Vert g_t(x_t)\Vert _2^2\right] + \Vert g_{t+1}(x_t)-g_t(x_t)\Vert _2^2 \\&\quad + \left( \sqrt{k}H_g\Vert \lambda _t\Vert _2+4kG^2-\frac{1}{\gamma _t}\right) V_{x_t}(x_{t+1}). \end{aligned}$$
Proof
We first prove part (a). For any \(t\in [T]\), expanding the inequality in Lemma 1(c), we have
Recall from Remark 7 that by taking \(x=g_{t+1}(x_{t+1})\), \(y=0\), \(z=g_t(x_t)\), we obtain
Plugging this into (26) and rearranging results in
Note also that using the subgradient inequality, the definition of the dual norm, and Assumption 2, for any \(i\in [k]\), we have the following relations
These two inequalities together imply \(\Vert g_{t+1}(x_{t+1})-g_{t+1}(x_t)\Vert _2\le \sqrt{k}G\Vert x_{t+1}-x_t\Vert \). Then, using triangle inequality, we deduce
where the last inequality follows from the fact that \(\Vert g_{t+1}(x_{t+1})-g_{t+1}(x_t)\Vert _2\le \sqrt{k}G\Vert x_{t+1}-x_t\Vert \). By plugging the above inequality into (27), we arrive at
Finally, by noting that \(\Vert x_{t+1}-x_t\Vert ^2\le 2V_{x_t}(x_{t+1})\) and rearranging the above inequality, we obtain the desired inequality for part (a).
To prove part (b), we first show that under Assumption 5, for any \(x\in X_0\) and \(t_0\in \{0,1\}\), Algorithm 2 ensures
To see this, fix any \(x\in X_0\) and \(t_0\in \{0,1\}\). It follows from Lemma 6(b) that
Recall that in Algorithm 2, \(\xi _t=\nabla f_{t+t_0}(x_t)+\sum _{i\in [k]}\lambda _{t,i}\nabla g_{t+1,i}(x_t)\). Combining the subgradient inequality and Assumption 5 results in
By plugging this into (29) and rearranging, we arrive at
Equation (28) then follows by noting that \(\frac{1}{2}\Vert x_{t+1}-x_t\Vert ^2\le V_{x_t}(x_{t+1})\), \(\sum _{i\in [k]}\lambda _{t,i}\le \sqrt{k}\Vert \lambda _t\Vert _2\) and that \(g_{t+1,i}(x)\le 0\) for \(x\in X_0\).
Now, part (b) follows from combining part (a) and (28). \(\square \)
In our next result, we observe a consequence of Assumption 4. Lemma 10 will play a crucial role in bounding the dual variable \(\lambda _t\) in Lemma 3.
Lemma 10
Under Assumption 4, running Algorithm 2 for any \(\tau \ge 2\), we have
Proof
For all \(t\ge 1\), under Assumption 4, using the definition of saddle point \((x_t^*,\lambda _t^*)\), we have
Thus, \(f_t(x_t^*)-f_t(x_t) \le \langle \lambda _t^*,g_t(x_t)\rangle \). By summing over \(t-1\in [\tau ]\), for \(\tau \ge 1\), we get
Here, the second inequality follows from Lemma 1(b). The last inequality follows from Assumption 4 that guarantees \(\lambda _t^*\le \lambda _{t+1}^*\) and Lemma 1(a) that ensures \(\lambda _t-g_t(x_t)\ge 0\) for any \(t\ge 1\). Finally, applying Cauchy-Schwarz inequality and using Assumption 4 once more lead to the desired conclusion. \(\square \)
Lemma 3
Suppose for any \(\tau \in [T]\), the relation (10) holds with \(C_0(T), C_1, C_2 \ge 0\). Let \(C_3(T)\) be defined in (11), and let \(U_f(T)\) be defined in Assumption 6. For \(t\in [T]\), let \(0<\gamma _t\le \gamma _0\), where \(\gamma _0\) satisfies
Then, under Assumptions 1 to 6, Algorithm 2 guarantees for any \(t\in [T]\),
-
(a)
\(\Vert \lambda _t-g_t(x_t)\Vert _2\le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T)-\sqrt{k}F\),
-
(b)
\(\Vert \lambda _t\Vert _2 \le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T)\), and
-
(c)
\(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\le 0\).
Proof
We will show by induction that \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\le 0\) and
The base case holds since \(\Vert \lambda _1-g_1(x_1)\Vert _2\le \Vert g_1(x_1)\Vert _2\le \sqrt{k}F\le \varLambda '\) by definition \(\lambda _1=\max \{g_1(x_1),0\}\) and Assumption 2. Moreover,
where the first inequality follows as (12) implies that \(\frac{1}{\gamma _1}\ge \frac{2}{C_2}(\sqrt{k}H_gC_3(T)+C_1)\) and by definition \(\lambda _1=\max \{g_1(x_1),0\}\) and Assumption 2 implies that \(\Vert \lambda _1\Vert _2 \le \Vert g_1(x_1)\Vert _2\le \sqrt{k}F\), and the last inequality follows because \(C_3(T)\ge \sqrt{k}F\) holds by its definition. Fix any \(x^*\in X_0\). Assume by induction hypothesis that \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t} \le 0\) for any \(t\le \tau \). Then, using the induction hypothesis and the fact that \(V_x(y)\ge 0\) for all x, y, for any \(t_0\in \{0,1\}\), we deduce from (10) that
where the last inequality follows from Lemma 1(d), Assumptions 1 and 2. By Lemma 10 and Assumption 6, for any \(t_0\in \{0,1\}\) we have
By adding the above two inequalities we arrive at
When \(t_0=1\), by completing the square in (30) we observe that \( \Vert \lambda _{\tau +1}-g_{\tau +1}(x_{\tau +1})\Vert _2 \le \varLambda '\). When \(t_0=0\), by induction hypothesis we have \(\Vert \lambda _\tau -g_\tau (x_\tau )\Vert _2\le \varLambda '\). Plugging this into (30), we obtain
where we have used the definition of \(\varLambda '\) in the second inequality as well. Therefore, we have shown by induction that \(\Vert \lambda _t-g_t(x_t)\Vert _2\le \varLambda '\) for all \(t\ge 1\). Observe that \(a^2+b^2\le (a+b)^2\) for \(a,b\ge 0\), and we have \(M/r \ge 0,~ \varOmega /\sqrt{\gamma _0} \ge 0,~ C_3(T) - \sqrt{k}F - 2M/r \ge 0\). Here the last relation holds due to the definition of \(C_3(T)\) in (11). So
Thus, we deduce for all \(t\ge 1\), \(\Vert \lambda _t-g_t(x_t)\Vert _2\le \varLambda '\le \frac{\varOmega }{\sqrt{\gamma _0}} + C_3(T)-\sqrt{k}F\) and furthermore \(\Vert \lambda _t\Vert _2\le \Vert \lambda _t-g_t(x_t)\Vert _2+\Vert g_t(x_t)\Vert _2\le \varLambda '+ \sqrt{k}F \le \frac{\varOmega }{\sqrt{\gamma _0}}+C_3(T)\). Then, from \(\gamma _0\ge \gamma _t>0~\forall \tau \ge 1\), we have
Solving the inequality \(\sqrt{k}H_gC_3(T)+C_1+\sqrt{k}H_g\varOmega \frac{1}{\sqrt{\gamma _0}} -C_2\frac{1}{\gamma _0}\le 0\) in terms of \(\gamma _0^{-1/2}>0\) gives
Using the fact that \((a+b)^2\le 2a^2+2b^2\), for \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\) to hold, we conclude that it suffices to have
Note that this inequality is exactly the premise of this lemma. Hence, this concludes the induction step and establishes that \(\sqrt{k}H_g\Vert \lambda _t\Vert _2+C_1-C_2\frac{1}{\gamma _t}\le 0\) holds for all \(t\in [T]\). \(\square \)
1.2 Exploiting smoothness
In this section, we exploit extra assumption on smoothness of \(\{f_t\}_{t\in [T]}\), i.e., Assumption 8 and prove Theorem 5.
Theorem 5
Suppose that Assumptions 1 to 8 hold. Let \(t_0=1,\) and for \(t\in [T]\) select \(\gamma _t=\gamma _0\) such that
Then, for any \(x^*\in X_0\), running Algorithm 2 guarantees that
Proof
Combining Lemma 7(b) with Lemma 2(b) leads to
Let \(\tau \in [T]\). By summing up the inequality over \(t\in [\tau ]\), we arrive at
where we note the telescoping terms due to \(\gamma _t=\gamma _0\), and apply Assumption 7. Thus, (10) is satisfied by taking \(C_0(T)=U_g(T)\), \(C_1=H_f+2kG^2\) and \(C_2=1\). From Theorem 3, we deduce
and
where
Recall that we can get rid of \(F_0\) in the case where \(t_0=1\). Moreover, (12) turns into
\(\square \)
1.3 Exploiting strong convexity
Theorem 6
Suppose that Assumptions 1 to 7 and 9 hold. Let \(t_0=0\), and for \(t\in [T]\) select \(\gamma _t=\min \{\frac{1}{ct},\gamma _0\}\), where
Then, for any \(x^*\in X_0\), running Algorithm 2 guarantees that
Proof
Plugging Lemma 7(c) into Lemma 2(b) leads to
Note that
The first inequality follows since \(\frac{1}{\gamma _{t+1}}-c-\frac{1}{\gamma _t}\le \max \{\frac{1}{\gamma _0}-c-\frac{1}{\gamma _0},c(t+1)-c-ct\}\le 0\), and in the last step we drop nonpositive terms. Moreover, \(\sum _{t\in [\tau ]}\gamma _t\le \sum _{t\in [\tau ]}\frac{1}{ct}\le \frac{1}{c}\ln T\). Summing up the inequality in Lemma 7(c) over \(t\in [\tau ]\),
where we apply Assumption 7 in addition to the above observations. Thus, (10) is satisfied by taking \(C_0(T)=U_g(T)+\frac{G^2}{c}\ln T\), \(C_1=4kG^2\) and \(C_2=\frac{1}{2}\). By Theorem 3,
\(\square \)
Implementation details
1.1 Stepsize computation for Section 7.1
In this section, we give the details of estimating the constants and computing the stepsizes for Algorithms 1 and 2 in Section 7.1.
For Algorithm 1, according to Theorem 7, we take stepsizes \(\gamma _0=T^{-2/3}\), \(\beta _0=T^{-1/3}\), and the Slater constant \(r=T^{-1/3}\).
To estimate the stepsize bound for Algorithm 2, we first take \(r=O(T^{-1/2})\) according to Theorem 8. Now, recall that
Since \(\{\eta _t\}\) is generated by Mirror Descent with stepsize \(1/H_g\) where \(H_g\) is the Lipschitz constant for \(\nabla \eta \), we have \(\eta _t-\eta ^*\le \frac{H_g}{2T}\Vert x_0-x_\eta ^*\Vert _2^2\). Then we have the following estimation for the parameters \(F=\max \{F_f,F_g\}\), \(G=\max \{G_f,G_g\}\), \(H_f\), \(H_g\), \(U_f(T)\), \(U_g(T)\), M and \(F_0\).
According to Theorem 4, plugging the parameters into the following expression
gives the stepsize \(\gamma _0\) for Algorithm 2. Table 8 summarizes the bounds of the parameters and the stepsize for the three linear inverse problems foxgood, baart and phillips when \(X=\{x\in {{\mathbb {R}}}^{1000}:\Vert x\Vert _2\le 20\}\).
1.2 Adaptive stepsizes via backtracking
Let us plug \(f_t=\phi \) and \(g_t=\eta -\eta _t-r\) in the description of Algorithm 1. Then, we obtain the following.
-
1.
Set \(x_1\in X\), \(\lambda _1=0\). Let \(\gamma _t\) and \(\beta _t\) be the primal and dual stepsizes.
-
2.
For \(t = 1,\dots ,T\), update
$$\begin{aligned} x_{t+1}&= \text {Prox}_{x_t}(\gamma _t(\nabla \phi (x_t)+\lambda _t\nabla \eta (x_t))), \\ \lambda _{t+1}&= \max \{\lambda _t+\beta _t(\eta (x_t)-\eta _t-r+\langle \nabla \eta (x_t),x_{t+1}-x_t\rangle ),0\}, \end{aligned}$$
where \(\{\eta _t\}\) is generated by the mirror descent algorithm. When the objective functions \(\phi \), \(\eta \) are smooth, the stepsize \(\gamma _t\) can be selected by a backtracking procedure (see [3]). In particular, we consider two additional parameters \(l_x\), \(e_x\). At each iteration, the next iterate \(x_{t+1}\) is computed based on the current \(\gamma _t=1/l_x\), and then we examine if a certain condition is satisfied. If not, \(l_x\) is updated by multiplying with \(e_x\) recursively until the condition is reached. In our setting, we check the following condition
Plugging in \(L_t(x,\lambda ) = \phi (x) + \lambda (\eta (x)-\eta _t-r)\), the above inequality becomes
By comparing the updating rules for the dual variable in Algorithms 1 and 2, i.e., \(\lambda _{t+1,i} = \max \{\lambda _{t,i}+\beta _t\zeta _{t,i}, 0\}\) where \(\zeta _{t,i} :=g_{t,i}(x_t)+\langle \nabla g_{t,i}(x_t),x_{t+1}-x_t\rangle \) and \(\lambda _{t+1,i}=\max \{\lambda _{t,i}+2g_{t+1,i}(x_{t+1})-g_{t,i}(x_t),0\}\), we observe that taking \(\beta _t=1\) in Algorithm 1 would make the increments \(\beta _t\zeta _{t,i}\) and \(2g_{t+1,i}(x_{t+1})-g_{t,i}(x_t)\) roughly of the same magnitude. Therefore, we set the dual stepsize in Algorithm 1 to \(\beta _t=1\). Moreover, we take \(\lambda _1=100\), the Slater constant \(r=1/T=10^{-4}\), \(s_\lambda =1\), \(l_z=0.01\), \(l_x=1\), and \(e_z=e_x=1+2^{-5}\). The details of our implementation of Algorithm 1 are summarized in Algorithm 3.
Similarly, when plugging in \(f_t=\phi \), \(g_t=\eta -\eta _t-r\), Algorithm 2 turns into the following.
-
1.
Set \(x_1\in X\), \(\lambda _1=\max \{\eta (x_1)-\eta _1-r\}\). Let \(\gamma _t\) the primal stepsizes.
-
2.
For \(t = 1,\dots ,T\), update by
$$\begin{aligned} x_{t+1}&= \text {Prox}_{x_t}(\gamma _t(\nabla \phi (x_t)+\lambda _t\nabla \eta (x_t))), \\ \lambda _{t+1}&= \max \{\lambda _t+2(\eta (x_{t+1})-\eta _{t+1}-r)-(\eta (x_t)-\eta _t-r),0\}. \end{aligned}$$
Again \(\{\eta _t\}\) is generated by the mirror descent algorithm. The stepsize \(\gamma _t\) is determined by backtracking procedures, where we adopt a different condition to guarantee that the last term in Lemma 2(b) is nonpositive, i.e.,
According to the calculations in Lemma 2(a), it suffices to have
Note that the condition entails an estimate for \(H_g\). Moreover, we take \(\lambda _1=0\), \(r=1/T=10^{-4}\), \(l_z=0.01\), \(l_x=1\), \(e_z=e_x=1+2^{-5}\), and \(c_x=1.1\). The implementation of Algorithm 2 is summarized in Algorithm 4.
Remark 8
We note that the updates for \(\{x_t\}\) in Algorithms 3 and 4 are very similar to the algorithm proposed by Solodov [17]: both compute \(x_{t+1} \leftarrow \text {Proj}(x_t-l_x^{-1}(\nabla L_t(x_t,\lambda _t))\), where \(L_t(x,\lambda ) = \phi (x) + \lambda (\eta (x)-\eta _t-r)\), for some stepsize \(l_x^{-1}\) computed via backtracking. However, there are key critical differences between the method of Solodov [17] and ours.
First, the descent condition that Solodov [17] uses in the backtracking is
whereas ours are slightly different
for Algorithms 3 and 4 respectively.
Second, and most critically, the dual variable sequence \(\{\lambda _t\}\) is chosen differently. Solodov [17] does not provide an explicit update scheme, but (in line with other iterative regularization schemes) analyzes the case when \(\lambda _t \rightarrow \infty \) sufficiently slowly, so that \(\sum _{t=1}^\infty (1/\lambda _t) = \infty \). On the other hand we provide an analysis when \(\lambda _{t+1}\) is explicitly updated using knowledge of \(\eta (x_t), \eta (x_{t+1})\) and estimates \(\eta _t,\eta _{t+1}\) of the inner optimal value \(\eta ^*\). \(\square \)
1.3 Additional numerical results
In this appendix, we provide additional numerical results that illustrate how the choice of the Slater constant r changes the behavior of Algorithm 1 in the context of Remarks 1 and 3. In particular, based on Theorem 7 in Section 6.2, \(r=T^{-1/3}\) was used for Algorithm 1 in Sect. 7.1. One can notice in Fig. 1 that with this choice of r Algorithm 1 seems to be stuck in terms of inner objective value accuracy for the instances baart and phillips. This is precisely due to the convergence to the neighborhood of the optimum solution phenomenon discussed in Remarks 1 and 3. To illustrate this point, here we present numerical results with the identical setup as in Sect. 7.1 where the only difference is that we use \(r=T^{-1/2}\) for Algorithm 1. With this smaller choice of r, we observe in Fig.e 3 that Algorithm 1 is no longer stuck in terms of the inner objective accuracy for these instances. Thus, we conclude that the choice of r is critical not only to ensure a desired convergence rate but also to ensure convergence to the desired neighborhood of the optimizer set \(X^*\).
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shen, L., Ho-Nguyen, N. & Kılınç-Karzan, F. An online convex optimization-based framework for convex bilevel optimization. Math. Program. 198, 1519–1582 (2023). https://doi.org/10.1007/s10107-022-01894-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01894-5