Skip to main content
Log in

Adaptive primal-dual stochastic gradient method for expectation-constrained convex stochastic programs

  • Full Length Paper
  • Published:
Mathematical Programming Computation Aims and scope Submit manuscript

Abstract

Stochastic gradient methods (SGMs) have been widely used for solving stochastic optimization problems. A majority of existing works assume no constraints or easy-to-project constraints. In this paper, we consider convex stochastic optimization problems with expectation constraints. For these problems, it is often extremely expensive to perform projection onto the feasible set. Several SGMs in the literature can be applied to solve the expectation-constrained stochastic problems. We propose a novel primal-dual type SGM based on the Lagrangian function. Different from existing methods, our method incorporates an adaptiveness technique to speed up convergence. At each iteration, our method inquires an unbiased stochastic subgradient of the Lagrangian function, and then it renews the primal variables by an adaptive-SGM update and the dual variables by a vanilla-SGM update. We show that the proposed method has a convergence rate of \(O(1/\sqrt{k})\) in terms of the objective error and the constraint violation. Although the convergence rate is the same as those of existing SGMs, we observe its significantly faster convergence than an existing non-adaptive primal-dual SGM and a primal SGM on solving the Neyman–Pearson classification and quadratically constrained quadratic programs. Furthermore, we modify the proposed method to solve convex–concave stochastic minimax problems, for which we perform adaptive-SGM updates to both primal and dual variables. A convergence rate of \(O(1/\sqrt{k})\) is also established to the modified method for solving minimax problems in terms of primal-dual gap. Our code has been released at https://github.com/RPI-OPT/APriD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability Statement

This manuscript has associated data in a data repository at https://github.com/RPIOPT/APriD.

References

  1. Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)

  2. Aybat, N.S., Iyengar, G.: An augmented Lagrangian method for conic convex programming. arXiv preprint arXiv:1302.6322 (2013)

  3. Calafiore, G., Campi, M.C.: Uncertain convex programs: randomized solutions and confidence levels. Math. Progr. 102(1), 25–46 (2005)

    Article  MathSciNet  Google Scholar 

  4. Calafiore, G.C., Campi, M.C.: The scenario approach to robust control design. IEEE Trans. Autom. Control 51(5), 742–753 (2006)

    Article  MathSciNet  Google Scholar 

  5. Chen, Y., Lan, G., Ouyang, Y.: Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014)

    Article  MathSciNet  Google Scholar 

  6. Dozat, T.: Incorporating Nesterov momentum into adam. Dostupné z: http://cs229.stanford.edu/proj2015/054_report. pdf (2016)

  7. Dua, D., Graff, C.: UCI machine learning repository (2017)

  8. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  9. Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx (2014)

  10. Grant, M.C., Boyd, S.P.: Graph implementations for nonsmooth convex programs. In: Recent Advances in Learning and Control, pp. 95–110. Springer (2008)

  11. Gupta, M., Cotter, A., Pfeifer, J., Voevodski, K., Canini, K., Mangylov, A., Moczydlowski, W., Van Esbroeck, A.: Monotonic calibrated interpolated look-up tables. J. Mach. Learn. Res. 17(1), 3790–3836 (2016)

    MathSciNet  MATH  Google Scholar 

  12. Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the nips 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2004)

    Google Scholar 

  13. Hamedani, E.Y., Jalilzadeh, A., Aybat, N.S., Shanbhag, U.V.: Iteration complexity of randomized primal-dual methods for convex-concave saddle point problems. arXiv preprint arXiv:1806.04118 (2018)

  14. Hamedani, E.Y., Aybat, N.S.: A primal-dual algorithm for general convex–concave saddle point problems. arXiv preprint arXiv:1803.01401 (2018)

  15. Hien, L.T.K., Zhao, R., Haskell, W.B.: An inexact primal-dual smoothing framework for large-scale non-bilinear saddle point problems. arXiv preprint arXiv:1711.03669 (2017)

  16. Juditsky, A., Nemirovski, A., Tauvel, C.: Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1(1), 17–58 (2011)

    Article  MathSciNet  Google Scholar 

  17. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  18. Lan, G., Zhou, Z.: Algorithms for stochastic optimization with function or expectation constraints. Comput. Optim. Appl. 76, 1–38 (2020)

    Article  MathSciNet  Google Scholar 

  19. Lin, Q., Ma, R., Yang, T.: Level-set methods for finite-sum constrained convex optimization. In: International Conference on Machine Learning, pp. 3112–3121 (2018)

  20. Lin, Q., Nadarajah, S., Soheili, N.: A level-set method for convex optimization with a feasible solution path. SIAM J. Optim. 28(4), 3290–3311 (2018)

    Article  MathSciNet  Google Scholar 

  21. Lu, Z., Zhou, Z.: Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming. arXiv preprint arXiv:1803.09941 (2018)

  22. Luedtke, J., Ahmed, S.: A sample approximation approach for optimization with probabilistic constraints. SIAM J. Optim. 19(2), 674–699 (2008)

    Article  MathSciNet  Google Scholar 

  23. Luo, L., Xiong, Y., Liu, Y., Sun, X.: Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843 (2019)

  24. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  Google Scholar 

  25. Neumann, J.V.: Zur theorie der gesellschaftsspiele. Math. Ann. 100(1), 295–320 (1928)

    Article  MathSciNet  Google Scholar 

  26. Pagnoncelli, B.K., Ahmed, S., Shapiro, A.: Sample average approximation method for chance constrained programming: theory and applications. J. Optim. Theory Appl. 142(2), 399–416 (2009)

    Article  MathSciNet  Google Scholar 

  27. Rigollet, P., Tong, X.: Neyman–Pearson classification, convexity and stochastic constraints. J. Mach. Learn. Res. 12(Oct), 2831–2855 (2011)

    MathSciNet  MATH  Google Scholar 

  28. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  29. Rockafellar, R.T., Uryasev, S., et al.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)

    Article  Google Scholar 

  30. Ryu, E.K., Yin, W.: Proximal–proximal–gradient method. arXiv preprint arXiv:1708.06908 (2017)

  31. Reddi, S.J., Kale, S, Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018)

  32. Scott, C., Nowak, R.: A Neyman–Pearson approach to statistical learning. IEEE Trans. Inf. Theory 51(11), 3806–3819 (2005)

    Article  MathSciNet  Google Scholar 

  33. Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming: Modeling and Theory. SIAM (2014)

  34. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  35. Wang, M., Bertsekas, D.P.: Stochastic first-order methods with random constraint projection. SIAM J. Optim. 26(1), 681–717 (2016)

    Article  MathSciNet  Google Scholar 

  36. Wang, M., Chen, Y., Liu, J., Gu, Y.: Random multi-constraint projection: Stochastic gradient methods for convex optimization with many constraints. arXiv preprint arXiv:1511.03760 (2015)

  37. Wang, M., Fang, E.X., Liu, H.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Progr. 161(1–2), 419–449 (2017)

    Article  MathSciNet  Google Scholar 

  38. Xu, Y.: Primal-dual stochastic gradient method for convex programs with many functional constraints. SIAM J. Optim. 30(2), 1664–1692 (2020)

  39. Xu, Y.: First-order methods for constrained convex programming based on linearized augmented Lagrangian function. INFORMS J. Optim. 3(1), 89–117 (2021)

  40. Xu, Y.: Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming. Math. Progr. 185, 199–244 (2021)

  41. Xu, Y., Xu, Y.: Katyusha acceleration for convex finite-sum compositional optimization. INFORMS J. Optim. 3(4), 418–443 (2021)

    Article  MathSciNet  Google Scholar 

  42. Yu, H., Neely, M.J.: A primal-dual type algorithm with the \({O} (1/t)\) convergence rate for large scale constrained convex programs. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 1900–1905. IEEE (2016)

  43. Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness constraints: a flexible approach for fair classification. J. Mach. Learn. Res. 20(75), 1–42 (2019)

    MathSciNet  MATH  Google Scholar 

  44. Zafar, M.B., Valera, I., Rogriguez, M.G., Gummadi, K.P.: Fairness constraints: mechanisms for fair classification. In: Artificial Intelligence and Statistics. PMLR, pp. 962–970 (2017)

  45. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)

  46. Zhao, R.: Optimal stochastic algorithms for convex–concave saddle-point problems. arXiv preprint arXiv:1903.01687 (2019)

Download references

Acknowledgements

The authors would like to thank three anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper and also for the careful testing on our codes. The authors are partly supported by the NSF award 2053493 and the RPI-IBM AIRC faculty fund.

Funding

This work was partly supported by NSF Award 2053493.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yangyang Xu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Lemma 2

Proof

First, consider the case of non-constant primal step size. By \(\eta _1 = \sum _{i=1}^K \alpha _i\beta _1^{i-1}\) and the \(\eta \)-update in (2.8), we have \(\eta _k = \sum _{i=k}^K \alpha _i \beta _1^{i-k}, k\in [K]\), and thus the \(\rho \)-update becomes

$$\begin{aligned} \rho _k = \frac{\rho _{k-1}}{\beta _1 + \frac{\alpha _{k-1} }{\sum _{i=k}^K \alpha _i \beta _1^{i-k}} }. \end{aligned}$$

By the above equation, we have for \(2\le j\le K\),

$$\begin{aligned}&\rho _j = \frac{\rho _{j-1}}{\beta _1 + \frac{\alpha _{j-1} }{\sum _{k=j}^K \alpha _k \beta _1^{k-j}} } \le \frac{\rho _{j-1}}{\beta _1 + \frac{\alpha _{j-1}}{\alpha _{j-1}\sum _{k=j}^K \beta _1^{k-j}} } \le \frac{\rho _{j-1}}{\beta _1+ \frac{1}{\sum _{k=j}^\infty \beta _1^{k-j}} } = \rho _{j-1}, \end{aligned}$$

where the inequality follows from the non-increasing monotonicity of \(\{\alpha _j\}_{j= 1}^K\) and \(\beta _1\in (0,1)\). Hence, \(\{\rho _j\}_{j= 1}^K\) is a non-increasing sequence. Using (2.8) again, we have for \(2\le j\le t\le K\),

$$\begin{aligned}\rho _j = \frac{\rho _{j-1}}{\beta _1 + \frac{\alpha _{j-1} }{\sum _{k=j}^K \alpha _k \beta _1^{k-j}} } \ge \frac{\rho _{j-1}}{\beta _1 + \frac{\alpha _{j-1} }{\sum _{k=j}^t \alpha _k \beta _1^{k-j}} }= \frac{\sum _{k=j}^t \alpha _k \beta _1^{k-j}}{\sum _{k=j-1}^t \alpha _k \beta _1^{k-(j-1)}}\rho _{j-1},\end{aligned}$$

which clearly implies the inequality in (2.11).

For \(j=1\), (2.12) holds because \(\beta \in (0,1)\). To show it holds for \(2\le j\le K\), we rewrite \(\rho _j\) and obtain

$$\begin{aligned}&\rho _j = \frac{\rho _{j-1}}{\beta _1 + \frac{\alpha _{j-1} }{\sum _{k=j}^K \alpha _k \beta _1^{k-j}} } = \frac{\sum _{k=j}^K \alpha _k \beta _1^{k-j}}{\sum _{k=j-1}^K \alpha _k \beta _1^{k-(j-1)}} \rho _{j-1}\\&\quad = \frac{\sum _{k=j}^K \alpha _k \beta _1^{k-j}}{\sum _{k=j-1}^K \alpha _k \beta _1^{k-(j-1)}} \times \frac{\sum _{k=j-1}^K \alpha _k \beta _1^{k-(j-1)}}{\sum _{k=j-2}^K \alpha _k \beta _1^{k-(j-2)}} \times \cdots \times \frac{\sum _{k=2}^K \alpha _k \beta _1^{k-2}}{\sum _{k=1}^K \alpha _k \beta _1^{k-1}} \rho _{1} \\&\quad = \frac{\sum _{k=j}^K \alpha _k \beta _1^{k-j}}{\sum _{k=1}^K \alpha _k \beta _1^{k-1}} \rho _{1} \le \frac{ \frac{\alpha _j}{(1\!-\!\beta _1)}}{\sum _{k=1}^K \alpha _k \beta _1^{k-1} } \rho _{1} \le \frac{ \frac{\alpha _j}{(1\!-\!\beta _1)}}{\alpha _1} \rho _{1} =\frac{ \rho _{1} \alpha _j}{\alpha _1(1\!-\!\beta _1)}, \end{aligned}$$

where the third equation recursively applies the second equation, and the inequalities hold by the two inequalities in (2.10).

Now, consider the case of constant primal step size, i.e., \(\alpha _j = \alpha _1\) for all \(j\in [K]\). We can prove \(\eta _j = \frac{\alpha _1}{1-\beta _1}\) for all \(j\in [K]\) by the induction, and thus

$$\begin{aligned} \rho _j = \frac{\rho _{j-1}}{\beta _1 + \frac{\alpha _1 }{\alpha _1/(1\!-\!\beta _1)}} = \frac{\rho _{j-1}}{\beta _1 + (1\!-\!\beta _1)} = \rho _{j-1}, \quad \forall \,\, j\ge 2, \end{aligned}$$

which completes the proof. \(\square \)

Proof of Lemma 3

Proof

As \((\mathbf {x}^*,\mathbf {z}^*)\) satisfies the KKT conditions in Assumption 3, there are \(\tilde{\nabla }f_i(\mathbf {x}^*), \forall i\in [M]\) such that

$$\begin{aligned} -\sum _{i=1}^{M} z_i^*\tilde{\nabla }f_i(\mathbf {x}^*) \in \partial f_0(\mathbf {x}^*) + \mathcal {N}_X(\mathbf {x}^*). \end{aligned}$$

From the convexity of \(f_0\) and X, it follows that \( f_0(\mathbf {x}) - f_0(\mathbf {x}^*)\ge - \big \langle \sum _{i=1}^{M} z_i^*\tilde{\nabla }f_i(\mathbf {x}^*), \mathbf {x}-\mathbf {x}^*\big \rangle , \forall \mathbf {x}\in X\). Since \(f_i\) is convex for each \(i\in [M]\), we have \(f_i(\mathbf {x}) - f_i(\mathbf {x}^*) \ge \langle \tilde{\nabla }f_i(\mathbf {x}^*),\mathbf {x}-\mathbf {x}^*\rangle \). Noticing \(\mathbf {z}^*\ge \mathbf {0}\), we have for any \( \mathbf {x}\in X\),

$$\begin{aligned} f_0(\mathbf {x}) - f_0(\mathbf {x}^*)\ge - \sum _{i=1}^{M} z_i^*\langle \tilde{\nabla }f_i(\mathbf {x}^*), \mathbf {x}-\mathbf {x}^* \rangle \ge -\sum _{i=1}^M z_i^*(f_i(\mathbf {x})-f_i(\mathbf {x}^*)).\end{aligned}$$

Because \(z_i^*f_i(\mathbf {x}^*)=0\) for all \(i\in [M]\), we obtain (3.7).

Furthermore, for any \(\mathbf {z}\ge \mathbf {0}\), we have \(\langle \mathbf {z}, \mathbf {f}(\mathbf {x}^*)\rangle \le 0\) from \(f_i(\mathbf {x}^*)\le 0\), \(\forall i \in [M]\). Hence, combining with (3.7), we have the inequality in (3.8). \(\square \)

Proof of Lemma 4

Proof

For \(\widehat{\mathbf {u}}^k\) given in (2.4), we have \(\Vert \widehat{\mathbf {u}}^k\Vert \le \theta \) and thus each coordinate of \(\widehat{\mathbf {u}}^k\) is also less than \(\theta \), i.e. \(-\theta \mathbf {1} \le \widehat{\mathbf {u}}^k \le \theta \mathbf {1}\). Recursively rewriting the updates in (2.3), (2.5) and (2.6) gives

$$\begin{aligned}&\mathbf {m}^k = \beta _{1}\mathbf {m}^{k-1} + (1-\beta _{1})\mathbf {u}^k = (1-\beta _{1})\sum _{j=1}^{k} \beta _{1}^{k-j} \mathbf {u}^j, \end{aligned}$$
(C.1)
$$\begin{aligned}&\mathbf {v}^k = \beta _2 \mathbf {v}^{k-1} + (1-\beta _2)({\widehat{\mathbf {u}}}^k)^2 =(1-\beta _2)\sum _{j=1}^k \beta _2^{k-j} ({\widehat{\mathbf {u}}}^j)^2, \end{aligned}$$
(C.2)
$$\begin{aligned}&\widehat{\mathbf {v}}^k = \max \{\widehat{\mathbf {v}}^{k-1}, \mathbf {v}^k\} = \max _{j\in [k]} \mathbf {v}^j, \end{aligned}$$
(C.3)

here, \(\mathbf {m}^0=\mathbf {v}^0=\widehat{\mathbf {v}}^0=\mathbf {0}\). By (C.2) and \(-\theta \mathbf {1} \le \widehat{\mathbf {u}}^k \le \theta \mathbf {1}\), we have \( \mathbf {v}^k \le \theta ^2 (1-\beta _2)\sum _{j=1}^k \beta _2^{k-j} \mathbf {1} \le \theta ^2 \mathbf {1}\). By (C.3), we further have \( \widehat{\mathbf {v}}^k \le \theta ^2 \mathbf {1}\). Thus \( \mathbb {E}\big [\Vert (\widehat{\mathbf {v}}^k)^{1/2}\Vert _1\big ] \le n\theta \) holds.

Notice \( \mathbb {E}\big [\Vert \mathbf {m}^{k}\Vert _{({\widehat{\mathbf {v}}}^k)^{-{1/2}}}^2\big ] = \mathbb {E}\big [\Vert \frac{\mathbf {m}^{k}}{({\widehat{\mathbf {v}}}^k)^{{1/4}}}\Vert ^2\big ]\). We can lower bound \(\mathbf {v}^k\) by keeping only the last term in (C.2) since \(({\widehat{\mathbf {u}}}^{j})^2\ge \mathbf {0}\), i.e. \( \mathbf {v}^k \ge (1-\beta _2)({\widehat{\mathbf {u}}}^k)^2 \). By (C.3), we also have \(\widehat{\mathbf {v}}^k \ge (1-\beta _2) \max _{j\in [k]} ({\widehat{\mathbf {u}}}^j)^2.\) Plugging the inequality and (C.1) into \( \mathbb {E}\big [\Vert \mathbf {m}^{k}\Vert _{({\widehat{\mathbf {v}}}^k)^{-{1/2}}}^2\big ] \) gives

$$\begin{aligned} \mathbb {E}\big [ \Vert \mathbf {m}^{k}\Vert _{( \widehat{\mathbf {v}}^k )^{-{1/2}}}^2\big ] \le \,&\mathbb {E}\left[ \left\| \frac{\mathbf {m}^k}{{\big ((1-\beta _2) \max _{j\in [k]} ({\widehat{\mathbf {u}}}^j)^2\big )^{1/4}}} \right\| ^2\right] \nonumber \\ =&\frac{(1\!-\!\beta _1)^2}{(1-\beta _2)^{1/2}} \mathbb {E}\left[ \left\| \frac{\sum _{j=1}^{k} \beta _{1}^{k-j} \mathbf {u}^j }{\max _{j\in [k]} \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2\right] . \end{aligned}$$
(C.4)

Then we bound \(\left\| \frac{\sum _{j=1}^{k} \beta _{1}^{k-j} \mathbf {u}^j }{\max _{j\in [k]} \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}} \right\| ^2\) by the Cauchy-Schwarz inequality.

$$\begin{aligned}&\left\| \frac{\sum _{j=1}^{k} \beta _{1}^{k-j} \mathbf {u}^j }{\max _{j\in [k]} \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2 = \left\| \sum _{j=1}^k (\beta _1^{k-j})^{1/2} \frac{(\beta _{1}^{k-j})^{1/2} \mathbf {u}^j }{\max _{j\in [k]} \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2 \nonumber \\&\quad \le \bigg ( \sum _{j=1}^{k} \Big ((\beta _{1}^{k-j})^{1/2}\Big )^2 \bigg ) \bigg (\sum _{j=1}^{k}\left\| \frac{(\beta _{1}^{k-j})^{1/2} \mathbf {u}^j }{\max _{j\in [k]} \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2\bigg ) \nonumber \\&\quad = \Big ( \sum _{j=1}^{k} \beta _{1}^{k-j} \Big ) \bigg (\sum _{j=1}^{k}\beta _{1}^{k-j}\left\| \frac{ \mathbf {u}^j }{\max _{j\in [k]} \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2\bigg )\nonumber \\&\quad \le \Big ( \sum _{j=1}^{k} \beta _{1}^{k-j} \Big ) \bigg (\sum _{j=1}^{k}\beta _{1}^{k-j}\left\| \frac{ \mathbf {u}^j }{\mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2\bigg ), \end{aligned}$$
(C.5)

where we use \(\max _{j\in [k]} \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}\ge \mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}\) for \(j\in [k]\) in the second inequality. For \(\left\| \frac{ \mathbf {u}^j }{\mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2\), we have \({\widehat{\mathbf {u}}}^j\) given in (2.4) and notice \(\max \big \{1, \frac{\Vert \mathbf {u}^k\Vert }{\theta }\big \}\) is a scalar.

$$\begin{aligned} \left\| \frac{ \mathbf {u}^j }{\mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2&= \bigg \Vert \frac{ \mathbf {u}^j }{ \mid \frac{\mathbf {u}^j}{\max \big \{1, \frac{\Vert \mathbf {u}^j\Vert }{\theta }\big \}}\mid ^{1/2}}\bigg \Vert ^2 = \max \big \{1, \frac{\Vert \mathbf {u}^j\Vert }{\theta }\big \} \left\| \mathbf {u}^j \right\| _1\\&\le \max \big \{1, \frac{\Vert \mathbf {u}^j\Vert }{\theta }\big \} \sqrt{n}\big \Vert \mathbf {u}^j \big \Vert = {\left\{ \begin{array}{ll} \sqrt{n} \big \Vert \mathbf {u}^j\big \Vert , &{} \text{ if } \Vert \mathbf {u}^j\Vert \le \theta , \\ \frac{\sqrt{n}\big \Vert \mathbf {u}^j\big \Vert ^2}{\theta } , &{} \text{ if } \Vert \mathbf {u}^j\Vert > \theta . \end{array}\right. } \end{aligned}$$

So we get \(\left\| \frac{ \mathbf {u}^j }{\mid {\widehat{\mathbf {u}}}^j\mid ^{1/2}}\right\| ^2\le \sqrt{n} \left( \theta + \frac{\Vert \mathbf {u}^j \Vert ^2}{\theta }\right) \). Plug the inequality back to (C.5), and then (C.5) back to (C.4).

$$\begin{aligned}&\mathbb {E}\big [\Vert \mathbf {m}^{k}\Vert _{({\widehat{\mathbf {v}}}^k)^{-{1/2}}}^2\big ]\\&\le \frac{(1\!-\!\beta _1)^2}{(1-\beta _2)^{1/2}} \mathbb {E}\left[ \Big ( \sum _{j=1}^{k} \beta _{1}^{k-j} \Big ) \Bigg (\sum _{j=1}^{k}\beta _{1}^{k-j} \sqrt{n} \bigg (\theta + \frac{\Vert \mathbf {u}^j \Vert ^2}{\theta }\bigg ) \Bigg )\right] \\&= \frac{ \sqrt{n} (1\!-\!\beta _1)^2}{(1-\beta _2)^{1/2}} \Big ( \sum _{j=1}^{k} \beta _{1}^{k-j} \Big ) \Bigg (\sum _{j=1}^{k}\beta _{1}^{k-j} \bigg (\theta + \frac{ \mathbb {E}\left[ \Vert \mathbf {u}^j \Vert ^2\right] }{\theta }\bigg ) \Bigg )\\&\le \frac{ \sqrt{n} (1\!-\!\beta _1)^2}{(1-\beta _2)^{1/2}} \Big ( \sum _{j=1}^{k} \beta _{1}^{k-j} \Big )^2\max _{j\in [k]} \bigg (\theta + \frac{ \mathbb {E}\left[ \Vert \mathbf {u}^j \Vert ^2\right] }{\theta }\bigg ) \\&\le \frac{\sqrt{n} \Big (\theta + \frac{\max _{j\in [k]} \mathbb {E}\left[ \Vert \mathbf {u}^j \Vert ^2 \right] }{\theta }\Big )}{(1-\beta _2)^{1/2}}, \end{aligned}$$

where we have used \( \sum _{j=1}^{k} \beta _{1}^{k-j}\le \sum _{j=1}^{\infty } \beta _{1}^{j}\le \frac{1}{1-\beta _1}\) for \(\beta _1\in (0,1)\) in the last inequality. With (3.2), the proof is finished. \(\square \)

Proof of Lemma 5

Proof

From the projection (2.7) in the primal variable update, we have for \(k\in [K]\) and \( \forall \mathbf {x}\in X\),

$$\begin{aligned} 0&\ge \left\langle \mathbf {x}^{k+1}-\mathbf {x}, \mathbf {x}^{k+1}-\Big (\mathbf {x}^k - \alpha _k \mathbf {m}^k/(\widehat{\mathbf {v}}^k)^{1/2}\Big )\right\rangle _{(\widehat{\mathbf {v}}^k)^{1/2}}\nonumber \\&= \left\langle \mathbf {x}^{k+1}-\mathbf {x}, (\widehat{\mathbf {v}}^k)^{1/2}\big (\mathbf {x}^{k+1}- \mathbf {x}^k\big ) + \alpha _k \mathbf {m}^k \right\rangle . \end{aligned}$$
(D.1)

The first term of the right side equals to

$$\begin{aligned}&\left\langle \mathbf {x}^{k+1}-\mathbf {x}, (\widehat{\mathbf {v}}^k)^{1/2}\Big (\mathbf {x}^{k+1} - \mathbf {x}^k\Big ) \right\rangle \nonumber \\&\quad = \frac{1}{2} \left( \big \Vert \mathbf {x}^{k+1}-\mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}} - \big \Vert \mathbf {x}^k - \mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}} + \big \Vert \mathbf {x}^{k+1} - \mathbf {x}^k \big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}} \right) . \end{aligned}$$
(D.2)

Recursively rewrite \( \big \langle \mathbf {x}^{k+1}-\mathbf {x}, \mathbf {m}^k \big \rangle \) with the update (2.3)

$$\begin{aligned}&\big \langle \mathbf {x}^{k+1}-\mathbf {x}, \mathbf {m}^k \big \rangle \end{aligned}$$
(D.3)
$$\begin{aligned}&= \big \langle \mathbf {x}^{k+1} -\mathbf {x}^k, \mathbf {m}^k \big \rangle + (1-\beta _{1})\big \langle \mathbf {x}^k - \mathbf {x}, \mathbf {u}^k \big \rangle + \beta _{1}\big \langle \mathbf {x}^k - \mathbf {x},\mathbf {m}^{k-1} \big \rangle \nonumber \\&= \big \langle \mathbf {x}^{k+1} -\mathbf {x}^k, \mathbf {m}^k \big \rangle + (1-\beta _{1})\big \langle \mathbf {x}^k - \mathbf {x}, \mathbf {u}^k \big \rangle \nonumber \\&\qquad + \beta _{1}\left( \big \langle \mathbf {x}^{k} -\mathbf {x}^{k-1}, \mathbf {m}^{k-1} \big \rangle + (1-\beta _{1})\big \langle \mathbf {x}^{k-1} - \mathbf {x}, \mathbf {u}^{k-1} \big \rangle \right) \nonumber \\&\qquad \ldots \nonumber \\&\qquad + \beta _{1}^{k-1}\left( \big \langle \mathbf {x}^{2} -\mathbf {x}^{1}, \mathbf {m}^{1} \big \rangle + (1-\beta _{1})\big \langle \mathbf {x}^{1} - \mathbf {x}, \mathbf {u}^{1} \big \rangle \right) + \beta _{1}^k\big \langle \mathbf {x}^{1} - \mathbf {x},\mathbf {m}^{0}\big \rangle \nonumber \\&= \sum _{j=1}^k \beta _1^{k-j} \Big (\big \langle \mathbf {x}^{j+1} -\mathbf {x}^{j}, \mathbf {m}^{j} \big \rangle + (1-\beta _{1})\big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j}\big \rangle \Big ), \end{aligned}$$
(D.4)

where the second equation recursively applied the first equation and the last term \(\beta _{1}^k\big \langle \mathbf {x}^{1} - \mathbf {x},\mathbf {m}^{0}\big \rangle \) vanishes because \(\mathbf {m}^0 = \mathbf {0}\). Plugging Eqs. (D.2) and (D.4) into the inequality (D.1) gives

$$\begin{aligned}&\alpha _k \sum _{j=1}^k \beta _1^{k-j} \Big (\big \langle \mathbf {x}^{j+1} -\mathbf {x}^{j}, \mathbf {m}^{j} \big \rangle + (1-\beta _{1})\big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j} \big \rangle \Big ) \nonumber \\&\quad \le \frac{1}{2} \left( -\big \Vert \mathbf {x}^{k+1}-\mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}} +\big \Vert \mathbf {x}^k - \mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}} - \big \Vert \mathbf {x}^{k+1} - \mathbf {x}^k \big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}}\right) . \end{aligned}$$
(D.5)

Sum the above inequality (D.5) for \(k=1\) to t. About the left side, we have

$$\begin{aligned}&\sum _{k=1}^t\alpha _k \sum _{j=1}^k \beta _1^{k-j} \Big (\big \langle \mathbf {x}^{j+1} -\mathbf {x}^{j}, \mathbf {m}^{j} \big \rangle \Big )= \sum _{j=1}^t \Big (\big \langle \mathbf {x}^{j+1} -\mathbf {x}^{j}, \mathbf {m}^{j} \big \rangle \Big )\sum _{k=j}^t \alpha _k \beta _1^{k-j}\nonumber \\&\ge \sum _{j=1}^t\Big ( -\frac{\big \Vert \mathbf {x}^{j+1} -\mathbf {x}^{j}\big \Vert ^2_{(\widehat{\mathbf {v}}^j)^{1/2}}}{2\sum _{k=j}^t \alpha _k \beta _1^{k-j}}-\frac{\sum _{k=j}^t \alpha _k \beta _1^{k-j}}{2}\big \Vert \mathbf {m}^{j}\big \Vert ^2_{(\widehat{\mathbf {v}}^j)^{-{1/2}}} \Big ) \sum _{k=j}^t \alpha _k \beta _1^{k-j}\nonumber \\&\overset{(2.10)}{\ge }\sum _{j=1}^t\Big ( -\frac{\big \Vert \mathbf {x}^{j+1} -\mathbf {x}^{j}\big \Vert ^2_{(\widehat{\mathbf {v}}^j)^{1/2}}}{2} -\frac{\alpha _j^2\big \Vert \mathbf {m}^{j} \big \Vert ^2_{(\widehat{\mathbf {v}}^j)^{-{1/2}}}}{2(1\!-\!\beta _1)^2} \Big ). \end{aligned}$$
(D.6)

About the right side of the sum of the inequality (D.5), by \((\widehat{\mathbf {v}}^{k})^{1/2} \ge (\widehat{\mathbf {v}}^{k-1})^{1/2}\ge \mathbf {0}\), \(k \in [t]\) since the iteration (2.6), and Assumption 1, we have

$$\begin{aligned}&\sum _{k=1}^{t} \frac{1}{2} \left( -\big \Vert \mathbf {x}^{k+1}-\mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}} + \big \Vert \mathbf {x}^k - \mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}}\right) \nonumber \\&\quad = \frac{1}{2} \left( -\big \Vert \mathbf {x}^{t+1}-\mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^t)^{1/2}} + \sum _{k=2}^{t} \big \Vert \mathbf {x}^{k}-\mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}-(\widehat{\mathbf {v}}^{k-1})^{1/2}} + \big \Vert \mathbf {x}^1 - \mathbf {x}\big \Vert ^2_{(\widehat{\mathbf {v}}^1)^{1/2}}\right) \nonumber \\&\quad \le \frac{B^2}{2} \left( \sum _{k=2}^{t} \big \Vert (\widehat{\mathbf {v}}^k)^{1/2}-(\widehat{\mathbf {v}}^{k-1})^{1/2}\big \Vert _1 +\big \Vert (\widehat{\mathbf {v}}^1)^{1/2}\big \Vert _1\right) = \frac{B^2}{2} \big \Vert (\widehat{\mathbf {v}}^t)^{1/2}\big \Vert _1. \end{aligned}$$
(D.7)

Thus with inequalities (D.6) and (D.7), the sum of the inequality (D.5) for \(k=1\) to t becomes

$$\begin{aligned}&\sum _{k=1}^t \! \alpha _k \! \sum _{j=1}^k \! \beta _1^{k-j} (1\!-\!\beta _{1})\big \langle \mathbf {x}^{j} \!-\! \mathbf {x}, \mathbf {u}^{j} \big \rangle \!+\! \sum _{j=1}^t\!\Big (\! -\!\frac{\big \Vert \mathbf {x}^{j+1} \!-\!\mathbf {x}^{j}\big \Vert ^2_{(\widehat{\mathbf {v}}^j)^{1/2}}}{2}\!-\!\frac{\alpha _j^2\big \Vert \mathbf {m}^{j} \big \Vert ^2_{(\widehat{\mathbf {v}}^j)^{\!-\!{1/2}}}}{2(1\!-\!\beta _1)^2} \Big )\\&\quad \le \frac{B^2}{2} \big \Vert (\widehat{\mathbf {v}}^t)^{1/2}\big \Vert _1 - \frac{1}{2}\sum _{k=1}^t \big \Vert \mathbf {x}^{k+1} \!-\! \mathbf {x}^k \big \Vert ^2_{(\widehat{\mathbf {v}}^k)^{1/2}}. \end{aligned}$$

Eliminating the term \(\big \Vert \mathbf {x}^{j+1} -\mathbf {x}^{j}\big \Vert _{(\widehat{\mathbf {v}}^j)^{1/2}}\) on both sides and exchanging the order of sums in the first term give

$$\begin{aligned} (1-\beta _{1}) \sum _{j=1}^t \big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j} \big \rangle \sum _{k=j}^t \alpha _k \beta _1^{k-j} \le \frac{B^2}{2} \big \Vert (\widehat{\mathbf {v}}^t)^{1/2}\big \Vert _1 + \frac{\sum _{j=1}^t \alpha _j^2\big \Vert \mathbf {m}^{j}\big \Vert ^2_{(\widehat{\mathbf {v}}^j)^{-1/2}}}{2(1\!-\!\beta _1)^2} . \end{aligned}$$

Then take the expectation on the above inequality. With the bounds given in Lemma 4, we have

$$\begin{aligned}&(1-\beta _{1}) \sum _{j=1}^t \mathbb {E}\left[ \big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j} \big \rangle \right] \sum _{k=j}^t \alpha _k \beta _1^{k-j}\\&\quad \le \frac{n\theta B^2}{2} + \frac{\sum _{j=1}^t \alpha _j^2 \frac{\sqrt{n}(\theta + \frac{\widehat{P}_\mathbf {z}^j}{\theta })}{(1-\beta _2)^{1/2}} }{2(1\!-\!\beta _1)^2}\\&\quad \le \frac{n\theta B^2}{2} + \frac{ \sqrt{n}(\theta + \frac{\widehat{P}_\mathbf {z}^t}{\theta }) \sum _{j=1}^t \alpha _j^2 }{2(1\!-\!\beta _1)^2(1\!-\!\beta _2)^{1/2}}, \end{aligned}$$

where the last inequality holds because we notice \(\widehat{P}_\mathbf {z}^k\) defined in (3.2) is nondecreasing with respect to k. \(\square \)

Proof of Lemma 6

Proof

For the dual variable is projected to the positive region in the update (2.9), it follows that for any \(\mathbf {z}\ge 0\), \(j\in [K]\),

$$\begin{aligned}&\Big \langle \mathbf {z}^{j+1}-\mathbf {z}, \mathbf {z}^{j+1}-\big (\mathbf {z}^j + {\rho _j}\mathbf {w}^j\big ) \Big \rangle \le 0. \end{aligned}$$

It could be rewritten as

$$\begin{aligned}&\big \langle \mathbf {z}^{j+1}-\mathbf {z}, \mathbf {z}^{j+1}-\mathbf {z}^j\big \rangle \le \big \langle \mathbf {z}^{j+1}-\mathbf {z}, {\rho _j}\mathbf {w}^j\big \rangle \nonumber \\&\quad = \big \langle \mathbf {z}^{j+1}-\mathbf {z}^j, {\rho _j}\mathbf {w}^j\big \rangle + \big \langle \mathbf {z}^{j}-\mathbf {z}, {\rho _j}\mathbf {w}^j\big \rangle . \end{aligned}$$
(E.1)

For each term of the above inequality (E.1), we have

$$\begin{aligned} \big \langle \mathbf {z}^{j+1}-\mathbf {z}, \mathbf {z}^{j+1}-\mathbf {z}^j \big \rangle&= \frac{1}{2}\Big ( \big \Vert \mathbf {z}^{j+1}-\mathbf {z}^j\big \Vert ^2+\big \Vert \mathbf {z}^{j+1} - \mathbf {z}\big \Vert ^2 - \big \Vert \mathbf {z}^j-\mathbf {z}\big \Vert ^2\Big ),\\ \big \langle \mathbf {z}^{j+1}-\mathbf {z}^j, \mathbf {w}^j\big \rangle&\le \frac{1}{2{\rho _j}}\big \Vert \mathbf {z}^{j+1}-\mathbf {z}^j\big \Vert ^2 + \frac{{\rho _j}}{2} \big \Vert \mathbf {w}^j\big \Vert ^2, \\ \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j\big \rangle&= \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle + \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {f}(\mathbf {x}^j)\big \rangle . \end{aligned}$$

Plugging the above three terms into the inequality (E.1) and eliminating \(\big \Vert \mathbf {z}^{j+1}-\mathbf {z}^j\big \Vert ^2\) give

$$\begin{aligned}\frac{1}{2\rho _j}\Big (\big \Vert \mathbf {z}^{j+1} - \mathbf {z}\big \Vert ^2 \!-\! \big \Vert \mathbf {z}^j-\mathbf {z}\big \Vert ^2\Big ) \le \frac{{\rho _j}}{2} \big \Vert \mathbf {w}^j\big \Vert ^2 \!+\! \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle \!+\! \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {f}(\mathbf {x}^j)\big \rangle .\end{aligned}$$

Rearranging the above inequality gives the inequality (3.10). \(\square \)

Proof of Lemma 7

Proof

For any \(j\in [K]\), we have

$$\begin{aligned} \big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j} \big \rangle = \big \langle \mathbf {x}^{j} - \mathbf {x}, \tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big \rangle +\big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j}-\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big \rangle . \end{aligned}$$
(F.1)

Here \( \tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) = \mathbb {E}\big [\mathbf {u}^j\mid \mathcal {H}^j\big ] \in \partial _\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j)\) according to Assumption 2. By the convexity of \(f_i(\mathbf {x}), i=0,1, \ldots , M\), we know \(\mathcal {L}(\mathbf {x},\mathbf {z})\) is convex with respect to \(\mathbf {x}\) and

$$\begin{aligned}&\big \langle \mathbf {x}^j \!-\!\mathbf {x}, \tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big \rangle \!\ge \! \mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \!-\! \mathcal {L}(\mathbf {x},\mathbf {z}^j) \!=\! f_0(\mathbf {x}^j) \!- \! f_0(\mathbf {x}) \!+\! \big \langle \mathbf {z}^j, \mathbf {f}(\mathbf {x}^j)\big \rangle \!-\!\big \langle \mathbf {z}^j,\mathbf {f}(\mathbf {x})\big \rangle . \end{aligned}$$

Plug the lower bound of \(\langle \mathbf {z}^j, \mathbf {f}(\mathbf {x}^j)\rangle \) given in Lemma 6 to the above inequality.

$$\begin{aligned}&\big \langle \mathbf {x}^j-\mathbf {x}, \tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big \rangle \\&\quad \ge f_0(\mathbf {x}^j)-f_0(\mathbf {x}) - \big \langle \mathbf {z}^j, \mathbf {f}(\mathbf {x})\big \rangle + \big \langle \mathbf {z}, \mathbf {f}(\mathbf {x}^j)\big \rangle + \frac{1}{2\rho _j}\Big (\big \Vert \mathbf {z}^{j+1} - \mathbf {z}\big \Vert ^2 - \big \Vert \mathbf {z}^j-\mathbf {z}\big \Vert ^2\Big ) \\&\qquad - \frac{{\rho _j}}{2} \big \Vert \mathbf {w}^j\big \Vert ^2 -\big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle . \end{aligned}$$

Summarizing (F.1) with weights \( \sum _{k=j}^t \alpha _k\beta _1^{k-j}\) for \(j\in [t]\), and plugging the above inequality give

$$\begin{aligned}&\sum _{j=1}^t \big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j} \big \rangle \sum _{k=j}^t \alpha _k\beta _1^{k-j}\nonumber \\&\quad \ge \sum _{j=1}^t \bigg (f_0(\mathbf {x}^j)-f_0(\mathbf {x}) - \big \langle \mathbf {z}^j, \mathbf {f}(\mathbf {x})\big \rangle + \big \langle \mathbf {z}, \mathbf {f}(\mathbf {x}^j)\big \rangle +\frac{1}{2\rho _j}\Big (\big \Vert \mathbf {z}^{j+1} - \mathbf {z}\big \Vert ^2 -\big \Vert \mathbf {z}^j-\mathbf {z}\big \Vert ^2\Big ) \nonumber \\&\qquad - \frac{\rho _j}{2} \big \Vert \mathbf {w}^j\big \Vert ^2-\big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle +\big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j}-\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big \rangle \bigg ) \sum _{k=j}^t \alpha _k\beta _1^{k-j}. \end{aligned}$$
(F.2)

Summation of the term about \(\big \Vert \mathbf {z}^{j+1} - \mathbf {z}\big \Vert ^2 - \big \Vert \mathbf {z}^j-\mathbf {z}\big \Vert ^2\) can be lower bounded:

$$\begin{aligned}&\sum _{j=1}^t \frac{1}{2\rho _j}\Big (\big \Vert \mathbf {z}^{j+1} - \mathbf {z}\big \Vert ^2 - \big \Vert \mathbf {z}^j-\mathbf {z}\big \Vert ^2\Big ) \sum _{k=j}^t \alpha _k\beta _1^{k-j}\nonumber \\&= -\frac{ \sum _{k=1}^t \alpha _k \beta _1^{k-1} }{2\rho _1}\big \Vert \mathbf {z}^1-\mathbf {z}\big \Vert ^2 + \frac{\alpha _t }{2\rho _t}\big \Vert \mathbf {z}^{t+1}-\mathbf {z}\big \Vert ^2 \nonumber \\&\quad + \sum _{j=2}^t\left( \frac{\sum _{k=j-1}^t \alpha _k \beta _1^{k-(j-1)}}{2\rho _{j-1}} - \frac{\sum _{k=j}^t \alpha _k \beta _1^{k-j}}{2\rho _j}\right) \big \Vert \mathbf {z}^j-\mathbf {z}\big \Vert ^2 \nonumber \\&\overset{(2.11) }{\ge }-\frac{ \sum _{k=1}^t \alpha _k \beta _1^{k-1} }{2\rho _1}\big \Vert \mathbf {z}^1- \mathbf {z}\big \Vert ^2 + \frac{\alpha _t }{2\rho _t}\big \Vert \mathbf {z}^{t+1}-\mathbf {z}\big \Vert ^2 \nonumber \\&\overset{(2.10)}{\ge }-\frac{ \alpha _1\big \Vert \mathbf {z}^1-\mathbf {z}\big \Vert ^2}{2\rho _1(1-\beta _{1})} + \frac{\alpha _t }{2\rho _t}\big \Vert \mathbf {z}^{t+1}-\mathbf {z}\big \Vert ^2. \end{aligned}$$
(F.3)

Summation of \(\big \Vert \mathbf {w}^j\big \Vert ^2\) can also be lower bounded

$$\begin{aligned}&\!-\! \sum _{j=1}^t \frac{{\rho _j}}{2} \big \Vert \mathbf {w}^j\big \Vert ^2 \sum _{k=j}^t \alpha _k\beta _1^{k-j} \!\overset{(2.10)}{\ge }\! -\! \sum _{j=1}^t \frac{{\rho _j}\alpha _j\big \Vert \mathbf {w}^j\big \Vert ^2}{2(1\!-\!\beta _1)} \nonumber \\&\qquad \overset{ (2.12)}{\ge }\! - \! \frac{ \rho _1}{2\alpha _1(1\!-\!\beta _1)^2} \sum _{j=1}^t \alpha _j^2 \Vert \mathbf {w}^j\Vert ^2. \end{aligned}$$
(F.4)

Plugging the above two inequalities (F.3) and (F.4) into the inequality (F.2) gives

$$\begin{aligned}&\sum _{j=1}^t \big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j}\big \rangle \sum _{k=j}^t \alpha _k\beta _1^{k-j}\\&\quad \ge \sum _{j=1}^t \Big (f_0(\mathbf {x}^j)-f_0(\mathbf {x}) - \big \langle \mathbf {z}^j, \mathbf {f}(\mathbf {x})\big \rangle + \big \langle \mathbf {z}, \mathbf {f}(\mathbf {x}^j)\big \rangle \Big ) \sum _{k=j}^t \alpha _k\beta _1^{k-j} \\&\qquad -\frac{ \alpha _1\big \Vert \mathbf {z}^1- \mathbf {z}\big \Vert ^2 }{2\rho _1 (1-\beta _{1})} + \frac{\alpha _t }{2\rho _t}\big \Vert \mathbf {z}^{t+1}-\mathbf {z}\big \Vert ^2 - \frac{ \rho _1 \sum _{j=1}^t \alpha _j^2 \Vert \mathbf {w}^j\Vert ^2}{2\alpha _1(1\!-\!\beta _1)^2}\\&\qquad +\sum _{j=1}^t \Big (\!-\!\big \langle \mathbf {z}^{j}\!-\!\mathbf {z}, \mathbf {w}^j \!-\! \mathbf {f}(\mathbf {x}^j)\big \rangle +\big \langle \mathbf {x}^{j} \!-\! \mathbf {x}, \mathbf {u}^{j}\!-\!\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big \rangle \Big ) \sum _{k=j}^t \alpha _k\beta _1^{k-j}. \end{aligned}$$

Then taking the expectation on the above inequality and using Assumption 2 to bound \(\mathbb {E}\big [\big \Vert \mathbf {w}^j\big \Vert ^2\big ]\) give the result (3.11). \(\square \)

Proof of Lemma 8

Proof

If \((\mathbf {x},\mathbf {z})\) are deterministic, we can prove (3.13) by the conditional expectation and Assumption 2, i.e.

$$\begin{aligned}&\mathbb {E}\left[ \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle \right] = \mathbb {E}\left[ \mathbb {E}\big [\big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle \mid \mathcal {H}^k\big ]\right] \\&\quad = \mathbb {E}\left[ \Big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbb {E}[\mathbf {w}^j - \mathbf {f}(\mathbf {x}^j) \mid \mathcal {H}^k]\Big \rangle \right] = \mathbb {E}\left[ \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {0}\big \rangle \right] =0,\\&\mathbb {E}\left[ \big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j}-\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big \rangle \right] = \mathbb {E}\left[ \mathbb {E}\Big [ \langle \mathbf {x}^{j} - \mathbf {x}, \mathbf {u}^{j}-\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \rangle \mid \mathcal {H}^k\Big ] \right] \\&\quad = \mathbb {E}\left[ \Big \langle \mathbf {x}^{j} - \mathbf {x}, \mathbb {E}\big [ \mathbf {u}^{j}-\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \mid \mathcal {H}^k\big ] \Big \rangle \right] = \mathbb {E}\left[ \langle \mathbf {x}^{j} - \mathbf {x},\mathbf {0}\rangle \right] =0. \end{aligned}$$

Then we prove the stochastic case through considering the left two terms of (3.12), separately, in a similar way. Let \(\tilde{\mathbf {z}}^1 = \mathbf {z}^1, \tilde{\mathbf {z}}^{j+1} = \tilde{\mathbf {z}}^j - \gamma _j( \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j))\), then \( \mathbf {z}^j-\tilde{\mathbf {z}}^j\) is known given \(\mathcal {H}^k\) and we have \(\mathbb {E}\big [ \big \langle \mathbf {z}^j-\tilde{\mathbf {z}}^j, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j) \big \rangle \big ] = 0\) like the above deterministic case. Thus we have

$$\begin{aligned}&\sum _{j=1}^t \gamma _j \mathbb {E}\big [ \big \langle \mathbf {z}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle \big ] \nonumber \\&\quad = \sum _{j=1}^t \gamma _j\mathbb {E}\big [ \big \langle \tilde{\mathbf {z}}^{j}-\mathbf {z}, \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \rangle \big ] =\sum _{j=1}^t \mathbb {E}\big [ \big \langle \tilde{\mathbf {z}}^{j}-\mathbf {z}, \tilde{\mathbf {z}}^j-\tilde{\mathbf {z}}^{j+1} \big \rangle \big ] \nonumber \\&\quad = \sum _{j=1}^t\frac{1}{2}\mathbb {E}\left[ \big \Vert \tilde{\mathbf {z}}^{j}-\mathbf {z}\big \Vert ^2 + \big \Vert \tilde{\mathbf {z}}^j-\tilde{\mathbf {z}}^{j+1}\big \Vert ^2 - \big \Vert \mathbf {z}- \tilde{\mathbf {z}}^{j+1} \big \Vert ^2\right] \nonumber \\&\quad = \frac{1}{2}\bigg (\mathbb {E}\big [ \big \Vert \tilde{\mathbf {z}}^{1}-\mathbf {z}\big \Vert ^2\big ] + \sum _{j=1}^t\mathbb {E}\big [ \big \Vert \tilde{\mathbf {z}}^j-\tilde{\mathbf {z}}^{j+1}\big \Vert ^2\big ] - \mathbb {E}\big [ \big \Vert \mathbf {z}- \tilde{\mathbf {z}}^{t+1} \big \Vert ^2\big ] \bigg ) \nonumber \\&\quad \le \frac{1}{2}\bigg (\mathbb {E}\big [ \big \Vert \mathbf {z}^1-\mathbf {z}\big \Vert ^2\big ] + \sum _{j=1}^t \gamma _j^2\mathbb {E}\big [ \big \Vert \mathbf {w}^j - \mathbf {f}(\mathbf {x}^j)\big \Vert ^2\big ] \bigg ) \nonumber \\&\quad \le \frac{1}{2}\bigg (\mathbb {E}\big [ \big \Vert \mathbf {z}^1-\mathbf {z}\big \Vert ^2\big ] + \sum _{j=1}^t \gamma _j^2 \mathbb {E}\big [ \big \Vert \mathbf {w}^j\big \Vert ^2\big ] \bigg ) \le \frac{1}{2}\bigg (\mathbb {E}\big [ \big \Vert \mathbf {z}^1-\mathbf {z}\big \Vert ^2\big ] + F^2 \sum _{j=1}^t \gamma _j^2 \bigg ), \end{aligned}$$
(G.1)

where the first inequality holds because we drop the nonpositive term and \(\tilde{\mathbf {z}}^1 = \mathbf {z}^1\); the second inequality holds because for any random vector \(\mathbf {w}\), \(\mathbb {E}\big [\big \Vert \mathbf {w}-\mathbb {E}[\mathbf {w}]\big \Vert ^2\big ]\le \mathbb {E}\big [\big \Vert \mathbf {w}\big \Vert ^2\big ]\), and here \(\mathbb {E}\big [\mathbf {w}^j\mid \mathcal {H}^j\big ] = \mathbf {f}(\mathbf {x}^j)\) for \(j\in [t]\); and the last inequality holds by Assumption 2.

For the summation of \(\gamma _j\mathbb {E}\big [ \big \langle \mathbf {x}^j-\mathbf {x}, \mathbf {u}^j -\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j)\big \rangle \big ]\), let \(\tilde{\mathbf {x}}^1 = \mathbf {x}^1, \tilde{\mathbf {x}}^{j+1} = \tilde{\mathbf {x}}^j + \gamma _j\big (\mathbf {u}^j -\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j)\big )\), then \(\mathbb {E}\big [\big \langle \mathbf {x}^j-\tilde{\mathbf {x}}^j, \mathbf {u}^j -\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j) \big ]\big \rangle = 0\) and

$$\begin{aligned}&- \sum _{j=1}^t \gamma _j\mathbb {E}\big [\big \langle \mathbf {x}^j-\mathbf {x}, \mathbf {u}^j -\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j)\big \rangle \big ] \nonumber \\&\quad = - \sum _{j=1}^t \gamma _j\mathbb {E}\big [\big \langle \tilde{\mathbf {x}}^j-\mathbf {x}, \mathbf {u}^j -\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j)\big \rangle \big ] = \sum _{j=1}^t \mathbb {E}\big [\big \langle \tilde{\mathbf {x}}^{j}-\mathbf {x}, \tilde{\mathbf {x}}^j-\tilde{\mathbf {x}}^{j+1} \big \rangle \big ] \nonumber \\&\quad =\sum _{j=1}^t \frac{1}{2} \mathbb {E}\left[ \big \Vert \tilde{\mathbf {x}}^{j}-\mathbf {x}\big \Vert ^2 + \big \Vert \tilde{\mathbf {x}}^j-\tilde{\mathbf {x}}^{j+1}\big \Vert ^2 - \big \Vert \mathbf {x}- \tilde{\mathbf {x}}^{j+1} \big \Vert ^2\right] \nonumber \\&\quad = \frac{1}{2}\bigg (\mathbb {E}\big [\big \Vert \tilde{\mathbf {x}}^{1}-\mathbf {x}\big \Vert ^2\big ] + \sum _{j=1}^t\mathbb {E}\big [\big \Vert \tilde{\mathbf {x}}^j-\tilde{\mathbf {x}}^{j+1}\big \Vert ^2\big ] - \mathbb {E}\big [\big \Vert \mathbf {x}- \tilde{\mathbf {x}}^{t+1} \big \Vert ^2\big ]\bigg )\nonumber \\&\quad \le \frac{1}{2}\bigg (n B^2 + \sum _{j=1}^t \gamma _j^2 \mathbb {E}\big [\big \Vert \mathbf {u}^j -\tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j)\big \Vert ^2\big ]\bigg ) \le \frac{1}{2}\bigg (n B^2 + \sum _{j=1}^t \gamma _j^2 \mathbb {E}\big [\big \Vert \mathbf {u}^j\big \Vert ^2\big ]\bigg )\nonumber \\&\quad \le \frac{1}{2}\bigg (n B^2 +( \max _{j\in [t]} \mathbb {E}\big [\big \Vert \mathbf {u}^j\big \Vert ^2\big ]) \sum _{j=1}^t \gamma _j^2 \bigg ) \le \frac{1}{2}\bigg (n B^2 + \widehat{P}_\mathbf {z}^t \sum _{j=1}^t \gamma _j^2\bigg ), \end{aligned}$$
(G.2)

where the first inequality holds because we drop the nonpositive term and (3.1); the second inequality holds because \(\mathbb {E}\big [\mathbf {u}^j\mid \mathcal {H}^j\big ]= \tilde{\nabla }_\mathbf {x}\mathcal {L}(\mathbf {x}^j,\mathbf {z}^j)\) for \(j\in [t]\); and the last two inequalities holds by Assumption 2 and (3.2).

Adding (G.1) and (G.2) gives the result (3.12). \(\square \)

Proof of Lemma 9

Proof

Let \(\mathbf {x}= \mathbf {x}^*\) in (3.17), we have \(\mathbf {f}(\mathbf {x}^*)\le \mathbf {0}\) and thus

$$\begin{aligned} \mathbb {E}\big [f_0(\bar{\mathbf {x}}) - f_0(\mathbf {x}^*) + \big \langle \mathbf {z}, \mathbf {f}(\bar{\mathbf {x}})\big \rangle \big ] \le \epsilon _1+\epsilon _0\mathbb {E}\big [\big \Vert \mathbf {z}\big \Vert ^2\big ]. \end{aligned}$$
(H.1)

Since \(f_j({\bar{\mathbf {x}}})\le [f_j({\bar{\mathbf {x}}})]_+\) and \(\mathbf {z}^*\ge 0\), we have from (3.7) that

$$\begin{aligned} -\sum _{j=1}^M z_j^* [f_j(\bar{\mathbf {x}})]_+ \le f_0({\bar{\mathbf {x}}})-f_0(\mathbf {x}^*). \end{aligned}$$
(H.2)

Substituting (H.2) into (H.1) with \(\mathbf {z}\) given by \(z_j = 1 + z_j^*\) if \(f_j({\bar{\mathbf {x}}})>0\) and \(z_j = 0\) otherwise for any \(j\in [M]\) gives

$$\begin{aligned} -\mathbb {E}\Big [\sum _{j=1}^M z_j^* [f_j(\bar{\mathbf {x}})]_+\Big ] +\mathbb {E}\Big [\sum _{j=1}^M (1+z_j^*) [f_j(\bar{\mathbf {x}})]_+\Big ] \le \epsilon _1+\epsilon _0 \big \Vert 1+\mathbf {z}^*\big \Vert ^2. \end{aligned}$$

Simplifying the above inequality gives (3.19).

Letting \(z_j = 3 z_j^*\) if \(f_j({\bar{\mathbf {x}}})>0\) and \(z_j = 0\) otherwise for any \(j\in [M]\) in (H.1) and adding (H.2) together gives

$$\begin{aligned} \mathbb {E}\Big [\sum _{j=1}^M z_j^* [f_j(\bar{\mathbf {x}})]_+\Big ]\le \frac{ \epsilon _1}{2}+\frac{9\epsilon _0}{2}\big \Vert \mathbf {z}^*\big \Vert ^2. \end{aligned}$$
(H.3)

Hence, by the above inequality and (H.2), we obtain

$$\begin{aligned} -\mathbb {E}\big [f_0({\bar{\mathbf {x}}})-f_0(\mathbf {x}^*)\big ]\le \frac{ \epsilon _1}{2}+\frac{9\epsilon _0}{2} \big \Vert \mathbf {z}^*\big \Vert ^2. \end{aligned}$$
(H.4)

Thus \(\mathbb {E}\big [[f_0({\bar{\mathbf {x}}})-f_0(\mathbf {x}^*)]_-\big ]\le \frac{\epsilon _1}{2}+\frac{9\epsilon _0}{2}\big \Vert \mathbf {z}^*\big \Vert ^2.\) In addition, from (H.1) with \(\mathbf {z}= 0\), it follows \(\mathbb {E}\big [f_0({\bar{\mathbf {x}}})-f_0(\mathbf {x}^*)\big ]\le \epsilon _1\). Since \(\mid a\mid = a + 2[a]_-\) for any real number a, we have

$$\begin{aligned} \mathbb {E}\big [\mid f_0({\bar{\mathbf {x}}})-f_0(\mathbf {x}^*)\mid \big ]= & {} \mathbb {E}\big [f_0({\bar{\mathbf {x}}})-f_0(\mathbf {x}^*)\big ]+2\mathbb {E}\big [[f_0({\bar{\mathbf {x}}})-f_0(\mathbf {x}^*)]_-\big ]\\\le & {} 2 \epsilon _1+9\epsilon _0\big \Vert \mathbf {z}^*\big \Vert ^2, \end{aligned}$$

which gives (3.18).

Furthermore, let \(\mathbf {z}= 0\) in (3.17) and take \({\widehat{\mathbf {x}}}\in \text{ argmin}_{\mathbf {x}\in X} f_0(\mathbf {x}) + \big \langle \bar{\mathbf {z}},\mathbf {f}(\mathbf {x})\big \rangle \). By equation (2.2), we have \(\mathbb {E}f_0({\bar{\mathbf {x}}})\le \mathbb {E}d(\bar{\mathbf {z}}) + \epsilon _1\), which together with (3.6) gives

$$\begin{aligned} \mathbb {E}\big [d(\mathbf {z}^*)-d(\bar{\mathbf {z}})\big ] \le \mathbb {E}\big [f_0(\mathbf {x}^*)-f_0(\bar{\mathbf {x}})\big ]+ \epsilon _1. \end{aligned}$$
(H.5)

Combining the above inequality with (H.4) gives (3.20). \(\square \)

Proof of Corollary 4

Proof

For the step sizes \(\{\alpha _j\}_{j=1}^K\), we have

$$\begin{aligned}&\sum _{j=1}^{K} \alpha _j =\alpha \sum _{j=1}^{K} \frac{1}{\sqrt{j+1}}\ge \alpha \int _{s = 2}^{K+2} \frac{1}{\sqrt{s}} \text{ d }s = 2\alpha (\sqrt{K+2}-\sqrt{2}),\\&\sum _{j=1}^{K} \alpha _j^2 =\alpha ^2 \sum _{j=1}^{K} \frac{1}{j+1}\le \alpha ^2 \int _{s = 1}^{K+1} \frac{1}{s} \text{ d }s = \alpha ^2\log {(K+1)}. \end{aligned}$$

Similarly, for \(\{\rho _j\}_{j=1}^K\), it holds \( \sum _{j=1}^{K} \rho _j^2\le \rho ^2\log {(K+1)}\). Plug these bounds to the result of Theorem 3 and note \(\log (K+1)\ge 1\) for \(K\ge 2\). We finish the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, Y., Xu, Y. Adaptive primal-dual stochastic gradient method for expectation-constrained convex stochastic programs. Math. Prog. Comp. 14, 319–363 (2022). https://doi.org/10.1007/s12532-021-00214-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12532-021-00214-w

Keywords

Mathematics Subject Classification

Navigation