An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians

Na, Sen; Anitescu, Mihai; Kolar, Mladen

doi:10.1007/s10107-022-01846-z

An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians

Full Length Paper
Series A
Published: 30 June 2022

Volume 199, pages 721–791, (2023)
Cite this article

Mathematical Programming Submit manuscript

943 Accesses
9 Citations
Explore all metrics

Abstract

We consider solving nonlinear optimization problems with a stochastic objective and deterministic equality constraints. We assume for the objective that its evaluation, gradient, and Hessian are inaccessible, while one can compute their stochastic estimates by, for example, subsampling. We propose a stochastic algorithm based on sequential quadratic programming (SQP) that uses a differentiable exact augmented Lagrangian as the merit function. To motivate our algorithm design, we first revisit and simplify an old SQP method Lucidi (J. Optim. Theory Appl. 67(2): 227–245, 1990) developed for solving deterministic problems, which serves as the skeleton of our stochastic algorithm. Based on the simplified deterministic algorithm, we then propose a non-adaptive SQP for dealing with stochastic objective, where the gradient and Hessian are replaced by stochastic estimates but the stepsizes are deterministic and prespecified. Finally, we incorporate a recent stochastic line search procedure Paquette and Scheinberg (SIAM J. Optim. 30(1): 349–376 2020) into the non-adaptive stochastic SQP to adaptively select the random stepsizes, which leads to an adaptive stochastic SQP. The global “almost sure” convergence for both non-adaptive and adaptive SQP methods is established. Numerical experiments on nonlinear problems in CUTEst test set demonstrate the superiority of the adaptive algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Notes

The implicit cost for deriving Lipschitz constants sequence in [3] is not counted here, which, however, requires non-negligible evaluations for the objective and constraints.

References

Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization 24(3), 1238–1264 (2014). https://doi.org/10.1137/130915984
Article MathSciNet MATH Google Scholar
Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of newton-sketch and subsampled newton methods. Optimization Methods and Software 35(4), 661–680 (2020). https://doi.org/10.1080/10556788.2020.1725751
Article MathSciNet MATH Google Scholar
Berahas, A.S., Curtis, F.E., Robinson, D., Zhou, B.: Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM J. Optim. 31(2), 1352–1379 (2021). https://doi.org/10.1137/20m1354556
Article MathSciNet MATH Google Scholar
Bertsekas, D.: Constrained Optimization and Lagrange Multiplier Methods. Elsevier, Belmont, Mass, (1982). https://doi.org/10.1016/c2013-0-10366-2
Bertsekas, D.P.: Projected newton methods for optimization problems with simple constraints. SIAM J. Control. Optim. 20(2), 221–246 (1982). https://doi.org/10.1137/0320018
Article MathSciNet MATH Google Scholar
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS Journal on Optimization 1(2), 92–119 (2019). https://doi.org/10.1287/ijoo.2019.0016
Article MathSciNet Google Scholar
Bollapragada, R., Byrd, R., Nocedal, J.: Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28(4), 3312–3343 (2018). https://doi.org/10.1137/17m1154679
Article MathSciNet MATH Google Scholar
Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2018). https://doi.org/10.1093/imanum/dry009
Article MathSciNet MATH Google Scholar
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018). https://doi.org/10.1137/16m1080173
Article MathSciNet MATH Google Scholar
Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5
Article MathSciNet MATH Google Scholar
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2017). https://doi.org/10.1007/s10107-017-1137-4
Article MathSciNet MATH Google Scholar
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Math. Program. 169(2), 447–487 (2017). https://doi.org/10.1007/s10107-017-1141-8
Article MathSciNet MATH Google Scholar
Curtis, F.E., Shi, R.: A fully stochastic second-order trust region method. Optimization Methods and Software 1–34, (2020). https://doi.org/10.1080/10556788.2020.1852403
Curtis, F.E., Scheinberg, K., Shi, R.: A stochastic trust region algorithm based on careful step normalization. INFORMS Journal on Optimization 1(3), 200–220 (2019). https://doi.org/10.1287/ijoo.2018.0010
Article MathSciNet Google Scholar
De, S., Yadav, A., Jacobs, D., Goldstein, T.: Automated Inference with Adaptive Batches. PMLR, Fort Lauderdale, FL, USA, Proceedings of Machine Learning Research, 54, 1504–1513 (2017). URL http://proceedings.mlr.press/v54/de17a.html
Dupacova, J., Wets, R.: Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. Ann. Stat. 16(4), 1517–1549 (1988). https://doi.org/10.1214/aos/1176351052
Article MathSciNet MATH Google Scholar
Durrett, R.: Probability, vol 49. Cambridge University Press (2019). https://doi.org/10.1017/9781108591034
Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1380–A1405 (2012). https://doi.org/10.1137/110830629
Article MathSciNet MATH Google Scholar
Gallager, R.G.: Stochastic Processes. Cambridge University Press (2013). https://doi.org/10.1017/cbo9781139626514
Gill, P.E., Wong, E.: Sequential quadratic programming methods. In: Mixed Integer Nonlinear Programming, IMA Vol. Math. Appl., vol 154, Springer New York, 147–224 (2011). https://doi.org/10.1007/978-1-4614-1927-3_6
Gill, P.E., Murray, W., Saunders, M.A.: SNOPT: An SQP algorithm for large-scale constrained optimization. SIAM Rev. 47(1), 99–131 (2005). https://doi.org/10.1137/s0036144504446096
Article MathSciNet MATH Google Scholar
Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2014). https://doi.org/10.1007/s10589-014-9687-3
Article MathSciNet MATH Google Scholar
Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numer. Anal. 38(3), 1579–1597 (2017). https://doi.org/10.1093/imanum/drx043
Article MathSciNet MATH Google Scholar
Krejić, N., Krklec, N.: Line search methods with variable sample size for unconstrained optimization. J. Comput. Appl. Math. 245, 213–231 (2013). https://doi.org/10.1016/j.cam.2012.12.020
Article MathSciNet MATH Google Scholar
Lucidi, S.: Recursive quadratic programming algorithm that uses an exact augmented lagrangian function. J. Optim. Theory Appl. 67(2), 227–245 (1990). https://doi.org/10.1007/bf00940474
Article MathSciNet MATH Google Scholar
Maratos, N.: Exact penalty function algorithms for finite dimensional and control optimization problems. Doctoral dissertation (1978). URL http://hdl.handle.net/10044/1/7283
Nagaraj, N.K., Fuller, W.A.: Estimation of the parameters of linear time series models subject to nonlinear restrictions. Ann. Stat. 19(3), 1143–1154 (1991). https://doi.org/10.1214/aos/1176348242
Article MathSciNet MATH Google Scholar
Nandwani, Y., Pathak, A., Mausam, Singla, P.: A primal dual formulation for deep learning with constraints. In: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 12157–12168 (2019). URL http://papers.nips.cc/paper/9385-a-primal-dual-formulation-for-deep-learning-with-constraints.pdf
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009). https://doi.org/10.1137/070704277
Article MathSciNet MATH Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering, Springer New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30(1), 349–376 (2020). https://doi.org/10.1137/18m1216250
Article MathSciNet MATH Google Scholar
Pillo, G.: Exact penalty methods. In: Algorithms for Continuous Optimization, NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., vol 434, Springer Netherlands, 209–253 (1994). https://doi.org/10.1007/978-94-009-0369-2_8
Pillo, G.D., Grippo, L.: A new class of augmented lagrangians in nonlinear programming. SIAM J. Control. Optim. 17(5), 618–628 (1979). https://doi.org/10.1137/0317044
Article MathSciNet MATH Google Scholar
Pillo, G.D., Grippo, L., Lampariello, F.: A method for solving equality constrained optimization problems by unconstrained minimization. In: Optimization Techniques, Springer-Verlag, Lecture Notes in Control and Information Sci. 23, 96–105 (1980). https://doi.org/10.1007/bfb0006592
Prékopa, A.: Contributions to the theory of stochastic programming. Math. Program. 4(1), 202–221 (1973). https://doi.org/10.1007/bf01584661
Article MathSciNet MATH Google Scholar
Ravi, S.N., Dinh, T., Lokhande, V.S., Singh, V.: Explicitly imposing constraints in deep networks via conditional gradients gives improved generalization and faster convergence. In: Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), vol 33, 4772–4779 (2019), https://doi.org/10.1609/aaai.v33i01.33014772
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods. Math. Program. 174(1–2), 293–326 (2018). https://doi.org/10.1007/s10107-018-1346-5
Article MathSciNet MATH Google Scholar
di Serafino, D., Krejić, N., Jerinkić, N.K., Viola, M.: Lsos: Line-search second-order stochastic optimization methods. (2020) arXiv preprint arXiv:2007.15966
Shapiro, A.: On the asymptotics of constrained local $m$-estimators. Ann. Stat. 28(3), 948–960 (2000). https://doi.org/10.1214/aos/1015952006
Article MathSciNet MATH Google Scholar
Siqueira, A.S., Orban, D.: Cutest.jl. (2020) https://github.com/JuliaSmoothOptimizers/CUTEst.jl, https://doi.org/10.5281/ZENODO.1188851
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in neural information processing systems, 2899–2908 (2018). https://doi.org/10.5555/3327144.3327213
Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2011). https://doi.org/10.1007/s10208-011-9099-z
Article MathSciNet MATH Google Scholar
Tropp, J.A.: An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8(1-2), 1–230 (2015). https://doi.org/10.1561/2200000048
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017). https://doi.org/10.1137/15m1053141
Article MathSciNet MATH Google Scholar
Wets, R.: Stochastic programming: Solution techniques and approximation schemes. In: Mathematical Programming The State of the Art, Springer Berlin Heidelberg, 566–603 (1983). https://doi.org/10.1007/978-3-642-68874-4_22

Download references

Acknowledgements

We would like to thank Associated Editor and two anonymous reviewers for helpful and instructive comments. Sen Na would like to thank Nick Gould for helping with the implementation of CUTEst in Julia. This material was completed in part with resources provided by the University of Chicago Research Computing Center. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347 and by NSF through award CNS-1545046.

Author information

Authors and Affiliations

Department of Statistics, University of California, Berkeley, United States
Sen Na
International Computer Science Institute, Berkeley, United States
Sen Na
Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, United States
Mihai Anitescu
Booth School of Business, University of Chicago, Chicago, United States
Mladen Kolar

Authors

Sen Na
View author publications
You can also search for this author in PubMed Google Scholar
Mihai Anitescu
View author publications
You can also search for this author in PubMed Google Scholar
Mladen Kolar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sen Na.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Government License: The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doe-public-access-plan.

Appendices

Proofs of section 3

1.1 Proof of lemma 1

The expectation is taken over randomness of $\xi _g^k$ and $\xi _H^k$, which means that the k-th iterate $({{\varvec{x}}}_k, {\varvec{\lambda }}_k)$ is supposed to be fixed. By the unbiasedness condition in Assumption 2, we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g}[{{\bar{\varDelta }}}{{\varvec{x}}}_k]&= {\mathbb {E}}[{{\bar{\varDelta }}}{{\varvec{x}}}_k \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k] {\mathop {=}\limits ^{(9)}} -{\mathbb {E}}\left[ \begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}\left( {\begin{matrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k\\ c_k \end{matrix}}\right) {\bigg \vert }{{\varvec{x}}}_k, {\varvec{\lambda }}_k\right] \\&= -\begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}{\mathbb {E}}\left[ \left( {\begin{matrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k\\ c_k \end{matrix}}\right) {\bigg \vert }{{\varvec{x}}}_k, {\varvec{\lambda }}_k \right] \\&{\mathop {=}\limits ^{(11)}} -\begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}\left( {\begin{matrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\\ c_k \end{matrix}}\right) {\mathop {=}\limits ^{{(5)}}} \varDelta {{\varvec{x}}}_k, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[{{\bar{\varDelta }}}{\varvec{\lambda }}_k]&= {\mathbb {E}}[{{\bar{\varDelta }}}{\varvec{\lambda }}_k\mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k] {\mathop {=}\limits ^{{(9)}}} -{\mathbb {E}}\left[ (G_kG_k^T)^{-1}G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k +(G_kG_k^T)^{-1}{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k\right] \\&= -(G_kG_k^T)^{-1}G_k{\mathbb {E}}[{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k] - (G_kG_k^T)^{-1}{\mathbb {E}}[{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k\mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k]\\&= -(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - (G_kG_k^T)^{-1}{\mathbb {E}}[{{\bar{M}}}_k^T \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k]{\mathbb {E}}[{{\bar{\varDelta }}}{{\varvec{x}}}_k\mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k]\\&{\mathop {=}\limits ^{{(11)}}} -(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - (G_kG_k^T)^{-1}M_k^T\varDelta {{\varvec{x}}}_k\\&{\mathop {=}\limits ^{{(5)}}} \varDelta {\varvec{\lambda }}_k, \end{aligned}$$

where the fourth equality is due to the independence between $\xi _g^k$ and $\xi _H^k$. Moreover, we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{\varDelta }}}{{\varvec{x}}}_k\Vert ^2\right]&= \Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k\Vert ^2\right] \nonumber \\&{\mathop {=}\limits ^{{(9)}}}\Vert \varDelta {{\varvec{x}}}_k\Vert ^2+ {\mathbb {E}}\bigg [\bigg \Vert \begin{pmatrix} I&{{\varvec{0}}}\end{pmatrix}\underbrace{\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) ^{-1}}_{=:{\mathcal {K}}}\left( {\begin{matrix} {{\bar{g}}}_k - \nabla f_k\\ {{\varvec{0}}}\end{matrix}}\right) \bigg \Vert ^2{\bigg \vert }{{\varvec{x}}}_k, {\varvec{\lambda }}_k \bigg ] \nonumber \\&{\mathop {\le }\limits ^{{(11)}}} \Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + \Vert {\mathcal {K}}\Vert ^2\psi . \end{aligned}$$

(A.1)

We now bound $\Vert {\mathcal {K}}\Vert $. Suppose $G_k^T = Y_kE_k$ where $Y_k$ has orthonormal columns spanning $\text {Image}(G_k^T)$ and $E_k$ is a square matrix, and suppose $Z_k$ has orthonormal columns spanning $\text {Ker}(G_k)$. By Assumption 1, we know $E_k$ is nonsingular and $Z_k^TB_kZ_k \succeq \gamma _{RH} I$. By directly verifying $\left( {\begin{matrix} B_k &{} G_k^T\\ G_k &{} {{\varvec{0}}}\end{matrix}}\right) {\mathcal {K}}= I$, one can show

$$\begin{aligned} {\mathcal {K}}= \left( {\begin{matrix} Z_k(Z_k^TB_kZ_k)^{-1}Z_k^T &{} (I - Z_k(Z_k^TB_kZ_k)^{-1}Z_k^TB_k)Y_kE_k^{-1}\\ E_k^{-1}Y_k^T(I - B_kZ_k(Z_k^TB_kZ_k)^{-1}Z_k^T) &{} E_k^{-1}Y_k^T(B_kZ_k(Z_k^TB_kZ_k)^{-1}Z_k^TB_k - B_k)Y_kE_k^{-1} \end{matrix}}\right) . \end{aligned}$$

Noting that $\Vert E_k^{-1}\Vert ^2 = \Vert E_k^{-1}(E_k^{-1})^T\Vert = \Vert (G_kG_k^T)^{-1}\Vert {\mathop {\le }\limits ^{(12)}} 1/\kappa _{1,G}$, and $\Vert (Z_k^TB_kZ_k)^{-1}\Vert \le 1/\gamma _{RH}$, we can bound $\Vert {\mathcal {K}}\Vert $ by summing the spectrum norm of each block, and get

$$\begin{aligned} \Vert {\mathcal {K}}\Vert \le \frac{1}{\gamma _{RH}} + \frac{2}{\sqrt{\kappa _{1,G}}}\left( 1 + \frac{\kappa _B}{\gamma _{RH}}\right) + \frac{1}{\kappa _{1,G}}\left( \frac{\kappa _B^2}{\gamma _{RH}} + \kappa _B\right) . \end{aligned}$$

Since $\kappa _B \ge 1\ge \gamma _{RH}\vee \kappa _{1,G}$, we can simplify the bound by

$$\begin{aligned} \Vert {\mathcal {K}}\Vert \le \frac{5\kappa _B}{\sqrt{\kappa _{1,G}}\gamma _{RH}} + \frac{2\kappa _B^2}{\kappa _{1,G}\gamma _{RH}} \le \frac{7\kappa _B^2}{\kappa _{1,G}\gamma _{RH}}. \end{aligned}$$

Plugging the above inequality into (A.1), we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{\varDelta }}}{{\varvec{x}}}_k\Vert ^2\right] \le \Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + \frac{49\kappa _B^4}{\kappa _{1,G}^2\gamma _{RH}^2}\psi . \end{aligned}$$

(A.2)

Similarly, we can bound ${\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[\Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k\Vert ^2]$ as follows.

$$\begin{aligned}&{\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k\Vert ^2\right] \nonumber \\&\quad = \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k - \varDelta {\varvec{\lambda }}_k\Vert ^2\right] \nonumber \\&\quad {\mathop {=}\limits ^{{(9)}}} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \left\| (G_kG_k^T)^{-1}\left( G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k -G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - M_k\varDelta {{\varvec{x}}}_k\right) \right\| ^2 \right] \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{2}{\kappa _{1,G}^2}\left\{ {\mathbb {E}}_{\xi ^k_g}\left[ \left\| G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k - G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\right\| ^2\right] + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \right\} \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}}\Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{2}{\kappa _{1,G}^2}\left\{ \kappa _{2,G} {\mathbb {E}}_{\xi ^k_g}\left[ \Vert {{\bar{g}}}_k - \nabla f_k\Vert ^2\right] + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \right\} \nonumber \\&\quad {\mathop {\le }\limits ^{{(11)}}} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{2}{\kappa _{1,G}^2}\left\{ \kappa _{2,G} \psi + {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \right\} . \end{aligned}$$

(A.3)

For the last term, we have

$$\begin{aligned}&{\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \nonumber \\&\quad = {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert ({{\bar{M}}}_k - M_k)^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k) + M_k^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k) + ({{\bar{M}}}_k - M_k)^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \nonumber \\&\quad = {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert ({{\bar{M}}}_k - M_k)^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2 + \Vert M_k^T({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2 + \Vert ({{\bar{M}}}_k - M_k)^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}} {\mathbb {E}}_{\xi ^k_H}\left[ \Vert {{\bar{M}}}_k - M_k\Vert ^2\right] {\mathbb {E}}_{\xi ^k_g}\left[ \Vert ({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2\right] + \kappa _M^2{\mathbb {E}}_{\xi ^k_g}\left[ \Vert ({{\bar{\varDelta }}}{{\varvec{x}}}_k - \varDelta {{\varvec{x}}}_k)\Vert ^2\right] \nonumber \\&\quad \quad \qquad + {\mathbb {E}}_{\xi ^k_H}\left[ \Vert {{\bar{M}}}_k - M_k\Vert ^2\right] \Vert \varDelta {{\varvec{x}}}_k\Vert ^2, \end{aligned}$$

(A.4)

where the second equality is due to the independence between $\xi _g^k$ and $\xi _H^k$. We note that

$$\begin{aligned} \Vert {{\bar{M}}}_k - M_k\Vert ^2&{\mathop {\le }\limits ^{(8)}} 2\Vert G_k\Vert ^2\Vert {{\bar{H}}}_k - \nabla ^2f_k\Vert ^2 + 2\Vert {{\bar{T}}}_k - T_k\Vert ^2 \nonumber \\&{\mathop {\le }\limits ^{(8)}} 2\Vert G_k\Vert ^2\Vert {{\bar{H}}}_k - \nabla ^2f_k\Vert ^2 + 2\Vert \nabla f({{\varvec{x}}}_k; \xi _H^k) - \nabla f_k\Vert ^2\sum _{i=1}^{m}\Vert \nabla ^2c_i({{\varvec{x}}}_k)\Vert ^2 \nonumber \\&{\mathop {\le }\limits ^{(12)}} 2\kappa _{2,G}\Vert {{\bar{H}}}_k - \nabla ^2f_k\Vert ^2 + 2\kappa _{\nabla _{{{\varvec{x}}}}^2c}^2\Vert \nabla f({{\varvec{x}}}^k; \xi _H^k) - \nabla f_k\Vert ^2. \end{aligned}$$

Taking expectation over $\xi _H^k$ on both sides and using (11), we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_H}[\Vert {{\bar{M}}}_k - M_k\Vert ^2] \le 2(\kappa _{2,G}+\kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi . \end{aligned}$$

(A.5)

Combining (A.5) with (A.4) and (A.1), we obtain

$$\begin{aligned}&{\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k-M_k^T\varDelta {{\varvec{x}}}_k\Vert ^2 \right] \\&\quad \le \left\{ 2(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi +\kappa _M^2 \right\} \frac{49\kappa _B^4}{\kappa _{1,G}^2\gamma _{RH}^2}\psi + 2(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi \Vert \varDelta {{\varvec{x}}}_k\Vert ^2. \end{aligned}$$

Plugging the above inequality into (A.3),

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \Vert {{\bar{\varDelta }}}{\varvec{\lambda }}_k\Vert ^2\right]\le & {} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 + \frac{4(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi }{\kappa _{1,G}^2}\Vert \varDelta {{\varvec{x}}}_k\Vert ^2 \nonumber \\&+ \left\{ \frac{2\kappa _{2,G}}{\kappa _{1,G}^2} +\frac{98\kappa _B^4}{\kappa _{1,G}^4\gamma _{RH}^2}\left( 2(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi +\kappa _M^2\right) \right\} \psi .\nonumber \\ \end{aligned}$$

(A.6)

Combining (A.6) with (A.2), we can define

$$\begin{aligned} \varUpsilon _0 :=1 + \frac{4(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi }{\kappa _{1,G}^2} \vee \frac{2\kappa _{2,G}}{\kappa _{1,G}^2} + \frac{49\kappa _B^4}{\kappa _{1,G}^2\gamma _{RH}^2} \cdot \left\{ 1+ \frac{4(\kappa _{2,G} + \kappa _{\nabla _{{{\varvec{x}}}}^2c}^2)\psi +2\kappa _M^2}{\kappa _{1,G}^2}\right\} \end{aligned}$$

(A.7)

and have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}\left[ \left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}\right\| ^2\right] \le \varUpsilon _0\left( \left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}\right\| ^2+\psi \right) . \end{aligned}$$

This completes the proof.

1.2 Proof of theorem 1

We require the following lemmas.

Lemma 10

Let $(\varDelta {{\varvec{x}}}_k, \varDelta {\varvec{\lambda }}_k)$ be solved by (5). Then for any $\mu , \nu $,

$$\begin{aligned} \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{\mu , \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{\mu , \nu }^k \end{pmatrix}^T \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}=(\varDelta {{\varvec{x}}}_k)^T\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - \mu \Vert c_k\Vert ^2 + c_k^T\varDelta {\varvec{\lambda }}_k - \nu \Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2. \end{aligned}$$

Proof

We have

$$\begin{aligned} \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{\mu , \nu }^k\\ s \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{\mu , \nu }^k \end{pmatrix}^T \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}&{\mathop {=}\limits ^{(4)}} (\varDelta {{\varvec{x}}}_k)^T\left( I + \nu M_kG_k\right) \nabla _{{{\varvec{x}}}}{\mathcal {L}}_k + \mu (\varDelta {{\varvec{x}}}_k)^TG_k^Tc_k + (\varDelta {\varvec{\lambda }}_k)^Tc_k \\&\quad + \nu (\varDelta {\varvec{\lambda }}_k)^TG_kG_k^TG_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\\&{\mathop {=}\limits ^{(5)}} (\varDelta {{\varvec{x}}}_k)^T\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k - \mu \Vert c_k\Vert ^2 + c_k^T\varDelta {\varvec{\lambda }}_k - \nu \Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2. \end{aligned}$$

This completes the proof. $\square $

Lemma 11

Under Assumption 1, we have

$$\begin{aligned} \left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ \varDelta {\varvec{\lambda }}_k \end{pmatrix}\right\| ^2 \le \frac{3\kappa _M^2}{\kappa _{1,G}^2}\left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2. \end{aligned}$$

Proof

By (5), we know

$$\begin{aligned} \Vert \varDelta {\varvec{\lambda }}_k\Vert ^2 =&\Vert (G_kG_k^T)^{-1}(G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k+M_k^T\varDelta {{\varvec{x}}}_k)\Vert ^2{\mathop {\le }\limits ^{(12)}}\frac{2}{\kappa _{1,G}^2}(\Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2 + \kappa _M^2\Vert \varDelta {{\varvec{x}}}_k\Vert ^2)\\ \le&\,\, \frac{2\kappa _M^2}{\kappa _{1,G}^2}(\Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2 + \Vert \varDelta {{\varvec{x}}}_k\Vert ^2).\text {(since }\kappa _M\ge 1\ge \kappa _{1,G}) \end{aligned}$$

This completes the proof. $\square $

Now, we are ready for proving Theorem 1. We note that

$$\begin{aligned} \Vert \nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2&= \Vert (I - G_k^T(G_kG_k^T)^{-1}G_k)\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2 + \Vert G_k^T(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2\\&{\mathop {=}\limits ^{{(5)}}} \Vert (I - G_k^T(G_kG_k^T)^{-1}G_k)B_k\varDelta {{\varvec{x}}}_k\Vert ^2 + \Vert G_k^T(G_kG_k^T)^{-1}G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2\\&{\mathop {\le }\limits ^{{(12)}}} \kappa _B^2\Vert \varDelta {{\varvec{x}}}_k\Vert ^2 + \frac{1}{\kappa _{1,G}}\Vert G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k\Vert ^2,\\ \Vert c_k\Vert ^2&{\mathop {=}\limits ^{{(5)}}}\Vert G_k\varDelta {{\varvec{x}}}_k\Vert ^2 {\mathop {\le }\limits ^{(12)}}\kappa _{2,G}\Vert \varDelta {{\varvec{x}}}_k\Vert ^2. \end{aligned}$$

Since $\kappa _{2,G}\wedge \kappa _B \ge 1\ge \kappa _{1,G}$,

$$\begin{aligned} \Vert \nabla {\mathcal {L}}_k\Vert ^2\le \frac{\kappa _B^2+\kappa _{2,G}}{\kappa _{1,G}}\left\| \begin{pmatrix} \varDelta {{\varvec{x}}}_k\\ G_k\nabla _{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2. \end{aligned}$$

Plugging the above inequality into (15), we have

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[{\mathcal {L}}_{\mu , \nu }^{k+1}]\le {\mathcal {L}}_{\mu , \nu }^k - \frac{\alpha _k{\tilde{\delta }}\kappa _{1,G}}{2(\kappa _B^2+\kappa _{2,G})}\Vert \nabla {\mathcal {L}}_k\Vert ^2 + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\alpha _k^2. \end{aligned}$$

(A.8)

Case 1: constant stepsize. If $\alpha _k = \alpha $, we take full expectation of (A.8), sum over $k = 0,\ldots ,K$, and have

$$\begin{aligned} \min _{{\mathcal {X}}\times \varLambda }{\mathcal {L}}_{\mu , \nu } - {\mathcal {L}}_{\mu , \nu }^0 \le - \frac{{\tilde{\delta }}\kappa _{1,G}}{2(\kappa _B^2+\kappa _{2,G})}\cdot \alpha \sum _{k=0}^{K}{\mathbb {E}}[\Vert \nabla {\mathcal {L}}_k\Vert ^2] + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi (K+1)}{2}\alpha ^2. \end{aligned}$$

Rearranging the above inequality leads to

$$\begin{aligned} \frac{1}{K+1}\sum _{k=0}^{K}{\mathbb {E}}[\Vert \nabla {\mathcal {L}}_k\Vert ^2]\le & {} \frac{2(\kappa _B^2+\kappa _{2,G})}{{\tilde{\delta }} \kappa _{1,G}}\cdot \frac{{\mathcal {L}}_{\mu , \nu }^0 - \min _{{\mathcal {X}}\times \varLambda }{\mathcal {L}}_{\mu , \nu }}{(K+1)\alpha }\\&+ \frac{\varUpsilon _0(\kappa _B^2+\kappa _{2,G})\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{{\tilde{\delta }}\kappa _{1,G}}\alpha . \end{aligned}$$

Case 2: decaying stepsize. Let us define two sequences

$$\begin{aligned} s_k = {\mathcal {L}}_{\mu , \nu }^k + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\sum _{i=k}^{\infty }\alpha _i^2,\quad \quad e_k = \frac{\alpha _k{{\tilde{\delta }}}\kappa _{1,G}}{2(\kappa _B^2 +\kappa _{2,G})}\Vert \nabla {\mathcal {L}}_k\Vert ^2. \end{aligned}$$

From (A.8), we know

$$\begin{aligned} {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[s_{k+1}]&= {\mathbb {E}}_{\xi ^k_g,\xi ^k_H}[{\mathcal {L}}_{\mu , \nu }^{k+1}] + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\sum _{i=k+1}^{\infty }\alpha _i^2\\&\le {\mathcal {L}}_{\mu , \nu }^k - \frac{\alpha _k{\tilde{\delta }}\kappa _{1,G}}{2(\kappa _B^2+\kappa _{2,G})}\Vert \nabla {\mathcal {L}}_k\Vert ^2 + \frac{\varUpsilon _0\kappa _{{\mathcal {L}}_{\mu , \nu }}\psi }{2}\sum _{i=k}^{\infty }\alpha _i^2 = s_k - e_k. \end{aligned}$$

Since $e_k \ge 0$, $\{s_k -\min _{{\mathcal {X}}\times \varLambda }{\mathcal {L}}_{\mu , \nu }\}_k$ is a positive supermartingale. By [17, Theorem 4.2.12], we have $s_k\rightarrow s$ almost surely for some random variable s and ${\mathbb {E}}[s]\le s_0<\infty $. Therefore, ${\mathbb {E}}[\sum _{k=0}^{\infty }e_k] = \sum _{k=0}^{\infty }{\mathbb {E}}\left[ e_k\right] \le \sum _{k=0}^{\infty }{\mathbb {E}}[s_k] - {\mathbb {E}}[s_{k+1}]<\infty $, where the first equality is due to Fubini-Tonelli theorem [17, Theorem 1.7.2]. The above display implies that $\sum _{k=0}^{\infty }e_k<\infty $ almost surely. Moreover, since $\sum _{k=0}^{\infty }\alpha _k = \infty $, we obtain $\sum _{k=0}^{\infty }e_k/\sum _{k=0}^{\infty }\alpha _k = 0$ and $\liminf _{k\rightarrow 0}\Vert \nabla {\mathcal {L}}_k\Vert = 0$ almost surely. This completes the proof.

Proofs of section 4

1.1 Proof of lemma 3

Let $P_{\xi _f^k}(\cdot ) = P(\cdot \mid {{\varvec{x}}}_k, {\varvec{\lambda }}_k,{{\bar{\varDelta }}}{{\varvec{x}}}_k, {{\bar{\varDelta }}}{\varvec{\lambda }}_k)$ be the conditional probability over randomness in $\xi _f^k$. We have that

$$\begin{aligned}&\left| {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k \right| \nonumber \\&\quad {\mathop {\le }\limits ^{{(21)}}} \left| {{\bar{f}}}_k - f_k\right| + \frac{\nu }{2}\left| \Vert G_k({{\bar{\nabla }}}f_k + G_k^T{\varvec{\lambda }}_k) \Vert ^2 - \Vert G_k(\nabla f_k + G_k^T{\varvec{\lambda }}_k) \Vert ^2\right| \nonumber \\&\quad \le \left| {{\bar{f}}}_k - f_k\right| + \frac{\nu }{2}\Vert G_k\Vert ^2\left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| ^2 + \nu \Vert G_k(\nabla f_k + G_k^T{\varvec{\lambda }}_k)\Vert \Vert G_k\Vert \Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert \nonumber \\&\quad {\mathop {\le }\limits ^{{(12)}}} \left| {{\bar{f}}}_k - f_k\right| + \nu \kappa _{2,G}\left( \left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| + \kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) \Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert . \end{aligned}$$

(B.1)

Let us denote

$$\begin{aligned} m_k = - \kappa _f{{\bar{\alpha }}}_k^2\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}, \end{aligned}$$

then the above display implies that $|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k |\le m_k$ by requiring

$$\begin{aligned} \left| {{\bar{f}}}_k - f_k\right| \le \frac{m_k}{2} \quad \text { and}\quad \left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| \le \frac{m_k}{4\nu \kappa _{2,G}\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}}\wedge 1. \end{aligned}$$

The above condition is further implied by requiring

$$\begin{aligned} \left| {{\bar{f}}}_k - f_k\right| \vee \left\| {{\bar{\nabla }}}f_k - \nabla f_k\right\| \le \frac{m_k\wedge 1}{4\nu \kappa _{2,G}\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}} \vee 2} . \end{aligned}$$

Thus, we apply Bernstein inequality [43, Theorem 6.1.1] and have that if

$$\begin{aligned} |\xi _f^k| \ge \frac{16\max \{\varOmega _0, \varOmega _1\}^2 (4\nu ^2\kappa _{2,G}^2\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}^2\vee 1) }{m_k^2\wedge 1 }\log \left( \frac{8d}{p_f}\right) , \end{aligned}$$

(B.2)

then

$$\begin{aligned} P_{\xi _f^k}(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k|> m_k)\le p_f/2. \end{aligned}$$

Under (B.2), we also have $P_{\xi _f^k}(|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^{s_k}|> m_k )\le p_f/2$, which implies the condition (23) holds. As for the condition (24), we apply (B.1) to obtain

$$\begin{aligned}&{\mathbb {E}}_{\xi _f^k}\left[ \left| {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^k \right| ^2\right] \\&\quad \le 2{\mathbb {E}}_{\xi _f^k}[|{{\bar{f}}}_k - f_k|^2] + 2\nu ^2\kappa _{2,G}^2\left( \varOmega _1+\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) ^2 {\mathbb {E}}_{\xi _f^k}[\Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert ^2]\\&\quad \le \frac{2\varOmega _0^2 + 2\nu ^2\kappa _{2,G}^2\varOmega _1^2\left( \varOmega _1+\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) ^2}{|\xi _f^k|}, \end{aligned}$$

where the last inequality applies the variance bound for the sample average, that is ${\mathbb {E}}[|{{\bar{f}}}_k - f_k|^2] = {\mathbb {E}}[|f({{\varvec{x}}}_k; \xi ) - f_k|^2]/|\xi _f^k| \le \varOmega _0^2/|\xi _f^k|$ (and similar for ${\mathbb {E}}[\Vert {{\bar{\nabla }}}f_k - \nabla f_k\Vert ^2]$). Following the same calculation, we can show the same bound for $|{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_k, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_k, \nu }^{s_k} |^2$. Thus, the condition (24) holds if

$$\begin{aligned} |\xi _f^k| \ge \frac{2\varOmega _0^2 + 2\nu ^2\kappa _{2,G}^2\varOmega _1^2\left( \varOmega _1+\kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}}\right) ^2}{{{\bar{\epsilon }}}_k^2}. \end{aligned}$$

(B.3)

Combining (B.2) and (B.3) and taking maximum on the right hand side, the conditions (23) and (24) are satisfied simultaneously when

$$\begin{aligned} |\xi _f^k| \ge \frac{C_f\log \left( \frac{8d}{p_f}\right) }{m_k^2\wedge {{\bar{\epsilon }}}_k^2\wedge 1}, \end{aligned}$$

(B.4)

with $C_f = 16\max \{\varOmega _0,\varOmega _1\}^2\{1\vee 4\nu ^2\kappa _{2,G}^2(\varOmega _1\vee \kappa _{\nabla _{{{\varvec{x}}}}{\mathcal {L}}})^2\}$. Finally, we note that $m_k\ne 0$. Otherwise, by (19) we know ${{\bar{\varDelta }}}{{\varvec{x}}}_k = {{\varvec{0}}}$ and $G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k = {{\varvec{0}}}$. Then, by (9) we have ${{\bar{\varDelta }}}{\varvec{\lambda }}_k = {{\varvec{0}}}$, which contradicts $\Vert ({{\bar{\varDelta }}}{{\varvec{x}}}_k, {{\bar{\varDelta }}}{\varvec{\lambda }}_k)\Vert >0$ in Assumption 3. This completes the proof.

1.2 Proof of proposition 2

Note that

$$\begin{aligned}&{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + \nu {{\bar{M}}}_k G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + G_k^Tc_k \\&\quad = (I - G_k^T(G_kG_k^T)^{-1}G_k){{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + G_k^T(G_kG_k^T)^{-1}G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + \nu {{\bar{M}}}_k G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}^k + G_k^Tc_k\\&\quad {\mathop {=}\limits ^{(9)}} - (I - G_k^T(G_kG_k^T)^{-1}G_k)B_k{{\bar{\varDelta }}}{{\varvec{x}}}_k \\&\qquad + \left( G_k^T(G_kG_k^T)^{-1} + \nu {{\bar{M}}}_k\right) G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k - G_k^TG_k{{\bar{\varDelta }}}{{\varvec{x}}}_k. \end{aligned}$$

Thus, we know

$$\begin{aligned}&\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + \nu {{\bar{M}}}_k G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + G_k^Tc_k\\ \nu G_kG_k^TG_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix} \nonumber \\&\quad = \begin{pmatrix} -(I - G_k^T(G_kG_k^T)^{-1}G_k)B_k - G_k^TG_k &{} G_k^T(G_kG_k^T)^{-1} + \nu {{\bar{M}}}_k\\ {{\varvec{0}}}&{} \nu G_kG_k^T \end{pmatrix}\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}.\nonumber \\ \end{aligned}$$

(B.5)

Using (12) and (36), we upper bound the spectrum norm of each block of the right hand side matrix, and define

$$\begin{aligned} \varUpsilon _1 = \kappa _B+\kappa _{2,G} + \frac{1}{\sqrt{\kappa _{1,G}}} + \nu (\kappa _{{{\bar{M}}}} + \kappa _{2,G}) \end{aligned}$$

with $\kappa _{{{\bar{M}}}}$ given in (36), then the first inequality in (39) is satisfied. Furthermore, we know

$$\begin{aligned} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix} {\mathop {=}\limits ^{(9)}} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ -{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k - G_kG_k^T{{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} = \begin{pmatrix} I &{} {{\varvec{0}}}\\ -{{\bar{M}}}_k^T &{} -G_kG_k^T \end{pmatrix}\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}. \end{aligned}$$

Thus, we let

$$\begin{aligned} \varUpsilon _2 = \frac{1}{1 + \kappa _{{{\bar{M}}}} + \kappa _{2,G}} \end{aligned}$$

and the second inequality in (39) is satisfied. Moreover,

$$\begin{aligned} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}&{\mathop {=}\limits ^{(9)}} \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ -(G_kG_k^T)^{-1}{{\bar{M}}}_k^T{{\bar{\varDelta }}}{{\varvec{x}}}_k - (G_kG_k^T)^{-1}G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix} \\&= \begin{pmatrix} I &{} {{\varvec{0}}}\\ -(G_kG_k^T)^{-1}{{\bar{M}}}_k^T &{} - (G_kG_k^T)^{-1} \end{pmatrix}\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}. \end{aligned}$$

Thus, we let

$$\begin{aligned} \varUpsilon _3 = 1 + \frac{\kappa _{{{\bar{M}}}}+1}{\kappa _{1,G}} \end{aligned}$$

and the third inequality in (39) is satisfied. Finally,

$$\begin{aligned} \begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}&{\mathop {=}\limits ^{(20)}} \begin{pmatrix} \left( I+\nu {{\bar{M}}}_kG_k\right) {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k + {{\bar{\mu }}}_{{{\bar{K}}}}G_k^Tc_k\\ c_k + \nu G_kG_k^TG_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\\&{\mathop {=}\limits ^{(B.5)}}\begin{pmatrix} -(I - G_k^T(G_kG_k^T)^{-1}G_k)B_k - {{\bar{\mu }}}_{{{\bar{K}}}}G_k^TG_k &{} G_k^T(G_kG_k^T)^{-1} + \nu {{\bar{M}}}_k\\ -G_k &{} \nu G_kG_k^T \end{pmatrix}\\&\quad \qquad \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}. \end{aligned}$$

By Lemma 4, we know ${{\bar{\mu }}}_{{{\bar{K}}}} \le {{\tilde{\mu }}}$ with deterministic ${{\tilde{\mu }}}$. Thus, we let

$$\begin{aligned} \varUpsilon _4 = \kappa _B + ({{\tilde{\mu }}}+1)\kappa _{2,G} +\frac{1}{\sqrt{\kappa _{1,G}}} + \nu (\kappa _{{{\bar{M}}}}+\kappa _{2,G}) \end{aligned}$$

and the fourth inequality in (39) is satisfied. This completes the proof.

1.3 Proof of lemma 8

We consider the following three cases.

Case 1: reliable step. For the reliable step, we apply Lemma 6 and (43) is changed to

$$\begin{aligned} {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k&\le \frac{{{\bar{\alpha }}}_k\beta }{2}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix}\nonumber \\&{\mathop {\le }\limits ^{(26)}} \frac{{{\bar{\alpha }}}_k\beta }{4}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} - \frac{{{\bar{\epsilon }}}_k}{4} \nonumber \\&{\mathop {\le }\limits ^{(19)}} -\frac{{{\bar{\alpha }}}_k\beta (\gamma _{RH}\wedge \nu )}{8}\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{k} \end{pmatrix}\right\| ^2 - \frac{{{\bar{\epsilon }}}_k}{4}. \end{aligned}$$

(B.6)

By Line 13 of Algorithm 3, ${{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = (\rho -1){{\bar{\epsilon }}}_k$, while (44) is the same. Thus, by the condition on $\omega $ in (41) we obtain

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 - \frac{\omega {{\bar{\epsilon }}}_k}{8}\nonumber \\&+ \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2. \end{aligned}$$

(B.7)

Case 2: unreliable step. For the unreliable step, (B.6) is changed to

$$\begin{aligned} {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \le \frac{{{\bar{\alpha }}}_k\beta }{2}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} {\mathop {\le }\limits ^{(19)}} -\frac{\beta (\gamma _{RH}\wedge \nu )}{4}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2. \end{aligned}$$

By Line 15 of Algorithm 3, ${{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = -(1-1/\rho ){{\bar{\epsilon }}}_k$ and (44) is the same. Thus, under the condition on $\omega $ in (41),

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{k} \end{pmatrix}\right\| ^2\nonumber \\&-\frac{1}{2}(1-\omega )\left( 1-\frac{1}{\rho }\right) {{\bar{\epsilon }}}_k + \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2.\nonumber \\ \end{aligned}$$

(B.8)

Case 3: unsuccessful step. The result in (45) holds.

Combining (B.7), (B.8), (45), we have

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k \le \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2, \end{aligned}$$

which completes the proof.

1.4 Proof of lemma 9

We consider the following three cases.

Case 1: reliable step. We have

$$\begin{aligned}&{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \nonumber \\&\quad \le |{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&\quad {\mathop {\le }\limits ^{(25)}} {{\bar{\alpha }}}_k\beta \begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&\quad {\mathop {\le }\limits ^{(26)}} \frac{{{\bar{\alpha }}}_k\beta }{2}\begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} -\frac{{{\bar{\epsilon }}}_k}{2} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&\quad {\mathop {\le }\limits ^{(19)}} -\frac{\beta {{\bar{\alpha }}}_k(\gamma _{RH}\wedge \nu )}{4}\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 - \frac{{{\bar{\epsilon }}}_k}{2} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k|. \end{aligned}$$

(B.9)

By Line 13 of Algorithm 3, ${{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = (\rho -1){{\bar{\epsilon }}}_k$, and (44) holds as well. By (41), we have

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k&\le -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 \nonumber \\&\quad - \frac{\omega {{\bar{\epsilon }}}_k}{8} + \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2 \nonumber \\&\quad +\omega ( |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| ). \end{aligned}$$

(B.10)

Case 2: unreliable step. The inequality in (B.9) is changed to

$$\begin{aligned} {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k&\le |{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{k+1} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&{\mathop {\le }\limits ^{(25)}} {{\bar{\alpha }}}_k\beta \begin{pmatrix} {{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ {{\bar{\nabla }}}_{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}^T\begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ {{\bar{\varDelta }}}{\varvec{\lambda }}_k \end{pmatrix} + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \nonumber \\&{\mathop {\le }\limits ^{(19)}} -\frac{\beta {{\bar{\alpha }}}_k(\gamma _{RH}\wedge \nu )}{2}\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}-{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}|\\&\quad \quad + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k|. \end{aligned}$$

We further have ${{\bar{\epsilon }}}_{k+1} - {{\bar{\epsilon }}}_k = -(1-1/\rho ){{\bar{\epsilon }}}_k$ and (44) holds as well. Using (41), we get

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} -\frac{\beta (\gamma _{RH}\wedge \nu )\omega }{32}{{\bar{\alpha }}}_k\left\| \begin{pmatrix} {{\bar{\varDelta }}}{{\varvec{x}}}_k\\ G_k{{\bar{\nabla }}}_{{{\varvec{x}}}}{\mathcal {L}}_k \end{pmatrix}\right\| ^2 - \frac{1}{2}(1-\omega )\left( 1-\frac{1}{\rho }\right) {{\bar{\epsilon }}}_k\nonumber \\&+ \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2 +\omega ( |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| \nonumber \\&+ |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| ). \end{aligned}$$

(B.11)

Case 3: unsuccessful step. The result in (45) holds.

Combining (B.10), (B.11), (45), we obtain

$$\begin{aligned} \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^{k+1} - \varPhi _{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu , \omega }^k\le & {} \rho (1-\omega ){{\bar{\alpha }}}_k\left\| \begin{pmatrix} \nabla _{{{\varvec{x}}}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k\\ \nabla _{{\varvec{\lambda }}}{\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k \end{pmatrix}\right\| ^2\\&+ \omega \left( |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k} - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^{s_k}| + |{{\bar{{\mathcal {L}}}}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k - {\mathcal {L}}_{{{\bar{\mu }}}_{{{\bar{K}}}}, \nu }^k| \right) , \end{aligned}$$

which completes the proof.

Additional simulation results

We list the detailed results for each selected CUTEst problem. For each problem and each method, we have five levels of noise. For all four methods, we report the smallest result among different setups. In particular, for NonAdapSQP and $\ell _1$ SQP, we take the minimum over all six setups of stepsize sequences (four constant sequences and two decaying sequences). For AdapSQP and $\ell _1$ AdapSQP, we take the minimum over four setups of constants $C_{grad}, C_f = \{1,5,10,50\}$. The result here is the KKT residual of the last iterate. We average over convergent runs across five independent runs. The convergence results are summarized in Tables 1, 2, and 3.

In tables, the column of logR shows $\log (\Vert \nabla {\mathcal {L}}_k\Vert )$ and the column of logStd shows the log of standard deviation. For each problem, the first line is the result of AdapSQP; the second line is the result of $\ell _1$ AdapSQP; the third line is the result of $\ell _1$ SQP; and the fourth line is the result of NonAdapSQP. The entry ‘$\diagup $’ means that the algorithm does not converge (within the given budget).

Table 1 KKT results summary

Full size table

Table 2 KKT results summary

Full size table

Table 3 KKT results summary

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Na, S., Anitescu, M. & Kolar, M. An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians. Math. Program. 199, 721–791 (2023). https://doi.org/10.1007/s10107-022-01846-z

Download citation

Received: 11 February 2021
Accepted: 05 June 2022
Published: 30 June 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01846-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians

Abstract

Access this article

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proofs of section 3

1.1 Proof of lemma 1

1.2 Proof of theorem 1

Lemma 10

Proof

Lemma 11

Proof

Proofs of section 4

1.1 Proof of lemma 3

1.2 Proof of proposition 2

1.3 Proof of lemma 8

1.4 Proof of lemma 9

Additional simulation results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation