Skip to main content
Log in

A stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems

  • Original Paper
  • Published:
Numerical Algorithms Aims and scope Submit manuscript

Abstract

In this paper, for solving a broad class of large-scale nonconvex and nonsmooth optimization problems, we propose a stochastic two-step inertial Bregman proximal alternating linearized minimization (STiBPALM) algorithm with variance-reduced stochastic gradient estimators. And we show that SAGA and SARAH are variance-reduced gradient estimators. Under expectation conditions with the Kurdyka–Łojasiewicz property and some suitable conditions on the parameters, we obtain that the sequence generated by the proposed algorithm converges to a critical point. And the general convergence rate is also provided. Numerical experiments on sparse nonnegative matrix factorization and blind image-deblurring are presented to demonstrate the performance of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability

All data generated or analyzed during this study are included in this article.

Notes

  1. http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.

References

  1. Chao, M.T., Han, D.R., Cai, X.J.: Convergence of the Peaceman-Rachford splitting method for a class of nonconvex programs. Numer. Math. Theory Methods Appl. 14(2), 438–460 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  2. Fu, X., Huang, K., Sidiropoulos, N.D., Ma, W.: Nonnegative matrix factorization for signal and data analytics: identifiability, algorithms, and applications. IEEE Signal Process. Mag. 36(2), 59–80 (2019)

    Article  Google Scholar 

  3. Paatero, P., Tapper, U.: Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994)

    Article  Google Scholar 

  4. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nat. 401, 788–791 (1999)

    Article  MATH  Google Scholar 

  5. Ma, Y., Hu, X., He, T., Jiang, X.: Clustering and integrating of heterogeneous microbiome data by joint symmetric nonnegative matrix factorization with Laplacian regularization. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(3), 788–795 (2020)

    Article  Google Scholar 

  6. Pock, T., Sabach, S.: Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imaging Sci. 9, 1756–1787 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  7. Aspremont, A., Ghaoui, L. E., Jordan, M. I., Lanckriet, G. R.: A direct formulation for sparse PCA using semidefinite programming. in Advances in Neural Information Processing Systems 41–48 (2005)

  8. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Statist. 15, 265–286 (2006)

    Article  MathSciNet  Google Scholar 

  9. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Guass-Seidel methods. Math. Program. 137, 91–129 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  10. Donoho, D.L.: Compressed sensing. IEEE Trans. Inform. Theory 4, 1289–1306 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  11. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearised minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  12. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  13. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. Publ. Math. Program. Soc. 116(1–2), 5–16 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  14. Gao, X., Cai, X.J., Han, D.R.: A Gauss-Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems. J. Glob. Optim. 76, 863–887 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  15. Wang, Q.X., Han, D.R.: A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl. Numer. Math. 189, 66–87 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  16. Zhao, J., Dong, Q.L., Michael, Th.R., Wang, F.H.: Two-step inertial Bregman alternating minimization algorithm for nonconvex and nonsmooth problems. J. Glob. Optim. 84, 941–966 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  17. Guo, C. Z., Zhao, J.: Two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems, (2023). arXiv:2306.07614v1

  18. Chao, M.T., Nong, F.F., Zhao, M.Y.: An inertial alternating minimization with Bregman distance for a class of nonconvex and nonsmooth problems. J. Appl. Math. Comput. 69, 1559–1581 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  19. Mukkamala, M.C., Ochs, P., Pock, T., Sabach, S.: Convex-concave backtracking for inertial Bregman proximal gradient algorithms in nonconvex optimization. SIAM J. Math. Data Sci. 2, 658–682 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  20. Ahookhosh, M., Hien, L.T.K., Gillis, N., Patrinos, P.: A block inertial Bregman proximal algorithm for nonsmooth nonconvex problems with application to symmetric nonnegative matrix tri-factorization. J. Optim. Theory Appl. 190, 234–258 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  21. Bottou, L.: In: Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, 1, 177–186 (2010)

  22. Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  23. Driggs, D., Tang, J.Q., Liang, J.W., Davies, M., Schonlieb, C.B.: SPRING: a stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J. Imaing Sci. 4, 1932–1970 (2021)

    Article  MATH  Google Scholar 

  24. Hertrich, J., Steidl, G.: Inertial stochastic PALM and applications in machine learning. Sampl. Theory Sign Process. Data Anal. 20, (2022). https://doi.org/10.1007/s43670-022-00021-x

  25. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  26. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. in Advances in Neural Information Processing Systems 315–323 (2013)

  27. Konecny, J., Liu, J., Richtarik, P., Takac, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign Process. 10, 242–255 (2016)

    Article  Google Scholar 

  28. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. in Advances in Neural Information Processing Systems 1646–1654 (2014)

  29. Li, B., Ma, M., Giannakis, G. B.: On the convergence of SARAH and beyond. in International Conference on Artificial Intelligence and Statistics, PMLR 223–233 (2020)

  30. Nguyen, L. M., Liu, J., Scheinberg, K., and Takáĉ, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. in Proceedings of the 34th International Conference on Machine Learning, 2613–2621 (2017)

  31. Rockafellar, R.T., Wets, R.: Variational analysis, Grundlehren der Mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)

    Google Scholar 

  32. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2007)

    Article  MATH  Google Scholar 

  33. Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Amer. Math. Soc. 362, 3319–3363 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  34. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and distributed computation: numerical methods. Prentice hall, Englewood Cliffs, NJ (1989)

    MATH  Google Scholar 

  35. Robbins, H., Siegmund, D.: A convergence theorem for non-negative almost supermartingales and some applications. Optimizing Methods in Statistics, Academic Press, New York, 233–257 (1971)

  36. Damek, D.: The asynchronous PALM algorithm for nonsmooth nonconvex problems (2016). arXiv:1604.00526

  37. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. B 116, 5–16 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  38. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nat. 788–791 (1999)

  39. Pan, J., Gillis, N.: Generalized separable nonnegative matrix factorization. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1546–1561 (2021)

    Article  Google Scholar 

  40. Rousset, F., Peyrin, F., Ducros, N.: A semi nonnegative matrix factorization technique for pattern generalization in single-pixel imaging. IEEE Trans. Comput. Imaging 4(2), 284–294 (2018)

    Article  MathSciNet  Google Scholar 

  41. Peharz, R., Pernkopf, F.: Sparse nonnegative matrix factorization with \(l_0\)-constraints. Neurocomput. 80, 38–46 (2012)

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful to the editor and reviewers for their comments that improved the quality of our paper.

Funding

This work is supported by the Scientific Research Project of Tianjin Municipal Education Commission (2022ZD007).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the manuscript and approved the submitted version.

Corresponding author

Correspondence to Jing Zhao.

Ethics declarations

Ethics approval

Not applicable

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A SAGA variance bound

We define the SAGA gradient estimators \(\widetilde{\nabla }_x(u_k,y_k)\) and \(\widetilde{\nabla }_y(x_{k+1},v_k)\) as follows:

$$\begin{aligned} \widetilde{\nabla }_x(u_k,y_k)=&\frac{1}{b}\sum _{i\in I_k^x}\left( \nabla _xH_i(u_k,y_k)- \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right) + \frac{1}{n}\sum _{j=1}^n\nabla _xH_j(\varphi _{k}^{j},y_{k}), \\ \widetilde{\nabla }_y(x_{k+1},v_k)=&\frac{1}{b}\sum _{i\in I_k^y}\left( \nabla _yH_i(x_{k+1},v_k)\!-\! \nabla _yH_i(x_{k+1},\xi _{k}^{i}) \right) \!+\! \frac{1}{n}\sum _{j=1}^n\nabla _yH_j(x_{k+1},\xi _{k}^{j}),\nonumber \end{aligned}$$
(A.1)

where \(I_k^x\) and \(I_k^y\) are mini-batches containing b indices. The variables \(\varphi _{k}^{i}\) and \(\xi _{k}^{i}\) follow the update rules \(\varphi _{k+1}^{i}=u_k\) if \(i\in I_k^x\) and \(\varphi _{k+1}^{i}=\varphi _{k}^{i}\) otherwise, and \(\xi _{k+1}^{i}=v_k\) if \(i\in I_k^y\) and \(\xi _{k+1}^{i}=\xi _{k}^{i}\) otherwise.

To prove our variance bounds, we require the following lemma.

Lemma A.1

Suppose \(X_1,\cdots ,X_t\) are independent random variables satisfying \(\mathbb {E}_{k}X_i\)\(=0\) for \(1\le i\le t\). Then

$$\begin{aligned} \mathbb {E}_{k}\left\| X_1+\cdots +X_t \right\| ^2=\mathbb {E}_{k}\left[ \left\| X_1 \right\| ^2 +\cdots +\left\| X_t \right\| ^2 \right] . \end{aligned}$$
(A.2)

Proof

Our hypotheses on these random variables imply \(\mathbb {E}_{k}\left\langle X_i,X_j \right\rangle =0\) for \(i\ne j\). Therefore,

$$\begin{aligned} \mathbb {E}_{k}\left\| X_1+\cdots +X_t \right\| ^2= \mathbb {E}_{k}\sum _{i,j=1}^{t}\left\langle X_i,X_j \right\rangle =\mathbb {E}_{k}\left[ \left\| X_1 \right\| ^2 +\cdots +\left\| X_t \right\| ^2 \right] . \end{aligned}$$

\(\square \)

We are now prepared to prove that the SAGA gradient estimator is variance-reduced.

Lemma A.2

The SAGA gradient estimator satisfies

$$\begin{aligned}{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2\le \frac{1}{bn}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2,\nonumber \\{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _yH(x_{k+1},v_k) \right\| ^2\!\le \!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_k,v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\nonumber \\{} & {} +\frac{16N^2\gamma ^2}{b}\left( \mathbb {E}_{k}\left\| z_{k+1}-z_{k}\right\| ^2+\left\| z_{k}-z_{k-1}\right\| ^2+\left\| z_{k-1}-z_{k-2}\right\| ^2\right) , \end{aligned}$$
(A.3)

as well as

$$\begin{aligned}{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| \le \frac{1}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ,\nonumber \\{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _yH(x_{k+1},v_k) \right\| \!\le \!\frac{2}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _yH_j(x_k,v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| \nonumber \\{} & {} +\frac{4N\gamma }{\sqrt{b}}\left( \mathbb {E}_{k}\left\| z_{k+1}-z_{k}\right\| +\left\| z_{k}-z_{k-1}\right\| +\left\| z_{k-1}-z_{k-2}\right\| \right) , \end{aligned}$$
(A.4)

where \(N=\max \left\{ M,L \right\} \), \(\gamma =\max \left\{ \gamma _1,\gamma _2 \right\} \).

Proof

According to (A.1), we have

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2\\ =&\mathbb {E}_{k}\left\| \frac{1}{b} \sum _{i\in I_k^x}\left( \nabla _xH_i(u_k,y_k)\!-\! \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right) \!-\!\nabla _xH(u_k,y_k)\!+\! \frac{1}{n}\sum _{j=1}^n\nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\nonumber \\ \overset{(1)}{\le }\&\frac{1}{b^2}\mathbb {E}_{k}\sum _{i\in I_k^x}\left\| \nabla _xH_i(u_k,y_k)- \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right\| ^2\nonumber \\ =&\frac{1}{bn}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2.\nonumber \end{aligned}$$
(A.5)

Inequality (1) follows from Lemma A.1. By the Jensen’s inequality, we can say that

$$\begin{aligned} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| \le&\sqrt{\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2}\\ \le&\frac{1}{\sqrt{bn}}\sqrt{\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2}\nonumber \\ \le&\frac{1}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| .\nonumber \end{aligned}$$
(A.6)

We use an analogous argument for \(\widetilde{\nabla }_y(x_{k+1},v_k)\). Let \(\mathbb {E}_{k,x}\) denote the expectation conditional on the first k iterations and \(I_k^x\). By the same reasoning as in (A.5), applying the Lipschitz continuity of \(\nabla _yH_j\), we obtain that

$$\begin{aligned}&\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\nonumber \\ \le&\frac{1}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k+1},v_k)- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k+1},v_k)\!-\! \nabla _yH_j(x_{k},y_{k}) \right\| ^2\!+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},y_k)\!-\! \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},\xi _{k}^{j})\!-\! \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\ \le&\frac{4M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{4M^2}{b}\left\| v_{k}-y_{k}\right\| ^2+\frac{4L^2}{b}\left\| y_{k}-v_{k}\right\| ^2\nonumber \\&+\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2+\frac{4M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2\nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2+\frac{8M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2\nonumber \\&+\frac{4(M^2+L^2)}{b}\left( 2\gamma _1^2\left\| y_{k}-y_{k-1}\right\| ^2+2\gamma _2^2\left\| y_{k-1}-y_{k-2}\right\| ^2\right) \nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{16N^2\gamma ^2}{b}\left( \left\| z_{k+1}\!-\!z_{k}\right\| ^2\!+\!\left\| z_{k}\!-\!z_{k-1}\right\| ^2\right. \nonumber \\&\left. +\left\| z_{k-1}-z_{k-2}\right\| ^2\right) , \end{aligned}$$
(A.7)

where \(N=\max \left\{ M,L \right\} \), \(\gamma =\max \left\{ \gamma _1,\gamma _2 \right\} \). Also, by the same reasoning as in (A.6),

$$\begin{aligned}&\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \\ \le&\sqrt{\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2}\nonumber \\ \le&\frac{2}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| +\frac{4N\gamma }{\sqrt{b}}\left( \left\| z_{k+1}-z_{k}\right\| +\left\| z_{k}-z_{k-1}\right\| \right. \nonumber \\&\left. +\left\| z_{k-1}-z_{k-2}\right\| \right) ,\nonumber \end{aligned}$$
(A.8)

Applying the operator \(\mathbb {E}_{k}\) to (A.7) and (A.8), we get the desired result. \(\square \)

Now, define

$$\begin{aligned} \Upsilon _{k+1}=&\frac{1}{bn}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \\&\left. +4\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) ,\nonumber \\ \Gamma _{k+1}=&\frac{1}{\sqrt{bn}}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \nonumber \\&\left. +2\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) .\nonumber \end{aligned}$$
(A.9)

By Lemma A.2, we have

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right] \\ \le&\Upsilon _k+V_1\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$

and

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right] \\ \le&\Gamma _k+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| \right) . \end{aligned}$$

This is exactly the MSE bound, where \(V_{1}=\frac{16N^2\gamma ^2}{b}\) and \(V_{2}=\frac{4N\gamma }{\sqrt{b}}\).

Lemma A.3

(Geometric decay) Let \(\Upsilon _{k}\) be defined as in (A.9), then we can establish the geometric decay property:

$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\le \left( 1-\rho \right) \Upsilon _{k}+V_{\Upsilon }\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$
(A.10)

where \(\rho =\frac{b}{2n}\), \(V_{\Upsilon }=\frac{408nN^2(1+2\gamma _1^2+\gamma _2^2)}{b^2}\).

Proof

We show that \(\mathbb {E}_{k}\Upsilon _{k+1}\) is decreasing at a geometric rate. By applying the inequality \(\left\| a-c \right\| ^2\le (1+\varepsilon )\left\| a-b \right\| ^2+(1+\varepsilon ^{-1} )\left\| b-c \right\| ^2\) twice, it follows that

$$\begin{aligned}&\frac{1}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2\nonumber \\ \le&\frac{1+\varepsilon }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n\mathbb {E}_{k} \left\| \nabla _xH_j(u_{k+1},y_{k+1})\right. \nonumber \\&\left. - \nabla _xH_j(u_{k},y_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^2 }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1} ) }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(\varphi _{k+1}^{j},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(u_{k},y_{k})\right\| ^2\nonumber \\ \le&\frac{(1\!+\varepsilon )^2 (1-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(1+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k}\!-\! y_{k+1} \right\| ^2\nonumber \\&+\frac{(1+\varepsilon ^{-1})M^2}{b}\mathbb {E}_{k}\left( \left\| u_{k+1}-u_{k}\right\| ^2+\left\| y_{k+1}-y_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1\!+\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(2+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k+1}\!- y_{k} \right\| ^2\nonumber \\&+\frac{(1+\varepsilon ^{-1})M^2}{b}\mathbb {E}_{k}\left( 3\left\| u_{k+1}-x_{k+1}\right\| ^2+3\left\| x_{k+1}-x_{k}\right\| ^2+3\left\| x_{k}-u_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1+\!\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(2+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k+1}\!-\! y_{k} \right\| ^2\nonumber \\&+\frac{3M^2(1+\varepsilon ^{-1})(1+2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\left\| x_{k}-x_{k-1}\right\| ^2\nonumber \\&+\frac{6M^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| x_{k-1}-x_{k-2}\right\| ^2. \end{aligned}$$
(A.11)

Similarly,

$$\begin{aligned}&\frac{1}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2\nonumber \\ \le&\frac{1+\varepsilon }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k},v_{k})- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^3 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )^2(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k},\xi _{k}^{j})- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(1\!+\!\varepsilon )^2(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k} \right\| ^2\!+\!\frac{(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k} \right\| ^2\!+\!\frac{(1\!+\!\varepsilon ^{-1})L^2}{b}\mathbb {E}_{k}\left\| v_{k+1}\!-\!v_{k} \right\| ^2\nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1\!-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2+\frac{(1+\varepsilon ^{-1})L^2}{b}\mathbb {E}_{k}\left( 3\left\| v_{k+1}-y_{k+1}\right\| ^2+3\left\| y_{k+1}-y_{k}\right\| ^2+3\left\| y_{k}-v_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1\!-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2+\frac{3L^2(1+\varepsilon ^{-1})(1+2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| y_{k+1}-y_{k}\right\| ^2+\frac{6L^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\nonumber \\&\left\| y_{k}-y_{k-1}\right\| ^2+\frac{6L^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| y_{k-1}-y_{k-2}\right\| ^2. \end{aligned}$$
(A.12)

With

$$\begin{aligned} \Upsilon _{k+1}=&\frac{1}{bn}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \\&\left. +4\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) , \end{aligned}$$

adding (A.11) and (A.12), we can obtain

$$\begin{aligned}&\mathbb {E}_{k}\Upsilon _{k+1}\\ \le&\!(1\!+\!\varepsilon )^3 (\!1\!-\!b/n)\Upsilon _{k}\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k\!+1}\!-\! y_{k} \right\| ^2\!+\!\frac{3M^2(\!1\!+\!\varepsilon ^{-1})(\!1\!+\!2\gamma _{1}^2)}{b}\\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\left\| x_{k}-x_{k-1}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\\&\left\| x_{k-1}-x_{k-2}\right\| ^2+\frac{4(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)M^2(2+\varepsilon )}{b}\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2\\&\!+\!\frac{12L^2(1\!+\!\varepsilon ^{-1})(1\!+\!2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| y_{k+1}\!-\!y_{k}\right\| ^2\!+\!\frac{24L^2(1\!+\!\varepsilon ^{-1})(\gamma _{1}^2\!+\!\gamma _{2}^2)}{b}\left\| y_{k}\!-\!y_{k-1}\right\| ^2\\&+\frac{24L^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| y_{k-1}-y_{k-2}\right\| ^2\\ \le&(1+\varepsilon )^3 (1-b/n)\Upsilon _{k}+\frac{13N^2(1+\varepsilon )(2+\varepsilon )(1+\varepsilon ^{-1} )(1+2\gamma _1^2)}{b}\mathbb {E}_{k}\left\| z_{k+1}- z_{k} \right\| ^2\\&+\frac{24N^2(1+\varepsilon ^{-1})(\gamma _1^2+\gamma _{2}^2)}{b}\left\| z_{k}-z_{k-1}\right\| ^2+\frac{24N^2\gamma _{2}^2(1+\varepsilon ^{-1})}{b}\left\| z_{k-1}-z_{k-2}\right\| ^2\\ \le&(1\!+\!\varepsilon )^3 (1\!-\!b/n)\Upsilon _{k}\!+\!\frac{24N^2(1\!+\!\varepsilon )(2\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1})(1\!+\!2\gamma _1^2\!+\!\gamma _2^2)}{b}\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\right. \\&\left. +\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$

where \(N=\max \left\{ M,L \right\} \). Choosing \(\varepsilon =\frac{b}{6n}\), we have \((1+\varepsilon )^3(1-\frac{b}{n} ) \le 1-\frac{b}{2n}\), producing the inequality

$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\le&(1-\frac{b}{2n})\Upsilon _{k}+\frac{24N^2(1+\frac{b}{6n})(2+\frac{b}{6n})(1+\frac{6n}{b})(1+2\gamma _1^2+\gamma _2^2)}{b}\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}\right. \nonumber \\&\left. +\left\| z_{k}-z_{k-1} \right\| ^{2}+\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) \nonumber \\ \le&(1\!-\!\frac{b}{2n})\Upsilon _{k}\!+\!\frac{408nN^2(1\!+\!2\gamma _1^2\!+\!\gamma _2^2)}{b^2}\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\right) . \end{aligned}$$
(A.13)

This completes the proof. \(\square \)

Lemma A.4

(Convergence of estimator) If \(\left\{ z_k \right\} _{k\in \mathbb {N} }\) satisfies \(\lim _{k \rightarrow \infty } \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^{2}\!=0\), then \( \mathbb {E}\Upsilon _k\rightarrow 0\) and \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k\rightarrow \infty \).

Proof

We frist show that \(\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\rightarrow 0\) as \(k\rightarrow \infty \). Indeed,

$$\begin{aligned}&\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\le L^2\sum _{j=1}^n \mathbb {E}\left\| u_{k}-\varphi _{k}^{j} \right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1+\frac{b}{2n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1+\frac{b}{2n})(1-\frac{b}{n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1-\frac{b}{2n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\sum _{l=1}^k(1-\frac{b}{2n})^{k-l} \mathbb {E}\left\| u_{l}-u_{l-1}\right\| ^2. \end{aligned}$$
(A.14)

As \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}\rightarrow 0\), so \(\mathbb {E}\left\| u_{k}-u_{k-1} \right\| ^{2}\rightarrow 0\), it is clear that \(\sum _{l=1}^k(1-\frac{b}{2n})^{k-l} \mathbb {E}\left\| u_{l}\!-\!u_{l-1}\right\| ^2 \!\rightarrow \! 0\), and hence \(\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\! \rightarrow \! 0\) as \(k\!\rightarrow \! \infty \). An analogous argument shows that \(\sum _{j=1}^n \!\mathbb {E}\!\left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \!\right\| ^2\)\(\rightarrow 0\) as \(k\rightarrow \infty \). So \(\mathbb {E}\Upsilon _k\rightarrow 0\) as \(k\rightarrow \infty \). Similarly, we can get \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k\rightarrow \infty \). Indeed,

$$\begin{aligned}&\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| \le L\sum _{j=1}^n \mathbb {E}\left\| u_{k}-\varphi _{k}^{j} \right\| \nonumber \\ \le&nL\mathbb {E}\left\| u_{k}-u_{k-1}\right\| +L\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k}^{j}\right\| \nonumber \\ \le&nL\mathbb {E}\left\| u_{k}-u_{k-1}\right\| +L(1-\frac{b}{n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| \nonumber \\ \le&nL\sum _{l=1}^k(1-\frac{b}{n})^{k-l} \mathbb {E}\left\| u_{l}-u_{l-1}\right\| . \end{aligned}$$
(A.15)

Because \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}\rightarrow 0\), it follows that \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| \rightarrow 0\) (because Jensen’s inequality implies \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| \le \sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}}\rightarrow 0\)). So \(\mathbb {E}\left\| u_{k}-u_{k-1} \right\| \rightarrow 0\), then it follows that the bound on the right goes to zero as \(k\rightarrow \infty \), hence \(\mathbb {E}\Gamma _k\rightarrow 0\).

\(\square \)

1.2 B SARAH variance bound

As in the previous section, we use \(I_k^x\) and \(I_k^y\) to denote the mini-batches used to approximate \(\nabla _xH(u_k,y_k) \) and \(\nabla _yH(x_{k+1},v_k)\), respectively.

Lemma B.1

The SARAH gradient estimator satisfies

$$\begin{aligned}&\mathbb {E}_{k}\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right) \\ \le&\left( \! 1\!-\!\frac{1}{p} \!\right) \left( \! \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\! \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\right) \\&+V_{1}\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}+\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) , \end{aligned}$$

as well as

$$\begin{aligned}&\mathbb {E}_k\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\ \le&\sqrt{1-\frac{1}{p} } \left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\&+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| \right) , \end{aligned}$$

where \(V_{1}=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\) and \(V_{2}=M\sqrt{6(1-\frac{1}{p})(1+2\gamma _{1}^2+\gamma _{2}^2) }\).

Proof

Let \(\mathbb {E}_{k,p}\) denote the expectation conditional on the first k iterations and the event that we do not compute the full gradient at iteration k. The conditional expectation of the SARAH gradient estimator in this case is

$$\begin{aligned} \mathbb {E}_{k,p}\widetilde{\nabla }_x(u_k,y_k)=&\frac{1}{b}\mathbb {E}_{k,p}\left( \! \sum _{i\in I_k^x} \nabla _xH_i(u_k,y_k)\!-\! \nabla _xH_i(u_{k-1},y_{k-1}) \!\right) \!+\!\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \nonumber \\ =&\nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1})+\widetilde{\nabla }_x(u_{k-1},y_{k-1}), \end{aligned}$$
(B.1)

and further

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \nonumber \\ =&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})+\nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k)\right. \nonumber \\&\left. +\widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| ^2 \nonumber \\ =&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2+\left\| \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\| ^2\nonumber \\&+\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2 \nonumber \\&+2\left\langle \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1}), \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\rangle \nonumber \\&-2\left\langle \nabla _xH(u_{k-1},y_{k-1})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle \nonumber \\&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle . \end{aligned}$$
(B.2)

By (B.1), we see that

$$\mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) =\nabla _xH(u_{k},y_{k})-\nabla _xH(u_{k-1},y_{k-1}).$$

Thus, the first two inner products in (B.2) sum to zero and the third one is equal to

$$\begin{aligned}&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle \\ =&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \nabla _xH(u_{k},y_{k})-\nabla _xH(u_{k-1},y_{k-1})\right\rangle \\ =&-2\left\| \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$

This yields

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ =&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2-\left\| \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\| ^2 \\&+\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2\\ \le&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)\!-\!\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2. \end{aligned}$$

We can bound the second term by computing the expectation.

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2\\ =&\mathbb {E}_{k,p}\left\| \frac{1}{b}\left( \sum _{i\in I_k^x} \nabla _xH_i(u_k,y_k)- \nabla _xH_i(u_{k-1},y_{k-1}) \right) \right\| ^2\\ \le&\frac{1}{b}\mathbb {E}_{k,p}\left[ \sum _{i\in I_k^x} \left\| \nabla _xH_i(u_k,y_k)- \nabla _xH_i(u_{k-1},y_{k-1}) \right\| ^2 \right] \\ =&\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$

The inequality is due to the convexity of the function \(x\mapsto \left\| x \right\| ^2\). This results in the recursive inequality

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ \le&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)\!-\! \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$

This bounds the MSE under the condition that the full gradient is not computed. When the full gradient is computed, the MSE is equal to zero, so taking the M-Lipschitz continuity of the gradients of the \(H_j\) into account, we get

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)\!-\! \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2 \right) \\ \le&\left( 1-\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2+M^2\left\| (u_k,y_k)- (u_{k-1},y_{k-1}) \right\| ^2 \right) . \end{aligned}$$

Using \((a+b+c) ^2\le 3(a^2+b^2+c^2)\), we can estimate

$$\begin{aligned}&\left\| (u_k,y_k)- (u_{k-1},y_{k-1}) \right\| ^2=\left\| u_k-u_{k-1}\right\| ^2+\left\| y_k-y_{k-1}\right\| ^2\\ \le&3\left\| u_k-x_{k}\right\| ^2+3\left\| x_k-x_{k-1}\right\| ^2+3\left\| x_{k-1}-u_{k-1}\right\| ^2+\left\| y_k-y_{k-1}\right\| ^2\\ \le&3(1\!+\!2\gamma _{1}^2)\left\| x_k\!-\!x_{k-1}\right\| ^2\!+\!6(\gamma _{1}^2\!+\!\gamma _{2}^2)\left\| x_{k-1}\!-\!x_{k-2}\right\| ^2\!+\!6\gamma _{2}^2\left\| x_{k-2}\!-\!x_{k-3}\right\| ^2\!+\!\left\| y_k-y_{k-1}\right\| ^2. \end{aligned}$$

Substituting the above inequality, we can obtain

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \nonumber \\ \le&\left( 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!3M^2(1\!+\!2\gamma _{1}^2)\left\| x_k\!-\!x_{k-1}\right\| ^2\right. \nonumber \\&\left. +\!6M^2(\gamma _{1}^2\!+\!\gamma _{2}^2)\left\| x_{k-1}\!-\!x_{k-2}\right\| ^2\!+\!6M^2\gamma _{2}^2\left\| x_{k-2}\!-\!x_{k-3}\right\| ^2\!+\!M^2\left\| y_k\!-\!y_{k-1}\right\| ^2 \right) . \end{aligned}$$
(B.3)

By symmetric arguments, it holds

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k) -\nabla _yH(x_{k+1},v_k) \right\| ^2 \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| (x_{k+1},v_k)\!-\! (x_{k},v_{k-1}) \right\| ^2 \right) \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k}\right\| ^2\!+\!3M^2(1\!+\!2\mu _{1k}^2)\right. \nonumber \\&\left. \left\| y_k\!-\!y_{k-1}\right\| ^2\!+\!6M^2(\mu _{1,k-1}^2\!+\!\mu _{2k}^2)\left\| y_{k-1}\!-\!y_{k-2}\right\| ^2\!+\!6M^2\mu _{2,k-1}^2\left\| y_{k-2}\!-\!y_{k-3}\right\| ^2 \right) \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k}\right\| ^2\!+\!3M^2(1\!+\!2\gamma _{1}^2)\right. \nonumber \\&\left. \left\| y_k-y_{k-1}\right\| ^2+6M^2(\gamma _{1}^2+\gamma _{2}^2)\left\| y_{k-1}-y_{k-2}\right\| ^2+6M^2\gamma _{2}^2\left\| y_{k-2}-y_{k-3}\right\| ^2 \right) . \end{aligned}$$
(B.4)

Combining (B.3) and (B.4), we can obtain

$$\begin{aligned}&\mathbb {E}_{k}\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right) \\ \le&\left( \!1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\! \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\right. \\&\left. +M^2\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+M^2\left\| y_{k}-y_{k-1}\right\| ^2+3M^2(1+2\gamma _{1}^2)\left\| z_k-z_{k-1}\right\| ^2\right. \\&\left. +6M^2(\gamma _{1}^2+\gamma _{2}^2)\left\| z_{k-1}-z_{k-2}\right\| ^2+6M^2\gamma _{2}^2\left\| z_{k-2}-z_{k-3}\right\| ^2 \right) \\ \le&\left( 1-\frac{1}{p} \right) \Upsilon _{k}+6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2}\right. \\&\left. +\left\| z_{k-1}-z_{k-2} \right\| ^{2}+\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) . \end{aligned}$$

Similar bounds hold for \(\Gamma _k\) due to Jensen’s inequality:

$$\begin{aligned}&\mathbb {E}_k\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\ \le&\sqrt{1-\frac{1}{p} } \left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\&\!+\!M\sqrt{6(1\!-\!\frac{1}{p})(1\!+\!2\gamma _{1}^2\!+\!\gamma _{2}^2) }\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| \!+\!\left\| z_{k}\!-\!z_{k-1} \right\| \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| \!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| \right) . \end{aligned}$$

This completes the proof. \(\square \)

Now, define

$$\begin{aligned} \Upsilon _{k+1}=&\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2,\nonumber \\ \Gamma _{k+1}=&\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| . \end{aligned}$$
(B.5)

By Lemma B.1, we have

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right] \\ \le&\Upsilon _k\!+\!V_1\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}\right) , \end{aligned}$$

and

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right] \\ \le&\Gamma _k+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| \right) . \end{aligned}$$

This is exactly the MSE bound, where \(V_{1}=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\) and

\(V_{2}=M\sqrt{6(1-\frac{1}{p})(1+2\gamma _{1}^2+\gamma _{2}^2) }\).

Lemma B.2

(Geometric decay) Let \(\Upsilon _{k}\) be defined as in (B.5), then we can establish the geometric decay property:

$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\!\le \!\left( 1\!-\!\rho \right) \Upsilon _{k}\!+\!V_{\Upsilon }\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}\right) , \end{aligned}$$
(B.6)

where \(\rho = \frac{1}{p}\), \(V_{\Upsilon }=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\).

Proof

This is a direct result of Lemma B.1. \(\square \)

Lemma B.3

(Convergence of estimator) If \(\left\{ z_k \right\} _{k\in \mathbb {N} }\) satisfies \(\lim _{k \rightarrow \infty } \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^{2}\!=0\), then \( \mathbb {E}\Upsilon _k\rightarrow 0\) and \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k \rightarrow \infty \).

Proof

By (B.6), we have

$$\begin{aligned}&\mathbb {E}\Upsilon _{k}\\ \le&\left( 1\!-\!\rho \right) \mathbb {E}\Upsilon _{k-1}\!+\!V_{\Upsilon }\mathbb {E}\left( \left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\left\| z_{k-2}\!-z_{k-3} \right\| ^{2}+\left\| z_{k-3}-z_{k-4} \right\| ^{2}\right) \\ \le&V_{\Upsilon }\sum _{l=1}^{k}\left( 1-\rho \right) ^{k-l}\mathbb {E}\left( \left\| z_{l}-z_{l-1} \right\| ^{2} +\left\| z_{l-1}-z_{l-2} \right\| ^{2}+\left\| z_{l-2}-z_{l-3} \right\| ^{2}+\left\| z_{l-3}-z_{l-4} \right\| ^{2}\right) , \end{aligned}$$

which implies \( \mathbb {E}\Upsilon _k\rightarrow 0\) as \(k \rightarrow \infty \). By Jensen’s inequality, we have \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k \rightarrow \infty \). \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, C., Zhao, J. & Dong, QL. A stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems. Numer Algor (2023). https://doi.org/10.1007/s11075-023-01693-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11075-023-01693-9

Keywords

Mathematics Subject Classification (2010)

Navigation