Skip to main content
Log in

Distributed stochastic gradient tracking methods with momentum acceleration for non-convex optimization

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

We consider a distributed non-convex optimization problem of minimizing the sum of all local cost functions over a network of agents. This problem often appears in large-scale distributed machine learning, known as non-convex empirical risk minimization. In this paper, we propose two accelerated algorithms, named DSGT-HB and DSGT-NAG, which combine the distributed stochastic gradient tracking (DSGT) method with momentum accelerated techniques. Under appropriate assumptions, we prove that both algorithms sublinearly converge to a neighborhood of a first-order stationary point of the distributed non-convex optimization. Moreover, we derive the conditions under which DSGT-HB and DSGT-NAG achieve a network-independent linear speedup. Numerical experiments for a distributed non-convex logistic regression problem on real data sets and a deep neural network on the MNIST database show the superiorities of DSGT-HB and DSGT-NAG compared with DSGT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The datasets tested in this study are available on the LIBSVM website, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The MNIST database used in this study is available on this site: http://yann.lecun.com/exdb/mnist/.

Notes

  1. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

  2. Each epoch evaluates m component gradients at each agent.

  3. http://yann.lecun.com/exdb/mnist/

References

  1. Tychogiorgos, G., Gkelias, A., Leung, K.K.: A non-convex distributed optimization framework and its application to wireless ad-hoc networks. IEEE Trans. Wireless Commun. 12(9), 4286–4296 (2013)

    Article  Google Scholar 

  2. Olshevsky, A.: Efficient information aggregation strategies for distributed control and signal processing. Phd thesis, Massachusetts Inst. Tech. (2010)

  3. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 11–22 (2011)

    MATH  Google Scholar 

  4. Chang, T., Hong, M., Wai, H., Zhang, X., Lu, S.: Distributed learning in the nonconvex world: From batch data to streaming and beyond. IEEE Signal Process. Mag. 37(3), 26–38 (2020)

    Article  Google Scholar 

  5. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  6. Assran, M., Loizou, N., Ballas, N., Rabbat, M.: Stochastic gradient push for distributed deep learning. In: Proc. 36th Int. Conf. Mach. Learn. pp. 344–353 (2019)

  7. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. statist. 1, 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  8. Yan, Y., Yang,T., Li, Z., Lin, Q., Yang, Y.: A unified analysis of stochastic momentum methods for deep learning. In: Proc. Int. Joint Conf. Artif. Intell. pp. 2955–2961 (2018)

  9. Sutskever, T., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Int. Conf. Mach. Learn. pp. 1139–1147 (2013)

  10. Liu, Y., Gao, Y., Yin, W.: An improved analysis of stochastic gradient descent with momentum. In: Proc. Adv. Neural Inf. Process. Syst. pp. 6–12 (2020)

  11. Gitman, I., Lang, H., Zhang, P., Xiao, L.: Understanding the role of momentum in stochastic gradient methods. In: Proc. Adv. Neural Inf. Process. Syst. pp. 9633–9643 (2019)

  12. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156, 59–99 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  13. Nedić, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  14. Yuan, K., Ling, Q., Yin, W.: On the convergence of decentralized gradient descent. SIAM J. Optim. 26(3), 1835–1854 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  15. Ghadimi, E., Shames, I., Johansson, M.: Multi-step gradient methods for networked optimization. IEEE Trans. Signal Process. 61(21), 5417–5429 (2013)

    Article  Google Scholar 

  16. Jakovetić, D., Xavier, J., Moura, J.M.F.: Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  17. Nedić, A., Olshevsky, A., Shi, W.: Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  18. Xu, J., Zhu, S., Soh, Y.C., Xie, L.: Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In: Proc. 54th IEEE Conf. Decis. Control (CDC) pp. 2055–2060 (2015)

  19. Gao, J., Liu, X.W., Dai, Y.H., Huang, Y.K., Yang, P.: Achieving geometric convergence for distributed optimization with Barzilai-Borwein step sizes. Sci. China Inf. Sci. 65(4), 149203 (2022)

    Article  MathSciNet  Google Scholar 

  20. Jakovetić, D., Krejić, N., Krklec Jerinkić, N.: Exact spectral-like gradient method for distributed optimization. Comput. Optim. Appl. 74, 703–728 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  21. Xin, R., Khan, U.A.: Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  22. Qu, G., Li, N.: Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control 65, 2566–2581 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  23. Li, H., Cheng, H., Wang, Z., Wu, G.C.: Distributed Nesterov gradient and heavy-ball double accelerated asynchronous optimization. IEEE Trans. Neural Netw. Learn. Syst. 32(12), 5723–5737 (2021)

    Article  MathSciNet  Google Scholar 

  24. Shen, Y., Yang, S.: A heavy-ball distributed optimization algorithm over digraphs with row-stochastic matrices. In: 2020 39th Chinese Control Conf. pp. 4977–4982 (2020)

  25. Gao, J., Liu, X.W., Dai, Y.H., Huang, Y.K., Yang, P.: A family of distributed momentum methods over directed graphs with linear convergence. IEEE Trans. Autom. Control (2022). https://doi.org/10.1109/TAC.2022.3160684

    Article  Google Scholar 

  26. Lü, Q., Liao, X., Li, H., Huang, T.: A Nesterov-like gradient tracking algorithm for distributed optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270 (2021)

    Article  Google Scholar 

  27. Jakovetić, D., Moura, J.M., Xavier, J.: Linear convergence rate of a class of distributed augmented lagrangian algorithms. IEEE Trans. Autom. Control 60(4), 922–936 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  28. Mansoori, F., Wei, E.: A general framework of exact primal-dual first-order algorithms for distributed optimization. In: 2019 IEEE Conf. Decis. Control pp. 6386–6391 (2019)

  29. Zeng, J., Yin, W.: On nonconvex decentralized gradient descent. IEEE Trans. Signal Process. 66(11), 2834–2848 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  30. Lorenzo, P.D., Scutari, G.: Next: In-network nonconvex optimization. IEEE Trans. Signal Inf. Process. Netw. Process. 2(2), 120–136 (2016)

    MathSciNet  Google Scholar 

  31. Daneshmand, A., Scutari, G., Kungurtsev, V.: Second-order guarantees of distributed gradient algorithms. Comput. Math. Model. 30(4), 3029–3068 (2020)

    MathSciNet  MATH  Google Scholar 

  32. Chen, J., Sayed, A.H.: Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Trans. Signal Process. 60(8), 4289–4305 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  33. Yuan, K., Alghunaim, S.A., Ying, B., Sayed, A.H.: On the influence of bias-correction on distributed stochastic optimization. IEEE Trans. Signal Process. 68, 4352–4367 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  34. Pu, S., Nedić, A.: Distributed stochastic gradient tracking methods. Math. Program. 187(1), 409–457 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  35. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Proc. Adv. Neural Inf. Process Syst. pp. 315–323 (2013)

  36. Nguyen, L.M., Liu, J., Scheinberg, K., Taká, M.: SARAH: A novel method for machine learning problems using stochastic recursive gradient. In: Proc. Int. Conf. Mach. Learn. pp. 2613–2621 (2017)

  37. Li, H., Zheng, L., Wang, Z., Yan, Y., Feng, L., Guo, J.: S-DIGing: A stochastic gradient tracking algorithm for distributed optimization. IEEE Trans. Emerg. Top. Comput. Intell. 6(1), 53–65 (2022)

    Article  Google Scholar 

  38. Li, B., Cen, S., Chen, Y., Chi, Y.: Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. J. Mach. Learn. Res. 21, 1–51 (2020)

    MathSciNet  MATH  Google Scholar 

  39. Xin, R., Khan, U.A., Kar, S.: Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Trans. Signal Process. 68, 6255–6271 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  40. Sun, B., Hu, J., Xia, D., Li, H.: A distributed stochastic optimization algorithm with gradient-tracking and distributed heavy-ball acceleration. Front. Inform. Technol. Electron. Eng. 22(11), 1463–1476 (2021)

    Article  Google Scholar 

  41. Hu, J., Xia, D., Cheng, H., Feng, L., Ji, L., Guo, J., Li, H.: A decentralized Nesterov gradient method for stochastic optimization over unbalanced directed networks. Asian J. Control 24(2), 576–593 (2022)

    Article  MathSciNet  Google Scholar 

  42. Erofeeva, V., Granichin, O., Sergeenko, A.: Distributed stochastic optimization with heavy-ball momentum term for parameter estimation. In: Proc. 2021 5th Scientific School Dynamics of Complex Networks and their Applications (DCNA) pp. 69–72 (2021)

  43. Lian, X., Zhang, C., Zhang, H., Hsieh, C., Zhang, W., Liu, J.: Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In: Proc. Adv. Neural Inf. Process. Syst. pp. 5330–5340 (2017)

  44. Vlaski, S., Sayed, A.H.: Distributed learning in non-convex environments-part I: Agreement at a linear rate. IEEE Trans. Signal Process. 69, 1242–1256 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  45. Vlaski, S., Sayed, A.H.: Distributed learning in non-convex environments-part II: Polynomial escape from saddle-points. IEEE Trans. Signal Process. 69, 1257–1270 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  46. Tang, H., Lian, X., Yan, M., Zhang, C., Liu, J.: D\(^2\): Decentralized training over decentralized data. In: Proc. 35th Int. Conf. Mach. Learn. pp. 4848–4856 (2018)

  47. Lu, S., Zhang, X., Sun, H., Hong, M.: Gnsd: a gradient-tracking based nonconvex stochastic algorithm for decentralized optimization. In: Proc. 2019 IEEE Data Sci. Workshop pp. 315–321 (2019)

  48. Xin, R., Khan, U.A., Kar, S.: An improved convergence analysis for decentralized online stochastic non-convex optimization. IEEE Trans. Signal Process. 69, 1842–1858 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  49. Zhang, J., You, K.: Decentralized stochastic gradient tracking for non-convex empirical risk minimization. ArXiv preprint arXiv:1909.02712 (2019)

  50. Xin, R., Khan, U.A., Kar, S.: Fast decentralized non-convex finite-sum optimization with recursive variance reduction. SIAM J. Optim. 32(1), 1–28 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  51. Yu, H., Jin, R., Yang, S.: On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In: Proc. Int. Conf. Mach. Learn. pp. 7184–7193 (2019)

  52. Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  53. Polyak, B.T.: Introduction to optimization. Optimization Software Inc, New York (1987)

    MATH  Google Scholar 

  54. Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: Proc. 2015 Eur. Control Conf. pp. 310–315 (2015)

  55. Nesterov, Y.: Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, UK (2013)

    MATH  Google Scholar 

  56. Zhu, M., Martinez, S.: Discrete-time dynamic average consensus. Automatica 46(2), 322–329 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  57. Sun, H., Lu, S., Hong, H.: Improving the sample and communication complexity for decentralized nonconvex optimization: Joint gradient estimation and tracking. In: Proc. Int. Conf. Mach. Learn. pp. 9217–9228 (2020)

  58. Erdos, P., Renyi, A.: On random graphs i. Publ. Math. Debrecen 6, 290–297 (1959)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Grant Numbers 12071108, 11671116, 12021001, 11991021, 11991020, 11971372, 11701137); and the Strategic Priority Research Program of CAS (Grant number XDA27000000).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin-Wei Liu.

Ethics declarations

Conflict of interest

The authors declared that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proof of Lemma 8

Proof

Using \(W_\infty W=W_\infty\) and (4c), we obtain

$$\begin{aligned}&\Vert \mathbf {s}_{k+1}-W_\infty \mathbf {s}_{k+1}\Vert ^2\nonumber \\&\quad =\Vert W(\mathbf {s}_{k}+ \mathbf {d}_{k+1}-\mathbf {d}_{k})-W_\infty W(\mathbf {s}_{k}+ \mathbf {d}_{k+1}-\mathbf {d}_{k})\Vert ^2\nonumber \\&\quad =\Vert W\mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2+\Vert (W-W_\infty )(\mathbf {d}_{k+1}- \mathbf {d}_{k})\Vert ^2\nonumber \\&\quad +2\langle (W-W_\infty )\mathbf {s}_k,(W-W_\infty )(\mathbf {d}_{k+1}- \mathbf {d}_{k})\rangle \nonumber \\&\quad \le \delta ^2\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2+\delta ^2\Vert \mathbf {d}_{k+1}- \mathbf {d}_{k}\Vert ^2 +2\langle (W-W_\infty )\mathbf {s}_k,(W-W_\infty )(\mathbf {d}_{k+1}- \mathbf {d}_{k})\rangle , \end{aligned}$$
(A1)

where the inequality follows from Lemma 2(c) and \(\Vert W-W_\infty \Vert =\delta\). Notice that

$$\begin{aligned}&\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_k,(W-W_\infty )(\mathbf {d}_{k+1}- \mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_k,(W-W_\infty )(\nabla \mathbf {f}_{k+1}- \mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_k,(W-W_\infty )(\nabla \mathbf {f}_{k+1}- \nabla \mathbf {f}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad +\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_k,(W-W_\infty )(\nabla \mathbf {f}_{k}- \mathbf {d}_{k})\rangle \mid \mathcal {F}_k]. \end{aligned}$$
(A2)

Taking the conditional expectation on (A1) and using (A2) lead to

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {s}_{k+1}-W_\infty \mathbf {s}_{k+1}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le \delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_k\Vert ^2\mid \mathcal {F}_k]+ \delta ^2\mathbb {E}[\Vert \mathbf {d}_{k+1}-\mathbf {d}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +2\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad +2\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \mid \mathcal {F}_k]. \end{aligned}$$
(A3)

Consider the second term on the right-hand side of (A3). Note that

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {d}_{k+1}-\mathbf {d}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\Vert \mathbf {d}_{k+1}-\nabla \mathbf {f}_{k+1}\Vert ^2\mid \mathcal {F}_k] +\mathbb {E}[\Vert \nabla \mathbf {f}_{k+1}-\mathbf {d}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le n\sigma ^2+\mathbb {E}[\Vert \nabla \mathbf {f}_{k+1}-\mathbf {d}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad =n\sigma ^2+\mathbb {E}[\Vert \nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k}\Vert ^2\mid \mathcal {F}_k]+ \mathbb {E}[\Vert \nabla \mathbf {f}_{k}-\mathbf {d}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +2\mathbb {E}[\langle \nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k},\nabla \mathbf {f}_{k}-\mathbf {d}_{k}\rangle \mid \mathcal {F}_k]\nonumber \\&\quad \le 2n\sigma ^2+L^2\mathbb {E}[\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +2\mathbb {E}[\langle \nabla \mathbf {f}_{k+1},\nabla \mathbf {f}_{k}-\mathbf {d}_{k}\rangle \mid \mathcal {F}_k], \end{aligned}$$
(A4)

where the last inequality follows from Assumption 1 and \(\mathbb {E}[\langle \nabla \mathbf {f}_{k},\nabla \mathbf {f}_{k}-\mathbf {d}_{k}\rangle \mid \mathcal {F}_k]=0\). We note that

$$\begin{aligned}&\mathbb {E}[\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\Vert W(\varvec{\theta }_k-\alpha (1-\beta )\mathbf {s}_k-\alpha \beta \mathbf {u}_k)-\varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\Vert (W-I_{np})(\varvec{\theta }_k-W_\infty \varvec{\theta }_{k})-\alpha (1-\beta ) W \mathbf {s}_{k}-\alpha \beta W\mathbf {u}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le 2\Vert W-I_{np}\Vert ^2\mathbb {E}[\Vert \varvec{\theta }_k-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k]+4\alpha ^2(1-\beta )^2\mathbb {E}[\Vert W\mathbf {s}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +4\alpha ^2\beta ^2\mathbb {E}[\Vert \mathbf {u}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le 8\mathbb {E}[\Vert \varvec{\theta }_k-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k]+4\alpha ^2\beta ^2\mathbb {E}[\Vert \mathbf {u}_{k}\Vert ^2\mid \mathcal {F}_k] +8\alpha ^2(1-\beta )^2\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +8\alpha ^2(1-\beta )^2n\mathbb {E}[\Vert h_k\Vert ^2\mid \mathcal {F}_k] +8\alpha ^2(1-\beta )^2\sigma ^2, \end{aligned}$$
(A5)

where the first inequality uses Lemma 1(c) with \(\eta =1\) and the last inequality uses \(\Vert W-I_{np}\Vert \le 2\) and (13). Next, we consider the last term on the right-hand side of (A4). From (4), it follows that

$$\begin{aligned} \nabla \mathbf {f}_{k+1}&=\nabla \mathbf {f}(W(\varvec{\theta }_k-\alpha (1-\beta ) \mathbf {s}_k-\alpha \beta \mathbf {u}_{k}))\nonumber \\&=\nabla \mathbf {f}(W(\varvec{\theta }_k-\alpha (1-\beta ) W(\mathbf {s}_{k-1}+\mathbf {d}_{k}-\mathbf {d}_{k-1})-\alpha \beta \mathbf {u}_{k})). \end{aligned}$$
(A6)

Denote

$$\begin{aligned} \underline{\nabla \mathbf {f}}_{k+1}=\nabla \mathbf {f}(W(\varvec{\theta }_k-\alpha (1-\beta ) W(\mathbf {s}_{k-1}+\nabla \mathbf {f}_{k}-\mathbf {d}_{k-1})-\alpha \beta \mathbf {u}_{k})). \end{aligned}$$
(A7)

By Assumption 3,

$$\begin{aligned} \mathbb {E}[\langle \underline{\nabla \mathbf {f}}_{k+1},\nabla \mathbf {f}_k-\mathbf {d}_k\rangle \mid \mathcal {F}_k]=0. \end{aligned}$$
(A8)

In light of Assumption 1 and \(\Vert W^2\Vert =1\),

$$\begin{aligned} \Vert \nabla \mathbf {f}_{k+1}-\underline{\nabla \mathbf {f}}_{k+1}\Vert&\le \alpha (1-\beta ) L\Vert W^2(\mathbf {d}_k-\nabla \mathbf {f}_k)\Vert \nonumber \\&\le \alpha (1-\beta ) L\Vert \mathbf {d}_k-\nabla \mathbf {f}_k\Vert . \end{aligned}$$
(A9)

Using (A6)-(A9), one has

$$\begin{aligned} \mathbb {E}[\langle \nabla \mathbf {f}_{k+1},\nabla \mathbf {f}_k-\mathbf {d}_k\rangle \mid \mathcal {F}_k]&=\mathbb {E}[\langle \nabla \mathbf {f}_{k+1}-\underline{\nabla \mathbf {f}}_{k+1},\nabla \mathbf {f}_k-\mathbf {d}_k\rangle \mid \mathcal {F}_k]\nonumber \\&\le \mathbb {E}[\Vert \nabla \mathbf {f}_{k+1}-\underline{\nabla \mathbf {f}}_{k+1}\Vert \Vert \nabla \mathbf {f}_k-\mathbf {d}_k\Vert \mid \mathcal {F}_k]\nonumber \\&\le \alpha (1-\beta ) L\mathbb {E}[\Vert \nabla \mathbf {f}_k-\mathbf {d}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\le \alpha (1-\beta ) Ln\sigma ^2. \end{aligned}$$
(A10)

Substituting (A5) and (A10) into (A4) yields

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {d}_{k+1}-\mathbf {d}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le 8L^2\mathbb {E}[\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +4\alpha ^2\beta ^2 L^2\mathbb {E}[\Vert \mathbf {u}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +8\alpha ^2(1-\beta )^2 L^2\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_k\Vert ^2\mid \mathcal {F}_k] +8\alpha ^2(1-\beta )^2L^2n\mathbb {E}[\Vert h_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +2(1+\alpha (1-\beta ) L)n\sigma ^2+8\alpha ^2(1-\beta )^2L^2\sigma ^2. \end{aligned}$$
(A11)

Consider the third term on the right-hand side of (A3). Using \(W_\infty (W-W_\infty )=0\) and (4c), we have

$$\begin{aligned}&\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\langle W\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\langle W^2(\mathbf {s}_{k-1}+\mathbf {d}_{k}-\mathbf {d}_{k-1}),(W\!-\!W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\langle W^2\mathbf {d}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\langle W^2(\mathbf {d}_{k}-\nabla \mathbf {f}_{k}),(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[(\mathbf {d}_{k}-\nabla \mathbf {f}_{k})^T(W_\infty -W^TW^2)(\mathbf {d}_{k}-\nabla \mathbf {f}_{k})\mid \mathcal {F}_k], \end{aligned}$$
(A12)

where the second and third equalities use \(\mathbb {E}[\mathbf {d}_k\mid \mathcal {F}_k]=\nabla \mathbf {f}_k\). Since, for \(\forall i\ne j\in \mathcal {V}\), it holds that

$$\begin{aligned} \mathbb {E}[\langle d_i(\theta ^i_k,\xi ^i_k)-\nabla f_i(\theta ^i_k), d_j(\theta ^j_k,\xi ^j_k)-\nabla f_j(\theta ^j_k)\rangle \mid \mathcal {F}_k]=0, \end{aligned}$$

we get

$$\begin{aligned}&\mathbb {E}[ (\mathbf {d}_{k}-\nabla \mathbf {f}_{k})^T(W_\infty -W^TW^2)(\mathbf {d}_{k}-\nabla \mathbf {f}_{k})\mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[(\mathbf {d}_{k}-\nabla \mathbf {f}_{k})^T\text {diag}(W_\infty -W^TW^2)(\mathbf {d}_{k}-\nabla \mathbf {f}_{k})\mid \mathcal {F}_k]\nonumber \\&\quad \le \mathbb {E}[(\mathbf {d}_{k}-\nabla \mathbf {f}_{k})^T\text {diag}(W_\infty )(\mathbf {d}_{k}-\nabla \mathbf {f}_{k})\mid \mathcal {F}_k]\nonumber \\&\quad =(1/n)\mathbb {E}[\Vert \mathbf {d}_{k}-\nabla \mathbf {f}_{k}\Vert ^2\mid \mathcal {F}_k], \end{aligned}$$
(A13)

where the inequality uses that \(\text {diag}(W^TW^2)\) is nonnegative. Using Assumption 3 and combining (A13) with (A12) obtain

$$\begin{aligned} \mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\le \sigma ^2. \end{aligned}$$
(A14)

Consider the fourth term on the right-hand side of (A3). By the Cauchy-schwarz inequality, we have

$$\begin{aligned}&\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \nonumber \\&\quad \le \Vert W\mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \Vert (W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\Vert \nonumber \\&\quad \le \delta ^2L\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert , \end{aligned}$$
(A15)

where the last inequality applies \(\Vert W-W_\infty \Vert =\delta\), Lemma 2(c) and Assumption 1. Next, we bound \(\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert\). Note that

$$\begin{aligned}&\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert \nonumber \\&\quad =\Vert W(\varvec{\theta }_{k}-\alpha (1-\beta )\mathbf {s}_k-\alpha \beta \mathbf {u}_{k})-\varvec{\theta }_{k}\Vert \nonumber \\&\quad =\Vert (W-I_{np})(\varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k})-\alpha (1-\beta )W\mathbf {s}_k-\alpha \beta W\mathbf {u}_k\Vert \nonumber \\&\quad \le 2\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert +\alpha \beta \Vert \mathbf {u}_{k}\Vert +\alpha (1-\beta )\Vert W\mathbf {s}_k\Vert \nonumber \\&\quad \le 2\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert +\alpha \beta \Vert \mathbf {u}_{k}\Vert +\alpha (1-\beta )\delta \Vert \mathbf {s}_k-W_\infty \mathbf {s}_k\Vert +\alpha (1-\beta ) \sqrt{n}\Vert \overline{d}_k\Vert , \end{aligned}$$
(A16)

where the first inequality uses \(\Vert W-I_{np}\Vert \le 2\) and \(\Vert W\Vert =1\). Substituting (A16) into (A15) yields

$$\begin{aligned}&\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \nonumber \\&\quad \le \alpha (1-\beta ) L\delta ^3\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2+2\delta \Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \delta L\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert \nonumber \\&\quad +\delta \Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \alpha \beta \delta L\Vert \mathbf {u}_k\Vert +\delta \Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \alpha (1-\beta ) L\delta \sqrt{n}\Vert \overline{d}_k\Vert \nonumber \\&\quad \le \alpha (1-\beta ) L\delta ^3\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2+\eta _1\delta ^2\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2 +\eta _1^{-1}\delta ^2L^2\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2\nonumber \\&\quad +0.5\eta _2\delta ^2\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2 +0.5\eta _2^{-1}\alpha ^2\beta ^2\delta ^2 L^2\Vert \mathbf {u}_k\Vert ^2\nonumber \\&\quad +0.5\eta _3\delta ^2\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2 +0.5\eta _3^{-1}\alpha ^2(1-\beta )^2L^2\delta ^2n\Vert \overline{d}_k\Vert ^2, \end{aligned}$$
(A17)

where \(\eta _i>0, i=1,2,3\) and the last inequality uses Lemma 1(a) with \(p=1\). Taking the conditional expectation on (A17) and using (13) lead to

$$\begin{aligned}&\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad \le (\alpha (1-\beta ) L\delta +\eta _1+0.5\eta _2+0.5\eta _3)\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +\eta _1^{-1}L^2\delta ^2\mathbb {E}[\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +0.5\eta _2^{-1}\alpha ^2L^2\beta ^2\delta ^2\mathbb {E}[\Vert \mathbf {u}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +0.5\eta _3^{-1}\alpha ^2(1-\beta )^2L^2\delta ^2n\mathbb {E}[\Vert h_k\Vert ^2\mid \mathcal {F}_k]+0.5\eta _3^{-1}\alpha ^2(1-\beta )^2L^2\delta ^2\sigma ^2. \end{aligned}$$
(A18)

Applying (A11), (A14) and (A18) to (A3) and then taking the total expectation, we further obtain

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {s}_{k+1}-W_\infty \mathbf {s}_{k+1}\Vert ^2]\nonumber \\&\quad \le (8+2\eta _1^{-1})L^2\delta ^2\mathbb {E}[\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2]\nonumber \\&\quad +(1+2\alpha (1-\beta ) L\delta +8\alpha ^2(1-\beta )^2 L^2\delta ^2+2\eta _1+\eta _2+\eta _3)\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2]\nonumber \\&\quad +(4+\eta _2^{-1})\alpha ^2L^2\beta ^2\delta ^2\mathbb {E}[\Vert \mathbf {u}_k\Vert ^2]+(8+\eta _3^{-1})\alpha ^2(1-\beta )^2L^2\delta ^2n\mathbb {E}[\Vert h_k\Vert ^2]\nonumber \\&\quad +(8+\eta _3^{-1})\alpha ^2(1-\beta )^2L^2\delta ^2\sigma ^2+2((1+\alpha (1-\beta ) L)n\delta ^2+1)\sigma ^2. \end{aligned}$$
(A19)

We set \(\eta _1=\frac{1-\delta ^2}{8\delta ^2}\), \(\eta _2=\frac{1-\delta ^2}{32\delta ^2}\) and \(\eta _3=\frac{1-\delta ^2}{16\delta ^2}\). If \(0<\alpha \le \frac{1-\delta ^2}{16(1-\beta )L\delta }\), then we get

$$\begin{aligned} 1+2\alpha (1-\beta ) L\delta +8\alpha ^2(1-\beta )^2 L^2\delta ^2+2\eta _1+\eta _2+\eta _3\le \frac{1+\delta ^2}{2\delta ^2}. \end{aligned}$$
(A20)

If \(0<\alpha \le \min \left\{ \frac{1-\delta ^2}{8\delta },1\right\} \frac{1}{2(1-\beta )L}\), then the desired result follows from applying (A20) and the values of \(\eta _1\), \(\eta _2\) and \(\eta _3\) to (A19). \(\square\)

Appendix B Proof of Lemma 17

Proof

First, using the same argument as that in the proof of Lemma 8, we have

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {s}_{k+1}-W_\infty \mathbf {s}_{k+1}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le \delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_k\Vert ^2]+ \delta ^2\mathbb {E}[\Vert \mathbf {d}_{k+1}-\mathbf {d}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +2\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad +2\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \mid \mathcal {F}_k]. \end{aligned}$$
(B21)

Consider the second term on the right-hand side of (B21). Following the same steps as (A4), there holds

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {d}_{k+1}-\mathbf {d}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le \!2n\sigma ^2+L^2\mathbb {E}[\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +2\mathbb {E}[\langle \nabla \mathbf {f}_{k+1},\nabla \mathbf {f}_{k}-\mathbf {d}_{k}\rangle \mid \mathcal {F}_k]. \end{aligned}$$
(B22)

Then, let us consider the second term on the right-hand side of (B22).

$$\begin{aligned}&\mathbb {E}[\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\Vert W(\varvec{\theta }_k-\alpha (1-\beta ^2)\mathbf {s}_k -\alpha \beta ^2\mathbf {u}_k)-\varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad =\mathbb {E}[\Vert (W-I_{np})(\varvec{\theta }_k-W_\infty \varvec{\theta }_{k}) -\alpha (1-\beta ^2) W \mathbf {s}_{k}\!-\!\alpha \beta ^2W\mathbf {u}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le 2\Vert W-I_{np}\Vert ^2\mathbb {E}[\Vert \varvec{\theta }_k-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +4\alpha ^2(1-\beta ^2)^2\mathbb {E}[\Vert W\mathbf {s}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +4\alpha ^2\beta ^4\mathbb {E}[\Vert \mathbf {u}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le 8\mathbb {E}[\Vert \varvec{\theta }_k-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +4\alpha ^2\beta ^4\mathbb {E}[\Vert \mathbf {u}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +8\alpha ^2(1-\beta ^2)^2\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k} -W_\infty \mathbf {s}_k\Vert ^2\mid \mathcal {F}_k] +8\alpha ^2(1-\beta ^2)^2n\mathbb {E}[\Vert h_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +8\alpha ^2(1-\beta ^2)^2\sigma ^2, \end{aligned}$$
(B23)

where the first inequality applies Lemma 1(c) with \(\eta =1\) and the last inequality uses \(\Vert W-I_{np}\Vert \le 2\) and (13). Next, we consider the last term on the right-hand side of (B22). From (5), it follows that

$$\begin{aligned} \nabla \mathbf {f}_{k+1}&=\nabla \mathbf {f}(W(\varvec{\theta }_k-\alpha (1-\beta ^2) \mathbf {s}_k-\alpha \beta ^2\mathbf {u}_{k}))\\&=\nabla \mathbf {f}(W(\varvec{\theta }_k-\alpha (1-\beta ^2)W(\mathbf {s}_{k-1}+\mathbf {d}_{k}-\mathbf {d}_{k-1})-\alpha \beta ^2\mathbf {u}_{k})). \end{aligned}$$

Denote

$$\begin{aligned} \underline{\nabla \mathbf {f}}_{k+1} =\nabla \mathbf {f}(W(\varvec{\theta }_k -\alpha (1-\beta ^2) W(\mathbf {s}_{k-1}+\nabla \mathbf {f}_{k} -\mathbf {d}_{k-1})-\alpha \beta ^2\mathbf {u}_{k})). \end{aligned}$$

By Assumption 3,

$$\begin{aligned} \mathbb {E}[\langle \underline{\nabla \mathbf {f}}_{k+1},\nabla \mathbf {f}_k-\mathbf {d}_k\rangle \mid \mathcal {F}_k]=0. \end{aligned}$$
(B24)

In light of Assumption 1 and \(\Vert W^2\Vert =1\),

$$\begin{aligned} \Vert \nabla \mathbf {f}_{k+1}-\underline{\nabla \mathbf {f}}_{k+1}\Vert\le & {} \,\alpha (1-\beta ^2) L\Vert W^2(\mathbf {d}_k-\nabla \mathbf {f}_k)\Vert \nonumber \\\le & {} \,\alpha (1-\beta ^2) L\Vert \mathbf {d}_k-\nabla \mathbf {f}_k\Vert . \end{aligned}$$
(B25)

Using (B24) and (B25), we have

$$\begin{aligned} \mathbb {E}[\langle \nabla \mathbf {f}_{k+1},\nabla \mathbf {f}_k-\mathbf {d}_k\rangle \mid \mathcal {F}_k]\le & {}\, \mathbb {E}[\Vert \nabla \mathbf {f}_{k+1}-\underline{\nabla \mathbf {f}}_{k+1}\Vert \Vert \nabla \mathbf {f}_k-\mathbf {d}_k\Vert \mid \mathcal {F}_k]\nonumber \\\le & {} \,\alpha (1-\beta ^2) L\mathbb {E}[\Vert \nabla \mathbf {f}_k-\mathbf {d}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\\le & {}\, \alpha (1-\beta ^2) Ln\sigma ^2. \end{aligned}$$
(B26)

Combining (B23), (B26) with (B22) gets

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {d}_{k+1}-\mathbf {d}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \le \! 8L^2\mathbb {E}[\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +4\alpha ^2\beta ^4L^2\mathbb {E}[\Vert \mathbf {u}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +8\alpha ^2(1-\beta ^2)^2 L^2\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_k\Vert ^2\mid \mathcal {F}_k] +8\alpha ^2(1-\beta ^2)^2L^2n\mathbb {E}[\Vert h_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad +2(1+\alpha (1-\beta ^2) L)n\sigma ^2 +8\alpha ^2(1-\beta ^2)^2L^2\sigma ^2. \end{aligned}$$
(B27)

Consider the third term on the right-hand side of (B21). It has the same bound as the corresponding term of (A3), that is,

$$\begin{aligned} \mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k}-\mathbf {d}_{k})\rangle \mid \mathcal {F}_k]\le \sigma ^2. \end{aligned}$$
(B28)

Consider the fourth term on the right-hand side of (B21). Following the same steps as (A15), we have

$$\begin{aligned}&\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \nonumber \\&\quad \le \!\delta ^2L\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert . \end{aligned}$$
(B29)

Moreover,

$$\begin{aligned}&\Vert \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\Vert \nonumber \\&\quad =\!\Vert W(\varvec{\theta }_{k}-\alpha (1-\beta ^2)\mathbf {s}_k-\alpha \beta ^2\mathbf {u}_{k})-\varvec{\theta }_{k}\Vert \nonumber \\&\quad =\!\Vert (W-I_{np})(\varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k})-\alpha (1-\beta ^2) W\mathbf {s}_k-\alpha \beta ^2 W\mathbf {u}_k\Vert \nonumber \\&\quad \le \!2\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert +\alpha \beta ^2\Vert \mathbf {u}_{k}\Vert +\alpha (1-\beta ^2)\Vert W\mathbf {s}_k\Vert \nonumber \\&\quad \le \!2\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert +\alpha \beta ^2\Vert \mathbf {u}_{k}\Vert +\alpha (1-\beta ^2)\delta \Vert \mathbf {s}_k-W_\infty \mathbf {s}_k\Vert +\alpha (1-\beta ^2)\sqrt{n}\Vert \overline{d}_k\Vert ,\,\,\,\,\,\,\, \end{aligned}$$
(B30)

where the first inequality uses \(\Vert W-I_{np}\Vert \le 2\) and \(\Vert W\Vert =1\). Substituting (B30) into (B29) yields

$$\begin{aligned}&\!\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \nonumber \\&\quad \le \!\alpha (1-\beta ^2) L\delta ^3\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2 +2\delta \Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \delta L\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert \nonumber \\&\quad +\delta \Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \alpha \beta ^2\delta L\Vert \mathbf {u}_k\Vert +\delta \Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert \alpha (1-\beta ^2) L\delta \sqrt{n}\Vert \overline{d}_k\Vert \nonumber \\&\quad \le \alpha (1-\beta ^2) L\delta ^3\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2+\eta _1\delta ^2\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2 +\eta _1^{-1}\delta ^2L^2\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2\nonumber \\&\quad \!+0.5\eta _2\delta ^2\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2 +0.5\eta _2^{-1}\alpha ^2\beta ^4\delta ^2 L^2\Vert \mathbf {u}_k\Vert ^2\nonumber \\&\quad \!+0.5\eta _3\delta ^2\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2 +0.5\eta _3^{-1}\alpha ^2(1-\beta ^2)^2 L^2\delta ^2n\Vert \overline{d}_k\Vert ^2, \end{aligned}$$
(B31)

where \(\eta _i>0, i=1,2,3\) and the last inequality uses Lemma 1(a) with \(p=1\). Taking the conditional expectation on (B31) and using (13) lead to

$$\begin{aligned}&\!\mathbb {E}[\langle (W-W_\infty )\mathbf {s}_{k},(W-W_\infty )(\nabla \mathbf {f}_{k+1}-\nabla \mathbf {f}_{k})\rangle \mid \mathcal {F}_k]\nonumber \\&\quad \le \! (\alpha (1-\beta ^2) L\delta +\eta _1+0.5\eta _2+0.5\eta _3)\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \!+\eta _1^{-1}L^2\delta ^2\mathbb {E}[\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2\mid \mathcal {F}_k] +0.5\eta _2^{-1}\alpha ^2L^2\beta ^4\delta ^2\mathbb {E}[\Vert \mathbf {u}_k\Vert ^2\mid \mathcal {F}_k]\nonumber \\&\quad \!+0.5\eta _3^{-1}\alpha ^2(1-\beta ^2)^2L^2\delta ^2n\mathbb {E}[\Vert h_k\Vert ^2\mid \mathcal {F}_k] +0.5\eta _3^{-1}\alpha ^2(1-\beta ^2)^2L^2\delta ^2\sigma ^2. \end{aligned}$$
(B32)

Applying (B27), (B28) and (B32) to (B21) and then taking the total expectation, we further obtain

$$\begin{aligned}&\mathbb {E}[\Vert \mathbf {s}_{k+1}-W_\infty \mathbf {s}_{k+1}\Vert ^2]\nonumber \\&\quad \le \!(8+2\eta _1^{-1})L^2\delta ^2\mathbb {E}[\Vert \varvec{\theta }_{k}-W_\infty \varvec{\theta }_{k}\Vert ^2]\nonumber \\&\quad \!+(1+2\alpha (1-\beta ^2) L\delta +8\alpha ^2(1-\beta ^2)^2 L^2\delta ^2+2\eta _1+\eta _2+\eta _3)\delta ^2\mathbb {E}[\Vert \mathbf {s}_{k}-W_\infty \mathbf {s}_{k}\Vert ^2]\nonumber \\&\quad \!+(4+\eta _2^{-1})\alpha ^2L^2\beta ^4\delta ^2\mathbb {E}[\Vert \mathbf {u}_k\Vert ^2] +(8+\eta _3^{-1})\alpha ^2(1-\beta ^2)^2L^2\delta ^2n\mathbb {E}[\Vert h_k\Vert ^2]\nonumber \\&\quad \!+(8+\eta _3^{-1})\alpha ^2(1-\beta ^2)^2L^2\delta ^2\sigma ^2 +2((1+\alpha (1-\beta ^2) L)n\delta ^2+1)\sigma ^2. \end{aligned}$$
(B33)

We set \(\eta _1=\frac{1-\delta ^2}{8\delta ^2}\), \(\eta _2=\frac{1-\delta ^2}{32\delta ^2}\) and \(\eta _3=\frac{1-\delta ^2}{16\delta ^2}\). It can then be verified that if \(0<\alpha \le \frac{1-\delta ^2}{16(1-\beta ^2)L\delta }\),

$$\begin{aligned} 1+2\alpha (1-\beta ^2) L\delta +8\alpha ^2(1-\beta ^2)^2 L^2\delta ^2+2\eta _1+\eta _2+\eta _3 \le \frac{1+\delta ^2}{2\delta ^2}. \end{aligned}$$
(B34)

If \(0<\alpha \le \min \{\frac{1-\delta ^2}{8\delta },1\}\frac{1}{2(1-\beta ^2)L}\), then the desired result follows from applying (B34) and the values of \(\eta _1\), \(\eta _2\) and \(\eta _3\) to (B33). \(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, J., Liu, XW., Dai, YH. et al. Distributed stochastic gradient tracking methods with momentum acceleration for non-convex optimization. Comput Optim Appl 84, 531–572 (2023). https://doi.org/10.1007/s10589-022-00432-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-022-00432-5

Keywords

Navigation