Skip to main content
Log in

Communication-efficient local SGD with age-based worker selection

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A major bottleneck of distributed learning under parameter server (PS) framework is communication cost due to frequent bidirectional transmissions between the PS and workers. To address this issue, local stochastic gradient descent (SGD) and worker selection have been exploited by reducing the communication frequency and the number of participating workers at each round, respectively. However, partial participation can be detrimental to convergence rate, especially for heterogeneous local datasets. In this paper, to improve communication efficiency and speed up the training process, we develop a novel worker selection strategy named AgeSel. The key enabler of AgeSel is utilization of the ages of workers to balance their participation frequencies. The convergence of local SGD with the proposed age-based partial worker participation is rigorously established. Simulation results demonstrate that the proposed AgeSel strategy can significantly reduce the number of training rounds needed to achieve a targeted accuracy, as well as the communication cost. The influence of the algorithm hyper-parameter is also explored to manifest the benefit of age-based worker selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The EMNIST dataset could be downloaded from https://www.nist.gov/itl/products-and-services/emnist-dataset.

Notes

  1. Since more training rounds also indicate more rounds of communication between the PS and the workers.

  2. Here we do not perform Monte Carlo runs to illustrate the effect of S on the fluctuation of the curves.

References

  1. Li M, Andersen DG, Park JW, Smola AJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of the USENIX OSDI, pp 583–598

  2. Lian X, Zhang C, Zhang H, Hsieh CJ, Zhang W, Liu J (2017) Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In: Proceedings of the NeurIPS, vol 30, pp 5336–5346

  3. Garcia Lopez P, Montresor A, Epema D, Datta A, Higashino T, Iamnitchi A et al (2015) Edge-centric computing: vision and challenges. ACM SIGCOMM Comput Commun Rev 45(5):37–42

    Article  Google Scholar 

  4. Hong K, Lillethun D, Ramachandran U, Ottenwälder B, Koldehofe B (2013) Mobile fog: a programming model for large-scale applications on the internet of things. In: Proceedings of the Second ACM SIGCOMM Workshop on Mobile Cloud Computing, pp 15–20

  5. Bonomi F, Milito R, Zhu J, Addepalli S (2012) Fog computing and its role in the internet of things. In: Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, pp 13–16

  6. Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2005) Model-based approximate querying in sensor networks. VLDB J 14:417–443

    Article  Google Scholar 

  7. Dean J, Corrado GS, Monga R, Kai C, Ng AY (2012) Large scale distributed deep networks. In: Proceedings of the NeurIPS, vol 25, pp 1223–1231

  8. Yan N, Wang K, Pan C, Chai KK (2022) Performance analysis for channel-weighted federated learning in OMA wireless networks. IEEE Signal Process Lett 29:772–776

    Article  Google Scholar 

  9. Uddin MP, Xiang Y, Yearwood J, Gao L (2021) Robust federated averaging via outlier pruning. IEEE Signal Process Lett 29:409–413

    Article  Google Scholar 

  10. Han S-S, Kim Y-K, Jeon Y-B, Park J, Park D-S, Hwang D, Jeong C-S (2020) Distributed deep learning platform for pedestrian detection on it convergence environment. J Supercomput 76(7):5460–5485

    Article  Google Scholar 

  11. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of the COMPSTAT, pp 177–186

  12. Mcmahan HB, Moore E, Ramage D, Hampson S, Arcas B (2017) Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the Artificial intelligence and statistics, pp 1273–1282

  13. Chen T, Guo Z, Sun Y, Yin W (2021) CADA: communication-adaptive distributed Adam. In: Proceedings of the ICAIS, pp 613–621

  14. Chen W, Horvath S, Richtarik P (2020) Optimal client sampling for federated learning. arXiv preprint arXiv:2010.13723

  15. Goetz J, Malik K, Bui D, Moon S, Liu H, Kumar A (2019) Active federated learning. preprint arXiv:1909.12641

  16. Zinkevich M, Weimer M, Smola AJ, Li L (2010) Parallelized stochastic gradient descent. Proc Neural Inf Process Syst 23:2595–2603

    Google Scholar 

  17. Arjevani Y, Shamir O (2015) Communication complexity of distributed convex learning and optimization. In: Proceedings of Advances in Neural Information Processing Systems, vol 28, pp 1756–1764

  18. Zhou F, Cong G (2018) On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp 3219–3227

  19. Wang J, Joshi G (2021) Cooperative SGD: a unified framework for the design and analysis of local-update sgd algorithms. J Mach Learn Res 22(1):9709–9758

    MathSciNet  MATH  Google Scholar 

  20. Woodworth B, Wang J, Mcmahan B, Srebro N (2018) Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In: Proceedings of the Neural Information Processing Systems, vol 31, pp 8505–8515

  21. Li X, Huang K, Yang W, Wang S, Zhang Z (2019) On the convergence of FedAvg on non-iid data. arXiv preprint arXiv:1907.02189

  22. Haddadpour F, Kamani MM, Mahdavi M, Cadambe VR (2019) Local SGD with periodic averaging: tighter analysis and adaptive synchronization. In: Proceedings of the Neural Information Processing Systems, vol 32, pp 11082–11094

  23. Wang S, Tuor T, Salonidis T, Leung KK, Makaya C, He T, Chan K (2019) Adaptive federated learning in resource constrained edge computing systems. IEEE J Sel Areas Commun 37(6):1205–1221

    Article  Google Scholar 

  24. Yu H, Yang S, Zhu S (2019) Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 5693–5700

  25. Khaled A, Mishchenko K, Richtárik P (2019) First analysis of local GD on heterogeneous data. arXiv preprint arXiv:1909.04715

  26. Malinovskiy G, Kovalev D, Gasanov E, Condat L, Richtarik P (2020) From local SGD to local fixed-point methods for federated learning. In: International Conference on Machine Learning, pp 6692–6701

  27. Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V (2020) Federated optimization in heterogeneous networks. In: Proceedings of Machine Learning and Systems, vol 2, pp 429–450

  28. Yang H, Fang M, Liu J (2021) Achieving linear speedup with partial worker participation in non-iid federated learning. arXiv preprint arXiv:2101.11203

  29. Ye Q, Zhou Y, Shi M, Lv J (2022) FLSGD: free local SGD with parallel synchronization. J Supercomput 78(10):12410–12433

    Article  Google Scholar 

  30. Ribero M, Vikalo H (2020) Communication-efficient federated learning via optimal client sampling. arXiv preprint arXiv:2007.15197

  31. Cho YJ, Gupta S, Joshi G, Yağan O (2020) Bandit-based communication-efficient client selection strategies for federated learning. In: Proceedings of the Asilomar Conference on Signals, Systems, and Computers, pp 1066–1069

  32. Shah V, Wu X, Sanghavi S (2020) Choosing the sample with lowest loss makes SGD robust. In: International Conference on Artificial Intelligence and Statistics, pp 2120–2130

  33. Cho YJ, Wang J, Joshi G (2022) Towards understanding biased client selection in federated learning. In: International Conference on Artificial Intelligence and Statistics, pp 10351–10375

  34. Yin D, Pananjady A, Lam M, Papailiopoulos D, Ramchandran K, Bartlett P (2018) Gradient diversity: a key ingredient for scalable distributed learning. In: Proceedings of the ICAIS, pp 1998–2007

  35. Kaul S, Yates R, Gruteser M (2012) Real-time status: How often should one update? In: Proceedings of the IEEE INFOCOM, pp 2731–2735

  36. Ozfatura E, Buyukates B, Gündüz D, Ulukus S (2020) Age-based coded computation for bias reduction in distributed learning. In: Proceedings of the IEEE GLOBECOM, pp 1–6

  37. Yang HH, Arafa A, Quek TQS, Vincent Poor H (2020) Age-based scheduling policy for federated learning in mobile edge networks. In: Proceedings of the IEEE ICASSP, pp 8743–8747

  38. Yang HH, Liu Z, Quek TQ, Poor HV (2019) Scheduling policies for federated learning in wireless networks. IEEE Trans Commun 68(1):317–333

    Article  Google Scholar 

  39. Buyukates B, Ulukus S (2021) Timely communication in federated learning. In: Proceedings of the IEEE INFOCOM, pp 1–6

  40. Reddi S, Charles Z, Zaheer M, Garrett Z, Rush K, Konečný J et al (2020) Adaptive federated optimization. arXiv preprint arXiv:2003.00295

Download references

Funding

This work was supported by the National Natural Science Foundation of China Grants No. 62101134 and No. 62071126, and the Innovation Program of Shanghai Municipal Science and Technology Commission Grant 20JC1416400.

Author information

Authors and Affiliations

Authors

Contributions

FZ and JZ came up with the idea; FZ wrote the manuscript; JZ and XW revised and polished the manuscript.

Corresponding author

Correspondence to Jingjing Zhang.

Ethics declarations

Conflict of interest

This work was supported by the National Natural Science Foundation of China Grants No. 62101134 and No. 62071126, and the Innovation Program of Shanghai Municipal Science and Technology Commission Grant 20JC1416400.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Lemma 1

With the L-smoothness of the objective function \({\mathcal {L}}\) in Assumption 1, by taking expectation of all the randomness over \({\mathcal {L}}(\varvec{\theta }^{j+1})\) we have:

$$\begin{aligned} {\mathbb {E}}[{\mathcal {L}}(\varvec{\theta }^{j+1})]&\le {\mathcal {L}}(\varvec{\theta }^{j}) + \left\langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}[\varvec{\theta }^{j+1}-\varvec{\theta }^j] \right\rangle +\frac{L}{2}{\mathbb {E}}[\left\| \varvec{\theta }^{j+1} -\varvec{\theta }^j\right\| ^2]\nonumber \\&\overset{(a1)}{=}{\mathcal {L}}(\varvec{\theta }^{j}) + \left\langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}[g^j+\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j})-\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j})] \right\rangle + \frac{L}{2}{\mathbb {E}}[\left\| g^j\right\| ^2]\nonumber \\&\overset{(b1)}{=}\ {\mathcal {L}}(\varvec{\theta }^{j}) -\eta U\left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j}) \right\| ^2 + \underbrace{\left\langle \nabla {\mathcal {L}} (\varvec{\theta }^{j}), {\mathbb {E}}[g^j+\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j})] \right\rangle }_{T_1} + \frac{L}{2}\underbrace{{\mathbb {E}}[\left\| g^j\right\| ^2]}_{T_2}, \end{aligned}$$
(A1)

where (a1) is due to the definitions \(g^j=\frac{1}{S}\sum _{m\in {\mathcal {M}}_U^j}g_m^j\) where \(p_m=N_m/N\) and \(g_m^j=-\eta \sum _{u=0}^{U-1}\frac{1}{B}\sum _{b=1}^B\nabla {\mathcal {L}}(\varvec{\theta }_{m}^{j,u}; z_{m,b}^{j,u})\); and (b1) can be obtained directly from (a1).

We then bound the terms \(T_1\) and \(T_2\), respectively. For the term \(T_1\), with the definition of the weighted global update \({\bar{g}}^j=\sum _{m\in {\mathcal {M}}}p_m g_m^j\) and the unbiased global update \({\tilde{g}}^j=\frac{1}{M}\sum _{m\in {\mathcal {M}}} g_m^j\), we have

$$\begin{aligned} T_1&= \left\langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}[g^j+\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j})] \right\rangle \nonumber \\&\overset{(a2)}{=}\Bigg \langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}\Bigg [\frac{1}{S}\left( \sum _{m\in {\mathcal {M}}_U^j \backslash {\mathcal {A}}^j}g_m^j+\sum _{m\in {\mathcal {A}}^j}g_m^j\right) \nonumber \\&\quad +\frac{1}{S}\left( (S-A^j)\eta U\nabla {\mathcal {L}} (\varvec{\theta }^{j})+A^j\eta U\nabla {\mathcal {L}}( \varvec{\theta }^{j})\right) \Bigg ] \Bigg \rangle \nonumber \\&\overset{(b2)}{=}\left\langle \nabla {\mathcal {L}} (\varvec{\theta }^{j}), {\mathbb {E}}\left[ \frac{S-A^j}{S} \left( {\bar{g}}^j+\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j}) \right) \right] \right\rangle \nonumber \\&\quad + \left\langle \nabla {\mathcal {L}} (\varvec{\theta }^{j}), {\mathbb {E}}\left[ \frac{A^j}{S} \left( {\tilde{g}}^j+\eta U\nabla {\mathcal {L}} (\varvec{\theta }^{j})\right) \right] \right\rangle \end{aligned}$$
(A2)

where (a2) splits the selected workers at round j into the ones selected by ages in set \({\mathcal {A}}^j\) with \(|{\mathcal {A}}^j|=A^j=\min \{S,S^j\}\) and the ones selected by weights in set \({\mathcal {M}}_U^j\backslash {\mathcal {A}}^j\); (b2) is because the selection by ages is essentially unweighted random selection in expectation and the selection by weights is equivalent to \({\bar{g}}^j\) in expectation. The two terms in (A2) are then bounded separately.

To bound the first term, we have:

$$\begin{aligned}&\left\langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}\left[ \frac{S-A^j}{S}\left( {\bar{g}}^j+\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j})\right) \right] \right\rangle \nonumber \\&\quad \overset{(a3)}{=}\ \frac{S-A^j}{S}\left\langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}\left[ -\frac{1}{B}\sum _{m=1}^M\sum _{u=0}^{U-1}\sum _{b=1}^B\eta p_m \nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u};z_{m,b}^{j,u}) +\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j})\right] \right\rangle \nonumber \\&\quad \overset{(b3)}{=}\ \frac{S-A^j}{S}\left\langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}\left[ -\sum _{m=1}^M\sum _{u=0}^{U-1}\eta p_m\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u}) +\eta U\sum _{m=1}^M p_m\nabla {\mathcal {L}}(\varvec{\theta }^{j})\right] \right\rangle \nonumber \\&\quad \overset{(c3)}{=}\ \frac{S-A^j}{S}\left\langle \sqrt{\eta U} \nabla {\mathcal {L}}(\varvec{\theta }^{j}), -\frac{\sqrt{\eta }}{\sqrt{U}}{\mathbb {E}}\left[ \sum _{m=1}^M p_m \sum _{u=0}^{U-1}\left( \nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u}) - \nabla {\mathcal {L}}(\varvec{\theta }^j)\right) \right] \right\rangle \nonumber \\&\quad \overset{(d3)}{\le }\ \frac{S-A^j}{S}\left( \frac{\eta U}{2} \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 + \frac{\eta }{2U}{\mathbb {E}}\left[ \left\| \sum _{m=1}^M p_m\sum _{u=0}^{U-1}\left( \nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u}) - \nabla {\mathcal {L}}(\varvec{\theta }^j)\right) \right\| ^2\right] \right) \nonumber \\&\quad \overset{(e3)}{\le }\ \frac{S-A^j}{S}\left( \frac{\eta U}{2} \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 + \frac{\eta }{2}\sum _{m=1}^M p_m\sum _{u=0}^{U-1}{\mathbb {E}}\left[ \left\| \left( \nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u}) - \nabla {\mathcal {L}}(\varvec{\theta }^j)\right) \right\| ^2\right] \right) \nonumber \\&\quad \overset{(f3)}{\le }\ \frac{S-A^j}{S}\left( \frac{\eta U}{2} \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 + \frac{\eta L^2}{2}\sum _{m=1}^M p_m\sum _{u=0}^{U-1}{\mathbb {E}}\left[ \left\| \varvec{\theta }_{m}^{j,u}-\varvec{\theta }^j\right\| ^2\right] \right) \nonumber \\&\quad \overset{(g3)}{\le }\ \frac{S-A^j}{S}\left( \eta U\left( \frac{1}{2}+15 U^2\eta ^2 L^2\right) \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 + \frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) \right) , \end{aligned}$$
(A3)

where (a3) is derived from the definition of \({\bar{g}}^j\); (b3) and (c3) come from direct computation; (d3) uses the fact that \(\langle {\textbf{x}},{\textbf{y}}\rangle \le \frac{1}{2}[\Vert {\textbf{x}}\Vert ^2+\Vert {\textbf{y}}\Vert ^2]\); (e3) is due to Jensen inequality and Cauchy-Schwarz inequality; (f3) follows from Assumption 1; and (g3) is due to the fact that \(\sum _{m=1}^M p_m= 1\) and [40, Lemma 3], which proves that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \varvec{\theta }_{m}^{j,u} -\varvec{\theta }^j\right\| ^2\right] \le 5U\eta ^2 (\sigma _L^2+6U\sigma _G^2)+30U^2\eta ^2 \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2, \end{aligned}$$
(A4)

under the condition that \(\eta \le \frac{1}{8LU}\), where \(\sigma _L\) and \(\sigma _G\) are two constants defined in Assumption 2.

Likewise, the second term in (A2) can be bounded as below, with \(p_m\) replaced by \(\frac{1}{M}\):

$$\begin{aligned}&\left\langle \nabla {\mathcal {L}}(\varvec{\theta }^{j}), {\mathbb {E}}\left[ \frac{A^j}{S}\left( {\tilde{g}}^j+\eta U\nabla {\mathcal {L}}(\varvec{\theta }^{j})\right) \right] \right\rangle \nonumber \\&\quad \le \frac{A^j}{S}\left( \eta U\left( \frac{1}{2}+15 U^2\eta ^2 L^2\right) \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 + \frac{5 U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) \right) , \end{aligned}$$
(A5)

Substituting (A3) and (A5) into (A2), we have

$$\begin{aligned} T_1&\le \eta U\left( \frac{1}{2}+15U^2\eta ^2L^2\right) \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 + \frac{5 U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) . \end{aligned}$$
(A6)

With \({\mathbb {I}}\{\cdot \}\) denoting the indicator function, and \({\mathcal {A}}\backslash {\mathcal {B}}\) denotes the complementary set of set \({\mathcal {B}}\) in set \({\mathcal {A}}\), the term \(T_2\) can be bounded as

$$\begin{aligned} T_2&= {\mathbb {E}}[\Vert g^j\Vert ^2]\nonumber \\&\overset{(a4)}{=}{\mathbb {E}}\left[ \left\| \frac{1}{S}\sum _{m\in {\mathcal {M}}_U^j}g_m^j\right\| ^2\right] \nonumber \\&\overset{(b4)}{=}\frac{1}{S^2}{\mathbb {E}}\left[ \left\| \sum _{m=1}^M {\mathbb {I}}\{m\in {\mathcal {M}}_U^j\}g_m^j\right\| ^2\right] \nonumber \\&\overset{(c4)}{=}\ \frac{\eta ^2}{S^2}{\mathbb {E}}\left[ \left\| \sum _{m=1}^M {\mathbb {I}}\{m\in {\mathcal {M}}_U^j\}\sum _{u=0}^{U-1}\left[ \frac{1}{B}\sum _{b=1}^B\left( \nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u};z_{m,b}^{j,u})-\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right) \right] \right\| ^2\right] \nonumber \\&\qquad +\frac{\eta ^2}{S^2}{\mathbb {E}}\left[ \left\| \sum _{m=1}^M {\mathbb {I}}\{m\in {\mathcal {M}}_U^j\}\sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] \nonumber \\&\overset{(d4)}{=}\ \frac{\eta ^2}{S^2B^2}{\mathbb {E}}\left[ \left\| \sum _{m=1}^M {\mathbb {I}}\{m\in {\mathcal {M}}_U^j\}\sum _{u=0}^{U-1}\sum _{b=1}^B\left( \nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u};z_{m,b}^{j,u})-\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right) \right\| ^2\right] \nonumber \\&\qquad +\frac{\eta ^2}{S^2}{\mathbb {E}}\left[ \left\| \sum _{m=1}^M {\mathbb {I}}\{m\in {\mathcal {M}}_U^j\}\sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] \nonumber \\&\overset{(e4)}{=}\ \frac{\eta ^2U}{SB}\sigma _L^2+\frac{\eta ^2}{S^2}{\mathbb {E}}\left[ \left\| \sum _{m=1}^M {\mathbb {I}}\{m\in {\mathcal {M}}_U^j\}\sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] \nonumber \\&\overset{(f4)}{=}\ \frac{\eta ^2U}{SB}\sigma _L^2+\frac{\eta ^2}{S^2}{\mathbb {E}}\left[ \left\| \sum _{m\in {\mathcal {A}}^j} \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})+\sum _{m\in {\mathcal {M}}_U^j\backslash {\mathcal {A}}^j} \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] \nonumber \\&\overset{(g4)}{\le }\ \frac{\eta ^2U}{SB}\sigma _L^2+\frac{2A^j\eta ^2}{S^2}\sum _{m\in {\mathcal {A}}^j}{\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] \nonumber \\&\qquad +\frac{2(S-A^j)\eta ^2}{S^2}\sum _{m\in {\mathcal {M}}_U^j\backslash {\mathcal {A}}^j} {\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] , \end{aligned}$$
(A7)

where (a4) is due to the definition of \(g^j\); (b4) comes directly from (a3); (c4) follows from the fact that \({\mathbb {E}}[\Vert x\Vert ^2]={\mathbb {E}}[\Vert x-{\mathbb {E}}[x]\Vert ^2+\Vert {\mathbb {E}}[x]\Vert ^2]\) and \({\mathbb {E}}[\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u};z_{m,b}^{j,u})]=\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\); (d4) follows from Cauchy-Schwarz inequality; (e4) is due to the fact that \({\mathbb {E}}[\Vert x_1+...+x_n\Vert ^2]={\mathbb {E}}[\Vert x_1\Vert ^2+...+\Vert x_n\Vert ^2]\) if \(x_i'\)s are independent with zero mean; (f4) is due to the definition that the subset of workers selected by ages at round j is \({\mathcal {A}}^j\) and \(|{\mathcal {A}}^j|=A^j=\min \{S,S^j\}\); (g4) is due to Cauchy-Schwarz inequality.

Next, we bound the term \({\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right]\) in (A7) and (A3) as follows:

$$\begin{aligned}&{\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] \nonumber \\&\quad ={\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\left( \nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})-\nabla {\mathcal {L}}_m(\varvec{\theta }^j) +\nabla {\mathcal {L}}_m(\varvec{\theta }^j)-\nabla {\mathcal {L}}(\varvec{\theta }^j)+\nabla {\mathcal {L}}(\varvec{\theta }^j)\right) \right\| ^2\right] \nonumber \\&\quad \overset{(a5)}{\le }\ 3UL^2\sum _{u=0}^{U-1}{\mathbb {E}}[\Vert \varvec{\theta }_{m}^{j,u}-\varvec{\theta }^j\Vert ^2]+3U^2\sigma _G^2+3U^2\Vert \nabla {\mathcal {L}}(\varvec{\theta }^j)\Vert ^2\nonumber \\&\quad \overset{(b5)}{\le }\ 15U^3L^2\eta ^2(\sigma _L^2+6U\sigma _G^2)+(90U^4L^2\eta ^2+3U^2)\Vert \nabla {\mathcal {L}}(\varvec{\theta }^j)\Vert ^2+3U^2\sigma _G^2\nonumber \\&\quad \overset{(c5)}{=}C_1\Vert \nabla {\mathcal {L}}(\varvec{\theta }^j)\Vert ^2+C_2, \end{aligned}$$
(A8)

where (a5) is due to Cauchy-Schwarz inequality and the bounded variance assumption; (b5) follows from (A4); (c5) is due to the definition that \(C_1=90U^4L^2\eta ^2+3U^2\) and \(C_2=15U^3L^2\eta ^2(\sigma _L^2+6U\sigma _G^2)+3U^2\sigma _G^2\).

By substituting the upper bounds (A6), (A7) of the terms \(T_1\) and \(T_2\) in (A1), we readily have:

$$\begin{aligned}&{\mathbb {E}}[{\mathcal {L}}(\varvec{\theta }^{j+1})]\nonumber \\&\quad \overset{(a6)}{\le }\ {\mathcal {L}}(\varvec{\theta }^{j}) -\eta U\left( \frac{1}{2}-15U^2\eta ^2 L^2\right) \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2\nonumber \\&\qquad +\frac{L\eta ^2A^j}{S^2} \sum _{m\in {\mathcal {A}}^j}{\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] +\frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +\frac{\eta ^2UL}{2SB}\sigma _L^2\nonumber \\&\qquad +\frac{L\eta ^2(S-A^j)}{S^2}\sum _{m\in {\mathcal {M}}_U^j\backslash {\mathcal {A}}^j}{\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right] \nonumber \\&\quad \overset{(b6)}{\le }{\mathcal {L}}(\varvec{\theta }^{j}) -\eta U\left( \frac{1}{2}-15U^2\eta ^2 L^2\right) \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2+\frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) \nonumber \\&\qquad +\frac{L\eta ^2\left( (A^j)^2+(S-A^j)^2\right) }{S^2}\left( C_1\Vert \nabla {\mathcal {L}}(\varvec{\theta }^j)\Vert ^2+C_2\right) +\frac{\eta ^2UL}{2SB}\sigma _L^2\nonumber \\&\quad \overset{(c6)}{\le }{\mathcal {L}}(\varvec{\theta }^{j})-\eta U\left( \frac{1}{2}-15U^2\eta ^2 L^2-\frac{L\eta \left( 2(A^j)^2-2SA^j+S^2\right) }{S^2U}C_1\right) \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2\nonumber \\&\qquad +\frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +\frac{\eta ^2UL}{2SB}\sigma _L^2+\frac{L\eta ^2\left( 2(A^j)^2-2SA^j+S^2\right) }{S^2}C_2\nonumber \\&\quad \overset{(d6)}{\le }{\mathcal {L}}(\varvec{\theta }^{j})-\eta U\left( \frac{1}{2}-15U^2\eta ^2 L^2-\frac{L\eta }{U}C_1\right) \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2\nonumber \\&\qquad +\frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +\frac{\eta ^2UL}{2SB}\sigma _L^2+\frac{L\eta ^2\left( 2(A^j)^2-2SA^j+S^2\right) }{S^2}C_2\nonumber \\&\quad \overset{(e6)}{\le }{\mathcal {L}}(\varvec{\theta }^{j})-c\eta U\left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 +\frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +\frac{\eta ^2UL}{2SB}\sigma _L^2\nonumber \\&\qquad +L\eta ^2C_2+\frac{2L\eta ^2(A^j)^2}{S^2}C_2-\frac{2L\eta ^2A^j}{S}C_2, \end{aligned}$$
(A9)

where (a6) comes from direct substitution; (b6) uses the result in (A8); (c6) follows from direct computation; (d6) uses the fact that \(0\le A^j\le S\); and (e6) follows from the fact that there exists a constant c such that \(0<c<\frac{1}{2}-15U^2\eta ^2 L^2-L\eta (90U^3L^2\eta ^2+3U)\). The proof of Lemma 1 is then complete.

Appendix B: Proof of Theorem 1

With Lemma 1, we have

$$\begin{aligned} {\mathbb {E}}[{\mathcal {L}}(\varvec{\theta }^{j+1})]&\overset{(a8)}{\le }{\mathcal {L}}(\varvec{\theta }^{j})-c\eta U \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 + \frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) \nonumber \\&\quad +\frac{L\eta ^2(2(A^j)^2+S^2)}{S^2}-\frac{2L\eta ^2A^j}{S}C_2\nonumber \\&\overset{(b8)}{\le }{\mathcal {L}}(\varvec{\theta }^{j})-c\eta U \left\| \nabla {\mathcal {L}}(\varvec{\theta }^{j})\right\| ^2 +\frac{5U^2\eta ^3L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) \nonumber \\&\quad +\frac{\eta ^2UL}{2SB}\sigma _L^2+3L\eta ^2C_2-\frac{2L\eta ^2A^j}{S}C_2, \end{aligned}$$
(B10)

where (a8) is from (A9); and (b8) uses the fact that \(0\le A^j\le S\).

With the age-based mechanism, we denote the minimum number of rounds to traverse all the workers in \({\mathcal {M}}\) as R, i.e., we have \(A^j+...+A^{j+R-1}\ge M\). By rearranging the terms in (B10) and summing from \(j=0,\ldots ,R-1\), we can have:

$$\begin{aligned} \frac{1}{R}{\mathbb {E}}\left[ \sum _{j=0}^{R-1}\left\| \nabla {\mathcal {L}} (\varvec{\theta }^j)\right\| ^2\right] \le \frac{{\mathcal {L}}(\varvec{\theta }^0)-{\mathcal {L}}(\varvec{\theta }^R)}{c\eta U R} + V_1, \end{aligned}$$
(B11)

where \(V_1=\frac{1}{c\eta U}(\frac{\eta ^2UL}{2SB}\sigma _L^2 + \frac{5U^2\eta ^3\,L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +(3\,L\eta ^2-\frac{2\,L\eta ^2\,M}{SR})C_2)\).

Thus, when J is a multiple of R, we can then readily write

$$\begin{aligned} \frac{1}{J}{\mathbb {E}}\left[ \sum _{j=0}^{J-1}\left\| \nabla {\mathcal {L}} (\varvec{\theta }^j)\right\| ^2\right] \le \frac{{\mathcal {L}}(\varvec{\theta }^0)-{\mathcal {L}}^*}{c\eta U J} + V, \end{aligned}$$
(B12)

where we used Assumption 1 and \(V=\frac{1}{c}[\frac{\eta L}{2SB}\sigma _L^2+\frac{5U\eta ^2\,L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +(3\eta L-\frac{2\,L\eta M}{SR})(15U^2\,L^2\eta ^2(\sigma _L^2+6U\sigma _G^2)+3U\sigma _G^2)]\). The proof is then complete.

Appendix C: Additional experiments

In this section, we provide more simulation results, including testing the performance of the AgeSel, FedAvg, OCS and RR on CIFAR-10 dataset and testing the performance of AgeSel considering different selections of S with \(S=3\) and \(S=15\).

Figures 5 and 6 present the performance of the four schemes mentioned in the main text in terms of communication cost and training rounds. It can be clearly observed that under datasets much larger than the MNIST, i.e., CIFAR-10, AgeSel still achieves the best performance regarding the two metrics.

Fig. 5
figure 5

Comparison of FedAvg, OCS, RR and AgeSel in terms of training rounds with CIFAR-10

Fig. 6
figure 6

Comparison of FedAvg, OCS, RR and AgeSel in terms of communication cost with CIFAR-10

Figures 7, 8, 9 and 10 test the performances of the four schemes under \(S=15\) and \(S=3\). Horizontally speaking, when \(S=15\), although the difference between the schemes is reduced, the proposed AgeSel still achieves the best performance in terms of communication cost and training rounds. When \(S=3\), all the schemes experience more fluctuations, but AgeSel also needs the lowest number of training rounds and communication cost. Vertically speaking, When \(S=15\), AgeSel needs more communication cost to reach the same accuracy as \(S=5\) due to the increase of number of workers selected. When \(S=3\), although AgeSel requires less communication cost than \(S=5\), the variance is relatively large and the performance is more unstable. To summarize, AgeSel is consistently the best algorithm regardless of the choice of S. With regard to the selection of S, if S is too large, then the communication cost would be high; if S is too small, then the performance is unstable. Therefore, a value in the middle, in this case \(S=5\), is a decent selection of S to reach a balance between communication cost and performance stability.

Fig. 7
figure 7

Comparison of FedAvg, OCS, RR and AgeSel in terms of training rounds with \(S=15\)

Fig. 8
figure 8

Comparison of FedAvg, OCS, RR and AgeSel in terms of communication cost with \(S=15\)

Fig. 9
figure 9

Comparison of FedAvg, OCS, RR and AgeSel in terms of training rounds with \(S=3\)

Fig. 10
figure 10

Comparison of FedAvg, OCS, RR and AgeSel in terms of communication cost with \(S=3\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, F., Zhang, J. & Wang, X. Communication-efficient local SGD with age-based worker selection. J Supercomput 79, 13794–13816 (2023). https://doi.org/10.1007/s11227-023-05190-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05190-7

Keywords

Navigation