Abstract
A major bottleneck of distributed learning under parameter server (PS) framework is communication cost due to frequent bidirectional transmissions between the PS and workers. To address this issue, local stochastic gradient descent (SGD) and worker selection have been exploited by reducing the communication frequency and the number of participating workers at each round, respectively. However, partial participation can be detrimental to convergence rate, especially for heterogeneous local datasets. In this paper, to improve communication efficiency and speed up the training process, we develop a novel worker selection strategy named AgeSel. The key enabler of AgeSel is utilization of the ages of workers to balance their participation frequencies. The convergence of local SGD with the proposed age-based partial worker participation is rigorously established. Simulation results demonstrate that the proposed AgeSel strategy can significantly reduce the number of training rounds needed to achieve a targeted accuracy, as well as the communication cost. The influence of the algorithm hyper-parameter is also explored to manifest the benefit of age-based worker selection.
Similar content being viewed by others
Data Availability
The EMNIST dataset could be downloaded from https://www.nist.gov/itl/products-and-services/emnist-dataset.
Notes
Since more training rounds also indicate more rounds of communication between the PS and the workers.
Here we do not perform Monte Carlo runs to illustrate the effect of S on the fluctuation of the curves.
References
Li M, Andersen DG, Park JW, Smola AJ, Su BY (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of the USENIX OSDI, pp 583–598
Lian X, Zhang C, Zhang H, Hsieh CJ, Zhang W, Liu J (2017) Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In: Proceedings of the NeurIPS, vol 30, pp 5336–5346
Garcia Lopez P, Montresor A, Epema D, Datta A, Higashino T, Iamnitchi A et al (2015) Edge-centric computing: vision and challenges. ACM SIGCOMM Comput Commun Rev 45(5):37–42
Hong K, Lillethun D, Ramachandran U, Ottenwälder B, Koldehofe B (2013) Mobile fog: a programming model for large-scale applications on the internet of things. In: Proceedings of the Second ACM SIGCOMM Workshop on Mobile Cloud Computing, pp 15–20
Bonomi F, Milito R, Zhu J, Addepalli S (2012) Fog computing and its role in the internet of things. In: Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, pp 13–16
Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2005) Model-based approximate querying in sensor networks. VLDB J 14:417–443
Dean J, Corrado GS, Monga R, Kai C, Ng AY (2012) Large scale distributed deep networks. In: Proceedings of the NeurIPS, vol 25, pp 1223–1231
Yan N, Wang K, Pan C, Chai KK (2022) Performance analysis for channel-weighted federated learning in OMA wireless networks. IEEE Signal Process Lett 29:772–776
Uddin MP, Xiang Y, Yearwood J, Gao L (2021) Robust federated averaging via outlier pruning. IEEE Signal Process Lett 29:409–413
Han S-S, Kim Y-K, Jeon Y-B, Park J, Park D-S, Hwang D, Jeong C-S (2020) Distributed deep learning platform for pedestrian detection on it convergence environment. J Supercomput 76(7):5460–5485
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of the COMPSTAT, pp 177–186
Mcmahan HB, Moore E, Ramage D, Hampson S, Arcas B (2017) Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the Artificial intelligence and statistics, pp 1273–1282
Chen T, Guo Z, Sun Y, Yin W (2021) CADA: communication-adaptive distributed Adam. In: Proceedings of the ICAIS, pp 613–621
Chen W, Horvath S, Richtarik P (2020) Optimal client sampling for federated learning. arXiv preprint arXiv:2010.13723
Goetz J, Malik K, Bui D, Moon S, Liu H, Kumar A (2019) Active federated learning. preprint arXiv:1909.12641
Zinkevich M, Weimer M, Smola AJ, Li L (2010) Parallelized stochastic gradient descent. Proc Neural Inf Process Syst 23:2595–2603
Arjevani Y, Shamir O (2015) Communication complexity of distributed convex learning and optimization. In: Proceedings of Advances in Neural Information Processing Systems, vol 28, pp 1756–1764
Zhou F, Cong G (2018) On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp 3219–3227
Wang J, Joshi G (2021) Cooperative SGD: a unified framework for the design and analysis of local-update sgd algorithms. J Mach Learn Res 22(1):9709–9758
Woodworth B, Wang J, Mcmahan B, Srebro N (2018) Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In: Proceedings of the Neural Information Processing Systems, vol 31, pp 8505–8515
Li X, Huang K, Yang W, Wang S, Zhang Z (2019) On the convergence of FedAvg on non-iid data. arXiv preprint arXiv:1907.02189
Haddadpour F, Kamani MM, Mahdavi M, Cadambe VR (2019) Local SGD with periodic averaging: tighter analysis and adaptive synchronization. In: Proceedings of the Neural Information Processing Systems, vol 32, pp 11082–11094
Wang S, Tuor T, Salonidis T, Leung KK, Makaya C, He T, Chan K (2019) Adaptive federated learning in resource constrained edge computing systems. IEEE J Sel Areas Commun 37(6):1205–1221
Yu H, Yang S, Zhu S (2019) Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 5693–5700
Khaled A, Mishchenko K, Richtárik P (2019) First analysis of local GD on heterogeneous data. arXiv preprint arXiv:1909.04715
Malinovskiy G, Kovalev D, Gasanov E, Condat L, Richtarik P (2020) From local SGD to local fixed-point methods for federated learning. In: International Conference on Machine Learning, pp 6692–6701
Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V (2020) Federated optimization in heterogeneous networks. In: Proceedings of Machine Learning and Systems, vol 2, pp 429–450
Yang H, Fang M, Liu J (2021) Achieving linear speedup with partial worker participation in non-iid federated learning. arXiv preprint arXiv:2101.11203
Ye Q, Zhou Y, Shi M, Lv J (2022) FLSGD: free local SGD with parallel synchronization. J Supercomput 78(10):12410–12433
Ribero M, Vikalo H (2020) Communication-efficient federated learning via optimal client sampling. arXiv preprint arXiv:2007.15197
Cho YJ, Gupta S, Joshi G, Yağan O (2020) Bandit-based communication-efficient client selection strategies for federated learning. In: Proceedings of the Asilomar Conference on Signals, Systems, and Computers, pp 1066–1069
Shah V, Wu X, Sanghavi S (2020) Choosing the sample with lowest loss makes SGD robust. In: International Conference on Artificial Intelligence and Statistics, pp 2120–2130
Cho YJ, Wang J, Joshi G (2022) Towards understanding biased client selection in federated learning. In: International Conference on Artificial Intelligence and Statistics, pp 10351–10375
Yin D, Pananjady A, Lam M, Papailiopoulos D, Ramchandran K, Bartlett P (2018) Gradient diversity: a key ingredient for scalable distributed learning. In: Proceedings of the ICAIS, pp 1998–2007
Kaul S, Yates R, Gruteser M (2012) Real-time status: How often should one update? In: Proceedings of the IEEE INFOCOM, pp 2731–2735
Ozfatura E, Buyukates B, Gündüz D, Ulukus S (2020) Age-based coded computation for bias reduction in distributed learning. In: Proceedings of the IEEE GLOBECOM, pp 1–6
Yang HH, Arafa A, Quek TQS, Vincent Poor H (2020) Age-based scheduling policy for federated learning in mobile edge networks. In: Proceedings of the IEEE ICASSP, pp 8743–8747
Yang HH, Liu Z, Quek TQ, Poor HV (2019) Scheduling policies for federated learning in wireless networks. IEEE Trans Commun 68(1):317–333
Buyukates B, Ulukus S (2021) Timely communication in federated learning. In: Proceedings of the IEEE INFOCOM, pp 1–6
Reddi S, Charles Z, Zaheer M, Garrett Z, Rush K, Konečný J et al (2020) Adaptive federated optimization. arXiv preprint arXiv:2003.00295
Funding
This work was supported by the National Natural Science Foundation of China Grants No. 62101134 and No. 62071126, and the Innovation Program of Shanghai Municipal Science and Technology Commission Grant 20JC1416400.
Author information
Authors and Affiliations
Contributions
FZ and JZ came up with the idea; FZ wrote the manuscript; JZ and XW revised and polished the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
This work was supported by the National Natural Science Foundation of China Grants No. 62101134 and No. 62071126, and the Innovation Program of Shanghai Municipal Science and Technology Commission Grant 20JC1416400.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Lemma 1
With the L-smoothness of the objective function \({\mathcal {L}}\) in Assumption 1, by taking expectation of all the randomness over \({\mathcal {L}}(\varvec{\theta }^{j+1})\) we have:
where (a1) is due to the definitions \(g^j=\frac{1}{S}\sum _{m\in {\mathcal {M}}_U^j}g_m^j\) where \(p_m=N_m/N\) and \(g_m^j=-\eta \sum _{u=0}^{U-1}\frac{1}{B}\sum _{b=1}^B\nabla {\mathcal {L}}(\varvec{\theta }_{m}^{j,u}; z_{m,b}^{j,u})\); and (b1) can be obtained directly from (a1).
We then bound the terms \(T_1\) and \(T_2\), respectively. For the term \(T_1\), with the definition of the weighted global update \({\bar{g}}^j=\sum _{m\in {\mathcal {M}}}p_m g_m^j\) and the unbiased global update \({\tilde{g}}^j=\frac{1}{M}\sum _{m\in {\mathcal {M}}} g_m^j\), we have
where (a2) splits the selected workers at round j into the ones selected by ages in set \({\mathcal {A}}^j\) with \(|{\mathcal {A}}^j|=A^j=\min \{S,S^j\}\) and the ones selected by weights in set \({\mathcal {M}}_U^j\backslash {\mathcal {A}}^j\); (b2) is because the selection by ages is essentially unweighted random selection in expectation and the selection by weights is equivalent to \({\bar{g}}^j\) in expectation. The two terms in (A2) are then bounded separately.
To bound the first term, we have:
where (a3) is derived from the definition of \({\bar{g}}^j\); (b3) and (c3) come from direct computation; (d3) uses the fact that \(\langle {\textbf{x}},{\textbf{y}}\rangle \le \frac{1}{2}[\Vert {\textbf{x}}\Vert ^2+\Vert {\textbf{y}}\Vert ^2]\); (e3) is due to Jensen inequality and Cauchy-Schwarz inequality; (f3) follows from Assumption 1; and (g3) is due to the fact that \(\sum _{m=1}^M p_m= 1\) and [40, Lemma 3], which proves that
under the condition that \(\eta \le \frac{1}{8LU}\), where \(\sigma _L\) and \(\sigma _G\) are two constants defined in Assumption 2.
Likewise, the second term in (A2) can be bounded as below, with \(p_m\) replaced by \(\frac{1}{M}\):
Substituting (A3) and (A5) into (A2), we have
With \({\mathbb {I}}\{\cdot \}\) denoting the indicator function, and \({\mathcal {A}}\backslash {\mathcal {B}}\) denotes the complementary set of set \({\mathcal {B}}\) in set \({\mathcal {A}}\), the term \(T_2\) can be bounded as
where (a4) is due to the definition of \(g^j\); (b4) comes directly from (a3); (c4) follows from the fact that \({\mathbb {E}}[\Vert x\Vert ^2]={\mathbb {E}}[\Vert x-{\mathbb {E}}[x]\Vert ^2+\Vert {\mathbb {E}}[x]\Vert ^2]\) and \({\mathbb {E}}[\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u};z_{m,b}^{j,u})]=\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\); (d4) follows from Cauchy-Schwarz inequality; (e4) is due to the fact that \({\mathbb {E}}[\Vert x_1+...+x_n\Vert ^2]={\mathbb {E}}[\Vert x_1\Vert ^2+...+\Vert x_n\Vert ^2]\) if \(x_i'\)s are independent with zero mean; (f4) is due to the definition that the subset of workers selected by ages at round j is \({\mathcal {A}}^j\) and \(|{\mathcal {A}}^j|=A^j=\min \{S,S^j\}\); (g4) is due to Cauchy-Schwarz inequality.
Next, we bound the term \({\mathbb {E}}\left[ \left\| \sum _{u=0}^{U-1}\nabla {\mathcal {L}}_m(\varvec{\theta }_{m}^{j,u})\right\| ^2\right]\) in (A7) and (A3) as follows:
where (a5) is due to Cauchy-Schwarz inequality and the bounded variance assumption; (b5) follows from (A4); (c5) is due to the definition that \(C_1=90U^4L^2\eta ^2+3U^2\) and \(C_2=15U^3L^2\eta ^2(\sigma _L^2+6U\sigma _G^2)+3U^2\sigma _G^2\).
By substituting the upper bounds (A6), (A7) of the terms \(T_1\) and \(T_2\) in (A1), we readily have:
where (a6) comes from direct substitution; (b6) uses the result in (A8); (c6) follows from direct computation; (d6) uses the fact that \(0\le A^j\le S\); and (e6) follows from the fact that there exists a constant c such that \(0<c<\frac{1}{2}-15U^2\eta ^2 L^2-L\eta (90U^3L^2\eta ^2+3U)\). The proof of Lemma 1 is then complete.
Appendix B: Proof of Theorem 1
With Lemma 1, we have
where (a8) is from (A9); and (b8) uses the fact that \(0\le A^j\le S\).
With the age-based mechanism, we denote the minimum number of rounds to traverse all the workers in \({\mathcal {M}}\) as R, i.e., we have \(A^j+...+A^{j+R-1}\ge M\). By rearranging the terms in (B10) and summing from \(j=0,\ldots ,R-1\), we can have:
where \(V_1=\frac{1}{c\eta U}(\frac{\eta ^2UL}{2SB}\sigma _L^2 + \frac{5U^2\eta ^3\,L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +(3\,L\eta ^2-\frac{2\,L\eta ^2\,M}{SR})C_2)\).
Thus, when J is a multiple of R, we can then readily write
where we used Assumption 1 and \(V=\frac{1}{c}[\frac{\eta L}{2SB}\sigma _L^2+\frac{5U\eta ^2\,L^2}{2}\left( \sigma _L^2+6U\sigma _G^2\right) +(3\eta L-\frac{2\,L\eta M}{SR})(15U^2\,L^2\eta ^2(\sigma _L^2+6U\sigma _G^2)+3U\sigma _G^2)]\). The proof is then complete.
Appendix C: Additional experiments
In this section, we provide more simulation results, including testing the performance of the AgeSel, FedAvg, OCS and RR on CIFAR-10 dataset and testing the performance of AgeSel considering different selections of S with \(S=3\) and \(S=15\).
Figures 5 and 6 present the performance of the four schemes mentioned in the main text in terms of communication cost and training rounds. It can be clearly observed that under datasets much larger than the MNIST, i.e., CIFAR-10, AgeSel still achieves the best performance regarding the two metrics.
Figures 7, 8, 9 and 10 test the performances of the four schemes under \(S=15\) and \(S=3\). Horizontally speaking, when \(S=15\), although the difference between the schemes is reduced, the proposed AgeSel still achieves the best performance in terms of communication cost and training rounds. When \(S=3\), all the schemes experience more fluctuations, but AgeSel also needs the lowest number of training rounds and communication cost. Vertically speaking, When \(S=15\), AgeSel needs more communication cost to reach the same accuracy as \(S=5\) due to the increase of number of workers selected. When \(S=3\), although AgeSel requires less communication cost than \(S=5\), the variance is relatively large and the performance is more unstable. To summarize, AgeSel is consistently the best algorithm regardless of the choice of S. With regard to the selection of S, if S is too large, then the communication cost would be high; if S is too small, then the performance is unstable. Therefore, a value in the middle, in this case \(S=5\), is a decent selection of S to reach a balance between communication cost and performance stability.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, F., Zhang, J. & Wang, X. Communication-efficient local SGD with age-based worker selection. J Supercomput 79, 13794–13816 (2023). https://doi.org/10.1007/s11227-023-05190-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05190-7