Abstract
Federated learning (FL) is emerging as an increasingly important and popular paradigm for collaboratively training high-quality machine learning (ML) models over massive amounts of data stored by geo-distributed datacenters. However, the communication efficiency of gradient aggregation during the training process comes as a primary bottleneck that impedes the adoption of FL, especially in cross-silo settings, as the available bandwidth of inter-datacenter links connecting data silos is often very limited. To improve the training efficiency of cross-silo FL between datacenters, we propose TopoAdopt, an efficient communication topology design for gradient aggregation to overcome the communication bottleneck of cross-silo model training. TopoAdopt uses multiple aggregators to share aggregation load and tree-based hierarchical aggregation to reduce bandwidth consumption from clients to aggregators. For better performance, it jointly optimizes the parameter assignment among aggregators and the construction of aggregation trees. We formulate this optimization problem as a mixed-integer nonlinear programming model and develop efficient algorithms to find satisfactory communication topologies in reasonable computational time. The experimental results show that TopoAdopt achieves significant speedup, up to 5.2\(\times \), in gradient aggregation completion time compared to existing solutions.
This work was funded by the National Natural Science Foundation of China (62102066), the Open Research Projects of Zhejiang Lab (No. 2022QA0AB02), and CNKLSTISS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Verbraeken, J., et al.: A survey on distributed machine learning. ACM Comput. Surv. (CSUR) 53(2), 1–33 (2020)
Hsieh, K., et al.: Gaia: geo-distributed machine learning approaching LAN speeds. In: NSDI, pp. 629–647 (2017)
Zhang, C., et al.: BatchCrypt: efficient homomorphic encryption for cross-silo federated learning. In: USENIX ATC, pp. 493–506 (2020)
Chen, Y., et al.: Elastic parameter server load distribution in deep learning clusters. In: SoCC, pp. 507–521 (2020)
Sapio, A., et al.: Scaling distributed machine learning with in-network aggregation. In: NSDI, pp. 785–808 (2021)
Marfoq, O., et al.: Throughput-optimal topology design for cross-silo federated learning. In: NeurIPS, vol. 33, pp. 19 478–19 487 (2020)
Lao, C., et al.: ATP: in-network aggregation for multi-tenant learning. In: NSDI, pp. 741–761 (2021)
Li, M., et al.: Scaling distributed machine learning with the parameter server. In: OSDI, pp. 583–598 (2014)
Luo, L., et al.: Plink: efficient cloud-based training with topology-aware dynamic hierarchical aggregation. In: MLSys, pp. 1–13 (2020)
Fate (federated AI technology enabler) (2019). https://github.com/FederatedAI/FATE
Courtiol, P., et al.: Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25(10), 1519–1525 (2019)
Kairouz, P., et al.: Advances and open problems in federated learning. Found. Trends® Mach. Learn. 14(1–2), 1–210 (2021)
Shouxi, L., et al.: Eliminating communication bottlenecks in cross-device federated learning with in-network processing at the edge. In: IEEE ICC, pp. 1–6 (2022)
Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Luo, L., Yang, S., Feng, W., Yu, H., Sun, G., Lei, B. (2023). Optimizing Communication Topology for Collaborative Learning Across Datacenters. In: Quan, W. (eds) Emerging Networking Architecture and Technologies. ICENAT 2022. Communications in Computer and Information Science, vol 1696. Springer, Singapore. https://doi.org/10.1007/978-981-19-9697-9_15
Download citation
DOI: https://doi.org/10.1007/978-981-19-9697-9_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-9696-2
Online ISBN: 978-981-19-9697-9
eBook Packages: Computer ScienceComputer Science (R0)