Skip to main content

Optimizing Communication Topology for Collaborative Learning Across Datacenters

  • Conference paper
  • First Online:
Emerging Networking Architecture and Technologies (ICENAT 2022)

Abstract

Federated learning (FL) is emerging as an increasingly important and popular paradigm for collaboratively training high-quality machine learning (ML) models over massive amounts of data stored by geo-distributed datacenters. However, the communication efficiency of gradient aggregation during the training process comes as a primary bottleneck that impedes the adoption of FL, especially in cross-silo settings, as the available bandwidth of inter-datacenter links connecting data silos is often very limited. To improve the training efficiency of cross-silo FL between datacenters, we propose TopoAdopt, an efficient communication topology design for gradient aggregation to overcome the communication bottleneck of cross-silo model training. TopoAdopt uses multiple aggregators to share aggregation load and tree-based hierarchical aggregation to reduce bandwidth consumption from clients to aggregators. For better performance, it jointly optimizes the parameter assignment among aggregators and the construction of aggregation trees. We formulate this optimization problem as a mixed-integer nonlinear programming model and develop efficient algorithms to find satisfactory communication topologies in reasonable computational time. The experimental results show that TopoAdopt achieves significant speedup, up to 5.2\(\times \), in gradient aggregation completion time compared to existing solutions.

This work was funded by the National Natural Science Foundation of China (62102066), the Open Research Projects of Zhejiang Lab (No. 2022QA0AB02), and CNKLSTISS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Verbraeken, J., et al.: A survey on distributed machine learning. ACM Comput. Surv. (CSUR) 53(2), 1–33 (2020)

    Article  Google Scholar 

  2. Hsieh, K., et al.: Gaia: geo-distributed machine learning approaching LAN speeds. In: NSDI, pp. 629–647 (2017)

    Google Scholar 

  3. Zhang, C., et al.: BatchCrypt: efficient homomorphic encryption for cross-silo federated learning. In: USENIX ATC, pp. 493–506 (2020)

    Google Scholar 

  4. Chen, Y., et al.: Elastic parameter server load distribution in deep learning clusters. In: SoCC, pp. 507–521 (2020)

    Google Scholar 

  5. Sapio, A., et al.: Scaling distributed machine learning with in-network aggregation. In: NSDI, pp. 785–808 (2021)

    Google Scholar 

  6. Marfoq, O., et al.: Throughput-optimal topology design for cross-silo federated learning. In: NeurIPS, vol. 33, pp. 19 478–19 487 (2020)

    Google Scholar 

  7. Lao, C., et al.: ATP: in-network aggregation for multi-tenant learning. In: NSDI, pp. 741–761 (2021)

    Google Scholar 

  8. Li, M., et al.: Scaling distributed machine learning with the parameter server. In: OSDI, pp. 583–598 (2014)

    Google Scholar 

  9. Luo, L., et al.: Plink: efficient cloud-based training with topology-aware dynamic hierarchical aggregation. In: MLSys, pp. 1–13 (2020)

    Google Scholar 

  10. Fate (federated AI technology enabler) (2019). https://github.com/FederatedAI/FATE

  11. Courtiol, P., et al.: Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25(10), 1519–1525 (2019)

    Article  Google Scholar 

  12. Kairouz, P., et al.: Advances and open problems in federated learning. Found. Trends® Mach. Learn. 14(1–2), 1–210 (2021)

    Google Scholar 

  13. Shouxi, L., et al.: Eliminating communication bottlenecks in cross-device federated learning with in-network processing at the edge. In: IEEE ICC, pp. 1–6 (2022)

    Google Scholar 

  14. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long Luo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, L., Yang, S., Feng, W., Yu, H., Sun, G., Lei, B. (2023). Optimizing Communication Topology for Collaborative Learning Across Datacenters. In: Quan, W. (eds) Emerging Networking Architecture and Technologies. ICENAT 2022. Communications in Computer and Information Science, vol 1696. Springer, Singapore. https://doi.org/10.1007/978-981-19-9697-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-9697-9_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-9696-2

  • Online ISBN: 978-981-19-9697-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics