Optimizing the Parallelism of Communication and Computation in Distributed Training Platform

Hou, Xiang; Yuan, Yuan; Ma, Sheng; Xu, Rui; Wang, Bo; Li, Tiejun; Jiang, Wei; Wu, Lizhou; Zhang, Jianmin

doi:10.1007/978-981-97-0834-5_20

Xiang Hou¹⁰,
Yuan Yuan¹⁰,
Sheng Ma¹⁰,
Rui Xu¹⁰,
Bo Wang¹⁰,
Tiejun Li¹⁰,
Wei Jiang¹⁰,
Lizhou Wu¹⁰ &
…
Jianmin Zhang¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14487))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

204 Accesses

Abstract

With the development of deep learning, DNN models have become more complex. Large-scale model parameters enhance the level of AI by improving the accuracy of DNN models. However, they also present more severe challenges to the hardware training platform for training a large model needs a lot of computing and memory resources, which can easily exceed the capacity of an accelerator. In addition, with the increasing demand for the accuracy of DNN models in academia and industry, the number of training iterations is also skyrocketing. In these backgrounds, more accelerators are integrated on a hierarchical platform to conduct distributed training. In distributed training platforms, the computation of the DNN model and the communication of the intermediate parameters are handled by different hardware modules, so their degree of parallelism profoundly affects the training speed. In this work, based on the widely used hierarchical Torus-Ring training platform and the Ring All-Reduce collective communication algorithm, we improve the speed of distributed training by optimizing the parallelism of communication and computation. Specifically, based on the analysis of the distributed training process, we schedule the computation and communication so that they execute simultaneously as much as possible. Finally, for data parallelism and model parallelism, we reduce the communication exposure time and the computation exposure time, respectively. Compared with the previous work, the training speed (including 5 training iterations) of the Resnet50 model and the Transformer model is increased by 23.77\(\%\)–25.64\(\%\) and 11.66\(\%\)–12.83\(\%\).

This work is supported in part by the National Key RD Project No. 2021YFB0300300, the NSFC (62172430), the NSF of Hunan Province 2021JJ10052, the STIP of Hunan Province 2022RC3065, and the Key Laboratory of Advanced Microprocessor Chips and Systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Agarwal, N., Krishna, T., Peh, L.S., Jha, N.K.: Garnet: a detailed on-chip network model inside a full-system simulator. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26–28, 2009, Boston, Massachusetts, USA, Proceedings (2009)
Google Scholar
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936 (2015)
Google Scholar
Chao, C., Saeta, B.: Cloud tpu: codesigning architecture and infrastructure. In: Hot Chips, volume 31 (2019)
Google Scholar
Chaturapruek, S., Duchi, J.C., Ré, C.: Asynchronous stochastic convex optimization: the noise is in the noise and sgd don’t care. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Chen, J., Pan, X., Monga, R., Bengio, S., Jozefowicz, R.: Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981 (2016)
Cho, M., Finkler, U., Kung, D., Hunter, H., et al.: Blueconnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. Ibm Journal of Research and Development, PP(99), 1–1 (2019)
Google Scholar
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Computer ence 10(4), 429–439 (2015)
Google Scholar
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, 25 (2012)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Graves, A., Jaitly, N., Mohamed, A.R.: Hybrid speech recognition with deep bidirectional lstm. In: Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop (2013)
Google Scholar
He, K., Zhang, X., Ren, S. and Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hou, X., Xu, R., Ma, S., Wang, Q., Jiang, W., Lu, H.: Co-designing the topology/algorithm to accelerate distributed training. In: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1010–1018, 2021
Google Scholar
Jouppi, N.P., Yoon, D.H., Ashcraft, M., Gottscho, M., Patterson, D.: Ten lessons from three generations shaped google’s tpuv4i : Industrial product. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) (2021)
Google Scholar
Kielmann, T., Hofman, R.F., Bal, H.E., Plaat, A., Bhoedjang, R.A.. Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 131–140 (1999)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, M., et al.: Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583–598 (2014)
Google Scholar
Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Advances in Neural Information Processing Systems, 27 19–27 (2014)
Google Scholar
Naumov, M., et al.: Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019)
Patarasuk, P., Yuan, X.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parall. Distrib. Comput. 69(2), 117–124 (2009)
Article Google Scholar
Perez, L., Wang, J.: The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 (2017)
Rashidi, S., Shurpali, P., Sridharan, S., Hassani, N., Krishna, T.: Scalable distributed training of recommendation models: An astra-sim + ns3 case-study with tcp/ip transport. In: 2020 IEEE Symposium on High-Performance Interconnects (HOTI) (2020)
Google Scholar
Rashidi, S., Sridharan, S., Srinivasan, S., Krishna, T.: Astra-sim: enabling sw/hw co-design exploration for distributed dl training platforms. In: 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2020)
Google Scholar
Rashidi, S., Sridharan, S., Srinivasan, S., Denton, M., Krishna, T.: Efficient communication acceleration for next-gen scale-up deep learning training platforms. arXiv preprint (2020)
Google Scholar
Riley, G.F., Henderson, T.R.: The ns-3 network simulator. In: Wehrle, K., Güneş, M., Gross, J. (eds.) Modeling and Tools for Network Simulation, pp. 15–34. Springer Berlin Heidelberg, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12331-3_2
Chapter Google Scholar
Schlkopf, B., Platt, J., Hofmann, T.: Map-reduce for machine learning on multicore. In: Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference
Google Scholar
Sepulchre, R., Paley, D.A. Leonard, N.E.: Stabilization of planar collective motion: All-to-all communication. IEEE Trans. Autom. Contr. 52(5), 811–824 (2007)
Google Scholar
Shi, S., Wang, Q., Chu, X., Li, B.: dag model of synchronous stochastic gradient descent in distributed deep learning. In: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 425–432. IEEE (2018)
Google Scholar
Träff, J.L.: Efficient all-gather communication on parallel systems with hierarchical communication structure. preparation (2003)
Google Scholar
Vaswani, A.: Attention is all you need. In: Advances In Neural Information Processing Systems, 30 (2017)
Google Scholar
Xu, R., Ma, S., Guo, Y., Li, D.: A survey of design and optimization for systolic array based dnn accelerators. In: ACM Computing Surveys (2023)
Google Scholar
Rui, X., Ma, S., Wang, Y., Chen, X., Guo, Y.: Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Trans. Architect. Code Optim. (TACO) 18(4), 1–24 (2021)
Google Scholar
Rui, X., Ma, S., Wang, Y., Guo, Y., Li, D., Qiao, Y.: Heterogeneous systolic array architecture for compact cnns hardware accelerators. IEEE Trans. Parallel Distrib. Syst. 33(11), 2860–2871 (2021)
Google Scholar
Zhang, H., et al.: Poseidon: an efficient communication architecture for distributed deep learning on \(\{\)GPU\(\}\) clusters. In: 2017 USENIX Annual Technical Conference (USENIX ATC 17), pp. 181–193 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

National University of Defense Technology, Changsha, China
Xiang Hou, Yuan Yuan, Sheng Ma, Rui Xu, Bo Wang, Tiejun Li, Wei Jiang, Lizhou Wu & Jianmin Zhang

Authors

Xiang Hou
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Rui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Lizhou Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiang Hou or Yuan Yuan .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology, Melbourne, VIC, Australia
Zahir Tari
Tianjin University, Tianjin, China
Keqiu Li
University of Arizona, Tucson, AZ, USA
Hongyi Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hou, X. et al. (2024). Optimizing the Parallelism of Communication and Computation in Distributed Training Platform. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14487. Springer, Singapore. https://doi.org/10.1007/978-981-97-0834-5_20

Download citation

DOI: https://doi.org/10.1007/978-981-97-0834-5_20
Published: 12 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0833-8
Online ISBN: 978-981-97-0834-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing the Parallelism of Communication and Computation in Distributed Training Platform