Abstract
With the development of deep learning, DNN models have become more complex. Large-scale model parameters enhance the level of AI by improving the accuracy of DNN models. However, they also present more severe challenges to the hardware training platform for training a large model needs a lot of computing and memory resources, which can easily exceed the capacity of an accelerator. In addition, with the increasing demand for the accuracy of DNN models in academia and industry, the number of training iterations is also skyrocketing. In these backgrounds, more accelerators are integrated on a hierarchical platform to conduct distributed training. In distributed training platforms, the computation of the DNN model and the communication of the intermediate parameters are handled by different hardware modules, so their degree of parallelism profoundly affects the training speed. In this work, based on the widely used hierarchical Torus-Ring training platform and the Ring All-Reduce collective communication algorithm, we improve the speed of distributed training by optimizing the parallelism of communication and computation. Specifically, based on the analysis of the distributed training process, we schedule the computation and communication so that they execute simultaneously as much as possible. Finally, for data parallelism and model parallelism, we reduce the communication exposure time and the computation exposure time, respectively. Compared with the previous work, the training speed (including 5 training iterations) of the Resnet50 model and the Transformer model is increased by 23.77\(\%\)–25.64\(\%\) and 11.66\(\%\)–12.83\(\%\).
This work is supported in part by the National Key RD Project No. 2021YFB0300300, the NSFC (62172430), the NSF of Hunan Province 2021JJ10052, the STIP of Hunan Province 2022RC3065, and the Key Laboratory of Advanced Microprocessor Chips and Systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Agarwal, N., Krishna, T., Peh, L.S., Jha, N.K.: Garnet: a detailed on-chip network model inside a full-system simulator. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26–28, 2009, Boston, Massachusetts, USA, Proceedings (2009)
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936 (2015)
Chao, C., Saeta, B.: Cloud tpu: codesigning architecture and infrastructure. In: Hot Chips, volume 31 (2019)
Chaturapruek, S., Duchi, J.C., Ré, C.: Asynchronous stochastic convex optimization: the noise is in the noise and sgd don’t care. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Chen, J., Pan, X., Monga, R., Bengio, S., Jozefowicz, R.: Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981 (2016)
Cho, M., Finkler, U., Kung, D., Hunter, H., et al.: Blueconnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. Ibm Journal of Research and Development, PP(99), 1–1 (2019)
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Computer ence 10(4), 429–439 (2015)
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, 25 (2012)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Graves, A., Jaitly, N., Mohamed, A.R.: Hybrid speech recognition with deep bidirectional lstm. In: Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop (2013)
He, K., Zhang, X., Ren, S. and Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hou, X., Xu, R., Ma, S., Wang, Q., Jiang, W., Lu, H.: Co-designing the topology/algorithm to accelerate distributed training. In: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1010–1018, 2021
Jouppi, N.P., Yoon, D.H., Ashcraft, M., Gottscho, M., Patterson, D.: Ten lessons from three generations shaped google’s tpuv4i : Industrial product. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) (2021)
Kielmann, T., Hofman, R.F., Bal, H.E., Plaat, A., Bhoedjang, R.A.. Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 131–140 (1999)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Li, M., et al.: Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583–598 (2014)
Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Advances in Neural Information Processing Systems, 27 19–27 (2014)
Naumov, M., et al.: Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019)
Patarasuk, P., Yuan, X.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parall. Distrib. Comput. 69(2), 117–124 (2009)
Perez, L., Wang, J.: The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 (2017)
Rashidi, S., Shurpali, P., Sridharan, S., Hassani, N., Krishna, T.: Scalable distributed training of recommendation models: An astra-sim + ns3 case-study with tcp/ip transport. In: 2020 IEEE Symposium on High-Performance Interconnects (HOTI) (2020)
Rashidi, S., Sridharan, S., Srinivasan, S., Krishna, T.: Astra-sim: enabling sw/hw co-design exploration for distributed dl training platforms. In: 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2020)
Rashidi, S., Sridharan, S., Srinivasan, S., Denton, M., Krishna, T.: Efficient communication acceleration for next-gen scale-up deep learning training platforms. arXiv preprint (2020)
Riley, G.F., Henderson, T.R.: The ns-3 network simulator. In: Wehrle, K., Güneş, M., Gross, J. (eds.) Modeling and Tools for Network Simulation, pp. 15–34. Springer Berlin Heidelberg, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12331-3_2
Schlkopf, B., Platt, J., Hofmann, T.: Map-reduce for machine learning on multicore. In: Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference
Sepulchre, R., Paley, D.A. Leonard, N.E.: Stabilization of planar collective motion: All-to-all communication. IEEE Trans. Autom. Contr. 52(5), 811–824 (2007)
Shi, S., Wang, Q., Chu, X., Li, B.: dag model of synchronous stochastic gradient descent in distributed deep learning. In: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 425–432. IEEE (2018)
Träff, J.L.: Efficient all-gather communication on parallel systems with hierarchical communication structure. preparation (2003)
Vaswani, A.: Attention is all you need. In: Advances In Neural Information Processing Systems, 30 (2017)
Xu, R., Ma, S., Guo, Y., Li, D.: A survey of design and optimization for systolic array based dnn accelerators. In: ACM Computing Surveys (2023)
Rui, X., Ma, S., Wang, Y., Chen, X., Guo, Y.: Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Trans. Architect. Code Optim. (TACO) 18(4), 1–24 (2021)
Rui, X., Ma, S., Wang, Y., Guo, Y., Li, D., Qiao, Y.: Heterogeneous systolic array architecture for compact cnns hardware accelerators. IEEE Trans. Parallel Distrib. Syst. 33(11), 2860–2871 (2021)
Zhang, H., et al.: Poseidon: an efficient communication architecture for distributed deep learning on \(\{\)GPU\(\}\) clusters. In: 2017 USENIX Annual Technical Conference (USENIX ATC 17), pp. 181–193 (2017)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hou, X. et al. (2024). Optimizing the Parallelism of Communication and Computation in Distributed Training Platform. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14487. Springer, Singapore. https://doi.org/10.1007/978-981-97-0834-5_20
Download citation
DOI: https://doi.org/10.1007/978-981-97-0834-5_20
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0833-8
Online ISBN: 978-981-97-0834-5
eBook Packages: Computer ScienceComputer Science (R0)