Skip to main content
Log in

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

International Journal of Parallel Programming Aims and scope Submit manuscript


In recent years, deep learning models have been successfully applied to large-scale data analysis, including image classification, video caption, natural language processing, etc. Large-scale data analyses take advantage of parallel computing to accelerate the speed of model training, in which data parallelism has become the dominant method for deep learning model training due to its high throughput rate. Synchronous stochastic gradient descent optimization becomes a well-recognized optimization method to ensure model convergence, but the overhead of gradients synchronization increases linearly as the number of workers increases, causing a huge waste of time. Although some efficiency-first asynchronous methods have been proposed, these methods cannot guarantee their convergence in large-scale distributed training. To solve this problem, we propose an efficient pseudo-synchronous approach that updates the network with the previous gradient, performing the synchronization of a new gradient to overlap computation and synchronization. This idea will obviously affect the normal convergence of the model, so we propose a novel adaptive exponential smoothing predicted gradient algorithm for model optimization, which can adaptively adjust the confidence coefficient of the history gradient to ensure the normal convergence of the training process. Experiments prove that our method can speed up the training process and achieve a comparable accuracy rate with standard synchronous SGD. Besides, our method has more efficient weak scalability compared to the traditional synchronous SGD and those in previous related work. We apply our methods to image recognition and video caption applications at most 12288 cores with strong scalability on Tianhe II. Evaluations show that, when configured appropriately, our method attains near-linear scalability using 128 nodes. We get 93.4% weak scaling efficiency on 64 nodes, 90.5% on 128 nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13


  1. SGD in the following refers to the SGD with momentum.


  1. Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: A system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), pp. 265–283. (2016)

  2. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. pp. 177–186. Springer (2010)

  3. Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. (2011)

  4. Chen, J., Pan, X., Monga, R., et al.: Revisiting distributed synchronous sgd. (2016) arXiv preprint arXiv:1604.00981

  5. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. IEEE, Boston, MA (2015)

  6. Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)

    Article  Google Scholar 

  7. Dean, J., Corrado, G.S., Monga, R., et al.: Large scale distributed deep networks. In: International Conference on Neural Information Processing Systems, pp. 1223–1231. (2012)

  8. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. (2018) arXiv preprint arXiv:1810.04805

  9. Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  10. Dutta, S., Wang, J., Joshi, G.: Slow and stale gradients can win the race. (2020) arXiv preprint arXiv:2003.10579

  11. Dutta, S., Wang, J., Joshi, G.: Slow and stale gradients can win the race. IEEE J. Select. Areas Inform. Theory 2(3), 1012–1024 (2021)

    Article  Google Scholar 

  12. Goyal, P., Dollár, P., Girshick, R., et al.: Accurate, large minibatch sgd: training imagenet in 1 h. (2017) arXiv preprint arXiv:1706.02677

  13. Gupta, S., Zhang, W., Wang, F.: Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. Comput. Sci. pp 171–180 (2016)

  14. He, K., Zhang, X., Ren, S., et al.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 1026–1034. (2015)

  15. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE, Las Vegas, NV, United States (2016)

  16. Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., et al.: Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers. In: International Conference for High PERFORMANCE Computing, pp. 3–14. Networking, Storage and Analysis (2014)

  17. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7132–7141. (2018)

  18. Jaggi, M., Smith, V., Takác, M., et al.: Communication-efficient distributed dual coordinate ascent. Adv. Neural. Inform. Process. Syst. 4, 3068–3076 (2014)

    Google Scholar 

  19. Jeon, W., Ko, G., Lee, J., et al.: Deep learning with gpus. In: Advances in Computers, pp. 167–215. Elsevier (2021)

    Google Scholar 

  20. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: International Conference on Neural Information Processing Systems, pp. 315–323. (2013)

  21. Keuper, J., Pfreundt, F.J.: Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms. In: The Workshop on Machine Learning in High-Performance Computing Environments, p. 1. (2015)

  22. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. Comput. Sci. (2014a)

  23. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS’12 Proceedings of the 25th International Conference on Neural Information Processing Systems. ACM, Lake Tahoe, Nevada, pp. 1097–1105. (2012)

  24. Li, M.: Scaling distributed machine learning with the parameter server. In: International Conference on Big Data Science and Computing, p. 1. (2014)

  25. Mahdisoltani, F., Berger, G., Gharbieh, W., et al.: Fine-grained video classification and captioning. (2018) arXiv preprint arXiv:1804.09235 5(6)

  26. Mittal, S., Vaishay, S.: A survey of techniques for optimizing deep learning on gpus. J. Syst. Architect. 99(101), 635 (2019)

    Google Scholar 

  27. Paszke, A., Gross, S., Massa, F., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural. Inform. Process. Syst. 32, 8026–8037 (2019)

    Google Scholar 

  28. Rada-Vilela, J., Zhang, M., Seah. W.: A performance study on synchronous and asynchronous updates in particle swarm optimization. In: Conference on Genetic and Evolutionary Computation, pp 21–28. (2011)

  29. Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners. Open AI blog 1(8), 9 (2019)

    Google Scholar 

  30. Sculley, D.: Web-scale k-means clustering. In: International Conference on World Wide Web, WWW 2010, pp. 1177–1178. Raleigh, North Carolina, USA, April (2010)

  31. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(1), 2013 (2013)

    MathSciNet  MATH  Google Scholar 

  32. Shi, S., Chu, X., Li, B.: Mg-wfbp: efficient data communication for distributed synchronous sgd algorithms. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, IEEE, pp. 172–180. (2019)

  33. Shi, S., Wang, Q., Chu, X., et al.: Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, IEEE, pp. 406–415. (2020)

  34. Shi, S., Chu, X., Li, B.: Exploiting simultaneous communications to accelerate data parallel distributed deep learning. In: IEEE INFOCOM 2021-IEEE Conference on Computer Communications, IEEE, pp 1–10. (2021)

  35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. (2014a) arXiv preprint arXiv:1409.1556

  36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014b)

  37. Sutskever, I., Vinyals, O., Le, QV.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112. (2014)

  38. Thomason, J., Venugopalan, S., Guadarrama, S., et al.: Integrating language and vision to generate natural language descriptions of videos in the wild. University of Texas at Austin Austin United States, Tech. rep. (2014)

  39. Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks. Comput. Sci. (2014)

  40. Venugopalan, S., Rohrbach, M., Donahue, J., et al.: Sequence to sequence – video to text. In: IEEE International Conference on Computer Vision, pp. 4534–4542. (2015)

  41. Verbraeken, J., Wolting, M., Katzy, J., et al.: A survey on distributed machine learning. ACM Comput. Surv. (CSUR) 53(2), 1–33 (2020)

    Article  Google Scholar 

  42. Wang, J., Wang, H., Zhao, C., et al.: Iteration acceleration for distributed learning systems. Parallel Comput. 72, 29–41 (2018).

    Article  MathSciNet  Google Scholar 

  43. Wang, L., Shen, B., Zhao, N.: Second-order convergence of asynchronous parallel stochastic gradient descent: When is the linear speedup achieved? (2019) arXiv preprint arXiv:1910.06000

  44. Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515. (2015)

  45. You, Y., Buluc, A., Demmel, J.: Scaling deep learning on gpu and knights landing clusters. In: In Proceedings of SC17. ACM, LDenver,CO, USA, p. 12. (2017a)

  46. You, Y., Gitman, I., Ginsburg, B.: Scaling sgd batch size to 32k for imagenet training. (2017b) arXiv preprint arXiv:1708.03888 6(12):6

  47. Zhang, R., Zheng, S., Kwok, J.T.: Fast distributed asynchronous sgd with variance reduction. (2015a) arXiv preprint arXiv:1508.01633

  48. Zhang, S., Choromanska, A.E., LeCun, Y.: Deep learning with elastic averaging sgd. Adv. Neural Inform. Process. Syst. 28 (2015b)

  49. Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation for distributed deep learning. (2016) arXiv preprint arXiv:1609.08326

  50. Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 70, pp. 4120–4129. PMLR (2017)

  51. Zhu, Y., Ying, L.: A sharp convergence rate for the asynchronous stochastic gradient descent. (2020) arXiv preprint arXiv:2001.09126

Download references


This research was supported by the Natural Science Foundation of China under Grant No. U1811464, and was also supported in part by the Guangdong Natural Science Foundation under Grant No. 2018B030312002, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211, in part by the CCF-Baidu Open Fund of 2021032.

Author information

Authors and Affiliations



Investigation, software, writing original draft, and editing were performed by YW. Conceptualization, methodology by YW and ZQ. Writing review and editing were performed by all authors.

Corresponding author

Correspondence to Nong Xiao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical Approval

This study does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, Y., Qiu, Z., Zhang, D. et al. Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method. Int J Parallel Prog (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: