Skip to main content

More Effective Distributed Deep Learning Using Staleness Based Parameter Updating

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11335))

Abstract

Deep learning technology has been widely applied for various purposes, especially big data analysis. However, computation required for deep learning is getting more complex and larger. In order to accelerate the training of large-scale deep networks, various distributed parallel training protocols have been proposed. In this paper, we design a novel asynchronous training protocol, Weighted Asynchronous Parallel (WASP), to update neural network parameters in a more effective way. The core of WASP is “gradient staleness”, a parameter version number based metric to weight gradients and reduce the influence of the stale parameters. Moreover, by periodic forced synchronization of parameters, WASP combines the advantages of synchronous and asynchronous training models and can speed up training with a rapid convergence rate. We conduct experiments using two classical convolutional neural networks, LeNet-5 and ResNet-101, at the Tianhe-2 supercomputing system, and the results show that, WASP can achieve much higher acceleration than existing asynchronous parallel training protocols.

This research is partially supported by The National Key Research and Development Program of China (No. 2016YFB0200404, 2018YFB0203803), National Natural Science Foundation of China (U1711263), MOE-CMCC Joint Research Fund of China (No. MCM20160104), and Program of Science and Technology of Guangdong (No. 2015B010111001).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://nscc-gz.cn/

References

  1. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  2. Li, X., Zhang, G., Huang, H., et al.: Performance analysis of GPU-based convolutional neural networks. In: International Conference on Parallel Processing, Philadelphia, USA, pp. 67–76. IEEE (2016)

    Google Scholar 

  3. Li, M., Andersen, D., Park, J., et al.: Scaling distributed machine learning with the parameter server. In: International Conference on Big Data Science and Computing, Beijing, China, pp. 583–598. ACM (2014)

    Google Scholar 

  4. Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, USA, pp. 1223–1231. Curran Associates Inc. (2012)

    Google Scholar 

  5. Ho, Q., Cipar, J., Cui, H., et al.: More effective distributed ML via a stale synchronous parallel parameter server. In: International Conference on Neural Information Processing Systems, Daegu, South Korea, pp. 1223–1231. Curran Associates Inc. (2013)

    Google Scholar 

  6. Zhang, W., Gupta, S., Lian, X., et al.: Staleness-aware async-SGD for distributed deep learning. In: International Joint Conference on Artificial Intelligence, vol. 1511(05950), pp. 2350–2356 (2016)

    Google Scholar 

  7. Smola, A., Narayanamurthy, S.: An architecture for parallel topic models. Very Large Data Bases 3(1–2), 703–710 (2010)

    Google Scholar 

  8. Ahmed, A., Aly, M., Gonzalez, J., et al.: Scalable inference in latent variable models. In: ACM International Conference on Web Search and Data Mining, Seattle Washington, USA, pp. 123–132. ACM (2012)

    Google Scholar 

  9. Li, M., Zhou, L., Yang, Z., et al.: Parameter server for distributed machine learning. In: Big Learning NIPS Workshop, Lake Tahoe, Nevada, USA, pp. 1–10. ACM (2013)

    Google Scholar 

  10. Zhang, H., Hu, Z., Wei, J., et al.: Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. Comput. Sci. 1512(06216), 10–21 (2015)

    Google Scholar 

  11. Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI 2016 Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, Savannah, USA, pp. 265–283. USENIX Association (2016)

    Google Scholar 

  12. Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., et al.: Minerva: a scalable and highly efficient training platform for deep learning. In: NIPS 2014 Workshop of Distributed Matrix Computations, Montreal, Canada, pp. 1–9. ACM (2014)

    Google Scholar 

  13. Valiant, L.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  14. McColl, W.: Bulk Synchronous Parallel Computing. Oxford University Press, Oxford (1995)

    Google Scholar 

  15. Cui, H., Cipar, J., Ho, Q., et al.: Exploiting bounded staleness to speed up big data analytics. In: Usenix Conference on Usenix Technical Conference, Philadelphia, USA, pp. 37–48. USENIX Association (2014)

    Google Scholar 

  16. Jiang, C., Xing, P., Rajat, M., et al.: Revisiting distributed synchronous SGD. In: International Conference on Learning Representations, vol. 1604(00981), pp. 1–10 (2017)

    Google Scholar 

  17. Jiang, J., Cui, B., Zhang, C., et al.: Heterogeneity-aware distributed parameter servers. In: ACM International Conference, Glasgow, Scotland, pp. 463–478. ACM (2017)

    Google Scholar 

  18. Dai, W., Kumar, A., Ho, Q., et al.: High-performance distributed ML at scale through parameter server consistency models. In: Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, pp. 79–87. AAAI Press (2015)

    Google Scholar 

  19. Lecun, Y., Cortes, C.: The MNIST database of handwritten digits. Courant Inst. Math. Sci. 3(7), 1–10 (2010)

    Google Scholar 

  20. Lecun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  21. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Comput. Sci. Dept. 1(4), 1–60 (2009)

    Google Scholar 

  22. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. Comput. Vis. Pattern Recogn. 1512(03385), 770–778 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Ye .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ye, Y., Chen, M., Yan, Z., Wu, W., Xiao, N. (2018). More Effective Distributed Deep Learning Using Staleness Based Parameter Updating. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11335. Springer, Cham. https://doi.org/10.1007/978-3-030-05054-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05054-2_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05053-5

  • Online ISBN: 978-3-030-05054-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics