Skip to main content
Log in

Optimal distributed parallel algorithms for deep learning framework Tensorflow

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Since its release, the Tensorflow framework has been widely used in various fields due to its advantages in deep learning. However, it is still at its early state. Its native distributed implementation has difficulty in expanding for large models because it has issues of low utilization of multiple GPUs and slow distribution compared with running on single machine. It is of great significance to reduce the training time through parallel models. In view of this, we firstly provided an in-depth analysis of the implementation principle of Tensorflow and identify the bottlenecks of its native distributed parallel models to improve. Then, two optimal algorithms are designed and implemented based on data parallelism and model parallelism modes of Tensorflow. For data parallelism, the proposed algorithm is implemented to replace the native linear execution mode with pipeline execution mode. As for model parallelism, the native random partitioning mode is replaced by our proposed novel greedy algorithm. Finally, we built a homogeneous distributed cluster and a heterogeneous distributed cluster respectively to verify the effectiveness of the proposed algorithms. Through a number of comparative experiments, we showed that the proposed optimal parallel algorithms can effectively reduce model training time by an average of 26.5%(or average 1.5x speedup than native distributed algorithms) and improve the utilization of the cluster while keeping the same accuracy level of native Tensorflow.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  2. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  Google Scholar 

  3. Brownlee J (2018) Better deep learning: train faster, reduce overfitting, and make better predictions machine learning mastery

  4. Shanmugamani R (2018) Deep learning for computer vision: expert techniques to train advanced neural networks using Tensorflow and Keras. Packt Publishing Ltd

  5. Hendrycks D, Mazeika M, Wilson D, Gimpel K (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In: Advances in neural information processing systems, pp 10456–10465

  6. Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N et al (2019) Evolving deep neural networks. In: Artificial intelligence in the age of neural networks and brain computing. Elsevier, pp 293–312

  7. Traore BB, Kamsu-Foguem B, Tangara F (2018) Deep convolution neural network for image recognition. Ecological Informatics 48:257–268

    Article  Google Scholar 

  8. Che Z, Purushotham S, Cho K, Sontag D, Liu Y (2018) Recurrent neural networks for multivariate time series with missing values. Scientific Reports 8(1):1–12

    Google Scholar 

  9. Gu J, Chowdhury M, Shin KG, Zhu Y, Jeon M, Qian J, Liu H, Guo C (2019) Tiresias: a {GPU} cluster manager for distributed deep learning. In: 16th {USENIX} symposium on networked systems design and implementation ({NSDI}, vol 19, pp 485–500

  10. Shi S, Wang Q, Chu X, Li B, Qin Y, Liu R, Zhao X (2020) Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: IEEE INFOCOM

  11. Malik A, Lu M, Wang N, Lin Y, Yoo S (2018) Detailed performance analysis of distributed Tensorflow on a gpu cluster using deep learning algorithms. In: 2018 New York scientific data summit (NYSDS). IEEE, pp 1–8

  12. Chen C, Weng Q, Wang W, Li B, Li B (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM symposium on cloud computing, pp 521–521

  13. Yang E, Kim S-H, Kim T-W, Jeon M, Park S, Youn C-H (2018) An adaptive batch-orchestration algorithm for the heterogeneous gpu cluster environment in distributed deep learning system. In: 2018 IEEE international conference on big data and smart computing (BigComp). IEEE, pp 725–728

  14. Bao Y, Peng Y, Wu C (2019) Deep learning-based job placement in distributed machine learning clusters. In: IEEE INFOCOM 2019-IEEE conference on computer communications. IEEE, pp 505–513

  15. Pang B, Nijkamp E, Wu YN (2020) Deep learning with Tensorflow: a review. J Educ Behav Stat 45(2):227–248

    Article  Google Scholar 

  16. Seetala K, Birdsong W, Reddy YB (2019) Image classification using Tensorflow. In: 16th international conference on information technology-new generations (ITNG 2019). Springer, pp 485–488

  17. Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231

  18. Baldi P, Sadowski P (2014) The dropout learning algorithm. Artificial Intelligence 210:78–122

    Article  MathSciNet  Google Scholar 

  19. Kennedy RK, Khoshgoftaar TM, Villanustre F, Humphrey T (2019) A parallel and distributed stochastic gradient descent implementation using commodity clusters. Journal of Big Data 6(1):16

    Article  Google Scholar 

  20. Du X, Kuang D, Ye Y, Li X, Chen M, Du Y, Wu W (2018) Comparative study of distributed deep learning tools on supercomputers. In: International conference on algorithms and architectures for parallel processing. Springer, pp 122–137

  21. Kang B, Jeong J-H, Jeong C (2018) Distributed parallel deep learning for fast extraction of similar weather map. In: TENCON 2018-2018 IEEE region 10 conference. IEEE, pp 1426–1429

  22. Li D, Lai Z, Ge K, Zhang Y, Zhang Z, Wang Q, Wang H (2019) Hpdl: towards a general framework for high-performance distributed deep learning. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, pp 1742–1753

  23. Kim S, Yu G-I, Park H, Cho S, Jeong E, Ha H, Lee S, Jeong JS, Chun B-G (2019) Parallax: sparsity-aware data parallel training of deep neural networks. In: Proceedings of the fourteenth eurosys conference 2019, pp 1–15

  24. Gunn DJ, Liu Z, Dave R, Yuan X, Roy K (2019) Touch-based active dloud authentication using traditional machine learning and LSTM on a distributed Tensorflow framework. International Journal of Computational Intelligence and Applications 18(04):1950022

    Article  Google Scholar 

  25. Ranbirsingh JK, Kimm H, Kimm H (2019) Distributed neural networks using Tensorflow over multicore and many-core systems. In: 2019 IEEE 13th international symposium on embedded multicore/many-core systems-on-chip (MCSoC). IEEE, pp 101–107

  26. Kennedy RKL (2018) Parallel distributed deep learning on cluster computers. Training 4(32):256

    Google Scholar 

  27. Marques J, Falcao G, Alexandre LA (2018) Distributed learning of cnns on heterogeneous cpu/gpu architectures. Appl Artif Intell 32(9-10):822–844

    Article  Google Scholar 

  28. Grabaskas N (2019) Improving usability of distributed neural network training. In: Intelligent computing-proceedings of the computing conference. Springer, pp 867–886

  29. Wen W, Xu C, Yan F, Wu C, Wang Y, Chen Y, Li H (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. In: Advances in neural information processing systems, pp 1509–1519

  30. Ben-Nun T, Hoefler T (2019) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52(4):1–43

    Article  Google Scholar 

  31. Chang K, Balachandar N, Lam C, Yi D, Brown J, Beers A, Rosen B, Rubin DL, Kalpathy-Cramer J (2018) Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc 25(8):945–954

    Article  Google Scholar 

  32. Chen C, Yang C, Cheng H (2018) Efficient and robust parallel dnn training through model parallelism on multi-gpu platform, arxiv: Distributed, Parallel and Cluster Computing

  33. Peng Y, Zhu Y, Chen Y, Bao Y, Yi B, Lan C, Wu C, Guo C (2019) A generic communication scheduler for distributed dnn training acceleration. In: Proceedings of the 27th ACM symposium on operating systems principles, ser. SOSP ’19. New York, NY, USA: Association for Computing Machinery, pp 16–29. [Online]. Available: https://doi.org/10.1145/3341301.3359642

  34. Surya RY, Imam Kistijantoro A (2019) Dynamic resource allocation for distributed Tensorflow training in kubernetes cluster. In: 2019 international conference on data and software engineering (ICoDSE), pp 1–6

  35. Mayer R, Mayer C, Laich L (2017) The Tensorflow partitioning and scheduling problem: it’s the critical path! arxiv: Distributed, Parallel, and Cluster Computing, pp 1–6

  36. Chen C, Weng Q, Wang W, Li B, Li B (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM symposium on cloud computing, ser. SoCC ’18. New York, NY, USA: Association for Computing Machinery, p 521. [Online]. Available: https://doi.org/10.1145/3267809.3275463

  37. Liu J, Jia C, Chen J, Lin H, Jin X, An H (2019) An effective method for operations placement in tensor flow. In: Proceedings of the 3rd international conference on high performance compilation, computing and communications, ser. HP3C ’19. New York, NY, USA: Association for Computing Machinery, pp 13–19. [Online]. Available: https://doi.org/10.1145/3318265.3318270

  38. Sergeev A, Del Balso M (2018) Horovod: fast and easy distributed deep learning in Tensorflow. arXiv:1802.05799

  39. Fujiki D, Mahlke S, Das R (2018) In-memory data parallel processor. ACM SIGPLAN Not 53(2):1–14

    Article  Google Scholar 

  40. Bienia C, Kumar S, Singh JP, Li K (2008) The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, pp 72–81

  41. Hu Z, Qin W (2017) Fuzzy method and neural network model parallel implementation of multi-layer neural network based on cloud computing for real time data transmission in large offshore platform. Polish Maritime Research 24(s2):39–44

    Article  Google Scholar 

  42. Kurth T, Smorkalov M, Mendygral P, Sridharan S, Mathuriya A (2019) Tensorflow at scale: performance and productivity analysis of distributed training with horovod, mlsl, and cray pe ml. Concurrency and Computation: Practice and Experience 31(16):e4989

    Article  Google Scholar 

  43. Liu M, Grana D (2019) Accelerating geostatistical seismic inversion using Tensorflow: a heterogeneous distributed deep learning framework. Computers & Geosciences 124:37–45

    Article  Google Scholar 

  44. Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su B-Y (2014) Scaling distributed machine learning with the parameter server. In: 11th {USENIX} symposium on operating systems design and implementation ({OSDI}, vol 14, pp 583–598

  45. Gibiansky A (2017) Bringing hpc techniques to deep learning, Baidu Research, Tech. Rep.

  46. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Kaiser L, Kudlur M, Levenberg J, Zheng X (2015) Tensorflow : large-scale machine learning on heterogeneous distributed systems, 01

Download references

Acknowledgements

This research is partially supported by National Key Research and Development Program of China with ID 2018AAA0103203.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenhong Tian.

Ethics declarations

Conflict of Interests

Author Yuanlun Xie declares that he has no conflict of interest. Author Majun He declares that he has no conflict of interest. Author Tingsong Ma declares that he has no conflict of interest. Wenhong Tian declares that he has no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, Y., He, M., Ma, T. et al. Optimal distributed parallel algorithms for deep learning framework Tensorflow. Appl Intell 52, 3880–3900 (2022). https://doi.org/10.1007/s10489-021-02588-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02588-9

Keywords

Navigation