Advertisement

Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

  • Miro Hodak
  • Ajay DholakiaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11135)

Abstract

Tensorflow (TF) is a highly popular Deep Learning (DL) software framework. Neural network training, a critical part of DL workflow, is a computationally intensive process that can take days or even weeks. Therefore, achieving faster training times is an active area of research and practise. TF supports multiple GPU parallelization, both within a single machine and between multiple physical servers. However, the distributed case is hard to use and consequently, almost all published performance data comes from the single machine use case. To fill this gap, here we benchmark Tensorflow in a GPU-equipped distributed environment. Our work evaluates performance of various hardware and software combinations. In particular, we examine several types of interconnect technologies to determine their impact on performance. Our results show that with the right choice of input parameters and appropriate hardware, GPU-equipped general-purpose compute clusters can provide comparable deep learning training performance to specialized machines designed for AI workloads.

Keywords

Tensorflow Deep learning GPU Distributed computing Performance 

References

  1. 1.
    Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)Google Scholar
  2. 2.
    Paszke, A., et al.: Automatic differentiation in PyTorch. https://openreview.net/forum?id=BJJsrmfCZ
  3. 3.
  4. 4.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
  5. 5.
  6. 6.
  7. 7.
  8. 8.
    Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
  9. 9.
    You, Y., Zhang, Z., Hsieh, C., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR, abs/1709.05011 (2017)Google Scholar
  10. 10.
    Cho, M., Finkler, U., Kumar, S., Kung, D., Saxena, V., Sreedhar, D.: PowerAI DDL. arXiv preprint arXiv:1708.02188 (2017)
  11. 11.
    Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)zbMATHGoogle Scholar
  12. 12.
    Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
  13. 13.
  14. 14.
  15. 15.
  16. 16.
    Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of convolution neural network advances on the Imagenet. Comput. Vis. Image Underst. 161, 11–19 (2017)CrossRefGoogle Scholar
  17. 17.
  18. 18.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Lenovo, Data Center GroupMorrisvilleUSA

Personalised recommendations