International Journal of Parallel Programming

, Volume 46, Issue 4, pp 674–685 | Cite as

Improving the Performance of Distributed TensorFlow with RDMA

  • Chengfan Jia
  • Junnan Liu
  • Xu Jin
  • Han Lin
  • Hong An
  • Wenting Han
  • Zheng Wu
  • Mengxian Chi
Part of the following topical collections:
  1. Special issue on Network and Parallel Computing for New Architectures and Applications


TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open-sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6\(\times \) performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.


Distributed TensorFlow RDMA Infiniband Optimization 



This research is conducted under Advanced Computer System Architecture (ACSA) Laboratory of University of Science and Technology of China, supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000403).


  1. 1.
    Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: A System for Large-Scale Machine Learning. arXiv:1605.08695 (2016)
  2. 2.
    Kim, H., Park, J., Jang, J., et al.: DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters. arXiv:1602.08191 (2016)
  3. 3.
    Vishnu, A., Siegel, C., Daily, J.: Distributed TensorFlow with MPI. arXiv:1603.02339 (2016)
  4. 4.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)Google Scholar
  5. 5.
    Shi, S., Wang, Q., Xu, P., et al.: Benchmarking State-of-the-Art Deep Learning Software Tools. arXiv:1608.07249 (2016)
  6. 6.
    Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 (2016)
  7. 7.
    Google Developers.: Introducing gRPC, a New Open Source HTTP/2 RPC Framework. (2015)
  8. 8.
    Pfister, G.F.: An introduction to the infiniband architecture. High Perform. Mass Storage Parallel I/O 42, 617–632 (2001)Google Scholar
  9. 9.
    Mellanox.: The Mellanox Solution to TensorFlow. (2016)
  10. 10.
    Ou, L., He, X., Han, J.: An efficient design for fast memory registration in RDMA. J. Netw. Comput. Appl. 32(3), 642–651 (2009)CrossRefGoogle Scholar
  11. 11.
    Sur, S., Jin, H.W., Chai, L., et al.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 32–39. ACM (2006)Google Scholar
  12. 12.
    Frey, P.W., Alonso, G.: Minimizing the hidden cost of RDMA. In: ICDCS’09. 29th IEEE International Conference on Distributed Computing Systems, 2009, pp. 553–560. IEEE (2009)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiChina

Personalised recommendations