Advertisement

Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences

  • Kawthar Shafie KhorassaniEmail author
  • Ching-Hsiang ChuEmail author
  • Hari SubramoniEmail author
  • Dhabaleswar K. PandaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11887)

Abstract

The advent of Graphics Processing Unit (GPU)-enabled OpenPOWER architectures are empowering the advancement of various High-Performance Computing (HPC) applications from dynamic modular simulation to deep learning training. GPU-aware Message Passing Interface (MPI) is one of the most efficient libraries used to exploit the computing power on GPU-enabled HPC systems at scale. However, there is a lack of thorough performance evaluations for GPU-aware MPI libraries to provide insights into the varying costs and benefits of using each one on GPU-enabled OpenPOWER systems. In this paper, we provide a detailed performance evaluation and analysis of point-to-point communication using various GPU-aware MPI libraries including SpectrumMPI, OpenMPI+UCX, and MVAPICH2-GDR on OpenPOWER GPU-enabled systems. We demonstrate that all three MPI libraries deliver approximately 95% of achievable bandwidth for NVLink communication between two GPUs on the same socket. For inter-node communication where the InfiniBand network dominates the peak bandwidth, MVAPICH2-GDR and SpectrumMPI attain approximately 99% achievable bandwidth, while OpenMPI delivers close to 95%. This evaluation is useful to determine which MPI library can provide the highest performance enhancement.

Keywords

OpenPOWER MPI GPU NVLink RDMA 

References

  1. 1.
    IBM Spectrum MPI version 10.3. https://www.ibm.com
  2. 2.
    Infiniband Verbs Performance Tests. https://github.com/linux-rdma/perftest. Accessed 26 Oct 2019
  3. 3.
    MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/features/
  4. 4.
    Open MPI: Open Source High Performance Computing. https://www.open-mpi.org
  5. 5.
    TOP 500 Supercomputer Sites. http://www.top500.org
  6. 6.
    Unified Communication X. http://www.openucx.org/. Accessed 26 Oct 2019
  7. 7.
    Ashworth, M., Meng, J., Novakovic, V., Siso, S.: Early application performance at the hartree centre with the OpenPOWER architecture. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 173–187. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46079-6_13CrossRefGoogle Scholar
  8. 8.
    Awan, A.A., Bédorf, J., Chu, C.H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation. In: The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019) (2019)Google Scholar
  9. 9.
    Bureddy, D., Wang, H., Venkatesh, A., Potluri, S., Panda, D.K.: OMB-GPU: a micro-benchmark suite for evaluating MPI libraries on GPU clusters. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 110–120. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33518-1_16CrossRefGoogle Scholar
  10. 10.
    Pearson, C., Chung, I.-H., Sura, Z., Hwu, W.-M., Xiong, J.: NUMA-aware data-transfer measurements for power/NVLink multi-GPU systems. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 448–454. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-02465-9_32CrossRefGoogle Scholar
  11. 11.
    Chu, C.H., Hamidouche, K., Venkatesh, A., Banerjee, D.S., Subramoni, H., Panda, D.K.: Exploiting maximal overlap for non-contiguous data movement processing on modern GPU-enabled systems. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 983–992, May 2016Google Scholar
  12. 12.
    Chu, C.H., et al.: Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning. In: 46th International Conference on Parallel Processing (ICPP-2017), August 2017Google Scholar
  13. 13.
    Foley, D., Danskin, J.: Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2), 7–17 (2017).  https://doi.org/10.1109/MM.2017.37CrossRefGoogle Scholar
  14. 14.
    Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-30218-6_19CrossRefGoogle Scholar
  15. 15.
    McCalpin, J.D.: STREAM: sustainable memory bandwidth in high performance computers (2019). https://www.cs.virginia.edu/stream/. Accessed 26 Oct 2019
  16. 16.
    Li, A., et al.: Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. CoRR abs/1903.04611 (2019). http://arxiv.org/abs/1903.04611
  17. 17.
    Luo, X., Wu, W., Bosilca, G., Patinyasakdikul, T., Wang, L., Dongarra, J.: ADAPT: an event-based adaptive collective communication framework. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018, pp. 118–130. ACM, New York (2018).  https://doi.org/10.1145/3208040.3208054
  18. 18.
    Mojumder, S.A., et al.: Profiling DNN workloads on a volta-based DGX-1 system. In: 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 122–133, September 2018.  https://doi.org/10.1109/IISWC.2018.8573521
  19. 19.
    Moreno, R., Arias, E., Navarro, A., Tapiador, F.J.: How good is the OpenPOWER architecture for high-performance CPU-oriented weather forecasting applications? J. Supercomput., April 2019.  https://doi.org/10.1007/s11227-019-02844-3CrossRefGoogle Scholar
  20. 20.
    NVIDIA: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 26 Oct 2019
  21. 21.
    NVIDIA: NVIDIA Tesla V100 GPU Architecture (2019). https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed 26 Oct 2019
  22. 22.
    Pfister, G.F.: An introduction to the infiniband architecture. High Perform. Mass Storage Parallel I/O 42, 617–632 (2001)Google Scholar
  23. 23.
    Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 80–89. IEEE (2013)Google Scholar
  24. 24.
    Shi, R., et al.: Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters. In: 2014 21st International Conference on High Performance Computing (HiPC), pp. 1–10, December 2014Google Scholar
  25. 25.
    Stone, J.E., Hynninen, A.-P., Phillips, J.C., Schulten, K.: Early experiences porting the NAMD and VMD molecular simulation and analysis software to GPU-accelerated OpenPOWER platforms. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 188–206. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46079-6_14CrossRefGoogle Scholar
  26. 26.
    Tallent, N.R., Gawande, N.A., Siegel, C., Vishnu, A., Hoisie, A.: Evaluating on-node GPU interconnects for deep learning workloads. In: Jarvis, S., Wright, S., Hammond, S. (eds.) PMBS 2017. LNCS, vol. 10724, pp. 3–21. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-72971-8_1CrossRefGoogle Scholar
  27. 27.
    Vazhkudai, S.S., et al..: The design, deployment, and evaluation of the CORAL pre-exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, pp. 52:1–52:12. IEEE Press, Piscataway (2018). http://dl.acm.org/citation.cfm?id=3291656.3291726
  28. 28.
    Wang, H., Potluri, S., Bureddy, D., Rosales, C., Panda, D.K.: GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans. Parallel Distrib. Syst. 25(10), 2595–2605 (2014).  https://doi.org/10.1109/TPDS.2013.222CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringThe Ohio State UniversityColumbusUSA

Personalised recommendations