Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences

Khorassani, Kawthar Shafie; Chu, Ching-Hsiang; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1007/978-3-030-34356-9_28

Kawthar Shafie Khorassani¹²,
Ching-Hsiang Chu¹²,
Hari Subramoni¹² &
…
Dhabaleswar K. Panda¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11887))

Included in the following conference series:

International Conference on High Performance Computing

6008 Accesses
11 Citations

Abstract

The advent of Graphics Processing Unit (GPU)-enabled OpenPOWER architectures are empowering the advancement of various High-Performance Computing (HPC) applications from dynamic modular simulation to deep learning training. GPU-aware Message Passing Interface (MPI) is one of the most efficient libraries used to exploit the computing power on GPU-enabled HPC systems at scale. However, there is a lack of thorough performance evaluations for GPU-aware MPI libraries to provide insights into the varying costs and benefits of using each one on GPU-enabled OpenPOWER systems. In this paper, we provide a detailed performance evaluation and analysis of point-to-point communication using various GPU-aware MPI libraries including SpectrumMPI, OpenMPI+UCX, and MVAPICH2-GDR on OpenPOWER GPU-enabled systems. We demonstrate that all three MPI libraries deliver approximately 95% of achievable bandwidth for NVLink communication between two GPUs on the same socket. For inter-node communication where the InfiniBand network dominates the peak bandwidth, MVAPICH2-GDR and SpectrumMPI attain approximately 99% achievable bandwidth, while OpenMPI delivers close to 95%. This evaluation is useful to determine which MPI library can provide the highest performance enhancement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

IBM Spectrum MPI version 10.3. https://www.ibm.com
Infiniband Verbs Performance Tests. https://github.com/linux-rdma/perftest. Accessed 26 Oct 2019
MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/features/
Open MPI: Open Source High Performance Computing. https://www.open-mpi.org
TOP 500 Supercomputer Sites. http://www.top500.org
Unified Communication X. http://www.openucx.org/. Accessed 26 Oct 2019
Ashworth, M., Meng, J., Novakovic, V., Siso, S.: Early application performance at the hartree centre with the OpenPOWER architecture. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 173–187. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_13
Chapter Google Scholar
Awan, A.A., Bédorf, J., Chu, C.H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation. In: The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019) (2019)
Google Scholar
Bureddy, D., Wang, H., Venkatesh, A., Potluri, S., Panda, D.K.: OMB-GPU: a micro-benchmark suite for evaluating MPI libraries on GPU clusters. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 110–120. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33518-1_16
Chapter Google Scholar
Pearson, C., Chung, I.-H., Sura, Z., Hwu, W.-M., Xiong, J.: NUMA-aware data-transfer measurements for power/NVLink multi-GPU systems. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 448–454. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_32
Chapter Google Scholar
Chu, C.H., Hamidouche, K., Venkatesh, A., Banerjee, D.S., Subramoni, H., Panda, D.K.: Exploiting maximal overlap for non-contiguous data movement processing on modern GPU-enabled systems. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 983–992, May 2016
Google Scholar
Chu, C.H., et al.: Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning. In: 46th International Conference on Parallel Processing (ICPP-2017), August 2017
Google Scholar
Foley, D., Danskin, J.: Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2), 7–17 (2017). https://doi.org/10.1109/MM.2017.37
Article Google Scholar
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19
Chapter Google Scholar
McCalpin, J.D.: STREAM: sustainable memory bandwidth in high performance computers (2019). https://www.cs.virginia.edu/stream/. Accessed 26 Oct 2019
Li, A., et al.: Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. CoRR abs/1903.04611 (2019). http://arxiv.org/abs/1903.04611
Luo, X., Wu, W., Bosilca, G., Patinyasakdikul, T., Wang, L., Dongarra, J.: ADAPT: an event-based adaptive collective communication framework. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018, pp. 118–130. ACM, New York (2018). https://doi.org/10.1145/3208040.3208054
Mojumder, S.A., et al.: Profiling DNN workloads on a volta-based DGX-1 system. In: 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 122–133, September 2018. https://doi.org/10.1109/IISWC.2018.8573521
Moreno, R., Arias, E., Navarro, A., Tapiador, F.J.: How good is the OpenPOWER architecture for high-performance CPU-oriented weather forecasting applications? J. Supercomput., April 2019. https://doi.org/10.1007/s11227-019-02844-3
NVIDIA: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 26 Oct 2019
NVIDIA: NVIDIA Tesla V100 GPU Architecture (2019). https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed 26 Oct 2019
Pfister, G.F.: An introduction to the infiniband architecture. High Perform. Mass Storage Parallel I/O 42, 617–632 (2001)
Google Scholar
Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 80–89. IEEE (2013)
Google Scholar
Shi, R., et al.: Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters. In: 2014 21st International Conference on High Performance Computing (HiPC), pp. 1–10, December 2014
Google Scholar
Stone, J.E., Hynninen, A.-P., Phillips, J.C., Schulten, K.: Early experiences porting the NAMD and VMD molecular simulation and analysis software to GPU-accelerated OpenPOWER platforms. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 188–206. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_14
Chapter Google Scholar
Tallent, N.R., Gawande, N.A., Siegel, C., Vishnu, A., Hoisie, A.: Evaluating on-node GPU interconnects for deep learning workloads. In: Jarvis, S., Wright, S., Hammond, S. (eds.) PMBS 2017. LNCS, vol. 10724, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72971-8_1
Chapter Google Scholar
Vazhkudai, S.S., et al..: The design, deployment, and evaluation of the CORAL pre-exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, pp. 52:1–52:12. IEEE Press, Piscataway (2018). http://dl.acm.org/citation.cfm?id=3291656.3291726
Wang, H., Potluri, S., Bureddy, D., Rosales, C., Panda, D.K.: GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans. Parallel Distrib. Syst. 25(10), 2595–2605 (2014). https://doi.org/10.1109/TPDS.2013.222
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA
Kawthar Shafie Khorassani, Ching-Hsiang Chu, Hari Subramoni & Dhabaleswar K. Panda

Authors

Kawthar Shafie Khorassani
View author publications
You can also search for this author in PubMed Google Scholar
Ching-Hsiang Chu
View author publications
You can also search for this author in PubMed Google Scholar
Hari Subramoni
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kawthar Shafie Khorassani , Ching-Hsiang Chu , Hari Subramoni or Dhabaleswar K. Panda .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Michèle Weiland
Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Sachsen, Germany
Guido Juckeland
Swiss National Supercomputing Centre, Lugano, Ticino, Switzerland
Sadaf Alam
University of Tennessee at Knoxville, Knoxville, TN, USA
Heike Jagode

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khorassani, K.S., Chu, CH., Subramoni, H., Panda, D.K. (2019). Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences. In: Weiland, M., Juckeland, G., Alam, S., Jagode, H. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11887. Springer, Cham. https://doi.org/10.1007/978-3-030-34356-9_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-34356-9_28
Published: 03 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34355-2
Online ISBN: 978-3-030-34356-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics