Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

Shafie Khorassani, Kawthar; Hashmi, Jahanzeb; Chu, Ching-Hsiang; Chen, Chen-Chun; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1007/978-3-030-78713-4_7

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

Kawthar Shafie Khorassani¹²,
Jahanzeb Hashmi¹²,
Ching-Hsiang Chu¹²,
Chen-Chun Chen¹²,
Hari Subramoni¹² &
…
Dhabaleswar K. Panda¹²

Conference paper
First Online: 17 June 2021

2617 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12728))

Abstract

Due to the emergence of AMD GPUs and their adoption in upcoming exascale systems (e.g. Frontier), it is pertinent to have scientific applications and communication middlewares ported and optimized for these systems. Radeon Open Compute (ROCm) platform is an open-source suite of libraries tailored towards writing high-performance software for AMD GPUs. GPU-aware MPI, has been the de-facto standard for accelerating HPC applications on GPU clusters. The state-of-the-art GPU-aware MPI libraries have evolved over the years to support NVIDIA CUDA platforms. Due to the recent emergence of AMD GPUs, it is equally important to add support for AMD ROCm platforms. Existing MPI libraries do not have native support for ROCm-aware communication. In this paper, we take up the challenge of designing a ROCm-aware MPI runtime within the MVAPICH2-GDR library. We design an abstract communication layer to interface with CUDA and ROCm runtimes. We exploit hardware features such as PeerDirect, ROCm IPC, and large-BAR mapped memory to orchestrate efficient GPU-based communication. We further augment these mechanisms by designing software-based schemes yielding optimized communication performance. We evaluate the performance of MPI-level point-to-point and collective operations with our proposed ROCm-aware MPI Library and Open MPI with UCX on a cluster of AMD GPUs. We demonstrate 3–6\(\times \) and 2\(\times \) higher bandwidth for intra- and inter-node communication, respectively. With the rocHPCG application, we demonstrate approximately 2.2\(\times \) higher GFLOPs/s. To the best of our knowledge, this is the first research work that studies the tradeoffs involved in designing a ROCm-aware MPI library for AMD GPUs.

This research is supported in part by NSF grants #1818253, #1854828, #1931537, #2007991, #2018627, and XRAC grant #NCR-130002.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bandwidth test for ROCm. https://github.com/RadeonOpenCompute/
Corona. https://hpc.llnl.gov/hardware/platforms/corona
Frontier: ORNL’s exascale supercomputer designed to deliver world-leading performance in 2021. https://www.olcf.ornl.gov/frontier/. Accessed 25 May 2021
Infiniband Verbs Performance Tests. https://github.com/linux-rdma/perftest
Radeon Open Compute (ROCm) Platform. https://rocmdocs.amd.com
RLLNL and HPE to partner with AMD on El Capitan, projected as world’s fastest supercomputer. https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer. Accessed 25 May 2021
Unified Communication X. http://www.openucx.org/. Accessed 25 May 2021
Cai, Z., et al.: Synthesizing optimal collective algorithms (2020)
Google Scholar
Chu, C.H., Khorassani, K.S., Zhou, Q., Subramoni, H., Panda, D.K.: Dynamic kernel fusion for bulk non-contiguous data transfer on gpu clusters. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 130–141 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00023
Dongarra, J., Heroux, M.A., Luszczek, P.: High-performance conjugate-gradient benchmark: a new metric for ranking high-performance computing systems. Int. J. High Perform. Comput. Appl. 30(1), 3–10 (2016)
Article Google Scholar
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings. 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104, September 2004
Google Scholar
Hashmi, J.M., Chu, C.H., Chakraborty, S., Bayatpour, M., Subramoni, H., Panda, D.K.: FALCON-X: zero-copy MPI derived datatype processing on modern CPU and GPU architectures. J. Parallel Distrib. Comput. 144, 1–13 (2020). https://doi.org/10.1016/j.jpdc.2020.05.008. http://www.sciencedirect.com/science/article/pii/S0743731520302872
Khorassani, K.S., Chu, C.H., Subramoni, H., Panda, D.K.: Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: early experiences. In: International Workshop on OpenPOWER for HPC (IWOPH 19) at the 2019 ISC High Performance Conference (2018)
Google Scholar
Kuznetsov, E., Stegailov, V.: Porting CUDA-based molecular dynamics algorithms to AMD ROCm platform using HIP framework: performance analysis. In: Voevodin, V., Sobolev, S. (eds.) RuSCDays 2019. CCIS, vol. 1129, pp. 121–130. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-36592-9_11
Chapter Google Scholar
Leiserson, C.E., et al.: There’s plenty of room at the top: what will drive computer performance after Moore’s law? Science 368(6495) (2020). https://doi.org/10.1126/science.aam9744. https://science.sciencemag.org/content/368/6495/eaam9744
Panda, D.K., Subramoni, H., Chu, C.H., Bayatpour, M.: The MVAPICH project: transforming research into high-performance MPI library for HPC community. J. Comput. Sci. 101208 (2020). https://doi.org/10.1016/j.jocs.2020.101208. http://www.sciencedirect.com/science/article/pii/S1877750320305093
Potluri, S., Wang, H., Bureddy, D., Singh, A.K., Rosales, C., Panda, D.K.: Optimizing MPI communication on multi-GPU systems using CUDA inter-process communication. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum, pp. 1848–1857 (2012). https://doi.org/10.1109/IPDPSW.2012.228
Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters With NVIDIA GPUs. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 80–89. IEEE (2013)
Google Scholar
Sharkawi, S.S., Chochia, G.A.: Communication protocol optimization for enhanced GPU performance. IBM J. Res. Dev. 64(3/4), 9:1–9:9 (2020)
Google Scholar
Shi, R., et al.: Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters. In: 2014 21st International Conference on High Performance Computing (HiPC), pp. 1–10, December 2014
Google Scholar
Subramoni, H., Chakraborty, S., Panda, D.K.: Designing dynamic and adaptive MPI point-to-point communication protocols for efficient overlap of computation and communication. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 334–354. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_18
Chapter Google Scholar
Tsai, Y.M., Cojean, T., Ribizel, T., Anzt, H.: Preparing ginkgo for AMD GPUS - a testimonial on porting CUDA code to HIP (2020)
Google Scholar
Wang, H., Potluri, S., Bureddy, D., Rosales, C., Panda, D.K.: GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans. Parallel Distrib. Syst. 25(10), 2595–2605 (2014). https://doi.org/10.1109/TPDS.2013.222
Article Google Scholar

Download references

Author information

Authors and Affiliations

The Ohio State University, Columbus, OH, 43210, USA
Kawthar Shafie Khorassani, Jahanzeb Hashmi, Ching-Hsiang Chu, Chen-Chun Chen, Hari Subramoni & Dhabaleswar K. Panda

Authors

Kawthar Shafie Khorassani
View author publications
You can also search for this author in PubMed Google Scholar
Jahanzeb Hashmi
View author publications
You can also search for this author in PubMed Google Scholar
Ching-Hsiang Chu
View author publications
You can also search for this author in PubMed Google Scholar
Chen-Chun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hari Subramoni
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jahanzeb Hashmi , Ching-Hsiang Chu , Chen-Chun Chen , Hari Subramoni or Dhabaleswar K. Panda .

Editor information

Editors and Affiliations

Hewlett Packard Enterprise, Seattle, WA, USA
Bradford L. Chamberlain
University of Amsterdam, Amsterdam, The Netherlands
Ana-Lucia Varbanescu
Extreme Computing Research Center, Thuwal Jeddah, Saudi Arabia
Hatem Ltaief
The University of Tennessee, Knoxville, Knoxville, TN, USA
Piotr Luszczek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shafie Khorassani, K., Hashmi, J., Chu, CH., Chen, CC., Subramoni, H., Panda, D.K. (2021). Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham. https://doi.org/10.1007/978-3-030-78713-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-78713-4_7
Published: 17 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78712-7
Online ISBN: 978-3-030-78713-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics