MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Wang, Hao; Potluri, Sreeram; Luo, Miao; Singh, Ashish Kumar; Sur, Sayantan; Panda, Dhabaleswar K.

doi:10.1007/s00450-011-0171-3

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Special Issue Paper
Published: 12 April 2011

Volume 26, pages 257–266, (2011)
Cite this article

Computer Science - Research and Development

Hao Wang¹,
Sreeram Potluri¹,
Miao Luo¹,
Ashish Kumar Singh¹,
Sayantan Sur¹ &
…
Dhabaleswar K. Panda¹

1117 Accesses
90 Citations
3 Altmetric
Explore all metrics

Abstract

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and programmer productivity. Applications executing on a cluster with GPUs have to manage data movement using CUDA in addition to MPI, the de-facto parallel programming standard. Currently, data movement with CUDA and MPI libraries is not integrated and it is not as efficient as possible. In addition, MPI-2 one sided communication does not work for windows in GPU memory, as there is no way to remotely get or put data from GPU memory in a one-sided manner.

In this paper, we propose a novel MPI design that integrates CUDA data movement transparently with MPI. The programmer is presented with one MPI interface that can communicate to and from GPUs. Data movement from GPU and network can now be overlapped. The proposed design is incorporated into the MVAPICH2 library. To the best of our knowledge, this is the first work of its kind to enable advanced MPI features and optimized pipelining in a widely used MPI library. We observe up to 45% improvement in one-way latency. In addition, we show that collective communication performance can be improved significantly: 32%, 37% and 30% improvement for Scatter, Gather and Allotall collective operations, respectively. Further, we enable MPI-2 one sided communication with GPUs. We observe up to 45% improvement for Put and Get operations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

GPU-Accelerated Language and Communication Support by FPGA

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

References

TOP500 Supercomputing Sites. http://www.top500.org/
Ma W, Krishnamoorthy S, Villa O, Kowalski K (2010) Acceleration of streamed tensor contraction expressions on GPGPU-based clusters. In: Proceedings of the 2010 IEEE international conference on cluster computing (Cluster’10)
Google Scholar
Jacobsen DA, Thibault JC, Senocak I (2010) An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters. In: Proceedings of the 48th AIAA aerospace sciences meeting
Google Scholar
Phillips EH, Fatica M (2010) Implementing the Himeno benchmark with CUDA on GPU clusters. In: Proceedings of the 24th IEEE international parallel and distributed processing symposium (IPDPS’10)
Google Scholar
Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the 15th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP’10), pp 115–126
Chapter Google Scholar
Fan Z, Qiu F, Kaufman AE (2008) Zippy: a framework for computation and visualization on a GPU cluster. Comput. Graph. Forum 27(2):341–350
Article Google Scholar
Stuart JA, Owens JD (2009) Message passing on data-parallel architectures. In: Proceedings of the 23th IEEE international parallel and distributed processing symposium (IPDPS’09)
Google Scholar
MVAPICH2: High performance MPI over InfiniBand/10GigE/iWARP and RoCE. http://mvapich.cse.ohio-state.edu/
InfiniBand Trade Association. http://www.infinibandta.com
NVIDIA: NVIDIA CUDA compute unified device architecture. http://developer.download.nvidia.com/compute/cuda/2_0/docs/CudaReferenceManual_2.0.pdf
Mellanox: NVIDIA GPUDirect technology—accelerating GPU-based systems. http://www.mellanox.com/pdf/whitepapers/TB_GPU_Direct.pdf
OSU Micro Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/
AMD: AMD fusion family of APUs: enabling a superior, immersive PC experience. http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, USA
Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Sayantan Sur & Dhabaleswar K. Panda

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sreeram Potluri
View author publications
You can also search for this author in PubMed Google Scholar
Miao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Sayantan Sur
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Wang.

Additional information

This research is supported in part by U.S. Department of Energy grants #DE-FC02-06ER25749 and #DE-FC02-06ER25755; National Science Foundation grants #CCF-0833169, #CCF-0916302, #OCI-0926691 and #CCF-0937842; grant from Wright Center for Innovation #WCI04-010-OSU-0; grants from Intel, Mellanox, Cisco, QLogic, and Sun Microsytems.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Potluri, S., Luo, M. et al. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Comput Sci Res Dev 26, 257–266 (2011). https://doi.org/10.1007/s00450-011-0171-3

Download citation

Published: 12 April 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s00450-011-0171-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Abstract

Access this article

Similar content being viewed by others

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

GPU-Accelerated Language and Communication Support by FPGA

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Abstract

Access this article

Similar content being viewed by others

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

GPU-Accelerated Language and Communication Support by FPGA

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation