MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and programmer productivity. Applications executing on a cluster with GPUs have to manage data movement using CUDA in addition to MPI, the de-facto parallel programming standard. Currently, data movement with CUDA and MPI libraries is not integrated and it is not as efficient as possible. In addition, MPI-2 one sided communication does not work for windows in GPU memory, as there is no way to remotely get or put data from GPU memory in a one-sided manner.
In this paper, we propose a novel MPI design that integrates CUDA data movement transparently with MPI. The programmer is presented with one MPI interface that can communicate to and from GPUs. Data movement from GPU and network can now be overlapped. The proposed design is incorporated into the MVAPICH2 library. To the best of our knowledge, this is the first work of its kind to enable advanced MPI features and optimized pipelining in a widely used MPI library. We observe up to 45% improvement in one-way latency. In addition, we show that collective communication performance can be improved significantly: 32%, 37% and 30% improvement for Scatter, Gather and Allotall collective operations, respectively. Further, we enable MPI-2 one sided communication with GPUs. We observe up to 45% improvement for Put and Get operations.
KeywordsMPI Clusters GPGPU CUDA InfiniBand
- 1.TOP500 Supercomputing Sites. http://www.top500.org/
- 2.Ma W, Krishnamoorthy S, Villa O, Kowalski K (2010) Acceleration of streamed tensor contraction expressions on GPGPU-based clusters. In: Proceedings of the 2010 IEEE international conference on cluster computing (Cluster’10) Google Scholar
- 3.Jacobsen DA, Thibault JC, Senocak I (2010) An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters. In: Proceedings of the 48th AIAA aerospace sciences meeting Google Scholar
- 4.Phillips EH, Fatica M (2010) Implementing the Himeno benchmark with CUDA on GPU clusters. In: Proceedings of the 24th IEEE international parallel and distributed processing symposium (IPDPS’10) Google Scholar
- 7.Stuart JA, Owens JD (2009) Message passing on data-parallel architectures. In: Proceedings of the 23th IEEE international parallel and distributed processing symposium (IPDPS’09) Google Scholar
- 8.MVAPICH2: High performance MPI over InfiniBand/10GigE/iWARP and RoCE. http://mvapich.cse.ohio-state.edu/
- 9.InfiniBand Trade Association. http://www.infinibandta.com
- 10.NVIDIA: NVIDIA CUDA compute unified device architecture. http://developer.download.nvidia.com/compute/cuda/2_0/docs/CudaReferenceManual_2.0.pdf
- 11.Mellanox: NVIDIA GPUDirect technology—accelerating GPU-based systems. http://www.mellanox.com/pdf/whitepapers/TB_GPU_Direct.pdf
- 12.OSU Micro Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/
- 13.AMD: AMD fusion family of APUs: enabling a superior, immersive PC experience. http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf