Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Gu, Zheng; Small, Matthew; Yuan, Xin; Marathe, Aniruddha; Lowenthal, David K.

doi:10.1007/s10766-013-0242-0

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Published: 28 February 2013

Volume 41, pages 682–703, (2013)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Zheng Gu¹,
Matthew Small¹,
Xin Yuan¹,
Aniruddha Marathe² &
…
David K. Lowenthal²

310 Accesses
1 Citation
Explore all metrics

Abstract

Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RDMA Communciation Patterns

Article Open access 29 September 2020

Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication

HiPower: A High-Performance RDMA Acceleration Solution for Distributed Transaction Processing

Notes

According to the MPI specification, MPI_Send blocks until the user buffer can be reused. The definition of \(SS\) and \(SW\) follows this convention.

References

Amerson, G., Apon, A.: Implementation and design analysis of a network messaging module using virtual interface architecture. In: Proceedings of the 6th International Conference on Cluster Computing, San Diego, CA, pp. 255–265, September 2004
Buluc, A., Gilbert, J.R.: Challenges and advances in parallel sparse matrix-matrix multiplication. In: Proceedings of the 37th International Conference on Parallel Processing, Portland, OR, pp. 503–510, September 2008
Chen, D., Eisley, N.A., Heidelberger, P., Senger, R.M., Sugawara, Y., Kumar, S., Salapura, V., Satterfield, D., Steinmacher-Burow, B., Parker, J.: The IBM blue gene/Q interconnection network and message unit. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, vol. 26, (2011)
Culler, D. et al.: LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, pp. 1–12 (1993)
Danalis, A., Brown, A., Pollock, L., Swany, M., Cavazos, J.: Gravel: a communication library to fast path MPI. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting, LNCS 5205, Dublin Ireland, pp. 111–119, September 2008
Davis, T.: The University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices/
Faraj, A., Patarasuk, P., Yuan, X.: A study of process arrival patterns for MPI collective operations. Int. J. Parallel Program. 36(6), 543–570 (2008)
Article Google Scholar
InfiniBand Trade Association. http://www.infinibandta.org
InfiniBand Host Channel Adapter Verb Implementer’s Guide. Intel Corp. (2003)
Hennings, A.: Matrix Computation for Engineers and Scientists. Wiley, New York (1977)
Google Scholar
Ke, J., Burtscher, M., Speight, E.: Tolerating message latency through the early release of blocked receives. In: Proceedings of the 11th European Conference on Parallel Processing (Euro-Par), LNCS 3648, Lisboa, Portugal, pp. 19–29, August 2005
Keppitiyagama, C., Wagner, a.: Asynchronous MPI messaging on myrinet. In: Proceedings of the 15th IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, pp. 50–57, April 2001
Kumar, R., Mamidala, A., Koop, M., Santhanaraman, G., Panda, D.K.: Lock-free Asynchronous rendezvous design for mpi point-to-point communication. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting, LNCS 5205, Dublin Ireland, pp. 185–193, September 2008
Liu, J., Wu, J., Kini, S.P., Wyckoff, P., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand., In: Proceedings of the 17th International Conference on Supercomputing (ICS), San Francisico, CA, pp. 295–304, June 2003
Majumder, S., Rixner, S., Pai, V.S.: An event-driven architecture for MPI libraries. In: Proceedings of the 5th Los Alamos Computer Science Institute Symposium (CD-ROM proceedings), Santa Fe, NM, October 2004
Marathe, A., Lowenthal, D., Gu, Z., Small, M., Yuan, X.: Profile guided MPI protocol selection for point-to-point communication calls. In: The 1st IPDPS workshop on Communication Architecture for Scalable Systems (CASS), Atlanta, GA, pp. 1–7, May 2011
MVAPICH. http://mvapich.cse.ohio-state.edu/
Myricom. http://www.myricom.com
NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/
Open MPI: open source high performance computing. http://www.open-mpi.org/
Pakin, S.: Receiver-initiated message passing over RDMA networks. In: Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami FL, pp. 1–12, April 2008
Rabenseinfner, R.: Automatic MPI counter profiling of all users: first results on CRAY T3E900-512. In: Proceedings of the 3rd Message Passing Interface Developer’s and User’s Conference, Atlanta, GA, pp. 77–85, March 1999
Rashti, M.J., Afsahi, A.: Improving communication progress and overlap in MPI rendezvous protocol over RDMA-enabled Interconnects. In: Proceedings of the 22nd High Performance Computing Systems and Applications Symposium (HPCS), Quebec City, Canada, pp. 96–101, June 2008
Sitsky, D., Hayashi, K.: An MPI library which uses polling, interrupts, and remote copying for the Fujitsu AP1000+. In: Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks, Beijing, China, pp. 43–49 (1996)
Small, M., Yuan, X.: Maximizing MPI point-to-point communication performance on RDMA-enabled clusters with customized protocols. In: Proceedings of the 23th ACM International Conference on Supercomputing (ICS), Yorktown Heights, NY, pp. 306–315, June 2009
Small, M., Gu, Z., Yuan, X.: Near-optimal rendezvous protocols for RDMA-enabled clusters. In: Proceedings of the 38th International Conference on Parallel Processing, San Diego, CA, pp. 644–652, September 2010
Sur, S., Jin, H., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the 11th ACM SIGPLAN symposium on principles and practice of parallel programming, New York, NY, pp. 32–39, March 2006
Tipparaju, V., Santhanaraman, G., Nieplocha, J., Panda, D.K.: Host-assisted zero-copy remote memory access communication on InfiniBand. In: Proceedings of the 18th IEEE International Parallel and Distributed Processing Symposium, Santa Fe, NM, p. 31a, April 2004
Venkata, M.G., Bridges, P.G., Widener, P.M.: Using application communication characteristics to drive dynamic MPI reconfiguration. In: Proceedings of the 9th IPDPS Workshop on Communication Architecture for Clusters, Rome Italy, pp. 1–6, May 2009

Download references

Acknowledgments

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575.

Author information

Authors and Affiliations

Department of Computer Science, Florida State University, Tallahassee, FL, 32306, USA
Zheng Gu, Matthew Small & Xin Yuan
Department of Computer Science, University of Arizona, Tucson, AZ, 85721, USA
Aniruddha Marathe & David K. Lowenthal

Authors

Zheng Gu
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Small
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Aniruddha Marathe
View author publications
You can also search for this author in PubMed Google Scholar
David K. Lowenthal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Yuan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, Z., Small, M., Yuan, X. et al. Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters. Int J Parallel Prog 41, 682–703 (2013). https://doi.org/10.1007/s10766-013-0242-0

Download citation

Received: 08 February 2012
Accepted: 11 February 2013
Published: 28 February 2013
Issue Date: October 2013
DOI: https://doi.org/10.1007/s10766-013-0242-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Abstract

Access this article

Similar content being viewed by others

RDMA Communciation Patterns

Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication

HiPower: A High-Performance RDMA Acceleration Solution for Distributed Transaction Processing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Abstract

Access this article

Similar content being viewed by others

RDMA Communciation Patterns

Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication

HiPower: A High-Performance RDMA Acceleration Solution for Distributed Transaction Processing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation