Skip to main content
Log in

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. According to the MPI specification, MPI_Send blocks until the user buffer can be reused. The definition of \(SS\) and \(SW\) follows this convention.

References

  1. Amerson, G., Apon, A.: Implementation and design analysis of a network messaging module using virtual interface architecture. In: Proceedings of the 6th International Conference on Cluster Computing, San Diego, CA, pp. 255–265, September 2004

  2. Buluc, A., Gilbert, J.R.: Challenges and advances in parallel sparse matrix-matrix multiplication. In: Proceedings of the 37th International Conference on Parallel Processing, Portland, OR, pp. 503–510, September 2008

  3. Chen, D., Eisley, N.A., Heidelberger, P., Senger, R.M., Sugawara, Y., Kumar, S., Salapura, V., Satterfield, D., Steinmacher-Burow, B., Parker, J.: The IBM blue gene/Q interconnection network and message unit. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, vol. 26, (2011)

  4. Culler, D. et al.: LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, pp. 1–12 (1993)

  5. Danalis, A., Brown, A., Pollock, L., Swany, M., Cavazos, J.: Gravel: a communication library to fast path MPI. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting, LNCS 5205, Dublin Ireland, pp. 111–119, September 2008

  6. Davis, T.: The University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices/

  7. Faraj, A., Patarasuk, P., Yuan, X.: A study of process arrival patterns for MPI collective operations. Int. J. Parallel Program. 36(6), 543–570 (2008)

    Article  Google Scholar 

  8. InfiniBand Trade Association. http://www.infinibandta.org

  9. InfiniBand Host Channel Adapter Verb Implementer’s Guide. Intel Corp. (2003)

  10. Hennings, A.: Matrix Computation for Engineers and Scientists. Wiley, New York (1977)

    Google Scholar 

  11. Ke, J., Burtscher, M., Speight, E.: Tolerating message latency through the early release of blocked receives. In: Proceedings of the 11th European Conference on Parallel Processing (Euro-Par), LNCS 3648, Lisboa, Portugal, pp. 19–29, August 2005

  12. Keppitiyagama, C., Wagner, a.: Asynchronous MPI messaging on myrinet. In: Proceedings of the 15th IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, pp. 50–57, April 2001

  13. Kumar, R., Mamidala, A., Koop, M., Santhanaraman, G., Panda, D.K.: Lock-free Asynchronous rendezvous design for mpi point-to-point communication. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting, LNCS 5205, Dublin Ireland, pp. 185–193, September 2008

  14. Liu, J., Wu, J., Kini, S.P., Wyckoff, P., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand., In: Proceedings of the 17th International Conference on Supercomputing (ICS), San Francisico, CA, pp. 295–304, June 2003

  15. Majumder, S., Rixner, S., Pai, V.S.: An event-driven architecture for MPI libraries. In: Proceedings of the 5th Los Alamos Computer Science Institute Symposium (CD-ROM proceedings), Santa Fe, NM, October 2004

  16. Marathe, A., Lowenthal, D., Gu, Z., Small, M., Yuan, X.: Profile guided MPI protocol selection for point-to-point communication calls. In: The 1st IPDPS workshop on Communication Architecture for Scalable Systems (CASS), Atlanta, GA, pp. 1–7, May 2011

  17. MVAPICH. http://mvapich.cse.ohio-state.edu/

  18. Myricom. http://www.myricom.com

  19. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/

  20. Open MPI: open source high performance computing. http://www.open-mpi.org/

  21. Pakin, S.: Receiver-initiated message passing over RDMA networks. In: Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami FL, pp. 1–12, April 2008

  22. Rabenseinfner, R.: Automatic MPI counter profiling of all users: first results on CRAY T3E900-512. In: Proceedings of the 3rd Message Passing Interface Developer’s and User’s Conference, Atlanta, GA, pp. 77–85, March 1999

  23. Rashti, M.J., Afsahi, A.: Improving communication progress and overlap in MPI rendezvous protocol over RDMA-enabled Interconnects. In: Proceedings of the 22nd High Performance Computing Systems and Applications Symposium (HPCS), Quebec City, Canada, pp. 96–101, June 2008

  24. Sitsky, D., Hayashi, K.: An MPI library which uses polling, interrupts, and remote copying for the Fujitsu AP1000+. In: Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks, Beijing, China, pp. 43–49 (1996)

  25. Small, M., Yuan, X.: Maximizing MPI point-to-point communication performance on RDMA-enabled clusters with customized protocols. In: Proceedings of the 23th ACM International Conference on Supercomputing (ICS), Yorktown Heights, NY, pp. 306–315, June 2009

  26. Small, M., Gu, Z., Yuan, X.: Near-optimal rendezvous protocols for RDMA-enabled clusters. In: Proceedings of the 38th International Conference on Parallel Processing, San Diego, CA, pp. 644–652, September 2010

  27. Sur, S., Jin, H., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the 11th ACM SIGPLAN symposium on principles and practice of parallel programming, New York, NY, pp. 32–39, March 2006

  28. Tipparaju, V., Santhanaraman, G., Nieplocha, J., Panda, D.K.: Host-assisted zero-copy remote memory access communication on InfiniBand. In: Proceedings of the 18th IEEE International Parallel and Distributed Processing Symposium, Santa Fe, NM, p. 31a, April 2004

  29. Venkata, M.G., Bridges, P.G., Widener, P.M.: Using application communication characteristics to drive dynamic MPI reconfiguration. In: Proceedings of the 9th IPDPS Workshop on Communication Architecture for Clusters, Rome Italy, pp. 1–6, May 2009

Download references

Acknowledgments

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Yuan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, Z., Small, M., Yuan, X. et al. Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters. Int J Parallel Prog 41, 682–703 (2013). https://doi.org/10.1007/s10766-013-0242-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0242-0

Keywords

Navigation