International Journal of Parallel Programming

, Volume 41, Issue 5, pp 682–703 | Cite as

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

  • Zheng Gu
  • Matthew Small
  • Xin YuanEmail author
  • Aniruddha Marathe
  • David K. Lowenthal


Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations.


MPI Point-to-point communication Protocol customization 



This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575.


  1. 1.
    Amerson, G., Apon, A.: Implementation and design analysis of a network messaging module using virtual interface architecture. In: Proceedings of the 6th International Conference on Cluster Computing, San Diego, CA, pp. 255–265, September 2004Google Scholar
  2. 2.
    Buluc, A., Gilbert, J.R.: Challenges and advances in parallel sparse matrix-matrix multiplication. In: Proceedings of the 37th International Conference on Parallel Processing, Portland, OR, pp. 503–510, September 2008Google Scholar
  3. 3.
    Chen, D., Eisley, N.A., Heidelberger, P., Senger, R.M., Sugawara, Y., Kumar, S., Salapura, V., Satterfield, D., Steinmacher-Burow, B., Parker, J.: The IBM blue gene/Q interconnection network and message unit. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, vol. 26, (2011)Google Scholar
  4. 4.
    Culler, D. et al.: LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, pp. 1–12 (1993)Google Scholar
  5. 5.
    Danalis, A., Brown, A., Pollock, L., Swany, M., Cavazos, J.: Gravel: a communication library to fast path MPI. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting, LNCS 5205, Dublin Ireland, pp. 111–119, September 2008Google Scholar
  6. 6.
    Davis, T.: The University of Florida Sparse Matrix Collection.
  7. 7.
    Faraj, A., Patarasuk, P., Yuan, X.: A study of process arrival patterns for MPI collective operations. Int. J. Parallel Program. 36(6), 543–570 (2008)CrossRefGoogle Scholar
  8. 8.
    InfiniBand Trade Association.
  9. 9.
    InfiniBand Host Channel Adapter Verb Implementer’s Guide. Intel Corp. (2003)Google Scholar
  10. 10.
    Hennings, A.: Matrix Computation for Engineers and Scientists. Wiley, New York (1977)Google Scholar
  11. 11.
    Ke, J., Burtscher, M., Speight, E.: Tolerating message latency through the early release of blocked receives. In: Proceedings of the 11th European Conference on Parallel Processing (Euro-Par), LNCS 3648, Lisboa, Portugal, pp. 19–29, August 2005Google Scholar
  12. 12.
    Keppitiyagama, C., Wagner, a.: Asynchronous MPI messaging on myrinet. In: Proceedings of the 15th IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, pp. 50–57, April 2001Google Scholar
  13. 13.
    Kumar, R., Mamidala, A., Koop, M., Santhanaraman, G., Panda, D.K.: Lock-free Asynchronous rendezvous design for mpi point-to-point communication. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting, LNCS 5205, Dublin Ireland, pp. 185–193, September 2008Google Scholar
  14. 14.
    Liu, J., Wu, J., Kini, S.P., Wyckoff, P., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand., In: Proceedings of the 17th International Conference on Supercomputing (ICS), San Francisico, CA, pp. 295–304, June 2003Google Scholar
  15. 15.
    Majumder, S., Rixner, S., Pai, V.S.: An event-driven architecture for MPI libraries. In: Proceedings of the 5th Los Alamos Computer Science Institute Symposium (CD-ROM proceedings), Santa Fe, NM, October 2004Google Scholar
  16. 16.
    Marathe, A., Lowenthal, D., Gu, Z., Small, M., Yuan, X.: Profile guided MPI protocol selection for point-to-point communication calls. In: The 1st IPDPS workshop on Communication Architecture for Scalable Systems (CASS), Atlanta, GA, pp. 1–7, May 2011Google Scholar
  17. 17.
  18. 18.
  19. 19.
    NAS Parallel Benchmarks.
  20. 20.
    Open MPI: open source high performance computing.
  21. 21.
    Pakin, S.: Receiver-initiated message passing over RDMA networks. In: Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami FL, pp. 1–12, April 2008Google Scholar
  22. 22.
    Rabenseinfner, R.: Automatic MPI counter profiling of all users: first results on CRAY T3E900-512. In: Proceedings of the 3rd Message Passing Interface Developer’s and User’s Conference, Atlanta, GA, pp. 77–85, March 1999Google Scholar
  23. 23.
    Rashti, M.J., Afsahi, A.: Improving communication progress and overlap in MPI rendezvous protocol over RDMA-enabled Interconnects. In: Proceedings of the 22nd High Performance Computing Systems and Applications Symposium (HPCS), Quebec City, Canada, pp. 96–101, June 2008Google Scholar
  24. 24.
    Sitsky, D., Hayashi, K.: An MPI library which uses polling, interrupts, and remote copying for the Fujitsu AP1000+. In: Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks, Beijing, China, pp. 43–49 (1996)Google Scholar
  25. 25.
    Small, M., Yuan, X.: Maximizing MPI point-to-point communication performance on RDMA-enabled clusters with customized protocols. In: Proceedings of the 23th ACM International Conference on Supercomputing (ICS), Yorktown Heights, NY, pp. 306–315, June 2009Google Scholar
  26. 26.
    Small, M., Gu, Z., Yuan, X.: Near-optimal rendezvous protocols for RDMA-enabled clusters. In: Proceedings of the 38th International Conference on Parallel Processing, San Diego, CA, pp. 644–652, September 2010Google Scholar
  27. 27.
    Sur, S., Jin, H., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the 11th ACM SIGPLAN symposium on principles and practice of parallel programming, New York, NY, pp. 32–39, March 2006Google Scholar
  28. 28.
    Tipparaju, V., Santhanaraman, G., Nieplocha, J., Panda, D.K.: Host-assisted zero-copy remote memory access communication on InfiniBand. In: Proceedings of the 18th IEEE International Parallel and Distributed Processing Symposium, Santa Fe, NM, p. 31a, April 2004Google Scholar
  29. 29.
    Venkata, M.G., Bridges, P.G., Widener, P.M.: Using application communication characteristics to drive dynamic MPI reconfiguration. In: Proceedings of the 9th IPDPS Workshop on Communication Architecture for Clusters, Rome Italy, pp. 1–6, May 2009Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Zheng Gu
    • 1
  • Matthew Small
    • 1
  • Xin Yuan
    • 1
    Email author
  • Aniruddha Marathe
    • 2
  • David K. Lowenthal
    • 2
  1. 1.Department of Computer ScienceFlorida State UniversityTallahasseeUSA
  2. 2.Department of Computer ScienceUniversity of ArizonaTucsonUSA

Personalised recommendations