Skip to main content

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2015)


Several techniques have been proposed in the past for designing non-blocking collective operations on high-performance clusters. While some of them required a dedicated process/thread or periodic probing to progress the collective others needed specialized hardware solutions. The former technique, while applicable to any generic HPC cluster, had the drawback of stealing CPU cycles away from the compute task. The latter gave near perfect overlap but increased the total cost of the HPC installation due to need for specialized hardware and also had other drawbacks that limited its applicability. On the other hand, the Remote Direct Memory Access technology and high performance networks have been pushing the envelope of HPC performance to multi-petaflop levels. However, no scholarly work exists that explores the impact such RDMA technology can bring to the design of non-blocking collective primitives. In this paper, we take up this challenge and propose efficient designs of personalized non-blocking collective operations on top of the basic RDMA primitives. Our experimental evaluation shows that our proposed designs are able to deliver near perfect overlap of computation and communication for personalized collective operations on modern HPC systems at scale. At the microbenchmark level, the proposed RDMA-Aware collectives deliver improvements in latency of up to 89 times for MPI_Igatherv, 3.71 times for MPI_Ialltoall and, 3.23 times for MPI_Iscatter over the state-of-the-art designs. We also observe an improvement of up to 19 % for the P3DFFT kernel at 8,192 cores on the Stampede supercomputing system at TACC.

This research is supported in part by National Science Foundation grants #CCF-1213084, #CNS-1419123, and #IIS-1447804.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. Donzis, D., Yeung, P.K., Pekurovsky, D.: Turbulence simulations on O(10\(^{4}\)) processors. In: TeraGrid, June 2008

    Google Scholar 

  2. Gupta, R., Balaji, P., Panda, D.K., Nieplocha, J.: Efficient collective operations using remote memory operations on VIA-based clusters. In: 2003 Proceedings of the International Parallel and Distributed Processing Symposium, p. 9, April 2003

    Google Scholar 

  3. Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread?. In: Cluster (2008)

    Google Scholar 

  4. Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, PMEO 2008 Workshop, April 2008

    Google Scholar 

  5. Hoefler, T., Siebert, C., Lumsdaine, A.: Group operation assembly language - a flexible way to express collective communication. In: ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE, September 2009

    Google Scholar 

  6. Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for non-blocking collective operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 155–164. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33(9), 624–633 (2007)

    Article  MathSciNet  Google Scholar 

  8. Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: 2007 Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, pp. 1–10. IEEE (2007)

    Google Scholar 

  9. InfiniBand Trade Association.

  10. Intel MPI Benchmarks (IMB).

  11. Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, B.,Tooney, B.: High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In: IPDPS (2004)

    Google Scholar 

  12. Kandalla, K., Yang, U., Keasler, J., Kolev, T., Moody, A., Subramoni, H., Tomko, K., Vienne, J., Panda, D.K.: Designing non-blocking allreduce with collective offload on infiniband clusters: a case study with conjugate gradient solvers. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS) (2012)

    Google Scholar 

  13. Kandalla, K., Subramoni, H., Tomko, K., Pekurovsky, D., Sur, S., Panda, D.K.: High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT. Comput. Sci. 26, 237–246 (2011)

    Google Scholar 

  14. Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A novel functional partitioning approach to design high-performance MPI-3 non-blocking alltoallv collective on multi-core systems. In: 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, 1–4 October 2013, pp. 611–620 (2013)

    Google Scholar 

  15. Kini, S.P., Liu, J., Wu, J., Wyckoff, P., Panda, D.K.: Fast and scalable barrier using rdma and multicast mechanisms for infiniband-based clusters. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 369–378. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  16. Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster, pp. 23–26 (2002)

    Google Scholar 

  17. Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: Proceedings of Int’l Parallel and Distributed Processing Symposium (IPDPS 2004), April 2004

    Google Scholar 

  18. Liu, J., Mamidala, A., Panda, D.K.: Fast and scalable MPI-level broadcast using InfiniBand’s hardware multicast support. In: Proceedings of Int’l Parallel and Distributed Processing Symposium (IPDPS 04), April 2004

    Google Scholar 

  19. Luo, M., Wang, H., Vienne, J., Panda, D.K.: Redesigning MPI shared memory communication for large multi-core architecture. computer science - research and development, pp. 1–10. doi: 10.1007/s00450-012-0210-8

  20. Luo, M., Wang, H., Vienne, J., Panda, D.K.: Redesigning MPI shared memory communication for large multi-core architecture. Comput. Sci. 28(2–3), 137–146 (2013)

    Google Scholar 

  21. Venkata, M., Graham, R., Ladd, J., Shamis, P., Rabinovitz, I., Vasily, F., Shainer, G.: ConnectX-2 CORE-direct enabled asynchronous broadcast collective communications. In: Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, Workshops (2011)

    Google Scholar 

  22. Mamidala, A., Liu, J., Panda, D.K.: Efficient barrier and allreduce on IBA clusters using hardware multicast and adaptive algorithms. In: IEEE Cluster Computing (2004)

    Google Scholar 

  23. ConnectX-2 VPI with CORE-Direct Technology.

  24. Programmable ConnectX-3 Pro Adapter Card Dual-Port Adapter with VPI.

  25. Connect-IB Single/Dual-Port InfiniBand Host Channel Adapter Cards.

  26. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, March 1994

    Google Scholar 

  27. MPI-3 Standard Document.

  28. Network-Based Computing Laboratory. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE.

  29. Nomura, A., Ishikawa, Y.: Design of kernel-level asynchronous collective communication. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 92–101. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  30. OSU Micro-benchmarks.

  31. Pekurovsky, D.: P3DFFT: a framework for parallel computations of fourier transforms in three dimensions. SIAM J. Sci. Comput. 34(4), C192–C209 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  32. Portals Network Programming Interface.

  33. Romanow, A., Bailey, S.: An overview of RDMA over IP. In: Proceedings of International Workshop on Protocols for Long-Distance Networks (PFLDnet2003) (2003)

    Google Scholar 

  34. Laizet, S., Lamballais, E., Vassilicos, J.C.: A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution dns of fractal generated turbulence. Comput. Fluids 39, 471–484 (2010)

    Article  MATH  Google Scholar 

  35. Schneider, T., Eckelmann, S., Hoefler, T., Rehm, W.: Kernel-based offload of collective operations – implementation, evaluation and lessons learned. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 264–275. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  36. Sandia MPI Micro-Benchmark Suite (SMB).

  37. Sur, S., Bondhugula, U.K.R., Mamidala, A.R., Jin, H.-W., Panda, D.K.: High performance RDMA based all-to-all broadcast for infiniband clusters. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 148–157. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  38. Sur, S., Jin, H.-W., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2006, pp. 32–39. ACM, New York, NY, USA (2006)

    Google Scholar 

  39. Texas Advanced Computing Center. Stampede Supercomputer.

  40. TOP 500 Supercomputer Sites.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Hari Subramoni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Subramoni, H. et al. (2015). Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20118-4

  • Online ISBN: 978-3-319-20119-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics