Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters

  • Hari SubramoniEmail author
  • Ammar Ahmad Awan
  • Khaled Hamidouche
  • Dmitry Pekurovsky
  • Akshay Venkatesh
  • Sourav Chakraborty
  • Karen Tomko
  • Dhabaleswar K. Panda
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9137)


Several techniques have been proposed in the past for designing non-blocking collective operations on high-performance clusters. While some of them required a dedicated process/thread or periodic probing to progress the collective others needed specialized hardware solutions. The former technique, while applicable to any generic HPC cluster, had the drawback of stealing CPU cycles away from the compute task. The latter gave near perfect overlap but increased the total cost of the HPC installation due to need for specialized hardware and also had other drawbacks that limited its applicability. On the other hand, the Remote Direct Memory Access technology and high performance networks have been pushing the envelope of HPC performance to multi-petaflop levels. However, no scholarly work exists that explores the impact such RDMA technology can bring to the design of non-blocking collective primitives. In this paper, we take up this challenge and propose efficient designs of personalized non-blocking collective operations on top of the basic RDMA primitives. Our experimental evaluation shows that our proposed designs are able to deliver near perfect overlap of computation and communication for personalized collective operations on modern HPC systems at scale. At the microbenchmark level, the proposed RDMA-Aware collectives deliver improvements in latency of up to 89 times for MPI_Igatherv, 3.71 times for MPI_Ialltoall and, 3.23 times for MPI_Iscatter over the state-of-the-art designs. We also observe an improvement of up to 19 % for the P3DFFT kernel at 8,192 cores on the Stampede supercomputing system at TACC.


Non-blocking collectives Remote Direct Memory Access HPC InfiniBand 


  1. 1.
    Donzis, D., Yeung, P.K., Pekurovsky, D.: Turbulence simulations on O(10\(^{4}\)) processors. In: TeraGrid, June 2008Google Scholar
  2. 2.
    Gupta, R., Balaji, P., Panda, D.K., Nieplocha, J.: Efficient collective operations using remote memory operations on VIA-based clusters. In: 2003 Proceedings of the International Parallel and Distributed Processing Symposium, p. 9, April 2003Google Scholar
  3. 3.
    Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread?. In: Cluster (2008)Google Scholar
  4. 4.
    Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, PMEO 2008 Workshop, April 2008Google Scholar
  5. 5.
    Hoefler, T., Siebert, C., Lumsdaine, A.: Group operation assembly language - a flexible way to express collective communication. In: ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE, September 2009Google Scholar
  6. 6.
    Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for non-blocking collective operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 155–164. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  7. 7.
    Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33(9), 624–633 (2007)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: 2007 Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, pp. 1–10. IEEE (2007)Google Scholar
  9. 9.
    InfiniBand Trade Association.
  10. 10.
  11. 11.
    Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, B.,Tooney, B.: High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In: IPDPS (2004)Google Scholar
  12. 12.
    Kandalla, K., Yang, U., Keasler, J., Kolev, T., Moody, A., Subramoni, H., Tomko, K., Vienne, J., Panda, D.K.: Designing non-blocking allreduce with collective offload on infiniband clusters: a case study with conjugate gradient solvers. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS) (2012)Google Scholar
  13. 13.
    Kandalla, K., Subramoni, H., Tomko, K., Pekurovsky, D., Sur, S., Panda, D.K.: High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT. Comput. Sci. 26, 237–246 (2011)Google Scholar
  14. 14.
    Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A novel functional partitioning approach to design high-performance MPI-3 non-blocking alltoallv collective on multi-core systems. In: 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, 1–4 October 2013, pp. 611–620 (2013)Google Scholar
  15. 15.
    Kini, S.P., Liu, J., Wu, J., Wyckoff, P., Panda, D.K.: Fast and scalable barrier using rdma and multicast mechanisms for infiniband-based clusters. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 369–378. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  16. 16.
    Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster, pp. 23–26 (2002)Google Scholar
  17. 17.
    Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: Proceedings of Int’l Parallel and Distributed Processing Symposium (IPDPS 2004), April 2004Google Scholar
  18. 18.
    Liu, J., Mamidala, A., Panda, D.K.: Fast and scalable MPI-level broadcast using InfiniBand’s hardware multicast support. In: Proceedings of Int’l Parallel and Distributed Processing Symposium (IPDPS 04), April 2004Google Scholar
  19. 19.
    Luo, M., Wang, H., Vienne, J., Panda, D.K.: Redesigning MPI shared memory communication for large multi-core architecture. computer science - research and development, pp. 1–10. doi:  10.1007/s00450-012-0210-8
  20. 20.
    Luo, M., Wang, H., Vienne, J., Panda, D.K.: Redesigning MPI shared memory communication for large multi-core architecture. Comput. Sci. 28(2–3), 137–146 (2013)Google Scholar
  21. 21.
    Venkata, M., Graham, R., Ladd, J., Shamis, P., Rabinovitz, I., Vasily, F., Shainer, G.: ConnectX-2 CORE-direct enabled asynchronous broadcast collective communications. In: Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, Workshops (2011)Google Scholar
  22. 22.
    Mamidala, A., Liu, J., Panda, D.K.: Efficient barrier and allreduce on IBA clusters using hardware multicast and adaptive algorithms. In: IEEE Cluster Computing (2004)Google Scholar
  23. 23.
  24. 24.
    Programmable ConnectX-3 Pro Adapter Card Dual-Port Adapter with VPI.
  25. 25.
    Connect-IB Single/Dual-Port InfiniBand Host Channel Adapter Cards.
  26. 26.
    Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, March 1994Google Scholar
  27. 27.
  28. 28.
    Network-Based Computing Laboratory. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE.
  29. 29.
    Nomura, A., Ishikawa, Y.: Design of kernel-level asynchronous collective communication. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 92–101. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  30. 30.
  31. 31.
    Pekurovsky, D.: P3DFFT: a framework for parallel computations of fourier transforms in three dimensions. SIAM J. Sci. Comput. 34(4), C192–C209 (2012)zbMATHMathSciNetCrossRefGoogle Scholar
  32. 32.
    Portals Network Programming Interface.
  33. 33.
    Romanow, A., Bailey, S.: An overview of RDMA over IP. In: Proceedings of International Workshop on Protocols for Long-Distance Networks (PFLDnet2003) (2003)Google Scholar
  34. 34.
    Laizet, S., Lamballais, E., Vassilicos, J.C.: A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution dns of fractal generated turbulence. Comput. Fluids 39, 471–484 (2010)zbMATHCrossRefGoogle Scholar
  35. 35.
    Schneider, T., Eckelmann, S., Hoefler, T., Rehm, W.: Kernel-based offload of collective operations – implementation, evaluation and lessons learned. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 264–275. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  36. 36.
    Sandia MPI Micro-Benchmark Suite (SMB).
  37. 37.
    Sur, S., Bondhugula, U.K.R., Mamidala, A.R., Jin, H.-W., Panda, D.K.: High performance RDMA based all-to-all broadcast for infiniband clusters. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 148–157. Springer, Heidelberg (2005) CrossRefGoogle Scholar
  38. 38.
    Sur, S., Jin, H.-W., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2006, pp. 32–39. ACM, New York, NY, USA (2006)Google Scholar
  39. 39.
    Texas Advanced Computing Center. Stampede Supercomputer.
  40. 40.
    TOP 500 Supercomputer Sites.

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Hari Subramoni
    • 1
    Email author
  • Ammar Ahmad Awan
    • 1
  • Khaled Hamidouche
    • 1
  • Dmitry Pekurovsky
    • 2
  • Akshay Venkatesh
    • 1
  • Sourav Chakraborty
    • 1
  • Karen Tomko
    • 3
  • Dhabaleswar K. Panda
    • 1
  1. 1.Department of Computer Science and EngineeringThe Ohio State UniversityColumbusUSA
  2. 2.San Diego Supercomputer CenterSan DiegoCalifornia
  3. 3.Ohio Supercomputer CenterColumbusUSA

Personalised recommendations