High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Kandalla, Krishna; Subramoni, Hari; Tomko, Karen; Pekurovsky, Dmitry; Sur, Sayantan; Panda, Dhabaleswar K.

doi:10.1007/s00450-011-0170-4

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Special Issue Paper
Published: 13 April 2011

Volume 26, pages 237–246, (2011)
Cite this article

Computer Science - Research and Development

Krishna Kandalla¹,
Hari Subramoni¹,
Karen Tomko²,
Dmitry Pekurovsky³,
Sayantan Sur¹ &
…
Dhabaleswar K. Panda¹

503 Accesses
34 Citations
Explore all metrics

Abstract

Three-dimensional FFT is an important component of many scientific computing applications ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely used three-dimensional FFT package. It uses the Message Passing Interface (MPI) programming model. The performance and scalability of parallel 3D FFT is limited by the time spent in the Alltoall Personalized exchange (MPI_Alltoall) operations. Hiding the latency of the MPI_Alltoall operation is critical towards scaling P3DFFT. The newest revision of MPI, MPI-3, is widely expected to provide support for non-blocking collective communication to enable latency-hiding. The latest InfiniBand adapter from Mellanox, ConnectX-2, enables offloading of generalized lists of communication operations to the network interface. Such an interface can be leveraged to design non-blocking collective operations. In this paper, we design a scalable, non-blocking Alltoall Personalized Exchange algorithm based on the network offload technology. To the best of our knowledge, this is the first paper to propose high performance non-blocking algorithms for dense collective operations, by leveraging InfiniBand’s network offload features. We also re-design the P3DFFT library and a sample application kernel to overlap the Alltoall operations with application-level computation. We are able to scale our implementation of the non-blocking Alltoall operation to more than 512 processes and we achieve near perfect computation/communication overlap (99%). We also see an improvement of about 23% in the overall run-time of our modified P3DFFT when compared to the default-blocking version and an improvement of about 17% when compared to the host-based non-blocking Alltoall schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Article 16 March 2023

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters

References

Mamidala AR, Kumar R, De D, Panda DK (2008) MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: 8th IEEE international symposium on cluster computing and the grid, Lyon, pp 130–137
Chapter Google Scholar
Donis DA, Yeung PK, Pekurovsky D (2008) Turbulence simulations on O(10⁴) processors. In: TeraGrid
Google Scholar
Graham R, Poole S, Shamis P, Bloch G, Boch N, Chapman H, Kagan M, Shahar A, Rabinovitz I, Shainer G (2010) Overlapping computation and communication: barrier algorithms and Connectx-2 CORE-direct capabilities. In: Proceedings of the 22nd IEEE international parallel & distributed processing symposium, workshop on communication architectures for clusters (CAC)’10
Google Scholar
Subramoni H, Kandalla K, Sur S, Panda DK (2010) Design and evaluation of generalized collective communication primitives with overlap using ConnectX-2 offload engine. In: The 18th annual symposium on high performance interconnects, HotI
Google Scholar
Hoefler T, Lumsdaine A (2008) Message progression in parallel computing—to thread or not to thread. In: Proceedings of the IEEE international conference on cluster computing
Google Scholar
Hoefler T, Squyres J, Rehm W, Lumsdaine A (2006) A case for non-blocking collective operations. In: Frontiers of high performance computing and networking. ISPA 2006 workshops. Lecture notes in computer science, vol 4331, pp 155–164
Chapter Google Scholar
Mellanox technologies. ConnectX-2 Architecture. http://www.hpcwire.com/features/Mellanox-Rolls-Out-Next-Iteration-of-ConnectX-57046327.html
MPI Forum. MPI: a message passing interface. www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
MVAPICH2. http://mvapich.cse.ohio-state.edu/
Karonis NT, de Supinski BR, Foster I, Gropp W, Lusk E, Bresnahan J (2000) Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th international symposium on parallel and distributed processing, p 377
Chapter Google Scholar
Parallel three-dimensional fast Fourier transforms (P3DFFT) library, San Diego Supercomputer Center (SDSC). http://code.google.com/p/p3dfft
Graham R, Poole S, Shamis P, Bloch G, Boch N, Chapman H, Kagan M, Shahar A, Rabinovitz I, Shainer G (2010) ConnectX2 InfiniBand management queues: new support for network offloaded collective operations. In: CCGrid’10, Melbourne, Australia, May 17–20
Google Scholar
Laizet S, Lamballais E, Vassilicos JC (2010) A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution DNS of fractal generated turbulence. Comput Fluids 39:471–484
Article Google Scholar
Top500. Top500 supercomputing systems, Oct 2010
Voltaire. Fabric collective accelerator (FCA)

Download references

Author information

Authors and Affiliations

The Ohio State University, Columbus, OH, USA
Krishna Kandalla, Hari Subramoni, Sayantan Sur & Dhabaleswar K. Panda
The Ohio Supercomputer Center, Columbus, OH, USA
Karen Tomko
San Diego Supercomputer Center, San Diego, MC, USA
Dmitry Pekurovsky

Authors

Krishna Kandalla
View author publications
You can also search for this author in PubMed Google Scholar
Hari Subramoni
View author publications
You can also search for this author in PubMed Google Scholar
Karen Tomko
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Pekurovsky
View author publications
You can also search for this author in PubMed Google Scholar
Sayantan Sur
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krishna Kandalla.

Additional information

This research is supported in part by U.S. Department of Energy grants #DE-FC02-06ER25749 and #DE-FC02-06ER25755; National Science Foundation grants #CCF-0833169, #CCF-0916302, #OCI-0926691 and #CCF-0937842; grant from Wright Center for Innovation #WCI04-010-OSU-0; grants from Intel, Mellanox, Cisco, QLogic, and Sun Microsytems.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kandalla, K., Subramoni, H., Tomko, K. et al. High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT. Comput Sci Res Dev 26, 237–246 (2011). https://doi.org/10.1007/s00450-011-0170-4

Download citation

Published: 13 April 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s00450-011-0170-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Abstract

Access this article

Similar content being viewed by others

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Abstract

Access this article

Similar content being viewed by others

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs

Sparbit: Towards to a Logarithmic-Cost and Data Locality-Aware MPI Allgather Algorithm

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation