Redesigning MPI shared memory communication for large multi-core architecture

Luo, Miao; Wang, Hao; Vienne, Jerome; Panda, Dhabaleswar K.

doi:10.1007/s00450-012-0210-8

Redesigning MPI shared memory communication for large multi-core architecture

Special Issue Paper
Published: 23 May 2012

Volume 28, pages 137–146, (2013)
Cite this article

Computer Science - Research and Development

Miao Luo¹,
Hao Wang¹,
Jerome Vienne¹ &
…
Dhabaleswar K. Panda¹

414 Accesses
4 Citations
Explore all metrics

Abstract

Modern multi-core platforms are evolving very rapidly with 32/64 cores for node. Sharing of system resource can increase communication efficiency between processes on the same node. However, it also increases contention for system resource. Currently, most MPI libraries are developed for systems with relatively small number of cores per node. On the emerging multi-core systems with hundreds of cores per node, existing shared memory mechanisms for MPI run-times will suffer from scalability problem, which may limit the benefits gained from multi-core system. In this paper, we first analyze these problems and then propose a set of new schemes for small message and large message transfer over shared memory. “Shared Tail Cyclic Buffer” scheme is proposed to reduce the number of read and write operations over shared control structures. “State-Driven Polling” scheme is proposed to optimize the message polling through dynamically adjusted polling frequency on different communication pairs. Through dynamic distribution of runtime pinned-down memory, “On-Demand Global Shared Memory Pool” is proposed to bring benefits of pair-wise buffer to large message transfer and optimize shared send buffer utilization without increasing the total shared memory usage. With micro-benchmark evaluation, the new schemes can bring up to 26 % and 70 % improvement for point-to-point latency and bandwidth performance, respectively. For applications, the new schemes can achieve 18 % improvement on the 64-core/node Bulldozer system for Graph500 benchmark, and up to 11 % improvement for NAS benchmarks. With 512 processes evaluation on 32-core Trestles system, the new schemes achieves 16 % improvement for NAS CG benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

Xingqi Zou, Sheng Xu, … Yinhe Han

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

MPICH2: high-performance and widely portable MPI implementation. Mathematics and Computer Science Division, Argonne National Laboratory. http://www.mcs.anl.gov/research/projects/mpich2/
MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE. Network-Based Computing Laboratory. http://mvapich.cse.ohio-state.edu/
Open MPI: Open Source High Performance Computing. http://www.open-mpi.org/
Intel MPI Library. http://software.intel.com/en-us/articles/intel-mpi-library/
Chai L, Hartono A, Panda DK (2006) Designing high performance and scalable MPI intra-node communication support for clusters. In: Cluster 2006, September 2006
Google Scholar
Darius B, Guillaume M, William G (2005) The design and evaluation of Nemesis, a scalable low-latency message-passing communication subsystem. Tech Rep, Argonne National Laboratory
The Graph500 List. http://www.graph500.org
NASA NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html
Integrated Performance Monitoring. http://ipm-hpc.sourceforge.net/overview.html
Buntinas D, Mercier G, Gropp W (2006) Design and evaluation of nemesis, a scalable, low-latency, message-passing communication subsystem. In: CCGRID’06.
Google Scholar
Buntinas D, Mercier G, Gropp W (2006) Implementation and shared-memory evaluation of MPICH2 over the Nemesis communication subsystem. In: Euro PVM/MPI 2006
Google Scholar
Jin H-W, Sur S, Chai L, Panda D (2007) Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems. In: Cluster 2007, September 2007
Google Scholar
Ma T, Bosilca G, Bouteiller A, Goglin B, Squyres JM, Dongarra JJ (2011) Kernel assisted collective intra-node MPI communication among multi-core and many-core CPUs. In: ICPP-2011
Chapter Google Scholar
Moreaud S, Goglin B, Namyst R, Goodell D (2010) Optimizing MPI communication within large multicore nodes with kernel assistance. In: CAC 2010, in conjunction with IPDPS
Google Scholar
Hood R, Jin H, Mehrotra P, Chang J, Djomehri J, Gavali S, Jespersen D, Taylor K, Biswas R (2010) Performance impact of resource contention in multicore systems. In: IPDPS 2010
Google Scholar
Haoqiang J, Robert H, Johnny C, Jahed D, Dennis J, Kenichi T (2009) Characterizing application performance sensitivity to resource contention in multicore architectures. Tech Rep, NAS
Majo Z, Gross TR (2011) Memory management in numa multicore systems: trapped between cache contention and interconnect overhead. In: ISMM’11
Google Scholar
Zhuravlev S, Blagodurov S, Fedorova A (2010) Addressing shared resource contention in multicore processors via scheduling. In: ASPLOS’10
Google Scholar
Blagodurov S, Zhuravlev S, Fedorova A, Kamali A (2010) A case for numa-aware contention management on multicore systems. In: PACT’10
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Miao Luo, Hao Wang, Jerome Vienne & Dhabaleswar K. Panda

Authors

Miao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jerome Vienne
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miao Luo.

Additional information

This research is supported in part by U.S. Department of Energy grant #DE-FC02-06ER25755; National Science Foundation grants #CCF- 0916302, #CCF-0937842 and #OCI-0926691; grant from Wright Center for Innovation #WCI04-010-OSU-0; Equipment donations from Intel, Mellanox, AMD, Appro, Chelsio, Dell, Microway, NVIDIA, QLogic, and Sun Microsystems.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, M., Wang, H., Vienne, J. et al. Redesigning MPI shared memory communication for large multi-core architecture. Comput Sci Res Dev 28, 137–146 (2013). https://doi.org/10.1007/s00450-012-0210-8

Download citation

Published: 23 May 2012
Issue Date: May 2013
DOI: https://doi.org/10.1007/s00450-012-0210-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Redesigning MPI shared memory communication for large multi-core architecture

Abstract

Access this article

Similar content being viewed by others

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Redesigning MPI shared memory communication for large multi-core architecture

Abstract

Access this article

Similar content being viewed by others

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation