Scalable PGAS collective operations in NUMA clusters

Mallón, Damián A.; Taboada, Guillermo L.; Teijeiro, Carlos; González-Domínguez, Jorge; Gómez, Andrés; Wibecan, Brian

doi:10.1007/s10586-014-0377-9

Scalable PGAS collective operations in NUMA clusters

Published: 08 May 2014

Volume 17, pages 1473–1495, (2014)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Damián A. Mallón¹,
Guillermo L. Taboada²,
Carlos Teijeiro²,
Jorge González-Domínguez²,
Andrés Gómez³ &
…
Brian Wibecan⁴

316 Accesses
3 Citations
Explore all metrics

Abstract

The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex interconnects in order to dispatch the increasing amount of data required by the processing elements. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one sided communications becomes more important in these systems, to avoid unnecessary synchronization between pairs of processes in collective operations implemented in terms of two sided point to point functions. This work proposes a series of algorithms that provide a good performance and scalability in collective operations, based on the use of hierarchical trees, overlapping one-sided communications, message pipelining and the available NUMA binding features. An implementation has been developed for Unified Parallel C, a Partitioned Global Address Space language, which presents a shared memory view across the nodes for programmability, while keeping private memory regions for performance. The performance evaluation of the proposed implementation, conducted on five representative systems (JuRoPA, JUDGE, Finis Terrae, SVG and Superdome), has shown generally good performance and scalability, even outperforming MPI in some cases, which confirms the suitability of the developed algorithms for manycore architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Parallelization of Stencils Using MODA

DASH: Data Structures and Algorithms with Support for Hierarchical Locality

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

References

Advanced Micro Devices, Inc.: Magny–Cours and direct connect architecture 2.0 (2009)
Antony, J., Janes, P.P., Rendell, A.P.: Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport. In: Proceedings of 13th IEEE International Conference on High Performance Computing (HiPC’06), pp. 338–352. Bangalore (2006)
Brightwell, R., Pedretti, K.T.: Optimizing multi-core MPI collectives with SMARTMAP. In: Proceedings of 3rd International Workshop on Advanced Distributed and Parallel Network Applications (ADPNA 2009), pp. 370–377. Vienna (2009)
Chan, E., van de Geijn, R., Gropp, W., Thakur, R.: Collective communication on architectures that support simultaneous communication over multiple links. In: Proceedings of 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06), pp. 2–11. Manhattan (2006)
Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., Hughes, B.: Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro. 30, 16–29 (2010)
Article Google Scholar
George Washington University.: GWU unified testing suite (GUTS). http://threads.hpcl.gwu.edu/sites/guts (2010). Accessed March 2014
Graham, R.L., Shipman, G.M.: MPI support for multi-core architectures: optimized shared memory collectives. In: Proceedings 15th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI’08), pp. 130–140. Dublin (2008)
Hoefler, T., Siebert, C., Rehm, W.: A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. In: Proceedings of 7th Workshop on Communication Architecture for Clusters (CAC’07), pp. 1–8. Long Beach (2007)
Intel Corporation.: “Intel MPI Benchmarks”, http://software.intel.com/en-us/articles/intel-mpi-benchmarks. Accessed March 2014
Jeannot, E., Mercier, G.: Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: Proceedings of 16th International European Conference on Parallel and Distributed Computing (Euro-Par’10), pp. 199–210. Ischia (2010)
Jiang, W., Liu, J., Wook Jin, H., Panda, D.K., Gropp, W., Thakur, R.: High performance MPI-2 one-sided communication over InfiniBand. In: Proceedings of 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’04), pp. 531–538. Chicago (2004)
Kandalla, K.C., Subramoni, H., Santhanaraman, G., Koop, M., Panda, D.K.: Designing multi-leader-based allgather algorithms for multi-core clusters. In: Proceedings of 9th Workshop on Communication Architecture for Clusters (CAC’09), pp. 1–8. Rome (2009)
Kandalla, K.C., Subramoni, H., Vishnu, A., Panda, D.K.: Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: case studies with scatter and gather. In: Proceedings of 10th Workshop on Communication Architecture for Clusters (CAC’10), pp. 1–8. Atlanta (2010)
Koop, M.J., Sridhar, J.K., Panda, D.K.: Scalable MPI design over InfiniBand using eXtended Reliable Connection. In: Proceedings of 10th IEEE International Conference on Cluster Computing (Cluster’08), pp. 203–212. Tsukuba (2008)
Kumar, R., Mamidala, A.R., Panda, D.K.: Scaling alltoall collective on multi-core systems. In: Proceedings of 8th Workshop on Communication Architecture for Clusters (CAC’08), pp. 1–8. Miami (2008)
Li, S., Hoefler, T., Snir, M.: NUMA-aware shared memory collective communication for MPI. In: Proceedings of 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC’13), pp. 85–96. New York (2013)
Lorenzo, J.A., Rivera, F.F., Tuma, P., Pichel, J.C.: On the influence of thread allocation for irregular codes in NUMA systems. In: Proceedings of 10th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’09), pp. 146–153. Hiroshima (2009)
Mallón, D.A., Mouriño, J.C., Gómez, A., Taboada, G.L., Teijeiro, C., Touriño, J., Fraguela, B.B., Doallo, R., Wibecan, B.: UPC operations microbenchmarking suite. In: Proceedings of 25th International Supercomputing Conference (ISC’10), Research Poster, Hamburg. http://upc.cesga.es (2010). Accessed March 2014
Mamidala, A.R., Kumar, R., De, D., Panda, D.K.: MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: Proceedings of 8th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), pp. 130–137. Lyon (2008)
Miao, Q., Sun, G., Shan, J., Chen, G.: Single data copying for MPI communication optimization on shared memory system. In: Proceedings of 7th International Conference on Computational Science (ICCS’07), pp. 700–707. Beijing (2007)
Mouriño, J.C., Gómez, A., Taboada, J.M., Landesa, L., Bértolo, J.M., Obelleiro, F., Rodríguez, J.L.: High scalability multipole method. Solving half billion of unknowns. Comput Sci R&D 23(3–4), 169–175 (2009)
Google Scholar
Nishtala, R., Yelick, K.A.: Optimizing collective communication on multicores. In: Proceedings of 1st USENIX Workshop on Hot Topics in Parallelism (HotPar’09), pp. 1–6. Berkeley (2009)
Nishtala, R., Zheng, Y., Hargrove, P.H., Yelick, K.A.: Tuning collective communication for partitioned global address space programming models. J. Parallel Comput. 37(9), 576–591 (2011)
Article Google Scholar
Qian, Y.: Design and evaluation of efficient collective communications on modern interconnects and multi-core clusters. Thesis PhD, Electrical & Computer Engineering, Queen’s University. http://hdl.handle.net/1974/5383 (2010). Accessed March 2014
Rabenseifner, R.: A new optimized MPI reduce algorithm. http://www.hlrs.de/mpi/myreduce.html (2007). Accessed March 2014
Ryne, Z., Seidel, S.: Ideas and specifications for the new one-sided collective operations in UPC. http://www.upc.mtu.edu/papers/OnesidedColl (2005). Accessed March 2014
Salama, R.A., Sameh, A.: Potential performance improvement of collective operations in UPC. In: Proceedings of 12th International Conference on Parallel Computing (ParCo’07), pp. 413–422. Jülich and Aachen (2007)
Shipman, G.M., Poole, S., Shamis, P., Rabinovitz, I.: X-SRQ: improving scalability and performance of multi-core infiniband clusters. In: Proceedings of 15th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI’08), pp. 33–42. Dublin (2008)
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)
Article Google Scholar
Trahay, F., Brunet, E., Denis, A., Namyst, R.: A multithreaded communication engine for multicore architectures. In: Proceedings of 8th Workshop on Communication Architecture for Clusters (CAC’08), pp. 1–7. Miami (2008)
Tu, B., Fan, J., Zhan, J., Zhao, X.: Performance analysis and optimization of MPI collective operations on multi-core clusters. J. Supercomput. 60(1), 141–162 (2012)
Article Google Scholar
Vadhiyar, S.S., Fagg, G.E., Dongarra, J.J.: Automatically tuned collective communications. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing (SC’00), pp. 1–11. Dallas (2000)
Velamati, M.K., Kumar, A., Jayam, N., Senthilkumar, G., Baruah, P.K., Sharma, R., Kapoor, S., Srinivasan, A.: Optimization of collective communication in intra-cell MPI. In: Proceedings of 14th International Conference on High Performance Computing (HiPC’07), pp. 488–499. Goa (2007)
Wibecan, B.: Proposal for privatizability functions. http://www2.hpcl.gwu.edu/pgas09/HP_UPC_Proposal (2009). Accessed March 2014

Download references

Acknowledgments

This work was funded by Hewlett-Packard and partially supported by the Ministry of Science and Innovation of Spain under Project TIN2010-16735 and by the Galician Government (Xunta de Galicia, Spain) under the Consolidation Program of Competitive Research Groups and under CN2012/211 grant, cofunded with FEDER. We gratefully thank Jim Bovay at HP for his valuable support, CESGA for providing access to the Finis Terrae, Superdome and SVG supercomputers, and Forschungzentrum Jülich for providing access to the JuRoPA and JUDGE supercomputers.

Author information

Authors and Affiliations

Forschungzentrum Jülich, Jülich, Germany
Damián A. Mallón
University of A Coruña, A Coruña, Spain
Guillermo L. Taboada, Carlos Teijeiro & Jorge González-Domínguez
Galicia Supercomputing Center, Santiago de Compostela, Spain
Andrés Gómez
Hewlett-Packard, Montgomery, AL, USA
Brian Wibecan

Authors

Damián A. Mallón
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo L. Taboada
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Teijeiro
View author publications
You can also search for this author in PubMed Google Scholar
Jorge González-Domínguez
View author publications
You can also search for this author in PubMed Google Scholar
Andrés Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Brian Wibecan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damián A. Mallón.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mallón, D.A., Taboada, G.L., Teijeiro, C. et al. Scalable PGAS collective operations in NUMA clusters. Cluster Comput 17, 1473–1495 (2014). https://doi.org/10.1007/s10586-014-0377-9

Download citation

Received: 19 August 2013
Revised: 20 March 2014
Accepted: 11 April 2014
Published: 08 May 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10586-014-0377-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable PGAS collective operations in NUMA clusters

Abstract

Access this article

Similar content being viewed by others

Scalable Parallelization of Stencils Using MODA

DASH: Data Structures and Algorithms with Support for Hierarchical Locality

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable PGAS collective operations in NUMA clusters

Abstract

Access this article

Similar content being viewed by others

Scalable Parallelization of Stencils Using MODA

DASH: Data Structures and Algorithms with Support for Hierarchical Locality

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation