Skip to main content
Log in

Scalable PGAS collective operations in NUMA clusters

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex interconnects in order to dispatch the increasing amount of data required by the processing elements. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one sided communications becomes more important in these systems, to avoid unnecessary synchronization between pairs of processes in collective operations implemented in terms of two sided point to point functions. This work proposes a series of algorithms that provide a good performance and scalability in collective operations, based on the use of hierarchical trees, overlapping one-sided communications, message pipelining and the available NUMA binding features. An implementation has been developed for Unified Parallel C, a Partitioned Global Address Space language, which presents a shared memory view across the nodes for programmability, while keeping private memory regions for performance. The performance evaluation of the proposed implementation, conducted on five representative systems (JuRoPA, JUDGE, Finis Terrae, SVG and Superdome), has shown generally good performance and scalability, even outperforming MPI in some cases, which confirms the suitability of the developed algorithms for manycore architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Advanced Micro Devices, Inc.: Magny–Cours and direct connect architecture 2.0 (2009)

  2. Antony, J., Janes, P.P., Rendell, A.P.: Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport. In: Proceedings of 13th IEEE International Conference on High Performance Computing (HiPC’06), pp. 338–352. Bangalore (2006)

  3. Brightwell, R., Pedretti, K.T.: Optimizing multi-core MPI collectives with SMARTMAP. In: Proceedings of 3rd International Workshop on Advanced Distributed and Parallel Network Applications (ADPNA 2009), pp. 370–377. Vienna (2009)

  4. Chan, E., van de Geijn, R., Gropp, W., Thakur, R.: Collective communication on architectures that support simultaneous communication over multiple links. In: Proceedings of 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06), pp. 2–11. Manhattan (2006)

  5. Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., Hughes, B.: Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro. 30, 16–29 (2010)

    Article  Google Scholar 

  6. George Washington University.: GWU unified testing suite (GUTS). http://threads.hpcl.gwu.edu/sites/guts (2010). Accessed March 2014

  7. Graham, R.L., Shipman, G.M.: MPI support for multi-core architectures: optimized shared memory collectives. In: Proceedings 15th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI’08), pp. 130–140. Dublin (2008)

  8. Hoefler, T., Siebert, C., Rehm, W.: A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. In: Proceedings of 7th Workshop on Communication Architecture for Clusters (CAC’07), pp. 1–8. Long Beach (2007)

  9. Intel Corporation.: “Intel MPI Benchmarks”, http://software.intel.com/en-us/articles/intel-mpi-benchmarks. Accessed March 2014

  10. Jeannot, E., Mercier, G.: Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: Proceedings of 16th International European Conference on Parallel and Distributed Computing (Euro-Par’10), pp. 199–210. Ischia (2010)

  11. Jiang, W., Liu, J., Wook Jin, H., Panda, D.K., Gropp, W., Thakur, R.: High performance MPI-2 one-sided communication over InfiniBand. In: Proceedings of 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’04), pp. 531–538. Chicago (2004)

  12. Kandalla, K.C., Subramoni, H., Santhanaraman, G., Koop, M., Panda, D.K.: Designing multi-leader-based allgather algorithms for multi-core clusters. In: Proceedings of 9th Workshop on Communication Architecture for Clusters (CAC’09), pp. 1–8. Rome (2009)

  13. Kandalla, K.C., Subramoni, H., Vishnu, A., Panda, D.K.: Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: case studies with scatter and gather. In: Proceedings of 10th Workshop on Communication Architecture for Clusters (CAC’10), pp. 1–8. Atlanta (2010)

  14. Koop, M.J., Sridhar, J.K., Panda, D.K.: Scalable MPI design over InfiniBand using eXtended Reliable Connection. In: Proceedings of 10th IEEE International Conference on Cluster Computing (Cluster’08), pp. 203–212. Tsukuba (2008)

  15. Kumar, R., Mamidala, A.R., Panda, D.K.: Scaling alltoall collective on multi-core systems. In: Proceedings of 8th Workshop on Communication Architecture for Clusters (CAC’08), pp. 1–8. Miami (2008)

  16. Li, S., Hoefler, T., Snir, M.: NUMA-aware shared memory collective communication for MPI. In: Proceedings of 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC’13), pp. 85–96. New York (2013)

  17. Lorenzo, J.A., Rivera, F.F., Tuma, P., Pichel, J.C.: On the influence of thread allocation for irregular codes in NUMA systems. In: Proceedings of 10th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’09), pp. 146–153. Hiroshima (2009)

  18. Mallón, D.A., Mouriño, J.C., Gómez, A., Taboada, G.L., Teijeiro, C., Touriño, J., Fraguela, B.B., Doallo, R., Wibecan, B.: UPC operations microbenchmarking suite. In: Proceedings of 25th International Supercomputing Conference (ISC’10), Research Poster, Hamburg. http://upc.cesga.es (2010). Accessed March 2014

  19. Mamidala, A.R., Kumar, R., De, D., Panda, D.K.: MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: Proceedings of 8th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), pp. 130–137. Lyon (2008)

  20. Miao, Q., Sun, G., Shan, J., Chen, G.: Single data copying for MPI communication optimization on shared memory system. In: Proceedings of 7th International Conference on Computational Science (ICCS’07), pp. 700–707. Beijing (2007)

  21. Mouriño, J.C., Gómez, A., Taboada, J.M., Landesa, L., Bértolo, J.M., Obelleiro, F., Rodríguez, J.L.: High scalability multipole method. Solving half billion of unknowns. Comput Sci R&D 23(3–4), 169–175 (2009)

    Google Scholar 

  22. Nishtala, R., Yelick, K.A.: Optimizing collective communication on multicores. In: Proceedings of 1st USENIX Workshop on Hot Topics in Parallelism (HotPar’09), pp. 1–6. Berkeley (2009)

  23. Nishtala, R., Zheng, Y., Hargrove, P.H., Yelick, K.A.: Tuning collective communication for partitioned global address space programming models. J. Parallel Comput. 37(9), 576–591 (2011)

    Article  Google Scholar 

  24. Qian, Y.: Design and evaluation of efficient collective communications on modern interconnects and multi-core clusters. Thesis PhD, Electrical & Computer Engineering, Queen’s University. http://hdl.handle.net/1974/5383 (2010). Accessed March 2014

  25. Rabenseifner, R.: A new optimized MPI reduce algorithm. http://www.hlrs.de/mpi/myreduce.html (2007). Accessed March 2014

  26. Ryne, Z., Seidel, S.: Ideas and specifications for the new one-sided collective operations in UPC. http://www.upc.mtu.edu/papers/OnesidedColl (2005). Accessed March 2014

  27. Salama, R.A., Sameh, A.: Potential performance improvement of collective operations in UPC. In: Proceedings of 12th International Conference on Parallel Computing (ParCo’07), pp. 413–422. Jülich and Aachen (2007)

  28. Shipman, G.M., Poole, S., Shamis, P., Rabinovitz, I.: X-SRQ: improving scalability and performance of multi-core infiniband clusters. In: Proceedings of 15th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI’08), pp. 33–42. Dublin (2008)

  29. Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)

    Article  Google Scholar 

  30. Trahay, F., Brunet, E., Denis, A., Namyst, R.: A multithreaded communication engine for multicore architectures. In: Proceedings of 8th Workshop on Communication Architecture for Clusters (CAC’08), pp. 1–7. Miami (2008)

  31. Tu, B., Fan, J., Zhan, J., Zhao, X.: Performance analysis and optimization of MPI collective operations on multi-core clusters. J. Supercomput. 60(1), 141–162 (2012)

    Article  Google Scholar 

  32. Vadhiyar, S.S., Fagg, G.E., Dongarra, J.J.: Automatically tuned collective communications. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing (SC’00), pp. 1–11. Dallas (2000)

  33. Velamati, M.K., Kumar, A., Jayam, N., Senthilkumar, G., Baruah, P.K., Sharma, R., Kapoor, S., Srinivasan, A.: Optimization of collective communication in intra-cell MPI. In: Proceedings of 14th International Conference on High Performance Computing (HiPC’07), pp. 488–499. Goa (2007)

  34. Wibecan, B.: Proposal for privatizability functions. http://www2.hpcl.gwu.edu/pgas09/HP_UPC_Proposal (2009). Accessed March 2014

Download references

Acknowledgments

This work was funded by Hewlett-Packard and partially supported by the Ministry of Science and Innovation of Spain under Project TIN2010-16735 and by the Galician Government (Xunta de Galicia, Spain) under the Consolidation Program of Competitive Research Groups and under CN2012/211 grant, cofunded with FEDER. We gratefully thank Jim Bovay at HP for his valuable support, CESGA for providing access to the Finis Terrae, Superdome and SVG supercomputers, and Forschungzentrum Jülich for providing access to the JuRoPA and JUDGE supercomputers.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damián A. Mallón.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mallón, D.A., Taboada, G.L., Teijeiro, C. et al. Scalable PGAS collective operations in NUMA clusters. Cluster Comput 17, 1473–1495 (2014). https://doi.org/10.1007/s10586-014-0377-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-014-0377-9

Keywords

Navigation