Locality and Topology Aware Intra-node Communication among Multicore CPUs

  • Teng Ma
  • George Bosilca
  • Aurelien Bouteiller
  • Jack J. Dongarra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6305)


A major trend in HPC is the escalation toward manycore, where systems are composed of shared memory nodes featuring numerous processing units. Unfortunately, with scale comes complexity, here in the form of non-uniform memory accesses and cache hierarchies. For most HPC applications, harnessing the power of multicores is hindered by the topology oblivious tuning of the MPI library. In this paper, we propose a framework to tune every type of shared memory communications according to locality and topology. An implementation inside Open MPI is evaluated experimentally and demonstrates significant speedups compared to vanilla Open MPI and MPICH2.


Shared Memory Message Size Collective Communication Cache Hierarchy Collective Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: Parallel, Distributed and Network-based Processing, pp. 427–436 (2009)Google Scholar
  2. 2.
    Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22, 789–828 (1996)zbMATHCrossRefGoogle Scholar
  3. 3.
    Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104 (2004)Google Scholar
  4. 4.
    Graham, R.L., Woodall, T.S., Squyres, J.M.: Open MPI: A flexible high performance MPI. In: Proceedings of 6th Annual International Conference on Parallel Processing and Applied Mathematics, Poznan, Poland (2005)Google Scholar
  5. 5.
    Buntinas, D., Mercier, G., Gropp, W.: Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. In: Sixth IEEE International Symposium on Cluster Computing and the Grid, vol. 1, pp. 10–20 (2006)Google Scholar
  6. 6.
    Chaarawi, M., Squyres, J.M., Gabriel, E., Feki, S.: A tool for optimizing runtime parameters of Open MPI. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) EuroPVM/MPI 2008. LNCS, vol. 5205, pp. 210–217. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Jin, H.W., Sur, S., Chai, L., Panda, D.: LiMIC: support for high-performance MPI intra-node communication on linux cluster. In: International Conference on Parallel Processing, ICPP 2005, pp. 184–191 (2005)Google Scholar
  8. 8.
    Buntinas, D., Goglin, B., Goodell, D., Mercier, G., Moreaud, S.: Cache-Efficient, Intranode Large-Message MPI Communication with MPICH2-Nemesis. In: Proceedings of the 38th International Conference on Parallel Processing (ICPP 2009), pp. 462–469. IEEE Computer Society Press, Vienna (2009)Google Scholar
  9. 9.
    Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the 1999 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 1999), pp. 131–140 (1999)Google Scholar
  10. 10.
    Karonis, N.T., de Supinski, B.R., Foster, I., Gropp, W., Lusk, E., Bresnahan, J.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: The 14th International Parallel and Distributed Processing Symposium, p. 377 (2000)Google Scholar
  11. 11.
    Filgueira, R., Singh, D.E., Pichel, J.C., Isaila, F., Carretero, J.: Data Locality Aware Strategy for two-phase Collective I/O. In: Palma, J.M.L.M., Amestoy, P.R., Daydé, M., Mattoso, M., Lopes, J.C. (eds.) VECPAR 2008. LNCS, vol. 5336, pp. 137–149. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Broquedis, F., Clet Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications. In: The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (2010)Google Scholar
  13. 13.
    Shipman, G.M., Woodall, T.S., Bosilca, G., Graham, R.L., Maccabe, A.B.: High performance RDMA protocols in HPC. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 76–85. Springer, Heidelberg (2006)Google Scholar
  14. 14.
    Snell, Q.O., Mikler, A.R., Gustafson, J.L.: NetPIPE: A network protocol independent performance evaluator. In: IASTED International Conference on Intelligent Information Management and Systems (1996)Google Scholar
  15. 15.
  16. 16.
    Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks. Technical report. The International Journal of Supercomputer Applications (1991)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Teng Ma
    • 1
  • George Bosilca
    • 1
  • Aurelien Bouteiller
    • 1
  • Jack J. Dongarra
    • 1
  1. 1.Innovative Computing LaboratoryUniversity of Tennessee Computer Science DepartmentKnoxvilleUSA

Personalised recommendations