MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

  • Marc Pérache
  • Hervé Jourdren
  • Raymond Namyst
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5168)


Over the last decade, Message Passing Interface (MPI) has become a very successful parallel programming environment for distributed memory architectures such as clusters. However, the architecture of cluster node is currently evolving from small symmetric shared memory multiprocessors towards massively multicore, Non-Uniform Memory Access (NUMA) hardware. Although regular MPI implementations are using numerous optimizations to realize zero copy cache-oblivious data transfers within shared-memory nodes, they might prevent applications from achieving most of the hardware’s performance simply because the scheduling of heavyweight processes is not flexible enough to dynamically fit the underlying hardware topology. This explains why several research efforts have investigated hybrid approaches mixing message passing between nodes and memory sharing inside nodes, such as MPI+OpenMP solutions [1,2]. However, these approaches require lots of programming efforts in order to adapt/rewrite existing MPI applications.

In this paper, we present the MultiProcessor Communications environnement (MPC), which aims at providing programmers with an efficient runtime system for their existing MPI, POSIX Thread or hybrid MPI+Thread applications. The key idea is to use user-level threads instead of processes over multiprocessor cluster nodes to increase scheduling flexibility, to better control memory allocations and optimize scheduling of the communication flows with other nodes. Most existing MPI applications can run over MPC with no modification. We obtained substantial gains (up to 20%) by using MPC instead of a regular MPI runtime on several scientific applications.


Message Passing Interface Collective Communication Task Migration Internal Thread NUMA Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cappello, F., Etiemble, D.: MPI versus MPI+OpenMP on the IBM SP for the NAS benchmarks. SuperComputing (2000)Google Scholar
  2. 2.
    Smith, L., Bull, M.: Development of mixed mode MPI/OpenMP applications. Scientific Programming (2001)Google Scholar
  3. 3.
    Van der Steen, A.: Overview of recent supercomputers (2006)Google Scholar
  4. 4.
    Liu, J., Chandrasekaran, B., Jiang, J., Kini, S., Yu, W., Buntinas, D., Wyckoff, P., Panda, D.: Performance comparison of MPI implementations over InfiniBand Myrinet and Quarics (2003)Google Scholar
  5. 5.
    Hoeflinger, J.: Extending OpenMP* to clusters (2006)Google Scholar
  6. 6.
    Lee, J., Sato, M., Boku, T.: Design and implementation of OpenMPD: An OpenMP-like programming language for distributed memory systems. In: Chapman, B.M., Zheng, W., Gao, G.R., Sato, M., Ayguadé, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Smith, L., Kent, P.: Development and performances of a mixed OpenMP/MPI quantum monte carlo code. Concurrency: Practice and Experience (2000)Google Scholar
  8. 8.
    Kalé, L.: The virtualization model of parallel programming: runtime optimizations and the state of art. In: LACSI (2002)Google Scholar
  9. 9.
    Huang, C., Lawlor, O., V., K.: Adaptive MPI. In: Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (2003)Google Scholar
  10. 10.
    Demaine, E.: A Threads-Only MPI implementation for the development of parallel programming. In: Proceedings of the 11th International Symposium on High Performance Computing Systems (1997)Google Scholar
  11. 11.
    Pérache, M.: Contribution à l’élaboration d’environnements de programmation dédiés au calcul scientifique hautes performances. PhD thesis, Bordeaux 1 University (2006)Google Scholar
  12. 12.
    Namyst, R.: PM2: un environnement pour une conception portable et une exécution efficace des applications parallèlles irrégulières. PhD thesis, Lille 1 university (1997)Google Scholar
  13. 13.
    Abt, B., Desai, S., Howell, D., Perez-Gonzalet, I., McCraken, D.: Next Generation POSIX Threading Project (2002),
  14. 14.
    Berger, E., McKinley, K., Blumofe, R., Wilson, P.: Hoard: a scalable memory allocator for multithreaded applications. In: International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX) (2000)Google Scholar
  15. 15.
    Torrellas, J., Lam, M.S., L., H.J.: False sharing and spatial locality in multiprocessor caches. IEEE Transaction on Computers (1994)Google Scholar
  16. 16.
    Berger, E., Zorn, B., McKinley, K.: Composing high-performance memory allocators. In: Proceedings of the ACM SIGPLAN conferance on Programming Language Design and Implementation (2001)Google Scholar
  17. 17.
    Del Pino, S., Despres, B., Have, P., Jourdren, H., Piserchia, P.F.: 3d finite volume simulation of acoustic waves in the earth atmosphere. Computer and fluids (submitted)Google Scholar
  18. 18.
    Jourdren, H.: HERA: a hydrodynamic AMR platform for multi-physics simulations. In: Adaptive mesh refinement - theory and applications, LNCSE (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Marc Pérache
    • 1
  • Hervé Jourdren
    • 1
  • Raymond Namyst
    • 2
  1. 1.CEA/DAM Île de France Bruyères-le-ChâtelArpajon Cedex
  2. 2.Laboratoire Bordelais de Recherche en Informatique 351Talence cedex

Personalised recommendations