Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs

  • Nikolaos Drosinos
  • Nectarios Koziris
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2840)


The parallelization process of nested-loop algorithms onto popular multi-level parallel architectures, such as clusters of SMPs, is not a trivial issue, since the existence of data dependencies in the algorithm impose severe restrictions on the task decomposition to be applied. In this paper we propose three techniques for the parallelization of such algorithms, namely pure MPI parallelization, fine-grain hybrid MPI/OpenMP parallelization and coarse-grain MPI/OpenMP parallelization. We further apply an advanced hyperplane scheduling scheme that enables pipelined execution and the overlapping of communication with useful computation, thus leading almost to full CPU utilization. We implement the three variations and perform a number of micro-kernel benchmarks to verify the intuition that the hybrid programming model could potentially exploit the characteristics of an SMP cluster more efficiently than the pure message-passing programming model. We conclude that the overall performance for each model is both application and hardware dependent, and propose some directions for the efficiency improvement of the hybrid model.


Hybrid Model Message Passing Interface Iteration Space Programming Paradigm Alternate Direction Implicit 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Athanasaki, M., Sotiropoulos, A., Tsoukalas, G., Koziris, N.: Pipelined scheduling of tiled nested loops onto clusters of SMPs using memory mapped network interfaces. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  2. 2.
    Cappello, F., Etiemble, D.: MPI versus MPI+OpenMP on IBM SP for the NAS benchmarks. In: Proceedings of the 2000 ACM/IEEE conference on Supercomputing, Dallas, Texas, USA, IEEE Computer Society Press, Los Alamitos (2000)Google Scholar
  3. 3.
    Dong, S., Karniadakis, G.E.: Dual-level parallelism for deterministic and stochastic CFD problems. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  4. 4.
    Goumas, G., Athanasaki, M., Koziris, N.: Automatic Code Generation for Executing Tiled Nested Loops Onto Parallel Architectures. In: Nyberg, K., Heys, H.M. (eds.) SAC 2002. LNCS, vol. 2595, Springer, Heidelberg (2003)Google Scholar
  5. 5.
    Goumas, G., Drosinos, N., Athanasaki, M., Koziris, N.: Compiling Tiled Iteration Spaces for Clusters. In: Proceedings of the IEEE International Conference on Cluster Computing, Chicago (September 2002)Google Scholar
  6. 6.
    He, Y., Ding, C.H.Q.: MPI and OpenMP paradigms on cluster of SMP architectures: the vacancy tracking algorithm for multi-dimensional array transposition. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  7. 7.
    Karniadakis, G.E., Kirby, R.M.: Parallel Scientific Computing in C++ and MPI: A Seamless Approach to Parallel Algorithms and their Implementation. Cambridge University Press, Cambridge (2002)Google Scholar
  8. 8.
    Krawezik, G., Cappello, F.: Performance Comparison of MPI and three OpenMP Programming Styles on Shared Memory Multiprocessors. In: ACM SPAA 2003, San Diego, USA (June 2003)Google Scholar
  9. 9.
    Protopopov, B.V., Skjellum, A.: A multi-threaded Message Passing Interface (MPI) architecture: performance and program issues. JPDC (2001)Google Scholar
  10. 10.
    Rabenseifner, R., Wellein, G.: Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. International Journal of High Performance Computing Applications 17(1), 49–62 (2003)CrossRefGoogle Scholar
  11. 11.
    Tang, H., Yang, T.: Optimizing threaded MPI execution on SMP clusters. In: Proceedings of the 15th international conference on Supercomputing, Sorrento, Italy, pp. 381–392. ACM Press, New York (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Nikolaos Drosinos
    • 1
  • Nectarios Koziris
    • 1
  1. 1.School of Electrical and Computer Engineering, Computing Systems LaboratoryNational Technical University of AthensZografou, AthensGreece

Personalised recommendations