The Journal of Supercomputing

, Volume 35, Issue 1, pp 65–91 | Cite as

The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms

  • Nikolaos Drosinos
  • Nectarios Koziris


This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.


parallel programming high performance computing SMP clusters iterative algorithms tiling MPI OpenMP hybrid programming 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    T. Andronikos, N. Koziris, G. Papakonstantinou, and P. Tsanakas. Optimal scheduling for UET/UET-UCT generalized N-dimensional Grid Task Graphs. Journal of Parallel and Distributed Computing, 57(2):140–165, 1999.CrossRefzbMATHGoogle Scholar
  2. 2.
    M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, and N. Koziris. Pipelined scheduling of tiled nested loops onto clusters of SMPs using memory mapped network interfaces. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, IEEE Computer Society Press, Baltimore, Maryland, USA, 2002.Google Scholar
  3. 3.
    P. Boulet, J. Dongarra, Y. Robert, and F. Vivien. Static tiling for heterogeneous computing platforms. Journal of Parallel Computing, 25(5):547–568, 1999.MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    P. Calland, J. Dongarra, and Y. Robert. Tiling on systems with communication/computation overlap. Journal of Concurrency: Practice and Experience, 11(3):139–153, 1999.Google Scholar
  5. 5.
    F. Cappello and D. Etiemble. MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks. In Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), IEEE Computer Society, Dallas, Texas, USA, p. 12, 2000.Google Scholar
  6. 6.
    A. Darte, J. Mellor-Crummey, R. Fowler, and D. Chavarría-Miranda. Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations. Journal of Parallel and Distributed Computing, 63(9):887–911, 2003.CrossRefzbMATHGoogle Scholar
  7. 7.
    S. Dong and G. Em. Karniadakis. Dual-level parallelism for high-order CFD methods. Journal of Parallel Computing, 30(1):1–20, 2004.Google Scholar
  8. 8.
    N. Drosinos and N. Koziris. Performance comparison of pure MPI vs hybrid MPI-OpenMP parallelization models on SMP clusters. In Proceedings of the 18th International Parallel and Distributed Processing Symposium 2004 (CDROM), Santa Fe, New Mexico, p. 10, 2004.Google Scholar
  9. 9.
    G. Goumas, N. Drosinos, M. Athanasaki, and N. Koziris. Compiling tiled iteration spaces for clusters. In Proceedings of the IEEE International Conference on Cluster Computing, Illinois, Chicago, pp. 360–369, 2002.Google Scholar
  10. 10.
    D. S. Henty. Performance of hybrid message-passing and shared-memory parallelism for discrete element modeling. In Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), IEEE Computer Society, Dallas, Texas, United States, p. 10, 2000.Google Scholar
  11. 11.
    Y. C. Hu, H. Lu, A. L. Cox, and W. Zwaenepoel. OpenMP for networks of SMPs. Journal of Parallel and Distributed Computing, 60(12):1512–1530, 2000.CrossRefzbMATHGoogle Scholar
  12. 12.
    G. Em. Karniadakis and R. M. Kirby. Parallel Scientific Computing in C++ and MPI : A Seamless Approach to Parallel Algorithms and their Implementation, Cambridge University Press, 2002.Google Scholar
  13. 13.
    G. Krawezik and F. Cappello. Performance comparison of MPI and three OpenMP programming styles on shared memory multiprocessors. Journal of Concurrency and Computation: Practice and Experience, 2003.Google Scholar
  14. 14.
    A. Legrand, H. Renard, Y. Robert, and F. Vivien. Mapping and load-balancing iterative computations on heterogeneous clusters with shared links. IEEE Trans. on Parallel and Distributed Systems, 15(6):546–558, 2004.CrossRefGoogle Scholar
  15. 15.
    R. D. Loft, S. J. Thomas, and J. M. Dennis. Terascale spectral element dynamical core for atmospheric general circulation models. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), ACM Press, Denver, Colorado, p. 18, 2001.Google Scholar
  16. 16.
    C. Morin and I. Puaut. A survey of recoverable distributed shared virtual memory systems. IEEE Trans. on Parallel and Distributed Systems, 8(9):959–969, 1997.CrossRefGoogle Scholar
  17. 17.
    B. V. Protopopov and A. Skjellum. A multi-threaded message passing interface (MPI) architecture: Performance and program issues. Journal of Parallel and Distributed Computing, 61(4):449–466, 2001.CrossRefzbMATHGoogle Scholar
  18. 18.
    R. Rabenseifner and G. Wellein. Communication and optimization aspects of parallel programming models on hybrid architectures. International Journal of High Performance Computing Applications, 17(1):49–62, 2003.CrossRefGoogle Scholar
  19. 19.
    P. Tang and J. Zigman. Reducing data communication overhead for DOACROSS loop nests. In Proceedings of the 8th International Conference on Supercomputing (ICS'94), Manchester, UK, pp. 44–53, 1994.Google Scholar
  20. 20.
    M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. on Parallel and Distributed Systems, 2(4):452–471, 1991.CrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, Inc. 2006

Authors and Affiliations

  1. 1.National Technical University of AthensSchool of Electrical and Computer Engineering, Computing Systems LaboratoryAthensGreece

Personalised recommendations