Cluster Computing

, Volume 9, Issue 3, pp 257–279 | Cite as

Optimizing MPI collective communication by orthogonal structures

  • Matthias KühnemannEmail author
  • Thomas Rauber
  • Gudula Rünger


MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.


MPI communication operation Orthogonal processor groups Parallel application Modeling of communication time 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    S.B. Baden and S.J. Fink, A programming methodology for dual-tier multicomputers, IEEE Transactions on Software Engineering 26(3) (2000) 212–226.CrossRefGoogle Scholar
  2. 2.
    H. Bal and M. Haines, Approaches for integrating task and data parallelism, IEEE Concurrency 6(3) (1998) 74–84.CrossRefGoogle Scholar
  3. 3.
    Scali/ScaMPI commercial MPI on SCI implementation.
  4. 4.
    LAM/MPI Parallel Computing.
  5. 5.
    W.J. Dally and C. L. Seitz, Deadlock free message routing in multiprocessor interconnection networks, IEEE Transactions on Computers 36(5) (1987) 547–553.zbMATHGoogle Scholar
  6. 6.
    S.J. Fink, A Programming Model for Block-Structured Scientific Calculations on SMP Clusters, PhD thesis, University of California, San Diego, 1998.Google Scholar
  7. 7.
    K. Hwang, Z. Xu, and M. Arakawa, Benchmark evaluation of the IBM SP2 for parallel signal processing, IEEE Transactions on Parallel and Distributed Systems 7(5) (1996) 522–536.CrossRefGoogle Scholar
  8. 8.
    D. Kerbyson, H. Alme, A. Hoisie, F. Petrini, H. Wasserman, and M. Gittings, Predictive performance and scalability modeling of a large-scale application, in: Proc. of IEEE/ACM SC2001, 2001.Google Scholar
  9. 9.
    S.R. Kohn and S.B. Baden, Irregular coarse-grain data parallelism under LPARX, Scientific Programming 5 (1995) 185–201.Google Scholar
  10. 10.
    S.R. Kohn and S.B. Baden, Parallel software abstractions for structured adaptive mesh methods, Journal of Parallel and Distributed Computing 61(6) (2001) 713–736.zbMATHCrossRefGoogle Scholar
  11. 11.
    M.Kühnemann, T. Rauber, and G.Rünger, Performance modeling for task-parallel programs, in: Proc. of Communication Networks and Distributed Systems Modeling and Simulation (CNDS’02), (2002), pp. 148–154.Google Scholar
  12. 12.
    J. Merlin, S. Baden, St. Fink, and B. Chapman, Multiple data parallelism with HPF and KeLP, J. Future Generation Computer Science 15(3) (1999) 393–405.CrossRefGoogle Scholar
  13. 13.
    MPICH-A Portable Implementation of MPI.
  14. 14.
    T. Rauber, R. Reilein, and G.Rünger, ORT – A communication library for orthogonal processor groups, in: Proc. of the ACM/IEEE SC 2001, IEEE Press (2001).Google Scholar
  15. 15.
    T. Rauber and G.Rünger, PVM and MPI communication operations on the IBM SP2: Modeling and comparison, in: Proc. of the 11th Symposium on High Performance Computing Systems (HPCS’97), 1997.Google Scholar
  16. 16.
    T. Rauber and G. Rünger, Library support for hierarchical multi-processor tasks, in: Proc. of the Supercomputing 2002, Baltimore, USA, 2002.Google Scholar
  17. 17.
    Cray Research Web Server.
  18. 18.
    D. Skillicorn and D. Talia, Models and languages for parallel computation, ACM Computing Surveys 30(2) (1998) 123–169.CrossRefGoogle Scholar
  19. 19.
    P.J. vander Houwen and E. Messina, Parallel Adams methods, Journal of Computational and Applied Mathematics 101 (1999) 153–165.MathSciNetCrossRefGoogle Scholar
  20. 20.
    G. Zhang, B. Carpenter, G. Fox, X. Li, and Y. Wen, A high level SPMD programming model: HPspmd and its Java language binding, Technical report, NPAC at Syracuse University, 1998.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  • Matthias Kühnemann
    • 1
    Email author
  • Thomas Rauber
    • 2
  • Gudula Rünger
    • 1
  1. 1.Fakultät für InformatikTechnische Universität ChemnitzChemnitzGermany
  2. 2.Fakultät für Mathematik und PhysikUniversität BayreuthBayreuthGermany

Personalised recommendations