Optimizing MPI collective communication by orthogonal structures
- 50 Downloads
MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.
KeywordsMPI communication operation Orthogonal processor groups Parallel application Modeling of communication time
Unable to display preview. Download preview PDF.
- 3.Scali/ScaMPI commercial MPI on SCI implementation. http://www.scali.com/.
- 4.LAM/MPI Parallel Computing. http://www.lam-mpi.org/.
- 6.S.J. Fink, A Programming Model for Block-Structured Scientific Calculations on SMP Clusters, PhD thesis, University of California, San Diego, 1998.Google Scholar
- 8.D. Kerbyson, H. Alme, A. Hoisie, F. Petrini, H. Wasserman, and M. Gittings, Predictive performance and scalability modeling of a large-scale application, in: Proc. of IEEE/ACM SC2001, 2001.Google Scholar
- 9.S.R. Kohn and S.B. Baden, Irregular coarse-grain data parallelism under LPARX, Scientific Programming 5 (1995) 185–201.Google Scholar
- 11.M.Kühnemann, T. Rauber, and G.Rünger, Performance modeling for task-parallel programs, in: Proc. of Communication Networks and Distributed Systems Modeling and Simulation (CNDS’02), (2002), pp. 148–154.Google Scholar
- 13.MPICH-A Portable Implementation of MPI. http://www-unix.mcs.anl.gov/mpi/mpich.
- 14.T. Rauber, R. Reilein, and G.Rünger, ORT – A communication library for orthogonal processor groups, in: Proc. of the ACM/IEEE SC 2001, IEEE Press (2001).Google Scholar
- 15.T. Rauber and G.Rünger, PVM and MPI communication operations on the IBM SP2: Modeling and comparison, in: Proc. of the 11th Symposium on High Performance Computing Systems (HPCS’97), 1997.Google Scholar
- 16.T. Rauber and G. Rünger, Library support for hierarchical multi-processor tasks, in: Proc. of the Supercomputing 2002, Baltimore, USA, 2002.Google Scholar
- 17.Cray Research Web Server. http://www.cray.org/.
- 20.G. Zhang, B. Carpenter, G. Fox, X. Li, and Y. Wen, A high level SPMD programming model: HPspmd and its Java language binding, Technical report, NPAC at Syracuse University, 1998.Google Scholar