Parallel Prefix (Scan) Algorithms for MPI
We describe and experimentally compare four theoretically well-known algorithms for the parallel prefix operation (scan, in MPI terms), and give a presumably novel, doubly-pipelined implementation of the in-order binary tree parallel prefix algorithm. Bidirectional interconnects can benefit from this implementation. We present results from a 32 node AMD Cluster with Myrinet 2000 and a 72-node SX-8 parallel vector system. The doubly-pipelined algorithm is more than a factor two faster than the straight-forward binomial-tree algorithm found in many MPI implementations. However, due to its small constant factors the simple, linear pipeline algorithm is preferable for systems with a moderate number of processors. We also discuss adapting the algorithms to clusters of SMP nodes.
KeywordsCluster of SMPs collective communication MPI implementation prefix sum pipelining
Unable to display preview. Download preview PDF.
- 1.Bae, S., Kim, D., Ranka, S.: Vector prefix and reduction computation on coarse-grained, distributed memory machines. In: International Parallel Processing Symposium/Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998), pp. 321–325 (1998)Google Scholar
- 3.Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI – The Complete Reference. In: The MPI Extensions, vol. 2, MIT Press, Cambridge (1998)Google Scholar
- 10.Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference. In: The MPI Core, 2nd edn., vol. 1. MIT Press, Cambridge (1998)Google Scholar
- 11.Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE Conference on Local Computer Networks (LCN 2003), pp. 548–557 (2003)Google Scholar