Parallel Prefix (Scan) Algorithms for MPI

  • Peter Sanders
  • Jesper Larsson Träff
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4192)


We describe and experimentally compare four theoretically well-known algorithms for the parallel prefix operation (scan, in MPI terms), and give a presumably novel, doubly-pipelined implementation of the in-order binary tree parallel prefix algorithm. Bidirectional interconnects can benefit from this implementation. We present results from a 32 node AMD Cluster with Myrinet 2000 and a 72-node SX-8 parallel vector system. The doubly-pipelined algorithm is more than a factor two faster than the straight-forward binomial-tree algorithm found in many MPI implementations. However, due to its small constant factors the simple, linear pipeline algorithm is preferable for systems with a moderate number of processors. We also discuss adapting the algorithms to clusters of SMP nodes.


Cluster of SMPs collective communication MPI implementation prefix sum pipelining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bae, S., Kim, D., Ranka, S.: Vector prefix and reduction computation on coarse-grained, distributed memory machines. In: International Parallel Processing Symposium/Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998), pp. 321–325 (1998)Google Scholar
  2. 2.
    Blelloch, G.E.: Scans as primitive parallel operations. IEEE Transactions on Computers 38(11), 1526–1538 (1989)CrossRefGoogle Scholar
  3. 3.
    Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI – The Complete Reference. In: The MPI Extensions, vol. 2, MIT Press, Cambridge (1998)Google Scholar
  4. 4.
    Hillis, W.D., Steele, J.G.L.: Data parallel algorithms. Communications of the ACM 29(12), 1170–1183 (1986)CrossRefGoogle Scholar
  5. 5.
    JáJá, J.: An Introduction to Parallel Algorithms. Addison-Wesley, Reading (1992)zbMATHGoogle Scholar
  6. 6.
    Lin, Y.-C., Yeh, C.-S.: Efficient parallel prefix algorithms on multiport message-passing systems. Information Processing Letters 71, 91–95 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Mayr, E.W., Plaxton, C.G.: Pipelined parallel prefix computations, and sorting on a pipelined hypercube. Journal of Parallel and Distributed Computing 17, 374–380 (1993)zbMATHCrossRefGoogle Scholar
  8. 8.
    Sanders, P., Sibeyn, J.F.: A bandwidth latency tradeoff for broadcast and reduction. Information Processing Letters 86(1), 33–38 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Santos, E.E.: Optimal and efficient algorithms for summing and prefix summing on parallel machines. Journal of Parallel and Distributed Computing 62(4), 517–543 (2002)zbMATHCrossRefGoogle Scholar
  10. 10.
    Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference. In: The MPI Core, 2nd edn., vol. 1. MIT Press, Cambridge (1998)Google Scholar
  11. 11.
    Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE Conference on Local Computer Networks (LCN 2003), pp. 548–557 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Peter Sanders
    • 1
  • Jesper Larsson Träff
    • 2
  1. 1.Universität KarlsruheKarlsruheGermany
  2. 2.C&C Research LaboratoriesNEC Europe Ltd.Sankt AugustinGermany

Personalised recommendations