Skip to main content

More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNCS,volume 3241)

Abstract

We present improved algorithms for global reduction operations for message-passing systems. Each of p processors has a vector of m data items, and we want to compute the element-wise “sum” under a given, associative function of the p vectors. The result, which is also a vector of m items, is to be stored at either a given root processor (MPI_Reduce), or all p processors (MPI_Allreduce). A further constraint is that for each data item and each processor the result must be computed in the same order, and with the same bracketing. Both problems can be solved in O(m+log2 p) communication and computation time. Such reduction operations are part of MPI (the Message Passing Interface), and the algorithms presented here achieve significant improvements over currently implemented algorithms for the important case where p is not a power of 2. Our algorithm requires ⌈log2 p⌉ + 1 rounds – one round off from optimal – for small vectors. For large vectors twice the number of rounds is needed, but the communication and computation time is less than 3 and 3/2, respectively, an improvement from 4 and 2 achieved by previous algorithms (with the message transfer time modeled as α + , and reduction-operation execution time as ). For p=3× 2n and p=9× 2n and small mb for some threshold b, and p=q 2n with small q, our algorithm achieves the optimal ⌈log2 p⌉ number of rounds.

Keywords

  • Message Passing Interface
  • Reduction Algorithm
  • Result Vector
  • Reduction Phase
  • Exchange Step

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-540-30218-6_13
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   99.00
Price excludes VAT (USA)
  • ISBN: 978-3-540-30218-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   129.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barnett, M., Gupta, S., Payne, D., Shuler, L., van de Gejin, R., Watts, J.: Interprocessor collective communication library (InterCom). In: Proceedings of Supercomputing 1994 (November 1994)

    Google Scholar 

  2. Bar-Noy, A., Bruck, J., Ho, C.-T., Kipnis, S., Schieber, B.: Computing global combine operations in the multiport postal model. IEEE Transactions on Parallel and Distributed Systems 6(8), 896–900 (1995)

    CrossRef  Google Scholar 

  3. Bar-Noy, A., Kipnis, S., Schieber, B.: An optimal algorithm for computing census functions in message-passing systems. Parallel Processing Letters 3(1), 19–23 (1993)

    CrossRef  Google Scholar 

  4. Blum, E.K., Wang, X., Leung, P.: Architectures and message-passing algorithms for cluster computing: Design and performance. Parallel Computing 26, 313–332 (2000)

    MATH  CrossRef  Google Scholar 

  5. Bruck, J., Ho, C.-T.: Efficient global combine operations in multi-port messagepassing systems. Parallel Processing Letters 3(4), 335–346 (1993)

    CrossRef  Google Scholar 

  6. Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems 8(11), 1143–1156 (1997)

    CrossRef  Google Scholar 

  7. Gabriel, E., Resch, M., Rühle, R.: Implementing MPI with optimized algorithms for metacomputing. In: Proceedings of the MPIDC 1999, Atlanta, USA, pp. 31–41 (1999)

    Google Scholar 

  8. Karonis, N., de Supinski, B., Foster, I., Gropp, W., Lusk, E., Bresnahan, J.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th International Parallel and Distributed Processing Symposium (IPDPS 2000), pp. 377–384 (2000)

    Google Scholar 

  9. Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MPI’s reduction operations in clustered wide area systems. In: Proceedings of the MPIDC 1999, pp. 43–52 (1999)

    Google Scholar 

  10. Knies, A.D., Ray Barriuso, F., Adams III, W.J.H.G.B.: SLICC: A low latency interface for collective communications. In: Proceedings of the 1994 conference on Supercomputing, Washington, D.C, November 14–18, pp. 89–96 (1994)

    Google Scholar 

  11. Pritchard, H., Nicholson, J., Schwarzmeier, J.: Optimizing MPI Collectives for the Cray X1. In: Proceeding of the CUG 2004 conference, Knoxville, Tennessee, USA, May 17-21 (2004) (personal communication)

    Google Scholar 

  12. Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference 1999 (MPIDC 1999), Atlanta, USA, March 1999, pp. 77–85 (1999)

    Google Scholar 

  13. Rabenseifner, R.: Optimization of collective reduction operations. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3036, pp. 1–9. Springer, Heidelberg (2004)

    CrossRef  Google Scholar 

  14. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference, 2nd edn. The MPI Core, vol. 1. MIT Press, Cambridge (1998)

    Google Scholar 

  15. Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003)

    CrossRef  Google Scholar 

  16. van de Geijn, R.: On global combine operations. Journal of Parallel and Distributed Computing 22, 324–328 (1994)

    MATH  CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rabenseifner, R., Träff, J.L. (2004). More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2004. Lecture Notes in Computer Science, vol 3241. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30218-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30218-6_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23163-9

  • Online ISBN: 978-3-540-30218-6

  • eBook Packages: Springer Book Archive