An Improved Algorithm for (Non-commutative) Reduce-Scatter with an Application

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)

Abstract

The collective reduce-scatter operation in MPI performs an element-wise reduction using a given associative (and possibly commutative) binary operation of a sequence of m-element vectors, and distributes the result in m i sized blocks over the participating processors. For the case where the number of processors is a power of two, the binary operation is commutative, and all resulting blocks have the same size, efficient, butterfly-like algorithms are well-known and implemented in good MPI libraries.

The contributions of this paper are threefold. First, we give a simple trick for extending the butterfly algorithm also to the case of non-commutative operations (which is advantageous also for the commutative case). Second, combining this with previous work, we give improved algorithms for the case where the number of processors is not a power of two. Third, we extend the algorithms also to the irregular case where the size of the resulting blocks may differ extremely.

For p processors the algorithm requires ⌈log 2 p ⌉ + (⌈log 2 p ⌉ - $$\lfloor log_2p \rfloor$$) communication rounds for the regular case, which may double for the irregular case (depending on the amount of irregularity). For vectors of size m with $$m = \sum^{p-1}_{i=0}m_i$$ the total running time is O(log p + m), irrespective of whether the m i blocks are equal or not. The algorithm has been implemented, and on a small Myrinet cluster gives substantial improvements (up to a factor of 3 in the experiments reported) over other often used implementations. The reduce-scatter operation is a building block in the fence one-sided communication synchronization primitive, and for this application we also document worthwhile improvements over a previous implementation.

Keywords

Correct Process Improve Algorithm Reduction Operation Result Vector Regular Case
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

1. 1.
Bernaschi, M., Iannello, G., Lauria, M.: Efficient implementation of reduce-scatter in MPI. Technical report, University of Napoli (1997)Google Scholar
2. 2.
Gołebiewski, M., Ritzdorf, H., Träff, J.L., Zimmermann, F.: The MPI/SX implementation of MPI for NEC’s SX-6 and other NEC platforms. NEC Research & Development 44(1), 69–74 (2003)Google Scholar
3. 3.
Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI – The Complete Reference, 2nd edn. The MPI Extensions. MIT Press, Cambridge (1998)Google Scholar
4. 4.
Gropp, W.D., Ross, R., Miller, N.: Providing efficient I/O redundancy in MPI environments. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 77–86. Springer, Heidelberg (2004)
5. 5.
Iannello, G.: Efficient algorithms for the reduce-scatter operation in LogGP. IEEE Transactions on Parallel and Distributed Systems 8(9), 970–982 (1997)
6. 6.
Leighton, F.T.: Introduction to Parallel Algorithms and Architechtures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Francisco (1992)
7. 7.
Rabenseifner, R., Träff, J.L.: More efficient reduction algorithms for message-passing parallel systems. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 36–46. Springer, Heidelberg (2004)
8. 8.
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference, 2nd edn. The MPI Core, vol. 1. MIT Press, Cambridge (1998)Google Scholar
9. 9.
Thakur, R., Gropp, W.D., Rabenseifner, R.: Improving the performance of collective operations in MPICH. International Journal on High Performance Computing Applications 19, 49–66 (2004)
10. 10.
Thakur, R., Gropp, W.D., Toonen, B.: Minimizing synchronization overhead in the implementation of MPI one-sided communication. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 57–67. Springer, Heidelberg (2004)
11. 11.
Träff, J.L.: Hierarchical gather/scatter algorithms with graceful degradation. In: International Parallel and Distributed Processing Symposium, IPDPS 2004 (2004)Google Scholar
12. 12.
Träff, J.L., Ritzdorf, H., Hempel, R.: The implementation of MPI-2 one-sided communication for the NEC SX-5. In: Supercomputing (2000), http://www.sc2000.org/proceedings/techpapr/index.htm#01