# An Improved Algorithm for (Non-commutative) Reduce-Scatter with an Application

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)

## Abstract

The collective reduce-scatter operation in MPI performs an element-wise reduction using a given associative (and possibly commutative) binary operation of a sequence of m-element vectors, and distributes the result in mi sized blocks over the participating processors. For the case where the number of processors is a power of two, the binary operation is commutative, and all resulting blocks have the same size, efficient, butterfly-like algorithms are well-known and implemented in good MPI libraries.

The contributions of this paper are threefold. First, we give a simple trick for extending the butterfly algorithm also to the case of non-commutative operations (which is advantageous also for the commutative case). Second, combining this with previous work, we give improved algorithms for the case where the number of processors is not a power of two. Third, we extend the algorithms also to the irregular case where the size of the resulting blocks may differ extremely.

For p processors the algorithm requires ⌈log2p ⌉ + (⌈log2p ⌉ - $$\lfloor log_2p \rfloor$$) communication rounds for the regular case, which may double for the irregular case (depending on the amount of irregularity). For vectors of size m with $$m = \sum^{p-1}_{i=0}m_i$$ the total running time is O(log p + m), irrespective of whether the mi blocks are equal or not. The algorithm has been implemented, and on a small Myrinet cluster gives substantial improvements (up to a factor of 3 in the experiments reported) over other often used implementations. The reduce-scatter operation is a building block in the fence one-sided communication synchronization primitive, and for this application we also document worthwhile improvements over a previous implementation.

## Preview

### References

1. 1.
Bernaschi, M., Iannello, G., Lauria, M.: Efficient implementation of reduce-scatter in MPI. Technical report, University of Napoli (1997)Google Scholar
2. 2.
Gołebiewski, M., Ritzdorf, H., Träff, J.L., Zimmermann, F.: The MPI/SX implementation of MPI for NEC’s SX-6 and other NEC platforms. NEC Research & Development 44(1), 69–74 (2003)Google Scholar
3. 3.
Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI – The Complete Reference, 2nd edn. The MPI Extensions. MIT Press, Cambridge (1998)Google Scholar
4. 4.
Gropp, W.D., Ross, R., Miller, N.: Providing efficient I/O redundancy in MPI environments. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 77–86. Springer, Heidelberg (2004)
5. 5.
Iannello, G.: Efficient algorithms for the reduce-scatter operation in LogGP. IEEE Transactions on Parallel and Distributed Systems 8(9), 970–982 (1997)
6. 6.
Leighton, F.T.: Introduction to Parallel Algorithms and Architechtures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Francisco (1992)
7. 7.
Rabenseifner, R., Träff, J.L.: More efficient reduction algorithms for message-passing parallel systems. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 36–46. Springer, Heidelberg (2004)
8. 8.
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference, 2nd edn. The MPI Core, vol. 1. MIT Press, Cambridge (1998)Google Scholar
9. 9.
Thakur, R., Gropp, W.D., Rabenseifner, R.: Improving the performance of collective operations in MPICH. International Journal on High Performance Computing Applications 19, 49–66 (2004)
10. 10.
Thakur, R., Gropp, W.D., Toonen, B.: Minimizing synchronization overhead in the implementation of MPI one-sided communication. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 57–67. Springer, Heidelberg (2004)
11. 11.
Träff, J.L.: Hierarchical gather/scatter algorithms with graceful degradation. In: International Parallel and Distributed Processing Symposium, IPDPS 2004 (2004)Google Scholar
12. 12.
Träff, J.L., Ritzdorf, H., Hempel, R.: The implementation of MPI-2 one-sided communication for the NEC SX-5. In: Supercomputing (2000), http://www.sc2000.org/proceedings/techpapr/index.htm#01