COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling

Kabić, Marko; Pintarelli, Simon; Kozhevnikov, Anton; VandeVondele, Joost

doi:10.1007/978-3-030-78713-4_12

Marko Kabić^12,13,
Simon Pintarelli^12,13,
Anton Kozhevnikov^12,13 &
…
Joost VandeVondele^12,13

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12728))

Included in the following conference series:

International Conference on High Performance Computing

2388 Accesses
1 Citations

Abstract

Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In practice, these algorithms assume that the data is already distributed in a specific way, thus making data reshuffling a key to use them. For performance reasons, a straightforward all-to-all exchange must be avoided.

Here, we show that process relabeling (i.e. permuting processes in the final layout) can be used to obtain communication optimality for data reshuffling, and that it can be efficiently found by solving a Linear Assignment Problem (Maximum Weight Bipartite Perfect Matching). Based on this, we have developed a Communication-Optimal Shuffle and Transpose Algorithm (COSTA): this highly-optimised algorithm implements \(A=\alpha \cdot {\text {op}}(B) + \beta \cdot A,\ {\text {op}} \in \{{\text {transpose}}, {\text {conjugate-transpose}}, {\text {identity}}\}\) on distributed systems, where A, B are matrices with potentially different (distributed) layouts and \(\alpha , \beta \) are scalars. COSTA can take advantage of the communication-optimal process relabeling even for heterogeneous network topologies, where latency and bandwidth differ among nodes. Moreover, our algorithm can be easily generalized to even more generic problems, making it suitable for distributed Machine Learning applications. The implementation not only outperforms the best available ScaLAPACK redistribute and transpose routines multiple times, but is also able to deal with more general matrix layouts, in particular it is not limited to block-cyclic layouts. Finally, we use COSTA to integrate a communication-optimal matrix multiplication algorithm into the CP2K quantum chemistry simulation package. This way, we show that COSTA can be used to unlock the full potential of recent Linear Algebra algorithms in applications by facilitating interoperability between algorithms with a wide range of data layouts, in addition to bringing significant redistribution speedups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Azad, A., Buluç, A., Li, X.S., Wang, X., Langguth, J.: A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42(4), C143–C168 (2020)
Article MathSciNet MATH Google Scholar
Birkhoff, G.: Tres observaciones sabre el algebra lineal. Univ. Nac. Tucumán Rev. Ser. A 5, 147–151 (1946)
Google Scholar
Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems: Revised Reprint. SIAM (2012)
Google Scholar
Choi, J., Dongarra, J.J., Pozo, R., Walker, D.W.: Scalapack: a scalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–121. IEEE Computer Society (1992)
Google Scholar
Date, K., Nagi, R.: GPU-accelerated Hungarian algorithms for the linear assignment problem. Parallel Comput. 57, 52–72 (2016). https://doi.org/10.1016/j.parco.2016.05.012, http://www.sciencedirect.com/science/article/pii/S016781911630045X
Del Ben, M., Schütt, O., Wentz, T., Messmer, P., Hutter, J., VandeVondele, J.: Enabling simulation at the fifth rung of DFT: large scale RPA calculations with excellent time to solution. Comput. Phys. Commun. 187, 120–129 (2015). https://doi.org/10.1016/j.cpc.2014.10.021, http://www.sciencedirect.com/science/article/pii/S0010465514003671
Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272. IEEE (2013)
Google Scholar
Dongarra, J.J., Walker, D.W.: Software libraries for linear algebra computations on high performance computers. SIAM Rev. 37(2), 151–180 (1995). https://doi.org/10.1137/1037042
Article MathSciNet MATH Google Scholar
Du, D., Pardalos, P.M.: Handbook of Combinatorial Optimization, vol. 4. Springer Science & Business Media, Boston (1998). https://doi.org/10.1007/978-1-4613-0303-9
Book MATH Google Scholar
Herrmann, J., Bosilca, G., Hérault, T., Marchal, L., Robert, Y., Dongarra, J.: Assessing the cost of redistribution followed by a computational kernel: complexity and performance results. Parallel Comput. 52, 22–41 (2016)
Article Google Scholar
Kabic, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J.: COSTA: communication-optimal shuffle and transpose algorithm (2020). https://github.com/eth-cscs/COSTA
Kielmann, T., Gorlatch, S.: Bandwidth-Latency models (BSP, LogP). In: Paduda, D. (ed.) Encyclopedia of Parallel Computing, pp. 107–112. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-09766-4_189
Chapter Google Scholar
Kozhevnikov, A., Schulthess, T.: Sirius library for electronic structure (2013). https://github.com/electronic-structure/SIRIUS
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Article MathSciNet MATH Google Scholar
Kühne, T.D., et al.: Cp2k: an electronic structure and molecular dynamics software package-quickstep: efficient and accurate electronic structure calculations. J. Chem. Phys. 152(19), 194103 (2020)
Article Google Scholar
Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York. SC 2019. Association for Computing Machinery (2019). https://doi.org/10.1145/3295500.3356181
Lopes, P.A., Yadav, S.S., Ilic, A., Patra, S.K.: Fast block distributed CUDA implementation of the Hungarian algorithm. J. Parallel Distrib. Comput. 130, 50–62 (2019). https://doi.org/10.1016/j.jpdc.2019.03.014, http://www.sciencedirect.com/science/article/pii/S0743731519302254
Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)
Article MathSciNet MATH Google Scholar
Prylli, L., Tourancheau, B.: Efficient block cyclic data redistribution. In: Bougé, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par 1996. LNCS, vol. 1123, pp. 155–164. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61626-8_20
Chapter Google Scholar
Schwartz, J., Steger, A., Weißl, A.: Fast algorithms for weighted bipartite matching. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol. 3503, pp. 476–487. Springer, Heidelberg (2005). https://doi.org/10.1007/11427186_41
Chapter MATH Google Scholar

Download references

Author information

Authors and Affiliations

ETH Zürich, Zurich, Switzerland
Marko Kabić, Simon Pintarelli, Anton Kozhevnikov & Joost VandeVondele
Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
Marko Kabić, Simon Pintarelli, Anton Kozhevnikov & Joost VandeVondele

Authors

Marko Kabić
View author publications
You can also search for this author in PubMed Google Scholar
Simon Pintarelli
View author publications
You can also search for this author in PubMed Google Scholar
Anton Kozhevnikov
View author publications
You can also search for this author in PubMed Google Scholar
Joost VandeVondele
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marko Kabić .

Editor information

Editors and Affiliations

Hewlett Packard Enterprise, Seattle, WA, USA
Bradford L. Chamberlain
University of Amsterdam, Amsterdam, The Netherlands
Ana-Lucia Varbanescu
Extreme Computing Research Center, Thuwal Jeddah, Saudi Arabia
Hatem Ltaief
The University of Tennessee, Knoxville, Knoxville, TN, USA
Piotr Luszczek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kabić, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J. (2021). COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham. https://doi.org/10.1007/978-3-030-78713-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-78713-4_12
Published: 17 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78712-7
Online ISBN: 978-3-030-78713-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics