Abstract
Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In practice, these algorithms assume that the data is already distributed in a specific way, thus making data reshuffling a key to use them. For performance reasons, a straightforward all-to-all exchange must be avoided.
Here, we show that process relabeling (i.e. permuting processes in the final layout) can be used to obtain communication optimality for data reshuffling, and that it can be efficiently found by solving a Linear Assignment Problem (Maximum Weight Bipartite Perfect Matching). Based on this, we have developed a Communication-Optimal Shuffle and Transpose Algorithm (COSTA): this highly-optimised algorithm implements \(A=\alpha \cdot {\text {op}}(B) + \beta \cdot A,\ {\text {op}} \in \{{\text {transpose}}, {\text {conjugate-transpose}}, {\text {identity}}\}\) on distributed systems, where A, B are matrices with potentially different (distributed) layouts and \(\alpha , \beta \) are scalars. COSTA can take advantage of the communication-optimal process relabeling even for heterogeneous network topologies, where latency and bandwidth differ among nodes. Moreover, our algorithm can be easily generalized to even more generic problems, making it suitable for distributed Machine Learning applications. The implementation not only outperforms the best available ScaLAPACK redistribute and transpose routines multiple times, but is also able to deal with more general matrix layouts, in particular it is not limited to block-cyclic layouts. Finally, we use COSTA to integrate a communication-optimal matrix multiplication algorithm into the CP2K quantum chemistry simulation package. This way, we show that COSTA can be used to unlock the full potential of recent Linear Algebra algorithms in applications by facilitating interoperability between algorithms with a wide range of data layouts, in addition to bringing significant redistribution speedups.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Azad, A., Buluç, A., Li, X.S., Wang, X., Langguth, J.: A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42(4), C143–C168 (2020)
Birkhoff, G.: Tres observaciones sabre el algebra lineal. Univ. Nac. Tucumán Rev. Ser. A 5, 147–151 (1946)
Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems: Revised Reprint. SIAM (2012)
Choi, J., Dongarra, J.J., Pozo, R., Walker, D.W.: Scalapack: a scalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–121. IEEE Computer Society (1992)
Date, K., Nagi, R.: GPU-accelerated Hungarian algorithms for the linear assignment problem. Parallel Comput. 57, 52–72 (2016). https://doi.org/10.1016/j.parco.2016.05.012, http://www.sciencedirect.com/science/article/pii/S016781911630045X
Del Ben, M., Schütt, O., Wentz, T., Messmer, P., Hutter, J., VandeVondele, J.: Enabling simulation at the fifth rung of DFT: large scale RPA calculations with excellent time to solution. Comput. Phys. Commun. 187, 120–129 (2015). https://doi.org/10.1016/j.cpc.2014.10.021, http://www.sciencedirect.com/science/article/pii/S0010465514003671
Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272. IEEE (2013)
Dongarra, J.J., Walker, D.W.: Software libraries for linear algebra computations on high performance computers. SIAM Rev. 37(2), 151–180 (1995). https://doi.org/10.1137/1037042
Du, D., Pardalos, P.M.: Handbook of Combinatorial Optimization, vol. 4. Springer Science & Business Media, Boston (1998). https://doi.org/10.1007/978-1-4613-0303-9
Herrmann, J., Bosilca, G., Hérault, T., Marchal, L., Robert, Y., Dongarra, J.: Assessing the cost of redistribution followed by a computational kernel: complexity and performance results. Parallel Comput. 52, 22–41 (2016)
Kabic, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J.: COSTA: communication-optimal shuffle and transpose algorithm (2020). https://github.com/eth-cscs/COSTA
Kielmann, T., Gorlatch, S.: Bandwidth-Latency models (BSP, LogP). In: Paduda, D. (ed.) Encyclopedia of Parallel Computing, pp. 107–112. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-09766-4_189
Kozhevnikov, A., Schulthess, T.: Sirius library for electronic structure (2013). https://github.com/electronic-structure/SIRIUS
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Kühne, T.D., et al.: Cp2k: an electronic structure and molecular dynamics software package-quickstep: efficient and accurate electronic structure calculations. J. Chem. Phys. 152(19), 194103 (2020)
Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York. SC 2019. Association for Computing Machinery (2019). https://doi.org/10.1145/3295500.3356181
Lopes, P.A., Yadav, S.S., Ilic, A., Patra, S.K.: Fast block distributed CUDA implementation of the Hungarian algorithm. J. Parallel Distrib. Comput. 130, 50–62 (2019). https://doi.org/10.1016/j.jpdc.2019.03.014, http://www.sciencedirect.com/science/article/pii/S0743731519302254
Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)
Prylli, L., Tourancheau, B.: Efficient block cyclic data redistribution. In: Bougé, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par 1996. LNCS, vol. 1123, pp. 155–164. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61626-8_20
Schwartz, J., Steger, A., Weißl, A.: Fast algorithms for weighted bipartite matching. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol. 3503, pp. 476–487. Springer, Heidelberg (2005). https://doi.org/10.1007/11427186_41
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kabić, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J. (2021). COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham. https://doi.org/10.1007/978-3-030-78713-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-78713-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78712-7
Online ISBN: 978-3-030-78713-4
eBook Packages: Computer ScienceComputer Science (R0)