Skip to main content

COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12728))

Included in the following conference series:

Abstract

Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In practice, these algorithms assume that the data is already distributed in a specific way, thus making data reshuffling a key to use them. For performance reasons, a straightforward all-to-all exchange must be avoided.

Here, we show that process relabeling (i.e. permuting processes in the final layout) can be used to obtain communication optimality for data reshuffling, and that it can be efficiently found by solving a Linear Assignment Problem (Maximum Weight Bipartite Perfect Matching). Based on this, we have developed a Communication-Optimal Shuffle and Transpose Algorithm (COSTA): this highly-optimised algorithm implements \(A=\alpha \cdot {\text {op}}(B) + \beta \cdot A,\ {\text {op}} \in \{{\text {transpose}}, {\text {conjugate-transpose}}, {\text {identity}}\}\) on distributed systems, where AB are matrices with potentially different (distributed) layouts and \(\alpha , \beta \) are scalars. COSTA can take advantage of the communication-optimal process relabeling even for heterogeneous network topologies, where latency and bandwidth differ among nodes. Moreover, our algorithm can be easily generalized to even more generic problems, making it suitable for distributed Machine Learning applications. The implementation not only outperforms the best available ScaLAPACK redistribute and transpose routines multiple times, but is also able to deal with more general matrix layouts, in particular it is not limited to block-cyclic layouts. Finally, we use COSTA to integrate a communication-optimal matrix multiplication algorithm into the CP2K quantum chemistry simulation package. This way, we show that COSTA can be used to unlock the full potential of recent Linear Algebra algorithms in applications by facilitating interoperability between algorithms with a wide range of data layouts, in addition to bringing significant redistribution speedups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Azad, A., Buluç, A., Li, X.S., Wang, X., Langguth, J.: A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42(4), C143–C168 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  2. Birkhoff, G.: Tres observaciones sabre el algebra lineal. Univ. Nac. Tucumán Rev. Ser. A 5, 147–151 (1946)

    Google Scholar 

  3. Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems: Revised Reprint. SIAM (2012)

    Google Scholar 

  4. Choi, J., Dongarra, J.J., Pozo, R., Walker, D.W.: Scalapack: a scalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–121. IEEE Computer Society (1992)

    Google Scholar 

  5. Date, K., Nagi, R.: GPU-accelerated Hungarian algorithms for the linear assignment problem. Parallel Comput. 57, 52–72 (2016). https://doi.org/10.1016/j.parco.2016.05.012, http://www.sciencedirect.com/science/article/pii/S016781911630045X

  6. Del Ben, M., Schütt, O., Wentz, T., Messmer, P., Hutter, J., VandeVondele, J.: Enabling simulation at the fifth rung of DFT: large scale RPA calculations with excellent time to solution. Comput. Phys. Commun. 187, 120–129 (2015). https://doi.org/10.1016/j.cpc.2014.10.021, http://www.sciencedirect.com/science/article/pii/S0010465514003671

  7. Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272. IEEE (2013)

    Google Scholar 

  8. Dongarra, J.J., Walker, D.W.: Software libraries for linear algebra computations on high performance computers. SIAM Rev. 37(2), 151–180 (1995). https://doi.org/10.1137/1037042

    Article  MathSciNet  MATH  Google Scholar 

  9. Du, D., Pardalos, P.M.: Handbook of Combinatorial Optimization, vol. 4. Springer Science & Business Media, Boston (1998). https://doi.org/10.1007/978-1-4613-0303-9

    Book  MATH  Google Scholar 

  10. Herrmann, J., Bosilca, G., Hérault, T., Marchal, L., Robert, Y., Dongarra, J.: Assessing the cost of redistribution followed by a computational kernel: complexity and performance results. Parallel Comput. 52, 22–41 (2016)

    Article  Google Scholar 

  11. Kabic, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J.: COSTA: communication-optimal shuffle and transpose algorithm (2020). https://github.com/eth-cscs/COSTA

  12. Kielmann, T., Gorlatch, S.: Bandwidth-Latency models (BSP, LogP). In: Paduda, D. (ed.) Encyclopedia of Parallel Computing, pp. 107–112. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-09766-4_189

    Chapter  Google Scholar 

  13. Kozhevnikov, A., Schulthess, T.: Sirius library for electronic structure (2013). https://github.com/electronic-structure/SIRIUS

  14. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  15. Kühne, T.D., et al.: Cp2k: an electronic structure and molecular dynamics software package-quickstep: efficient and accurate electronic structure calculations. J. Chem. Phys. 152(19), 194103 (2020)

    Article  Google Scholar 

  16. Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York. SC 2019. Association for Computing Machinery (2019). https://doi.org/10.1145/3295500.3356181

  17. Lopes, P.A., Yadav, S.S., Ilic, A., Patra, S.K.: Fast block distributed CUDA implementation of the Hungarian algorithm. J. Parallel Distrib. Comput. 130, 50–62 (2019). https://doi.org/10.1016/j.jpdc.2019.03.014, http://www.sciencedirect.com/science/article/pii/S0743731519302254

  18. Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  19. Prylli, L., Tourancheau, B.: Efficient block cyclic data redistribution. In: Bougé, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par 1996. LNCS, vol. 1123, pp. 155–164. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61626-8_20

    Chapter  Google Scholar 

  20. Schwartz, J., Steger, A., Weißl, A.: Fast algorithms for weighted bipartite matching. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol. 3503, pp. 476–487. Springer, Heidelberg (2005). https://doi.org/10.1007/11427186_41

    Chapter  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marko Kabić .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kabić, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J. (2021). COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham. https://doi.org/10.1007/978-3-030-78713-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78713-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78712-7

  • Online ISBN: 978-3-030-78713-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics