Skip to main content

COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling

  • 1360 Accesses

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 12728)


Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In practice, these algorithms assume that the data is already distributed in a specific way, thus making data reshuffling a key to use them. For performance reasons, a straightforward all-to-all exchange must be avoided.

Here, we show that process relabeling (i.e. permuting processes in the final layout) can be used to obtain communication optimality for data reshuffling, and that it can be efficiently found by solving a Linear Assignment Problem (Maximum Weight Bipartite Perfect Matching). Based on this, we have developed a Communication-Optimal Shuffle and Transpose Algorithm (COSTA): this highly-optimised algorithm implements \(A=\alpha \cdot {\text {op}}(B) + \beta \cdot A,\ {\text {op}} \in \{{\text {transpose}}, {\text {conjugate-transpose}}, {\text {identity}}\}\) on distributed systems, where AB are matrices with potentially different (distributed) layouts and \(\alpha , \beta \) are scalars. COSTA can take advantage of the communication-optimal process relabeling even for heterogeneous network topologies, where latency and bandwidth differ among nodes. Moreover, our algorithm can be easily generalized to even more generic problems, making it suitable for distributed Machine Learning applications. The implementation not only outperforms the best available ScaLAPACK redistribute and transpose routines multiple times, but is also able to deal with more general matrix layouts, in particular it is not limited to block-cyclic layouts. Finally, we use COSTA to integrate a communication-optimal matrix multiplication algorithm into the CP2K quantum chemistry simulation package. This way, we show that COSTA can be used to unlock the full potential of recent Linear Algebra algorithms in applications by facilitating interoperability between algorithms with a wide range of data layouts, in addition to bringing significant redistribution speedups.


  • Communication-optimal
  • Redistribution
  • Transpose
  • Perfect matching
  • Linear assignment
  • Random-phase approximation (RPA)
  • CP2K
  • Linear algebra

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-78713-4_12
  • Chapter length: 20 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-78713-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.


  1. Azad, A., Buluç, A., Li, X.S., Wang, X., Langguth, J.: A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42(4), C143–C168 (2020)

    MathSciNet  CrossRef  Google Scholar 

  2. Birkhoff, G.: Tres observaciones sabre el algebra lineal. Univ. Nac. Tucumán Rev. Ser. A 5, 147–151 (1946)

    Google Scholar 

  3. Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems: Revised Reprint. SIAM (2012)

    Google Scholar 

  4. Choi, J., Dongarra, J.J., Pozo, R., Walker, D.W.: Scalapack: a scalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–121. IEEE Computer Society (1992)

    Google Scholar 

  5. Date, K., Nagi, R.: GPU-accelerated Hungarian algorithms for the linear assignment problem. Parallel Comput. 57, 52–72 (2016).,

  6. Del Ben, M., Schütt, O., Wentz, T., Messmer, P., Hutter, J., VandeVondele, J.: Enabling simulation at the fifth rung of DFT: large scale RPA calculations with excellent time to solution. Comput. Phys. Commun. 187, 120–129 (2015).,

  7. Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272. IEEE (2013)

    Google Scholar 

  8. Dongarra, J.J., Walker, D.W.: Software libraries for linear algebra computations on high performance computers. SIAM Rev. 37(2), 151–180 (1995).

    MathSciNet  CrossRef  MATH  Google Scholar 

  9. Du, D., Pardalos, P.M.: Handbook of Combinatorial Optimization, vol. 4. Springer Science & Business Media, Boston (1998).

    CrossRef  MATH  Google Scholar 

  10. Herrmann, J., Bosilca, G., Hérault, T., Marchal, L., Robert, Y., Dongarra, J.: Assessing the cost of redistribution followed by a computational kernel: complexity and performance results. Parallel Comput. 52, 22–41 (2016)

    CrossRef  Google Scholar 

  11. Kabic, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J.: COSTA: communication-optimal shuffle and transpose algorithm (2020).

  12. Kielmann, T., Gorlatch, S.: Bandwidth-Latency models (BSP, LogP). In: Paduda, D. (ed.) Encyclopedia of Parallel Computing, pp. 107–112. Springer, Boston (2011).

    CrossRef  Google Scholar 

  13. Kozhevnikov, A., Schulthess, T.: Sirius library for electronic structure (2013).

  14. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)

    MathSciNet  CrossRef  Google Scholar 

  15. Kühne, T.D., et al.: Cp2k: an electronic structure and molecular dynamics software package-quickstep: efficient and accurate electronic structure calculations. J. Chem. Phys. 152(19), 194103 (2020)

    CrossRef  Google Scholar 

  16. Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York. SC 2019. Association for Computing Machinery (2019).

  17. Lopes, P.A., Yadav, S.S., Ilic, A., Patra, S.K.: Fast block distributed CUDA implementation of the Hungarian algorithm. J. Parallel Distrib. Comput. 130, 50–62 (2019).,

  18. Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)

    MathSciNet  CrossRef  Google Scholar 

  19. Prylli, L., Tourancheau, B.: Efficient block cyclic data redistribution. In: Bougé, L., Fraigniaud, P., Mignotte, A., Robert, Y. (eds.) Euro-Par 1996. LNCS, vol. 1123, pp. 155–164. Springer, Heidelberg (1996).

    CrossRef  Google Scholar 

  20. Schwartz, J., Steger, A., Weißl, A.: Fast algorithms for weighted bipartite matching. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol. 3503, pp. 476–487. Springer, Heidelberg (2005).

    CrossRef  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Marko Kabić .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Kabić, M., Pintarelli, S., Kozhevnikov, A., VandeVondele, J. (2021). COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78712-7

  • Online ISBN: 978-3-030-78713-4

  • eBook Packages: Computer ScienceComputer Science (R0)