Tile Low-Rank GEMM Using Batched Operations on GPUs

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11014)


Dense General Matrix-Matrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for many scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the high-performance software libraries extract hardware performance. With the emergence of big data applications involving large data-sparse, hierarchically low-rank matrices, the off-diagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile low-rank (TLR) data format is composed of small data structures, which retain the most significant information for each tile. However, to operate on low-rank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that the data sparsity structure of the matrix can be exploited while leveraging the underlying TLR compression format. The main idea consists of aggregating all operations into a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR-GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advanced algorithms.


Hierarchical low-rank matrix computations Matrix multiplication - GEMM High performance computing GPU Computing KBLAS 


Data Availability Statement and Acknowledgments

The datasets and code generated during and/or analysed during the current study are available in the figshare repository: [16]. We would like to acknowledge Paris Observatory (LESIA, France) for giving us remote access to their Volta-based system, sponsored through a grant from project #671662 (Green Flash), funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014.


  1. 1.
    Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee.
  2. 2.
    The NVIDIA CUDA Basic Linear Algebra Subroutines (CUBLAS).
  3. 3.
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  4. 4.
    Abdelfattah, A., et al.: High-performance tensor contractions for GPUs. Procedia Comput. Sci. 80, 108–118 (2016). International Conference on Computational Science 2016, ICCS 2016, San Diego, California, USA, 6–8 June 2016CrossRefGoogle Scholar
  5. 5.
    Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 21–38. Springer, Cham (2016). Scholar
  6. 6.
    Abdelfattah, A., Ltaief, H., Keyes, D.E., Dongarra, J.J.: Performance optimization of sparse matrix-vector multiplication for multi-component PDE-based applications using GPUs. Concurr. Comput.: Pract. Exp. 28(12), 3447–3465 (2016)CrossRefGoogle Scholar
  7. 7.
    Agullo, E., et al.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys: Conf. Ser. 180(1), 012037 (2009)Google Scholar
  8. 8.
    Akbudak, K., Ltaief, H., Mikhalev, A., Keyes, D.: Tile low rank cholesky factorization for climate/weather modeling applications on manycore architectures. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 22–40. Springer, Cham (2017). Scholar
  9. 9.
    Akbudak, K., Ltaief, H., Mikhalev, A., Charara, A., Keyes, D.: Exploiting data sparsity for large-scale matrix computations. In: Aldinucci, M., et al. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. xx–yy. Springer, Cham (2018)CrossRefGoogle Scholar
  10. 10.
    Ambikasaran, S., Darve, E.: An \(\mathscr {O}({N} \log {N})\) fast direct solver for partial hierarchically semiseparable matrices. J. Sci. Comput. 57(3), 477–501 (2013)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Amestoy, P.R., Ashcraft, C., Boiteau, O., Buttari, A., L’Excellent, J.Y., Weisbecker, C.: Improving multifrontal methods by means of block low-rank representations. SIAM J. Sci. Comput. 37(3), A1451–A1474 (2015). Scholar
  12. 12.
    Aminfar, A., Darve, E.: A fast sparse solver for Finite-Element matrices. arXiv:1403.5337 [cs.NA], pp. 1–25 (2014)
  13. 13.
    Börm, S.: Efficient numerical methods for non-local operators: \(\mathscr {H}^2\)-Matrix compression, algorithms and analysis. EMS Tracts in Mathematics, vol. 14. European Mathematical Society (2010)Google Scholar
  14. 14.
    Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. 74, 19–33 (2017)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Charara, A., Keyes, D., Ltaief, H.: Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans. Math. Softw. (2017, submitted). (under review,
  16. 16.
    Charara, A., Keyes, D., Ltaief, H.: Software artifact for Euro-Par 2018: Tile Low-Rank GEMM Using Batched Operations on GPUs. figshare. Code. (2018).
  17. 17.
    Chávez, G., Turkiyyah, G., Zampini, S., Ltaief, H., Keyes, D.: Accelerated cyclic reduction: a distributed-memory fast solver for structured linear systems. Parallel Comput. 74, 65–83 (2017)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Dongarra, J., Du Croz, J., Hammarling, S., Hanson, R.J.: An extended set of Fortran basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1–17 (1988)CrossRefGoogle Scholar
  19. 19.
    Dongarra, J., et al.: A proposed API for batched basic linear algebra subprograms. Mims preprint, University of Manchester (2016).
  20. 20.
    Grasedyck, L., Hackbusch, W.: Construction and arithmetics of \(\mathscr {H}\)-matrices. Computing 70(4), 295–334 (2003). Scholar
  21. 21.
    Hackbusch, W.: A sparse matrix arithmetic based on \(\mathscr {H}\)-matrices. part i: introduction to \(\mathscr {H}\)-matrices. Computing 62(2), 89–108 (1999). Scholar
  22. 22.
    Hackbusch, W.: Hierarchical Matrices: Algorithms and Analysis. Springer Series in Computational Mathematics, vol. 49. Springer, Heidelberg (2015). Scholar
  23. 23.
    Hackbusch, W., Börm, S.: Data-sparse approximation by adaptive \({\mathscr {H}}^2\)-matrices. Computing 69(1), 1–35 (2002). Scholar
  24. 24.
    Hackbusch, W., Börm, S., Grasedyck, L.: HLib 1.4. Max-Planck-Institut, Leipzig (2012)Google Scholar
  25. 25.
    Hackbusch, W., Khoromskij, B., Sauter, S.: On \(\mathscr {H}^2\)-matrices. In: Bungartz, H.J., Hoppe, R.H.W., Zenger, C. (eds.) Lectures on Applied Mathematics, pp. 9–29. Springer, Heidelberg (2000). Scholar
  26. 26.
    Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011). Scholar
  27. 27.
    Heinecke, A., Henry, G., Hutchinson, M., Pabst, H.: LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: 0001, J.W., Pancake, C.M. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, 13–18 November 2016, p. 84. ACM (2016)Google Scholar
  28. 28.
    Kim, K., et al.: Designing vector-friendly compact BLAS and LAPACK kernels. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, pp. 55:1–55:12. ACM, New York (2017).
  29. 29.
    Kriemann, R.: LU factorization on many-core systems. Comput. Vis. Sci. 16(3), 105–117 (2013). Scholar
  30. 30.
    Ltaief, H., et al.: Real-time massively distributed multi-object adaptive optics simulations for the european extremely large telescope. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), accepted, May 2018Google Scholar
  31. 31.
    North, G.R., Wang, J., Genton, M.G.: Correlation models for temperature fields. J. Clim. 24, 5850–5862 (2011)CrossRefGoogle Scholar
  32. 32.
    Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans. Math. Softw. 42(4), 27:1–27:35 (2016)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Shi, Y., Niranjan, U.N., Anandkumar, A., Cecka, C.: Tensor contractions with extended BLAS kernels on CPU and GPU. In: HiPC, pp. 193–202. IEEE Computer Society (2016)Google Scholar
  34. 34.
    Tyrtyshnikov, E.: Mosaic-skeleton approximations. Calcolo 33(1), 47–57 (1996)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Extreme Computing Research Center, Division of Computer, Electrical, and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal JeddahKingdom of Saudi Arabia

Personalised recommendations