High-Performance Matrix-Matrix Multiplications of Very Small Matrices

  • Ian MasliahEmail author
  • Ahmad Abdelfattah
  • A. Haidar
  • S. Tomov
  • Marc Baboulin
  • J. Falcou
  • J. Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)


The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90 % of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.


GEMM Batched GEMM Small matrices HPC Autotuning 



This material is based in part upon work supported by the US NSF under Grants No. CSR 1514286 and ACI-1339822, NVIDIA, the Department of Energy, and in part by the Russian Scientific Foundation, Agreement N14-11-00190.


  1. 1.
    Abdelfattah, A., Baboulin, M., Dobrev, V., Dongarra, J., Earl, C., Falcou, J., Haidar, A., Karlin, I., Kolev, T., Masliah, I., Tomov, S.: High-performance tensor contractions for GPUs. In: International Conference on Computational Science (ICCS 2016). Elsevier, Procedia Computer Science, San Diego, CA, USA, June 2016Google Scholar
  2. 2.
    Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 21–38. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-41321-1_2 CrossRefGoogle Scholar
  3. 3.
    Ahmed, N., Mateev, N., Pingali, K.: Tiling imperfectly-nested loop nests. In: ACM/IEEE 2000 Conference Supercomputing, p. 31, November 2000Google Scholar
  4. 4.
    Bacon, D.F., Graham, S.L., Sharp, O.J.: Compiler transformations for high-performance computing. ACM Comput. Surv. 26(4), 345–420 (1994)CrossRefGoogle Scholar
  5. 5.
    Dong, T., Haidar, A., Luszczek, P., Harris, A., Tomov, S., Dongarra, J.: LU Factorization of small matrices: accelerating batched DGETRF on the GPU. In: Proceedings of 16th IEEE International Conference on High Performance and Communications, August 2014Google Scholar
  6. 6.
    Dongarra, J., Duff, I., Gates, M., Haidar, A., Hammarling, S., Higham, N.J., Hogg, J., Valero-Lara, P., Relton, S.D., Tomov, S., Zounon, M.: A proposed API for batched basic linear algebra subprograms. MIMS EPrint 2016.25, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, April 2016.
  7. 7.
    Fuller, S.H., Millett, L.I., Committee on Sustaining Growth in Computing Performance; National Research Council: The Future of Computing Performance: Game Over or Next Level? The National Academies Press, Washington (2011).
  8. 8.
    Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, Boca Raton (2011)Google Scholar
  9. 9.
    Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on gpus. Int. J. High Perform. Comput. Appl. 29(2), 193–208 (2015). CrossRefGoogle Scholar
  10. 10.
    Haidar, A., Dong, T.T., Tomov, S., Luszczek, P., Dongarra, J.: A framework for batched and GPU-resident factorization algorithms applied to block householder transformations. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 31–47. Springer, Heidelberg (2015). CrossRefGoogle Scholar
  11. 11.
    Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publ. Inc., San Francisco (2011)zbMATHGoogle Scholar
  12. 12.
    Intel Math Kernel Library (2016).
  13. 13.
    Loshin, D.: Efficient Memory Programming, 1st edn. McGraw-Hill Professional, New York (1998)Google Scholar
  14. 14.
    Malony, A.D., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., Dietrich, R., Poole, D., Lamb, C.: Parallel performance measurement of heterogeneous parallel systems with gpus. In: Proceedings of ICPP 2011, pp. 176–185. IEEE Computer Society, Washington, DC (2011)Google Scholar
  15. 15.
    Masliah, I., Baboulin, M., Falcou, J.: Metaprogramming dense linear algebra solvers applications to multi and many-core architectures. In: 2015 iIEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland, vol. 3, pp. 69–76, August 2015Google Scholar
  16. 16.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995Google Scholar
  17. 17.
    Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36(5–6), 232–240 (2010)CrossRefzbMATHGoogle Scholar
  18. 18.
    Weaver, V., Johnson, M., Kasichayanula, K., Ralph, J., Luszczek, P., Terpstra, D., S.: Measuring energy and power with PAPI. In: 41st International Conference on Parallel Processing Workshops, September 2012Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Ian Masliah
    • 2
    Email author
  • Ahmad Abdelfattah
    • 1
  • A. Haidar
    • 1
  • S. Tomov
    • 1
  • Marc Baboulin
    • 2
  • J. Falcou
    • 2
  • J. Dongarra
    • 1
    • 3
  1. 1.Innovative Computing LaboratoryUniversity of TennesseeKnoxvilleUSA
  2. 2.University of Paris-SudOrsayFrance
  3. 3.University of ManchesterManchesterUK

Personalised recommendations