Optimized Batched Linear Algebra for Modern Architectures

  • Jack Dongarra
  • Sven Hammarling
  • Nicholas J. Higham
  • Samuel D. ReltonEmail author
  • Mawussi ZounonEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10417)


Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, \(2\times 2\) for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from \(2 \times 2\) to \(32 \times 32\). Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.



The authors would like to thank The University of Tennessee for the use of their computational resources. This research was funded in part from the European Union’s Horizon 2020 research and innovation programme under the NLAFET grant agreement No. 671633.


  1. 1.
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015).
  2. 2.
    Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.J.: Performance, design, and autotuning of batched GEMM for GPUs. In: Proceedings of High Performance Computing - 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, 19–23 June 2016, pp. 21–38 (2016)Google Scholar
  3. 3.
    Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., et al.: Theano: a python framework for fast computation of mathematical expressions. arXiv e-prints,, May 2016
  4. 4.
    Anderson, M.J., Sheffield, D., Keutzer, K.: A predictive model for solving small linear algebra problems in GPU registers. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), pp. 2–13. IEEE (2012)Google Scholar
  5. 5.
    Duff, I., Reid, J.K.: The multifrontal solution of indefinite sparse symmetric linear equations. ACM Trans. Math. Softw. 9(3), 302–325 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Haidar, A., Dong, T.T., Tomov, S., Luszczek, P., Dongarra, J.: A framework for batched and GPU-resident factorization algorithms applied to block householder transformations. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 31–47. Springer, Cham (2015). doi: 10.1007/978-3-319-20119-1_3 CrossRefGoogle Scholar
  7. 7.
    Jhurani, C., Mullowney, P.: A gemm interface and implementation on NVIDIA GPUs for multiple small matrices. J. Parallel Distrib. Comput. 75, 133–140 (2015)CrossRefGoogle Scholar
  8. 8.
    Kågström, B., Ling, P., van Loan, C.: GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24(3), 268–302 (1998)CrossRefzbMATHGoogle Scholar
  9. 9.
    Lopez, M.G., Horton, M.D.: Batch matrix exponentiation. In: Kindratenko, V. (ed.) Numerical Computations with GPUs, pp. 45–67. Springer, Cham (2014). doi: 10.1007/978-3-319-06548-9_3 Google Scholar
  10. 10.
    Masliah, I., Abdelfattah, A., Haidar, A., Tomov, S., Baboulin, M., Falcou, J., Dongarra, J.: High-performance matrix-matrix multiplications of very small matrices. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 659–671. Springer, Cham (2016). doi: 10.1007/978-3-319-43659-3_48 Google Scholar
  11. 11.
    Relton, S.D., Valero-Lara, P., Zounon, M.: A comparison of potential interfaces for batched BLAS computations. MIMS EPrint 2016.42, Manchester Institute for Mathematical Sciences, The University of Manchester, UK (2016)Google Scholar
  12. 12.
    Shi, Y., Niranjan, U.N., Anandkumar, A., Cecka, C.: Tensor contractions with extended BLAS kernels on CPU and GPU. arXiv preprint arXiv:1606.05696 (2016)

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA
  3. 3.School of MathematicsThe University of ManchesterManchesterUK

Personalised recommendations