Skip to main content

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

  • 255 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13350)

Abstract

QR factorization of dense matrices is a ubiquitous tool in high performance computing (HPC). From solving linear systems and least squares problems to eigenvalue problems, and singular value decompositions, the impact of a high performance QR factorization is fundamental to computer simulations and many applications. More importantly, the QR factorization on a batch of relatively small matrices has acquired a lot of attention in sparse direct solvers and low-rank approximations for Hierarchical matrices. To address this interest and demand, we developed and present a high performance batch QR factorization for Graphics Processing Units (GPUs). We present a multi-level blocking strategy that adjusts various algorithmic designs to the size of the input matrices. We also show that following the LAPACK QR design convention, while still useful, is significantly outperformed by unconventional code structures that increase data reuse. The performance results show multi-fold speedups against the state of the art libraries on the latest GPU architectures from both NVIDIA and AMD.

Keywords

  • Batch linear algebra
  • QR factorization
  • GPU computing

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-08751-6_5
  • Chapter length: 15 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-031-08751-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.
Fig. 11.

References

  1. LAPACK - Linear Algebra PACKage. http://www.netlib.org/lapack/

  2. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: ISC High Performance 2016, Frankfurt, Germany, 19–23 June 2016, Proceedings, pp. 21–38 (2016)

    Google Scholar 

  3. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput. Sci. 108, 606–615 (2017). ICCS 2017, Zurich, Switzerland

    Google Scholar 

  4. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.J.: Batched one-sided factorizations of tiny matrices using GPUs: challenges and countermeasures. J. Comput. Sci. 26, 226–236 (2018)

    CrossRef  Google Scholar 

  5. hipBLAS. https://github.com/ROCmSoftwarePlatform/hipBLAS

  6. Anderson, M., Sheffield, D., Keutzer, K.: A predictive model for solving small linear algebra problems in GPU registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)

    Google Scholar 

  7. Anzt, H., Dongarra, J., Flegar, G., Quintana-Ortí, E.S.: Batched Gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs. PMAM 2017, pp. 1–10. ACM, New York (2017)

    Google Scholar 

  8. Auer, A.A., et al.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phys. 104(2), 211–228 (2006)

    CrossRef  Google Scholar 

  9. Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. (2017). https://doi.org/10.1016/j.parco.2017.09.001

    CrossRef  Google Scholar 

  10. Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. IJHPCA 29(2), 193–208 (2015)

    Google Scholar 

  11. Haidar, A., Tomov, S., Luszczek, P., Dongarra, J.: Magma embedded: towards a dense linear algebra library for energy efficient extreme computing. In: 2015 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6, September 2015

    Google Scholar 

  12. MAGMA. http://icl.cs.utk.edu/magma/

  13. PLASMA, October 2017. https://bitbucket.org/icl/plasma

  14. Intel Math Kernel Library. http://software.intel.com/intel-mkl/

  15. Kurzak, J., Anzt, H., Gates, M., Dongarra, J.: Implementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUs. IEEE Trans. Parallel Distrib. Syst. 27, 2036–2048 (2015)

    CrossRef  Google Scholar 

  16. Messer, O., Harris, J., Parete-Koon, S., Chertkow, M.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Proceedings of “PARA 2012: State-of-the-Art in Scientific and Parallel Computing” (2012)

    Google Scholar 

  17. NVIDIA CUBLAS. https://developer.nvidia.com/cublas

  18. Tomás Dominguez, A.E., Quintana Orti, E.S.: Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 385–393 (2018). https://doi.org/10.1109/PDP2018.2018.00068

  19. Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM TOMS 41(3), 33 (2015). https://dl.acm.org/doi/10.1145/2764454

  20. Walker, Homer F.: Implementation of the GMRES method using householder transformations. SIAM J. Sci. Stat. Comput. 9(1), 152–163 (1988). https://doi.org/10.1137/0909010

  21. Yeralan, S.N., Davis, T.A., Sid-Lakhdar, W.M., Ranka, S.: Algorithm 980: sparse QR factorization on the GPU. ACM TOMS 44(2), 17:1–17:29 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmad Abdelfattah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Abdelfattah, A., Tomov, S., Dongarra, J. (2022). Batch QR Factorization on GPUs: Design, Optimization, and Tuning. In: Groen, D., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2022. ICCS 2022. Lecture Notes in Computer Science, vol 13350. Springer, Cham. https://doi.org/10.1007/978-3-031-08751-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08751-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08750-9

  • Online ISBN: 978-3-031-08751-6

  • eBook Packages: Computer ScienceComputer Science (R0)