Skip to main content

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

  • Conference paper
  • First Online:
Book cover Computational Science – ICCS 2022 (ICCS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13350))

Included in the following conference series:

Abstract

QR factorization of dense matrices is a ubiquitous tool in high performance computing (HPC). From solving linear systems and least squares problems to eigenvalue problems, and singular value decompositions, the impact of a high performance QR factorization is fundamental to computer simulations and many applications. More importantly, the QR factorization on a batch of relatively small matrices has acquired a lot of attention in sparse direct solvers and low-rank approximations for Hierarchical matrices. To address this interest and demand, we developed and present a high performance batch QR factorization for Graphics Processing Units (GPUs). We present a multi-level blocking strategy that adjusts various algorithmic designs to the size of the input matrices. We also show that following the LAPACK QR design convention, while still useful, is significantly outperformed by unconventional code structures that increase data reuse. The performance results show multi-fold speedups against the state of the art libraries on the latest GPU architectures from both NVIDIA and AMD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. LAPACK - Linear Algebra PACKage. http://www.netlib.org/lapack/

  2. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: ISC High Performance 2016, Frankfurt, Germany, 19–23 June 2016, Proceedings, pp. 21–38 (2016)

    Google Scholar 

  3. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput. Sci. 108, 606–615 (2017). ICCS 2017, Zurich, Switzerland

    Google Scholar 

  4. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.J.: Batched one-sided factorizations of tiny matrices using GPUs: challenges and countermeasures. J. Comput. Sci. 26, 226–236 (2018)

    Article  Google Scholar 

  5. hipBLAS. https://github.com/ROCmSoftwarePlatform/hipBLAS

  6. Anderson, M., Sheffield, D., Keutzer, K.: A predictive model for solving small linear algebra problems in GPU registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)

    Google Scholar 

  7. Anzt, H., Dongarra, J., Flegar, G., Quintana-Ortí, E.S.: Batched Gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs. PMAM 2017, pp. 1–10. ACM, New York (2017)

    Google Scholar 

  8. Auer, A.A., et al.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phys. 104(2), 211–228 (2006)

    Article  Google Scholar 

  9. Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. (2017). https://doi.org/10.1016/j.parco.2017.09.001

    Article  Google Scholar 

  10. Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. IJHPCA 29(2), 193–208 (2015)

    Google Scholar 

  11. Haidar, A., Tomov, S., Luszczek, P., Dongarra, J.: Magma embedded: towards a dense linear algebra library for energy efficient extreme computing. In: 2015 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6, September 2015

    Google Scholar 

  12. MAGMA. http://icl.cs.utk.edu/magma/

  13. PLASMA, October 2017. https://bitbucket.org/icl/plasma

  14. Intel Math Kernel Library. http://software.intel.com/intel-mkl/

  15. Kurzak, J., Anzt, H., Gates, M., Dongarra, J.: Implementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUs. IEEE Trans. Parallel Distrib. Syst. 27, 2036–2048 (2015)

    Article  Google Scholar 

  16. Messer, O., Harris, J., Parete-Koon, S., Chertkow, M.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Proceedings of “PARA 2012: State-of-the-Art in Scientific and Parallel Computing” (2012)

    Google Scholar 

  17. NVIDIA CUBLAS. https://developer.nvidia.com/cublas

  18. Tomás Dominguez, A.E., Quintana Orti, E.S.: Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 385–393 (2018). https://doi.org/10.1109/PDP2018.2018.00068

  19. Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM TOMS 41(3), 33 (2015). https://dl.acm.org/doi/10.1145/2764454

  20. Walker, Homer F.: Implementation of the GMRES method using householder transformations. SIAM J. Sci. Stat. Comput. 9(1), 152–163 (1988). https://doi.org/10.1137/0909010

  21. Yeralan, S.N., Davis, T.A., Sid-Lakhdar, W.M., Ranka, S.: Algorithm 980: sparse QR factorization on the GPU. ACM TOMS 44(2), 17:1–17:29 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmad Abdelfattah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Abdelfattah, A., Tomov, S., Dongarra, J. (2022). Batch QR Factorization on GPUs: Design, Optimization, and Tuning. In: Groen, D., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2022. ICCS 2022. Lecture Notes in Computer Science, vol 13350. Springer, Cham. https://doi.org/10.1007/978-3-031-08751-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08751-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08750-9

  • Online ISBN: 978-3-031-08751-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics