Skip to main content

Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors


We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform equipped with a graphics processor, with a performance advantage over the conventional factorization due to the reduced amount of data transfers between the graphics accelerator and the main memory of the host. Our experiments show that, for tall–skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder QR factorization.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. 1.

    Available at


  1. 1.

    Abdelfattah A, Haidar A, Tomov S, Dongarra J (2018) Analysis and design techniques towards high-performance and energy-efficient dense linear solvers on GPUs. IEEE Trans Parallel Distrib Syst 29(12):2700–2712.

    Article  Google Scholar 

  2. 2.

    Ballard G, Demmel J, Grigori L, Jacquelin M, Knight N, Nguyen H (2015) Reconstructing Householder vectors from tall-skinny QR. J Parallel Distrib Comput 85:3–31.

    Article  Google Scholar 

  3. 3.

    Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ortí ES (2008) Solving dense linear systems on graphics processors. In: Luque E, Margalef T, Benítez D (eds) Euro-Par 2008—parallel processing. Springer, Heidelberg, pp 739–748

    Chapter  Google Scholar 

  4. 4.

    Benson AR, Gleich DF, Demmel J (2013) Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: 2013 IEEE International Conference on Big Data, pp 264–272.

  5. 5.

    Businger P, Golub GH (1965) Linear least squares solutions by householder transformations. Numer Math 7(3):269–276.

    MathSciNet  Article  MATH  Google Scholar 

  6. 6.

    Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):206–239.

    MathSciNet  Article  MATH  Google Scholar 

  7. 7.

    Dongarra J, Du Croz J, Hammarling S, Duff IS (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17.

    MathSciNet  Article  MATH  Google Scholar 

  8. 8.

    Drmač Z, Bujanović Z (2008) On the failure of rank-revealing qr factorization software—a case study. ACM Trans Math Softw 35(2):12:1–12:28.

    MathSciNet  Article  Google Scholar 

  9. 9.

    Fukaya T, Nakatsukasa Y, Yanagisawa Y, Yamamoto Y (2014) CholeskyQR2: A simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: 2014 5th workshop on latest advances in scalable algorithms for large-scale systems, pp 31–38.

  10. 10.

    Fukaya T, Kannan R, Nakatsukasa Y, Yamamoto Y, Yanagisawa Y (2018) Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices, arXiv:1809.11085

  11. 11.

    Golub G, Van Loan C (2013) Matrix computations. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, Baltimore

    Google Scholar 

  12. 12.

    Gunter BC, van de Geijn RA (2005) Parallel out-of-core computation and updating the QR factorization. ACM Trans Math Softw 31(1):60–78.

    MathSciNet  Article  MATH  Google Scholar 

  13. 13.

    Joffrain T, Low TM, Quintana-Ortí ES, Rvd Geijn, Zee FGV (2006) Accumulating householder transformations, revisited. ACM Trans Math Softw 32(2):169–179.

    MathSciNet  Article  MATH  Google Scholar 

  14. 14.

    Puglisi C (1992) Modification of the householder method based on the compact WY representation. SIAM J Sci Stat Comput 13(3):723–726.

    MathSciNet  Article  MATH  Google Scholar 

  15. 15.

    Saad Y (2003) Iterative methods for sparse linear systems, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia

    Book  Google Scholar 

  16. 16.

    Schreiber R, Van Loan C (1989) A storage-efficient WY representation for products of householder transformations. SIAM J Sci Comput 10(1):53–57.

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Stathopoulos A, Wu K (2001) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2182.

    MathSciNet  Article  MATH  Google Scholar 

  18. 18.

    Strazdins P (1998) A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. Rep. TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia

  19. 19.

    Tomás Dominguez AE, Quintana Orti ES (2018) Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 385–393.

  20. 20.

    Volkov V, Demmel JW (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech. Rep. 202, LAPACK Working Note.

  21. 21.

    Yamamoto Y, Nakatsukasa Y, Yanagisawa Y, Fukaya T (2015) Roundoff error analysis of the Cholesky QR2 algorithm. Electron Trans Numer Anal 44:306–326

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330.

    MathSciNet  Article  MATH  Google Scholar 

Download references


This research was supported by the Project TIN2017-82972-R from the MINECO (Spain) and the EU H2020 Project 732631 “OPRECOMP. Open Transprecision Computing”.

Author information



Corresponding author

Correspondence to Andrés E. Tomás.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tomás, A.E., Quintana-Ortí, E.S. Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors. J Supercomput 76, 8771–8786 (2020).

Download citation


  • QR factorization
  • Tall-and-skinny matrices
  • GPU
  • Householder vector
  • Look-ahead
  • High performance