Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors

  • Andrés E. TomásEmail author
  • Enrique S. Quintana-Ortí
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11725)


We present a method for the QR factorization of large tall-and-skinny matrices that combines block Gram-Schmidt and the Cholesky decomposition to factorize the input matrix column panels, overcoming the sequential nature of this operation. This method uses re-orthogonalization to obtain a satisfactory level of orthogonality both in the Gram-Schmidt process and the Cholesky QR.

Our approach has the additional benefit of enabling the introduction of a static look-ahead technique for computing the Cholesky decomposition on the CPU while the remaining operations (all Level-3 BLAS) are performed on the GPU.

In contrast with other specific factorizations for tall-skinny matrices, the novel method has the key advantage of not requiring any custom GPU kernels. This simplifies the implementation and favours portability to future GPU architectures.

Our experiments show that, for tall-skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of Householder QR.


QR factorization Tall-and-skinny matrices Graphics processing unit Gram-Schmidt Cholesky factorization Look-ahead High-performance 



This research was supported by the project TIN2017-82972-R from the MINECO (Spain), and the EU H2020 project 732631 “OPRECOMP. Open Transprecision Computing”.


  1. 1.
    Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Knight, N., Nguyen, H.: Reconstructing Householder vectors from tall-skinny QR. J. Parallel Distrib. Comput. 85, 3–31 (2015). iPDPS 2014 Selected Papers on Numerical and Combinatorial AlgorithmsCrossRefGoogle Scholar
  2. 2.
    Benson, A.R., Gleich, D.F., Demmel, J.: Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: 2013 IEEE International Conference on Big Data, pp. 264–272, October 2013.
  3. 3.
    Catalán, S., Herrero, J.R., Quintana-Ortí, E.S., Rodríguez-Sánchez, R., Van De Geijn, R.: A case for malleable thread-level linear algebra libraries: the LU factorization with partial pivoting. IEEE Access 7, 17617–17633 (2019). Scholar
  4. 4.
    Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). Scholar
  5. 5.
    Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Fukaya, T., Nakatsukasa, Y., Yanagisawa, Y., Yamamoto, Y.: CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 31–38, November 2014.
  7. 7.
    Giraud, L., Langou, J., Rozložník, M., Eshof, J.v.d.: Rounding error analysis of the classical Gram-Schmidt orthogonalization process. Numerische Mathematik 101(1), 87–100 (2005). Scholar
  8. 8.
    Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore (2013)Google Scholar
  9. 9.
    Gunter, B.C., van de Geijn, R.A.: Parallel out-of-core computation and updating the QR factorization. ACM Trans. Math. Softw. 31(1), 60–78 (2005). Scholar
  10. 10.
    Leon, S.J., Björck, Å., Gander, W.: Gram-Schmidt orthogonalization: 100 years and more. Numer. Linear Algebr. Appl. 20(3), 492–532 (2013). Scholar
  11. 11.
    Saad, Y.: Iterative Methods for Sparse Linear Systems, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)CrossRefGoogle Scholar
  12. 12.
    Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23(6), 2165–2182 (2001). Scholar
  13. 13.
    Strazdins, P.: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia (1998)Google Scholar
  14. 14.
    Yamamoto, Y., Nakatsukasa, Y., Yanagisawa, Y., Fukaya, T.: Roundoff error analysis of the Cholesky QR2 algorithm. Electron. Trans. Numer. Anal. 44, 306–326 (2015)MathSciNetzbMATHGoogle Scholar
  15. 15.
    Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J. Sci. Comput. 37(3), C307–C330 (2015). Scholar
  16. 16.
    Yamazaki, I., Tomov, S., Kurzak, J., Dongarra, J., Barlow, J.: Mixed-precision block Gram Schmidt orthogonalization. In: Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2015, pp. 2:1–2:8. ACM, New York (2015).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Andrés E. Tomás
    • 1
    • 2
    Email author
  • Enrique S. Quintana-Ortí
    • 3
  1. 1.Dept. d’Enginyeria i Ciència dels ComputadorsUniversitat Jaume ICastelló de la PlanaSpain
  2. 2.Dept. de Sistemes Informàtics i ComputacióUniversitat Politècnica de ValènciaValènciaSpain
  3. 3.Dept. d’Informàtica de Sistemes i ComputadorsUniversitat Politècnica de ValènciaValènciaSpain

Personalised recommendations