Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors
We present a method for the QR factorization of large tall-and-skinny matrices that combines block Gram-Schmidt and the Cholesky decomposition to factorize the input matrix column panels, overcoming the sequential nature of this operation. This method uses re-orthogonalization to obtain a satisfactory level of orthogonality both in the Gram-Schmidt process and the Cholesky QR.
Our approach has the additional benefit of enabling the introduction of a static look-ahead technique for computing the Cholesky decomposition on the CPU while the remaining operations (all Level-3 BLAS) are performed on the GPU.
In contrast with other specific factorizations for tall-skinny matrices, the novel method has the key advantage of not requiring any custom GPU kernels. This simplifies the implementation and favours portability to future GPU architectures.
Our experiments show that, for tall-skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of Householder QR.
KeywordsQR factorization Tall-and-skinny matrices Graphics processing unit Gram-Schmidt Cholesky factorization Look-ahead High-performance
This research was supported by the project TIN2017-82972-R from the MINECO (Spain), and the EU H2020 project 732631 “OPRECOMP. Open Transprecision Computing”.
- 1.Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Knight, N., Nguyen, H.: Reconstructing Householder vectors from tall-skinny QR. J. Parallel Distrib. Comput. 85, 3–31 (2015). https://doi.org/10.1016/j.jpdc.2015.06.003. iPDPS 2014 Selected Papers on Numerical and Combinatorial AlgorithmsCrossRefGoogle Scholar
- 2.Benson, A.R., Gleich, D.F., Demmel, J.: Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: 2013 IEEE International Conference on Big Data, pp. 264–272, October 2013. https://doi.org/10.1109/BigData.2013.6691583
- 6.Fukaya, T., Nakatsukasa, Y., Yanagisawa, Y., Yamamoto, Y.: CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 31–38, November 2014. https://doi.org/10.1109/ScalA.2014.11
- 8.Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore (2013)Google Scholar
- 13.Strazdins, P.: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia (1998)Google Scholar
- 16.Yamazaki, I., Tomov, S., Kurzak, J., Dongarra, J., Barlow, J.: Mixed-precision block Gram Schmidt orthogonalization. In: Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2015, pp. 2:1–2:8. ACM, New York (2015). https://doi.org/10.1145/2832080.2832082