Exploring Dual-Triangular Structure for Efficient R-Initiated Tall-Skinny QR on GPGPU

  • Nai-Yun Cheng
  • Ming-Syan ChenEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11440)


The QR decomposition is one of the fundamental matrix decompositions in data mining. A particularly challenging case of QR decomposition is to deal with the tall-and-skinny matrix. Tall-skinny QR has lots of applications such as Krylov subspace methods and some subspace projection methods. Furthermore, tall-skinny QR can accelerate the process of principal component analysis (PCA). Although algorithms like TSQR and Cholesky QR have been proposed for computing QR decompositions on tall-and-skinny matrices, none of these algorithms are suitable for being applied to the GPGPU, which has been increasingly used nowadays. In view of the limited memory in GPGPU and also the costly data transmission between CPU and GPGPU, we propose a novel R-initiated TSQR to make the computing of tall-and-skinny QR on the GPGPU efficient. Explicitly, our method is unique in that it utilizes Givens QR to take advantage of the existence of dual-triangular (DT) structure in submatrices in TSQR so as to significantly reduce the computation required. With the R-initiated method, our method can not only meet the memory limitation of GPGPU but also avoid large amounts of data transmission. Theoretical results are derived, showing the merit of the proposed method. The experimental results indicate that our method significantly outperforms the conventional TSQR.


TSQR Tall-and-skinny matrix QR decomposition 


  1. 1.
    Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Bangkok (2000)CrossRefGoogle Scholar
  2. 2.
    Gutknecht, M.H.: Block Krylov space methods for linear systems with multiple right-hand sides: an introduction (2006)Google Scholar
  3. 3.
    Sakurai, T., Sugiura, H.: A projection method for generalized eigenvalue problems using numerical integration. J. Comput. Appl. Math. 159(1), 119–128 (2003)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Sharma, A., Paliwal, K.K., Imoto, S., Miyano, S.: Principal component analysis using QR decomposition. Int. J. Mach. Learn. Cybern. 4(6), 679–683 (2013)CrossRefGoogle Scholar
  5. 5.
    Nguyen, H.D., Demmel, J.: Reproducible tall-skinny QR. In: 2015 IEEE 22nd Symposium on Computer Arithmetic (ARITH), pp. 152–159. IEEE (2015)Google Scholar
  6. 6.
    Yamamoto, Y.: Aggregation of the compact WY representations generated by the TSQR algorithm. In: Conference Talk Presented in SIAM Applied Linear Algebra (2012)Google Scholar
  7. 7.
    Fukaya, T., Nakatsukasa, Y., Yanagisawa, Y., Yamamoto, Y.: CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), pp. 31–38. IEEE (2014)Google Scholar
  8. 8.
    Volkov, V., Demmel, J.: LU, QR and Cholesky factorizations using vector capabilities of GPUS. Technical report, UCB/EECS-2008-49, vol. 49, EECS Department, University of California, Berkeley (2008)Google Scholar
  9. 9.
    Kerr, A., Campbell, D., Richards, M.: QR decomposition on GPUS. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 71–78. ACM (2009)Google Scholar
  10. 10.
    Humphrey, J.R., Price, D.K., Spagnoli, K.E., Paolini, A.L., Kelmelis, E.J.: CULA: hybrid GPU accelerated linear algebra routines. In: SPIE Defense, Security, and Sensing, pp. 502–770. International Society for Optics and Photonics (2010)Google Scholar
  11. 11.
    Anderson, M., Ballard, G., Demmel, J., Keutzer, K.: Communication-avoiding QR decomposition for GPUS. In: 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS), pp. 48–58. IEEE (2011)Google Scholar
  12. 12.
    Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), A206–A239 (2012)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Constantine, P.G., Gleich, D.F.: Tall and skinny QR factorizations in MapReduce architectures. In: Proceedings of the Second International Workshop on MapReduce and Its Applications, pp. 43–50. ACM (2011)Google Scholar
  14. 14.
    Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Knight, N., Nguyen, H.D.: Reconstructing householder vectors from tall-skinny QR. J. Parallel Distrib. Comput. 85, 3–31 (2015)CrossRefGoogle Scholar
  15. 15.
    Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Nguyen, H.D., Solomonik, E.: Reconstructing householder vectors from tall-skinny QR. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1159–1170. IEEE (2014)Google Scholar
  16. 16.
    Benson, A.R., Gleich, D.F., Demmel, J.: Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: 2013 IEEE International Conference on Big Data, pp. 264–272. IEEE (2013)Google Scholar
  17. 17.
    Schreiber, R., Van Loan, C.: A storage-efficient WY representation for products of householder transformations. SIAM J. Sci. Stat. Comput. 10(1), 53–57 (1989)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.National Taiwan UniversityTaipeiTaiwan

Personalised recommendations