Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

  • Andrés Tomás
  • Zhaojun Bai
  • Vicente Hernández
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7851)


The QR decomposition with column pivoting (QRP) of a matrix is widely used for rank revealing. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1).

Topics. Parallel and Distributed Computing.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bischof, C.H.: A parallel QR factorization algorithm with controlled local pivoting. SIAM J. Sci. Stat. Comput. 12, 36–57 (1991)MathSciNetMATHCrossRefGoogle Scholar
  2. 2.
    Chandrasekaran, S., Ipsen, I.C.F.: On rank-revealing factorisations. SIAM J. Matrix Anal. Appl. 15, 592–622 (1994)MathSciNetMATHCrossRefGoogle Scholar
  3. 3.
    Castaldo, A.M., Whaley, R.C.: Scaling LAPACK panel operations using parallel cache assignment. In: 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 223–231 (2010)Google Scholar
  4. 4.
    Drmač, Z., Bujanović, Z.: On the failure of rank-revealing QR factorization software – a case study. ACM Trans. Math. Softw. 35, 12:1–12:28 (2008)CrossRefGoogle Scholar
  5. 5.
    Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm I. SIAM J. Matrix Anal. Appl. 29, 1322–1342 (2008)MATHCrossRefGoogle Scholar
  6. 6.
    Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm II. SIAM J. Matrix Anal. Appl. 29, 1343–1362 (2008)MATHCrossRefGoogle Scholar
  7. 7.
    Golub, G.H.: Numerical methods for solving linear least squares problems. Numer. Math. 7, 206–216 (1965)MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Gu, M., Eisenstat, S.: Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput. 17, 848–869 (1996)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Quintana-Orti, G., Sun, X., Bischof, C.H.: A BLAS-3 version of the QR factorization with column pivoting. SIAM J. Sci. Comput. 19, 1486–1494 (1998)MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Schreiber, R., van Loan, C.: A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10, 53–57 (1989)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Andrés Tomás
    • 1
  • Zhaojun Bai
    • 1
  • Vicente Hernández
    • 2
  1. 1.Department of Computer ScienceUniversity of CaliforniaDavisUSA
  2. 2.Dept. Sistemas Informáticos y ComputaciònUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations