New serial and parallel recursive QR factorization algorithms for SMP systems

  • Erik Elmroth
  • Fred Gustavson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1541)


We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the efficient use of the recursion for large n. This obstacle is overcome by using a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m=n increases from 100 to 1000. A successful parallel implementation on a PowerPC 604 based IBM SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors and m=n=2000 it shows speedups of 1.96, 2.99, and 3.92 compared to our uniprocessor algorithm.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, S. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide—Release 2.0. SIAM, Philadelphia, 1994.Google Scholar
  2. 2.
    C. Bischof. Adaptive blocking in the QR factorization. The Journal of Supercomputing, 3:193–208, 1989.CrossRefGoogle Scholar
  3. 3.
    C. Bischof and C. Van Loan. The WY representation for products of householder matrices. SIAM J. Scientific and Statistical Computing, 8(1):s2–s13, 1987.CrossRefGoogle Scholar
  4. 4.
    A. Chalmers and J. Tidmus. Practical Parallel Processing. International Thomson Computer Press, UK, 1996.Google Scholar
  5. 5.
    K. Dackland, E. Elmroth, and B. Kågström. A ring-oriented approach for block matrix factorizations on shared and distributed memory architectures. In R. F. Sincovec et al, editor, Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, pages 330–338, Norfolk, 1993. SIAM Publications.Google Scholar
  6. 6.
    K. Dackland, E. Elmroth, B. Kågström, and C. Van Loan. Parallel block matrix factorizations on the shared memory multiprocessor IBM 3090 VF/600J. International Journal of Supercomputer Applications, 6(1):69–97, 1992.Google Scholar
  7. 7.
    J. Dongarra, L. Kaufman, and S. Hammarling. Squeezing the most out of eigen-value solvers on high performance computers. Lin. Alg. and its Applic., 77:113–136, 1986.MATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    F. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737–755, 1997.CrossRefGoogle Scholar
  9. 9.
    R. Schreiber and C. Van Loan. A storage efficient WY representation for products of householder transformations. SIAM J. Scientific and Statistical Computing, 10(1):53–57, 1989.MATHCrossRefGoogle Scholar
  10. 10.
    S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix. Anal. Appl., 18(4):1065–1081, 1997.MATHMathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Erik Elmroth
    • 1
  • Fred Gustavson
    • 2
  1. 1.Department of Computing Science and HPC2NUmeå UniversityUmeåSweden
  2. 2.IBM T.J. Watson Research CenterYorktown HeightsU.S.A.

Personalised recommendations