Advertisement

Parallel Tiled QR Factorization for Multicore Architectures

  • Alfredo Buttari
  • Julien Langou
  • Jakub Kurzak
  • Jack Dongarra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4967)

Abstract

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. Compared to the standard approach, say with LAPACK, may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Pham, D., Asano, S., Bolliger, M., Day, M.N., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K.: The design and implementation of a first-generation CELL processor. In: IEEE International Solid-State Circuits Conference, pp. 184–185 (2005)Google Scholar
  2. 2.
  3. 3.
    Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK User’s Guide, 3rd edn. SIAM, Philadelphia (1999)zbMATHGoogle Scholar
  4. 4.
    Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK: A portable linear algebra library for distributed memory computers - design issues and performance. Computer Physics Communications 97, 1–15 (1996), (also as LAPACK Working Note #95)Google Scholar
  5. 5.
    Kurzak, J., Dongarra, J.: Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178 (September 2006), Also available as UT-CS-06-581Google Scholar
  6. 6.
    Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: The impact of multicore on math software. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 1–10. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  7. 7.
    Chan, E., Quintana-Orti, E.S., Quintana-Orti, G., van de Geijn, R.: Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In: SPAA 2007: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 116–125. ACM Press, New York (2007)CrossRefGoogle Scholar
  8. 8.
    Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46(1), 3–45 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Gustavson, F., Karlsson, L., Kågström, B.: Three algorithms for cholesky factorization on distributed memory using packed storage. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 550–559. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  10. 10.
    Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of linear equations on the CELL processor using Cholesky factorization. Technical Report UT-CS-07-596, Innovative Computing Laboratory, University of Tennessee Knoxville (April 2007)Google Scholar
  11. 11.
    Pham, D., Asano, S., Bolliger, M., Day, M.N., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K.: The design and implementation of a first-generation CELL processor. In: IEEE International Solid-State Circuits Conference, pp. 184–185 (2005)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Dongarra, J.J., Hiromoto, R.E.: A collection of parallel linear equations routines for the Denelcor HEP 1(2), 133–142 (December 1984)Google Scholar
  13. 13.
    Agarwal, R.C., Gustavson, F.G.: Vector and parallel algorithms for cholesky factorization on ibm 3090. In: Supercomputing 1989: Proceedings of the 1989 ACM/IEEE conference on Supercomputing, pp. 225–233. ACM Press, New York (1989)CrossRefGoogle Scholar
  14. 14.
    Agarwal, R.C., Gustavson, F.G.: A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In: Proceedings of the IFIP WG 2.5 Working Group on Aspects of Computation on Asychronous Parallel Processors, Stanford CA, Augest 22-26,1988, North Holland, Amsterdam (1988)Google Scholar
  15. 15.
    Elmroth, E., Gustavson, F.G.: Applying recursion to serial and parallel QR factorization leads to better performance. IBM Journal of Research and Development 44(4), 605 (2000)CrossRefGoogle Scholar
  16. 16.
    Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)zbMATHGoogle Scholar
  17. 17.
    Stewart, G.W.: Matrix Algorithms, 1st edn., vol. 1. SIAM, Philadelphia (1998)zbMATHGoogle Scholar
  18. 18.
    Yip, E.L.: FORTRAN Subroutines for Out-of-Core Solutions of Large Complex Linear Systems. Technical Report CR-159142, NASA (November 1979)Google Scholar
  19. 19.
    Quintana-Orti, E., van de Geijn, R.: Updating an LU factorization with pivoting, Technical Report TR-2006-42, The University of Texas at Austin, Department of Computer Sciences (2006), FLAME Working Note 21Google Scholar
  20. 20.
    Gunter, B.C., van de Geijn, R.A.: Parallel out-of-core computation and updating of the QR factorization. ACM Trans. Math. Softw. 31(1), 60–78 (2005)zbMATHCrossRefGoogle Scholar
  21. 21.
    Berry, M.W., Dongarra, J.J., Kim, Y.: A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-hessenberg form. Parallel Comput. 21(8), 1189–1211 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high performance algorithms. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 418–436. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  23. 23.
    Bischof, C., van Loan, C.: The WY representation for products of householder matrices. SIAM J. Sci. Stat. Comput. 8(1), 2–13 (1987)CrossRefGoogle Scholar
  24. 24.
    Schreiber, R., van Loan, C.: A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10(1), 53–57 (1989)zbMATHCrossRefGoogle Scholar
  25. 25.
    Bischof, C., van Loan, C.: The WY representation for products of householder matrices. SIAM J. Sci. Stat. Comput. 8(1), 2–13 (1987)CrossRefGoogle Scholar
  26. 26.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: Parallel Tiled QR Factorization for Multicore Architectures. Technical Report UT-CS-07-598, University of Tennessee (2007), LAPACK Working Note 190Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Alfredo Buttari
    • 1
  • Julien Langou
    • 2
  • Jakub Kurzak
    • 1
  • Jack Dongarra
    • 1
    • 3
    • 4
  1. 1.Computer Science Dept.University of TennesseeKnoxvilleUSA
  2. 2.Department of Mathematical SciencesUniversity of Colorado at Denver and Health Sciences CenterColoradoUSA
  3. 3.Oak Ridge National LaboratoryOak RidgeUSA
  4. 4.University of ManchesterUK

Personalised recommendations