Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators

  • Jakub KurzakEmail author
  • Mark Gates
  • Ali Charara
  • Asim YarKhan
  • Ichitaro Yamazaki
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11725)


This work presents two implementations of linear solvers for distributed-memory machines with GPU accelerators—one based on the Cholesky factorization and one based on the LU factorization with partial pivoting. The routines are developed as part of the Software for Linear Algebra Targeting Exascale (SLATE) package, which represents a sharp departure from the traditional conventions established by legacy packages, such as LAPACK and ScaLAPACK. The article lays out the principles of the new approach, discusses the implementation details, and presents the performance results.


Linear algebra Distributed memory Linear systems of equations Cholesky factorization LU factorization GPU acceleration 



This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation’s exascale computing imperative.


  1. 1.
    Andersen, B.S., Gunnels, J.A., Gustavson, F., Wasniewski, J.: A recursive formulation of the inversion of symmetric positive definite matrices in packed storage data format. PARA 2, 287–296 (2002)zbMATHGoogle Scholar
  2. 2.
    Andersen, B.S., Waśniewski, J., Gustavson, F.G.: A recursive formulation of Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Softw. (TOMS) 27(2), 214–244 (2001)CrossRefGoogle Scholar
  3. 3.
    Blackford, L.S., et al.: ScaLAPACK Users’ Guide. SIAM, Philadelphia (1997)CrossRefGoogle Scholar
  4. 4.
    Castaldo, A., Whaley, C.: Scaling LAPACK panel operations using parallel cache assignment. In: ACM Sigplan Notices, vol. 45, pp. 223–232. ACM (2010)Google Scholar
  5. 5.
    Chan, E., van de Geijn, R., Chapman, A.: Managing the complexity of lookahead for LU factorization with pivoting. In: Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 200–208. ACM (2010)Google Scholar
  6. 6.
    Choi, J., Dongarra, J., Ostrouchov, S., Petitet, A., Walker, D., Whaley, C.: Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5(3), 173–184 (1996)Google Scholar
  7. 7.
    Dongarra, J., Faverge, M., Ltaief, H., Luszczek, P.: Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting. Concurr. Comput. Pract. Exp. 26(7), 1408–1431 (2014)CrossRefGoogle Scholar
  8. 8.
    Gates, M., et al.: SLATE working note 2: C++ API for BLAS and LAPACK. Technical report ICL-UT-17-03, Innovative Computing Laboratory, University of Tennessee, June 2017. Revision 03–2018Google Scholar
  9. 9.
    Gustavson, F., Henriksson, A., Jonsson, I., Kågström, B., Ling, P.: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In: Kågström, B., Dongarra, J., Elmroth, E., Waśniewski, J. (eds.) PARA 1998. LNCS, vol. 1541, pp. 195–206. Springer, Heidelberg (1998). Scholar
  10. 10.
    Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans. Math. Softw. (TOMS) 38(3), 17 (2012)CrossRefGoogle Scholar
  11. 11.
    Kurzak, J., Dongarra, J.: Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 147–156. Springer, Heidelberg (2007). Scholar
  12. 12.
    Sala, K., Teruel, X., Perez, J.M., Peña, A.J., Beltran, V., Labarta, J.: Integrating blocking and non-blocking MPI primitives with task-based programming models. Parallel Comput. 85, 153–166 (2019)CrossRefGoogle Scholar
  13. 13.
    Sorin, D.J., Hill, M.D., Wood, D.A.: A primer on memory consistency and cache coherence. Synth. Lect. Comput. Arch. 6(3), 1–212 (2011)Google Scholar
  14. 14.
    Strazdins, P., et al.: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization (1998)Google Scholar
  15. 15.
    Sukkari, D., Ltaief, H., Keyes, D.: A high performance QDWH-SVD solver using hardware accelerators. ACM Trans. Math. Softw. (TOMS) 43(1), 6 (2016)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA
  3. 3.University of ManchesterManchesterUK

Personalised recommendations