Advertisement

Domain Overlap for Iterative Sparse Triangular Solves on GPUs

  • Hartwig AnztEmail author
  • Edmond Chow
  • Daniel B. Szyld
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computational Science and Engineering book series (LNCSE, volume 113)

Abstract

Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution.

Notes

Acknowledgements

This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Numbers DE-SC-0012538 and DE-SC-0010042. Daniel B. Szyld was supported in part by the U.S. National Science Foundation under grant DMS-1418882. Support from NVIDIA is also gratefully acknowledged.

References

  1. 1.
    Alvarado, F.L., Schreiber, R.: Optimal parallel solution of sparse triangular systems. SIAM J. Sci. Comput. 14, 446–460 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Anderson, E.C., Saad, Y.: Solving sparse triangular systems on parallel computers. Int. J. High Speed Comput. 1, 73–96 (1989)CrossRefzbMATHGoogle Scholar
  3. 3.
    Anzt, H.: Asynchronous and multiprecision linear solvers – scalable and fault-tolerant numerics for energy efficient high performance computing. Ph.D. thesis, Karlsruhe Institute of Technology, Institute for Applied and Numerical Mathematics (2012)Google Scholar
  4. 4.
    Anzt, H., Luszczek, P., Dongarra, J., Heuveline, V.: GPU-accelerated asynchronous error correction for mixed precision iterative refinement. In: Euro-Par 2012 Parallel Processing. Lecture Notes in Computer Science, pp. 908–919. Springer, Berlin/New York (2012)Google Scholar
  5. 5.
    Anzt, H., Tomov, S., Gates, M., Dongarra, J., Heuveline, V.: Block-asynchronous multigrid smoothers for GPU-accelerated systems. In: Ali, H.H., Shi, Y., Khazanchi, D., Lees, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS. Procedia Computer Science, vol. 9, pp. 7–16. Elsevier, Amsterdam (2012)Google Scholar
  6. 6.
    Anzt, H., Tomov, S., Dongarra, J., Heuveline, V.: A block-asynchronous relaxation method for graphics processing units. J. Parallel Distrib. Comput. 73 (12), 1613–1626 (2013)CrossRefGoogle Scholar
  7. 7.
    Anzt, H., Chow, E., Dongarra, J.: Iterative sparse triangular solves for preconditioning. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015: Parallel Processing. Lecture Notes in Computer Science, vol. 9233, pp. 650–661. Springer, Berlin/Heidelberg (2015)CrossRefGoogle Scholar
  8. 8.
    Anzt, H., Dongarra, J., Quintana-Ortí, E.S.: Tuning stationary iterative solvers for fault resilience. In: Proceedings of the 6th workshop on latest advances in scalable algorithms for large-scale systems, ScalA’15, pp. 1:1–1:8. ACM, New York (2015)Google Scholar
  9. 9.
    Benzi, M., Tůma, M.: A comparative study of sparse approximate inverse preconditioners. Appl. Numer. Math. 30, 305–340 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Benzi, M., Szyld, D.B., van Duin, A.: Orderings for incomplete factorization preconditionings of nonsymmetric problems. SIAM J. Sci. Comput. 20, 1652–1670 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Benzi, M., Frommer, A., Nabben, R., Szyld, D.B.: Algebraic theory of multiplicative Schwarz methods. Numer. Math. 89, 605–639 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Cai, X.C., Sarkis, M.: A restricted additive Schwarz preconditioner for general sparse linear systems. SIAM J. Sci. Comput. 21, 792–797 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Chazan, D., Miranker, W.: Chaotic relaxation. Linear Algebra Appl. 2 (7), 199–222 (1969)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Chow, E., Patel, A.: Fine-grained parallel incomplete LU factorization. SIAM J. Sci. Comput. 37, C169–C193 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Chow, E., Anzt, H., Dongarra, J.: Asynchronous iterative algorithm for computing incomplete factorizations on GPUs. In: Kunkel, J., Ludwig, T. (eds.) Proceedings of 30th International Conference, ISC High Performance 2015. Lecture Notes in Computer Science, vol. 9137, pp. 1–16. Springer, Cham (2015)Google Scholar
  16. 16.
    Duff, I.S., Meurant, G.A.: The effect of ordering on preconditioned conjugate gradients. BIT 29 (4), 635–657 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Duin, A.C.N.V.: Scalable parallel preconditioning with the sparse approximate inverse of triangular matrices. SIAM J. Matrix Anal. Appl. 20, 987–1006 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Frommer, A., Szyld, D.B.: On asynchronous iterations. J. Comput. Appl. Math. 123, 201–216 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Frommer, A., Szyld, D.B.: An algebraic convergence theory for restricted additive Schwarz methods using weighted max norms. SIAM J. Numer. Anal. 39, 463–479 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Frommer, A., Schwandt, H., Szyld, D.B.: Asynchronous weighted additive Schwarz methods. Electron. Trans. Numer. Anal. 5, 48–61 (1997)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Hammond, S.W., Schreiber, R.: Efficient ICCG on a shared memory multiprocessor. Int. J. High Speed Comput. 4, 1–21 (1992)CrossRefGoogle Scholar
  22. 22.
    Mayer, J.: Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86 (4), 291–312 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Naumov, M.: Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Technical Report, NVR-2011-001, NVIDIA (2011)Google Scholar
  24. 24.
    NVIDIA Corporation: CUDA C best practices guide. http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
  25. 25.
    NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide, 2.3.1 edn. (2009)Google Scholar
  26. 26.
    NVIDIA Corporation: NVIDIA CUDA TOOLKIT V7.0 (2015)Google Scholar
  27. 27.
    Pothen, A., Alvarado, F.: A fast reordering algorithm for parallel sparse triangular solution. SIAM J. Sci. Stat. Comput. 13, 645–653 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)CrossRefzbMATHGoogle Scholar
  29. 29.
    Saad, Y., Zhang, J.: Bilum: block versions of multi-elimination and multi-level ILU preconditioner for general sparse linear systems. SIAM J. Sci. Comput. 20, 2103–2121 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Saltz, J.H.: Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM J. Sci. Stat. Comput. 11, 123–144 (1990)MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Shang, Y.: A parallel finite element variational multiscale method based on fully overlapping domain decomposition for incompressible flows. Numer. Methods Partial Differ. Equ. 31, 856–875 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Smith, B.F., Bjørstad, P.E., Gropp, W.D.: Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, New York (1996)zbMATHGoogle Scholar
  33. 33.
    Szyld, D.B.: Different models of parallel asynchronous iterations with overlapping blocks. Comput. Appl. Math. 17, 101–115 (1998)MathSciNetzbMATHGoogle Scholar
  34. 34.
    Toselli, A., Widlund, O.B.: Domain Decomposition Methods – Algorithms and Theory. Springer Series in Computational Mathematics, vol. 34. Springer, Berlin/Heidelberg (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Hartwig Anzt
    • 1
    Email author
  • Edmond Chow
    • 2
  • Daniel B. Szyld
    • 3
  • Jack Dongarra
    • 1
  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.Georgia Institute of TechnologyAtlantaUSA
  3. 3.Temple UniversityPhiladelphiaUSA

Personalised recommendations