Advertisement

Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs

  • Edmond Chow
  • Hartwig AnztEmail author
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9137)

Abstract

This paper presents a GPU implementation of an asynchronous iterative algorithm for computing incomplete factorizations. Asynchronous algorithms, with their ability to tolerate memory latency, form an important class of algorithms for modern computer architectures. Our GPU implementation considers several non-traditional techniques that can be important for asynchronous algorithms to optimize convergence and data locality. These techniques include controlling the order in which variables are updated by controlling the order of execution of thread blocks, taking advantage of cache reuse between thread blocks, and managing the amount of parallelism to control the convergence of the algorithm.

Keywords

Shared Memory Thread Block Residual Norm Task List Reuse Factor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Numbers DE-SC-0012538 and DE-SC-0010042. Support from NVIDIA is also acknowledged.

References

  1. 1.
    Anzt, H., Tomov, S., Dongarra, J., Heuveline, V.: A block-asynchronous relaxation method for graphics processing units. J. Parallel Distrib. Comput. 73(12), 1613–1626 (2013)CrossRefGoogle Scholar
  2. 2.
    Benzi, M., Joubert, W., Mateescu, G.: Numerical experiments with parallel orderings for ILU preconditioners. Electron. Trans. Numer. Anal. 8, 88–114 (1999)zbMATHMathSciNetGoogle Scholar
  3. 3.
    Bergman, K. et al.: ExaScale computing study: technology challenges in achieving exascale systems (2008)Google Scholar
  4. 4.
    Chow, E., Patel, A.: Fine-grained parallel incomplete LU factorization. SIAM J. Sci. Comput. 37, C169–C193 (2015)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Contassot-Vivier, S., Jost, T., Vialle, S.: Impact of asynchronism on gpu accelerated parallel iterative computations. In: Jónasson, K. (ed.) PARA 2010, Part I. LNCS, vol. 7133, pp. 43–53. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  6. 6.
    Davis, T.A.: The University of Florida Sparse Matrix Collection. NA DIGEST 92 (1994). http://www.netlib.org/na-digesthtml/
  7. 7.
    Doi, S.: On parallelism and convergence of incomplete LU factorizations. Appl. Numer. Math. 7(5), 417–436 (1991)zbMATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Duff, I.S., Meurant, G.A.: The effect of ordering on preconditioned conjugate gradients. BIT 29(4), 635–657 (1989)zbMATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    Frommer, A., Szyld, D.B.: On asynchronous iterations. J. Comput. Appl. Math. 123, 201–216 (2000)zbMATHMathSciNetCrossRefGoogle Scholar
  10. 10.
    Innovative Computing Lab: Software distribution of MAGMA, July 2015. http://icl.cs.utk.edu/magma/
  11. 11.
    Lukarski, D.: Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms - Parallel Solvers and Preconditioners. Ph.D. thesis, Karlsruhe Institute of Technology (KIT), Germany (2012)Google Scholar
  12. 12.
    Naumov, M.: Parallel incomplete-LU and Cholesky factorization in the preconditioned iterative methods on the GPU. Technical report. NVR-2012-003, NVIDIA (2012)Google Scholar
  13. 13.
    NVIDIA Corporation: NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Whitepaper (2012)Google Scholar
  14. 14.
    NVIDIA Corporation: CUSPARSE LIBRARY, July 2013Google Scholar
  15. 15.
    NVIDIA Corporation: NVIDIA CUDA TOOLKIT V6.0, July 2013Google Scholar
  16. 16.
    Poole, E.L., Ortega, J.M.: Multicolor ICCG methods for vector computers. SIAM J. Numer. Anal. 24, 1394–1417 (1987)zbMATHMathSciNetCrossRefGoogle Scholar
  17. 17.
    Saad, Y.: Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia (2003) zbMATHCrossRefGoogle Scholar
  18. 18.
    Venkatasubramanian, S., Vuduc, R.W.: Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In: Proceedings of the 23rd International Conference on Supercomputing, ICS 2009, pp. 244–255. ACM, New York (2009)Google Scholar
  19. 19.
    Volkov, V.: Better performance at lower occupancy. In: GPU Technology Conference (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Georgia Institute of TechnologyAtlantaUSA
  2. 2.University of TennesseeKnoxvilleUSA

Personalised recommendations