Skip to main content
Log in

TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Sparse triangular solve (SpTRSV) is one of the most important level-2 kernels in sparse basic linear algebra subprograms (BLAS). Compared to another level-2 sparse BLAS kernel sparse matrix–vector multiplication (SpMV), SpTRSV is in general more difficult to find high parallelism on many-core processors, such as GPUs. Nowadays, much work focuses on reducing dependencies and synchronizations in the level-set and Sync-free algorithms for SpTRSV. However, there is less work that can make good use of sparse spatial structure for SpTRSV on GPUs. In this paper, we propose a tiled algorithm called TileSpTRSV for optimizing SpTRSV on GPUs through exploiting 2D spatial structure of sparse matrices. We design two algorithm implementations, i.e., TileSpTRSV_level-set and TileSpTRSV_sync-free, for TileSpTRSV on top of level-set and Sync-free algorithms, respectively. By testing 16 representative matrices on a latest NVIDIA GPU, the experimental results show that TileSpTRSV_level-set gives on average 5.29\(\times\) (up to 38.10\(\times\)), 5.33\(\times\) (up to 21.32\(\times\)) and 2.62\(\times\) (up to 12.87\(\times\)) speedups over cuSPARSE, Sync-free and Recblock algorithms on the 16 representative matrices, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Ahmad, N., Yilmaz, B., Unat, D.: A split execution model for sptrsv. IEEE Trans. Parallel Distrib. Syst. 32(11), 2809–2822 (2021)

    Article  Google Scholar 

  • Anderson, E., Saad, Y.: Solving sparse triangular linear systems on parallel computers. Int. J. High Speed Comput. 1(1), 73–95 (1989)

    Article  MATH  Google Scholar 

  • Anzt, H., Chow, E., Dongarra, J.: Iterative sparse triangular solves for preconditioning. In: Euro-Par ’15. p 650–661 (2015)

  • Anzt, H., Chow, E., Szyld, D.B., et al.: Domain overlap for iterative sparse triangular solves on GPUs. Softw. Exascale Comput. SPPEXA 2013–2015, 527–545 (2016)

    MathSciNet  Google Scholar 

  • Anzt, H., Chow, E., Dongarra, J.: ParILUT—a new parallel threshold ILU factorization. SIAM J. Sci. Comput. 40(4), C503–C519 (2018a)

    Article  MathSciNet  MATH  Google Scholar 

  • Anzt, H., Huckle, T., Brackle, J., et al.: Incomplete sparse approximate inverses for parallel preconditioning. Parallel Comput. 71, 1–22 (2018b)

    Article  MathSciNet  Google Scholar 

  • Bradley, A.M.: A hybrid multithreaded direct sparse triangular solver. In: SIAM CSC workshop ’16, pp 13–22 (2016)

  • Buttari, A., Eijkhout, V., Langou, J., et al.: Performance optimization and modeling of blocked sparse kernels. Int. J. High Perform. Comput. Appl. 21(4), 467–484 (2007)

    Article  Google Scholar 

  • Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on gpus. In: PPoPP ’10, pp 115–126 (2010)

  • Davis, T.: Direct methods for sparse linear systems. Society for Industrial and Applied Mathematics (2006)

    Book  MATH  Google Scholar 

  • Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 11–125 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Duff, I.S., Erisman, A.M., Reid, J.K.: Direct methods for sparse matrices, 2nd edn. Oxford University Press, Inc, Oxford (2017)

    Book  MATH  Google Scholar 

  • Dufrechou, E., Ezzatti, P.: A new GPU algorithm to compute a level set-based analysis for the parallel solution of sparse triangular systems. In: IPDPS ’18, pp 920–929 (2018a)

  • Dufrechou, E., Ezzatti, P.: Solving sparse triangular linear systems in modern GPUs: a synchronization-free algorithm. In: PDP ’18, pp 196–203 (2018b)

  • Hou, K., Liu, W., Wang, H., et al. Fast segmented sort on GPUs. In: ICS ’17, pp 12:1–12:10 (2017)

  • Ji, H., Song, H., Lu, S., et al. Tilespmspv: a tiled algorithm for sparse matrix-sparse vector multiplication on gpus. In: ICPP ’22 (2022)

  • Kabir, H., Booth, J.D., Aupy, G., et al.: STS-k: A multilevel sparse triangular solution scheme for NUMA multicores. In: SC ’15, pp 55:1–55:11 (2015)

  • Li, X.S.: An overview of SuperLU: algorithms, implementation, and user interface. ACM Trans. Math. Softw. 31(3), 302–325 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Li, R., Saad, Y.: GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63(2), 443–466 (2013)

    Article  Google Scholar 

  • Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015)

  • Liu, W., Li, A., Hogg, J., et al.: A synchronization-free algorithm for parallel sparse triangular solves. In: Euro-Par ’16, pp 617–630 (2016)

  • Liu, W., Vinter, B.: A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015a)

    Article  Google Scholar 

  • Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS ’15, pp 339–350 (2015b)

  • Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015c)

    Article  MathSciNet  Google Scholar 

  • Liu, W., Li, A., Hogg, J.D., et al.: Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurr. Comput. Pract. Exp. 29(21), e4244 (2017)

    Article  Google Scholar 

  • Liu, J., He, X., Liu, W., et al.: Register-aware optimizations for parallel sparse matrix-matrix multiplication. Int. J. Parallel Program. 47, 403–417 (2019)

    Article  Google Scholar 

  • Lu, Z., Niu, Y., Liu, W.: Efficient block algorithms for parallel sparse triangular solve. In: ICPP ’20, pp 1–11 (2020)

  • Mayer, J.: Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86(4), 291–312 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Naumov, M.: Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Tech. rep, NVIDIA (2011)

  • Naumov, M., Castonguay, P., Cohen, J.: Parallel graph coloring with applications to the incomplete-LU factorization on the GPU. Nvidia White Paper (2015)

  • Niu, Y., Lu, Z., Dong, M., et al.: Tilespmv: a tiled algorithm for sparse matrix-vector multiplication on gpus. In: IPDPS ’21, IEEE, pp 68–78 (2021)

  • Niu, Y., Lu, Z., Ji, H., et al.: Tilespgemm: a tiled algorithm for parallel sparse general matrix-matrix multiplication on gpus. In: PPoPP ’22, pp 90–106 (2022)

  • Park, J., Smelyanskiy, M., Sundaram, N., et al.: Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In: ISC ’14, pp 124–140 (2014)

  • Saltz, J.H.: Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM J. Sci. Stat. Comput. 11(1), 123–144 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  • Schreiber, R., Tang, W.P.: Vectorizing the conjugate gradient method. In: Proceedings of the Symposium on CYBER 205 Applications (1982)

  • Su, J., Zhang, F., Liu, W., et al.: CapelliniSpTRSV: a thread-level synchronization-free sparse triangular solve on GPUs. In: ICPP ’20 (2020)

  • Suchoski, B., Severn, C., Shantharam, M., et al.: Adapting sparse triangular solution to GPUs. In: ICPPW ’12, pp 140–148 (2012)

  • Vuduc, R., Kamil, S., Hsu, J., et al.: Automatic performance tuning and analysis of sparse triangular solve. In: ICS ’02 Workshop (2002)

  • Wang, X., Liu, W., Xue, W., et al.: SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: PPoPP ’18, p 338-353 (2018a)

  • Wang, X., Xu, P., Xue, W., et al.: A fast sparse triangular solver for structured-grid problems on sunway many-core processor SW26010. In: ICPP ’18 (2018b)

  • Wang, T., Li, W., Pei, H., et al.: Accelerating sparse lu factorization with density-aware adaptive matrix multiplication for circuit simulation. In: DAC ’23 (2023)

  • Xie, Z., Tan, G., Liu, W., et al.: IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In: ICS ’19, pp 94–105 (2019)

  • Xie, C., Chen, J., Firoz, J., et al.: Fast and scalable sparse triangular solver for multi-gpu based hpc architectures. In: ICPP ’21, pp 1–11 (2021)

  • Yan, S., Li, C., Zhang, Y., et al. (2014) yaspmv: yet another spmv framework on gpus. In: PPoPP ’14, pp 107–118 (2021)

  • Zhang, F., Su, J., Liu, W., et al.: Yuenyeungsptrsv: a thread-level and warp-level fusion synchronization-free sparse triangular solve. IEEE Trans. Parallel Distrib. Syst. 32(9), 2321–2337 (2021)

    Article  Google Scholar 

  • Zhao, J., Wen, Y., Luo, Y., et al.: Sflu: Synchronization-free sparse lu factorization for fast circuit simulation on gpus. In: DAC ’21, pp 37–42 (2021)

Download references

Acknowledgements

We deeply appreciate the invaluable comments from all the reviewers. We are also so grateful to Hemeng Wang for the help in the experimental test. Weifeng Liu is the corresponding author of this paper. This research was supported by the National Natural Science Foundation of China under Grant No. 61972415.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weifeng Liu.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, Z., Liu, W. TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs. CCF Trans. HPC 5, 129–143 (2023). https://doi.org/10.1007/s42514-023-00151-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-023-00151-1

Keywords

Navigation