TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs

Lu, Zhengyang; Liu, Weifeng

doi:10.1007/s42514-023-00151-1

TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs

Regular Paper
Published: 12 June 2023

Volume 5, pages 129–143, (2023)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Zhengyang Lu¹ &
Weifeng Liu¹

244 Accesses
1 Citation
Explore all metrics

Abstract

Sparse triangular solve (SpTRSV) is one of the most important level-2 kernels in sparse basic linear algebra subprograms (BLAS). Compared to another level-2 sparse BLAS kernel sparse matrix–vector multiplication (SpMV), SpTRSV is in general more difficult to find high parallelism on many-core processors, such as GPUs. Nowadays, much work focuses on reducing dependencies and synchronizations in the level-set and Sync-free algorithms for SpTRSV. However, there is less work that can make good use of sparse spatial structure for SpTRSV on GPUs. In this paper, we propose a tiled algorithm called TileSpTRSV for optimizing SpTRSV on GPUs through exploiting 2D spatial structure of sparse matrices. We design two algorithm implementations, i.e., TileSpTRSV_level-set and TileSpTRSV_sync-free, for TileSpTRSV on top of level-set and Sync-free algorithms, respectively. By testing 16 representative matrices on a latest NVIDIA GPU, the experimental results show that TileSpTRSV_level-set gives on average 5.29\(\times\) (up to 38.10\(\times\)), 5.33\(\times\) (up to 21.32\(\times\)) and 2.62\(\times\) (up to 12.87\(\times\)) speedups over cuSPARSE, Sync-free and Recblock algorithms on the 16 representative matrices, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

MT-3000: a heterogeneous multi-zone processor for HPC

Article 24 May 2022

An optimized, easy-to-use, open-source GPU solver for large-scale inverse homogenization problems

Article 09 September 2023

References

Ahmad, N., Yilmaz, B., Unat, D.: A split execution model for sptrsv. IEEE Trans. Parallel Distrib. Syst. 32(11), 2809–2822 (2021)
Article Google Scholar
Anderson, E., Saad, Y.: Solving sparse triangular linear systems on parallel computers. Int. J. High Speed Comput. 1(1), 73–95 (1989)
Article MATH Google Scholar
Anzt, H., Chow, E., Dongarra, J.: Iterative sparse triangular solves for preconditioning. In: Euro-Par ’15. p 650–661 (2015)
Anzt, H., Chow, E., Szyld, D.B., et al.: Domain overlap for iterative sparse triangular solves on GPUs. Softw. Exascale Comput. SPPEXA 2013–2015, 527–545 (2016)
MathSciNet Google Scholar
Anzt, H., Chow, E., Dongarra, J.: ParILUT—a new parallel threshold ILU factorization. SIAM J. Sci. Comput. 40(4), C503–C519 (2018a)
Article MathSciNet MATH Google Scholar
Anzt, H., Huckle, T., Brackle, J., et al.: Incomplete sparse approximate inverses for parallel preconditioning. Parallel Comput. 71, 1–22 (2018b)
Article MathSciNet Google Scholar
Bradley, A.M.: A hybrid multithreaded direct sparse triangular solver. In: SIAM CSC workshop ’16, pp 13–22 (2016)
Buttari, A., Eijkhout, V., Langou, J., et al.: Performance optimization and modeling of blocked sparse kernels. Int. J. High Perform. Comput. Appl. 21(4), 467–484 (2007)
Article Google Scholar
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on gpus. In: PPoPP ’10, pp 115–126 (2010)
Davis, T.: Direct methods for sparse linear systems. Society for Industrial and Applied Mathematics (2006)
Book MATH Google Scholar
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 11–125 (2011)
Article MathSciNet MATH Google Scholar
Duff, I.S., Erisman, A.M., Reid, J.K.: Direct methods for sparse matrices, 2nd edn. Oxford University Press, Inc, Oxford (2017)
Book MATH Google Scholar
Dufrechou, E., Ezzatti, P.: A new GPU algorithm to compute a level set-based analysis for the parallel solution of sparse triangular systems. In: IPDPS ’18, pp 920–929 (2018a)
Dufrechou, E., Ezzatti, P.: Solving sparse triangular linear systems in modern GPUs: a synchronization-free algorithm. In: PDP ’18, pp 196–203 (2018b)
Hou, K., Liu, W., Wang, H., et al. Fast segmented sort on GPUs. In: ICS ’17, pp 12:1–12:10 (2017)
Ji, H., Song, H., Lu, S., et al. Tilespmspv: a tiled algorithm for sparse matrix-sparse vector multiplication on gpus. In: ICPP ’22 (2022)
Kabir, H., Booth, J.D., Aupy, G., et al.: STS-k: A multilevel sparse triangular solution scheme for NUMA multicores. In: SC ’15, pp 55:1–55:11 (2015)
Li, X.S.: An overview of SuperLU: algorithms, implementation, and user interface. ACM Trans. Math. Softw. 31(3), 302–325 (2005)
Article MathSciNet MATH Google Scholar
Li, R., Saad, Y.: GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63(2), 443–466 (2013)
Article Google Scholar
Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015)
Liu, W., Li, A., Hogg, J., et al.: A synchronization-free algorithm for parallel sparse triangular solves. In: Euro-Par ’16, pp 617–630 (2016)
Liu, W., Vinter, B.: A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015a)
Article Google Scholar
Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS ’15, pp 339–350 (2015b)
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015c)
Article MathSciNet Google Scholar
Liu, W., Li, A., Hogg, J.D., et al.: Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurr. Comput. Pract. Exp. 29(21), e4244 (2017)
Article Google Scholar
Liu, J., He, X., Liu, W., et al.: Register-aware optimizations for parallel sparse matrix-matrix multiplication. Int. J. Parallel Program. 47, 403–417 (2019)
Article Google Scholar
Lu, Z., Niu, Y., Liu, W.: Efficient block algorithms for parallel sparse triangular solve. In: ICPP ’20, pp 1–11 (2020)
Mayer, J.: Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86(4), 291–312 (2009)
Article MathSciNet MATH Google Scholar
Naumov, M.: Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Tech. rep, NVIDIA (2011)
Naumov, M., Castonguay, P., Cohen, J.: Parallel graph coloring with applications to the incomplete-LU factorization on the GPU. Nvidia White Paper (2015)
Niu, Y., Lu, Z., Dong, M., et al.: Tilespmv: a tiled algorithm for sparse matrix-vector multiplication on gpus. In: IPDPS ’21, IEEE, pp 68–78 (2021)
Niu, Y., Lu, Z., Ji, H., et al.: Tilespgemm: a tiled algorithm for parallel sparse general matrix-matrix multiplication on gpus. In: PPoPP ’22, pp 90–106 (2022)
Park, J., Smelyanskiy, M., Sundaram, N., et al.: Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In: ISC ’14, pp 124–140 (2014)
Saltz, J.H.: Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM J. Sci. Stat. Comput. 11(1), 123–144 (1990)
Article MathSciNet MATH Google Scholar
Schreiber, R., Tang, W.P.: Vectorizing the conjugate gradient method. In: Proceedings of the Symposium on CYBER 205 Applications (1982)
Su, J., Zhang, F., Liu, W., et al.: CapelliniSpTRSV: a thread-level synchronization-free sparse triangular solve on GPUs. In: ICPP ’20 (2020)
Suchoski, B., Severn, C., Shantharam, M., et al.: Adapting sparse triangular solution to GPUs. In: ICPPW ’12, pp 140–148 (2012)
Vuduc, R., Kamil, S., Hsu, J., et al.: Automatic performance tuning and analysis of sparse triangular solve. In: ICS ’02 Workshop (2002)
Wang, X., Liu, W., Xue, W., et al.: SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: PPoPP ’18, p 338-353 (2018a)
Wang, X., Xu, P., Xue, W., et al.: A fast sparse triangular solver for structured-grid problems on sunway many-core processor SW26010. In: ICPP ’18 (2018b)
Wang, T., Li, W., Pei, H., et al.: Accelerating sparse lu factorization with density-aware adaptive matrix multiplication for circuit simulation. In: DAC ’23 (2023)
Xie, Z., Tan, G., Liu, W., et al.: IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In: ICS ’19, pp 94–105 (2019)
Xie, C., Chen, J., Firoz, J., et al.: Fast and scalable sparse triangular solver for multi-gpu based hpc architectures. In: ICPP ’21, pp 1–11 (2021)
Yan, S., Li, C., Zhang, Y., et al. (2014) yaspmv: yet another spmv framework on gpus. In: PPoPP ’14, pp 107–118 (2021)
Zhang, F., Su, J., Liu, W., et al.: Yuenyeungsptrsv: a thread-level and warp-level fusion synchronization-free sparse triangular solve. IEEE Trans. Parallel Distrib. Syst. 32(9), 2321–2337 (2021)
Article Google Scholar
Zhao, J., Wen, Y., Luo, Y., et al.: Sflu: Synchronization-free sparse lu factorization for fast circuit simulation on gpus. In: DAC ’21, pp 37–42 (2021)

Download references

Acknowledgements

We deeply appreciate the invaluable comments from all the reviewers. We are also so grateful to Hemeng Wang for the help in the experimental test. Weifeng Liu is the corresponding author of this paper. This research was supported by the National Natural Science Foundation of China under Grant No. 61972415.

Author information

Authors and Affiliations

Super Scientific Software Laboratory, China University of Petroleum-Beijing, Beijing, China
Zhengyang Lu & Weifeng Liu

Authors

Zhengyang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Weifeng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weifeng Liu.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, Z., Liu, W. TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs. CCF Trans. HPC 5, 129–143 (2023). https://doi.org/10.1007/s42514-023-00151-1

Download citation

Received: 17 December 2022
Accepted: 03 May 2023
Published: 12 June 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s42514-023-00151-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

MT-3000: a heterogeneous multi-zone processor for HPC

An optimized, easy-to-use, open-source GPU solver for large-scale inverse homogenization problems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

MT-3000: a heterogeneous multi-zone processor for HPC

An optimized, easy-to-use, open-source GPU solver for large-scale inverse homogenization problems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation