Abstract
Sparse triangular solve (SpTRSV) is one of the most important level-2 kernels in sparse basic linear algebra subprograms (BLAS). Compared to another level-2 sparse BLAS kernel sparse matrix–vector multiplication (SpMV), SpTRSV is in general more difficult to find high parallelism on many-core processors, such as GPUs. Nowadays, much work focuses on reducing dependencies and synchronizations in the level-set and Sync-free algorithms for SpTRSV. However, there is less work that can make good use of sparse spatial structure for SpTRSV on GPUs. In this paper, we propose a tiled algorithm called TileSpTRSV for optimizing SpTRSV on GPUs through exploiting 2D spatial structure of sparse matrices. We design two algorithm implementations, i.e., TileSpTRSV_level-set and TileSpTRSV_sync-free, for TileSpTRSV on top of level-set and Sync-free algorithms, respectively. By testing 16 representative matrices on a latest NVIDIA GPU, the experimental results show that TileSpTRSV_level-set gives on average 5.29\(\times\) (up to 38.10\(\times\)), 5.33\(\times\) (up to 21.32\(\times\)) and 2.62\(\times\) (up to 12.87\(\times\)) speedups over cuSPARSE, Sync-free and Recblock algorithms on the 16 representative matrices, respectively.
Similar content being viewed by others
References
Ahmad, N., Yilmaz, B., Unat, D.: A split execution model for sptrsv. IEEE Trans. Parallel Distrib. Syst. 32(11), 2809–2822 (2021)
Anderson, E., Saad, Y.: Solving sparse triangular linear systems on parallel computers. Int. J. High Speed Comput. 1(1), 73–95 (1989)
Anzt, H., Chow, E., Dongarra, J.: Iterative sparse triangular solves for preconditioning. In: Euro-Par ’15. p 650–661 (2015)
Anzt, H., Chow, E., Szyld, D.B., et al.: Domain overlap for iterative sparse triangular solves on GPUs. Softw. Exascale Comput. SPPEXA 2013–2015, 527–545 (2016)
Anzt, H., Chow, E., Dongarra, J.: ParILUT—a new parallel threshold ILU factorization. SIAM J. Sci. Comput. 40(4), C503–C519 (2018a)
Anzt, H., Huckle, T., Brackle, J., et al.: Incomplete sparse approximate inverses for parallel preconditioning. Parallel Comput. 71, 1–22 (2018b)
Bradley, A.M.: A hybrid multithreaded direct sparse triangular solver. In: SIAM CSC workshop ’16, pp 13–22 (2016)
Buttari, A., Eijkhout, V., Langou, J., et al.: Performance optimization and modeling of blocked sparse kernels. Int. J. High Perform. Comput. Appl. 21(4), 467–484 (2007)
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on gpus. In: PPoPP ’10, pp 115–126 (2010)
Davis, T.: Direct methods for sparse linear systems. Society for Industrial and Applied Mathematics (2006)
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 11–125 (2011)
Duff, I.S., Erisman, A.M., Reid, J.K.: Direct methods for sparse matrices, 2nd edn. Oxford University Press, Inc, Oxford (2017)
Dufrechou, E., Ezzatti, P.: A new GPU algorithm to compute a level set-based analysis for the parallel solution of sparse triangular systems. In: IPDPS ’18, pp 920–929 (2018a)
Dufrechou, E., Ezzatti, P.: Solving sparse triangular linear systems in modern GPUs: a synchronization-free algorithm. In: PDP ’18, pp 196–203 (2018b)
Hou, K., Liu, W., Wang, H., et al. Fast segmented sort on GPUs. In: ICS ’17, pp 12:1–12:10 (2017)
Ji, H., Song, H., Lu, S., et al. Tilespmspv: a tiled algorithm for sparse matrix-sparse vector multiplication on gpus. In: ICPP ’22 (2022)
Kabir, H., Booth, J.D., Aupy, G., et al.: STS-k: A multilevel sparse triangular solution scheme for NUMA multicores. In: SC ’15, pp 55:1–55:11 (2015)
Li, X.S.: An overview of SuperLU: algorithms, implementation, and user interface. ACM Trans. Math. Softw. 31(3), 302–325 (2005)
Li, R., Saad, Y.: GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63(2), 443–466 (2013)
Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015)
Liu, W., Li, A., Hogg, J., et al.: A synchronization-free algorithm for parallel sparse triangular solves. In: Euro-Par ’16, pp 617–630 (2016)
Liu, W., Vinter, B.: A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015a)
Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS ’15, pp 339–350 (2015b)
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015c)
Liu, W., Li, A., Hogg, J.D., et al.: Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurr. Comput. Pract. Exp. 29(21), e4244 (2017)
Liu, J., He, X., Liu, W., et al.: Register-aware optimizations for parallel sparse matrix-matrix multiplication. Int. J. Parallel Program. 47, 403–417 (2019)
Lu, Z., Niu, Y., Liu, W.: Efficient block algorithms for parallel sparse triangular solve. In: ICPP ’20, pp 1–11 (2020)
Mayer, J.: Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86(4), 291–312 (2009)
Naumov, M.: Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Tech. rep, NVIDIA (2011)
Naumov, M., Castonguay, P., Cohen, J.: Parallel graph coloring with applications to the incomplete-LU factorization on the GPU. Nvidia White Paper (2015)
Niu, Y., Lu, Z., Dong, M., et al.: Tilespmv: a tiled algorithm for sparse matrix-vector multiplication on gpus. In: IPDPS ’21, IEEE, pp 68–78 (2021)
Niu, Y., Lu, Z., Ji, H., et al.: Tilespgemm: a tiled algorithm for parallel sparse general matrix-matrix multiplication on gpus. In: PPoPP ’22, pp 90–106 (2022)
Park, J., Smelyanskiy, M., Sundaram, N., et al.: Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In: ISC ’14, pp 124–140 (2014)
Saltz, J.H.: Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM J. Sci. Stat. Comput. 11(1), 123–144 (1990)
Schreiber, R., Tang, W.P.: Vectorizing the conjugate gradient method. In: Proceedings of the Symposium on CYBER 205 Applications (1982)
Su, J., Zhang, F., Liu, W., et al.: CapelliniSpTRSV: a thread-level synchronization-free sparse triangular solve on GPUs. In: ICPP ’20 (2020)
Suchoski, B., Severn, C., Shantharam, M., et al.: Adapting sparse triangular solution to GPUs. In: ICPPW ’12, pp 140–148 (2012)
Vuduc, R., Kamil, S., Hsu, J., et al.: Automatic performance tuning and analysis of sparse triangular solve. In: ICS ’02 Workshop (2002)
Wang, X., Liu, W., Xue, W., et al.: SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: PPoPP ’18, p 338-353 (2018a)
Wang, X., Xu, P., Xue, W., et al.: A fast sparse triangular solver for structured-grid problems on sunway many-core processor SW26010. In: ICPP ’18 (2018b)
Wang, T., Li, W., Pei, H., et al.: Accelerating sparse lu factorization with density-aware adaptive matrix multiplication for circuit simulation. In: DAC ’23 (2023)
Xie, Z., Tan, G., Liu, W., et al.: IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In: ICS ’19, pp 94–105 (2019)
Xie, C., Chen, J., Firoz, J., et al.: Fast and scalable sparse triangular solver for multi-gpu based hpc architectures. In: ICPP ’21, pp 1–11 (2021)
Yan, S., Li, C., Zhang, Y., et al. (2014) yaspmv: yet another spmv framework on gpus. In: PPoPP ’14, pp 107–118 (2021)
Zhang, F., Su, J., Liu, W., et al.: Yuenyeungsptrsv: a thread-level and warp-level fusion synchronization-free sparse triangular solve. IEEE Trans. Parallel Distrib. Syst. 32(9), 2321–2337 (2021)
Zhao, J., Wen, Y., Luo, Y., et al.: Sflu: Synchronization-free sparse lu factorization for fast circuit simulation on gpus. In: DAC ’21, pp 37–42 (2021)
Acknowledgements
We deeply appreciate the invaluable comments from all the reviewers. We are also so grateful to Hemeng Wang for the help in the experimental test. Weifeng Liu is the corresponding author of this paper. This research was supported by the National Natural Science Foundation of China under Grant No. 61972415.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, Z., Liu, W. TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs. CCF Trans. HPC 5, 129–143 (2023). https://doi.org/10.1007/s42514-023-00151-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-023-00151-1