Abstract
Modern GPUs can achieve high computing power at low cost, but still requires much time and effort. Tridiagonal system and scan solvers are one example of widely used algorithms which can take advantage of these devices. In this article, one tridiagonal system solver and two scan primitive operators are implemented on CUDA GPUs. To do so, a tuning strategy based on three phases is developed. Additionally, a performance analysis is performed for two different CUDA GPU architectures, resulting in a huge improvement with respect to the state of the art.
Similar content being viewed by others
Notes
BPLG Library is available at http://bplg.des.udc.es/BPLib.zip.
References
Davidson A, Zhang Y, Owens JD (2011) An auto-tuned method for solving large tridiagonal systems on the GPU. In: Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS’11), pp 956–965
Brent RP, Kung H (1982) A regular layout for parallel adders. IEEE Trans Comput 31(3):260–264. https://doi.org/10.1109/TC.1982.1675982
Chang LW, Stratton JA, Kim HS, Hwu WMW (2012) A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), pp 27:1–27:11
Davidson A, Owens JD (2011) Register packing for cyclic reduction. In: Proceedings of the 4th Workshop on General Purpose Processing on Graphics Processing Units GPGPU-4, pp 4:1–4:6
Diéguez AP, Amor M, Doallo R (2015) New tridiagonal systems solvers on GPU architectures. In: Proceedings of IEEE International Conference on High Performance Computing (HiPC’15), pp 85–93
Diéguez AP, Amor M, Doallo R (2018) A tuning strategy for tridiagonal system solvers on GPU. In: Proceedings of the 18th International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE’18
Diéguez AP, Amor M, Lobeiras J, Doallo R (2018) Solving large problem sizes of index-digit algorithms on GPU: FFT and tridiagonal system solvers. IEEE Trans Comput 67(1):86–101. https://doi.org/10.1109/TC.2017.2723879
Dotsenko Y, Govindaraju NK, Sloan PP, Boyd C, Manferdelli J (2008) Fast scan algorithms on graphics processors. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp 205–213
Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. In: GPU Gems 3. Addison Wesley
Hockney R, Jesshope C (1988) Parallel computers 2: architecture, programming and algorithms. Taylor & Francis, Milton Park
Hockney RW (1965) A fast direct solution of Poisson’s equation using Fourier analysis. J ACM 12(1):95–113
Kim H, Wu S, Chang L, Hwu WW (2011) A scalable tridiagonal solver for GPUs. In: Proceedings of the International Conference on Parallel Processing (ICPP’11), pp 444–453. https://doi.org/10.1109/ICPP.2011.41
Kogge PM, Stone HS (1973) A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans Comput 22(8):786–793
Ladner RE, Fischer MJ (1980) Parallel prefix computation. J ACM 27(4):831–838. https://doi.org/10.1145/322217.322232
László E, Giles M, Appleyard J (2016) Manycore algorithms for batch scalar and block tridiagonal solvers. ACM Trans Math Softw 42(4):31:1–31:36
Lobeiras J, Amor M, Doallo R (2015) BPLG: a tuned butterfly processing library for GPU architectures. Int J Parallel Program 43(6):1078–1102
Lobeiras J, Amor M, Doallo R (2016) Designing efficient index-digit algorithms for CUDA GPU architectures. IEEE Trans Parallel Distrib Syst 27(5):1331–1343
NVIDIA-Corporation (2012) CUDA CUSPARSE library
NVIDIA-Corporation (2013) Modern GPU library. https://github.com/NVlabs/moderngpu. Accessed 01 Nov 2018
NVIDIA-Corporation (2014) CUDPP: CUDA data parallel primitives library. http://cudpp.github.io/. Accessed 01 Nov 2018
NVIDIA-Corporation (2015a) CUB library. http://nvlabs.github.io/cub/. Accessed 01 Nov 2018
NVIDIA-Corporation (2015b) Thrust library. https://github.com/thrust/thrust. Accessed 01 Nov 2018
Sengupta S, Harris M, Garland M (2008) Efficient parallel scan algorithms for GPUs. Technical report
Sengupta S, Lefohn AE, Owens JD (2006) A work-efficient step-efficient prefix sum algorithm. Workshop on edge computing using new commodity architectures
Yan S, Long G, Zhang Y (2013) Streamscan: fast scan algorithms for gpus without global barrier synchronization. SIGPLAN Not 48(8):229–238
Yang W, Li K, Li K (2017) A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems. J Supercomput 73(5):1760–1781
Zhang Y, Cohen J, Owens JD (2010) Fast tridiagonal solvers on the GPU. In: Proceeding of the 15th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’10), pp 127–136
Acknowledgements
This work is supported by the Ministry of Economy and Competitiveness of Spain, TIN2016-75845-P (AEI/FEDER, UE), by the Galician Government and FEDER funds under the Consolidation Program of Competitive Reference Groups (GRC2013-055) as well as under the Consolidation Programme of Competitive Research Units [Ref. R2014/049 and Ref. R2016/037]; and by the FPU Program of the Ministry of Education of Spain (FPU14/02801).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Diéguez, A.P., Amor, M. & Doallo, R. Parallel prefix operations on GPU: tridiagonal system solvers and scan operators. J Supercomput 75, 1510–1523 (2019). https://doi.org/10.1007/s11227-018-2676-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2676-z