Advertisement

The Journal of Supercomputing

, Volume 75, Issue 3, pp 1510–1523 | Cite as

Parallel prefix operations on GPU: tridiagonal system solvers and scan operators

  • Adrián P. DiéguezEmail author
  • Margarita Amor
  • Ramón Doallo
Article
  • 49 Downloads

Abstract

Modern GPUs can achieve high computing power at low cost, but still requires much time and effort. Tridiagonal system and scan solvers are one example of widely used algorithms which can take advantage of these devices. In this article, one tridiagonal system solver and two scan primitive operators are implemented on CUDA GPUs. To do so, a tuning strategy based on three phases is developed. Additionally, a performance analysis is performed for two different CUDA GPU architectures, resulting in a huge improvement with respect to the state of the art.

Keywords

GPU CUDA Tuning Tridiagonal systems Scan 

Notes

Acknowledgements

This work is supported by the Ministry of Economy and Competitiveness of Spain, TIN2016-75845-P (AEI/FEDER, UE), by the Galician Government and FEDER funds under the Consolidation Program of Competitive Reference Groups (GRC2013-055) as well as under the Consolidation Programme of Competitive Research Units [Ref. R2014/049 and Ref. R2016/037]; and by the FPU Program of the Ministry of Education of Spain (FPU14/02801).

References

  1. 1.
    Davidson A, Zhang Y, Owens JD (2011) An auto-tuned method for solving large tridiagonal systems on the GPU. In: Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS’11), pp 956–965Google Scholar
  2. 2.
    Brent RP, Kung H (1982) A regular layout for parallel adders. IEEE Trans Comput 31(3):260–264.  https://doi.org/10.1109/TC.1982.1675982 MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Chang LW, Stratton JA, Kim HS, Hwu WMW (2012) A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), pp 27:1–27:11Google Scholar
  4. 4.
    Davidson A, Owens JD (2011) Register packing for cyclic reduction. In: Proceedings of the 4th Workshop on General Purpose Processing on Graphics Processing Units GPGPU-4, pp 4:1–4:6Google Scholar
  5. 5.
    Diéguez AP, Amor M, Doallo R (2015) New tridiagonal systems solvers on GPU architectures. In: Proceedings of IEEE International Conference on High Performance Computing (HiPC’15), pp 85–93Google Scholar
  6. 6.
    Diéguez AP, Amor M, Doallo R (2018) A tuning strategy for tridiagonal system solvers on GPU. In: Proceedings of the 18th International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE’18Google Scholar
  7. 7.
    Diéguez AP, Amor M, Lobeiras J, Doallo R (2018) Solving large problem sizes of index-digit algorithms on GPU: FFT and tridiagonal system solvers. IEEE Trans Comput 67(1):86–101.  https://doi.org/10.1109/TC.2017.2723879 MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Dotsenko Y, Govindaraju NK, Sloan PP, Boyd C, Manferdelli J (2008) Fast scan algorithms on graphics processors. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp 205–213Google Scholar
  9. 9.
    Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. In: GPU Gems 3. Addison WesleyGoogle Scholar
  10. 10.
    Hockney R, Jesshope C (1988) Parallel computers 2: architecture, programming and algorithms. Taylor & Francis, Milton ParkzbMATHGoogle Scholar
  11. 11.
    Hockney RW (1965) A fast direct solution of Poisson’s equation using Fourier analysis. J ACM 12(1):95–113MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Kim H, Wu S, Chang L, Hwu WW (2011) A scalable tridiagonal solver for GPUs. In: Proceedings of the International Conference on Parallel Processing (ICPP’11), pp 444–453.  https://doi.org/10.1109/ICPP.2011.41
  13. 13.
    Kogge PM, Stone HS (1973) A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans Comput 22(8):786–793MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Ladner RE, Fischer MJ (1980) Parallel prefix computation. J ACM 27(4):831–838.  https://doi.org/10.1145/322217.322232 MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    László E, Giles M, Appleyard J (2016) Manycore algorithms for batch scalar and block tridiagonal solvers. ACM Trans Math Softw 42(4):31:1–31:36MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Lobeiras J, Amor M, Doallo R (2015) BPLG: a tuned butterfly processing library for GPU architectures. Int J Parallel Program 43(6):1078–1102CrossRefGoogle Scholar
  17. 17.
    Lobeiras J, Amor M, Doallo R (2016) Designing efficient index-digit algorithms for CUDA GPU architectures. IEEE Trans Parallel Distrib Syst 27(5):1331–1343CrossRefGoogle Scholar
  18. 18.
    NVIDIA-Corporation (2012) CUDA CUSPARSE libraryGoogle Scholar
  19. 19.
    NVIDIA-Corporation (2013) Modern GPU library. https://github.com/NVlabs/moderngpu. Accessed 01 Nov 2018
  20. 20.
    NVIDIA-Corporation (2014) CUDPP: CUDA data parallel primitives library. http://cudpp.github.io/. Accessed 01 Nov 2018
  21. 21.
    NVIDIA-Corporation (2015a) CUB library. http://nvlabs.github.io/cub/. Accessed 01 Nov 2018
  22. 22.
    NVIDIA-Corporation (2015b) Thrust library. https://github.com/thrust/thrust. Accessed 01 Nov 2018
  23. 23.
    Sengupta S, Harris M, Garland M (2008) Efficient parallel scan algorithms for GPUs. Technical reportGoogle Scholar
  24. 24.
    Sengupta S, Lefohn AE, Owens JD (2006) A work-efficient step-efficient prefix sum algorithm. Workshop on edge computing using new commodity architecturesGoogle Scholar
  25. 25.
    Yan S, Long G, Zhang Y (2013) Streamscan: fast scan algorithms for gpus without global barrier synchronization. SIGPLAN Not 48(8):229–238CrossRefGoogle Scholar
  26. 26.
    Yang W, Li K, Li K (2017) A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems. J Supercomput 73(5):1760–1781CrossRefGoogle Scholar
  27. 27.
    Zhang Y, Cohen J, Owens JD (2010) Fast tridiagonal solvers on the GPU. In: Proceeding of the 15th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’10), pp 127–136Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Grupo de Arquitectura de Computadores (GAC), Facultade de InformáticaUniversidade da CoruñaA CoruñaSpain

Personalised recommendations