The Journal of Supercomputing

, Volume 71, Issue 2, pp 369–390 | Cite as

Efficiently solving tri-diagonal system by chunked cyclic reduction and single-GPU shared memory

  • Di Zhao
  • Jinhang Yu


The tri-diagonal system comes from dynamic problems such as fluid simulation, and high efficiency is important for the success of these applications. In this paper, we develop completely GPU shared memory-based chunked cyclic reduction under the constraint of the capacity of the shared memory. Computational results show that GPU shared memory chunked cyclic reduction exhibits high efficiency by Nvidia TITAN with 48k shared memory, and GPU shared memory chunked cyclic reduction can solve a tri-diagonal system with 262,144-by-262,144 coefficient matrix in 1.768 ms. Computational results also show that GPU shared memory chunked cyclic reduction scales well to the sizes of coefficient matrix and the reduced systems. Altogether, since building completely on GPU shared memory, our solver may be faster than existing GPU solvers because of the efficiency of GPU shared memory, though the solubility of our solver is smaller than existing GPU solvers because of the capacity constraint of shared memory, where solubility means the solvable tri-diagonal system with the maximum size of the coefficient matrix by our solver.


GPU computing Parallel algorithm GPU shared memory Chunked cyclic reduction Tri-diagonal system 



We thank Exxact Corporation for providing usage time of Tesla K20 GPU through Nvidia’s program of GPU Test Drive. We thank reviewers for their valuable suggestions.


  1. 1.
    Golub GH, Van Loan CF (1996) Matrix computations. Johns Hopkins University Press, BaltimorezbMATHGoogle Scholar
  2. 2.
    Niemeyer K, Sung C-J (2014) Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. J Supercomput 67(2):528–564Google Scholar
  3. 3.
    Wang Y et al (2013) A parallel solver for incompressible fluid flows. Procedia Comput Sci 18:439–448CrossRefGoogle Scholar
  4. 4.
    Wei Z et al (2013) Parallelizing alternating direction implicit solver on GPUs. Procedia Comput Sci 18:389–398CrossRefGoogle Scholar
  5. 5.
    Curnier A (1994) Computational methods in solid mechanics. Kluwer Academic, DordrechtCrossRefzbMATHGoogle Scholar
  6. 6.
    Fung Y, Tong P (2001) Classical and computational solid mechanics. World Scientific, SingaporeCrossRefzbMATHGoogle Scholar
  7. 7.
    Bathe KJ (2001) Computational fluid and solid mechanics. Elsevier, AmsterdamzbMATHGoogle Scholar
  8. 8.
    Rylander T, Bondeson A, Ingelström P (2012) Computational electromagnetics. Springer, BerlinGoogle Scholar
  9. 9.
    Sheng XQ, Song W (2012) Essentials of computational electromagnetics. Wiley, New YorkCrossRefGoogle Scholar
  10. 10.
    Levy G (2004) Computational finance: numerical methods for pricing financial instruments. Elsevier Butterworth-Heinemann, OxfordGoogle Scholar
  11. 11.
    Los CA (2001) Computational finance: a scientific perspective. World Scientific, SingaporeGoogle Scholar
  12. 12.
    Duan JC, Härdle W, Gentle JE (2011) Handbook of computational finance. Springer, BerlinGoogle Scholar
  13. 13.
    Levy G (2008) Computational finance using C and C#. Elsevier, AmsterdamGoogle Scholar
  14. 14.
    Lyuu YD (2002) Financial engineering and computation: principles, mathematics, algorithms. Cambridge University Press, CambridgeGoogle Scholar
  15. 15.
    Nguyuen H, Corporation N (2008) GPU Gems 3. Addison Wesley Professional, ReadingGoogle Scholar
  16. 16.
    Pharr M, Fernando R (2005) GPU Gems 2: programming techniques for high-performance graphics and general-purpose computation. Pearson Addison Wesley Professional, ReadingGoogle Scholar
  17. 17.
    Hockney RW, Jesshope CR (1988) Parallel computers 2: architecture, programming, and algorithms. A. Hilger, LondonzbMATHGoogle Scholar
  18. 18.
    Sweet R (1988) A parallel and vector variant of the cyclic reduction algorithm. SIAM J Sci Stat Comput 9(4):761–765CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    Amodio P, Mastronardi N (1993) A parallel version of the cyclic reduction algorithm on a hypercube. Parallel Comput 19(11):1273–1281CrossRefzbMATHMathSciNetGoogle Scholar
  20. 20.
    Mattor N, Williams TJ, Hewett DW (1995) Algorithm for solving tri-diagonal matrix problems in parallel. Parallel Comput 21(11):1769–1782CrossRefMathSciNetGoogle Scholar
  21. 21.
    Stone HS (1975) Parallel tri-diagonal equation solvers. ACM Trans Math Softw 1(4):289–307CrossRefzbMATHGoogle Scholar
  22. 22.
    Schwandt H (1989) Cyclic reduction for tri-diagonal systems of equations with interval coefficients on vector computers. SIAM J Numer Anal 26(3):661–680CrossRefzbMATHMathSciNetGoogle Scholar
  23. 23.
    Allmann S, Rauber T, Runger G (2001) Cyclic reduction on distributed shared memory machines. In: Proceedings of ninth Euromicro workshop on parallel and distributed processing, 2001Google Scholar
  24. 24.
    Bekakos MP, Evans DJ (1993) Parallel cyclic odd–even reduction algorithms for solving Toeplitz tri-diagonal equations on MIMD computers. Parallel Comput 19(5):545–561CrossRefzbMATHGoogle Scholar
  25. 25.
    Gallopoulos E, Saad Y (1989) A parallel block cyclic reduction algorithm for the fast solution of elliptic equations. Parallel Comput 10(2):143–159CrossRefzbMATHMathSciNetGoogle Scholar
  26. 26.
    Sweet R (1977) A cyclic reduction algorithm for solving block tri-diagonal systems of arbitrary dimension. SIAM J Numer Anal 14(4):706–720CrossRefzbMATHMathSciNetGoogle Scholar
  27. 27.
    Seal SK, Perumalla KS, Hirshman SP (2013) Revisiting parallel cyclic reduction and parallel prefix-based algorithms for block tri-diagonal systems of equations. J Parallel Distrib Comput 73(2):273–280CrossRefzbMATHGoogle Scholar
  28. 28.
    Wang HH (1981) A parallel method for tri-diagonal equations. ACM Trans Math Softw 7(2):170–183CrossRefzbMATHGoogle Scholar
  29. 29.
    Stone HS (1973) An efficient parallel algorithm for the solution of a tri-diagonal linear system of equations. J ACM 20(1):27–38CrossRefzbMATHGoogle Scholar
  30. 30.
    Bondeli S, Gander W (1994) Cyclic reduction for special tri-diagonal systems. SIAM J Matrix Anal Appl 15(1):321–330CrossRefzbMATHMathSciNetGoogle Scholar
  31. 31.
    Xian-he S, Zhang H, Ni LM (1992) Efficient tri-diagonal solvers on multicomputers. IEEE Trans Comput 41(3):286–296CrossRefMathSciNetGoogle Scholar
  32. 32.
    Argüello F et al (2012) The split-and-merge method in general purpose computation on GPUs. Parallel Comput 38(6–7):277–288CrossRefGoogle Scholar
  33. 33.
    Owens JD et al (2008) GPU computing. Proc IEEE 96(5):879–899CrossRefGoogle Scholar
  34. 34.
    Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, Austin, pp 1–11Google Scholar
  35. 35.
    Zhang Y, Cohen J, Owens JD (2010) Fast tri-diagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, Bangalore, pp 127–136Google Scholar
  36. 36.
    Zhang Y, Cohen J, Davidson AA, Owens JD (2011) A hybrid method for solving tri-diagonal systems on the GPU. In: W-mW Hwu (ed) GPU computing gems, vol 2, chap 11. Morgan Kaufmann, Los Altos, pp 117–132Google Scholar
  37. 37.
    Zhang Y (2009) Fast tridiagonal solvers on GPU. In: GPU technology conference. San Jose, CaliforniaGoogle Scholar
  38. 38.
    Davidson A, Yao Z, Owens JD (2011) An auto-tuned method for solving large tri-diagonal systems on the GPU. In: IEEE international symposium on parallel and distributed processing (IPDPS), 2011Google Scholar
  39. 39.
    Davidson A, Owens JD (2011) Register packing for cyclic reduction: a case study. In: Proceedings of the fourth workshop on general purpose processing on graphics processing units. ACM, Newport Beach, pp 1–6Google Scholar
  40. 40.
    Chang L-W et al (2012) A scalable, numerically stable, high-performance tri-diagonal solver using GPUs. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, Salt Lake City, pp 1–11Google Scholar
  41. 41.
    Hee-Seok K et al. (2011) A scalable tri-diagonal solver for GPUs. In: International conference on parallel processing (ICPP), 2011Google Scholar
  42. 42.
    Cuda C Programming, Version Guide, 5.5. (2013) Nvidia, Santa ClaraGoogle Scholar
  43. 43.
    Sanders J, Kandrot E (2010) CUDA by example: an introduction to general-purpose GPU programming. Pearson Education, BostonGoogle Scholar
  44. 44.
    Cook S (2013) CUDA programming: a developer’s guide to parallel computing with GPUs. Morgan Kaufmann, Los AltosGoogle Scholar
  45. 45.
    Farber R (2011) CUDA application design and development. Morgan Kaufmann, Los AltosGoogle Scholar
  46. 46.
    Wilt N (2013) The CUDA handbook: a comprehensive guide to GPU programming. Pearson Education, BostonGoogle Scholar
  47. 47.
    Goeddeke D, Strzodka R (2011) Cyclic reduction tri-diagonal solvers on GPUs applied to mixed-precision multigrid. IEEE Trans Parallel Distrib Syst 22(1):22–32CrossRefGoogle Scholar
  48. 48.
    Karniadakis GE, Kirby RM (2003) Parallel scientific computing in c++ and mpi: a seamless approach to parallel algorithms and their implementation. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  49. 49.
    Swarztrauber PN (1979) A parallel algorithm for solving general tri-diagonal equations. Math Comput 33(145):185–199CrossRefzbMATHMathSciNetGoogle Scholar
  50. 50.
    Lin HX (2001) A unifying graph model for designing parallel algorithms for tri-diagonal systems. Parallel Comput 27(7):925–939CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Center for Cognitive and Brain ScienceThe Ohio State UniversityColumbusUSA
  2. 2.College of MedicineThe Ohio State UniversityColumbusUSA
  3. 3.Department of Civil, Environmental and Geodetic EngineeringThe Ohio State UniversityColumbusUSA

Personalised recommendations