The Journal of Supercomputing

, Volume 73, Issue 5, pp 1760–1781 | Cite as

A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems

  • Wangdong Yang
  • Kenli Li
  • Keqin Li


Solving block-tridiagonal systems is one of the key issues in numerical simulations of many scientific and engineering problems. Non-zero elements are mainly concentrated in the blocks on the main diagonal for most block-tridiagonal matrices, and the blocks above and below the main diagonal have little non-zero elements. Therefore, we present a solving method which mixes direct and iterative methods. In our method, the submatrices on the main diagonal are solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the block-tridiagonal system of linear equations can be improved. Some direct methods have good performance in solving small-scale equations, and the sub-equations can be solved in parallel. We present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access. Furthermore, we analyze cloud resources scheduling model and obtain ten block-tridiagonal matrices which are produced by the simulation of the cloud-computing system. The computing performance of solving these block-tridiagonal systems of linear equations can be improved using our method.


Block tridiagonal Linear equations Sparse matrix-vector multiplication Solving 



The authors deeply appreciate the anonymous reviewers for their comments on the manuscript. The research was partially funded by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005 and 61432005), the National Natural Science Foundation of China (Grant Nos. 61370095, 61472124, and 61572175), and the Science and technology project of Hunan Province (Grant No. 2015SK20062).


  1. 1.
    Geer D (2005) Chip makers turn to multicore processors. Computer 38(5):11–13CrossRefGoogle Scholar
  2. 2.
    Thomas LH (1949) Elliptic problems in linear difference equations over a network. Watson Sci. Comput. Lab. Rept. Columbia University, New YorkGoogle Scholar
  3. 3.
    Stone HS (1975) Parallel tridiagonal equation solvers. ACM Trans Math Softw 1:289–307MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Heller D (1976) Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems. SIAM J Numer Anal 13(4):484–496MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Hirshman SP, Perumalla KS, Lynch VE, Sanchez R (2010) Bcyclic: a parallel block tridiagonal matrix cyclic solver. J Comput Phys 229(18):6392–6404MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Lamas-Rodrıguez J, Heras D, Bóo M, Argüello F (2011) Tridiagonal system solvers internal report. Department of Electronics and Computer Science Internal Report, University of Santiago de Compostela, SpainGoogle Scholar
  7. 7.
    Buzbee BL, Golub GH, Nielson CW (1970) On direct methods for solving poisson’s equations. SIAM J Numer Anal 7:627–656MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Hockney RWA (1965) fast direct solution of Poisson’s equation using fourier analysis. J ACM 12:95–113MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Stone HS (1973) An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. J ACM 20:27–38MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Bondeli S (1990) Divide and conquer: a parallel algorithm for the solution of a tridiagonal linear system of equations. In: Joint International Conference on Vector and Parallel Processing, CONPAR 90, vol. IV. Springer, Berlin, pp 419–434Google Scholar
  11. 11.
    Wang HH (1981) A parallel method for tridiagonal equations. ACM Trans Math Softw 7:170–183MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Lorenzo PAR, Müller A, Murakami Y, Wylie BJN (1996) High performance fortran interfacing to scalapack. In: Proceedings of the Third International Workshop on Applied Parallel Computing, Industrial Computation and Optimization, pp 457–466Google Scholar
  13. 13.
    Sanchez R, Hirshman S, Lynch V (2010) Siesta: an scalable island equilibrium solver for toroidal applications. American Physical Society, ProvidenceGoogle Scholar
  14. 14.
    Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188CrossRefGoogle Scholar
  15. 15.
    Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(30):425–432CrossRefGoogle Scholar
  16. 16.
    Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with simd machine architectures. Comput Graphics Forum 6(1):3–11CrossRefGoogle Scholar
  17. 17.
    Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graphics Forum 8(8):3–11CrossRefGoogle Scholar
  18. 18.
    Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRefGoogle Scholar
  19. 19.
    Arabnia HR (1995) Distributed stereo-correlation algorithm. In: Proceedings of the International Conference on Computer Communications and Networks, pp 707–711Google Scholar
  20. 20.
    Bhandarkar SM, Arabnia HR, Smith JW (2011) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(2):201–229CrossRefGoogle Scholar
  21. 21.
    Bhandarkar SM, Arabnia HR (1995) The hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114CrossRefGoogle Scholar
  22. 22.
    Bhandarkar SM, Arabnia HR (1995) The refine multiprocessor theoretical properties and algorithms. Parallel Comput 21(11):1783–1805CrossRefGoogle Scholar
  23. 23.
    Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269CrossRefMATHGoogle Scholar
  24. 24.
    Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multiring network. J Supercomput 25(1):43–62CrossRefMATHGoogle Scholar
  25. 25.
    Thapliyal H, Srinivas MB, Arabnia HR (2005) A need of quantum computing: Reversible logic synthesis of parallel binary adder-subtractor. In: International Conference on Embedded Systems and Applications. ESA, Las VegasGoogle Scholar
  26. 26.
    Thapliyal H, Arabnia H, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for fpgas and its reversible logic implementation. Comput Sci 2:438–442Google Scholar
  27. 27.
    Gopineedi PD, Thapliyal H, Srinivas MB, Arabnia HR (2006) Novel and efficient 4: 2 and 5: 2 compressors with minimum number of transistors designed for low-power operations. In: International Conference on Embedded Systems Applications, Las Vegas, pp 160–168Google Scholar
  28. 28.
    Thapliyal H, Arabnia HR (2006) Reversible programmable logic array (RPLA) using fredkin and feynman gates for industrial electronics and applications. Computer ScienceGoogle Scholar
  29. 29.
    Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for fpgas. Comput SciGoogle Scholar
  30. 30.
    Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. Springer, BerlinCrossRefGoogle Scholar
  31. 31.
    Balasubramanian P, Edwards DA, Arabnia HR (2011) Robust asynchronous carry lookahead adders. In: International Conference on Computer Design, pp 321–324Google Scholar
  32. 32.
    Balasubramanian P, Arabnia HR, Arisaka R (2012) Rb_dsop: a rule based disjoint sum of products synthesis method. In: International Conference on Computer DesignGoogle Scholar
  33. 33.
    Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adderGoogle Scholar
  34. 34.
    Lee J, Wright JC (2014) A block-tridiagonal solver with two-level parallelization for finite element-spectral codes. Comput Phys Commun 185(10):2598–2608CrossRefGoogle Scholar
  35. 35.
    Ruggiero V, Galligani E (1992) A parallel algorithm for solving block tridiagonal linear systems. Comput Math Appl 24(4):15–21MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    Li HB, Huang TZ, Zhang Y, Liu XP, Li H (2009) On some new approximate factorization methods for block tridiagonal matrices suitable for vector and parallel processors. Math Comput Simul 79(7):2135–2147MathSciNetCrossRefMATHGoogle Scholar
  37. 37.
    Henk A, Vorst VD (2003) Iterative krylov methods for large linear systems, vol 13. Cambridge University Press, Cambridge xiv+221MATHGoogle Scholar
  38. 38.
    Samarskii A A, Nikolaev E S (1989) Numerical methods for grid equations. Birkhäuser, BaselCrossRefGoogle Scholar
  39. 39.
    Varah JM (1972) On the solution of block-tridiagonal systems arising from certain finite-difference equations. Math Comput 26(120):859–868MathSciNetCrossRefMATHGoogle Scholar
  40. 40.
    Terekhov AV (2011) A fast parallel algorithm for solving block-tridiagonal systems of linear equations including the domain decomposition method. Parallel Comput 39(s 6–7):475–484MathSciNetGoogle Scholar
  41. 41.
    Ruggiero V, Galligani E (1992) A parallel algorithm for solving block tridiagonal linear systems. Comput Math Appl 24(4):15–21MathSciNetCrossRefMATHGoogle Scholar
  42. 42.
    Gutknecht MH, Schmelzer T (2007) Updating the qr decomposition of block tridiagonal and block hessenberg matrices. Appl Numer Math 58(2008):871–883MathSciNetMATHGoogle Scholar
  43. 43.
    Koulaei MH, Toutounian F (2007) On computing of block ilu preconditioner for block tridiagonal systems. J Comput Appl Math 202(2):248–257MathSciNetCrossRefMATHGoogle Scholar
  44. 44.
    Yang W, Li K, Liu Y, Shi L, Wang C (2014) Optimization of quasi diagonal matrix-vector multiplication on gpu. Int J High Perform Comput Appl 28(2):181–193CrossRefGoogle Scholar
  45. 45.
    Li K, Yang W, Li K (2015) Performance analysis and optimization for SPMV on GPU using probabilistic modeling. IEEE Trans Parallel Distrib Syst 26:196–205. doi: 10.1109/TPDS.2014.2308221
  46. 46.
    Yang W, Li K, Mo Z, Li K (2015) Performance optimization using partitioned SPMV on GPUs and multicore cpus. IEEE Trans Comput 64(9):2623–2636MathSciNetCrossRefGoogle Scholar
  47. 47.
    DAzevedo E, Hill J C (2012) Parallel lu factorization on GPU cluster. Proc Comp Sci 9(11):67–75CrossRefGoogle Scholar
  48. 48.
    Tomov S (2012) A hybridization methodology for high-performance linear algebra software for GPUs, Chap 34. Elsevier, AmsterdamGoogle Scholar
  49. 49.
    Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the plasma and magma projects. J Phys Conf Seri, p 012037Google Scholar
  50. 50.
    Davidson A, Zhang Y, Owens JD (2011) An auto-tuned method for solving large tridiagonal systems on the gpu. In: Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium, pp 956–965Google Scholar
  51. 51.
    Göddeke D, Strzodka R (2011) Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid. IEEE Trans Parallel Distrib Syst 22(1):22–32CrossRefGoogle Scholar
  52. 52.
    László E, Giles M, Appleyard J (2016) Manycore algorithms for batch scalar and block tridiagonal solvers. ACM Trans Math Softw 42(4):31:1–31:36MathSciNetCrossRefGoogle Scholar
  53. 53.
    NVIDIA (213) NVIDIA CUDA C programming guide, Tech. RepGoogle Scholar
  54. 54.
    NVIDIA (2015) Cusolver library, Tech. RepGoogle Scholar
  55. 55.
    NVIDIA (2015) Cusparse library, Tech. RepGoogle Scholar
  56. 56.
    PARALUTION Labs UG & Co. KG (2015) Paralution—user manual, Tech. Rep., GaggenauGoogle Scholar
  57. 57.
    Ziane Khodja L, Couturier R, Giersch A, Bahi J (2014) Parallel sparse linear solver with gmres method using minimization techniques of communications for gpu clusters. J Supercomput 69(1):200–224. doi: 10.1007/s11227-014-1143-8 CrossRefGoogle Scholar
  58. 58.
    Couturier R, Denis C, Jzquel F (2008) Gremlins: a large sparse linear solver for grid environment. Parallel Comput 34(6C8):380–391. Parallel Matrix Algorithms and Applications.
  59. 59.
    Jezequel F, Couturier R, Denis C (2012) Solving large sparse linear systems in a grid environment: the gremlins code versus the petsc library. J Supercomput 59(3):1517–1532. doi: 10.1007/s11227-011-0563-y CrossRefGoogle Scholar
  60. 60.
    Smith B (2001) PETSC: portable, extensible toolkit for scientific computation. Encyclopedia of Parallel Computing, pp 1530–1539Google Scholar
  61. 61.
    Householder AS (1964) The theory of matrices in numerical analysis. Dover, New YorkMATHGoogle Scholar
  62. 62.
    Davis T A (2011) Algorithm 915, suitesparseqr: multifrontal multithreaded rank-revealing sparse qr factorization. ACM Trans Math Softw (TOMS) 38(1):8MathSciNetGoogle Scholar
  63. 63.
    Davis TA, Yeralan SN, Ranka S (2015) Algorithm 9xx: sparse qr factorization on the GPU. ACM Trans Math Softw 1:1–28. doi: 10.1145/0000000.0000000 Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.College of Information Science and EngineeringHunan UniversityChangshaChina
  2. 2.Department of Computer ScienceState University of New YorkNew PaltzUSA

Personalised recommendations