Skip to main content
Log in

Optimal size of the block in block GMRES on GPUs: computational model and experiments

  • Original Paper
  • Published:
Numerical Algorithms Aims and scope Submit manuscript

Abstract

The block version of GMRES (BGMRES) is most advantageous over the single right hand side (RHS) counterpart when the cost of communication is high while the cost of floating point operations is not. This is the particular case on modern graphics processing units (GPUs), while it is generally not the case on traditional central processing units (CPUs). In this paper, experiments on both GPUs and CPUs are shown that compare the performance of BGMRES against GMRES as the number of RHS increases, with a particular forcus on GPU performance. The experiments indicate that there are many cases in which BGMRES is slower than GMRES on CPUs, but faster on GPUs. Furthermore, when varying the number of RHS on the GPU, there is an optimal number of RHS where BGMRES is clearly most advantageous over GMRES. A computational model for the GPU is developed using hardware specific parameters, providing insight towards how the qualitative behavior of BGMRES changes as the number of RHS increase, and this model also helps explain the phenomena observed in the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

As stated in Section 4, all test problems used matrices from the SuiteSparse library [6] which is publicly available.

Notes

  1. Alternatively, nz can represent the number of non-zeros needed in the application of the coefficient matrix and the preconditioner.

References

  1. Arafa, Y., Badawy, A.A., Chennupati, G., Santhi, N., Eidenbenz, S.: PPT-GPU: scalable GPU performance modeling. IEEE Comput. Archit. Lett. 18(1), 55–58 (2019)

    Article  Google Scholar 

  2. The Belos Project Team: The Belos Project Website. https://docs.trilinos.org/dev/packages/belos/doc/html/classBelos_1_1PseudoBlockGmresSolMgr.html. Accessed 2021 Dec 01

  3. Susan Blackford, L., Demmel, J., Dongarra, J.J., Duff, I.S., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Clint Whaley, R.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28(2), 135–151 (2002)

    Article  MathSciNet  Google Scholar 

  4. Carson, E., Lund, K., Rozloznik, M.: The stability of block variants of classical Gram-Schmidt. SIAM J. Matrix Anal. Appl. 42(3), 1365–1380 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  5. Dagum, L., Menon, R.: OpenMP : an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  6. Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1) (2011)

  7. Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  8. Carter, E.H., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). Domain-Specific Languages and High-Level Frameworks for High-Performance Computing

    Article  Google Scholar 

  9. Gutknecht, M.H.: Block Krylov subspace methods for linear systems with multiple right-hand sides: an introduction. In: Siddiqi, A.H., Duff, I.S., Ole, C. (eds.) Modern Mathematical Models, Methods and Algorithms for Real World Systems, Chapter 10, pp. 420–447. Anamaya Publishers, New Dehli (2006)

  10. Gutknecht, M.H., Schmelzer, T.: Updating the QR decomposition of block tridiagonal and block Hessenberg matrices. Appl. Numer. Math. 58(6), 871–883 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  11. Hoemmen, M.: Communication-avoiding Krylov subspace methods. PhD thesis, EECS Department, University of California, Berkeley (2010)

  12. Intel. Intel math kernel library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html. Accessed 2022 Apr 24

  13. Langou, J.: Solving large linear systems with multiple right-hand sides. PhD thesis, INSA de Toulouse (2003)

  14. Liesen, J., Strakoš, Z.: Krylov subspace methods. Principles and analysis. Numerical Mathematics and Scientific Computation, 1st edn. Oxford University Press, Oxford (2013)

    MATH  Google Scholar 

  15. NVIDIA. cuBLAS documentation. https://docs.nvidia.com/cuda/cublas/index.html. Accessed 2021 Dec 08

  16. NVIDIA. cuSPARSE documentation. https://docs.nvidia.com/cuda/cusparse/index.html. Accessed 2021 Dec 08

  17. NVIDIA. NVIDIA V100 Tensor Core GPU. https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf Accessed 2021 Apr 07

  18. O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29, 293–322 (1980). Special Volume Dedicated to Alson S. Householder.

    Article  MathSciNet  MATH  Google Scholar 

  19. Rashedi, S., Ebadi, G., Birk, S., Frommer, A.: On short recurrence Krylov type methods for linear systems with many right-hand sides. J. Comput. Appl. Math. 300, 18–29 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  20. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)

    Book  Google Scholar 

  21. Yousef, S., Schultz, M.H.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7 (3), 856–869 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  22. Simoncini, V., Gallopoulos, E.: An iterative method for nonsymmetric systems with multiple right-hand sides. SIAM J. Sci. Comput. 16, 917–933 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  23. Simoncini, V., Gallopoulos, E.: Convergence properties of block GMRES and matrix polynomials. Linear Algebra Appl. 247, 97–119 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  24. Soodhalter, K.: Block Krylov subspace recycling for shifted systems with unrelated right-hand sides. SIAM J. Sci. Comput. 38, A302–A324 (2016)

    Article  MathSciNet  Google Scholar 

  25. Vital, B.: Etude de quelques méthodes de résolution de problémes linéaires de grande taille sur multiprocesseur. PhD thesis, Université de Rennes I (1990)

  26. Yamazaki, I., Hoemmen, M., Luszczek, P., Dongarra, J.J.: Improving performance of GMRES by reducing communication and pipelining global collectives. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1118–1127 (2017)

Download references

Acknowledgements

We benefited from helpful comments and questions we received from Valeria Simoncini and Yousef Saad. We thank the two referees for their careful reading of the manuscript and their comments and questions which helped improve our presentation.

Funding

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the US Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. This work was in part supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy Office of Science and the National Nuclear Security Administration. This research includes calculations carried out on HPC resources at Temple University supported in part by the National Science Foundation through major research instrumentation grant number 1625061 and by the US Army Research Laboratory under contract number W911NF-16-2-0189.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew J. Higgins.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Complete results

Appendix: Complete results

In the tables, entries with a ‘-’ indicate the method failed to converge. All times are taken in seconds, and k is the restart parameter. All iteration counts taken for s ×GMRES are the average number of iterations required for each RHS until convergence. Data for GMRES-LI are only included for problems where the method converges for at least s = 5 RHS.

Table 5 FEM_3D_thermal2 CPU results
Table 6 FEM_3D_thermal2 GPU results
Table 7 stomach CPU results
Table 8 stomach GPU results
Table 9 cage13 CPU results
Table 10 cage13 GPU results
Table 11 thermomech_dM CPU results
Table 12 thermomech_dM GPU results
Table 13 Dubcova3 CPU results
Table 14 Dubcova3 GPU results
Table 15 CoupCons3D CPU results
Table 16 CoupCons3D GPU results
Table 17 CO CPU results
Table 18 CO GPU results
Table 19 hood CPU results
Table 20 hood GPU results
Table 21 shipsec8 CPU results
Table 22 shipsec8 GPU results
Table 23 SiO2 CPU results
Table 24 SiO2 GPU results

We note that BGMRES is often slower than s ×GMRES for s small, e.g., s = 1. This is due to different implementations of solving the least squares problem among the two algorithms. BGMRES uses Householder reflections, which is advantageous for multiple RHS, while the Givens rotations in s ×GMRES are designed with minimal effort for one RHS in mind.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boman, E.G., Higgins, A.J. & Szyld, D.B. Optimal size of the block in block GMRES on GPUs: computational model and experiments. Numer Algor 92, 119–147 (2023). https://doi.org/10.1007/s11075-022-01439-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11075-022-01439-z

Keywords

Mathematics Subject Classification (2010)

Navigation