Optimal size of the block in block GMRES on GPUs: computational model and experiments

Boman, Erik G.; Higgins, Andrew J.; Szyld, Daniel B.

doi:10.1007/s11075-022-01439-z

Optimal size of the block in block GMRES on GPUs: computational model and experiments

Original Paper
Published: 13 December 2022

Volume 92, pages 119–147, (2023)
Cite this article

Numerical Algorithms Aims and scope Submit manuscript

211 Accesses
Explore all metrics

Abstract

The block version of GMRES (BGMRES) is most advantageous over the single right hand side (RHS) counterpart when the cost of communication is high while the cost of floating point operations is not. This is the particular case on modern graphics processing units (GPUs), while it is generally not the case on traditional central processing units (CPUs). In this paper, experiments on both GPUs and CPUs are shown that compare the performance of BGMRES against GMRES as the number of RHS increases, with a particular forcus on GPU performance. The experiments indicate that there are many cases in which BGMRES is slower than GMRES on CPUs, but faster on GPUs. Furthermore, when varying the number of RHS on the GPU, there is an optimal number of RHS where BGMRES is clearly most advantageous over GMRES. A computational model for the GPU is developed using hardware specific parameters, providing insight towards how the qualitative behavior of BGMRES changes as the number of RHS increase, and this model also helps explain the phenomena observed in the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GPU Architecture

Distributed Sparse Block Grids on GPUs

Parameter based tuning model for optimizing performance on GPU

Article 01 July 2017

Data availability

As stated in Section 4, all test problems used matrices from the SuiteSparse library [6] which is publicly available.

Notes

Alternatively, nz can represent the number of non-zeros needed in the application of the coefficient matrix and the preconditioner.

References

Arafa, Y., Badawy, A.A., Chennupati, G., Santhi, N., Eidenbenz, S.: PPT-GPU: scalable GPU performance modeling. IEEE Comput. Archit. Lett. 18(1), 55–58 (2019)
Article Google Scholar
The Belos Project Team: The Belos Project Website. https://docs.trilinos.org/dev/packages/belos/doc/html/classBelos_1_1PseudoBlockGmresSolMgr.html. Accessed 2021 Dec 01
Susan Blackford, L., Demmel, J., Dongarra, J.J., Duff, I.S., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Clint Whaley, R.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28(2), 135–151 (2002)
Article MathSciNet Google Scholar
Carson, E., Lund, K., Rozloznik, M.: The stability of block variants of classical Gram-Schmidt. SIAM J. Matrix Anal. Appl. 42(3), 1365–1380 (2021)
Article MathSciNet MATH Google Scholar
Dagum, L., Menon, R.: OpenMP : an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1) (2011)
Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
Article MathSciNet MATH Google Scholar
Carter, E.H., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). Domain-Specific Languages and High-Level Frameworks for High-Performance Computing
Article Google Scholar
Gutknecht, M.H.: Block Krylov subspace methods for linear systems with multiple right-hand sides: an introduction. In: Siddiqi, A.H., Duff, I.S., Ole, C. (eds.) Modern Mathematical Models, Methods and Algorithms for Real World Systems, Chapter 10, pp. 420–447. Anamaya Publishers, New Dehli (2006)
Gutknecht, M.H., Schmelzer, T.: Updating the QR decomposition of block tridiagonal and block Hessenberg matrices. Appl. Numer. Math. 58(6), 871–883 (2008)
Article MathSciNet MATH Google Scholar
Hoemmen, M.: Communication-avoiding Krylov subspace methods. PhD thesis, EECS Department, University of California, Berkeley (2010)
Intel. Intel math kernel library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html. Accessed 2022 Apr 24
Langou, J.: Solving large linear systems with multiple right-hand sides. PhD thesis, INSA de Toulouse (2003)
Liesen, J., Strakoš, Z.: Krylov subspace methods. Principles and analysis. Numerical Mathematics and Scientific Computation, 1st edn. Oxford University Press, Oxford (2013)
MATH Google Scholar
NVIDIA. cuBLAS documentation. https://docs.nvidia.com/cuda/cublas/index.html. Accessed 2021 Dec 08
NVIDIA. cuSPARSE documentation. https://docs.nvidia.com/cuda/cusparse/index.html. Accessed 2021 Dec 08
NVIDIA. NVIDIA V100 Tensor Core GPU. https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf Accessed 2021 Apr 07
O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29, 293–322 (1980). Special Volume Dedicated to Alson S. Householder.
Article MathSciNet MATH Google Scholar
Rashedi, S., Ebadi, G., Birk, S., Frommer, A.: On short recurrence Krylov type methods for linear systems with many right-hand sides. J. Comput. Appl. Math. 300, 18–29 (2016)
Article MathSciNet MATH Google Scholar
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)
Book Google Scholar
Yousef, S., Schultz, M.H.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7 (3), 856–869 (1986)
Article MathSciNet MATH Google Scholar
Simoncini, V., Gallopoulos, E.: An iterative method for nonsymmetric systems with multiple right-hand sides. SIAM J. Sci. Comput. 16, 917–933 (1995)
Article MathSciNet MATH Google Scholar
Simoncini, V., Gallopoulos, E.: Convergence properties of block GMRES and matrix polynomials. Linear Algebra Appl. 247, 97–119 (1996)
Article MathSciNet MATH Google Scholar
Soodhalter, K.: Block Krylov subspace recycling for shifted systems with unrelated right-hand sides. SIAM J. Sci. Comput. 38, A302–A324 (2016)
Article MathSciNet Google Scholar
Vital, B.: Etude de quelques méthodes de résolution de problémes linéaires de grande taille sur multiprocesseur. PhD thesis, Université de Rennes I (1990)
Yamazaki, I., Hoemmen, M., Luszczek, P., Dongarra, J.J.: Improving performance of GMRES by reducing communication and pipelining global collectives. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1118–1127 (2017)

Download references

Acknowledgements

We benefited from helpful comments and questions we received from Valeria Simoncini and Yousef Saad. We thank the two referees for their careful reading of the manuscript and their comments and questions which helped improve our presentation.

Funding

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the US Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. This work was in part supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy Office of Science and the National Nuclear Security Administration. This research includes calculations carried out on HPC resources at Temple University supported in part by the National Science Foundation through major research instrumentation grant number 1625061 and by the US Army Research Laboratory under contract number W911NF-16-2-0189.

Author information

Authors and Affiliations

Center for Computing Research, Sandia National Laboratories, Albuquerque, NM, 87123, USA
Erik G. Boman
Department of Mathematics, Temple University, Philadelphia, PA, 19122, USA
Andrew J. Higgins & Daniel B. Szyld

Authors

Erik G. Boman
View author publications
You can also search for this author in PubMed Google Scholar
Andrew J. Higgins
View author publications
You can also search for this author in PubMed Google Scholar
Daniel B. Szyld
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew J. Higgins.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Complete results

In the tables, entries with a ‘-’ indicate the method failed to converge. All times are taken in seconds, and k is the restart parameter. All iteration counts taken for s ×GMRES are the average number of iterations required for each RHS until convergence. Data for GMRES-LI are only included for problems where the method converges for at least s = 5 RHS.

Table 5 FEM_3D_thermal2 CPU results

Optimal size of the block in block GMRES on GPUs: computational model and experiments

Abstract

Access this article

Similar content being viewed by others

GPU Architecture

Distributed Sparse Block Grids on GPUs

Parameter based tuning model for optimizing performance on GPU

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Appendix: Complete results

Appendix: Complete results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation