Skip to main content

CPU vs. GPU - Performance comparison for the Gram-Schmidt algorithm


The Gram-Schmidt method is a classical method for determining QR decompositions, which is commonly used in many applications in computational physics, such as orthogonalization of quantum mechanical operators or Lyapunov stability analysis. In this paper, we discuss how well the Gram-Schmidt method performs on different hardware architectures, including both state-of-the-art GPUs and CPUs. We explain, in detail, how a smart interplay between hardware and software can be used to speed up those rather compute intensive applications as well as the benefits and disadvantages of several approaches. In addition, we compare some highly optimized standard routines of the BLAS libraries against our own optimized routines on both processor types. Particular attention was paid to the strong hierarchical memory of modern GPUs and CPUs, which requires cache-aware blocking techniques for optimal performance. Our investigations show that the performance strongly depends on the employed algorithm, compiler and a little less on the employed hardware. Remarkably, the performance of the NVIDIA CUDA BLAS routines improved significantly from CUDA 3.2 to CUDA 4.0. Still, BLAS routines tend to be slightly slower than manually optimized code on GPUs, while we were not able to outperform the BLAS routines on CPUs. Comparing optimized implementations on different hardware architectures, we find that a NVIDIA GeForce GTX580 GPU is about 50% faster than a corresponding Intel X5650 Westmere hexacore CPU. The self-written codes are included as supplementary material.

This is a preview of subscription content, access via your institution.


  1. 1.

    J. Stoer, R. Bulirsch, W. Gautschi, C. Witzgall, Introduction to Numerical Analysis (Springer, 2002)

  2. 2.

    Å. Björck, BIT Numer. Math. 7, 1 (1967)

    MATH  Article  Google Scholar 

  3. 3.

    G. Golub, C. Van Loan, Matrix Computations (John Hopkins University Press, 1996)

  4. 4.

    J. Francis, Comput. J. 4, 265 (1961)

    MathSciNet  MATH  Article  Google Scholar 

  5. 5.

    J. Francis, Comput. J. 4, 332 (1962)

    MathSciNet  Article  Google Scholar 

  6. 6.

    V. Kublanovskaya, Comput. Math. Phys. 3, 637 (1961)

    Google Scholar 

  7. 7.

    J. Gram, J. Math 94, 45 (1883)

    Google Scholar 

  8. 8.

    A. Householder, J. ACM (JACM) 5, 342 (1958)

    Google Scholar 

  9. 9.

    W. Givens, National Bureau Stand. Appl. Math. Ser. 29, 117 (1953)

    MathSciNet  Google Scholar 

  10. 10.

    E. Schmidt, Math. Annal. 63, 433 (1907)

    MATH  Article  Google Scholar 

  11. 11.

    W. Arnoldi, Quart. Appl. Math. 9, 17 (1951)

    MathSciNet  MATH  Google Scholar 

  12. 12.

    H. von Bremen, F. Udwadia, W. Proskurowski, Physica D: Nonlinear Phenomena 101, 1 (1997)

    MathSciNet  ADS  MATH  Article  Google Scholar 

  13. 13.

    F. Christiansen, H. Rugh, Nonlinearity 10, 1063 (1997)

    MathSciNet  ADS  MATH  Article  Google Scholar 

  14. 14.

    CULA Programmer’s Guide Release 13 (CUDA 4.0), EM Photonic, Inc., Newark, DE (2011)

  15. 15.

    A. Kerr, D. Campbell, M. Richards, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (ACM, 2009), p. 71

  16. 16.

    CUDA Programming Guide, NVIDIA, Santa Clara, CA (2011)

  17. 17.

    J. Dongarra, I. Duff, D. Sorenson, D. Sorensen, H. van der Vorst, Numerical Linear Algebra on High Performance Computers (Software, Environments, Tools) (SIAM, 1999)

  18. 18.

    M. Harris, Optimizing Parallel Reduction in CUDA, NVIDIA white paper, Santa Clara, CA (2008)

  19. 19.

    CUDA Toolkit 4.0, CUBLAS Library, NVIDIA, Santa Clara, CA (2011)

Download references

Author information



Corresponding author

Correspondence to T. Brandes.

Electronic supplementary material

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Brandes, T., Arnold, A., Soddemann, T. et al. CPU vs. GPU - Performance comparison for the Gram-Schmidt algorithm. Eur. Phys. J. Spec. Top. 210, 73–88 (2012).

Download citation


  • Graphic Processing Unit
  • European Physical Journal Special Topic
  • Shared Memory
  • Memory Bandwidth
  • Thread Block