Advertisement

The European Physical Journal Special Topics

, Volume 210, Issue 1, pp 73–88 | Cite as

CPU vs. GPU - Performance comparison for the Gram-Schmidt algorithm

  • T. BrandesEmail author
  • A. Arnold
  • T. Soddemann
  • D. Reith
Regular Article

Abstract

The Gram-Schmidt method is a classical method for determining QR decompositions, which is commonly used in many applications in computational physics, such as orthogonalization of quantum mechanical operators or Lyapunov stability analysis. In this paper, we discuss how well the Gram-Schmidt method performs on different hardware architectures, including both state-of-the-art GPUs and CPUs. We explain, in detail, how a smart interplay between hardware and software can be used to speed up those rather compute intensive applications as well as the benefits and disadvantages of several approaches. In addition, we compare some highly optimized standard routines of the BLAS libraries against our own optimized routines on both processor types. Particular attention was paid to the strong hierarchical memory of modern GPUs and CPUs, which requires cache-aware blocking techniques for optimal performance. Our investigations show that the performance strongly depends on the employed algorithm, compiler and a little less on the employed hardware. Remarkably, the performance of the NVIDIA CUDA BLAS routines improved significantly from CUDA 3.2 to CUDA 4.0. Still, BLAS routines tend to be slightly slower than manually optimized code on GPUs, while we were not able to outperform the BLAS routines on CPUs. Comparing optimized implementations on different hardware architectures, we find that a NVIDIA GeForce GTX580 GPU is about 50% faster than a corresponding Intel X5650 Westmere hexacore CPU. The self-written codes are included as supplementary material.

Keywords

Graphic Processing Unit European Physical Journal Special Topic Shared Memory Memory Bandwidth Thread Block 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11734_2011_1715_MOESM1_ESM.zip (53 kb)
Supplementary material, approximately 52.5 KB.

References

  1. 1.
    J. Stoer, R. Bulirsch, W. Gautschi, C. Witzgall, Introduction to Numerical Analysis (Springer, 2002)Google Scholar
  2. 2.
    Å. Björck, BIT Numer. Math. 7, 1 (1967)zbMATHCrossRefGoogle Scholar
  3. 3.
    G. Golub, C. Van Loan, Matrix Computations (John Hopkins University Press, 1996)Google Scholar
  4. 4.
    J. Francis, Comput. J. 4, 265 (1961)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    J. Francis, Comput. J. 4, 332 (1962)MathSciNetCrossRefGoogle Scholar
  6. 6.
    V. Kublanovskaya, Comput. Math. Phys. 3, 637 (1961)Google Scholar
  7. 7.
    J. Gram, J. Math 94, 45 (1883)Google Scholar
  8. 8.
    A. Householder, J. ACM (JACM) 5, 342 (1958)Google Scholar
  9. 9.
    W. Givens, National Bureau Stand. Appl. Math. Ser. 29, 117 (1953)MathSciNetGoogle Scholar
  10. 10.
    E. Schmidt, Math. Annal. 63, 433 (1907)zbMATHCrossRefGoogle Scholar
  11. 11.
    W. Arnoldi, Quart. Appl. Math. 9, 17 (1951)MathSciNetzbMATHGoogle Scholar
  12. 12.
    H. von Bremen, F. Udwadia, W. Proskurowski, Physica D: Nonlinear Phenomena 101, 1 (1997)MathSciNetADSzbMATHCrossRefGoogle Scholar
  13. 13.
    F. Christiansen, H. Rugh, Nonlinearity 10, 1063 (1997)MathSciNetADSzbMATHCrossRefGoogle Scholar
  14. 14.
    CULA Programmer’s Guide Release 13 (CUDA 4.0), EM Photonic, Inc., Newark, DE (2011)Google Scholar
  15. 15.
    A. Kerr, D. Campbell, M. Richards, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (ACM, 2009), p. 71Google Scholar
  16. 16.
    CUDA Programming Guide, NVIDIA, Santa Clara, CA (2011)Google Scholar
  17. 17.
    J. Dongarra, I. Duff, D. Sorenson, D. Sorensen, H. van der Vorst, Numerical Linear Algebra on High Performance Computers (Software, Environments, Tools) (SIAM, 1999)Google Scholar
  18. 18.
    M. Harris, Optimizing Parallel Reduction in CUDA, NVIDIA white paper, Santa Clara, CA (2008)Google Scholar
  19. 19.
    CUDA Toolkit 4.0, CUBLAS Library, NVIDIA, Santa Clara, CA (2011)Google Scholar

Copyright information

© EDP Sciences and Springer 2012

Authors and Affiliations

  1. 1.Fraunhofer Institute SCAISchloss BirlinghovenSankt AugustinGermany
  2. 2.Institute for Computational PhysicsUniversity of StuttgartStuttgartGermany

Personalised recommendations