Performance Limitations for Sparse Matrix-Vector Multiplications on Current Multi-Core Environments

  • Gerald Schubert
  • Georg Hager
  • Holger Fehske


The increasing importance of multi-core processors calls for a reevaluation of established numerical algorithms in view of their ability to profit from this new hardware concept. In order to optimize the existent algorithms, a detailed knowledge of the different performance-limiting factors is mandatory. In this contribution we investigate sparse matrix-vector multiplications, which are the dominant operation in many sparse eigenvalue solvers. Two conceptually different storage schemes and computational kernels have been conceived in the past to target cache-based and vector architectures, respectively: compressed row and jagged diagonal storage. Starting from a series of microbenchmarks to single out performance limitations, we apply the gained insight to optimize sparse MVM implementations, reviewing serial and OpenMP-parallel performance on state-of-the-art multi-core systems.


Access Pattern Memory Bandwidth Cache Line Storage Scheme Sparsity Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
    Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia (2000) zbMATHGoogle Scholar
  4. 4.
    Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of Supercomputing Conference 2009 (2009). To be published Google Scholar
  5. 5.
    Brandt, A.: Guide to multigrid development. Lect. Notes Math. 960, 220 (1981) CrossRefMathSciNetGoogle Scholar
  6. 6.
    Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial. SIAM, Philadelphia (2000) zbMATHGoogle Scholar
  7. 7.
    Demmel, J.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997) zbMATHGoogle Scholar
  8. 8.
    Demmel, J.W., Gilbert, J.R., Li, X.S.: An asynchronous parallel supernodal algorithm for sparse gaussian elimination. SIAM J. Matrix Analysis and Applications 20(4), 915–952 (1999) zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., Koziris, N.: Performance evaluation of the sparse matrix-vector multiplication on modern architectures. J. Supercomputing (2008). DOI  10.1007/s11227-008-0251-8
  10. 10.
    Greenbaum, A.: Iterative Methods for Solving Linear Systems. SIAM, Philadelphia (1997) zbMATHGoogle Scholar
  11. 11.
    Hager, G., Stengel, H., Zeiser, T., Wellein, G.: Rzbench: Performance evaluation of current hpc architectures using low-level and application benchmarks. In: S. Wagner, M. Steinmetz, A. Bode, M. Brehm (eds.) High Performance Computing in Science and Engineering, Garching/Munich 2007. Transactions of the Third Joint HLRB and KONWIHR Status and Result Workshop, Dec 3-4, 2007, LRZ Garching, pp. 485–501 (2009). arXiv:0712.3389
  12. 12.
    Hager, G., Wellein, G.: Optimization techniques for modern high performance computers. Lect. Notes Phys. 739, 731 (2008) CrossRefGoogle Scholar
  13. 13.
    Saad, Y.: Iterative Methods for Sparse Linear Systems, 2 edn. SIAM, Philadelphia (2003). URL zbMATHGoogle Scholar
  14. 14.
    Schönauer, W.: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000). URL
  15. 15.
    Treibig, J., Hager, G., Wellein, G.: Complexities of performance prediction for bandwidth-limited loop kernels on multi-core architectures. In: High Performance Computing in Science and Engineering, Garching/Munich 2009, p. 3–12 (2010) Google Scholar
  16. 16.
    Wellein, G., Röder, H., Fehske, H.: Polarons and bipolarons in strongly interacting electron-phonon systems. Phys. Rev. B 53, 9666 (1996) CrossRefGoogle Scholar
  17. 17.
    Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplications on emerging multicore platforms. Parallel Comput. 35, 178 (2009) CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.Regionales Rechenzentrum ErlangenFriedrich-Alexander Universität Erlangen-NürnbergErlangenGermany
  2. 2.Institut für PhysikErnst-Moritz-Arndt-Universität GreifswaldGreifswaldGermany

Personalised recommendations