Performance Limitations for Sparse Matrix-Vector Multiplications on Current Multi-Core Environments
The increasing importance of multi-core processors calls for a reevaluation of established numerical algorithms in view of their ability to profit from this new hardware concept. In order to optimize the existent algorithms, a detailed knowledge of the different performance-limiting factors is mandatory. In this contribution we investigate sparse matrix-vector multiplications, which are the dominant operation in many sparse eigenvalue solvers. Two conceptually different storage schemes and computational kernels have been conceived in the past to target cache-based and vector architectures, respectively: compressed row and jagged diagonal storage. Starting from a series of microbenchmarks to single out performance limitations, we apply the gained insight to optimize sparse MVM implementations, reviewing serial and OpenMP-parallel performance on state-of-the-art multi-core systems.
Unable to display preview. Download preview PDF.
- 4.Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of Supercomputing Conference 2009 (2009). To be published Google Scholar
- 9.Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., Koziris, N.: Performance evaluation of the sparse matrix-vector multiplication on modern architectures. J. Supercomputing (2008). DOI 10.1007/s11227-008-0251-8
- 11.Hager, G., Stengel, H., Zeiser, T., Wellein, G.: Rzbench: Performance evaluation of current hpc architectures using low-level and application benchmarks. In: S. Wagner, M. Steinmetz, A. Bode, M. Brehm (eds.) High Performance Computing in Science and Engineering, Garching/Munich 2007. Transactions of the Third Joint HLRB and KONWIHR Status and Result Workshop, Dec 3-4, 2007, LRZ Garching, pp. 485–501 (2009). arXiv:0712.3389
- 14.Schönauer, W.: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000). URL http://www.rz.uni-karlsruhe.de/~rx03/book
- 15.Treibig, J., Hager, G., Wellein, G.: Complexities of performance prediction for bandwidth-limited loop kernels on multi-core architectures. In: High Performance Computing in Science and Engineering, Garching/Munich 2009, p. 3–12 (2010) Google Scholar