Complexities of Performance Prediction for Bandwidth-Limited Loop Kernels on Multi-Core Architectures
The balance metric is a simple approach to estimate the performance of bandwidth-limited loop kernels. However, applying the method to modern multi-core architectures yields unsatisfactory results. This paper analyzes the influence of cache hierarchy design on performance predictions for bandwidth-limited loop kernels on current mainstream processors. We present a diagnostic model with improved predictive power, correcting the limitations of the simple balance metric. The importance of code execution overhead even in bandwidth-bound situations is emphasized.
Unable to display preview. Download preview PDF.
- 1.W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition, Karlsruhe (2000). Google Scholar
- 2.K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick: Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures. In: ACM/IEEE (Ed.): Proceedings of the ACM/IEEE SC 2008 Conference (Supercomputing Conference ’08, Austin, TX, Nov 15–21, 2008). Google Scholar
- 3.Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual. (2008) Document Number: 248966–17. Google Scholar
- 4.W. Jalby, C. Lemuet and X. Le Pasteur: WBTK: a New Set of Microbenchmarks to Explore Memory System Performance for Scientific Computing. International Journal of High Performance Computing Applications, Vol. 18, 211–224 (2004). Google Scholar