Abstract
We discuss basic optimization and parallelization strategies for current cache-based microprocessors (Intel Itanium2, Intel Netburst and AMD64 variants) in single-CPU and shared memory environments. Using selected kernel benchmarks representing data intensive applications we focus on the effective bandwidths attainable, which is still suboptimal using current compilers.We stress the need for a subtle OpenMP implementation even for simple benchmark programs, to exploit the high aggregate memory bandwidth available nowadays on ccNUMA systems. If the quality of main memory access is the measure, classical vector systems such as the NEC SX6+ are still a class of their own and are able to sustain the performance level of in-cache operations of modern microprocessors even with arbitrarily large data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lemuet C, Jalby W, Touati S (2004) Improving load/store queues usage in scientific computing. The International Conference on Parallel Processing (ICPP'04). Montraal IEEE
Oliker L et al. (2003) Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations. In: Proc. SC2003, Phoenix, AZ
Deserno F et al. (2004) Performance of scientific applications on modern supercomputers. In: Wagner S et al. (eds) High Performance Computing in Science and Engineering. Munich 2004. Transactions of the Second Joint HLRB and KONWIHR Status and Result Workshop. Springer-Verlag, Berlin, Heidelberg
Oliker L et al. (2004) Scientific computations on modern parallel vector systems. In: Proc. SC2004, Pittsburgh, PA
Pohl T et al. (2004) Performance evaluation of parallel large-scale Lattice Boltzmann applications on three supercomputing architectures. In: Proc. SC2004, Pittsburgh, PA
Schönauer W (2000) Scientific Supercomputing. Self-edition, Karlsruhe
Jalby W, Lemuet C, Touati S An effective memory operations optimization technique for vector loops on Itanium2 processors. Concurrency Comput Pract Exp (accepted for publication)
Intel Corp. (2004) Itanium2TM programming and optimization reference manual. Intel http://developer.intel.com/
Bast H, Levinthal D, Intel Corp. Private communication
Intel Corp. (2004) IA-32 optimization reference manual. Intel http://developer.intel.com/
Rightmark Memory Analyzer http://cpu.rightmark.org/products/rmma.shtml
AMD Athlon processor, x86 code optimization guide 86–98 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer
About this paper
Cite this paper
Hager, G., Zeiser, T., Treibig, J., Wellein, G. (2006). Optimizing performance on modern HPC systems: learning from simple kernel benchmarks. In: Krause, E., Shokin, Y., Resch, M., Shokina, N. (eds) Computational Science and High Performance Computing II. Notes on Numerical Fluid Mechanics and Multidisciplinary Design, vol 91. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31768-6_23
Download citation
DOI: https://doi.org/10.1007/3-540-31768-6_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31767-8
Online ISBN: 978-3-540-31768-5
eBook Packages: EngineeringEngineering (R0)