Optimizing performance on modern HPC systems: learning from simple kernel benchmarks

  • G. Hager
  • T. Zeiser
  • J. Treibig
  • G. Wellein
Part of the Notes on Numerical Fluid Mechanics and Multidisciplinary Design book series (NNFM, volume 91)


We discuss basic optimization and parallelization strategies for current cache-based microprocessors (Intel Itanium2, Intel Netburst and AMD64 variants) in single-CPU and shared memory environments. Using selected kernel benchmarks representing data intensive applications we focus on the effective bandwidths attainable, which is still suboptimal using current compilers.We stress the need for a subtle OpenMP implementation even for simple benchmark programs, to exploit the high aggregate memory bandwidth available nowadays on ccNUMA systems. If the quality of main memory access is the measure, classical vector systems such as the NEC SX6+ are still a class of their own and are able to sustain the performance level of in-cache operations of modern microprocessors even with arbitrarily large data sets.


Cache Size Memory Bandwidth Cache Line Loop Length Cache Performance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lemuet C, Jalby W, Touati S (2004) Improving load/store queues usage in scientific computing. The International Conference on Parallel Processing (ICPP'04). Montraal IEEEGoogle Scholar
  2. 2.
    Oliker L et al. (2003) Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations. In: Proc. SC2003, Phoenix, AZGoogle Scholar
  3. 3.
    Deserno F et al. (2004) Performance of scientific applications on modern supercomputers. In: Wagner S et al. (eds) High Performance Computing in Science and Engineering. Munich 2004. Transactions of the Second Joint HLRB and KONWIHR Status and Result Workshop. Springer-Verlag, Berlin, HeidelbergGoogle Scholar
  4. 4.
    Oliker L et al. (2004) Scientific computations on modern parallel vector systems. In: Proc. SC2004, Pittsburgh, PAGoogle Scholar
  5. 5.
    Pohl T et al. (2004) Performance evaluation of parallel large-scale Lattice Boltzmann applications on three supercomputing architectures. In: Proc. SC2004, Pittsburgh, PAGoogle Scholar
  6. 6.
    Schönauer W (2000) Scientific Supercomputing. Self-edition, KarlsruheGoogle Scholar
  7. 7.
    Jalby W, Lemuet C, Touati S An effective memory operations optimization technique for vector loops on Itanium2 processors. Concurrency Comput Pract Exp (accepted for publication)Google Scholar
  8. 8.
    Intel Corp. (2004) Itanium2TM programming and optimization reference manual. Intel Scholar
  9. 9.
    Bast H, Levinthal D, Intel Corp. Private communicationGoogle Scholar
  10. 10.
    Intel Corp. (2004) IA-32 optimization reference manual. Intel Scholar
  11. 11.
    Rightmark Memory Analyzer Scholar
  12. 12.
    AMD Athlon processor, x86 code optimization guide 86–98 Scholar

Copyright information

© Springer 2006

Authors and Affiliations

  • G. Hager
    • 1
  • T. Zeiser
    • 1
  • J. Treibig
    • 2
  • G. Wellein
    • 1
  1. 1.Regional Computing Centre Erlangen (RRZE)University of Erlangen-NurembergErlangenGermany
  2. 2.Chair of System Simulation (LSS)University of Erlangen-NurembergErlangenGermany

Personalised recommendations