Analysis of Intel’s Haswell Microarchitecture Using the ECM Model and Microbenchmarks

  • Johannes Hofmann
  • Dietmar Fey
  • Jan Eitzinger
  • Georg Hager
  • Gerhard Wellein
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9637)


This paper presents an in-depth analysis of Intel’s Haswell microarchitecture for streaming loop kernels. Among the new features examined are the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, enhancements such as new and improved execution units, as well as improvements throughout the memory hierarchy. The Execution-Cache-Memory diagnostic performance model is used together with a generic set of microbenchmarks to quantify the efficiency of the microarchitecture. The set of microbenchmarks is chosen in a way that it can serve as a blueprint for other streaming loop kernels.


Intel Haswell Architecture analysis ECM model Performance modeling 


  1. 1.
    Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture. IEEE (2015)Google Scholar
  2. 2.
    Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency Computat: Pract. Exper. (2013). doi: 10.1002/cpe.3180
  3. 3.
    Hofmann, J., Treibig, J., Fey, D.: Execution-cache-memory performance model: introduction and validation (2015)Google Scholar
  4. 4.
    Intel Corporation: Intel Xeon Processor E5-2600/4600 Product Family Technical Overview.
  5. 5.
    Intel Corporation: Intel Technology Journal 14(3) (2010)Google Scholar
  6. 6.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995Google Scholar
  7. 7.
    Molka, D., Hackenberg, D., Schöne, R.: Main memory and cache performance of intel sandy bridge and amd bulldozer. In: Proceedings of the Workshop on Memory Systems Performance and Correctness, MSPC 2014, pp. 4: 1–4:10. ACM (2014)Google Scholar
  8. 8.
    Schönauer, W.: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000)Google Scholar
  9. 9.
    Schöne, R., Hackenberg, D., Molka, D.: Memory performance at reduced cpu clock speeds: an analysis of current x86\_64 processors. In: Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, HotPower 2012, p. 9. USENIX Association (2012)Google Scholar
  10. 10.
    Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015. ACM, New York (2015).
  11. 11.
    Treibig, J., Hager, G.: Introducing a performance model for bandwidth-limited loop kernels. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 615–624. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Treibig, J., Hager, G., Wellein, G.: likwid-bench: an extensible microbenchmarking platform for x86 multicore compute nodes. In: Parallel Tools Workshop, pp. 27–36 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Johannes Hofmann
    • 1
  • Dietmar Fey
    • 1
  • Jan Eitzinger
    • 2
  • Georg Hager
    • 2
  • Gerhard Wellein
    • 2
  1. 1.Computer ArchitectureUniversity Erlangen–NurembergErlangenGermany
  2. 2.Erlangen Regional Computing Center (RRZE)University Erlangen–NurembergErlangenGermany

Personalised recommendations