Evaluating Out-of-Order Engine Limitations Using Uop Flow Simulation

  • Vincent PalomaresEmail author
  • David C. Wong
  • David J. Kuck
  • William Jalby
Conference paper


Out-of-order mechanisms in recent microarchitectures do a very good job at hiding latencies and improving performance. However, they come with limitations not easily modeled statically, and hard to quantify exactly even dynamically. This paper will present Uop Flow Simulation (UFS), a loop performance prediction technique accounting for such restrictions by combining static analysis and cycle-driven simulation. UFS simulates the behavior of the execution pipeline when executing a loop. It handles instruction latencies, dependencies, out-of-order resource consumption and other low-level details while completely ignoring semantics. We will use a UFS prototype to validate our approach on Sandy Bridge using loops from real-world HPC applications, showing it is both accurate and very fast (reaching simulation speeds of hundreds of thousands of cycles per second).


Buffer Size Resource Allocation Scheme Simulated Cycle Reservation Station Simulation Speed 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We would like to thank Gabriel Staffelbach (CERFACS) for having provided our laboratory with the AVBP application, as well as Ghislain Lartigue and Vincent Moureau (CORIA) for providing us with YALES2.

We would also like to thank Mathieu Tribalat (UVSQ) and Emmanuel Oseret (Exascale Computing Research) for performing and providing the in vivo measurements we used to validate UFS on the aforementioned applications.

This work has been carried out partly at Exascale Computing Research laboratory, thanks to the support of CEA, Intel, UVSQ, and by the PRiSM laboratory, thanks to the support of the French Ministry for Economy, Industry, and Employment through the COLOC project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the CEA, Intel, or UVSQ.


  1. 1.
    Intel: Intel architecture code analyzer (IACA) (2012).
  2. 2.
    Oseret E et al (2014) CQA: a code quality analyzer tool at binary level. HiPCGoogle Scholar
  3. 3.
    Noudohouenou J et al (2013) Simsys: a performance simulation framework. RAPIDO’13, ACMGoogle Scholar
  4. 4.
    Palomares V (2015) Combining static and dynamic approaches to model loop performance in HPC. Ph.D. thesis, UVSQ, Chapter 7, Uop Flow SimulationGoogle Scholar
  5. 5.
  6. 6.
  7. 7.
    MAQAO: Maqao project (2013)
  8. 8.
    Fog A (2015) Instruction tables: lists of instruction latencies, throughputs and micro-operation breakdowns for intel, amd and via cpus.
  9. 9.
    Djoudi L et al (2007) The design and architecture of maqao profile: an instrumentation maqao module. In: EPIC-6, IEEE, New York, p 13Google Scholar
  10. 10.
    Palomares V (2015) Combining static and dynamic approaches to model loop performance in HPC. Ph.D. thesis, UVSQ, Appendix A: Quantifying effective out-of-order resource sizes, Appendix B: Note on the load matrixGoogle Scholar
  11. 11.
    Intel: (2014) Micro-op queue and the loop stream detector (LSD). Intel 64 and IA-32 Architectures Optimization Reference ManualGoogle Scholar
  12. 12.
    Intel: 2.2.4 (2014) The execution core. Intel 64 and IA-32 Architectures optimization reference manualGoogle Scholar
  13. 13.
    Intel: (2014) Partial register stalls. Intel 64 and IA-32 Architectures optimization reference manualGoogle Scholar
  14. 14.
    Paoloni G (2010) How to benchmark code execution times on intel ia-32 and ia-64 instruction set architectures. Intel Corporation, Santa ClaraGoogle Scholar
  15. 15.
    Koliaï S et al (2013) Quantifying performance bottleneck cost through differential analysis. In: Proceedings of the 27th international ACM conference on supercomputing. ACM, New York, pp. 263–272Google Scholar
  16. 16.
    Moureau V et al (2011) From large-eddy simulation to direct numerical simulation of a lean premixed swirl flame. Combust. FlameGoogle Scholar
  17. 17.
    Press WH et al (1992) Numerical recipes: the art of scientific computingGoogle Scholar
  18. 18.
    Loh GH et al (2009) Zesto: a cycle-level simulator for highly detailed microarchitecture exploration. In: IEEE International symposium on performance analysis of systems and software, 2009. ISPASS 2009, IEEE, New York, pp 53–64Google Scholar
  19. 19.
    Loh GH, Subramaniam S, Xie Y (2009) Zesto.
  20. 20.
    Burger D, Austin TM (1997) The simplescalar tool set, version 2.0. ACM SIGARCH Comput Archit News 25(3): 13–25Google Scholar
  21. 21.
    Binkert N et al (2011) The gem5 simulator. ACM SIGARCH Computer Architecture News 39(2): 1–7Google Scholar
  22. 22.
    Carlson TE et al (2011) Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: SC, ACM, New York, p 52Google Scholar
  23. 23.
    Heirman W et al (2012) Sniper: scalable and accurate parallel multi-core simulation. In: ACACES-2012, HiPEAC, pp 91–94Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Vincent Palomares
    • 1
    Email author
  • David C. Wong
    • 2
  • David J. Kuck
    • 2
  • William Jalby
    • 1
  1. 1.University of Versailles Saint-Quentin-en-YvelinesVersaillesFrance
  2. 2.Intel CorporationChampaignUSA

Personalised recommendations