Evaluating Out-of-Order Engine Limitations Using Uop Flow Simulation
Out-of-order mechanisms in recent microarchitectures do a very good job at hiding latencies and improving performance. However, they come with limitations not easily modeled statically, and hard to quantify exactly even dynamically. This paper will present Uop Flow Simulation (UFS), a loop performance prediction technique accounting for such restrictions by combining static analysis and cycle-driven simulation. UFS simulates the behavior of the execution pipeline when executing a loop. It handles instruction latencies, dependencies, out-of-order resource consumption and other low-level details while completely ignoring semantics. We will use a UFS prototype to validate our approach on Sandy Bridge using loops from real-world HPC applications, showing it is both accurate and very fast (reaching simulation speeds of hundreds of thousands of cycles per second).
KeywordsBuffer Size Resource Allocation Scheme Simulated Cycle Reservation Station Simulation Speed
We would like to thank Gabriel Staffelbach (CERFACS) for having provided our laboratory with the AVBP application, as well as Ghislain Lartigue and Vincent Moureau (CORIA) for providing us with YALES2.
We would also like to thank Mathieu Tribalat (UVSQ) and Emmanuel Oseret (Exascale Computing Research) for performing and providing the in vivo measurements we used to validate UFS on the aforementioned applications.
This work has been carried out partly at Exascale Computing Research laboratory, thanks to the support of CEA, Intel, UVSQ, and by the PRiSM laboratory, thanks to the support of the French Ministry for Economy, Industry, and Employment through the COLOC project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the CEA, Intel, or UVSQ.
- 1.Intel: Intel architecture code analyzer (IACA) (2012). https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
- 2.Oseret E et al (2014) CQA: a code quality analyzer tool at binary level. HiPCGoogle Scholar
- 3.Noudohouenou J et al (2013) Simsys: a performance simulation framework. RAPIDO’13, ACMGoogle Scholar
- 4.Palomares V (2015) Combining static and dynamic approaches to model loop performance in HPC. Ph.D. thesis, UVSQ, Chapter 7, Uop Flow SimulationGoogle Scholar
- 5.The AVBP code http://www.cerfacs.fr/4-26334-The-AVBP-code.php
- 6.YALES2 public page http://www.coria-cfd.fr/index.php/YALES2
- 7.MAQAO: Maqao project (2013) http://www.maqao.org
- 8.Fog A (2015) Instruction tables: lists of instruction latencies, throughputs and micro-operation breakdowns for intel, amd and via cpus. http://www.agner.org/optimize/instruction_tables.pdf
- 9.Djoudi L et al (2007) The design and architecture of maqao profile: an instrumentation maqao module. In: EPIC-6, IEEE, New York, p 13Google Scholar
- 10.Palomares V (2015) Combining static and dynamic approaches to model loop performance in HPC. Ph.D. thesis, UVSQ, Appendix A: Quantifying effective out-of-order resource sizes, Appendix B: Note on the load matrixGoogle Scholar
- 11.Intel: 22.214.171.124 (2014) Micro-op queue and the loop stream detector (LSD). Intel 64 and IA-32 Architectures Optimization Reference ManualGoogle Scholar
- 12.Intel: 2.2.4 (2014) The execution core. Intel 64 and IA-32 Architectures optimization reference manualGoogle Scholar
- 13.Intel: 126.96.36.199 (2014) Partial register stalls. Intel 64 and IA-32 Architectures optimization reference manualGoogle Scholar
- 14.Paoloni G (2010) How to benchmark code execution times on intel ia-32 and ia-64 instruction set architectures. Intel Corporation, Santa ClaraGoogle Scholar
- 15.Koliaï S et al (2013) Quantifying performance bottleneck cost through differential analysis. In: Proceedings of the 27th international ACM conference on supercomputing. ACM, New York, pp. 263–272Google Scholar
- 16.Moureau V et al (2011) From large-eddy simulation to direct numerical simulation of a lean premixed swirl flame. Combust. FlameGoogle Scholar
- 17.Press WH et al (1992) Numerical recipes: the art of scientific computingGoogle Scholar
- 18.Loh GH et al (2009) Zesto: a cycle-level simulator for highly detailed microarchitecture exploration. In: IEEE International symposium on performance analysis of systems and software, 2009. ISPASS 2009, IEEE, New York, pp 53–64Google Scholar
- 19.Loh GH, Subramaniam S, Xie Y (2009) Zesto. http://zesto.cc.gatech.edu
- 20.Burger D, Austin TM (1997) The simplescalar tool set, version 2.0. ACM SIGARCH Comput Archit News 25(3): 13–25Google Scholar
- 21.Binkert N et al (2011) The gem5 simulator. ACM SIGARCH Computer Architecture News 39(2): 1–7Google Scholar
- 22.Carlson TE et al (2011) Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: SC, ACM, New York, p 52Google Scholar
- 23.Heirman W et al (2012) Sniper: scalable and accurate parallel multi-core simulation. In: ACACES-2012, HiPEAC, pp 91–94Google Scholar