Node Performance and Energy Analysis with the Sniper Multi-core Simulator
Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate high-abstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Sniper provides this balance allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. With per-function advanced visualization and coupled power and energy simulations, the Sniper multi-core simulator can provide a fast and accurate way both to understand and optimize software for current and future hardware systems.
KeywordsCache Coherence Simulation Speed Interval Simulation Barrier Synchronization Branch Predictor
We thank Mathijs Rogiers for his invaluable work on the visualization features of Sniper and the anonymous reviewers for their valuable feedback. This work is supported by Intel and the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT). Additional support is provided by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERC Grant agreement no. 259295. Experiments were run on computing infrastructure at the ExaScience Lab, Leuven, Belgium; the Intel HPC Lab, Swindon, UK; and the VSC Flemish Supercomputer Center.
- 2.Carlson, T.E., Heirman, W., Eeckhout, L.: Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, pp. 52:1–52:12 (Nov 2011)Google Scholar
- 3.Chen, J., Dabbiru, L.K., Wong, D., Annavaram, M., Dubois, M.: Adaptive and speculative slack simulations of CMPs on CMPs. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Atlanta, pp. 523–534 (Dec 2010)Google Scholar
- 4.Eyerman, S., Eeckhout, L., Karkhanis, T., Smith, J.E.: A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. (TOCS) 27(2), 42–53 (2009)Google Scholar
- 6.Eyerman, S., Smith, J., Eeckhout, L.: Characterizing the branch misprediction penalty. In: Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, pp. 48–58 (Apr 2006)Google Scholar
- 8.Genbrugge, D., Eyerman, S., Eeckhout, L.: Interval simulation: raising the level of abstraction in architectural simulation. In: Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Bangalore, pp. 307–318 (Feb 2010)Google Scholar
- 9.Heirman, W., Sarkar, S., Carlson, T.E., Hur, I., Eeckhout, L.: Power-aware multi-core simulation for early design stage hardware/software co-optimization. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, pp. 3–12 (Sept 2012)Google Scholar
- 10.Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Chicago, pp. 190–200 (June 2005)Google Scholar
- 11.Miller, J.E., Kasture, H., Kurian, G., Gruenwald III, C., Beckmann, N., Celio, C., Eastep, J., Agarwal, A.: Graphite: a distributed parallel simulator for multicores. In: Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Bangalore, pp. 1–12 (Jan 2010)Google Scholar
- 12.Patil, H., Pereira, C., Stallcup, M., Lueck, G., Cownie, J.: PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Toronto, pp. 2–11 (Apr 2010)Google Scholar
- 13.Reinhardt, S.K., Hill, M.D., Larus, J.R., Lebeck, A.R., Lewis, J.C., Wood, D.A.: The Wisconsin wind tunnel: virtual prototyping of parallel computers. In: Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, Santa Clara, pp. 48–60 (May 1993)Google Scholar
- 14.Uzelac, V., Milenkovic, A.: Experiment flows and microbenchmarks for reverse engineering of branch predictor structures. In: Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, pp. 207–217 (Apr 2009)Google Scholar
- 16.Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22th International Symposium on Computer Architecture (ISCA), Portofino, pp. 24–36 (June 1995)Google Scholar