Accurate and Complete Hardware Profiling for OpenMP

Multiplexing Hardware Events Across Executions
  • Richard NeillEmail author
  • Andi Drebes
  • Antoniu Pop
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10468)


Analyzing the behavior of OpenMP programs and their interaction with the hardware is essential for locating performance bottlenecks and identifying performance optimization opportunities. However, current architectures only provide a small number of dedicated registers to quantify hardware events, which strongly limits the scope of performance analyses. Hardware event multiplexing can help cover more events, but incurs a significant loss of accuracy and introduces overheads that change the behavior of program execution significantly. In this paper, we present an implementation of our technique for building a unique, coherent profile that contains all available hardware events from multiple executions of the same OpenMP program, each monitoring only a subset of the available hardware events. Reconciliation of the execution profiles relies on a new labeling scheme for OpenMP that uniquely identifies each dynamic unit of work across executions under dynamic scheduling across processing units. We show that our approach yields significantly better accuracy and lower monitoring overhead per execution than hardware event multiplexing.


Performance analysis Hardware events Performance monitoring counters OpenMP profiling 


  1. 1.
    Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Exp. 22(6), 685–701 (2010)Google Scholar
  2. 2.
    Intel Corporation: Intel VTune Amplifier (2017). Accessed 30 Apr 2017
  3. 3.
    Dimakopoulou, M., Eranian, S., Koziris, N., Bambos, N.: Reliable and efficient performance monitoring in Linux. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 34 (2016)Google Scholar
  4. 4.
    Drebes, A., Bréjon, J.-B., Pop, A., Heydemann, K., Cohen, A.: Language-centric performance analysis of OpenMP programs with aftermath. In: Maruyama, N., Supinski, B.R., Wahib, M. (eds.) IWOMP 2016. LNCS, vol. 9903, pp. 237–250. Springer, Cham (2016). doi: 10.1007/978-3-319-45550-1_17 CrossRefGoogle Scholar
  5. 5.
    Drebes, A., Pop, A., Heydemann, K., Cohen, A., Drach-Temam, N.: Aftermath: a graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In: 7th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG), Vienna, Austria (2014)Google Scholar
  6. 6.
    Hauswirth, M., Diwan, A., Sweeney, P.F., Mozer, M.C.: Automating vertical profiling. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, pp. 281–296. ACM, New York (2005)Google Scholar
  7. 7.
    Levina, E., Bickel, P.: The earth mover’s distance is the Mallows distance: some insights from statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, vol. 2, pp. 251–256 (2001)Google Scholar
  8. 8.
    Lim, R.V., Carrillo-Cisneros, D., Scherson, I.D.: Computationally efficient multiplexing of events on hardware counters. In: Linux Symposium, pp. 101–110 (2014)Google Scholar
  9. 9.
    Mathur, W., Cook, J.: Towards accurate performance evaluation using hardware counters. In: ITEA Modeling and Simulation Workshop (2003)Google Scholar
  10. 10.
    Mathur, W., Cook, J.: Improved estimation for software multiplexing of performance counters. In: Proceedings - IEEE Computer Society’s Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS, vol. 2005, pp. 23–32. IEEE (2005)Google Scholar
  11. 11.
    Muddukrishna, A., Jonsson, P.A., Brorsson, M.: Characterizing task-based OpenMP programs. PLoS ONE 10(4), e0123545 (2015)CrossRefGoogle Scholar
  12. 12.
    Mytkowicz, T., Sweeney, P.F., Hauswirth, M., Diwan, A.: Time interpolation: so many metrics, so few registers. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2007, pp. 286–300. IEEE (2007)Google Scholar
  13. 13.
    NASA: NAS Parallel Benchmarks. Accessed 30 Apr 2017
  14. 14.
    Neill, R., Drebes, A., Pop, A.: Fuse: accurate multiplexing of hardware performance counters across executions (2017)Google Scholar
  15. 15.
    University of Versailles Saint Quentin en Yvelines: NAS Parallel Benchmarks 3.0 Unofficial OpenMP C Version (2014). Accessed 30 Apr 2017
  16. 16.
    Pele, O., Werman, M.: A linear time histogram metric for improved SIFT matching. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 495–508. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-88690-7_37 CrossRefGoogle Scholar
  17. 17.
    Pop, A., Cohen, A.: OpenStream: expressiveness and data-flow compilation of OpenMP streaming programs. ACM Trans. Architect. Code Optim. 9(4), 5301–5325 (2013)Google Scholar
  18. 18.
    Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40(2), 99–121 (2000)CrossRefzbMATHGoogle Scholar
  19. 19.
    Shende, S.S., Malony, A.D.: The Tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.School of Computer ScienceThe University of ManchesterManchesterUK

Personalised recommendations