Hardware Performance Monitoring for the Rest of Us: A Position and Survey

  • Tipp Moseley
  • Neil Vachharajani
  • William Jalby
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6985)


Microprocessors continue to make great strides in performance and scalability, yet hardware performance monitoring remains an area of dissatisfaction amongst those interested in better understanding the interactions of hardware and software. HPM technology has, at best, maintained the status quo for over a decade, though hope for better answers still remains. As it is, HPM is well-suited for some purposes, and everyone else tries to make the most of what is available. HPM will never be everything to everyone, and as new features are added, new users will adopt them in unforeseen ways.


Execution Trace Binary Decision Diagram Program Counter Performance Tuning Performance Counter 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Liu, L., Rus, S.: Perflint: A context sensitive performance advisor for c++ programs. In: IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, Los Alamitos (2009)Google Scholar
  2. 2.
    Tallent, N.R., Mellor-Crummey, J.M., Fagan, M.W.: Binary analysis for measurement and attributionof program performance. In: PLDI (2009)Google Scholar
  3. 3.
    Moseley, T., Connors, D.A., Grunwald, D., Peri, R.: Identifying potential parallelism via loopcentricprofiling. In: Proceedings of the 2007 International Conference on Computing Frontiers (May 2007)Google Scholar
  4. 4.
    Price, G.D., Giacomoni, J., Vachharajani, M.: Visualizing potential parallelism in sequential programs. In: PACT (2008)Google Scholar
  5. 5.
    Zilles, C.B.: Benchmark health considered harmful. SIGARCH Computer Architecture News (2001)Google Scholar
  6. 6.
    Moseley, T., Grunwald, D., Peri, R.V.: Optiscope: Performance accountability for optimizing compilers. In: CGO 2009: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, Seattle (2009)Google Scholar
  7. 7.
    Mytkowicz, T., Diwan, A., Hauswirth, M., Sweeney, P.F.: Producing wrong data without doing anything obviously wrong! In: ASPLOS (2009)Google Scholar
  8. 8.
    Moseley, T., Shye, A., Reddi, V.J., Iyer, M., Fay, D., Hodgdon, D., Kihm, J.L., Settle, A., Grunwald, D., Connors, D.A.: Dynamic run-time architecture techniques for enabling continuous optimization. In: Proceedings of the 2005 International Conference on Computing Frontiers (May 2005)Google Scholar
  9. 9.
    Knights, D., Mytkowicz, T., Sweeney, P.F., Mozer, M.C., Diwan, A.: Blind optimization for exploiting hardware features. In: Conference on Compiler Construction (2009)Google Scholar
  10. 10.
    Pan, Z., Eigenmann, R.: Fast, automatic, procedure-level performance tuning. In: PACT 2006: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pp. 173–181. ACM Press, New York (2006)CrossRefGoogle Scholar
  11. 11.
    Whaley, C.R., Dongarra, J.J.: Automatically tuned linear algebra software. In: Supercomputing 1998: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, CDROM (1998)Google Scholar
  12. 12.
    Callister, J.: Confessions of a performance monitor hardware designer. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  13. 13.
    Amd lightweight profiling specification,
  14. 14.
    Anderson, J.M., Berc, L.M., Dean, J., Ghemawat, S., Henzinger, M.R., Leung, S.-T.A., Sites, R.L., Vandevoorde, M.T., Waldspurger, C.A., Weihl, W.E.: Continuous profiling: where have all the cycles gone? In: SOSP 1997: Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, pp. 1–14. ACM Press, New York (1997)CrossRefGoogle Scholar
  15. 15.
  16. 16.
  17. 17.
    Intel64 and IA-32 Architectures Software Developer’s Manual - Volume 3B, Intel CorporationGoogle Scholar
  18. 18.
    Dean, J., Hicks, J.E., Waldspurger, C.A., Weihl, W.E., Chrysos, G.Z.: Profileme: Hardware support for instruction-level profiling on out-of-order processors. In: International Symposium on Microarchitecture, pp. 292–302 (1997),
  19. 19.
    Drongowski, P.: Instruction-based sampling: A new performance analysis technique for amd family 10h processors (2007)Google Scholar
  20. 20.
    Intel Corporation, Intel Itanium 2 processor reference manual: For software development and optimization (May 2004)Google Scholar
  21. 21.
    Workshop on hardware performance monitor design and functionality colocated with hpca (2005),
  22. 22.
    v2 of comments on performance counters for linux, pcl (2009),
  23. 23.
    Hunter, H.C., Nair, R.: Refining performance monitor design. In: Proceedings of the 2004 Workshop on Complexity Effective Design, WCED (2004)Google Scholar
  24. 24.
    Cavazos, J., Dubach, C., Agakov, F., Bonilla, E., O’Boyle, M.F., Fursin, G., Temam, O.: Automatic performance model construction for the fast software exploration of new hardware designs. In: International Conference on Compilers, Architecture, And Synthesis For Embedded Systems (CASES 2006) (October 2006)Google Scholar
  25. 25.
    Sprunt, B.: Performance monitoring hardware will always be a low priority, second class feature in processor designs until. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  26. 26.
    Moseley, T., Kihm, J.L., Connors, D.A., Grunwald, D.: Methods for modeling resource contention on simultaneous multithreading processors. In: Proceedings of the 2005 International Conference on Computer Design (ICCD) (October 2005)Google Scholar
  27. 27.
    Ould-Ahmed-Vall, E., Woodlee, J., Yount, C., Doshi, K.A., Abraham, S.: Using model trees for computer architecture performance analysis of software applications. In: ISPASS (2007)Google Scholar
  28. 28.
    Dai, X., Zhai, A., Hsu, W.-C., Yew, P.-C.: A general compiler framework for speculative optimizations using data speculative code motion. In: CGO 2005: Proceedings of the International Symposium on Code Generation and Optimization (2005)Google Scholar
  29. 29.
    Canturk Isci, M.M., Contreras, G.: Hardware performance counters for detailed runtime power and thermal estimations: Experiences and proposals. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  30. 30.
    Moseley, T.: Adaptive thread scheduling for simultaneous multithreading processors, Boulder, CO. (March 2006)Google Scholar
  31. 31.
    Shye, A., Iyer, M., Moseley, T., Hodgdon, D., Fay, D., Reddi, V.J., Connors, D.A.: Analyis of path profiling information generated with performance monitoring hardware. In: INTERACT 2005: Proceedings of the 9th Annual Workshop on Interaction between Compilers and Computer Architectures, pp. 34–43. IEEE Computer Society, Washington, DC (2005)Google Scholar
  32. 32.
    Shye, A., Özisikyilmaz, B., Mallik, A., Memik, G., Dinda, P.A., Dick, R.P., Choudhary, A.N.: Learning and leveraging the relationship between architecture-level measurements and individual user satisfaction. In: ISCA (2008)Google Scholar
  33. 33.
    Tikir, M.M., Buck, B.R., Hollingsworth, J.K.: What we need to be able to count to tune programs. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  34. 34.
    Tuduce, I., Gross, T.: Efficient collection of information on the locality of accesses. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  35. 35.
    Brantley, B.: The NUMA challenge. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  36. 36.
    Rishi, A., Masamitsu, J.A.: Us patent no. 5953530. method and apparatus for run-time memory access checking and memory leak detectionGoogle Scholar
  37. 37.
    Conte, T.M., Patel, B.A., Menezes, K.N., Cox, J.S.: Hardware-based profiling: an effective Technique for profile-driven optimization. Int. J. Parallel Programming (1996)Google Scholar
  38. 38.
    Fields, B.A., Bodik, R., Hill, M.D., Newburn, C.J.: Interaction cost and shotgun profiling. ACM Trans. Architecture Code Optimization (2004)Google Scholar
  39. 39.
    Zilles, C.B., Sohi, G.S.: A programmable co-processor for profiling. In: HPCA (2001)Google Scholar
  40. 40.
    Weaver, V.M., McKee, S.A.: Can hardware performance counters be trusted? In: IISWC (2008)Google Scholar
  41. 41.
    Mucci, P., Smeds, N., Ekman, P.: Performance monitoring with papi using the performance Application programming interface. Dr. Dobb’s (2005),
  42. 42.
    Mucci, P.: Towards a flexible and realistic hardware performance monitor infrastructure. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  43. 43.
    Sprunt, B.: Managing the complexity of performance monitoring hardware: The brink andabyss Approach. Int. J. High Perform. Comput. Appl. (2006)Google Scholar
  44. 44.
    Daniel Molka, R.S., Hackenberg, D., Mller, M.S.: Memory performance and cache coherency effects on an intel nehalem multiprocessor systemGoogle Scholar
  45. 45.
    Fowler, R.: Performance hardware if i ran the world. In: Workshop on Hardware Performance Monitor Design and Functionality Colocated with HPCA (2005)Google Scholar
  46. 46.
    Levin, R., Newman, I., Haber, G.: Complementing missing and inaccurate profiling using a minimum cost circulation algorithm. In: Stenström, P., Dubois, M., Katevenis, M., Gupta, R., Ungerer, T. (eds.) HiPEAC 2007. LNCS, vol. 4917, pp. 291–304. Springer, Heidelberg (2008)Google Scholar
  47. 47.
    Todd Mytkowicz, D.C., Diwan, A.: Inferred call path profiling. In: OOPSLA (2009)Google Scholar
  48. 48.
  49. 49.
    Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: PLDI 2005: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 190–200. ACM Press, New York (2005)CrossRefGoogle Scholar
  50. 50.
    Nethercote, N., Seward, J.: Valgrind: A framework for heavyweight dynamic binary instrumentation. In: Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), San Diego, California, USA (2007)Google Scholar
  51. 51.
    Dyninst: An application program interface (api) for runtime code generation,
  52. 52.
    Bruening, D.L.: Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. dissertation, Cambridge, MA, USA (2004)Google Scholar
  53. 53.
    Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hllberg, G., Hgberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A full system simulation platform. Computer (2002)Google Scholar
  54. 54.
    Valgrind’s tools suite,
  55. 55.
    Hoste, K., Eeckhout, L.: Microarchitecture-independent workload characterization. IEEE Micro 27(3), 63–72 (2007)CrossRefGoogle Scholar
  56. 56.
    Zhang, X., Wang, Z., Gloy, N.C., Chen, J.B., Smith, M.D.: System support for automated profiling and optimization. In: SOSP (1997)Google Scholar
  57. 57.
    Hirzel, M., Chilimbi, T.: Bursty tracing: A framework for low-overhead temporal profiling. In: 4th ACM Workshop on Feedback-Directed and Dynamic Optimization, FDDO-4 (2001),
  58. 58.
    Arnold, M., Ryder, B.G.: A framework for reducing the cost of instrumented code. In: SIGPLAN Conference on Programming Language Design and Implementation, pp. 168–179 (2001),
  59. 59.
    Moseley, T., Shye, A., Reddi, V.J., Grunwald, D., Peri, R.V.: Shadow profiling: Hiding instrumentation costs with parallelism. In: CGO 2007: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, San Jose (2007)Google Scholar
  60. 60.
    Hoste, K., Phansalkar, A., Eeckhout, L., Georges, A., John, L.K., Bosschere, K.D.: Performance prediction based on inherent program similarity. In: PACT (2006)Google Scholar
  61. 61.
    Shaham, R., Kolodner, E.K., Sagiv, M.: Heap profiling for space-efficient java. In: PLDI 2001: Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (2001)Google Scholar
  62. 62.
    Djoudi, L., Barthou, D., Carribault, P., Lemuet, C., Acquaviva, J.-T., Jalby, W.: Exploring application performance: a new tool for a static/dynamic approach. In: Los Alamos Computer Science Institute Symp., Santa Fe, NM (October 2005)Google Scholar
  63. 63.
    Iyer, M., Ashok, C., Stone, J., Vachharajani, N., Connors, D.A., Vachharajani, M.: Finding parallelism for future epic machines. In: Proceedings of the Fourth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology, EPIC (2005)Google Scholar
  64. 64.
    Fursin, G., O’Boyle, M., Temam, O., Watts, G.: Fast and accurate method for determining a lower bound on execution time. Concurrency: Practice and Experience 16(2-3), 271–292 (2004)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2011

Authors and Affiliations

  • Tipp Moseley
    • 1
    • 2
  • Neil Vachharajani
    • 2
    • 3
  • William Jalby
    • 1
  1. 1.Université de Versailles Saint-Quentin-en-YvelinesFrance
  2. 2.Google Inc.
  3. 3.Pure Storage Inc.

Personalised recommendations