PEBIL: binary instrumentation for practical data-intensive program analysis

Laurenzano, Michael A.; Peraza, Joshua; Carrington, Laura; Tiwari, Ananta; Ward, William A.; Campbell, Roy

doi:10.1007/s10586-013-0307-2

PEBIL: binary instrumentation for practical data-intensive program analysis

Published: 12 October 2013

Volume 18, pages 1–14, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Michael A. Laurenzano^1,2,
Joshua Peraza²,
Laura Carrington^2,3,
Ananta Tiwari^2,3,
William A. Ward Jr⁴ &
…
Roy Campbell⁴

398 Accesses
9 Citations
Explore all metrics

Abstract

In order to achieve a high level of performance, data intensive programs such as the real-time processing of surveillance feeds from unmanned aerial vehicles, genomics sequence comparison or large graph traversal require the strategic application of multi/many-core processors and co-processors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program and system design decisions, program runtime behavior gathered through binary instrumentation is useful because it enables inspection of the low-level interactions between a data intensive program and a multi-core processor or many-core co-processor. This work details two novel mechanisms in the PEBIL binary instrumentation platform that make it well-suited for analyzing data-intensive programs by providing (1) support for fast lookup of instrumentation thread-local storage (ITLS) and (2) support for the fast enabling and disabling of instrumentation at runtime as a methodology for supporting sampling. These features are compared to two other popular binary instrumentation platforms, Pin and Dyninst, in both analytical and empirical terms for programs implemented using the popular but disparate parallelization models MPI and OpenMP. Empirical comparisons are made for two binary instrumentation applications that are critical to the analysis of data-intensive programs, basic block counting and memory address trace collection. These empirical results show that PEBIL is unrivaled in terms of overhead for basic block counting, introducing an average of 18 % extra runtime for MPI programs and 116 % for OpenMP programs as opposed to 60 % (MPI) and 232 % (OpenMP) for Pin and 20 % (MPI) and 14743 % (OpenMP) for Dyninst. For memory address trace collection that makes use of the conventional optimization of sampling 10 % of the memory addresses of a program to reduce processing time, PEBIL also introduces the lowest overheads of 144 % (MPI) and 222 % (OpenMP) compared to 313 % (MPI) and 360 % (OpenMP) with Pin and 1113 % (MPI) and 89075 % (OpenMP) with Dyninst.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OS-Agnostic Identification of Processes and Threads in the Full System Emulation for Selective Instrumentation

Article 01 November 2018

A Comparison between OPARI2 and the OpenMP Tools Interface in the Context of Score-P

Overview of Application Instrumentation for Performance Analysis and Tuning

Notes

In x86_64 a running thread’s unique identifier is stored in %fs:0x10. IDX can therefore be generated using a sequence which has a mov, a shr, then an and instruction.
A dead register is defined as a register that will be defined prior to being used by the program, implying that the program does not care about its contents.
In Figs. 8 through 16 overhead is expressed as a percentage of the runtime of the uninstrumented program, given on the y-axis.
In Figs. 8 through 16 sampling rate is expressed as the percentage of the address stream that is discarded by sampling, given on the x-axis.

References

Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M., Zorn, B.: Evidence-based static branch prediction using machine learning. ACM Trans. Program. Lang. Syst. 19(1), 188–222 (1997)
Article Google Scholar
Jaleel, A., Cohn, R., Luk, C.K., Jacob, B.: CMP$im: a Pin-based on-the-fly multi-core cache simulator. In: Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation, pp. 28–36 (2008)
Google Scholar
Pierce, J., Mudge, T.: The effect of speculative execution on cache performance. In: Proceedings of the Eigth International Parallel Processing Symposium, pp. 172–179. IEEE, New York (1994)
Chapter Google Scholar
DeRose, L., Wolf, F.: CATCH—a call-graph based automatic tool for capture of hardware performance metrics for MPI and OpenMP applications. In: European Conference on Parallel Processing, pp. 167–176 (2002)
Google Scholar
Seward, J., Nethercote, N.: Using Valgrind to detect undefined value errors with bit-precision. In: USENIX Annual Technical Conference, pp. 17–30 (2005)
Google Scholar
Saxena, P., Sekar, R., Puranik, V.: Efficient fine-grained binary instrumentation with applications to taint-tracking. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 74–83. ACM, New York (2008)
Google Scholar
Nethercote, N., Mycroft, A.: Redux: a dynamic dataflow tracer. Electron. Notes Theor. Comput. Sci. 89(2), 149–170 (2003)
Article Google Scholar
Tang, L., Mars, J., Soffa, M.L.: Compiling for niceness: mitigating contention for QoS in warehouse scale computers. In: Proceedings of the 10th Annual International Symposium on Code Generation and Optimization, pp. 1–12. ACM, New York (2012)
Google Scholar
Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., Purkayastha, A.: A Framework for Performance Modeling and Prediction, pp. 21. IEEE, New York (2002)
Google Scholar
Laurenzano, M.A., Meswani, M., Carrington, L., Snavely, A., Tikir, M., Poole, S.: Reducing energy usage with memory and computation-aware dynamic frequency scaling. In: European Conference on Parallel Processing, pp. 79–90 (2011)
Google Scholar
Tiwari, A., Laurenzano, M.A., Carrington, L., Snavely, A.: Modeling power and energy usage of hpc kernels. In: 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 990–998. IEEE, New York (2012)
Google Scholar
Serebryany, K., Iskhodzhanov, T.: ThreadSanitizer: data race detection in practice. In: Proceedings of the Workshop on Binary Instrumentation and Applications, pp. 62–71. ACM, New York (2009)
Chapter Google Scholar
Carrington, L., Tikir, M.M., Olschanowsky, C., Laurenzano, M.A., Peraza, J., Snavely, A., Poole, S.: An idiom-finding tool for increasing productivity of accelerators. In: Proceedings of the International Conference on Supercomputing, pp. 202–212. ACM, New York (2011)
Google Scholar
Tikir, M.M., Hollingsworth, J.K.: Efficient instrumentation for code coverage testing. In: ACM SIGSOFT Software Engineering Notes, vol. 27, pp. 86–96. ACM, New York (2002)
Google Scholar
Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)
Article Google Scholar
Miller, B.P., Christodorescu, M., Iverson, R., Kosar, T., Mirgorodskii, A., Popovici, F.: Playing inside the black box: using dynamic instrumentation to create security holes. Parallel Process. Lett. 11(02n03), 267–280 (2001)
Article Google Scholar
Prasad, M., Chiueh, T.: A binary rewriting defense against stack based buffer overflow attacks. In: Proceedings of the USENIX Annual Technical Conference, pp. 211–224 (2003)
Google Scholar
Laurenzano, M.A., Tikir, M.M., Carrington, L., Snavely, A.: PEBIL: efficient static binary instrumentation for Linux. In: International Symposium on Performance Analysis of Systems & Software, pp. 175–183. IEEE, New York (2010)
Google Scholar
Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: ACM SIGPLAN Notices, vol. 40, pp. 190–200. ACM, New York (2005)
Google Scholar
Buck, B., Hollingsworth, J.K.: An API for runtime code patching. Int. J. High Perform. Comput. Appl. 14(4), 317–329 (2000)
Article Google Scholar
Kessler, R.E., Hill, M.D., Wood, D.A.: A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. 43(6), 664–675 (1994)
Article MATH Google Scholar
Carrington, L., Snavely, A., Gao, X., Wolter, N.: A performance prediction framework for scientific applications. Comput. Sci., 701 (2003)
Smith, M.D.: Tracing with pixie. Computer Systems Laboratory, Stanford University (1991)
Larus, J.R., Ball, T.: Rewriting executable files to measure program behavior. Softw. Pract. Exp. 24(2), 197–218 (1994)
Article Google Scholar
Larus, J.R., Schnarr, E.: Eel: machine-independent executable editing. In: ACM Sigplan Notices, vol. 30, pp. 291–300. ACM, New York (1995)
Google Scholar
Srivastava, A., Eustace, A.: ATOM: A System for Building Customized Program Analysis Tools, vol. 29. ACM, New York (1994)
Google Scholar
Tikir, M.M., Laurenzano, M.A., Carrington, L., Snavely, A.: The pmac binary instrumentation library for PowerPC/AIX. In: Workshop on Binary Instrumentation and Applications (2006)
Google Scholar
Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Not. 42(6), 89–100 (2007)
Article Google Scholar
Bruening, D., Duesterwald, E., Amarasinghe, S.: Design and implementation of a dynamic optimization framework for windows. In: 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (2001)
Google Scholar
Bruening, D., Zhao, Q., Amarasinghe, S.: Transparent dynamic instrumentation. In: ACM SIGPLAN Notices, vol. 47, pp. 133–144. ACM, New York (2012)
Google Scholar
Laurenzano, M.A., Peraza, J., Carrington, L., Tiwari, A., Ward, W.A., Campbell, R.: A static binary instrumentation threading model for fast memory trace collection. In: International Workshop on Data-Intensive Scalable Computing Systems (2012)
Google Scholar
Luk, C.K., Muth, R., Patil, H., Cohn, R., Lowney, G.: Ispike: a post-link optimizer for the Intel® Itanium® architecture. In: International Symposium on Code Generation and Optimization, pp. 15–26. IEEE, New York (2004)
Google Scholar
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks—summary and preliminary results. In: The ACM/IEEE Conference on Supercomputing, pp. 158–165 (1991)
Google Scholar
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)
Article Google Scholar
Chang, P.P., Mahlke, S.A., Hwu, W.M.W.: Using profile information to assist classic code optimizations. Softw. Pract. Exp. 21(12), 1301–1321 (1991)
Article Google Scholar
Wall, D.W.: Predicting Program Behavior Using Real or Estimated Profiles, vol. 26. ACM, New York (1991)
Google Scholar
Li, Y.T.S., Malik, S.: Performance analysis of embedded software using implicit path enumeration. In: Proceedings of the 32nd Annual ACM/IEEE Design Automation Conference, pp. 456–461. ACM, New York (1995)
Google Scholar
Snavely, A., Wolter, N., Carrington, L.: Modeling application performance by convolving machine signatures with application profiles. In: IEEE International Workshop on Workload Characterization, pp. 149–156. IEEE, New York (2001)
Google Scholar
Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically characterizing large scale program behavior. In: ACM SIGARCH Computer Architecture News, vol. 30, pp. 45–57. ACM, New York (2002)
Google Scholar
Bernat, A.R., Miller, B.P.: Anywhere, any-time binary instrumentation. In: Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools, pp. 9–16. ACM, New York (2011)
Chapter Google Scholar
Wang, W.H., Baer, J.L.: Efficient trace-driven simulation methods for cache performance analysis. ACM Trans. Comput. Syst. 9(3), 222–241 (1991)
Article Google Scholar
Ding, C., Zhong, Y.: Reuse distance analysis. University of Rochester, Rochester, NY (2001)
Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. In: ACM SIGPLAN Notices, vol. 38, pp. 245–257. ACM, New York (2003)
Google Scholar
Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying locality in the memory access patterns of hpc applications. In: Proceedings of the ACM/IEEE Conference on Supercomputing, p. 50. IEEE, New York (2005)
Google Scholar
Milenkovic, A., Milenkovic, M.: Exploiting streams in instruction and data address trace compression. In: IEEE International Workshop on Workload Characterization, pp. 99–107. IEEE, New York (2003)
Google Scholar
Olschanowsky, C., Tikir, M.M., Carrington, L., Snavely, A.: PSnAP: accurate synthetic address streams through memory profiles. Languages and Compilers for Parallel Computing, 353–367 (2010)
Conte, T.M., Hirsch, M.A., Hwu, W.M.W.: Combining trace sampling with single pass methods for efficient cache simulation. IEEE Trans. Comput. 47(6), 714–720 (1998)
Article Google Scholar
Laurenzano, M.A., Simon, B., Snavely, A., Gunn, M.: Low cost trace-driven memory simulation using SimPoint. ACM SIGARCH Comput. Archit. News 33(5), 81–86 (2005)
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge the support of this project by the DoD HPCMP’s User Productivity Enhancement, Technology Transfer, and Training (PETTT) Program (Contract No:GS04T09DBC0017 though High Performance Technologies, Inc.). This work was also supported in part by the U.S. Department of Energy Office of Science through the SciDAC award titled SUPER (Institute for Sustained Performance, Energy and Resilience).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA
Michael A. Laurenzano
EP Analytics, San Diego, CA, USA
Michael A. Laurenzano, Joshua Peraza, Laura Carrington & Ananta Tiwari
Performance Modeling and Characterization Laboratory, San Diego Supercomputer Center, University of California, San Diego, La Jola, CA, USA
Laura Carrington & Ananta Tiwari
High Performance Computing Modernization Program, United States Department of Defense, Lorton, VA, USA
William A. Ward Jr & Roy Campbell

Authors

Michael A. Laurenzano
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Peraza
View author publications
You can also search for this author in PubMed Google Scholar
Laura Carrington
View author publications
You can also search for this author in PubMed Google Scholar
Ananta Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
William A. Ward Jr
View author publications
You can also search for this author in PubMed Google Scholar
Roy Campbell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael A. Laurenzano.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laurenzano, M.A., Peraza, J., Carrington, L. et al. PEBIL: binary instrumentation for practical data-intensive program analysis. Cluster Comput 18, 1–14 (2015). https://doi.org/10.1007/s10586-013-0307-2

Download citation

Received: 16 January 2013
Accepted: 25 August 2013
Published: 12 October 2013
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10586-013-0307-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PEBIL: binary instrumentation for practical data-intensive program analysis

Abstract

Access this article

Similar content being viewed by others

OS-Agnostic Identification of Processes and Threads in the Full System Emulation for Selective Instrumentation

A Comparison between OPARI2 and the OpenMP Tools Interface in the Context of Score-P

Overview of Application Instrumentation for Performance Analysis and Tuning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PEBIL: binary instrumentation for practical data-intensive program analysis

Abstract

Access this article

Similar content being viewed by others

OS-Agnostic Identification of Processes and Threads in the Full System Emulation for Selective Instrumentation

A Comparison between OPARI2 and the OpenMP Tools Interface in the Context of Score-P

Overview of Application Instrumentation for Performance Analysis and Tuning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation