Skip to main content
Log in

PEBIL: binary instrumentation for practical data-intensive program analysis

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In order to achieve a high level of performance, data intensive programs such as the real-time processing of surveillance feeds from unmanned aerial vehicles, genomics sequence comparison or large graph traversal require the strategic application of multi/many-core processors and co-processors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program and system design decisions, program runtime behavior gathered through binary instrumentation is useful because it enables inspection of the low-level interactions between a data intensive program and a multi-core processor or many-core co-processor. This work details two novel mechanisms in the PEBIL binary instrumentation platform that make it well-suited for analyzing data-intensive programs by providing (1) support for fast lookup of instrumentation thread-local storage (ITLS) and (2) support for the fast enabling and disabling of instrumentation at runtime as a methodology for supporting sampling. These features are compared to two other popular binary instrumentation platforms, Pin and Dyninst, in both analytical and empirical terms for programs implemented using the popular but disparate parallelization models MPI and OpenMP. Empirical comparisons are made for two binary instrumentation applications that are critical to the analysis of data-intensive programs, basic block counting and memory address trace collection. These empirical results show that PEBIL is unrivaled in terms of overhead for basic block counting, introducing an average of 18 % extra runtime for MPI programs and 116 % for OpenMP programs as opposed to 60 % (MPI) and 232 % (OpenMP) for Pin and 20 % (MPI) and 14743 % (OpenMP) for Dyninst. For memory address trace collection that makes use of the conventional optimization of sampling 10 % of the memory addresses of a program to reduce processing time, PEBIL also introduces the lowest overheads of 144 % (MPI) and 222 % (OpenMP) compared to 313 % (MPI) and 360 % (OpenMP) with Pin and 1113 % (MPI) and 89075 % (OpenMP) with Dyninst.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. In x86_64 a running thread’s unique identifier is stored in %fs:0x10. IDX can therefore be generated using a sequence which has a mov, a shr, then an and instruction.

  2. A dead register is defined as a register that will be defined prior to being used by the program, implying that the program does not care about its contents.

  3. In Figs. 8 through 16 overhead is expressed as a percentage of the runtime of the uninstrumented program, given on the y-axis.

  4. In Figs. 8 through 16 sampling rate is expressed as the percentage of the address stream that is discarded by sampling, given on the x-axis.

References

  1. Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M., Zorn, B.: Evidence-based static branch prediction using machine learning. ACM Trans. Program. Lang. Syst. 19(1), 188–222 (1997)

    Article  Google Scholar 

  2. Jaleel, A., Cohn, R., Luk, C.K., Jacob, B.: CMP$im: a Pin-based on-the-fly multi-core cache simulator. In: Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation, pp. 28–36 (2008)

    Google Scholar 

  3. Pierce, J., Mudge, T.: The effect of speculative execution on cache performance. In: Proceedings of the Eigth International Parallel Processing Symposium, pp. 172–179. IEEE, New York (1994)

    Chapter  Google Scholar 

  4. DeRose, L., Wolf, F.: CATCH—a call-graph based automatic tool for capture of hardware performance metrics for MPI and OpenMP applications. In: European Conference on Parallel Processing, pp. 167–176 (2002)

    Google Scholar 

  5. Seward, J., Nethercote, N.: Using Valgrind to detect undefined value errors with bit-precision. In: USENIX Annual Technical Conference, pp. 17–30 (2005)

    Google Scholar 

  6. Saxena, P., Sekar, R., Puranik, V.: Efficient fine-grained binary instrumentation with applications to taint-tracking. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 74–83. ACM, New York (2008)

    Google Scholar 

  7. Nethercote, N., Mycroft, A.: Redux: a dynamic dataflow tracer. Electron. Notes Theor. Comput. Sci. 89(2), 149–170 (2003)

    Article  Google Scholar 

  8. Tang, L., Mars, J., Soffa, M.L.: Compiling for niceness: mitigating contention for QoS in warehouse scale computers. In: Proceedings of the 10th Annual International Symposium on Code Generation and Optimization, pp. 1–12. ACM, New York (2012)

    Google Scholar 

  9. Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., Purkayastha, A.: A Framework for Performance Modeling and Prediction, pp. 21. IEEE, New York (2002)

    Google Scholar 

  10. Laurenzano, M.A., Meswani, M., Carrington, L., Snavely, A., Tikir, M., Poole, S.: Reducing energy usage with memory and computation-aware dynamic frequency scaling. In: European Conference on Parallel Processing, pp. 79–90 (2011)

    Google Scholar 

  11. Tiwari, A., Laurenzano, M.A., Carrington, L., Snavely, A.: Modeling power and energy usage of hpc kernels. In: 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 990–998. IEEE, New York (2012)

    Google Scholar 

  12. Serebryany, K., Iskhodzhanov, T.: ThreadSanitizer: data race detection in practice. In: Proceedings of the Workshop on Binary Instrumentation and Applications, pp. 62–71. ACM, New York (2009)

    Chapter  Google Scholar 

  13. Carrington, L., Tikir, M.M., Olschanowsky, C., Laurenzano, M.A., Peraza, J., Snavely, A., Poole, S.: An idiom-finding tool for increasing productivity of accelerators. In: Proceedings of the International Conference on Supercomputing, pp. 202–212. ACM, New York (2011)

    Google Scholar 

  14. Tikir, M.M., Hollingsworth, J.K.: Efficient instrumentation for code coverage testing. In: ACM SIGSOFT Software Engineering Notes, vol. 27, pp. 86–96. ACM, New York (2002)

    Google Scholar 

  15. Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)

    Article  Google Scholar 

  16. Miller, B.P., Christodorescu, M., Iverson, R., Kosar, T., Mirgorodskii, A., Popovici, F.: Playing inside the black box: using dynamic instrumentation to create security holes. Parallel Process. Lett. 11(02n03), 267–280 (2001)

    Article  Google Scholar 

  17. Prasad, M., Chiueh, T.: A binary rewriting defense against stack based buffer overflow attacks. In: Proceedings of the USENIX Annual Technical Conference, pp. 211–224 (2003)

    Google Scholar 

  18. Laurenzano, M.A., Tikir, M.M., Carrington, L., Snavely, A.: PEBIL: efficient static binary instrumentation for Linux. In: International Symposium on Performance Analysis of Systems & Software, pp. 175–183. IEEE, New York (2010)

    Google Scholar 

  19. Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: ACM SIGPLAN Notices, vol. 40, pp. 190–200. ACM, New York (2005)

    Google Scholar 

  20. Buck, B., Hollingsworth, J.K.: An API for runtime code patching. Int. J. High Perform. Comput. Appl. 14(4), 317–329 (2000)

    Article  Google Scholar 

  21. Kessler, R.E., Hill, M.D., Wood, D.A.: A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. 43(6), 664–675 (1994)

    Article  MATH  Google Scholar 

  22. Carrington, L., Snavely, A., Gao, X., Wolter, N.: A performance prediction framework for scientific applications. Comput. Sci., 701 (2003)

  23. Smith, M.D.: Tracing with pixie. Computer Systems Laboratory, Stanford University (1991)

  24. Larus, J.R., Ball, T.: Rewriting executable files to measure program behavior. Softw. Pract. Exp. 24(2), 197–218 (1994)

    Article  Google Scholar 

  25. Larus, J.R., Schnarr, E.: Eel: machine-independent executable editing. In: ACM Sigplan Notices, vol. 30, pp. 291–300. ACM, New York (1995)

    Google Scholar 

  26. Srivastava, A., Eustace, A.: ATOM: A System for Building Customized Program Analysis Tools, vol. 29. ACM, New York (1994)

    Google Scholar 

  27. Tikir, M.M., Laurenzano, M.A., Carrington, L., Snavely, A.: The pmac binary instrumentation library for PowerPC/AIX. In: Workshop on Binary Instrumentation and Applications (2006)

    Google Scholar 

  28. Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Not. 42(6), 89–100 (2007)

    Article  Google Scholar 

  29. Bruening, D., Duesterwald, E., Amarasinghe, S.: Design and implementation of a dynamic optimization framework for windows. In: 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (2001)

    Google Scholar 

  30. Bruening, D., Zhao, Q., Amarasinghe, S.: Transparent dynamic instrumentation. In: ACM SIGPLAN Notices, vol. 47, pp. 133–144. ACM, New York (2012)

    Google Scholar 

  31. Laurenzano, M.A., Peraza, J., Carrington, L., Tiwari, A., Ward, W.A., Campbell, R.: A static binary instrumentation threading model for fast memory trace collection. In: International Workshop on Data-Intensive Scalable Computing Systems (2012)

    Google Scholar 

  32. Luk, C.K., Muth, R., Patil, H., Cohn, R., Lowney, G.: Ispike: a post-link optimizer for the Intel® Itanium® architecture. In: International Symposium on Code Generation and Optimization, pp. 15–26. IEEE, New York (2004)

    Google Scholar 

  33. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks—summary and preliminary results. In: The ACM/IEEE Conference on Supercomputing, pp. 158–165 (1991)

    Google Scholar 

  34. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)

    Article  Google Scholar 

  35. Chang, P.P., Mahlke, S.A., Hwu, W.M.W.: Using profile information to assist classic code optimizations. Softw. Pract. Exp. 21(12), 1301–1321 (1991)

    Article  Google Scholar 

  36. Wall, D.W.: Predicting Program Behavior Using Real or Estimated Profiles, vol. 26. ACM, New York (1991)

    Google Scholar 

  37. Li, Y.T.S., Malik, S.: Performance analysis of embedded software using implicit path enumeration. In: Proceedings of the 32nd Annual ACM/IEEE Design Automation Conference, pp. 456–461. ACM, New York (1995)

    Google Scholar 

  38. Snavely, A., Wolter, N., Carrington, L.: Modeling application performance by convolving machine signatures with application profiles. In: IEEE International Workshop on Workload Characterization, pp. 149–156. IEEE, New York (2001)

    Google Scholar 

  39. Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically characterizing large scale program behavior. In: ACM SIGARCH Computer Architecture News, vol. 30, pp. 45–57. ACM, New York (2002)

    Google Scholar 

  40. Bernat, A.R., Miller, B.P.: Anywhere, any-time binary instrumentation. In: Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools, pp. 9–16. ACM, New York (2011)

    Chapter  Google Scholar 

  41. Wang, W.H., Baer, J.L.: Efficient trace-driven simulation methods for cache performance analysis. ACM Trans. Comput. Syst. 9(3), 222–241 (1991)

    Article  Google Scholar 

  42. Ding, C., Zhong, Y.: Reuse distance analysis. University of Rochester, Rochester, NY (2001)

  43. Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. In: ACM SIGPLAN Notices, vol. 38, pp. 245–257. ACM, New York (2003)

    Google Scholar 

  44. Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying locality in the memory access patterns of hpc applications. In: Proceedings of the ACM/IEEE Conference on Supercomputing, p. 50. IEEE, New York (2005)

    Google Scholar 

  45. Milenkovic, A., Milenkovic, M.: Exploiting streams in instruction and data address trace compression. In: IEEE International Workshop on Workload Characterization, pp. 99–107. IEEE, New York (2003)

    Google Scholar 

  46. Olschanowsky, C., Tikir, M.M., Carrington, L., Snavely, A.: PSnAP: accurate synthetic address streams through memory profiles. Languages and Compilers for Parallel Computing, 353–367 (2010)

  47. Conte, T.M., Hirsch, M.A., Hwu, W.M.W.: Combining trace sampling with single pass methods for efficient cache simulation. IEEE Trans. Comput. 47(6), 714–720 (1998)

    Article  Google Scholar 

  48. Laurenzano, M.A., Simon, B., Snavely, A., Gunn, M.: Low cost trace-driven memory simulation using SimPoint. ACM SIGARCH Comput. Archit. News 33(5), 81–86 (2005)

    Article  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the support of this project by the DoD HPCMP’s User Productivity Enhancement, Technology Transfer, and Training (PETTT) Program (Contract No:GS04T09DBC0017 though High Performance Technologies, Inc.). This work was also supported in part by the U.S. Department of Energy Office of Science through the SciDAC award titled SUPER (Institute for Sustained Performance, Energy and Resilience).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael A. Laurenzano.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laurenzano, M.A., Peraza, J., Carrington, L. et al. PEBIL: binary instrumentation for practical data-intensive program analysis. Cluster Comput 18, 1–14 (2015). https://doi.org/10.1007/s10586-013-0307-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-013-0307-2

Keywords

Navigation