Abstract
In order to achieve a high level of performance, data intensive programs such as the real-time processing of surveillance feeds from unmanned aerial vehicles, genomics sequence comparison or large graph traversal require the strategic application of multi/many-core processors and co-processors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program and system design decisions, program runtime behavior gathered through binary instrumentation is useful because it enables inspection of the low-level interactions between a data intensive program and a multi-core processor or many-core co-processor. This work details two novel mechanisms in the PEBIL binary instrumentation platform that make it well-suited for analyzing data-intensive programs by providing (1) support for fast lookup of instrumentation thread-local storage (ITLS) and (2) support for the fast enabling and disabling of instrumentation at runtime as a methodology for supporting sampling. These features are compared to two other popular binary instrumentation platforms, Pin and Dyninst, in both analytical and empirical terms for programs implemented using the popular but disparate parallelization models MPI and OpenMP. Empirical comparisons are made for two binary instrumentation applications that are critical to the analysis of data-intensive programs, basic block counting and memory address trace collection. These empirical results show that PEBIL is unrivaled in terms of overhead for basic block counting, introducing an average of 18 % extra runtime for MPI programs and 116 % for OpenMP programs as opposed to 60 % (MPI) and 232 % (OpenMP) for Pin and 20 % (MPI) and 14743 % (OpenMP) for Dyninst. For memory address trace collection that makes use of the conventional optimization of sampling 10 % of the memory addresses of a program to reduce processing time, PEBIL also introduces the lowest overheads of 144 % (MPI) and 222 % (OpenMP) compared to 313 % (MPI) and 360 % (OpenMP) with Pin and 1113 % (MPI) and 89075 % (OpenMP) with Dyninst.
Similar content being viewed by others
Notes
In x86_64 a running thread’s unique identifier is stored in %fs:0x10. IDX can therefore be generated using a sequence which has a mov, a shr, then an and instruction.
A dead register is defined as a register that will be defined prior to being used by the program, implying that the program does not care about its contents.
References
Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M., Zorn, B.: Evidence-based static branch prediction using machine learning. ACM Trans. Program. Lang. Syst. 19(1), 188–222 (1997)
Jaleel, A., Cohn, R., Luk, C.K., Jacob, B.: CMP$im: a Pin-based on-the-fly multi-core cache simulator. In: Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation, pp. 28–36 (2008)
Pierce, J., Mudge, T.: The effect of speculative execution on cache performance. In: Proceedings of the Eigth International Parallel Processing Symposium, pp. 172–179. IEEE, New York (1994)
DeRose, L., Wolf, F.: CATCH—a call-graph based automatic tool for capture of hardware performance metrics for MPI and OpenMP applications. In: European Conference on Parallel Processing, pp. 167–176 (2002)
Seward, J., Nethercote, N.: Using Valgrind to detect undefined value errors with bit-precision. In: USENIX Annual Technical Conference, pp. 17–30 (2005)
Saxena, P., Sekar, R., Puranik, V.: Efficient fine-grained binary instrumentation with applications to taint-tracking. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 74–83. ACM, New York (2008)
Nethercote, N., Mycroft, A.: Redux: a dynamic dataflow tracer. Electron. Notes Theor. Comput. Sci. 89(2), 149–170 (2003)
Tang, L., Mars, J., Soffa, M.L.: Compiling for niceness: mitigating contention for QoS in warehouse scale computers. In: Proceedings of the 10th Annual International Symposium on Code Generation and Optimization, pp. 1–12. ACM, New York (2012)
Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., Purkayastha, A.: A Framework for Performance Modeling and Prediction, pp. 21. IEEE, New York (2002)
Laurenzano, M.A., Meswani, M., Carrington, L., Snavely, A., Tikir, M., Poole, S.: Reducing energy usage with memory and computation-aware dynamic frequency scaling. In: European Conference on Parallel Processing, pp. 79–90 (2011)
Tiwari, A., Laurenzano, M.A., Carrington, L., Snavely, A.: Modeling power and energy usage of hpc kernels. In: 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 990–998. IEEE, New York (2012)
Serebryany, K., Iskhodzhanov, T.: ThreadSanitizer: data race detection in practice. In: Proceedings of the Workshop on Binary Instrumentation and Applications, pp. 62–71. ACM, New York (2009)
Carrington, L., Tikir, M.M., Olschanowsky, C., Laurenzano, M.A., Peraza, J., Snavely, A., Poole, S.: An idiom-finding tool for increasing productivity of accelerators. In: Proceedings of the International Conference on Supercomputing, pp. 202–212. ACM, New York (2011)
Tikir, M.M., Hollingsworth, J.K.: Efficient instrumentation for code coverage testing. In: ACM SIGSOFT Software Engineering Notes, vol. 27, pp. 86–96. ACM, New York (2002)
Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)
Miller, B.P., Christodorescu, M., Iverson, R., Kosar, T., Mirgorodskii, A., Popovici, F.: Playing inside the black box: using dynamic instrumentation to create security holes. Parallel Process. Lett. 11(02n03), 267–280 (2001)
Prasad, M., Chiueh, T.: A binary rewriting defense against stack based buffer overflow attacks. In: Proceedings of the USENIX Annual Technical Conference, pp. 211–224 (2003)
Laurenzano, M.A., Tikir, M.M., Carrington, L., Snavely, A.: PEBIL: efficient static binary instrumentation for Linux. In: International Symposium on Performance Analysis of Systems & Software, pp. 175–183. IEEE, New York (2010)
Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: ACM SIGPLAN Notices, vol. 40, pp. 190–200. ACM, New York (2005)
Buck, B., Hollingsworth, J.K.: An API for runtime code patching. Int. J. High Perform. Comput. Appl. 14(4), 317–329 (2000)
Kessler, R.E., Hill, M.D., Wood, D.A.: A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. 43(6), 664–675 (1994)
Carrington, L., Snavely, A., Gao, X., Wolter, N.: A performance prediction framework for scientific applications. Comput. Sci., 701 (2003)
Smith, M.D.: Tracing with pixie. Computer Systems Laboratory, Stanford University (1991)
Larus, J.R., Ball, T.: Rewriting executable files to measure program behavior. Softw. Pract. Exp. 24(2), 197–218 (1994)
Larus, J.R., Schnarr, E.: Eel: machine-independent executable editing. In: ACM Sigplan Notices, vol. 30, pp. 291–300. ACM, New York (1995)
Srivastava, A., Eustace, A.: ATOM: A System for Building Customized Program Analysis Tools, vol. 29. ACM, New York (1994)
Tikir, M.M., Laurenzano, M.A., Carrington, L., Snavely, A.: The pmac binary instrumentation library for PowerPC/AIX. In: Workshop on Binary Instrumentation and Applications (2006)
Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Not. 42(6), 89–100 (2007)
Bruening, D., Duesterwald, E., Amarasinghe, S.: Design and implementation of a dynamic optimization framework for windows. In: 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (2001)
Bruening, D., Zhao, Q., Amarasinghe, S.: Transparent dynamic instrumentation. In: ACM SIGPLAN Notices, vol. 47, pp. 133–144. ACM, New York (2012)
Laurenzano, M.A., Peraza, J., Carrington, L., Tiwari, A., Ward, W.A., Campbell, R.: A static binary instrumentation threading model for fast memory trace collection. In: International Workshop on Data-Intensive Scalable Computing Systems (2012)
Luk, C.K., Muth, R., Patil, H., Cohn, R., Lowney, G.: Ispike: a post-link optimizer for the Intel® Itanium® architecture. In: International Symposium on Code Generation and Optimization, pp. 15–26. IEEE, New York (2004)
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks—summary and preliminary results. In: The ACM/IEEE Conference on Supercomputing, pp. 158–165 (1991)
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)
Chang, P.P., Mahlke, S.A., Hwu, W.M.W.: Using profile information to assist classic code optimizations. Softw. Pract. Exp. 21(12), 1301–1321 (1991)
Wall, D.W.: Predicting Program Behavior Using Real or Estimated Profiles, vol. 26. ACM, New York (1991)
Li, Y.T.S., Malik, S.: Performance analysis of embedded software using implicit path enumeration. In: Proceedings of the 32nd Annual ACM/IEEE Design Automation Conference, pp. 456–461. ACM, New York (1995)
Snavely, A., Wolter, N., Carrington, L.: Modeling application performance by convolving machine signatures with application profiles. In: IEEE International Workshop on Workload Characterization, pp. 149–156. IEEE, New York (2001)
Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically characterizing large scale program behavior. In: ACM SIGARCH Computer Architecture News, vol. 30, pp. 45–57. ACM, New York (2002)
Bernat, A.R., Miller, B.P.: Anywhere, any-time binary instrumentation. In: Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools, pp. 9–16. ACM, New York (2011)
Wang, W.H., Baer, J.L.: Efficient trace-driven simulation methods for cache performance analysis. ACM Trans. Comput. Syst. 9(3), 222–241 (1991)
Ding, C., Zhong, Y.: Reuse distance analysis. University of Rochester, Rochester, NY (2001)
Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. In: ACM SIGPLAN Notices, vol. 38, pp. 245–257. ACM, New York (2003)
Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying locality in the memory access patterns of hpc applications. In: Proceedings of the ACM/IEEE Conference on Supercomputing, p. 50. IEEE, New York (2005)
Milenkovic, A., Milenkovic, M.: Exploiting streams in instruction and data address trace compression. In: IEEE International Workshop on Workload Characterization, pp. 99–107. IEEE, New York (2003)
Olschanowsky, C., Tikir, M.M., Carrington, L., Snavely, A.: PSnAP: accurate synthetic address streams through memory profiles. Languages and Compilers for Parallel Computing, 353–367 (2010)
Conte, T.M., Hirsch, M.A., Hwu, W.M.W.: Combining trace sampling with single pass methods for efficient cache simulation. IEEE Trans. Comput. 47(6), 714–720 (1998)
Laurenzano, M.A., Simon, B., Snavely, A., Gunn, M.: Low cost trace-driven memory simulation using SimPoint. ACM SIGARCH Comput. Archit. News 33(5), 81–86 (2005)
Acknowledgements
The authors acknowledge the support of this project by the DoD HPCMP’s User Productivity Enhancement, Technology Transfer, and Training (PETTT) Program (Contract No:GS04T09DBC0017 though High Performance Technologies, Inc.). This work was also supported in part by the U.S. Department of Energy Office of Science through the SciDAC award titled SUPER (Institute for Sustained Performance, Energy and Resilience).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Laurenzano, M.A., Peraza, J., Carrington, L. et al. PEBIL: binary instrumentation for practical data-intensive program analysis. Cluster Comput 18, 1–14 (2015). https://doi.org/10.1007/s10586-013-0307-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-013-0307-2