LCPC 1993: Languages and Compilers for Parallel Computing pp 617-632 | Cite as
Trace size vs parallelism in trace-and-replay debugging of shared-memory programs
Abstract
Execution replay is a debugging strategy where a program is run repeatedly on an input that manifests bugs. Replaying nondeterministic parallel programs requires special tools; otherwise, successive runs (on the same input) can differ, making bugs impossible to track. These tools must trace an execution so it can be replayed. We present improvements over our past work on an adaptive tracing strategy for shared-memory programs. Our past approach makes run-time tracing decisions by detecting and tracing exactly the non-transitive dynamic data dependences among the execution's shared data. Tracing the non-transitive dependences provides sufficient information for a replay. In this paper we show that tracing exactly these dependences is not necessary. Instead, we present two algorithms that introduce and trace artificial dependences among some events that are actually independent If no data dependence exists between two memory references during execution, we are free to artificially force them to execute in a specific order during replay. Artificial dependences reduce trace size, but introduce additional event orderings that have the potential of reducing the replay's parallelism. We present one algorithm that always adds dependences guaranteed not to be on the critical path (which do not slow replay). Another algorithm adds as many dependences as possible, slowing replay but reducing trace size further. Experiments show that we can improve the already high trace reduction of our past technique by up to two more orders of magnitude, without slowing replay. Our new techniques usually trace only 0.00025–0.2% of the shared-memory references, a 3–6 order of magnitude reduction over past approaches that trace every access.
Keywords
Critical Path Original Algorithm Memory Reference Modify Algorithm Transitive ReductionPreview
Unable to display preview. Download preview PDF.
References
- [1]Richard H. Carver and Kuo-Chung Tai, “Reproducible Testing of Concurrent Programs Based on Shared Variables,” 6th Intl. Conf. on Distributed Computing Systems, pp. 428–432 Boston, MA, (May 1986).Google Scholar
- [2]Anne Dinning and Edith Schonberg, “An Empirical Comparison of Monitoring Algorithms for Access Anomaly Detection,” 2nd ACM Symposium on Principles and Practice of Parallel Programming, pp. 1–10 Seattle, WA, (March 1990).Google Scholar
- [3]C. J. Fidge, “Partial Orders for Parallel Debugging,” SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging, pp. 183–194 Madison, WI, (May 1988). Also appears in SIGPLAN Notices 24(1) (January 1989).Google Scholar
- [4]Thomas J. LeBlanc and John M. Mellor-Crummey, “Debugging Parallel Programs with Instant Replay,” IEEE Trans. on Computers C-36(4) pp. 471–482 (April 1987).Google Scholar
- [5]Robert H.B. Netzer and Barton P. Miller, “Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs,” Supercomputing '92, pp. 502–511 Minneapolis, MN, (November 1992).Google Scholar
- [6]Robert H.B. Netzer, “Optimal Tracing and Replay for Debugging SharedMemory Parallel Programs,” ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 1–11 San Diego, CA, (May 1993).Google Scholar
- [7]Robert H.B. Netzer and Jian Xu, “Adaptive Message Logging for Incremental Replay of Message-Passing Programs,” To appear in IEEE Parallel and Distributed Technology, (1994). Also appears in Supercomputing '93Google Scholar
- [8]Douglas Z. Pan and Mark A. Linton, “Supporting Reverse Execution of Parallel Programs,” SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging, pp. 124–129 Madison, WI, (May 1988). Also appears in SIGPLAN Notices 24(1) (January 1989).Google Scholar
- [9]K. C. Tai, Richard H. Carver, and Evelyn E. Obaid, “Debugging Concurrent Ada Programs by Deterministic Execution,” IEEE Trans. on Software Engineering 17(1) pp. 45–63 (January 1991).CrossRefGoogle Scholar
- [10]Jian Xu and Robert H.B. Netzer, “Adaptive Independent Checkpointing for Reducing Rollback Propagation,” IEEE Symp. on Parallel and Distributed Processing, Dallas, TX, (Dec 1993).Google Scholar