Efficient execution replay technique for distributed memory architectures

  • Eric Leu
  • André Schiper
  • Abdelwahab Zramdini
Systems Software
Part of the Lecture Notes in Computer Science book series (LNCS, volume 487)


Debugging parallel programs on MIMD machines is a difficult task because successive executions of the same program can lead to different behaviors. To solve this problem, a method called execution replay has been introduced, which guarantees the reexecution of a program to be equivalent to the initial execution. In this paper we present an execution replay technique in the context of distributed memory architectures. In contrary to all other proposed approaches, our technique can treat non-blocking message passing primitives, and can be adapted to any form of message passing communication. Since the technique is based on an events numbering, we show how to bound these numbers, and then analyse the influence of this bound on the amount of recorded information. The prototype implemented on an Intel iPSC/2 shows that the overhead due to the recording of control information is extremely low (about 1%).


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    M. Bramer, "Computer Game-Playing theory and practice", Ellis Horwood Series, Halsted Press, 1983.Google Scholar
  2. [2]
    R. Curtis, L. Wittie, "BugNet: A Debugging System for Parallel Programming Environments", Proc. 3rd Int. Conf. on Distrib. Computing Syst. Hollywood, FL, Oct 1982.Google Scholar
  3. [3]
    S. Feldmann, C. Brown, "IGOR: A System for Program Debugging via Reversible Execution", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.Google Scholar
  4. [4]
    R. Fowler, T. Leblanc, "An Integrated Approach to Parallel Program Debugging and Performance Analysis on Large-Scale Multiprocessors", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.Google Scholar
  5. [5]
    G. Fox, S. Otto, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, No 4, North-Holland, 1987.Google Scholar
  6. [6]
    J. Fowler, W.Zwaenepoel, "Causal Distributed Breakpoints", Proc. 10th IEEE Int. Conf. on Distributed Computing Systems, Paris, May 90.Google Scholar
  7. [7]
    S. Jones, "Bugnet: A Real-Time Distributed Debugging System", Proc. of 6th Internat. Symposium on Reliability in Distributed Software and DB Systems, Williamsburg, Va, March 1987.Google Scholar
  8. [8]
    T. Leblanc, A. Robbins, "Event driven monitoring of distributed programs", Proc. 5th Int. Conf. Distrib. Comput. Syst., Denver, CO, May 1985.Google Scholar
  9. [9]
    T. Leblanc, J. Mellor-Crummey, "Debugging Parallel Programs with Instant Replay", IEEE Transactions on Computers C-36(4), April 1987.Google Scholar
  10. [10]
    E. Leu, A. Schiper, A. Zramdini, "Réexécution de programmes parallèles: une approche systématique", Technical Report 90-07, Ecole Polytechnique Fédérale de Lausanne, Département d'Informatique, Switzerland.Google Scholar
  11. [11]
    D. Pan, M. Linton, "Supporting Reverse Execution for Parallel Programs", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.Google Scholar
  12. [12]
    D. Peterson, H. Westphal, "An efficient Implementation of Instant Replay", Technical report, European Computer-Industry Research Centre, Muenchen, West Germany.Google Scholar
  13. [13]
    D. Snowden, A. Wellings, "Debugging Distributed Real-Time Applications in ADA", University of York, UK, April 1988.Google Scholar
  14. [14]
    W. Zhou, "PM: A System for Prototyping and Monitoring Remote Procedure Call Programs", ACM SIGSOFT Software Engineering Notes, Vol. 15, Number 1, Jan. 1990.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1991

Authors and Affiliations

  • Eric Leu
    • 1
  • André Schiper
    • 1
  • Abdelwahab Zramdini
    • 1
  1. 1.Département d'InformatiqueEcole Polytechnique Fédérale de LausanneLausanneSwitzerland

Personalised recommendations