Checkpoint/Restart-Enabled Parallel Debugging

  • Joshua Hursey
  • Chris January
  • Mark O’Connor
  • Paul H. Hargrove
  • David Lecomber
  • Jeffrey M. Squyres
  • Andrew Lumsdaine
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6305)


Debugging is often the most time consuming part of software development. HPC applications prolong the debugging process by adding more processes interacting in dynamic ways for longer periods of time. Checkpoint/restart-enabled parallel debugging returns the developer to an intermediate state closer to the bug. This focuses the debugging process, saving developers considerable amounts of time, but requires parallel debuggers cooperating with MPI implementations and checkpointers. This paper presents a design specification for such a cooperative relationship. Additionally, this paper discusses the application of this design to the GDB and DDT debuggers, Open MPI, and BLCR projects.


Message Passing Interface Parallel Application Debug Process Message Passing Interface Process Open Message Passing Interface 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Message Passing Interface Forum: MPI: A Message Passing Interface. In: Proc. of Supercomputing 1993, pp. 878–883 (1993)Google Scholar
  2. 2.
    Cownie, J., Gropp, W.: A standard interface for debugger access to message queue information in MPI. In: Margalef, T., Dongarra, J., Luque, E. (eds.) PVM/MPI 1999. LNCS, vol. 1697, pp. 51–58. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  3. 3.
    Gottbrath, C.L., Barrett, B., Gropp, B., Lusk, E., Squyres, J.: An interface to support the identification of dynamic MPI 2 processes for scalable parallel debugging. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 115–122. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 375–408 (2002)CrossRefGoogle Scholar
  5. 5.
    Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems 3, 63–75 (1985)CrossRefGoogle Scholar
  6. 6.
    Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2007)Google Scholar
  7. 7.
    Jung, H., Shin, D., Han, H., Kim, J.W., Yeom, H.Y., Lee, J.: Design and implementation of multiple fault-tolerant MPI over Myrinet (M3). In: Proceedings of the ACM/IEEE Supercomputing Conference (2005)Google Scholar
  8. 8.
    Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for MPI programs over InfiniBand. In: International Conference on Parallel Processing, pp. 471–478 (2006)Google Scholar
  9. 9.
    Bouteiller, A., et al.: MPICH-V project: A multiprotocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications 20, 319–333 (2006)CrossRefGoogle Scholar
  10. 10.
    Duell, J., Hargrove, P., Roman, E.: The design and implementation of Berkeley Lab’s Linux Checkpoint/Restart. Technical Report LBNL-54941, Lawrence Berkeley National Laboratory (2002)Google Scholar
  11. 11.
    Hursey, J., Mattox, T.I., Lumsdaine, A.: Interconnect agnostic checkpoint/restart in Open MPI. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, pp. 49–58 (2009)Google Scholar
  12. 12.
    Curtis, B.: Fifteen years of psychology in software engineering: Individual differences and cognitive science. In: Proceedings of the International Conference on Software Engineering, pp. 97–106 (1984)Google Scholar
  13. 13.
    Feldman, S.I., Brown, C.B.: IGOR: A system for program debugging via reversible execution. In: Proceedings of the ACM SIGPLAN/SIGOPS workshop on Parallel and Distributed Debugging, pp. 112–123 (1988)Google Scholar
  14. 14.
    Wittie, L.: The Bugnet distributed debugging system. In: Proceedings of the 2nd workshop on Making Distributed Systems Work, pp. 1–3 (1986)Google Scholar
  15. 15.
    Bouteiller, A., Bosilca, G., Dongarra, J.: Retrospect: Deterministic replay of MPI applications for interactive distributed debugging. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 297–306 (2007)Google Scholar
  16. 16.
    Ronsse, M., Bosschere, K.D., de Kergommeaux, J.C.: Execution replay and debugging. In: Proceedings of the Fourth International Workshop on Automated Debugging, Munich, Germany (2000)Google Scholar
  17. 17.
    King, S.T., Dunlap, G.W., Chen, P.M.: Debugging operating systems with time-traveling virtual machines. In: Proceedings of the USENIX Annual Technical Conference (2005)Google Scholar
  18. 18.
    Pan, D.Z., Linton, M.A.: Supporting reverse execution for parallel programs. In: Proceedings of the ACM SIGPLAN/SIGOPS workshop on Parallel and Distributed Debugging, pp. 124–129 (1988)Google Scholar
  19. 19.
    Agrawal, H., DeMillo, R.A., Spafford, E.H.: An execution-backtracking approach to debugging. IEEE Software 8(3), 21–26 (1991)CrossRefGoogle Scholar
  20. 20.
    Undo Ltd.: UndoDB - Reversible debugging for Linux (2009)Google Scholar
  21. 21.
    TotalView Technologies: ReplayEngine (2009)Google Scholar
  22. 22.
    Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. SIGARCH Computer Architecture News 30, 123–134 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Joshua Hursey
    • 1
  • Chris January
    • 2
  • Mark O’Connor
    • 2
  • Paul H. Hargrove
    • 3
  • David Lecomber
    • 2
  • Jeffrey M. Squyres
    • 4
  • Andrew Lumsdaine
    • 1
  1. 1.Open Systems LaboratoryIndiana University 
  2. 2.Allinea Software Ltd. 
  3. 3.Lawrence Berkeley National Laboratory 
  4. 4.Cisco Systems, Inc. 

Personalised recommendations