Advertisement

Application-Level Checkpointing Techniques for Parallel Programs

  • John Paul Walters
  • Vipin Chaudhary
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4317)

Abstract

In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user’s source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.

Keywords

Data Conversion Shared Memory System Memory Pool Memory Allocator Thread Migration 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bronevetsky, G., Marques, D., Pingali, K., Szwed, P., Schulz, M.: Application-level checkpointing for shared memory programs. In: ASPLOS-XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, pp. 235–247. ACM Press, New York (2004)CrossRefGoogle Scholar
  2. 2.
    Milojicic, D.S., Douglis, F., Paindaveine, Y., Wheeler, R., Zhou, S.: Process migration. ACM Comput. Surv. 32(3), 241–299 (2000)CrossRefGoogle Scholar
  3. 3.
    Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: ISCA 2002: Proceedings of the 29th annual international symposium on Computer architecture, pp. 123–134. IEEE Computer Society Press, Los Alamitos (2002)CrossRefGoogle Scholar
  4. 4.
    Duell, J.: The design and implementation of berkeley lab’s linux checkpoint/restart (2003), http://old-www.nersc.gov/research/FTG/checkpoint/reports.html
  5. 5.
    Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In: Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA (2003)Google Scholar
  6. 6.
    Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for mpi programs over infiniband. In: ICPP 2006: Proceedings of the 35th International Conference on Parallel Processing, Columbus, OH (2006)Google Scholar
  7. 7.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. Technical Report UT-CS-94-242 (1994)Google Scholar
  8. 8.
    Bozyigit, M., Wasiq, M.: User-level process checkpoint and restore for migration. SIGOPS Oper. Syst. Rev. 35(2), 86–96 (2001)CrossRefzbMATHGoogle Scholar
  9. 9.
    Dimitrov, B., Rego, V.: Arachne: A portable threads system supporting migrant threads on heterogeneous network farms. IEEE Transactions on Parallel and Distributed Systems 9(5), 459 (1998)CrossRefGoogle Scholar
  10. 10.
    Mascarenhas, E., Rego, V.: Ariadne: Architecture of a portable threads system supporting thread migration. Software- Practice and Experience 26(3), 327–356 (1996)CrossRefGoogle Scholar
  11. 11.
    Itzkovitz, A., Schuster, A., Wolfovich, L.: Thread migration and its applications in distributed shared memory systems. Technical Report LPCR9603, Technion, Isreal (1996)Google Scholar
  12. 12.
    Jiang, H., Chaudhary, V.: Process/thread migration and checkpointing in heterogeneous distributed systems. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences, p. 282 (2004)Google Scholar
  13. 13.
    Karablieh, F., Bazzi, R.A.: Heterogeneous checkpointing for multithreaded applications. In: Proceedings. 21st IEEE Symposium on Reliable Distributed Systems, p. 140 (2002)Google Scholar
  14. 14.
    Jiang, H., Chaudhary, V., Walters, J.P.: Data conversion for process/thread migration and checkpointing. In: Proceedings. 2003 International Conference on Parallel Processing, p. 473 (2003)Google Scholar
  15. 15.
    Beguelin, A., Seligman, E., Stephan, P.: Application level fault tolerance in heterogeneous networks of workstations. J. Parallel Distrib. Comput. 43(2), 147–155 (1997)CrossRefGoogle Scholar
  16. 16.
    Jiang, H., Chaudhary, V.: On improving thread migration: Safety and performance. In: Sahni, S.K., Prasanna, V.K., Shukla, U. (eds.) HiPC 2002. LNCS, vol. 2552, pp. 474–484. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  17. 17.
    Karablieh, F., Bazzi, R.A., Hicks, M.: Compiler-assisted heterogeneous checkpointing. In: Proceedings. 20th IEEE Symposium on Reliable Distributed Systems, p. 56 (2001)Google Scholar
  18. 18.
    Szwed, P.K., Marques, D., Buels, R.M., McKee, S.A., Schulz, M.: Simsnap: fast-forwarding via native execution and application-level checkpointing. In: INTERACT-8 2004. Eighth Workshop on Interaction between Compilers and Computer Architectures, p. 65 (2004)Google Scholar
  19. 19.
    Strumpen, V.: Compiler technology for portable checkpoints (1998)Google Scholar
  20. 20.
    Lyon, B.: Sun external data representation specification. Technical report, SUN Microsystems, Inc., Mountain View (1984)Google Scholar
  21. 21.
    Krishnan, S., Gannon, D.: Checkpoint and restart for distributed components in xcat3. In: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, p. 281 (2004)Google Scholar
  22. 22.
    Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogeneous architectures. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, pp. 58–67. IEEE Computer Society Press, Los Alamitos (1997)Google Scholar
  23. 23.
    Zhou, H., Geist, A.: “Receiver makes right” data conversion in PVM. In: Conference Proceedings of the 1995 IEEE Fourteenth Annual International Phoenix Conference on Computers and Communications, pp. 458–464. IEEE Computer Society, Los Alamitos (1995)Google Scholar
  24. 24.
    Zhong, H., Nieh, J.: The ergonomics of software porting: Automatically configuring software to the runtime environment (2006), http://www.cwi.nl/ftp/steven/enquire/enquire.html
  25. 25.
    Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in application-level fault-tolerant mpi. In: ICS 2003: Proceedings of the 17th annual international conference on Supercomputing, pp. 234–243. ACM Press, New York (2003)CrossRefGoogle Scholar
  26. 26.
    Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of mpi programs. In: PPoPP 2003: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 84–94. ACM Press, New York (2003)CrossRefGoogle Scholar
  27. 27.
    Jiang, H., Chaudhary, V.: Compile/run-time support for thread migration. In: Proceedings International Parallel and Distributed Processing Symposium, IPDPS, pp. 58–66. IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  28. 28.
    Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. In: IEEE Computational Science and Engineering, pp. 46–55. IEEE Computer Society Press, Los Alamitos (1998)Google Scholar
  29. 29.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  30. 30.
    de Camargo, R.Y., Goldchleger, A., Kon, F., Goldman, A.: Checkpointing-based rollback recovery for parallel applications on the integrade grid middleware. In: Proceedings of the 2nd workshop on Middleware for grid computing, pp. 35–40. ACM Press, New York (2004)CrossRefGoogle Scholar
  31. 31.
    Agbaria, A., Freund, A., Friedman, R.: Evaluating distributed checkpointing protocols. In: Proceedings. 23rd International Conference on Distributed Computing Systems, p. 266 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • John Paul Walters
    • 1
  • Vipin Chaudhary
    • 2
  1. 1.Institute for Scientific ComputingWayne State University 
  2. 2.Department of Computer Science and EngineeringUniversity at Buffalo, The State University of New York 

Personalised recommendations