The Journal of Supercomputing

, Volume 65, Issue 3, pp 1302–1326 | Cite as

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

  • Ifeanyi P. EgwutuohaEmail author
  • David Levy
  • Bran Selic
  • Shiping Chen


In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.


High Performance Computing (HPC) Checkpoint/restart Fault tolerance Clusters Reliability Performance 


  1. 1.
    Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: 23rd IEEE international parallel and distributed processing symposium, Rome, Italy, pp 1–12 Google Scholar
  2. 2.
    Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans Dependable Secure Comput 1(1):87–96 CrossRefGoogle Scholar
  3. 3.
    Bartlett J, Gray J, Horst B (1986) Fault tolerance in tandem computer systems. Tandem Technical Report Google Scholar
  4. 4.
    Blackham B (2005) [Online]. Available:
  5. 5.
    Bosilca G, Bouteiller A, Cappello et al (2002) MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: IEEE/ACM SIGARCH Google Scholar
  6. 6.
    Brown A, Patterson DA (2001) To err is human. In: Proceedings of the first workshop on evaluating and architecting system dependability (EASY’01), Göteborg, Sweden, July 2001 Google Scholar
  7. 7.
    Byoung-Jip K (2005) Comparison of the existing checkpoint systems. Technical report, IBM Watson Google Scholar
  8. 8.
    Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23:212–226 CrossRefGoogle Scholar
  9. 9.
    Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):378–388 CrossRefGoogle Scholar
  10. 10.
    CFDR (2012) [Online]. Available: CFDR
  11. 11.
    Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3(1):63–75 CrossRefGoogle Scholar
  12. 12. (2012) Checkpointing [Online]. Available:
  13. 13.
    Chen F (2010) On performance optimization and system design of flash memory based solid state drives in the storage hierarchy. Ph.D. dissertation, Ohio State University, Computer Science and Engineering, Ohio State University Google Scholar
  14. 14.
    Chen L, Avizienis A (1978) N-version programming: a fault-tolerance approach to reliability of software operation, June, Toulouse, France, pp 3–9 Google Scholar
  15. 15.
    Christodorescu M, Jha S (2003) Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX security symposium, pp 169–186 Google Scholar
  16. 16.
    Clark C, Fraser K, Hand S et al (2005) Live migration of virtual machines. In: Proceedings of the 2nd conference on symposium on networked systems design and implementation, vol 2, May 2005, pp 273–286 Google Scholar
  17. 17.
    Courtright II, William V, Gibson GA (1994) Backward error recovery in redundant disk arrays. In: Proc 1994 computer measurement group con Google Scholar
  18. 18.
    Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56–88 CrossRefGoogle Scholar
  19. 19.
    Cristian F, Jahanian F (1991) A timestampbased checkpointing protocol for long-lived distributed computations. In: Proceedings, tenth symposium on reliable distributed systems Google Scholar
  20. 20.
    Czarnecki K, Østerbye K, Völter M (2002) Generative programming. In: Object-oriented technology ECOOP 2002 workshop reader. Springer, Berlin/Heidelberg, pp 83–115 Google Scholar
  21. 21.
    Duell J, Hargrove P, Roman E (2002) The design and implementation of Berkeley lab’s Linux checkpoint/restart. Berkeley Lab Technical Report (publication LBNL-54941), December 2002 Google Scholar
  22. 22.
    Duell J, Hargrove P, Roman E (2002) Requirements for Linux checkpoint/restart. Lawrence Berkeley National Laboratory Technical Report LBNL-49659 Google Scholar
  23. 23.
    Elnozahy ENM, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408 CrossRefGoogle Scholar
  24. 24.
    Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Recent advances in parallel virtual machine and message passing interface, pp 346–353 CrossRefGoogle Scholar
  25. 25.
    Fault tolerance, wikipedia (2012) [Online]. Available:
  26. 26.
    Fusion-IO (2012) [Online]. Available:
  27. 27.
    Ghaeba JA, Smadia MA, Chebil J (2010) A high performance data integrity assurance based on the determinant technique. Elsevier, Amsterdam Google Scholar
  28. 28.
    Gibson D (2012) esky [Online]. Available:
  29. 29.
    Grant-Ireson W, Coombs CF (1988) Handbook of reliability engineering and management. McGraw-Hill, New York Google Scholar
  30. 30.
    Gray J (1990) A census of tandem system availability between 1985 and 1990. IEEE Trans Reliab 39(4):409–418 CrossRefGoogle Scholar
  31. 31.
    Gwertzman J, Seltzer M (1996) World-wide web cache consistency. In: Proc 1996 USENIX tech conf, San Diego, CA, Jan 1996, pp 141–152 Google Scholar
  32. 32.
    Hobbs C, Becha H, Amyot D (2008) Failure semantics in a SOA environment. In: 3rd int MCeTech conference on etechnologies, Montréal Google Scholar
  33. 33.
    InfiniBand (2012) [Online]. Available: InfiniBand
  34. 34.
    Johnson C, Holloway C (2007) The dangers of failure masking in fault tolerant software: aspects of a recent in-flight upset event. In: 2nd institution of engineering and technology international conference on system safety, pp 60–65 Google Scholar
  35. 35.
    Kalaiselvi S, Rajaraman V (2000) A survey of checkpointing algorithms for parallel and distributed computers, pp 489–510 Google Scholar
  36. 36.
    Koch D, Haubelt C, Teich J (2007) Efficient hardware checkpointing concepts, overhead analysis, and implementation. In: Proceedings of int symp on field programmable gate arrays (FPGA) Google Scholar
  37. 37.
    Koren I, Krishna C (2007) Fault-tolerant systems. Elsevier/Morgan Kaufmann, San Diego, San Mateo zbMATHGoogle Scholar
  38. 38.
    Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565 zbMATHCrossRefGoogle Scholar
  39. 39.
    Laprie JC, Arlat J, Beounes C, Kanoun K (1990) Definition and analysis of hardware-and software-fault-tolerant architectures. Computer 23(7):39–51 CrossRefGoogle Scholar
  40. 40.
    Large software state (2012) [Online]. Available:
  41. 41.
    Li K, Naughton JF, Plank JS (1994) Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst 5(8):874–879 CrossRefGoogle Scholar
  42. 42.
    Liang Y, Zhang Y, Jette et al (2006) BlueGene/L failure analysis and prediction models. In: International conference on dependable systems and networks, DSN 2006. IEEE Press, New York, pp 425–434 Google Scholar
  43. 43.
    Lofgren KMJ et al (2001) Wear leveling techniques for flash EEPROM systems. US Patent No 6,230,233, 8 May 2001 Google Scholar
  44. 44.
    Lu CD (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation, University of Illinois at Urbana-Champaign Google Scholar
  45. 45.
    Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209 zbMATHCrossRefGoogle Scholar
  46. 46.
    Maloney A, Goscinski A (2009) A survey and review of the current state of rollback-recovery for cluster systems. Concurr Comput., 1632–1666 Google Scholar
  47. 47.
    Milojicic DS, Douglis F, Paindaveine Y, Wheeler R, Zhou S (2000) Process migration. ACM Comput Surv 32(3):241–299 CrossRefGoogle Scholar
  48. 48.
    MPI Forum (1994) MPI: a message-passing interface standard. Int J Supercomput Appl High Perform Comput Google Scholar
  49. 49.
    Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. Washington, DC, pp 575–584 Google Scholar
  50. 50.
    Oppenheimer D, Patterson D (2002) Architecture and dependability of large-scale Internet services. IEEE Internet Comput 6(5):41–49 CrossRefGoogle Scholar
  51. 51.
    Osman S, Subhraveti D, Su G, Nieh J (2002) The design and implementation of zap: a system for migration computing environments. Oper Syst Rev 36(SI):361–376 CrossRefGoogle Scholar
  52. 52.
    Overeinder BJ, Sloot RN, Heederik RN, Hertzberger LO (1996) A dynamic load balancing system for parallel cluster computing. Future Gener Comput Syst 12:101–115 CrossRefGoogle Scholar
  53. 53.
    PETSc (2012) [Online]. Available:
  54. 54.
  55. 55.
    Plank JS, Li K (1994) ickp: a consistent checkpointer for multicomputers. In: IEEE parallel and distributed technologies, vol 2, pp 62–67 Google Scholar
  56. 56.
    Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. In: Conference proceedings. Usenix, Berkeley Google Scholar
  57. 57.
    Poledna S (1996) The problem of replica determinism. Kluwer Academic, Boston, pp 29–30 zbMATHGoogle Scholar
  58. 58.
    Ramkumar B, Strumpen V (1997) Portable checkpointing for heterogeneous archtitectures. In: Proceedings of he 27th international symposium on fault-tolerant computing (FTCS’97), pp 58–67 CrossRefGoogle Scholar
  59. 59.
    Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng SE-1(2):220–232 CrossRefGoogle Scholar
  60. 60.
    Roman E (2002) A survey of checkpoint/restart implementations. Berkeley Lab Technical Report (publication LBNL-54942) Google Scholar
  61. 61.
    Ruscio J, Heffner M, Varadarajan S (2007) DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: IEEE international parallel and distributed processing symposium, pp 1–10 Google Scholar
  62. 62.
    Sancho JC, Petrini F, Davis K, Gioiosa R, Jiang S (2005) Current practice and a direction forward in checkpoint/restart implementations for fault olerance. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—workshop 18 Google Scholar
  63. 63.
    Sankaran S, Squyres JM, Barrett B et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493 CrossRefGoogle Scholar
  64. 64.
    Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022 CrossRefGoogle Scholar
  65. 65.
    Schroeder B, Gibson GA (2010) A large-scale study of failures in high performance computing systems. IEEE Trans Dependable Secure Comput 7(4):337–350 CrossRefGoogle Scholar
  66. 66.
    Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Supercomputing, Pittsburgh, PA Google Scholar
  67. 67.
    Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: VECPAR 2010, LNCS, vol 6449. Springer, Berlin, Heidelberg, pp 1–25 Google Scholar
  68. 68.
    Slivinski T, Broglio C, Wild C et al. (1984) Study of fault-tolerant software technology. NASA CR 172385, Langley Research, Center, VA Google Scholar
  69. 69.
    Stellner G (1996) Cocheck: checkpointing and process migration for MPI. In: Proc IPPS Google Scholar
  70. 70.
    Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced computing systems: technology and applications, pp 159–164 Google Scholar
  71. 71.
    Takahashi T, Sumimoto S, Hori A, Harada H, Ishikawa Y (2000) PM2: high performance communication middleware for heterogeneous network environments, in supercomputing. In: ACM/IEEE 2000 conference. IEEE Press, New York, p 16 Google Scholar
  72. 72.
    Team Condor (2010) Condor version 7.5.3 manual. University of Wisconsin–Madison Google Scholar
  73. 73.
    Teodorescu R, Nakano J, Torrellas J (2006) SWICH: a prototype for efficient cache-level checkpointing and rollback. IEEE Micro Google Scholar
  74. 74.
    Top500 (2012) [Online]. Available:
  75. 75.
    Walters J, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In: Proc of the 3rd ICDCIT conf, pp 221–234 Google Scholar
  76. 76.
    Wang Y-M, Chung P-Y, Lin I-J, Fuchs WK (1995) Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans Parallel Distrib Syst 6(5):546–554 CrossRefGoogle Scholar
  77. 77.
    William RD, James EL Jr (2001) User-level checkpointing for LinuxThreads programs. In: FREENIX track: USENIX annual technical conference Google Scholar
  78. 78.
    Zandy V (2002) ckpt [Online]. Available:
  79. 79.
    Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Authors and Affiliations

  • Ifeanyi P. Egwutuoha
    • 1
    Email author
  • David Levy
    • 1
  • Bran Selic
    • 1
  • Shiping Chen
    • 2
  1. 1.School of Electrical & Information EngineeringThe University of SydneySydneyAustralia
  2. 2.Information Engineering LaboratoryCSIRO ICT CentreSydneyAustralia

Personalised recommendations