The Journal of Supercomputing

, Volume 65, Issue 3, pp 1302–1326

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

  • Ifeanyi P. Egwutuoha
  • David Levy
  • Bran Selic
  • Shiping Chen
Article

DOI: 10.1007/s11227-013-0884-0

Cite this article as:
Egwutuoha, I.P., Levy, D., Selic, B. et al. J Supercomput (2013) 65: 1302. doi:10.1007/s11227-013-0884-0

Abstract

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Keywords

High Performance Computing (HPC) Checkpoint/restart Fault tolerance Clusters Reliability Performance 

Copyright information

© Springer Science+Business Media New York 2013

Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Authors and Affiliations

  • Ifeanyi P. Egwutuoha
    • 1
  • David Levy
    • 1
  • Bran Selic
    • 1
  • Shiping Chen
    • 2
  1. 1.School of Electrical & Information EngineeringThe University of SydneySydneyAustralia
  2. 2.Information Engineering LaboratoryCSIRO ICT CentreSydneyAustralia