A Distributed and Replicated Service for Checkpoint Storage

  • Fatiha Bouabache
  • Thomas Herault
  • Gilles Fedak
  • Franck Cappello


As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images.

This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.


high performance computing fault tolerance replication rollback recovery 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    W. Groop and E. Lusk, Fault Tolerance in MP Programs. OAI-PMH server at, 2002.Google Scholar
  2. [2]
    E. N. Elnozahy et al.A survey of Rollback-Recovery Protocols in Message-Passing Sys-tems, Journal "CSURV: Computer Surveys", volume 34, 2002.Google Scholar
  3. [3]
    K.M. Chandy and L. Lamport, Distributed snapshots: Determining global states of dis-tributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63? 75, 1985.Google Scholar
  4. [4]
    A. Bouteiller et al.Mpich-v: a multiprotocol fault tolerant mpi. International Journal of High Performance Computing and Applications, 20(8):319?333, fall, 2006.Google Scholar
  5. [5]
    G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment forMPI, 1994.Google Scholar
  6. [6]
    L. Alvisi et al.An analysis of communication induced checkpointing. In Proceedings of the symposium on fault-tolerant computing, pages 242?249, 1999.Google Scholar
  7. [7]
    F. Baude et al.A hybrid message logging-cic protocol for constrained checkpointability. In Proceedings of EuroPar2005, LNCS, 2005.Google Scholar
  8. [8]
    James S. Plank and Kai Li, Faster Checkpointing with N+1 Parity, 24th International Symposium on Fault-Tolerant Computing, Austin, TX, June, 1994, pp 288-297.Google Scholar
  9. [9]
    Z. Chen et al.Building fault survivable MPI programs with FT-MPI using diskless-checkpointing. In Proceedings of the tenth ACM SIGPLAN Symposium on (PPoPP), June 2005.Google Scholar
  10. [10]
    G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an inmemory checkpoint-based fault toler-ant runtime for charm++ and mpi. In Proceedings of the IEEE International Conference on Cluster Computing, USA, 2004. IEEE Computer Society.Google Scholar
  11. [11]
    C. Huang et al.Performance evaluation of adaptive MPI. PPOPP 2006: 12-21Google Scholar
  12. [12]
    L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.Google Scholar
  13. [13]
    L. V. Kale. The Virtualization approach to Parallel Programming: Runtime Optimization and the State of Art. In LACSI 2002, Albuquerque, October 2002.Google Scholar
  14. [14]
    S. Chakravorty, C. L. Mendes, and L. V. Kalé, Proactive Fault Tolerance in MPI Applica-tions Via Task Migration. HiPC 2006: 485-496Google Scholar
  15. [15]
    L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.Google Scholar
  16. [16]
    S. Chakravorty and L. V. Kalé, A fault tolerance Protocol with Fast Fault Recovery, Accepted for publication at IPDPS 2007.Google Scholar
  17. [17]
    R. Guerraoui and A. Schiper. Software based replication for fault tolerance. IEEE Com-puter, 30(4):68?74, Apr. 1997.Google Scholar
  18. [18]
    N. Budhiraja et al.The primary-backup approach, Dec. 01 1993.Google Scholar
  19. [19]
    L. Rilling and C. Morin. A practical transparent data sharing service for the grid. In Proc. Fifth InternationalWorkshop on Distributed SharedMemory (DSM 2005), Cardiff, UK, May 2005. Held in conjunction with CCGrid 2005.Google Scholar
  20. [20]
    C. Leangsuksun et al.Asymmetric active-active high availability for high-end computing. In Proceedings of (COSET-2), in conjunction with the 19th ACM International Conference on Supercomputing (ICS), Cambridge, MA, USA, 2005.Google Scholar
  21. [21]
    C. Engelmann et al.Symmetric active/active high availability for high-performance com-puting system services. Journal of Computers (JCP), 1(8), 2006.Google Scholar
  22. [22]
    INRIA. Simgrid project.

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Fatiha Bouabache
    • 1
  • Thomas Herault
    • 2
  • Gilles Fedak
    • 2
  • Franck Cappello
    • 2
  1. 1.Laboratoire de Recherche en InformatiqueUniversite Paris Sud-XIFrance
  2. 2.INRIA Futurs/Laboratoire de Recherche en InformatiqueUniversite Paris Sud-XIFrance

Personalised recommendations