A Distributed and Replicated Service for Checkpoint Storage

Bouabache, Fatiha; Herault, Thomas; Fedak, Gilles; Cappello, Franck

doi:10.1007/978-0-387-78448-9_24

Fatiha Bouabache⁵,
Thomas Herault⁶,
Gilles Fedak⁶ &
…
Franck Cappello⁶

284 Accesses
1 Citations

Abstract

As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images.

This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

W. Groop and E. Lusk, Fault Tolerance in MP Programs. OAI-PMH server at cs1.ist.psu.edu, 2002.
Google Scholar
E. N. Elnozahy et al.A survey of Rollback-Recovery Protocols in Message-Passing Sys-tems, Journal "CSURV: Computer Surveys", volume 34, 2002.
Google Scholar
K.M. Chandy and L. Lamport, Distributed snapshots: Determining global states of dis-tributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63? 75, 1985.
Google Scholar
A. Bouteiller et al.Mpich-v: a multiprotocol fault tolerant mpi. International Journal of High Performance Computing and Applications, 20(8):319?333, fall, 2006.
Google Scholar
G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment forMPI, 1994.
Google Scholar
L. Alvisi et al.An analysis of communication induced checkpointing. In Proceedings of the symposium on fault-tolerant computing, pages 242?249, 1999.
Google Scholar
F. Baude et al.A hybrid message logging-cic protocol for constrained checkpointability. In Proceedings of EuroPar2005, LNCS, 2005.
Google Scholar
James S. Plank and Kai Li, Faster Checkpointing with N+1 Parity, 24th International Symposium on Fault-Tolerant Computing, Austin, TX, June, 1994, pp 288-297.
Google Scholar
Z. Chen et al.Building fault survivable MPI programs with FT-MPI using diskless-checkpointing. In Proceedings of the tenth ACM SIGPLAN Symposium on (PPoPP), June 2005.
Google Scholar
G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an inmemory checkpoint-based fault toler-ant runtime for charm++ and mpi. In Proceedings of the IEEE International Conference on Cluster Computing, USA, 2004. IEEE Computer Society.
Google Scholar
C. Huang et al.Performance evaluation of adaptive MPI. PPOPP 2006: 12-21
Google Scholar
L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.
Google Scholar
L. V. Kale. The Virtualization approach to Parallel Programming: Runtime Optimization and the State of Art. In LACSI 2002, Albuquerque, October 2002.
Google Scholar
S. Chakravorty, C. L. Mendes, and L. V. Kalé, Proactive Fault Tolerance in MPI Applica-tions Via Task Migration. HiPC 2006: 485-496
Google Scholar
L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.
Google Scholar
S. Chakravorty and L. V. Kalé, A fault tolerance Protocol with Fast Fault Recovery, Accepted for publication at IPDPS 2007.
Google Scholar
R. Guerraoui and A. Schiper. Software based replication for fault tolerance. IEEE Com-puter, 30(4):68?74, Apr. 1997.
Google Scholar
N. Budhiraja et al.The primary-backup approach, Dec. 01 1993.
Google Scholar
L. Rilling and C. Morin. A practical transparent data sharing service for the grid. In Proc. Fifth InternationalWorkshop on Distributed SharedMemory (DSM 2005), Cardiff, UK, May 2005. Held in conjunction with CCGrid 2005.
Google Scholar
C. Leangsuksun et al.Asymmetric active-active high availability for high-end computing. In Proceedings of (COSET-2), in conjunction with the 19th ACM International Conference on Supercomputing (ICS), Cambridge, MA, USA, 2005.
Google Scholar
C. Engelmann et al.Symmetric active/active high availability for high-performance com-puting system services. Journal of Computers (JCP), 1(8), 2006.
Google Scholar
INRIA. Simgrid project. http://simgrid.gforge.inria.fr.

Download references

Author information

Authors and Affiliations

Laboratoire de Recherche en Informatique, Universite Paris Sud-XI, 91405 Orsay, France
Fatiha Bouabache
INRIA Futurs/Laboratoire de Recherche en Informatique, Universite Paris Sud-XI, 91405 Orsay, France
Thomas Herault, Gilles Fedak & Franck Cappello

Authors

Fatiha Bouabache
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Herault
View author publications
You can also search for this author in PubMed Google Scholar
Gilles Fedak
View author publications
You can also search for this author in PubMed Google Scholar
Franck Cappello
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bouabache, F., Herault, T., Fedak, G., Cappello, F. (2008). A Distributed and Replicated Service for Checkpoint Storage. In: Making Grids Work. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-78448-9_24

Download citation

DOI: https://doi.org/10.1007/978-0-387-78448-9_24
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-78447-2
Online ISBN: 978-0-387-78448-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics