Skip to main content

A Distributed and Replicated Service for Checkpoint Storage

  • Chapter
Making Grids Work

Abstract

As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images.

This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. W. Groop and E. Lusk, Fault Tolerance in MP Programs. OAI-PMH server at cs1.ist.psu.edu, 2002.

    Google Scholar 

  2. E. N. Elnozahy et al.A survey of Rollback-Recovery Protocols in Message-Passing Sys-tems, Journal "CSURV: Computer Surveys", volume 34, 2002.

    Google Scholar 

  3. K.M. Chandy and L. Lamport, Distributed snapshots: Determining global states of dis-tributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63? 75, 1985.

    Google Scholar 

  4. A. Bouteiller et al.Mpich-v: a multiprotocol fault tolerant mpi. International Journal of High Performance Computing and Applications, 20(8):319?333, fall, 2006.

    Google Scholar 

  5. G. Burns, R. Daoud, and J. Vaigl. LAM: An open cluster environment forMPI, 1994.

    Google Scholar 

  6. L. Alvisi et al.An analysis of communication induced checkpointing. In Proceedings of the symposium on fault-tolerant computing, pages 242?249, 1999.

    Google Scholar 

  7. F. Baude et al.A hybrid message logging-cic protocol for constrained checkpointability. In Proceedings of EuroPar2005, LNCS, 2005.

    Google Scholar 

  8. James S. Plank and Kai Li, Faster Checkpointing with N+1 Parity, 24th International Symposium on Fault-Tolerant Computing, Austin, TX, June, 1994, pp 288-297.

    Google Scholar 

  9. Z. Chen et al.Building fault survivable MPI programs with FT-MPI using diskless-checkpointing. In Proceedings of the tenth ACM SIGPLAN Symposium on (PPoPP), June 2005.

    Google Scholar 

  10. G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an inmemory checkpoint-based fault toler-ant runtime for charm++ and mpi. In Proceedings of the IEEE International Conference on Cluster Computing, USA, 2004. IEEE Computer Society.

    Google Scholar 

  11. C. Huang et al.Performance evaluation of adaptive MPI. PPOPP 2006: 12-21

    Google Scholar 

  12. L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.

    Google Scholar 

  13. L. V. Kale. The Virtualization approach to Parallel Programming: Runtime Optimization and the State of Art. In LACSI 2002, Albuquerque, October 2002.

    Google Scholar 

  14. S. Chakravorty, C. L. Mendes, and L. V. Kalé, Proactive Fault Tolerance in MPI Applica-tions Via Task Migration. HiPC 2006: 485-496

    Google Scholar 

  15. L. V. Kale and S. Krishnan. Charm++: Parallel programming with message-driven objects. In Wilson, G.V., Lu, P., eds.: Parallel programming using C++. MIT Press (1996) 175-213.

    Google Scholar 

  16. S. Chakravorty and L. V. Kalé, A fault tolerance Protocol with Fast Fault Recovery, Accepted for publication at IPDPS 2007.

    Google Scholar 

  17. R. Guerraoui and A. Schiper. Software based replication for fault tolerance. IEEE Com-puter, 30(4):68?74, Apr. 1997.

    Google Scholar 

  18. N. Budhiraja et al.The primary-backup approach, Dec. 01 1993.

    Google Scholar 

  19. L. Rilling and C. Morin. A practical transparent data sharing service for the grid. In Proc. Fifth InternationalWorkshop on Distributed SharedMemory (DSM 2005), Cardiff, UK, May 2005. Held in conjunction with CCGrid 2005.

    Google Scholar 

  20. C. Leangsuksun et al.Asymmetric active-active high availability for high-end computing. In Proceedings of (COSET-2), in conjunction with the 19th ACM International Conference on Supercomputing (ICS), Cambridge, MA, USA, 2005.

    Google Scholar 

  21. C. Engelmann et al.Symmetric active/active high availability for high-performance com-puting system services. Journal of Computers (JCP), 1(8), 2006.

    Google Scholar 

  22. INRIA. Simgrid project. http://simgrid.gforge.inria.fr.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Bouabache, F., Herault, T., Fedak, G., Cappello, F. (2008). A Distributed and Replicated Service for Checkpoint Storage. In: Making Grids Work. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-78448-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-78448-9_24

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-78447-2

  • Online ISBN: 978-0-387-78448-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics