Skip to main content

A Flexible Checkpoint/Restart Model in Distributed Systems

  • Conference paper
Parallel Processing and Applied Mathematics (PPAM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6067))

Abstract

Large scale applications running on new computing platforms with thousands of processors have to face with reliability problems. The failure of a single processor will cause the entire execution to fail. Most existing approaches to guarantee reliable executions are based on fault tolerance mechanisms. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. This work presents a new model of coordinated Checkpoint/Restart mechanism for several types of computing platforms. The model is parametrized by the process failure distribution, the cost to save a global consistent state of processes and the number of computational resources. Through mathematical analysis of reliability, we apply this new model to compute the optimal interval between checkpoint times in order to minimize the average completion time. Model independency from the type of the failure law makes it completely flexible. We show that such a model may be used to reduce the checkpoint rate up to 20% in same cases and up to factor 4 the total overhead in same cases. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson’s process and Weibull’s law.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Adiga, N., et al.: An Overview of the BlueGene/L Supercomputer. In: ACM/IEEE 2002 Conference on Supercomputing, p. 60 (2002)

    Google Scholar 

  2. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN 2006: Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, USA, pp. 249–258 (2006)

    Google Scholar 

  3. Hacker, T.J., Romero, F., Carothers, C.D.: An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput. 69(7), 652–665 (2009)

    Article  Google Scholar 

  4. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303–312 (2006)

    Article  Google Scholar 

  5. Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Secur. Comput. 1(2), 97–108 (2004)

    Article  Google Scholar 

  6. Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)

    Google Scholar 

  7. Oliner, A.J., Rudolph, L., Sahoo, R.K.: Cooperative checkpointing: a robust approach to large-scale systems reliability. In: Proceedings of The 20th Annual International Conference on Supercomputing, pp. 14–23. ACM, New York (2006)

    Chapter  Google Scholar 

  8. Young, J.W.: A first order approximation to the optimum checkpoint interval. ACM Commun. 17(9), 530–531 (1974)

    Article  MATH  Google Scholar 

  9. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)

    Article  Google Scholar 

  10. Bouguerra, M.S., Gautier, T., Trystram, D., Vincent, J.M.: A new flexible checkpoint/restart model. Technical report, RR-6751, INRIA (2008)

    Google Scholar 

  11. Geist, R., Reynolds, R., Westall, J.: Selection of a checkpoint interval in a critical-task environment. IEEE Transactions on Reliability 37, 395–400 (1988)

    Article  MATH  Google Scholar 

  12. Plank, J.S., Thomason, M.G.: The average availability of parallel checkpointing systems and its importance in selecting runtime parameters. In: 29th International Symposium on Fault-Tolerant Computing, pp. 250–259 (1999)

    Google Scholar 

  13. Naksinehaboon, N., Liu, Y., Leangsuksun, C., Nassar, R., Paun, M., Scott, S.: Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788 (2008)

    Google Scholar 

  14. Tijms, H.C.: A First Course in Stochastic Models. John Wiley, Chichester (2003)

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bouguerra, MS., Gautier, T., Trystram, D., Vincent, JM. (2010). A Flexible Checkpoint/Restart Model in Distributed Systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2009. Lecture Notes in Computer Science, vol 6067. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14390-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14390-8_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14389-2

  • Online ISBN: 978-3-642-14390-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics