A Flexible Checkpoint/Restart Model in Distributed Systems

Bouguerra, Mohamed-Slim; Gautier, Thierry; Trystram, Denis; Vincent, Jean-Marc

doi:10.1007/978-3-642-14390-8_22

Mohamed-Slim Bouguerra^20,21,
Thierry Gautier²¹,
Denis Trystram²⁰ &
…
Jean-Marc Vincent²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6067))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

1434 Accesses
17 Citations

Abstract

Large scale applications running on new computing platforms with thousands of processors have to face with reliability problems. The failure of a single processor will cause the entire execution to fail. Most existing approaches to guarantee reliable executions are based on fault tolerance mechanisms. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. This work presents a new model of coordinated Checkpoint/Restart mechanism for several types of computing platforms. The model is parametrized by the process failure distribution, the cost to save a global consistent state of processes and the number of computational resources. Through mathematical analysis of reliability, we apply this new model to compute the optimal interval between checkpoint times in order to minimize the average completion time. Model independency from the type of the failure law makes it completely flexible. We show that such a model may be used to reduce the checkpoint rate up to 20% in same cases and up to factor 4 the total overhead in same cases. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson’s process and Weibull’s law.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Computation algorithms for workload-dependent optimal checkpoint placement

Article 21 January 2022

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

Resilient MPI applications using an application-level checkpointing framework and ULFM

Article 22 January 2016

References

Adiga, N., et al.: An Overview of the BlueGene/L Supercomputer. In: ACM/IEEE 2002 Conference on Supercomputing, p. 60 (2002)
Google Scholar
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN 2006: Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, USA, pp. 249–258 (2006)
Google Scholar
Hacker, T.J., Romero, F., Carothers, C.D.: An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput. 69(7), 652–665 (2009)
Article Google Scholar
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303–312 (2006)
Article Google Scholar
Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Secur. Comput. 1(2), 97–108 (2004)
Article Google Scholar
Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)
Google Scholar
Oliner, A.J., Rudolph, L., Sahoo, R.K.: Cooperative checkpointing: a robust approach to large-scale systems reliability. In: Proceedings of The 20th Annual International Conference on Supercomputing, pp. 14–23. ACM, New York (2006)
Chapter Google Scholar
Young, J.W.: A first order approximation to the optimum checkpoint interval. ACM Commun. 17(9), 530–531 (1974)
Article MATH Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Article Google Scholar
Bouguerra, M.S., Gautier, T., Trystram, D., Vincent, J.M.: A new flexible checkpoint/restart model. Technical report, RR-6751, INRIA (2008)
Google Scholar
Geist, R., Reynolds, R., Westall, J.: Selection of a checkpoint interval in a critical-task environment. IEEE Transactions on Reliability 37, 395–400 (1988)
Article MATH Google Scholar
Plank, J.S., Thomason, M.G.: The average availability of parallel checkpointing systems and its importance in selecting runtime parameters. In: 29th International Symposium on Fault-Tolerant Computing, pp. 250–259 (1999)
Google Scholar
Naksinehaboon, N., Liu, Y., Leangsuksun, C., Nassar, R., Paun, M., Scott, S.: Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 783–788 (2008)
Google Scholar
Tijms, H.C.: A First Course in Stochastic Models. John Wiley, Chichester (2003)
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Grenoble University, ZIRST 51, avenue Jean Kuntzmann, 38330, MONTBONNOT SAINT MARTIN, France
Mohamed-Slim Bouguerra, Denis Trystram & Jean-Marc Vincent
INRIA Rhone-Alpes, 655 avenue de l’Europe Montbonnot-Saint-Martin, 38334, SAINT ISMIER, France
Mohamed-Slim Bouguerra & Thierry Gautier

Authors

Mohamed-Slim Bouguerra
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Gautier
View author publications
You can also search for this author in PubMed Google Scholar
Denis Trystram
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Vincent
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computational and Information Sciences, Czestochowa University of Technology,
Roman Wyrzykowski
Department of Electrical Engineering and Computer Science, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra
Institute of Computer and Information Science, Czestochowa University of Technology, Dabrowskiego 73, PL-42-200, Czestochowa, Poland
Konrad Karczewski
Department of Informatics and Mathematical Modeling, Technical University of Denmark, Richard Petersens Plads, Building 321, 2800, Kongens Lyngby, Denmark
Jerzy Wasniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouguerra, MS., Gautier, T., Trystram, D., Vincent, JM. (2010). A Flexible Checkpoint/Restart Model in Distributed Systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2009. Lecture Notes in Computer Science, vol 6067. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14390-8_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-14390-8_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14389-2
Online ISBN: 978-3-642-14390-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Flexible Checkpoint/Restart Model in Distributed Systems

Abstract

Access this chapter

Preview

Similar content being viewed by others

Computation algorithms for workload-dependent optimal checkpoint placement

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

Resilient MPI applications using an application-level checkpointing framework and ULFM

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Flexible Checkpoint/Restart Model in Distributed Systems

Abstract

Access this chapter

Preview

Similar content being viewed by others

Computation algorithms for workload-dependent optimal checkpoint placement

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

Resilient MPI applications using an application-level checkpointing framework and ULFM

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation