Checkpointing Models for Tasks with Widely Different Processing Times

Ezhilchelvan, Paul; Mitrani, Isi

doi:10.1007/978-3-031-25049-1_7

Paul Ezhilchelvan⁹ &
Isi Mitrani⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13659))

Included in the following conference series:

European Workshop on Performance Engineering

420 Accesses
1 Citations

Abstract

A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy aiming to protect against possibly lengthy recovery periods is in operation. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adiga, N.R., et al.: An overview of the BlueGene/L supercomputer. In: Proceedings of the ACM/IEEE Conference on Supercomputing, p. 60 (2002). https://doi.org/10.1109/SC.2002.10017
Baccelli, F.: Analysis of a service facility with periodic checkpointing. Acta lnform. 15, 67–81 (1981)
Article MATH Google Scholar
Bruno, J.L., Coffman, E.G.: Optimal fault-tolerant computing on multi-processor systems. Acta Inform. 34, 881–904 (1997)
Article MATH Google Scholar
Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 8(5), 40–47 (1975)
Article MATH Google Scholar
Chen, Y., Ganapathi, A.S., Griffith, R., Katz, R.H.: Analysis and lessons from a publicly available google cluster trace. Technical report No. UCB/EECS-2010-95 (2010)
Google Scholar
Coffman, E.G., Gilbert, E.N.: Optimal strategies for scheduling checkpoints and preventive maintenance. IEEE Trans. Reliabil. 39(1), 9–18 (1990)
Article MATH Google Scholar
Cohen, J.W.: The Single Server Queue. North-Holland, Amsterdam (1969)
Google Scholar
Cox, D.R.: A use of complex probabilities in the theory of stochastic processes. Math. Proc. Camb. Philos. Soc. 51(2), 313–319 (1955)
Article MATH Google Scholar
Dimitriou, I.: A retrial queue for modeling fault-tolerant systems with checkpointing and rollback recovery. Comput. Ind. Eng. 79, 156–167 (2015)
Article Google Scholar
Dohi, T., Kaio, N., Trivedi, K.S.: Availability models with age-dependent checkpointing. In: 21st IEEE Symposium on Reliable Distributed Systems, pp. 130–139 (2002)
Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Fuhrmann, S.W.: A note on the M/G/1 queue with server vacations. Oper. Res. 32(6), 1368–1373 (1984)
Article MATH Google Scholar
Garraghan, P., Townend, P., Xu, J.: An empirical failure-analysis of a large-scale cloud computing environment. In: 15th International Symposium on High-Assurance Systems Engineering, pp. 113–120 (2014)
Google Scholar
Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26(2), 259–270 (1979)
Article MATH Google Scholar
Gelenbe, E., Boryszko, P., Siavvas, M., Domanska, J.: Optimum checkpoints for time and energy. In: 28th IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1–8 (2020)
Google Scholar
Grassi, V., Donatiello, L., Tucci, S.: On the optimal checkpointing of critical tasks and transaction-oriented systems. IEEE Trans. Softw. Eng. 18(1), 72–77 (1992)
Article Google Scholar
Güler, B., Özkasap, Ö.: Efficient checkpointing mechanisms for primary-backup replication on the cloud. Concurr. Comput. Pract. Exp. 30, 21 (2018)
Article Google Scholar
Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)
Google Scholar
Marzouk, S., Jmaiel, M.: A survey on software checkpointing and mobility techniques in distributed systems. Concurr. Comput. Pract. Exp. 23(11), 1196–1212 (2011)
Article Google Scholar
Mitrani, I.: Probabilistic Modelling. Cambridge University Press, Cambridge (1998)
MATH Google Scholar
Nicola, V.F.: Checkpointing and the modelling of program execution time. In: Lyu, M.R. (ed.) Software Fault Tolerance, pp. 167–188. Wiley (1995)
Google Scholar
Oliveira, R., Pereira, J., Schiper, A.: Primary-backup replication: from a time-free protocol to a time-based implementation. In: Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems, pp. 14–23 (2001)
Google Scholar
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61(11), 1570–1590 (2001)
Article MATH Google Scholar
Subasi, O., Kestor, G., Krishnamoorthy, S.: Toward a general theory of optimal checkpoint placement. In: IEEE Conference on Cluster Computing (CLUSTER), pp. 464–474 (2017)
Google Scholar
Tuthill, B., Johnson, K., Schultz, T.: Irix checkpoint and restart operation guide. Document of Silicon Graphics Inc. (1999)
Google Scholar
Wang, Y.-M., Huang, Y., Vo, K.-Ph., Chung, P.-Y., Kintala, C.: Checkpointing and its applications. In: 25th International Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 22–31 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Newcastle University, Newcastle upon Tyne, NE4 5TG, UK
Paul Ezhilchelvan & Isi Mitrani

Authors

Paul Ezhilchelvan
View author publications
You can also search for this author in PubMed Google Scholar
Isi Mitrani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Ezhilchelvan .

Editor information

Editors and Affiliations

Miguel Hernandez University, Elche, Spain
Katja Gilly
Newcastle University, Newcastle upon Tyne, UK
Nigel Thomas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ezhilchelvan, P., Mitrani, I. (2023). Checkpointing Models for Tasks with Widely Different Processing Times. In: Gilly, K., Thomas, N. (eds) Computer Performance Engineering. EPEW 2022. Lecture Notes in Computer Science, vol 13659. Springer, Cham. https://doi.org/10.1007/978-3-031-25049-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-25049-1_7
Published: 25 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25048-4
Online ISBN: 978-3-031-25049-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Checkpointing Models for Tasks with Widely Different Processing Times