Abstract
A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy aiming to protect against possibly lengthy recovery periods is in operation. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adiga, N.R., et al.: An overview of the BlueGene/L supercomputer. In: Proceedings of the ACM/IEEE Conference on Supercomputing, p. 60 (2002). https://doi.org/10.1109/SC.2002.10017
Baccelli, F.: Analysis of a service facility with periodic checkpointing. Acta lnform. 15, 67–81 (1981)
Bruno, J.L., Coffman, E.G.: Optimal fault-tolerant computing on multi-processor systems. Acta Inform. 34, 881–904 (1997)
Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 8(5), 40–47 (1975)
Chen, Y., Ganapathi, A.S., Griffith, R., Katz, R.H.: Analysis and lessons from a publicly available google cluster trace. Technical report No. UCB/EECS-2010-95 (2010)
Coffman, E.G., Gilbert, E.N.: Optimal strategies for scheduling checkpoints and preventive maintenance. IEEE Trans. Reliabil. 39(1), 9–18 (1990)
Cohen, J.W.: The Single Server Queue. North-Holland, Amsterdam (1969)
Cox, D.R.: A use of complex probabilities in the theory of stochastic processes. Math. Proc. Camb. Philos. Soc. 51(2), 313–319 (1955)
Dimitriou, I.: A retrial queue for modeling fault-tolerant systems with checkpointing and rollback recovery. Comput. Ind. Eng. 79, 156–167 (2015)
Dohi, T., Kaio, N., Trivedi, K.S.: Availability models with age-dependent checkpointing. In: 21st IEEE Symposium on Reliable Distributed Systems, pp. 130–139 (2002)
Elnozahy, E.N., Alvisi, L., Wang, Y., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Fuhrmann, S.W.: A note on the M/G/1 queue with server vacations. Oper. Res. 32(6), 1368–1373 (1984)
Garraghan, P., Townend, P., Xu, J.: An empirical failure-analysis of a large-scale cloud computing environment. In: 15th International Symposium on High-Assurance Systems Engineering, pp. 113–120 (2014)
Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26(2), 259–270 (1979)
Gelenbe, E., Boryszko, P., Siavvas, M., Domanska, J.: Optimum checkpoints for time and energy. In: 28th IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1–8 (2020)
Grassi, V., Donatiello, L., Tucci, S.: On the optimal checkpointing of critical tasks and transaction-oriented systems. IEEE Trans. Softw. Eng. 18(1), 72–77 (1992)
Güler, B., Özkasap, Ö.: Efficient checkpointing mechanisms for primary-backup replication on the cloud. Concurr. Comput. Pract. Exp. 30, 21 (2018)
Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)
Marzouk, S., Jmaiel, M.: A survey on software checkpointing and mobility techniques in distributed systems. Concurr. Comput. Pract. Exp. 23(11), 1196–1212 (2011)
Mitrani, I.: Probabilistic Modelling. Cambridge University Press, Cambridge (1998)
Nicola, V.F.: Checkpointing and the modelling of program execution time. In: Lyu, M.R. (ed.) Software Fault Tolerance, pp. 167–188. Wiley (1995)
Oliveira, R., Pereira, J., Schiper, A.: Primary-backup replication: from a time-free protocol to a time-based implementation. In: Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems, pp. 14–23 (2001)
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61(11), 1570–1590 (2001)
Subasi, O., Kestor, G., Krishnamoorthy, S.: Toward a general theory of optimal checkpoint placement. In: IEEE Conference on Cluster Computing (CLUSTER), pp. 464–474 (2017)
Tuthill, B., Johnson, K., Schultz, T.: Irix checkpoint and restart operation guide. Document of Silicon Graphics Inc. (1999)
Wang, Y.-M., Huang, Y., Vo, K.-Ph., Chung, P.-Y., Kintala, C.: Checkpointing and its applications. In: 25th International Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 22–31 (1995)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ezhilchelvan, P., Mitrani, I. (2023). Checkpointing Models for Tasks with Widely Different Processing Times. In: Gilly, K., Thomas, N. (eds) Computer Performance Engineering. EPEW 2022. Lecture Notes in Computer Science, vol 13659. Springer, Cham. https://doi.org/10.1007/978-3-031-25049-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-25049-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25048-4
Online ISBN: 978-3-031-25049-1
eBook Packages: Computer ScienceComputer Science (R0)