Abstract
A task with ideal execution time T such as the execution of a computer program or the transmission of a file on a data link may fail. A number of protocols for failure recovery have been suggested and analyzed, in particular RESUME, REPLACE and RESTART. We consider here RESTART with particular emphasis on checkpointing where the task is split into K subtasks by inserting K − 1 checkpoints. If a failure occurs between two checkpoints, the task needs to be restarted from the last checkpoint only. Various models are considered: the task may have a fixed (T ≡ t) or a random length, and the spacing of checkpoints may be equidistant, non-equidistant or random. The emphasis here is on tail asymptotics for the total task time X in the same vein as the study of Asmussen et al.[5] on simple RESTART. For a fixed task length (T ≡ t) and certain types of failure mechanism, for example Poisson, the conclusion of the study is clear and not unexpected: the essential thing to control for minimizing large delays is making the maximal distance between checkpoints as small as possible. For random unbounded task length, it is seen that the effect of checkpointing is limited in the sense that the tail of X remains power-like as for simple RESTART (K = 1).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Andersen, L.N., Asmussen, S.: Parallel computing, failure recovery and extreme values. J. Statist. Theory. Pract. 2, 279–292 (2008)
Asmussen, S.: Applied Probability and Queues, 2nd edn. Springer (2003)
Asmussen, S.: Importance sampling for failure probabilities in computing and data transmission. Journal Applied Probability 46, 768–790 (2009)
Asmussen, S., Glynn, P.W.: Stochastic Simulation: Algorithms and Analysis. Springer (2007)
Asmussen, S., Fiorini, P., Lipsky, L., Rolski, T., Sheahan, R.: On the distribution of total task times for tasks that must restart from the beginning if failure occurs. Math. Oper. Res. 33, 932–944 (2008)
Asmussen, S., Klüppelberg, C., Sigman, K.: Sampling at a subexponential time, with queueing applications. Stoch. Proc. Appl. 79, 265–286 (1999)
Bingham, N.H., Goldie, C.M., Teugels, J.L.: Regular Variation. Cambridge University Press (1987)
Bobbio, A., Trivedi, K.: Computation of the distribution of the completion time when the work requirement is a PH random variable. Stochastic Models 6, 133–150 (1990)
Castillo, X., Siewiorek, D.P.: A performance-reliability model for computing systems. In: Proc FTCS-10, Silver Spring, MD, pp. 187–192. IEEE Computer Soc. (1980)
Chimento Jr., P.F., Trivedi, K.S.: The completion time of programs on processors subject to failure and repair. IEEE Trans. on Computers 42(1) (1993)
Chlebus, B.S., De Prisco, R., Shvartsman, A.A.: Performing tasks on synchronous restartable message-passing processors. Distributed Computing 14, 49–64 (2001)
David, H.A.: Order Statistics. Wiley (1970)
DePrisco, R., Mayer, A., Yung, M.: Time-optimal message-efficient work performance in the presence of faults. In: Proc. 13th ACM PODC, pp. 161–172 (1994)
Embrechts, P., Maejima, M., Teugels, J.L.: Asymptotic behaviour of compound distributions. Astin Bulletin 15(1) (1985)
Feller, W.: An Introduction to Probability Theory and its Applications II, 2nd edn. Wiley (1971)
Fisher, R.A.: Tests of significance in harmonic analysis. Proc. Roy. Soc. A 125, 54–59
Gut, A.: Stopped Random Walks. Springer (1988)
Jelenković, P., Tan, J.: Can retransmissions of superexponential documents cause subexponential delays? In: Proceedings of IEEE INFOCMO 2007, Anchorage, pp. 892–900 (2007)
Jelenković, P., Tan, J.: Dynamic packet fragmentation for wireless channels with failures. In: Proc. MobiHoc 2008, Hong Kong, May 26-30 (2008)
Jelenković, P., Tan, J.: Characterizing heavy-tailed distributions induced by retransmissions. Adv. Appl. Probab. 45(1) (2013)
Kartashov, N.V.: A uniform asymptotic renewal theorem. Th. Probab. Appl. 25, 589–592 (1980)
Kartashov, N.V.: Equivalence of uniform renewal theorems and their criteria. Teor. Veoryuatnost. i Mat. Statist. 27, 51–60 (1982) (in Russian)
Kulkarni, V., Nicola, V., Trivedi, K.: On modeling the performance and reliability of multimode systems. The Journal of Systems and Software 6, 175–183 (1986)
Kulkarni, V., Nicola, V., Trivedi, K.: The completion time of a job on a multimode system. Adv. Appl. Probab. 19, 932–954 (1987)
Lipsky, L.: Queueing Theory. A Linear Algebraic Approach, 2nd edn. Springer (2008)
Lipsky, L., Doran, D., Gokhale, S.: Checkpointing for the RESTART problem in Markov networks. J. Appl. Probab. 48A, 195–207 (2011)
Müller, A., Stoyan, D.: Comparison Methods for Stochastic Models and Risks. Wiley (2002)
Nicola, V.F.: Checkpointing and the modeling of program execution time. In: Lyu, M.R. (ed.) Software Fault Tolerance, ch. 7, pp. 167–188 (1995)
Nicola, V.F., Martini, R., Chimento, P.F.: The completion time of a job in a failure environment and partial loss of work. In: Proceedings of the 2nd International Conference on Mathematical Methods in Reliabiliy (MMR 2000), Bordeaux, pp. 813–816 (2000)
Tantawi, A.N., Rutschitzka, M.: Performance analysis of checkpointing strategies. ACM Trans. Comp. Syst. 2, 123–144 (1984)
Wang, M., Woodroofe, M.: A uniform renewal theorem. Sequential Anal. 15, 21–36 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Asmussen, S., Lipsky, L., Thompson, S. (2014). Checkpointing in Failure Recovery in Computing and Data Transmission. In: Sericola, B., Telek, M., Horváth, G. (eds) Analytical and Stochastic Modeling Techniques and Applications. ASMTA 2014. Lecture Notes in Computer Science, vol 8499. Springer, Cham. https://doi.org/10.1007/978-3-319-08219-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-08219-6_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08218-9
Online ISBN: 978-3-319-08219-6
eBook Packages: Computer ScienceComputer Science (R0)