Checkpointing in Failure Recovery in Computing and Data Transmission

Asmussen, Søren; Lipsky, Lester; Thompson, Stephen

doi:10.1007/978-3-319-08219-6_18

Søren Asmussen¹⁸,
Lester Lipsky¹⁹ &
Stephen Thompson¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 8499))

Included in the following conference series:

International Conference on Analytical and Stochastic Modeling Techniques and Applications

753 Accesses
2 Citations

Abstract

A task with ideal execution time T such as the execution of a computer program or the transmission of a file on a data link may fail. A number of protocols for failure recovery have been suggested and analyzed, in particular RESUME, REPLACE and RESTART. We consider here RESTART with particular emphasis on checkpointing where the task is split into K subtasks by inserting K − 1 checkpoints. If a failure occurs between two checkpoints, the task needs to be restarted from the last checkpoint only. Various models are considered: the task may have a fixed (T ≡ t) or a random length, and the spacing of checkpoints may be equidistant, non-equidistant or random. The emphasis here is on tail asymptotics for the total task time X in the same vein as the study of Asmussen et al.[5] on simple RESTART. For a fixed task length (T ≡ t) and certain types of failure mechanism, for example Poisson, the conclusion of the study is clear and not unexpected: the essential thing to control for minimizing large delays is making the maximal distance between checkpoints as small as possible. For random unbounded task length, it is seen that the effect of checkpointing is limited in the sense that the tail of X remains power-like as for simple RESTART (K = 1).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Andersen, L.N., Asmussen, S.: Parallel computing, failure recovery and extreme values. J. Statist. Theory. Pract. 2, 279–292 (2008)
Article MathSciNet Google Scholar
Asmussen, S.: Applied Probability and Queues, 2nd edn. Springer (2003)
Google Scholar
Asmussen, S.: Importance sampling for failure probabilities in computing and data transmission. Journal Applied Probability 46, 768–790 (2009)
Article MATH MathSciNet Google Scholar
Asmussen, S., Glynn, P.W.: Stochastic Simulation: Algorithms and Analysis. Springer (2007)
Google Scholar
Asmussen, S., Fiorini, P., Lipsky, L., Rolski, T., Sheahan, R.: On the distribution of total task times for tasks that must restart from the beginning if failure occurs. Math. Oper. Res. 33, 932–944 (2008)
Article MATH MathSciNet Google Scholar
Asmussen, S., Klüppelberg, C., Sigman, K.: Sampling at a subexponential time, with queueing applications. Stoch. Proc. Appl. 79, 265–286 (1999)
Article MATH Google Scholar
Bingham, N.H., Goldie, C.M., Teugels, J.L.: Regular Variation. Cambridge University Press (1987)
Google Scholar
Bobbio, A., Trivedi, K.: Computation of the distribution of the completion time when the work requirement is a PH random variable. Stochastic Models 6, 133–150 (1990)
Article MATH MathSciNet Google Scholar
Castillo, X., Siewiorek, D.P.: A performance-reliability model for computing systems. In: Proc FTCS-10, Silver Spring, MD, pp. 187–192. IEEE Computer Soc. (1980)
Google Scholar
Chimento Jr., P.F., Trivedi, K.S.: The completion time of programs on processors subject to failure and repair. IEEE Trans. on Computers 42(1) (1993)
Google Scholar
Chlebus, B.S., De Prisco, R., Shvartsman, A.A.: Performing tasks on synchronous restartable message-passing processors. Distributed Computing 14, 49–64 (2001)
Article Google Scholar
David, H.A.: Order Statistics. Wiley (1970)
Google Scholar
DePrisco, R., Mayer, A., Yung, M.: Time-optimal message-efficient work performance in the presence of faults. In: Proc. 13th ACM PODC, pp. 161–172 (1994)
Google Scholar
Embrechts, P., Maejima, M., Teugels, J.L.: Asymptotic behaviour of compound distributions. Astin Bulletin 15(1) (1985)
Google Scholar
Feller, W.: An Introduction to Probability Theory and its Applications II, 2nd edn. Wiley (1971)
Google Scholar
Fisher, R.A.: Tests of significance in harmonic analysis. Proc. Roy. Soc. A 125, 54–59
Google Scholar
Gut, A.: Stopped Random Walks. Springer (1988)
Google Scholar
Jelenković, P., Tan, J.: Can retransmissions of superexponential documents cause subexponential delays? In: Proceedings of IEEE INFOCMO 2007, Anchorage, pp. 892–900 (2007)
Google Scholar
Jelenković, P., Tan, J.: Dynamic packet fragmentation for wireless channels with failures. In: Proc. MobiHoc 2008, Hong Kong, May 26-30 (2008)
Google Scholar
Jelenković, P., Tan, J.: Characterizing heavy-tailed distributions induced by retransmissions. Adv. Appl. Probab. 45(1) (2013)
Google Scholar
Kartashov, N.V.: A uniform asymptotic renewal theorem. Th. Probab. Appl. 25, 589–592 (1980)
Article MathSciNet Google Scholar
Kartashov, N.V.: Equivalence of uniform renewal theorems and their criteria. Teor. Veoryuatnost. i Mat. Statist. 27, 51–60 (1982) (in Russian)
Google Scholar
Kulkarni, V., Nicola, V., Trivedi, K.: On modeling the performance and reliability of multimode systems. The Journal of Systems and Software 6, 175–183 (1986)
Article Google Scholar
Kulkarni, V., Nicola, V., Trivedi, K.: The completion time of a job on a multimode system. Adv. Appl. Probab. 19, 932–954 (1987)
Article MATH MathSciNet Google Scholar
Lipsky, L.: Queueing Theory. A Linear Algebraic Approach, 2nd edn. Springer (2008)
Google Scholar
Lipsky, L., Doran, D., Gokhale, S.: Checkpointing for the RESTART problem in Markov networks. J. Appl. Probab. 48A, 195–207 (2011)
Google Scholar
Müller, A., Stoyan, D.: Comparison Methods for Stochastic Models and Risks. Wiley (2002)
Google Scholar
Nicola, V.F.: Checkpointing and the modeling of program execution time. In: Lyu, M.R. (ed.) Software Fault Tolerance, ch. 7, pp. 167–188 (1995)
Google Scholar
Nicola, V.F., Martini, R., Chimento, P.F.: The completion time of a job in a failure environment and partial loss of work. In: Proceedings of the 2nd International Conference on Mathematical Methods in Reliabiliy (MMR 2000), Bordeaux, pp. 813–816 (2000)
Google Scholar
Tantawi, A.N., Rutschitzka, M.: Performance analysis of checkpointing strategies. ACM Trans. Comp. Syst. 2, 123–144 (1984)
Article Google Scholar
Wang, M., Woodroofe, M.: A uniform renewal theorem. Sequential Anal. 15, 21–36 (1996)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Mathematics, Aarhus University, Ny Munkegade, 8000, Aarhus C, Denmark
Søren Asmussen
Dept. of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269-2155, USA
Lester Lipsky & Stephen Thompson

Authors

Søren Asmussen
View author publications
You can also search for this author in PubMed Google Scholar
Lester Lipsky
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Thompson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INRIA, Campus de Beaulieu, 35042, Rennes, Cedex, France
Bruno Sericola
BME-HIT, Magyar tudósok, krt.2, 1117, Budapest, Hungary
Miklós Telek
BME-HIT, Magyar tudósok krt. 2, Hungary
Gábor Horváth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Asmussen, S., Lipsky, L., Thompson, S. (2014). Checkpointing in Failure Recovery in Computing and Data Transmission. In: Sericola, B., Telek, M., Horváth, G. (eds) Analytical and Stochastic Modeling Techniques and Applications. ASMTA 2014. Lecture Notes in Computer Science, vol 8499. Springer, Cham. https://doi.org/10.1007/978-3-319-08219-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-08219-6_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08218-9
Online ISBN: 978-3-319-08219-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics