Skip to main content

Checkpointing in Failure Recovery in Computing and Data Transmission

  • Conference paper
Analytical and Stochastic Modeling Techniques and Applications (ASMTA 2014)

Abstract

A task with ideal execution time T such as the execution of a computer program or the transmission of a file on a data link may fail. A number of protocols for failure recovery have been suggested and analyzed, in particular RESUME, REPLACE and RESTART. We consider here RESTART with particular emphasis on checkpointing where the task is split into K subtasks by inserting K − 1 checkpoints. If a failure occurs between two checkpoints, the task needs to be restarted from the last checkpoint only. Various models are considered: the task may have a fixed (T ≡ t) or a random length, and the spacing of checkpoints may be equidistant, non-equidistant or random. The emphasis here is on tail asymptotics for the total task time X in the same vein as the study of Asmussen et al.[5] on simple RESTART. For a fixed task length (T ≡ t) and certain types of failure mechanism, for example Poisson, the conclusion of the study is clear and not unexpected: the essential thing to control for minimizing large delays is making the maximal distance between checkpoints as small as possible. For random unbounded task length, it is seen that the effect of checkpointing is limited in the sense that the tail of X remains power-like as for simple RESTART (K = 1).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andersen, L.N., Asmussen, S.: Parallel computing, failure recovery and extreme values. J. Statist. Theory. Pract. 2, 279–292 (2008)

    Article  MathSciNet  Google Scholar 

  2. Asmussen, S.: Applied Probability and Queues, 2nd edn. Springer (2003)

    Google Scholar 

  3. Asmussen, S.: Importance sampling for failure probabilities in computing and data transmission. Journal Applied Probability 46, 768–790 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  4. Asmussen, S., Glynn, P.W.: Stochastic Simulation: Algorithms and Analysis. Springer (2007)

    Google Scholar 

  5. Asmussen, S., Fiorini, P., Lipsky, L., Rolski, T., Sheahan, R.: On the distribution of total task times for tasks that must restart from the beginning if failure occurs. Math. Oper. Res. 33, 932–944 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  6. Asmussen, S., Klüppelberg, C., Sigman, K.: Sampling at a subexponential time, with queueing applications. Stoch. Proc. Appl. 79, 265–286 (1999)

    Article  MATH  Google Scholar 

  7. Bingham, N.H., Goldie, C.M., Teugels, J.L.: Regular Variation. Cambridge University Press (1987)

    Google Scholar 

  8. Bobbio, A., Trivedi, K.: Computation of the distribution of the completion time when the work requirement is a PH random variable. Stochastic Models 6, 133–150 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  9. Castillo, X., Siewiorek, D.P.: A performance-reliability model for computing systems. In: Proc FTCS-10, Silver Spring, MD, pp. 187–192. IEEE Computer Soc. (1980)

    Google Scholar 

  10. Chimento Jr., P.F., Trivedi, K.S.: The completion time of programs on processors subject to failure and repair. IEEE Trans. on Computers 42(1) (1993)

    Google Scholar 

  11. Chlebus, B.S., De Prisco, R., Shvartsman, A.A.: Performing tasks on synchronous restartable message-passing processors. Distributed Computing 14, 49–64 (2001)

    Article  Google Scholar 

  12. David, H.A.: Order Statistics. Wiley (1970)

    Google Scholar 

  13. DePrisco, R., Mayer, A., Yung, M.: Time-optimal message-efficient work performance in the presence of faults. In: Proc. 13th ACM PODC, pp. 161–172 (1994)

    Google Scholar 

  14. Embrechts, P., Maejima, M., Teugels, J.L.: Asymptotic behaviour of compound distributions. Astin Bulletin 15(1) (1985)

    Google Scholar 

  15. Feller, W.: An Introduction to Probability Theory and its Applications II, 2nd edn. Wiley (1971)

    Google Scholar 

  16. Fisher, R.A.: Tests of significance in harmonic analysis. Proc. Roy. Soc. A 125, 54–59

    Google Scholar 

  17. Gut, A.: Stopped Random Walks. Springer (1988)

    Google Scholar 

  18. Jelenković, P., Tan, J.: Can retransmissions of superexponential documents cause subexponential delays? In: Proceedings of IEEE INFOCMO 2007, Anchorage, pp. 892–900 (2007)

    Google Scholar 

  19. Jelenković, P., Tan, J.: Dynamic packet fragmentation for wireless channels with failures. In: Proc. MobiHoc 2008, Hong Kong, May 26-30 (2008)

    Google Scholar 

  20. Jelenković, P., Tan, J.: Characterizing heavy-tailed distributions induced by retransmissions. Adv. Appl. Probab. 45(1) (2013)

    Google Scholar 

  21. Kartashov, N.V.: A uniform asymptotic renewal theorem. Th. Probab. Appl. 25, 589–592 (1980)

    Article  MathSciNet  Google Scholar 

  22. Kartashov, N.V.: Equivalence of uniform renewal theorems and their criteria. Teor. Veoryuatnost. i Mat. Statist. 27, 51–60 (1982) (in Russian)

    Google Scholar 

  23. Kulkarni, V., Nicola, V., Trivedi, K.: On modeling the performance and reliability of multimode systems. The Journal of Systems and Software 6, 175–183 (1986)

    Article  Google Scholar 

  24. Kulkarni, V., Nicola, V., Trivedi, K.: The completion time of a job on a multimode system. Adv. Appl. Probab. 19, 932–954 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  25. Lipsky, L.: Queueing Theory. A Linear Algebraic Approach, 2nd edn. Springer (2008)

    Google Scholar 

  26. Lipsky, L., Doran, D., Gokhale, S.: Checkpointing for the RESTART problem in Markov networks. J. Appl. Probab. 48A, 195–207 (2011)

    Google Scholar 

  27. Müller, A., Stoyan, D.: Comparison Methods for Stochastic Models and Risks. Wiley (2002)

    Google Scholar 

  28. Nicola, V.F.: Checkpointing and the modeling of program execution time. In: Lyu, M.R. (ed.) Software Fault Tolerance, ch. 7, pp. 167–188 (1995)

    Google Scholar 

  29. Nicola, V.F., Martini, R., Chimento, P.F.: The completion time of a job in a failure environment and partial loss of work. In: Proceedings of the 2nd International Conference on Mathematical Methods in Reliabiliy (MMR 2000), Bordeaux, pp. 813–816 (2000)

    Google Scholar 

  30. Tantawi, A.N., Rutschitzka, M.: Performance analysis of checkpointing strategies. ACM Trans. Comp. Syst. 2, 123–144 (1984)

    Article  Google Scholar 

  31. Wang, M., Woodroofe, M.: A uniform renewal theorem. Sequential Anal. 15, 21–36 (1996)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Asmussen, S., Lipsky, L., Thompson, S. (2014). Checkpointing in Failure Recovery in Computing and Data Transmission. In: Sericola, B., Telek, M., Horváth, G. (eds) Analytical and Stochastic Modeling Techniques and Applications. ASMTA 2014. Lecture Notes in Computer Science, vol 8499. Springer, Cham. https://doi.org/10.1007/978-3-319-08219-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08219-6_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08218-9

  • Online ISBN: 978-3-319-08219-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics