Skip to main content

Checkpointing Models for Tasks with Widely Different Processing Times

  • Conference paper
  • First Online:
Computer Performance Engineering (EPEW 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13659))

Included in the following conference series:

Abstract

A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy aiming to protect against possibly lengthy recovery periods is in operation. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adiga, N.R., et al.: An overview of the BlueGene/L supercomputer. In: Proceedings of the ACM/IEEE Conference on Supercomputing, p. 60 (2002). https://doi.org/10.1109/SC.2002.10017

  2. Baccelli, F.: Analysis of a service facility with periodic checkpointing. Acta lnform. 15, 67–81 (1981)

    Article  MATH  Google Scholar 

  3. Bruno, J.L., Coffman, E.G.: Optimal fault-tolerant computing on multi-processor systems. Acta Inform. 34, 881–904 (1997)

    Article  MATH  Google Scholar 

  4. Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 8(5), 40–47 (1975)

    Article  MATH  Google Scholar 

  5. Chen, Y., Ganapathi, A.S., Griffith, R., Katz, R.H.: Analysis and lessons from a publicly available google cluster trace. Technical report No. UCB/EECS-2010-95 (2010)

    Google Scholar 

  6. Coffman, E.G., Gilbert, E.N.: Optimal strategies for scheduling checkpoints and preventive maintenance. IEEE Trans. Reliabil. 39(1), 9–18 (1990)

    Article  MATH  Google Scholar 

  7. Cohen, J.W.: The Single Server Queue. North-Holland, Amsterdam (1969)

    Google Scholar 

  8. Cox, D.R.: A use of complex probabilities in the theory of stochastic processes. Math. Proc. Camb. Philos. Soc. 51(2), 313–319 (1955)

    Article  MATH  Google Scholar 

  9. Dimitriou, I.: A retrial queue for modeling fault-tolerant systems with checkpointing and rollback recovery. Comput. Ind. Eng. 79, 156–167 (2015)

    Article  Google Scholar 

  10. Dohi, T., Kaio, N., Trivedi, K.S.: Availability models with age-dependent checkpointing. In: 21st IEEE Symposium on Reliable Distributed Systems, pp. 130–139 (2002)

    Google Scholar 

  11. Elnozahy, E.N., Alvisi, L., Wang, Y., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  12. Fuhrmann, S.W.: A note on the M/G/1 queue with server vacations. Oper. Res. 32(6), 1368–1373 (1984)

    Article  MATH  Google Scholar 

  13. Garraghan, P., Townend, P., Xu, J.: An empirical failure-analysis of a large-scale cloud computing environment. In: 15th International Symposium on High-Assurance Systems Engineering, pp. 113–120 (2014)

    Google Scholar 

  14. Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26(2), 259–270 (1979)

    Article  MATH  Google Scholar 

  15. Gelenbe, E., Boryszko, P., Siavvas, M., Domanska, J.: Optimum checkpoints for time and energy. In: 28th IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1–8 (2020)

    Google Scholar 

  16. Grassi, V., Donatiello, L., Tucci, S.: On the optimal checkpointing of critical tasks and transaction-oriented systems. IEEE Trans. Softw. Eng. 18(1), 72–77 (1992)

    Article  Google Scholar 

  17. Güler, B., Özkasap, Ö.: Efficient checkpointing mechanisms for primary-backup replication on the cloud. Concurr. Comput. Pract. Exp. 30, 21 (2018)

    Article  Google Scholar 

  18. Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)

    Google Scholar 

  19. Marzouk, S., Jmaiel, M.: A survey on software checkpointing and mobility techniques in distributed systems. Concurr. Comput. Pract. Exp. 23(11), 1196–1212 (2011)

    Article  Google Scholar 

  20. Mitrani, I.: Probabilistic Modelling. Cambridge University Press, Cambridge (1998)

    MATH  Google Scholar 

  21. Nicola, V.F.: Checkpointing and the modelling of program execution time. In: Lyu, M.R. (ed.) Software Fault Tolerance, pp. 167–188. Wiley (1995)

    Google Scholar 

  22. Oliveira, R., Pereira, J., Schiper, A.: Primary-backup replication: from a time-free protocol to a time-based implementation. In: Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems, pp. 14–23 (2001)

    Google Scholar 

  23. Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61(11), 1570–1590 (2001)

    Article  MATH  Google Scholar 

  24. Subasi, O., Kestor, G., Krishnamoorthy, S.: Toward a general theory of optimal checkpoint placement. In: IEEE Conference on Cluster Computing (CLUSTER), pp. 464–474 (2017)

    Google Scholar 

  25. Tuthill, B., Johnson, K., Schultz, T.: Irix checkpoint and restart operation guide. Document of Silicon Graphics Inc. (1999)

    Google Scholar 

  26. Wang, Y.-M., Huang, Y., Vo, K.-Ph., Chung, P.-Y., Kintala, C.: Checkpointing and its applications. In: 25th International Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 22–31 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Ezhilchelvan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ezhilchelvan, P., Mitrani, I. (2023). Checkpointing Models for Tasks with Widely Different Processing Times. In: Gilly, K., Thomas, N. (eds) Computer Performance Engineering. EPEW 2022. Lecture Notes in Computer Science, vol 13659. Springer, Cham. https://doi.org/10.1007/978-3-031-25049-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25049-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25048-4

  • Online ISBN: 978-3-031-25049-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics