Skip to main content

Estimating Checkpointing, Rollback and Recovery Overheads

  • Conference paper
  • 295 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2918))

Abstract

There are several schemes for checkpointing and rollback recovery. In this paper, we analyze some such schemes under a stochastic model. We have found expressions for average cost of checkpointing, rollback recovery, message logging and piggybacking with application messages in synchronous as well as asynchronous checkpointing. For quasi-synchronous checkpointing we show that in a system with n processes, the upper bound and lower bound of selective message logging are O(n 2) and O(n) respectively.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)

    Article  Google Scholar 

  2. Alvisi, L., Bhatia, K., Marzullo, K.: Nonblocking and orphanfree message logging protocols. In: Proc. 23rd Fault-Tolerant Computing Symposium, June 1993, pp. 145–154 (1993)

    Google Scholar 

  3. Alvisi, L., Hoppe, B., Marzullo, K.: Causality tracking in causal message-logging protocols. Distributed Computing 15, 1–15 (2002)

    Article  Google Scholar 

  4. Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent rollbackrecovery with low overhead, limited rollback and fast output commit. IEEE Transactions on Computers 41(5), 526–531 (1992)

    Article  Google Scholar 

  5. Manivannan, D., Singhal, M.: Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE Trans. Parallel and Distributed Syst. 10(7), 703–713 (1999)

    Article  Google Scholar 

  6. Wang, Y.M.: Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Computers 46(4), 456–468 (1997)

    Article  Google Scholar 

  7. Meth, K.Z., Tuel, W.G.: Parallel checkpoint/restart without message logging. In: Proc. IEEE 28th International Conference on Parallel Processing (ICPP 2000), August 2000, pp. 253–258 (2000)

    Google Scholar 

  8. Cao, G., Singhal, M.: On coordinated checkpointing in distributed systems. IEEE Trans. Parallel and Distributed Syst. 9(12), 1213–1225 (1998)

    Article  Google Scholar 

  9. Mandal, P.S., Mukhopadhyaya, K.: Mobile agent based checkpointing and recovery algorithms on a distributed system. In: Proc. of 6th Int. Conf./ Exhibition on High Performance Computing in Asia Pacific Region, Bangalore, India, December 2002, vol. 2, pp. 492–499 (2002)

    Google Scholar 

  10. Mandal, P.S., Mukhopadhyaya, K.: Concurrent checkpoint initiation and recovery algorithms on an asynchronous unidirectional ring network. In: Proc. 9th Int. Conf. on Advanced Computing and Communications, Bhubaneswar, India, December 2001, pp. 21–28 (2001)

    Google Scholar 

  11. Prakash, R., Singhal, M.: Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel and Distributed Syst. 7(10), 1035–1048 (1996)

    Article  Google Scholar 

  12. Kin, J.L., Park, T.: An efficient protocol for checkpointing recovery in distributed system. IEEE Trans. Parallel and Distributed Syst. 5(8), 955–960 (1998)

    Google Scholar 

  13. Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed system. IEEE Trans. Software Eng. 13(1), 23–31 (1987)

    Article  MATH  Google Scholar 

  14. Venkatesan, S., Juang, T.T.-Y.: Efficient algorithms for optimistic crash recovery. Distributed Computing 8(2), 105–114 (1994)

    Article  Google Scholar 

  15. Johnson, D., Zwaenepoel, W.: Recovery in distributed systems using optimistic message logging and checkpointing. Journal of Algorithms 3(11), 462–491 (1990)

    Article  MathSciNet  Google Scholar 

  16. Johnson, D., Zwaenepoel, W.: Sender-based message logging and checkpointing. In: Proc. of 17th Annual International Symposium on Fault-Tolerant Computing, June 1987, pp. 14–19. IEEE Computer Society, Los Alamitos (1987)

    Google Scholar 

  17. Sistla, A.P., Welch, J.: Efficient distributed recovery using message logging. In: Proc. of the ACM Symp. on Principle of Distributed Computing, pp. 223–238 (1989)

    Google Scholar 

  18. Elnozalhy, E.N., Johnsone, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proc. 11th Symp. Reliable Distributed Systems, pp. 86–95 (1992)

    Google Scholar 

  19. Silva, L.M., Silva, J.G.: Global checkpointing for distributed systems. In: Proc. 11th Symp. Reliable Distributed Systems, pp. 155–162 (1992)

    Google Scholar 

  20. Spezialetti, M., Kearns, P.: Efficient distributed snapshots. In: Proc. of the 6th ICDCS, pp. 382–388 (1986)

    Google Scholar 

  21. Strom, R.E., Bacon, D.F., Yemini, S.: Volatile logging in n-fault-tolerant distributed systems. In: Proc. of 18th Annual International Symposium on Fault-Tolerant Computing, pp. 44–49 (1988)

    Google Scholar 

  22. Strom, R.E., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. On Computer Syst. 3(3), 204–226 (1985)

    Article  Google Scholar 

  23. Manivannan, D., Singhal, M.: Asynchronous recovery without using vector timestamps, J. Parallel Distrib. Comput. 62, 1695–1728 (2002)

    Article  MATH  Google Scholar 

  24. Vaidya, N.H.: A case for Two-level Recovery Schemes. IEEE Trans. Computers 47, 656–666 (1998)

    Article  Google Scholar 

  25. Panda, B.S., Das, S.K.: Performance evaluation of a two level error recovery scheme for distributed systems. In: Proc. 5th International Workshop on Distributed Computing, December 2002, pp. 88–97. Springer, Heidelberg (2002)

    Google Scholar 

  26. Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61, 1570–1590 (2001)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mandal, P.S., Mukhopadhyaya, K. (2003). Estimating Checkpointing, Rollback and Recovery Overheads. In: Das, S.R., Das, S.K. (eds) Distributed Computing - IWDC 2003. IWDC 2003. Lecture Notes in Computer Science, vol 2918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24604-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24604-6_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20745-0

  • Online ISBN: 978-3-540-24604-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics