Estimating Checkpointing, Rollback and Recovery Overheads

Mandal, Partha Sarathi; Mukhopadhyaya, Krishnendu

doi:10.1007/978-3-540-24604-6_6

Estimating Checkpointing, Rollback and Recovery Overheads

Partha Sarathi Mandal⁶ &
Krishnendu Mukhopadhyaya⁶

Conference paper

295 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2918))

Abstract

There are several schemes for checkpointing and rollback recovery. In this paper, we analyze some such schemes under a stochastic model. We have found expressions for average cost of checkpointing, rollback recovery, message logging and piggybacking with application messages in synchronous as well as asynchronous checkpointing. For quasi-synchronous checkpointing we show that in a system with n processes, the upper bound and lower bound of selective message logging are O(n ²) and O(n) respectively.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Article Google Scholar
Alvisi, L., Bhatia, K., Marzullo, K.: Nonblocking and orphanfree message logging protocols. In: Proc. 23rd Fault-Tolerant Computing Symposium, June 1993, pp. 145–154 (1993)
Google Scholar
Alvisi, L., Hoppe, B., Marzullo, K.: Causality tracking in causal message-logging protocols. Distributed Computing 15, 1–15 (2002)
Article Google Scholar
Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent rollbackrecovery with low overhead, limited rollback and fast output commit. IEEE Transactions on Computers 41(5), 526–531 (1992)
Article Google Scholar
Manivannan, D., Singhal, M.: Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE Trans. Parallel and Distributed Syst. 10(7), 703–713 (1999)
Article Google Scholar
Wang, Y.M.: Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Computers 46(4), 456–468 (1997)
Article Google Scholar
Meth, K.Z., Tuel, W.G.: Parallel checkpoint/restart without message logging. In: Proc. IEEE 28th International Conference on Parallel Processing (ICPP 2000), August 2000, pp. 253–258 (2000)
Google Scholar
Cao, G., Singhal, M.: On coordinated checkpointing in distributed systems. IEEE Trans. Parallel and Distributed Syst. 9(12), 1213–1225 (1998)
Article Google Scholar
Mandal, P.S., Mukhopadhyaya, K.: Mobile agent based checkpointing and recovery algorithms on a distributed system. In: Proc. of 6th Int. Conf./ Exhibition on High Performance Computing in Asia Pacific Region, Bangalore, India, December 2002, vol. 2, pp. 492–499 (2002)
Google Scholar
Mandal, P.S., Mukhopadhyaya, K.: Concurrent checkpoint initiation and recovery algorithms on an asynchronous unidirectional ring network. In: Proc. 9th Int. Conf. on Advanced Computing and Communications, Bhubaneswar, India, December 2001, pp. 21–28 (2001)
Google Scholar
Prakash, R., Singhal, M.: Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel and Distributed Syst. 7(10), 1035–1048 (1996)
Article Google Scholar
Kin, J.L., Park, T.: An efficient protocol for checkpointing recovery in distributed system. IEEE Trans. Parallel and Distributed Syst. 5(8), 955–960 (1998)
Google Scholar
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed system. IEEE Trans. Software Eng. 13(1), 23–31 (1987)
Article MATH Google Scholar
Venkatesan, S., Juang, T.T.-Y.: Efficient algorithms for optimistic crash recovery. Distributed Computing 8(2), 105–114 (1994)
Article Google Scholar
Johnson, D., Zwaenepoel, W.: Recovery in distributed systems using optimistic message logging and checkpointing. Journal of Algorithms 3(11), 462–491 (1990)
Article MathSciNet Google Scholar
Johnson, D., Zwaenepoel, W.: Sender-based message logging and checkpointing. In: Proc. of 17th Annual International Symposium on Fault-Tolerant Computing, June 1987, pp. 14–19. IEEE Computer Society, Los Alamitos (1987)
Google Scholar
Sistla, A.P., Welch, J.: Efficient distributed recovery using message logging. In: Proc. of the ACM Symp. on Principle of Distributed Computing, pp. 223–238 (1989)
Google Scholar
Elnozalhy, E.N., Johnsone, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proc. 11th Symp. Reliable Distributed Systems, pp. 86–95 (1992)
Google Scholar
Silva, L.M., Silva, J.G.: Global checkpointing for distributed systems. In: Proc. 11th Symp. Reliable Distributed Systems, pp. 155–162 (1992)
Google Scholar
Spezialetti, M., Kearns, P.: Efficient distributed snapshots. In: Proc. of the 6th ICDCS, pp. 382–388 (1986)
Google Scholar
Strom, R.E., Bacon, D.F., Yemini, S.: Volatile logging in n-fault-tolerant distributed systems. In: Proc. of 18th Annual International Symposium on Fault-Tolerant Computing, pp. 44–49 (1988)
Google Scholar
Strom, R.E., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. On Computer Syst. 3(3), 204–226 (1985)
Article Google Scholar
Manivannan, D., Singhal, M.: Asynchronous recovery without using vector timestamps, J. Parallel Distrib. Comput. 62, 1695–1728 (2002)
Article MATH Google Scholar
Vaidya, N.H.: A case for Two-level Recovery Schemes. IEEE Trans. Computers 47, 656–666 (1998)
Article Google Scholar
Panda, B.S., Das, S.K.: Performance evaluation of a two level error recovery scheme for distributed systems. In: Proc. 5th International Workshop on Distributed Computing, December 2002, pp. 88–97. Springer, Heidelberg (2002)
Google Scholar
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61, 1570–1590 (2001)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Computing and Microelectronics Unit, Indian Statistical Institute, 203, B T Road, Kolkata, 700108, India
Partha Sarathi Mandal & Krishnendu Mukhopadhyaya

Authors

Partha Sarathi Mandal
View author publications
You can also search for this author in PubMed Google Scholar
Krishnendu Mukhopadhyaya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department SUNY at Stony Brook, NY 11794-4400, Stony Brook, USA
Samir R. Das
Department of Computer Science and Engineering, University of Texas at Arlington, P.O. Box 19015, TX 76019, Arlington, USA
Sajal K. Das

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mandal, P.S., Mukhopadhyaya, K. (2003). Estimating Checkpointing, Rollback and Recovery Overheads. In: Das, S.R., Das, S.K. (eds) Distributed Computing - IWDC 2003. IWDC 2003. Lecture Notes in Computer Science, vol 2918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24604-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-24604-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20745-0
Online ISBN: 978-3-540-24604-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics