Abstract
There are several schemes for checkpointing and rollback recovery. In this paper, we analyze some such schemes under a stochastic model. We have found expressions for average cost of checkpointing, rollback recovery, message logging and piggybacking with application messages in synchronous as well as asynchronous checkpointing. For quasi-synchronous checkpointing we show that in a system with n processes, the upper bound and lower bound of selective message logging are O(n 2) and O(n) respectively.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Alvisi, L., Bhatia, K., Marzullo, K.: Nonblocking and orphanfree message logging protocols. In: Proc. 23rd Fault-Tolerant Computing Symposium, June 1993, pp. 145–154 (1993)
Alvisi, L., Hoppe, B., Marzullo, K.: Causality tracking in causal message-logging protocols. Distributed Computing 15, 1–15 (2002)
Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent rollbackrecovery with low overhead, limited rollback and fast output commit. IEEE Transactions on Computers 41(5), 526–531 (1992)
Manivannan, D., Singhal, M.: Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE Trans. Parallel and Distributed Syst. 10(7), 703–713 (1999)
Wang, Y.M.: Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Computers 46(4), 456–468 (1997)
Meth, K.Z., Tuel, W.G.: Parallel checkpoint/restart without message logging. In: Proc. IEEE 28th International Conference on Parallel Processing (ICPP 2000), August 2000, pp. 253–258 (2000)
Cao, G., Singhal, M.: On coordinated checkpointing in distributed systems. IEEE Trans. Parallel and Distributed Syst. 9(12), 1213–1225 (1998)
Mandal, P.S., Mukhopadhyaya, K.: Mobile agent based checkpointing and recovery algorithms on a distributed system. In: Proc. of 6th Int. Conf./ Exhibition on High Performance Computing in Asia Pacific Region, Bangalore, India, December 2002, vol. 2, pp. 492–499 (2002)
Mandal, P.S., Mukhopadhyaya, K.: Concurrent checkpoint initiation and recovery algorithms on an asynchronous unidirectional ring network. In: Proc. 9th Int. Conf. on Advanced Computing and Communications, Bhubaneswar, India, December 2001, pp. 21–28 (2001)
Prakash, R., Singhal, M.: Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel and Distributed Syst. 7(10), 1035–1048 (1996)
Kin, J.L., Park, T.: An efficient protocol for checkpointing recovery in distributed system. IEEE Trans. Parallel and Distributed Syst. 5(8), 955–960 (1998)
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed system. IEEE Trans. Software Eng. 13(1), 23–31 (1987)
Venkatesan, S., Juang, T.T.-Y.: Efficient algorithms for optimistic crash recovery. Distributed Computing 8(2), 105–114 (1994)
Johnson, D., Zwaenepoel, W.: Recovery in distributed systems using optimistic message logging and checkpointing. Journal of Algorithms 3(11), 462–491 (1990)
Johnson, D., Zwaenepoel, W.: Sender-based message logging and checkpointing. In: Proc. of 17th Annual International Symposium on Fault-Tolerant Computing, June 1987, pp. 14–19. IEEE Computer Society, Los Alamitos (1987)
Sistla, A.P., Welch, J.: Efficient distributed recovery using message logging. In: Proc. of the ACM Symp. on Principle of Distributed Computing, pp. 223–238 (1989)
Elnozalhy, E.N., Johnsone, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proc. 11th Symp. Reliable Distributed Systems, pp. 86–95 (1992)
Silva, L.M., Silva, J.G.: Global checkpointing for distributed systems. In: Proc. 11th Symp. Reliable Distributed Systems, pp. 155–162 (1992)
Spezialetti, M., Kearns, P.: Efficient distributed snapshots. In: Proc. of the 6th ICDCS, pp. 382–388 (1986)
Strom, R.E., Bacon, D.F., Yemini, S.: Volatile logging in n-fault-tolerant distributed systems. In: Proc. of 18th Annual International Symposium on Fault-Tolerant Computing, pp. 44–49 (1988)
Strom, R.E., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. On Computer Syst. 3(3), 204–226 (1985)
Manivannan, D., Singhal, M.: Asynchronous recovery without using vector timestamps, J. Parallel Distrib. Comput. 62, 1695–1728 (2002)
Vaidya, N.H.: A case for Two-level Recovery Schemes. IEEE Trans. Computers 47, 656–666 (1998)
Panda, B.S., Das, S.K.: Performance evaluation of a two level error recovery scheme for distributed systems. In: Proc. 5th International Workshop on Distributed Computing, December 2002, pp. 88–97. Springer, Heidelberg (2002)
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. J. Parallel Distrib. Comput. 61, 1570–1590 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mandal, P.S., Mukhopadhyaya, K. (2003). Estimating Checkpointing, Rollback and Recovery Overheads. In: Das, S.R., Das, S.K. (eds) Distributed Computing - IWDC 2003. IWDC 2003. Lecture Notes in Computer Science, vol 2918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24604-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-24604-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20745-0
Online ISBN: 978-3-540-24604-6
eBook Packages: Springer Book Archive