An Efficient and Scalable Checkpointing and Recovery Algorithm for Distributed Systems
In this paper, we describe an efficient coordinated checkpointing and recovery algorithm which can work even when the channels are assumed to be non-FIFO, and messages may be lost. Nodes are assumed to be autonomous, and they do not block while taking checkpoints. Based on the local conditions, any process can request the previous coordinator for the ’permission’ to initiate a new checkpoint. Allowing multiple initiators of checkpoints avoids the bottleneck associated with a single initiator, but the algorithm permits only a single instance of checkpointing process at any given time, thus reducing much of the overhead associated with multiple initiators of distributed algorithms.
KeywordsControl Message Recovery Algorithm Message Overhead Receiver Process Application Message
Unable to display preview. Download preview PDF.
- 2.Silva, L.M., Silva, J.G.: Global checkpointing for distributed programs. In: Proceedings of the 10th Symposium on Reliable Distributed Systems, pp. 155–162 (1992)Google Scholar
- 6.Briatico, D., Ciuffoletti, A., Simoncini, L.: A distributed domino-effect free recovery algorithm. In: Proceedings of the IEEE International Symposium on Reliability, Distributed Software and Databases, pp. 207–215 (1984)Google Scholar
- 7.Manivannan, D., Singhal, M.: A low overhead recovery technique using quasi synchronous checkpointing. In: Proceedings of the IEEE International Conference on Distributed Computing Systems, pp. 100–107 (1996)Google Scholar
- 8.Baldoni, R., Quaglia, F., Fornara, P.: An index-based checkpointing algorithm for autonomous distributed systems. In: Proceedings of the IEEE International Conference on Distributed Computing Systems, pp. 181–188 (1999)Google Scholar