An improved logging and checkpointing scheme for recoverable distributed shared memory
The distributed shared memory(DSM) system transforms an existing network of workstations to a powerful shared-memory parallel computer which could deliver superior price/performance. However, with more workstations engaged in the system and longer execution time, the probability of faults increases which could render the system useless. Several checkpointing and logging schemes have been proposed to enable the DSM system to continue work after transient failures. Using checkpoints, it is not necessary to roll back to the beginning of the process but the processes need to roll back to the latest checkpoint. The logging is introduced to further reduce the amount of rollback propagation on other related processes. Although logging makes the rollback propogation unnecessary, it introduces the overhead for the logging itself. If it is needed to log all the read/write operations, the logging overhead would be prohibitive. Moreover, some of the logging methods proposed earlier could result in incorrect recovery when processes synchronize using barriers. In this paper, we propose a novel logging scheme which greatly reduces the amount of logging by not loging all the pages accessed but logging only the pages which are invalidated. The performance our proposed scheme is analyzed using extensive simulation. Compared with two other schemes proposed earlier, our new logging scheme shows superior performance in various cases.
Unable to display preview. Download preview PDF.
- 1.M. Ahamad, P.W. Hutto, and R. John. Implementing and programming causal distributed shared memory. In Proc. of the 10th International Conference on Distributed Computing Systems, pages 274–281, June 1990.Google Scholar
- 2.R. E. Ahamed, R. C. Frazier, and P. N. Marinos. Cache-aided rollback error recovery(carer) algoritms for shared-memory multiprocessor systems. In Proc. of the 20th Symp. on Fault-Tolerant Computing, pages 82–88, 1990.Google Scholar
- 3.L. Brown and J. Wu. Dynamic Snooping in a Fault-Tolerant Distributed Shared Memory. In Proc. of the 14th Int'l Conf. on Distributed Computing Systems (ICDCS-14), pages 218–226, June 1994.Google Scholar
- 4.G. Janakiraman and Y. Tamir. Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers. In Proc. of the 13th Symp. on Reliable Distributed Systems (SRDS'94), pages 42–51, October 1994.Google Scholar
- 5.J.-H. Kim and N. H. Vaidya. Distributed Shared Memory: Recoverable and Non-recoraverable Limited Update Protocols. Technical Report 95-025, Dept. of Computer Science, Texas A& M University, May 1995.Google Scholar
- 6.Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241–248, September 1979.Google Scholar
- 7.K. Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Department of Computer Science, Yale University, September 1986.Google Scholar
- 8.B. Nitzberg and V. Lo. Distributed shared memory: A survey of issues and algorithms. IEEE Computer, August 1991.Google Scholar
- 10.G. G. Richard III and M. Singhal. Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory. In Proc. of the 12th Symp. on Reliable Distributed Systems (SRDS'93), pages 58–67, October 1993.Google Scholar
- 12.M. Stumm and S. Zhou. Fault Tolerant Distributed Shared Memory. In Proc. of the 2nd IEEE Symp. on Parallel and Distributed Processing, pages 719–724, December 1990.Google Scholar
- 13.G. Suri, B. Janssens, and W. K. Fuchs. Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory. In Proc. of the 25th Annual Int'l Symp. on Fault-Tolerant Computing (FTCS-25), June 1995.Google Scholar