Abstract
Fault-tolerance is very important in cluster computing and has been implemented in many famous cluster-computing systems using checkpoint/restart mechanisms. But existent check-pointing algorithms cannot restore the states of a file system when roll-backing the running of a program, so there are many restrictions on file accesses in existent fault-tolerance systems. SCR algorithm, an algorithm based on atomic operation and consistent schedule, which can restore the states of file systems, is presented in this paper. In the SCR algorithm, system calls on file systems are classified into idem-potent operations and non-idem-potent operations. A non-idem-potent operation modifies a file system’s states, while an idem-potent operation does not. SCR algorithm tracks changes of the file system states. It logs each non-idem-potent operation used by user programs and the information that can restore the operation in disks. When check-pointing roll-backing the program, SCR algorithm will revert the file system states to the last checkpoint time. By using SCR algorithm, users, are allowed to use any file operation in their programs.
Similar content being viewed by others
References
Sunderam V S. PVM: A framework for parallel distributed computing.Concurrency: Practice and Expernce, 1990, 2(4): 315–339.
Litzkow M. Supporting check-pointing and process migration outside the Unix kernel. InProc. USENIX-Winter’92, San Francisco, CA, 1992, pp.283–290.
Litzkow M, Miron L, Mattw M. Condor — A hunter of idle workstations. InIEEE 8ICDCS, San Jose, California, 1988, pp. 104–111.
Casas Jet al. Mist: PVM with transparent migration and check-pointing InProc. the 3rd Annual PVM User’s Group Meeting, Pittsburgh, 1995.
Casas Jet al. MPVM: A migration transparent version of PVM. Dept. of Computer Science and Engineering, Oregon Graduate Institute of Science & Technology: TR CSE-95-002, Feb. 1995.
Stellner G. Resource management and check-pointing for PVM. InProc. the 2rd European User’s Group Meeting, Lyon, France, 1995, pp.131–136.
Juan León, Allan L Fisher, Peter Steenkiste. Fail-safe PVM: A portable package for distributed programming with transparent recovery. School of Computer Science, Carnegie Mellon University: TR CMU-CS-93-124, 1993.
Arabe J Net al. Dome: Parallel programming in a heterogeneous multi-user environment. School of Computer Science, Carnegie Mellon University: TR CMU-CS-95-137, 1995.
Erik Seligmon, Adam Beguelin. High-level fault tolerance in distributed programs. School of Computer Science, Carnegie Mellon University: CMU-CS-94-223, Dec. 1994.
Eliezer Levy, Abraham Silberschatz. Distributed file systems: Concept and examples.ACM Computing Surveys, 1990, 22(4): 321–374.
Chen P Met al. RAID: High-performance, reliable secondary storage.ACM Computing Surveys, 1994, 26(2): 145–185.
James S Plank. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques. InSRDS-15: 15th Symposium on Reliable Distributed Systems, Niagra-on-the-Lake, Canada, Oct. 1996, pp.76–85.
James S Plank. A tutorial on reed-Solomon coding for fault-tolerance in RAID-like systems. Tenn. University: TR UT-CS-96-332, July 1996.
Schwarz T J E, Burkhard W A. RAID: Organization and performance. InProc. the 12th Int. Conf. Dist. Comp. Sys., Yokohama, June 1992, pp.318–325.
James S Plank. Efficient check-pointing on MIMD architectures [dissertation]. Princeton University, Princeton, 1993.
Manivanan D, Mukesh Singhal. A low-overhead recovery technique using quasi-synchronous checkpointing. InIEEE Proceedings of the 16th ICDCS, Hong Kong 1996, pp.100–107.
JU Jiubin, WEI Xiaohuiet al. Implementing process migration in PVM with check-pointing.Journal of Software, 1996, 7(3): 175–179. (in Chinese).
JU Jiubin, WEI Xiaohuiet al. DPVM: An enhanced PVM supporting task migration and quening.Chinese Journal of Computers, 1997, 20(10): 872–877. (in Chinese)
Author information
Authors and Affiliations
Additional information
Project supported by NNSFC under grant No.69673012.
Rights and permissions
About this article
Cite this article
Wei, X., Ju, J. SCR algorithm: Saving/restoring states of file systems. J. Comput. Sci. & Technol. 15, 393–400 (2000). https://doi.org/10.1007/BF02948877
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02948877