Abstract
Cluster systems are becoming more prevalent in today’s computer society and users are beginning to request that these systems be reliable. Currently, most clusters have been designed to provide high performance at the cost of providing little to no reliability. To combat this, this report looks at how a recovery facility, based on either a centralised or distributed approach could be implemented into a cluster that is supported by a checkpointing facility. This recovery facility can then recover failed user processes by using checkpoints of the processes that have been taken during failure free execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Goscinski, A.: Towards A Cluster Operating System That Offers A Single System Image. In: Distributed and Parallel Systems (2002)
Maloney, A.: Checkpointing and Rollback-Recovery Mechanisms to Provide Fault Tolerance for Parallel Applications. School of Information Technology, Deakin University (2004), http://www-development.deakin.edu.au/scitech/sit/dsapp/members/index.php
Elnozahy, M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. School of Computer Schience at Carnegie Mellon University, Pittsburgh, PA 15213 (1999)
Badrinath, R., Morin, C., Vallée, G.: Checkpointing and Recovery of Shared Memory Parallel Applications in a Cluster. In: Proc. Intl. Workshop on Distributed Shared Memory on Clusters (DSM 2003), Tokyo, May 2003, pp. 471–477 (2003)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent Checkpointing under Unix. In: Proceedings of the USENIX Winter 1995 Technical Conference, pp. 213–223 (1995)
Landau, C.R.: The Checkpoint Mechanism in KeyKOS. In: Proceedings of the Second International Workshop on Object Orientation in Operating Systems (September 1992)
Rough, J., Goscinski, A.: The development of an efficient checkpointing facility exploiting operating systems services of the GENESIS cluster operating system. Future Generation Computer Systems 20, 523–538 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maloney, A., Goscinski, A. (2005). A Comparative Study at the Logical Level of Centralised and Distributed Recovery in Clusters. In: Hobbs, M., Goscinski, A.M., Zhou, W. (eds) Distributed and Parallel Computing. ICA3PP 2005. Lecture Notes in Computer Science, vol 3719. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564621_13
Download citation
DOI: https://doi.org/10.1007/11564621_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29235-7
Online ISBN: 978-3-540-32071-5
eBook Packages: Computer ScienceComputer Science (R0)