Optimal recovery schemes in fault tolerant distributed computing

Klonowska, Kamilla; Lennerstad, Håkan; Lundberg, Lars; Svahnberg, Charlie

doi:10.1007/s00236-005-0161-7

Optimal recovery schemes in fault tolerant distributed computing

Published: May 2005

Volume 41, pages 341–365, (2005)
Cite this article

Acta Informatica Aims and scope Submit manuscript

Kamilla Klonowska¹,
Håkan Lennerstad¹,
Lars Lundberg¹ &
…
Charlie Svahnberg¹

54 Accesses
4 Citations
Explore all metrics

Abstract.

Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simple and optimal randomized fault-tolerant rumor spreading

Article 31 December 2014

FINJ: A Fault Injection Tool for HPC Systems

A nearly optimal upper bound for the self-stabilization time in Herman’s algorithm

Article 15 March 2015

References

Bertsekas, D.P., Özveren, C., Stamoulis, G.D., Tsitsiklis, J.N. (1991) Optimal communication algorithms for hypercubes. J. Parallel Distributed Comput. 11: 263-275
Google Scholar
Bloom, G.S., Golomb, S.W. (1977) Applications of numbered, undirected graphs. Proceedings of the IEEE 65(4): 562-571
Google Scholar
Chabridon, S., Gelenbe, E. (1995) Failure detection algorithms for a reliable execution of parallel programs. 14th Symposium on Reliable Distributed Systems SRDS’14, Bad Neuenahr, Germany, September 1995, Proceedings
Chinchani, R., Upadhyaya, S., Kwiat, K. (2003) A tamper-resistant framework for unambiguous detection of attacks in user space using process monitors. First IEEE International Workshop on Information Assurance IWIA’03, March 24-24, 2003, Darmstadt, Germany, Proceedings, pp 25-36
Dimitromanolakis, A (2002) Analysis of the golomb ruler and the sidon set problems, and determination of large, near-optimal golomb rulers. Dept. of Electronic and Computer Engineering Technical University of Crete
Flavin, C. (1991) Understanding fault tolerant distributed systems. Communication ACM 34(2): 56-78
Google Scholar
Gelenbe, E. (1976) A model for roll-back recovery with multiple checkpoints. 2nd International Conference on Software Engineering, San Francisco, California, US, October 1976, Proceedings, pp. 251-255
Gelenbe, E., Chabridon, S. (1995) Dependable execution of distributed programs. Elsevier, Simulation Practice and Theory 3(1): 1-16
Gelenbe, E., Derochete, D. (1978) Performance of rollback recovery systems under intermittent failures. Communication ACM 21(6): 493-499
Google Scholar
Greenberg, D.S., Bhatt, S.N. (1990) Routing multiple paths in hypercubes. Second Annual ACM Symposium on Parallel Algorithms and Architectures, Island of Crete, Greece, 1990, Proceedings, pp. 45-54
Hewlett-Packard Company (2002) TruCluster server - Cluster highly available applications. Hewlett-Packard Company, September
Hewlett-Packard (2002) Managing MC/ServiceGuard. Hewlett-Packard, March
Huang, C., McKinley, P.K. (1994) Communication issues in parallel computing across ATM networks. IEEE Parallel and Distributed Technology: Systems and Applications 2(4): 73-86
Google Scholar
IBM (2002) HACMP. Concepts and Facilities Guide. IBM, July
Kameda, H., Fathy, E.-Z.S., Ryu, I., Li, J. (2002) A performance comparison of dynamic vs. static load balancing policies in a mainframe - Personal computer network model. Information: an International Journal 5(4): 431-446
Klonowska, K., Lundberg, L., Lennerstad, H. (2003) Using golomb rulers for optimal recovery schemes in fault tolerant distributed computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, pp. 9-, CD-ROM
Klonowska, K., Lundberg, L., Lennerstad, H., Svahnberg, C. (2004) Using modulo rulers for optimal recovery schemes in distributed computing. 10th International Symposium PRDC 2004, Papeete, Tahiti, French Polynesia, March 2004, Proceedings, pp. 133-142
Krishna, C.M., Shin, K.G. (1997) Real-time systems. (McGraw-Hill International Editions, Computer Science Series, ISBN 0-07-114243-6)
Lundberg, L., Häggander, D., Klonowska, K., Svahnberg, C. (2003) Recovery schemes for high availability and high performance distributed real-time computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, p. 122a, CD-ROM
Lundberg, L., Svahnberg, C. (2001) Optimal recovery schemes for high-availability cluster and distributed computing. Journal of Parallel and Distributed Computing 61(11): 1680-1691
Google Scholar
Mahmood, A., McCluskey, E.J. (1988) Concurrent error detection using watchdog processors - A survey. IEEE Transactions on Computers 37(2): 160-174
Google Scholar
Microsoft Corporation (2003) Server clusters: Architecture overview for Windows server 2003. Microsoft Corporation, March
Google Scholar
Pande, S.S., Agrawal, D.P., Mauney, J. (1994) A threshold scheduling strategy for Sisal on distributed memory machines. Journal on Parallel and Distributed Computing 21(2), 223-236
Google Scholar
Pfister, G.F. (1998) In search of clusters. Prentice-Hall
Reinhardt, S.K., Mukherjee, S.S. (2000) Transient fault detection via simultaneous multithreading. 27th Annual International Symposium on Computer Architecture (ISCA), Vancouver, British Columbia, Canada, June, 2000, Proceedings
Stalling, W. (2003) Computer organization & architecture. Designing for performance, 6th edn. Prentice Hall, ISBN 0-13-049307-4
Sun Microsystems (2000) Sun cluster 3.0 data services installation and configuration guide. Sun Microsystems
TruCluster. Systems Administration Guide, Digital Equipment Corporation, http://www.unix.digital.com/faqs/publications/cluster\_doc
Vaidya, N.H. (1994) Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency. Technical Report 94-068. Department of Computer Science, Texas A&M University, December
Willebeek-LeMair, M., Reeves, A.P. (1993) Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems 9(4): 979-993
Google Scholar
Young, M., Taylor, R.N. (1989) Rethinking the taxonomy of fault detection techniques. International Conference Software Enginering (ICSE), ACM, May, 1989, Proceedings, pp. 53-62
http://www.distributed.net/ogr/index.html
http://www.research.ibm.com/people/s/shearer/grtab.html

Download references

Author information

Authors and Affiliations

School of Engineering, Blekinge Institute of Technology, 372 25, Ronneby, Sweden
Kamilla Klonowska, Håkan Lennerstad, Lars Lundberg & Charlie Svahnberg

Authors

Kamilla Klonowska
View author publications
You can also search for this author in PubMed Google Scholar
Håkan Lennerstad
View author publications
You can also search for this author in PubMed Google Scholar
Lars Lundberg
View author publications
You can also search for this author in PubMed Google Scholar
Charlie Svahnberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamilla Klonowska.

Additional information

Received: 26 May 2004, Published online: 14 March 2005

Rights and permissions

Reprints and permissions

About this article

Cite this article

Klonowska, K., Lennerstad, H., Lundberg, L. et al. Optimal recovery schemes in fault tolerant distributed computing. Acta Informatica 41, 341–365 (2005). https://doi.org/10.1007/s00236-005-0161-7

Download citation

Issue Date: May 2005
DOI: https://doi.org/10.1007/s00236-005-0161-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal recovery schemes in fault tolerant distributed computing

Abstract.

Access this article

Similar content being viewed by others

Simple and optimal randomized fault-tolerant rumor spreading

FINJ: A Fault Injection Tool for HPC Systems

A nearly optimal upper bound for the self-stabilization time in Herman’s algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal recovery schemes in fault tolerant distributed computing

Abstract.

Access this article

Similar content being viewed by others

Simple and optimal randomized fault-tolerant rumor spreading

FINJ: A Fault Injection Tool for HPC Systems

A nearly optimal upper bound for the self-stabilization time in Herman’s algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation