Skip to main content
Log in

Optimal recovery schemes in fault tolerant distributed computing

  • Published:
Acta Informatica Aims and scope Submit manuscript

Abstract.

Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bertsekas, D.P., Özveren, C., Stamoulis, G.D., Tsitsiklis, J.N. (1991) Optimal communication algorithms for hypercubes. J. Parallel Distributed Comput. 11: 263-275

    Google Scholar 

  2. Bloom, G.S., Golomb, S.W. (1977) Applications of numbered, undirected graphs. Proceedings of the IEEE 65(4): 562-571

    Google Scholar 

  3. Chabridon, S., Gelenbe, E. (1995) Failure detection algorithms for a reliable execution of parallel programs. 14th Symposium on Reliable Distributed Systems SRDS’14, Bad Neuenahr, Germany, September 1995, Proceedings

  4. Chinchani, R., Upadhyaya, S., Kwiat, K. (2003) A tamper-resistant framework for unambiguous detection of attacks in user space using process monitors. First IEEE International Workshop on Information Assurance IWIA’03, March 24-24, 2003, Darmstadt, Germany, Proceedings, pp 25-36

  5. Dimitromanolakis, A (2002) Analysis of the golomb ruler and the sidon set problems, and determination of large, near-optimal golomb rulers. Dept. of Electronic and Computer Engineering Technical University of Crete

  6. Flavin, C. (1991) Understanding fault tolerant distributed systems. Communication ACM 34(2): 56-78

    Google Scholar 

  7. Gelenbe, E. (1976) A model for roll-back recovery with multiple checkpoints. 2nd International Conference on Software Engineering, San Francisco, California, US, October 1976, Proceedings, pp. 251-255

  8. Gelenbe, E., Chabridon, S. (1995) Dependable execution of distributed programs. Elsevier, Simulation Practice and Theory 3(1): 1-16

  9. Gelenbe, E., Derochete, D. (1978) Performance of rollback recovery systems under intermittent failures. Communication ACM 21(6): 493-499

    Google Scholar 

  10. Greenberg, D.S., Bhatt, S.N. (1990) Routing multiple paths in hypercubes. Second Annual ACM Symposium on Parallel Algorithms and Architectures, Island of Crete, Greece, 1990, Proceedings, pp. 45-54

  11. Hewlett-Packard Company (2002) TruCluster server - Cluster highly available applications. Hewlett-Packard Company, September

  12. Hewlett-Packard (2002) Managing MC/ServiceGuard. Hewlett-Packard, March

  13. Huang, C., McKinley, P.K. (1994) Communication issues in parallel computing across ATM networks. IEEE Parallel and Distributed Technology: Systems and Applications 2(4): 73-86

    Google Scholar 

  14. IBM (2002) HACMP. Concepts and Facilities Guide. IBM, July

  15. Kameda, H., Fathy, E.-Z.S., Ryu, I., Li, J. (2002) A performance comparison of dynamic vs. static load balancing policies in a mainframe - Personal computer network model. Information: an International Journal 5(4): 431-446

  16. Klonowska, K., Lundberg, L., Lennerstad, H. (2003) Using golomb rulers for optimal recovery schemes in fault tolerant distributed computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, pp. 9-, CD-ROM

  17. Klonowska, K., Lundberg, L., Lennerstad, H., Svahnberg, C. (2004) Using modulo rulers for optimal recovery schemes in distributed computing. 10th International Symposium PRDC 2004, Papeete, Tahiti, French Polynesia, March 2004, Proceedings, pp. 133-142

  18. Krishna, C.M., Shin, K.G. (1997) Real-time systems. (McGraw-Hill International Editions, Computer Science Series, ISBN 0-07-114243-6)

  19. Lundberg, L., Häggander, D., Klonowska, K., Svahnberg, C. (2003) Recovery schemes for high availability and high performance distributed real-time computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, p. 122a, CD-ROM

  20. Lundberg, L., Svahnberg, C. (2001) Optimal recovery schemes for high-availability cluster and distributed computing. Journal of Parallel and Distributed Computing 61(11): 1680-1691

    Google Scholar 

  21. Mahmood, A., McCluskey, E.J. (1988) Concurrent error detection using watchdog processors - A survey. IEEE Transactions on Computers 37(2): 160-174

    Google Scholar 

  22. Microsoft Corporation (2003) Server clusters: Architecture overview for Windows server 2003. Microsoft Corporation, March

    Google Scholar 

  23. Pande, S.S., Agrawal, D.P., Mauney, J. (1994) A threshold scheduling strategy for Sisal on distributed memory machines. Journal on Parallel and Distributed Computing 21(2), 223-236

    Google Scholar 

  24. Pfister, G.F. (1998) In search of clusters. Prentice-Hall

  25. Reinhardt, S.K., Mukherjee, S.S. (2000) Transient fault detection via simultaneous multithreading. 27th Annual International Symposium on Computer Architecture (ISCA), Vancouver, British Columbia, Canada, June, 2000, Proceedings

  26. Stalling, W. (2003) Computer organization & architecture. Designing for performance, 6th edn. Prentice Hall, ISBN 0-13-049307-4

  27. Sun Microsystems (2000) Sun cluster 3.0 data services installation and configuration guide. Sun Microsystems

  28. TruCluster. Systems Administration Guide, Digital Equipment Corporation, http://www.unix.digital.com/faqs/publications/cluster\_doc

  29. Vaidya, N.H. (1994) Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency. Technical Report 94-068. Department of Computer Science, Texas A&M University, December

  30. Willebeek-LeMair, M., Reeves, A.P. (1993) Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems 9(4): 979-993

    Google Scholar 

  31. Young, M., Taylor, R.N. (1989) Rethinking the taxonomy of fault detection techniques. International Conference Software Enginering (ICSE), ACM, May, 1989, Proceedings, pp. 53-62

  32. http://www.distributed.net/ogr/index.html

  33. http://www.research.ibm.com/people/s/shearer/grtab.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamilla Klonowska.

Additional information

Received: 26 May 2004, Published online: 14 March 2005

Rights and permissions

Reprints and permissions

About this article

Cite this article

Klonowska, K., Lennerstad, H., Lundberg, L. et al. Optimal recovery schemes in fault tolerant distributed computing. Acta Informatica 41, 341–365 (2005). https://doi.org/10.1007/s00236-005-0161-7

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00236-005-0161-7

Keywords

Navigation