Acta Informatica

, Volume 41, Issue 6, pp 341–365 | Cite as

Optimal recovery schemes in fault tolerant distributed computing

  • Kamilla KlonowskaEmail author
  • Håkan Lennerstad
  • Lars Lundberg
  • Charlie Svahnberg


Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.


Operating System Data Structure Communication Network Information Theory Computational Mathematic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bertsekas, D.P., Özveren, C., Stamoulis, G.D., Tsitsiklis, J.N. (1991) Optimal communication algorithms for hypercubes. J. Parallel Distributed Comput. 11: 263-275Google Scholar
  2. 2.
    Bloom, G.S., Golomb, S.W. (1977) Applications of numbered, undirected graphs. Proceedings of the IEEE 65(4): 562-571Google Scholar
  3. 3.
    Chabridon, S., Gelenbe, E. (1995) Failure detection algorithms for a reliable execution of parallel programs. 14th Symposium on Reliable Distributed Systems SRDS’14, Bad Neuenahr, Germany, September 1995, ProceedingsGoogle Scholar
  4. 4.
    Chinchani, R., Upadhyaya, S., Kwiat, K. (2003) A tamper-resistant framework for unambiguous detection of attacks in user space using process monitors. First IEEE International Workshop on Information Assurance IWIA’03, March 24-24, 2003, Darmstadt, Germany, Proceedings, pp 25-36Google Scholar
  5. 5.
    Dimitromanolakis, A (2002) Analysis of the golomb ruler and the sidon set problems, and determination of large, near-optimal golomb rulers. Dept. of Electronic and Computer Engineering Technical University of CreteGoogle Scholar
  6. 6.
    Flavin, C. (1991) Understanding fault tolerant distributed systems. Communication ACM 34(2): 56-78Google Scholar
  7. 7.
    Gelenbe, E. (1976) A model for roll-back recovery with multiple checkpoints. 2nd International Conference on Software Engineering, San Francisco, California, US, October 1976, Proceedings, pp. 251-255Google Scholar
  8. 8.
    Gelenbe, E., Chabridon, S. (1995) Dependable execution of distributed programs. Elsevier, Simulation Practice and Theory 3(1): 1-16Google Scholar
  9. 9.
    Gelenbe, E., Derochete, D. (1978) Performance of rollback recovery systems under intermittent failures. Communication ACM 21(6): 493-499Google Scholar
  10. 10.
    Greenberg, D.S., Bhatt, S.N. (1990) Routing multiple paths in hypercubes. Second Annual ACM Symposium on Parallel Algorithms and Architectures, Island of Crete, Greece, 1990, Proceedings, pp. 45-54Google Scholar
  11. 11.
    Hewlett-Packard Company (2002) TruCluster server - Cluster highly available applications. Hewlett-Packard Company, SeptemberGoogle Scholar
  12. 12.
    Hewlett-Packard (2002) Managing MC/ServiceGuard. Hewlett-Packard, MarchGoogle Scholar
  13. 13.
    Huang, C., McKinley, P.K. (1994) Communication issues in parallel computing across ATM networks. IEEE Parallel and Distributed Technology: Systems and Applications 2(4): 73-86Google Scholar
  14. 14.
    IBM (2002) HACMP. Concepts and Facilities Guide. IBM, JulyGoogle Scholar
  15. 15.
    Kameda, H., Fathy, E.-Z.S., Ryu, I., Li, J. (2002) A performance comparison of dynamic vs. static load balancing policies in a mainframe - Personal computer network model. Information: an International Journal 5(4): 431-446Google Scholar
  16. 16.
    Klonowska, K., Lundberg, L., Lennerstad, H. (2003) Using golomb rulers for optimal recovery schemes in fault tolerant distributed computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, pp. 9-, CD-ROMGoogle Scholar
  17. 17.
    Klonowska, K., Lundberg, L., Lennerstad, H., Svahnberg, C. (2004) Using modulo rulers for optimal recovery schemes in distributed computing. 10th International Symposium PRDC 2004, Papeete, Tahiti, French Polynesia, March 2004, Proceedings, pp. 133-142Google Scholar
  18. 18.
    Krishna, C.M., Shin, K.G. (1997) Real-time systems. (McGraw-Hill International Editions, Computer Science Series, ISBN 0-07-114243-6)Google Scholar
  19. 19.
    Lundberg, L., Häggander, D., Klonowska, K., Svahnberg, C. (2003) Recovery schemes for high availability and high performance distributed real-time computing. 17th International Parallel & Distributed Processing Symposium IPDPS 2003, Nice, France, April 2003, Proceedings, p. 122a, CD-ROMGoogle Scholar
  20. 20.
    Lundberg, L., Svahnberg, C. (2001) Optimal recovery schemes for high-availability cluster and distributed computing. Journal of Parallel and Distributed Computing 61(11): 1680-1691Google Scholar
  21. 21.
    Mahmood, A., McCluskey, E.J. (1988) Concurrent error detection using watchdog processors - A survey. IEEE Transactions on Computers 37(2): 160-174Google Scholar
  22. 22.
    Microsoft Corporation (2003) Server clusters: Architecture overview for Windows server 2003. Microsoft Corporation, MarchGoogle Scholar
  23. 23.
    Pande, S.S., Agrawal, D.P., Mauney, J. (1994) A threshold scheduling strategy for Sisal on distributed memory machines. Journal on Parallel and Distributed Computing 21(2), 223-236Google Scholar
  24. 24.
    Pfister, G.F. (1998) In search of clusters. Prentice-HallGoogle Scholar
  25. 25.
    Reinhardt, S.K., Mukherjee, S.S. (2000) Transient fault detection via simultaneous multithreading. 27th Annual International Symposium on Computer Architecture (ISCA), Vancouver, British Columbia, Canada, June, 2000, ProceedingsGoogle Scholar
  26. 26.
    Stalling, W. (2003) Computer organization & architecture. Designing for performance, 6th edn. Prentice Hall, ISBN 0-13-049307-4Google Scholar
  27. 27.
    Sun Microsystems (2000) Sun cluster 3.0 data services installation and configuration guide. Sun MicrosystemsGoogle Scholar
  28. 28.
    TruCluster. Systems Administration Guide, Digital Equipment Corporation,\_docGoogle Scholar
  29. 29.
    Vaidya, N.H. (1994) Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency. Technical Report 94-068. Department of Computer Science, Texas A&M University, DecemberGoogle Scholar
  30. 30.
    Willebeek-LeMair, M., Reeves, A.P. (1993) Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems 9(4): 979-993Google Scholar
  31. 31.
    Young, M., Taylor, R.N. (1989) Rethinking the taxonomy of fault detection techniques. International Conference Software Enginering (ICSE), ACM, May, 1989, Proceedings, pp. 53-62Google Scholar
  32. 32. Scholar
  33. 33. Scholar

Copyright information

© Springer-Verlag Berlin/Heidelberg 2005

Authors and Affiliations

  • Kamilla Klonowska
    • 1
    Email author
  • Håkan Lennerstad
    • 1
  • Lars Lundberg
    • 1
  • Charlie Svahnberg
    • 1
  1. 1.School of EngineeringBlekinge Institute of TechnologyRonnebySweden

Personalised recommendations