Advertisement

Simulating crash failures with many faulty processors (extended abstract)

  • Rida Bazzi
  • Gil Neiger
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 647)

Abstract

The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hagit Attiya, Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Bounds on the time to reach agreement in the presence of timing uncertainty. In Proceedings of the Twenty-Third ACM Symposium on Theory of Computing, pages 359–369, May 1991.Google Scholar
  2. 2.
    Rida Bazzi and Gil Neiger. Optimally providing fault-tolerance in a Byzantine environment. In S. Toueg, P. G. Spirakis, and L. Kirousis, editors, Proceedings of the Fifth International Workshop on Distributed Algorithms, volume 579 of Lecture Notes on Computer Science, pages 108–128. Springer-Verlag, October 1991.Google Scholar
  3. 3.
    Rida Bazzi and Gil Neiger. The complexity and impossibility of achieving fault-tolerant coordination. In Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, August 1992. To appear.Google Scholar
  4. 4.
    Brian A. Coan. A compiler that increases the fault-tolerance of asynchronous protocols. IEEE Transactions on Computers, 37(12):1541–1553, December 1988.Google Scholar
  5. 5.
    Danny Dolev. The Byzantine generals strike again. Journal of Algorithms, 3(1):14–30, 1982.Google Scholar
  6. 6.
    Vassos Hadzilacos. Byzantine agreement under restricted types of failures (not telling the truth is different from telling lies). Technical Report 18–83, Department of Computer Science, Harvard University, 1983. A revised version appears in Hadzilacos's Ph.D. dissertation [7].Google Scholar
  7. 7.
    Vassos Hadzilacos. Issues of Fault Tolerance in Concurrent Computations. Ph.D. dissertation, Harvard University, June 1984. Technical Report 11–84, Department of Computer Science.Google Scholar
  8. 8.
    Vassos Hadzilacos. Connectivity requirements for Byzantine agreement under restricted types of failures. Distributed Computing, 2(2):95–103, 1987.Google Scholar
  9. 9.
    Joseph Y. Halpern and H. Raymond Strong, March 1986. Personal communication.Google Scholar
  10. 10.
    Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, July 1982.CrossRefGoogle Scholar
  11. 11.
    Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, September 1990.Google Scholar
  12. 12.
    Gil Neiger and Mark R. Tuttle. Common knowledge and consistent simultaneous coordination. In J. van Leeuwen and N. Santoro, editors, Proceedings of the Fourth International Workshop on Distributed Algorithms, volume 486 of Lecture Notes on Computer Science, pages 334–352. Springer-Verlag, September 1990. To appear in Distributed Computing.Google Scholar
  13. 13.
    Kenneth J. Perry and Sam Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering, 12(3):477–482, March 1986.Google Scholar
  14. 14.
    Stephen Ponzio. Consensus in the presence of timing uncertainty: Omission and Byzantine faults. In Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 125–138, August 1991.Google Scholar
  15. 15.
    Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222–238, August 1983.Google Scholar
  16. 16.
    T. K. Srikanth and Sam Toueg. Simulating authenticated broadcasts to derive simple fault-tolerant algorithms. Distributed Computing, 2(2):80–94, 1987.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1992

Authors and Affiliations

  • Rida Bazzi
    • 1
  • Gil Neiger
    • 1
  1. 1.College of ComputingGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations