Skip to main content

Challenges and Directions in Fault-Tolerant Computing

  • Chapter

Abstract

Two decades of theoretical and experimental work and numerous recent successful applications have established fault tolerance as a standard objective in computer system design. As with the objective of correctness, and in contrast to the objective of high speed, satisfaction of fault-tolerance requirements cannot be demonstrated by testing alone, but requires formal analysis. Most of the work in fault tolerance has been concerned with developing effective design techniques. Recent work on reliability modeling and formal proof of fault-tolerant design and implementation is laying a foundation for a more rigorous design discipline. The scope of concerns has also expanded to include any source of computer unreliability, such as design mistakes in software, hardware, or at any system level.

Current art is barely able to keep up with the rapid pace of computer technology, the stresses of new applications and the new expansion in scope of concerns. Particular challenges lie in coping with the imperfections of the ultrasmall, i.e., high-density VLSI, and the ultralarge, i.e., large software systems. It is clear that fault tolerance cannot be “added” to a design and must be integrated with other design objectives. Simultaneous demands in future systems for high performance, high security, high evolvability and high fault tolerance will require new theoretical models of computer systems and a much closer integration of practical design techniques.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adams, G.B. III, and Siegel, H.J., “The Extra Stage Cube: A Fault-Tolerant Interconnection Network for Supersystems,” IEEE Trans. Comput., Vol. C-31, No. 5, May 1982, pp. 443–454.

    Article  Google Scholar 

  2. Anderson, T., and Lee, P.A., Fault-Tolerance Principles and Practice, Prentice-Hall, Englewood Cliffs, N.J., 1981.

    Google Scholar 

  3. Anderson, T., Barrett, P.A., Halliwell, D.N., and Moulding, M.R., “An Evaluation of Software Fault Tolerance in a Practical System, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 140–145.

    Google Scholar 

  4. Andrews, D.M., “Using Executable Assertions for Testing and Fault Tolerance,” Proc. 9th Symposium on Fault-Tolerant Computing, June 1979, pp. 102–105.

    Google Scholar 

  5. Armstrong, J.R., and Gray, F.G., “Fault Diagnosis in a Boolean n-Cube,” IEEE Trans. Comput., Vol. C-30, August 1981, pp. 587–590.

    Article  Google Scholar 

  6. Avizienis, A ., “Fault Tolerance: The Survival Attribute of Digital Systems,” Proc. IEEE, Vol. 66, No. 10, October 1978, pp. 1109–1125.

    Google Scholar 

  7. Avizienis, A., Gunningberg, P., Kelly, J.P.J., Strigini, L., Traverse, P.J., Tso, K.S., and Voges, U., “The UCLA DEDIX System: A Distributed Testbed for Multiple-Version Software, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 126–134.

    Google Scholar 

  8. Beounes, C., and Laprie, J-C., “Dependability Evaluation of Complex Computer Systems: Stochastic Petri Net Modeling, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 364–369.

    Google Scholar 

  9. Bernstein, P.A., and Goodman, N., “Concurrency Control in Distributed Database Systems,” ACM Comp. Surv., Vol. 13, June 1981, pp. 185–221.

    Article  MathSciNet  Google Scholar 

  10. Chen, L., and Avizienis, A., “N-Version Programming: A Fault-Tolerant Approach to the Reliability of Software Operation, ” Proc. 8th Symposium on Fault-Tolerant Computing, June 1978, pp. 3–9.

    Google Scholar 

  11. Chou, T.C.K., and Abraham, J.A., “Load Distribution under Failure in Distributed Systems,” IEEE Trans. Comput., Vol. C-32, No. 9, September 1983, pp. 799–808.

    Article  Google Scholar 

  12. Coste, A., Doucet, J.E., Landrault, C., and Laprie, J-C., “SURF: A Program for Dependability Evaluation of Complex Fault-Tolerant Computing Systems, ” Proc. 11th Symposium on Fault-Tolerant Computing, 1981, pp. 72–78.

    Google Scholar 

  13. Cristian, F., Aghili, H., Strong, R., and Dolev, D., “Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 200–206.

    Google Scholar 

  14. Davis, R., Shrobe, H., Hamscher, W., Wieckert, K., Shirley, M., and Polit, S., “Diagnosis Based on Description of Structure and Function”, in Proc. AAAI-82, 1982, pp. 137–142.

    Google Scholar 

  15. Boelens, O.C., Friedman, A.D., and Menon, P.R., “Fault Location in Iterative Logic Arrays,” Proc. 4th Symposium on Fault-Tolerant Computing, 1974.

    Google Scholar 

  16. Furchtgott, D.G., and Meyer, J.F., “A Performabiiity Solution Method for Degradable Nonrepairable Systems,” IEEE Trans. Comput., Vol. C-33, No. 6, June 1984, pp. 550–553.

    Article  Google Scholar 

  17. Genesereth, M., “The Use of Hierarchical Models in the Automated Diagnosis of Computer Systems,” Technical Report, Stanford Univ. Computer Science Dept., December 1981.

    Google Scholar 

  18. Georgeff, M.P., “An Expert System for Representing Procedural Knowledge,” in Joint Services Workshop on Artificial Intelligence in Maintenance, Vol. 1, 1983.

    Google Scholar 

  19. Goldberg, J., “SIFT: A Provable Fault Tolerant Computer for Aircraft Control,” in Proc. Info. Proc. 80, Tokyo, Japan, 1980.

    Google Scholar 

  20. Goldberg, J., “New Problems in Fault-Tolerant Computing,” Proc. 5th Symposium on Fault-Tolerant Computing, 1975, pp. 29–34.

    Google Scholar 

  21. Goldberg, J., “The Problem of Confidence in Fault-Tolerant Computer Design,” in Informatik-Fachberichte 78, Proc. GI/NTG Conference: Architektur und Betrieb von Rechensystemen, Springer-Verlag, Berlin, 1984, pp. 347–361.

    Google Scholar 

  22. Goldberg, J., “Perspectives in Fault-Tolerant Software,” in IEEE COMPCON 85, 1985, pp. 264–269.

    Google Scholar 

  23. Hecht, H., and Hecht, M., “Use of Fault Trees for the Design of Recovery Blocks, ” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 134–139.

    Google Scholar 

  24. Holt, C., and Smith, J.E., “Self-Diagnosis in Distributed Systems,” IEEE Trans. Comput., Vol. C-34, No. 1, January 1985, pp. 19–31.

    Article  Google Scholar 

  25. Hopkins, A.L. Jr., Smith, T.B. Ill, and Lala, J.H., “FTMP—A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft,” Proc. IEEE, Vol. 66, No. 10, October 1978, pp. 1221–1239.

    Article  Google Scholar 

  26. Kelly, J.P.J., and Avizienis, A., “A Specification-Oriented Multi-Version Software Experiment, ” Proc. 13th Symposium on Fault-Tolerant Computing, June 1983, pp. 120–126.

    Google Scholar 

  27. Kim, K.H., “Approaches to Mechanization of the Conversation Scheme Based on Monitors,” IEEE Trans. Software Eng., Vol. SE-8, No. 3, May 1982, pp. 189–197.

    Article  Google Scholar 

  28. Kirrman, H.D., and Kaufmann F., “Poolpro-A Pool of Processors for Process Control Applications,” IEEE Trans. Comput., Vol. C-33, No. 10, October 1984, pp. 869–878.

    Article  Google Scholar 

  29. Knight, J.C., Leveson, N.G., and St.Jean, L.D., “A Large Scale Experiment in N-Version Programming, ” Proc. 15th Symposium on Fault–Tolerant Computing, 1985, pp. 135–139.

    Google Scholar 

  30. Kopetz, H., “The Failure Fault (FF) Model,” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 14–17.

    Google Scholar 

  31. Kopetz, H., and Merker, W., “The Architecture of MARS, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 274–279.

    Google Scholar 

  32. Koren, I., and Breuer, M.A., “On Area and Yield Considerations for Fault-Tolerant VLSI Processor Arrays,” IEEE Trans. Comput., Vol. C-33, No. 1, January 1984, pp. 21–27.

    Article  Google Scholar 

  33. Kroll, T., “The (4,2) Concept Fault-Tolerant Computer,” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 49–54.

    Google Scholar 

  34. Lampson, B.W., “Atomic Transactions,” in Lecture Notes in Computer Science, Vol. 105: Distributed Systems-Architecture and Implementation, Springer-Verlag, Berlin, 1981.

    Google Scholar 

  35. Lamport, L., Shostak, R., and Pease, M.C., “The Byzantine Generals Problem,” ACM Trans. Program. Lang. Syst., Vol. 4, No. 3, July 1982, pp. 382–401.

    Article  MATH  Google Scholar 

  36. Lamport, L., “Using Time Instead of Time-Outs in Fault-Tolerant Systems,” ACM Trans. Program. Lang. Syst., Vol. 6, No. 2, April 1984, pp. 256–280.

    Google Scholar 

  37. Laprie, J-C., “Dependable Computing and Fault Tolerance: Concepts and Terminology,” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 2–11.

    Google Scholar 

  38. Leveson, N.G., “Software Safety in Computer-Controlled Systems,” IEEE Computer, Vol. 17, No. 2, February 1984, pp. 48–55.

    Google Scholar 

  39. Leveson, N.G., and Stolzy, J.L., “Safety Analysis Using Petri Nets, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 358–363.

    Google Scholar 

  40. Liskov, B., “On Linguistic Support for Distributed Programs,” IEEE Trans. Software Eng., Vol. SE-8, No. 3, May 1982, pp. 203–210.

    Article  Google Scholar 

  41. Lu, L.Y., “A Virtual TMR Node,” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 286–292.

    Google Scholar 

  42. Lu, D.J., “Watchdog Processors and Structural Integrity Checking,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 681–685.

    Article  Google Scholar 

  43. Makam, S.V., and Avizienis, A., “ARIES81: A Reliability and Life-Cycle Evaluation Tool for Fault-Tolerant Systems, ” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 267–274.

    Google Scholar 

  44. Melliar-Smith, P.M., and Schwartz, R., “Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 616–630.

    Article  Google Scholar 

  45. Meyer, J., “Closed-Form Solutions of Performability,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 648–657.

    Article  Google Scholar 

  46. Nicolaidis, M., “Evaluation of a Self-Checking Version of the MC 68000 Microprocessor,” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 350–356.

    Google Scholar 

  47. Pease, M., Shostak, R., and Lamport, L., “Reaching Agreements in the Presence of Faults,” J. ACM, Vol. 27, No. 2, April 1980, pp. 228–234.

    Article  MathSciNet  MATH  Google Scholar 

  48. Oruc, A.Y., and Prakash, D., “Routing Algorithms for Cellular Interconnection Arrays,” IEEE Trans. Comput., October 1984, pp. 939–942.

    Google Scholar 

  49. Preparata, F., Metze G., and Chien, R;, “On the Connection Assignment Problem of Diagnosable Systems,” IEEE Trans. Electronic Comput., Vol. EC-16, December 1967, pp. 848–854.

    Article  Google Scholar 

  50. Raghavendra, C.S., Avizienis, A., and Ercegovac, M.D., “Fault Tolerance in Binary Tree Architectures,” IEEE Trans. Comput., Vol. C-33, No. 6, June 1984, pp. 568–572.

    Article  Google Scholar 

  51. Randell, B ., “System Structure for Software Fault Tolerance,” IEEE Trans. Software Eng., Vol. 2, June 1975, pp. 220–232.

    Google Scholar 

  52. Randell, B., “Recursively Structured Distributed Systems,” Technical Report, Computing Laboratory, Univ. Newcastle upon Tyne, May 1983.

    Google Scholar 

  53. Rennels, D., “Faul-Tolerant Computing- -Concepts and Examples,” IEEE Trans. Comput., Vol. C-33, No. 12, December 1984, pp. 1116–1129.

    Article  MathSciNet  Google Scholar 

  54. Shrivastava, S.K., and Panzieri, F., “The Design of a Reliable Remote Procedure Call Mechanism,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 692–697.

    Article  Google Scholar 

  55. Siewiorek, D.P., and Swarz, R.S., The Theory and Practice of Reliable System Design, Digital Press, Bedford, Mass., 1982.

    Google Scholar 

  56. Slivinski, T., “Study of Fault-Tolerant Software Technology,” Technical Report, Mandex, Inc., Report to NASA Langley Research Center, 1984.

    Google Scholar 

  57. Stiffler, J., Bryant, L.A., and Guccione, L., “Care III Final Report, Phase I,” Technical Report, NASA Langley Research Center, CR159122, November 1979.

    Google Scholar 

  58. Stiffler, J., “Robust Detection of Intermittent Faults,” Proc. 10th Symposium on Fault–Tolerant Computing, 1980, pp. 216–218.

    Google Scholar 

  59. Svobodova, L., “Resilient Distributed Computing,” IEEE Trans. Software Eng., Vol. SE-10, No. 3, May 1984, pp. 257–267.

    Article  Google Scholar 

  60. Taylor, D.J., and Black, J.P., “Principles of Data Structure Error Correction,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 602–608.

    Article  Google Scholar 

  61. Trivedi, K.S., Probability and Statistics with Reliability Queueing and Computer Science Applications, Prentice-Hall, Englewood Cliffs, N.J., 1982.

    Google Scholar 

  62. von Neumann, J., “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components,” in C.E. Shannon and J. McCarthy, eds., Automata Studies, Princeton Univ. Press, 1956, pp. 43–98.

    Google Scholar 

  63. Wakerley, J.F., Error Detecting Codes, Self-Checking Circuits and Applications, Elsevier, New York, 1978.

    Google Scholar 

  64. Wensley, J.H., Lamport, L., Goldberg, J., Green, M.W., Levitt, K.N., Melliar-Smith, P.M., Shostak, R.E., and Weinstock, C.B., “SIFT: The Design and Analysis of a Fault-Tolerant Computer for Aircraft Control,” Proc. IEEE, Vol. 66, No. 10, October 1978, pp. 1255–1268.

    Article  Google Scholar 

  65. Williams, T.W., and Parker, K.P., “Design for Testability,” IEEE Trans. Comput., Vol. C-31, No. 1, January 1982, pp. 2–15.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1986 Plenum Press, New York

About this chapter

Cite this chapter

Goldberg, J. (1986). Challenges and Directions in Fault-Tolerant Computing. In: Güth, R. (eds) Computer Systems for Process Control. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-2237-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-2237-5_3

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4612-9311-8

  • Online ISBN: 978-1-4613-2237-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics