Challenges and Directions in Fault-Tolerant Computing

  • J. Goldberg

Abstract

Two decades of theoretical and experimental work and numerous recent successful applications have established fault tolerance as a standard objective in computer system design. As with the objective of correctness, and in contrast to the objective of high speed, satisfaction of fault-tolerance requirements cannot be demonstrated by testing alone, but requires formal analysis. Most of the work in fault tolerance has been concerned with developing effective design techniques. Recent work on reliability modeling and formal proof of fault-tolerant design and implementation is laying a foundation for a more rigorous design discipline. The scope of concerns has also expanded to include any source of computer unreliability, such as design mistakes in software, hardware, or at any system level.

Current art is barely able to keep up with the rapid pace of computer technology, the stresses of new applications and the new expansion in scope of concerns. Particular challenges lie in coping with the imperfections of the ultrasmall, i.e., high-density VLSI, and the ultralarge, i.e., large software systems. It is clear that fault tolerance cannot be “added” to a design and must be integrated with other design objectives. Simultaneous demands in future systems for high performance, high security, high evolvability and high fault tolerance will require new theoretical models of computer systems and a much closer integration of practical design techniques.

Keywords

Dust Europe Coherent Error Assure Encapsulation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adams, G.B. III, and Siegel, H.J., “The Extra Stage Cube: A Fault-Tolerant Interconnection Network for Supersystems,” IEEE Trans. Comput., Vol. C-31, No. 5, May 1982, pp. 443–454.CrossRefGoogle Scholar
  2. 2.
    Anderson, T., and Lee, P.A., Fault-Tolerance Principles and Practice, Prentice-Hall, Englewood Cliffs, N.J., 1981.Google Scholar
  3. 3.
    Anderson, T., Barrett, P.A., Halliwell, D.N., and Moulding, M.R., “An Evaluation of Software Fault Tolerance in a Practical System, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 140–145.Google Scholar
  4. 4.
    Andrews, D.M., “Using Executable Assertions for Testing and Fault Tolerance,” Proc. 9th Symposium on Fault-Tolerant Computing, June 1979, pp. 102–105.Google Scholar
  5. 5.
    Armstrong, J.R., and Gray, F.G., “Fault Diagnosis in a Boolean n-Cube,” IEEE Trans. Comput., Vol. C-30, August 1981, pp. 587–590.CrossRefGoogle Scholar
  6. 6.
    Avizienis, A ., “Fault Tolerance: The Survival Attribute of Digital Systems,” Proc. IEEE, Vol. 66, No. 10, October 1978, pp. 1109–1125.Google Scholar
  7. 7.
    Avizienis, A., Gunningberg, P., Kelly, J.P.J., Strigini, L., Traverse, P.J., Tso, K.S., and Voges, U., “The UCLA DEDIX System: A Distributed Testbed for Multiple-Version Software, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 126–134.Google Scholar
  8. 8.
    Beounes, C., and Laprie, J-C., “Dependability Evaluation of Complex Computer Systems: Stochastic Petri Net Modeling, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 364–369.Google Scholar
  9. 9.
    Bernstein, P.A., and Goodman, N., “Concurrency Control in Distributed Database Systems,” ACM Comp. Surv., Vol. 13, June 1981, pp. 185–221.MathSciNetCrossRefGoogle Scholar
  10. 10.
    Chen, L., and Avizienis, A., “N-Version Programming: A Fault-Tolerant Approach to the Reliability of Software Operation, ” Proc. 8th Symposium on Fault-Tolerant Computing, June 1978, pp. 3–9.Google Scholar
  11. 11.
    Chou, T.C.K., and Abraham, J.A., “Load Distribution under Failure in Distributed Systems,” IEEE Trans. Comput., Vol. C-32, No. 9, September 1983, pp. 799–808.CrossRefGoogle Scholar
  12. 12.
    Coste, A., Doucet, J.E., Landrault, C., and Laprie, J-C., “SURF: A Program for Dependability Evaluation of Complex Fault-Tolerant Computing Systems, ” Proc. 11th Symposium on Fault-Tolerant Computing, 1981, pp. 72–78.Google Scholar
  13. 13.
    Cristian, F., Aghili, H., Strong, R., and Dolev, D., “Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 200–206.Google Scholar
  14. 14.
    Davis, R., Shrobe, H., Hamscher, W., Wieckert, K., Shirley, M., and Polit, S., “Diagnosis Based on Description of Structure and Function”, in Proc. AAAI-82, 1982, pp. 137–142.Google Scholar
  15. 15.
    Boelens, O.C., Friedman, A.D., and Menon, P.R., “Fault Location in Iterative Logic Arrays,” Proc. 4th Symposium on Fault-Tolerant Computing, 1974.Google Scholar
  16. 16.
    Furchtgott, D.G., and Meyer, J.F., “A Performabiiity Solution Method for Degradable Nonrepairable Systems,” IEEE Trans. Comput., Vol. C-33, No. 6, June 1984, pp. 550–553.CrossRefGoogle Scholar
  17. 17.
    Genesereth, M., “The Use of Hierarchical Models in the Automated Diagnosis of Computer Systems,” Technical Report, Stanford Univ. Computer Science Dept., December 1981.Google Scholar
  18. 18.
    Georgeff, M.P., “An Expert System for Representing Procedural Knowledge,” in Joint Services Workshop on Artificial Intelligence in Maintenance, Vol. 1, 1983.Google Scholar
  19. 19.
    Goldberg, J., “SIFT: A Provable Fault Tolerant Computer for Aircraft Control,” in Proc. Info. Proc. 80, Tokyo, Japan, 1980.Google Scholar
  20. 20.
    Goldberg, J., “New Problems in Fault-Tolerant Computing,” Proc. 5th Symposium on Fault-Tolerant Computing, 1975, pp. 29–34.Google Scholar
  21. 21.
    Goldberg, J., “The Problem of Confidence in Fault-Tolerant Computer Design,” in Informatik-Fachberichte 78, Proc. GI/NTG Conference: Architektur und Betrieb von Rechensystemen, Springer-Verlag, Berlin, 1984, pp. 347–361.Google Scholar
  22. 22.
    Goldberg, J., “Perspectives in Fault-Tolerant Software,” in IEEE COMPCON 85, 1985, pp. 264–269.Google Scholar
  23. 23.
    Hecht, H., and Hecht, M., “Use of Fault Trees for the Design of Recovery Blocks, ” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 134–139.Google Scholar
  24. 24.
    Holt, C., and Smith, J.E., “Self-Diagnosis in Distributed Systems,” IEEE Trans. Comput., Vol. C-34, No. 1, January 1985, pp. 19–31.CrossRefGoogle Scholar
  25. 25.
    Hopkins, A.L. Jr., Smith, T.B. Ill, and Lala, J.H., “FTMP—A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft,” Proc. IEEE, Vol. 66, No. 10, October 1978, pp. 1221–1239.CrossRefGoogle Scholar
  26. 26.
    Kelly, J.P.J., and Avizienis, A., “A Specification-Oriented Multi-Version Software Experiment, ” Proc. 13th Symposium on Fault-Tolerant Computing, June 1983, pp. 120–126.Google Scholar
  27. 27.
    Kim, K.H., “Approaches to Mechanization of the Conversation Scheme Based on Monitors,” IEEE Trans. Software Eng., Vol. SE-8, No. 3, May 1982, pp. 189–197.CrossRefGoogle Scholar
  28. 28.
    Kirrman, H.D., and Kaufmann F., “Poolpro-A Pool of Processors for Process Control Applications,” IEEE Trans. Comput., Vol. C-33, No. 10, October 1984, pp. 869–878.CrossRefGoogle Scholar
  29. 29.
    Knight, J.C., Leveson, N.G., and St.Jean, L.D., “A Large Scale Experiment in N-Version Programming, ” Proc. 15th Symposium on Fault–Tolerant Computing, 1985, pp. 135–139.Google Scholar
  30. 30.
    Kopetz, H., “The Failure Fault (FF) Model,” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 14–17.Google Scholar
  31. 31.
    Kopetz, H., and Merker, W., “The Architecture of MARS, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 274–279.Google Scholar
  32. 32.
    Koren, I., and Breuer, M.A., “On Area and Yield Considerations for Fault-Tolerant VLSI Processor Arrays,” IEEE Trans. Comput., Vol. C-33, No. 1, January 1984, pp. 21–27.CrossRefGoogle Scholar
  33. 33.
    Kroll, T., “The (4,2) Concept Fault-Tolerant Computer,” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 49–54.Google Scholar
  34. 34.
    Lampson, B.W., “Atomic Transactions,” in Lecture Notes in Computer Science, Vol. 105: Distributed Systems-Architecture and Implementation, Springer-Verlag, Berlin, 1981.Google Scholar
  35. 35.
    Lamport, L., Shostak, R., and Pease, M.C., “The Byzantine Generals Problem,” ACM Trans. Program. Lang. Syst., Vol. 4, No. 3, July 1982, pp. 382–401.MATHCrossRefGoogle Scholar
  36. 36.
    Lamport, L., “Using Time Instead of Time-Outs in Fault-Tolerant Systems,” ACM Trans. Program. Lang. Syst., Vol. 6, No. 2, April 1984, pp. 256–280.Google Scholar
  37. 37.
    Laprie, J-C., “Dependable Computing and Fault Tolerance: Concepts and Terminology,” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 2–11.Google Scholar
  38. 38.
    Leveson, N.G., “Software Safety in Computer-Controlled Systems,” IEEE Computer, Vol. 17, No. 2, February 1984, pp. 48–55.Google Scholar
  39. 39.
    Leveson, N.G., and Stolzy, J.L., “Safety Analysis Using Petri Nets, ” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 358–363.Google Scholar
  40. 40.
    Liskov, B., “On Linguistic Support for Distributed Programs,” IEEE Trans. Software Eng., Vol. SE-8, No. 3, May 1982, pp. 203–210.CrossRefGoogle Scholar
  41. 41.
    Lu, L.Y., “A Virtual TMR Node,” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 286–292.Google Scholar
  42. 42.
    Lu, D.J., “Watchdog Processors and Structural Integrity Checking,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 681–685.CrossRefGoogle Scholar
  43. 43.
    Makam, S.V., and Avizienis, A., “ARIES81: A Reliability and Life-Cycle Evaluation Tool for Fault-Tolerant Systems, ” Proc. 12th Symposium on Fault-Tolerant Computing, 1982, pp. 267–274.Google Scholar
  44. 44.
    Melliar-Smith, P.M., and Schwartz, R., “Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 616–630.CrossRefGoogle Scholar
  45. 45.
    Meyer, J., “Closed-Form Solutions of Performability,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 648–657.CrossRefGoogle Scholar
  46. 46.
    Nicolaidis, M., “Evaluation of a Self-Checking Version of the MC 68000 Microprocessor,” Proc. 15th Symposium on Fault-Tolerant Computing, 1985, pp. 350–356.Google Scholar
  47. 47.
    Pease, M., Shostak, R., and Lamport, L., “Reaching Agreements in the Presence of Faults,” J. ACM, Vol. 27, No. 2, April 1980, pp. 228–234.MathSciNetMATHCrossRefGoogle Scholar
  48. 48.
    Oruc, A.Y., and Prakash, D., “Routing Algorithms for Cellular Interconnection Arrays,” IEEE Trans. Comput., October 1984, pp. 939–942.Google Scholar
  49. 49.
    Preparata, F., Metze G., and Chien, R;, “On the Connection Assignment Problem of Diagnosable Systems,” IEEE Trans. Electronic Comput., Vol. EC-16, December 1967, pp. 848–854.CrossRefGoogle Scholar
  50. 50.
    Raghavendra, C.S., Avizienis, A., and Ercegovac, M.D., “Fault Tolerance in Binary Tree Architectures,” IEEE Trans. Comput., Vol. C-33, No. 6, June 1984, pp. 568–572.CrossRefGoogle Scholar
  51. 51.
    Randell, B ., “System Structure for Software Fault Tolerance,” IEEE Trans. Software Eng., Vol. 2, June 1975, pp. 220–232.Google Scholar
  52. 52.
    Randell, B., “Recursively Structured Distributed Systems,” Technical Report, Computing Laboratory, Univ. Newcastle upon Tyne, May 1983.Google Scholar
  53. 53.
    Rennels, D., “Faul-Tolerant Computing- -Concepts and Examples,” IEEE Trans. Comput., Vol. C-33, No. 12, December 1984, pp. 1116–1129.MathSciNetCrossRefGoogle Scholar
  54. 54.
    Shrivastava, S.K., and Panzieri, F., “The Design of a Reliable Remote Procedure Call Mechanism,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 692–697.CrossRefGoogle Scholar
  55. 55.
    Siewiorek, D.P., and Swarz, R.S., The Theory and Practice of Reliable System Design, Digital Press, Bedford, Mass., 1982.Google Scholar
  56. 56.
    Slivinski, T., “Study of Fault-Tolerant Software Technology,” Technical Report, Mandex, Inc., Report to NASA Langley Research Center, 1984.Google Scholar
  57. 57.
    Stiffler, J., Bryant, L.A., and Guccione, L., “Care III Final Report, Phase I,” Technical Report, NASA Langley Research Center, CR159122, November 1979.Google Scholar
  58. 58.
    Stiffler, J., “Robust Detection of Intermittent Faults,” Proc. 10th Symposium on Fault–Tolerant Computing, 1980, pp. 216–218.Google Scholar
  59. 59.
    Svobodova, L., “Resilient Distributed Computing,” IEEE Trans. Software Eng., Vol. SE-10, No. 3, May 1984, pp. 257–267.CrossRefGoogle Scholar
  60. 60.
    Taylor, D.J., and Black, J.P., “Principles of Data Structure Error Correction,” IEEE Trans. Comput., Vol. C-31, No. 7, July 1982, pp. 602–608.CrossRefGoogle Scholar
  61. 61.
    Trivedi, K.S., Probability and Statistics with Reliability Queueing and Computer Science Applications, Prentice-Hall, Englewood Cliffs, N.J., 1982.Google Scholar
  62. 62.
    von Neumann, J., “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components,” in C.E. Shannon and J. McCarthy, eds., Automata Studies, Princeton Univ. Press, 1956, pp. 43–98.Google Scholar
  63. 63.
    Wakerley, J.F., Error Detecting Codes, Self-Checking Circuits and Applications, Elsevier, New York, 1978.Google Scholar
  64. 64.
    Wensley, J.H., Lamport, L., Goldberg, J., Green, M.W., Levitt, K.N., Melliar-Smith, P.M., Shostak, R.E., and Weinstock, C.B., “SIFT: The Design and Analysis of a Fault-Tolerant Computer for Aircraft Control,” Proc. IEEE, Vol. 66, No. 10, October 1978, pp. 1255–1268.CrossRefGoogle Scholar
  65. 65.
    Williams, T.W., and Parker, K.P., “Design for Testability,” IEEE Trans. Comput., Vol. C-31, No. 1, January 1982, pp. 2–15.CrossRefGoogle Scholar

Copyright information

© Plenum Press, New York 1986

Authors and Affiliations

  • J. Goldberg
    • 1
  1. 1.SRI InternationalMenlo ParkUSA

Personalised recommendations