Advertisement

Reliable computing systems

  • B. Randell
Chapter 3.: Issues And Results In The Design Of Operating Systems
Part of the Lecture Notes in Computer Science book series (LNCS, volume 60)

Abstract

The paper presents an analysis of the various problems involved in achieving very high reliability from complex computing systems, and discusses the relationship between system structuring techniques and techniques of fault tolerance. Topics covered include (i) differing types of reliability requirement, (ii) forms of protective redundancy in hardware and software systems, (iii) methods of structuring the activity of a system, using atomic actions, so as to limit information flow, (iv) error detection techniques, (v) strategies for locating and dealing with faults, and for assessing the damage they have caused, and (vi) forward and backward error recovery techniques, based on the concepts of recovery line, commitment, exception and compensation. A set of appendices provide summary descriptions and analyses of a number of computing systems that have been specifically designed with the aim of achieving very high reliability.

Keywords

fault tolerance failure error fault system structure hardware/software reliability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

8 References

  1. (AND76).
    Anderson, T., R. Kerr. Recovery Blocks in Action: a system supporting high reliability. Proc. Int. Conf. on Software Engineering San Francisco (Oct. 1976).Google Scholar
  2. (AND77).
    Anderson, T., P.A. Lee, S.K. Shrivastava. A Conceptual Model of Recoverability in Multi-Level Systems. Technical Report 115, Computing Laboratory, The University, Newcastle upon Tyne (Nov. 1977).Google Scholar
  3. (AVI72a).
    Avizienis, A. et al. The STAR (Self Testing and Repairing Computer): An Investigation of the Theory and Practice of Fault Tolerant Computer Design. IEEE Trans. on Computers, C-20, 11 (Nov. 1971), 1312–1321.Google Scholar
  4. (AVI72b).
    Avizienis, A., D.A. Rennels. Fault Tolerance Experiments With the JPL-STAR Computer. IEEE Compcon 72, (1972), 321–324.Google Scholar
  5. (AVI76).
    Avizienis, A. Fault-Tolerant Systems. IEEE Trans. on Computers C-25, 12 (Dec. 1976), 1304–1312.Google Scholar
  6. (BAN77).
    Banatre, J.-P., S.K. Shrivastava. Reliable Resource Allocation Between Unreliable Processes. Technical Report 99, Computing Laboratory, The University, Newcastle upon Tyne (June 1977).Google Scholar
  7. (BAS72).
    Baskin, H.B., B.R. Borgerson, R. Roberts. PRIME-A Modular Architecture for Terminal-Orientated Systems. Proc. AFIPS 1972 SJCC 40 (1972), 431–437.Google Scholar
  8. (BEL64).
    Bell System Technical Journal. (Sept. 1964).Google Scholar
  9. (BEL77).
    Bell System Technical Journal. (Feb. 1977).Google Scholar
  10. (BJO72).
    Bjork, L.A., C.T. Davies. The Semantics of the Preservation and Recovery of Integrity in a Data System. Report TR 02.540, IBM, San Jose, Calif. (Dec. 1972).Google Scholar
  11. (BJO74).
    Bjork, L.A. Generalised Audit Trail (Ledger) Concepts for Data Base Applications. Report TR 02.641, IBM, San Jose, Calif. (Sept. 1974).Google Scholar
  12. (BOR72).
    Borgerson, B.R. A Fail-Softly System For Timesharing Use. Digest of papers FTC-2. (1972), 89–93.Google Scholar
  13. (BOR73).
    Borgerson, B.R. Spontaneous Reconfiguration in a Fail-Softly Computer Utility. Datafair (1973), 326–331.Google Scholar
  14. (BOR74).
    Borgerson, B.R., R.F. Freitas. An Analysis of PRIME Using a New Reliability Model. Digest of papers FTC-4, (1974), 2.26–2.31.Google Scholar
  15. (BRI73).
    Brinch Hansen, P. Operating System Principles. Prentice-Hall, Englewood Cliffs, N.J. (1973).Google Scholar
  16. (BRI75).
    Brinch Hansen, P. The Programming Language Concurrent Pascal. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 199–207.Google Scholar
  17. (CLE74).
    Clement, C.F., R.D. Toyer. Recovery From Faults in the No. 1A Processor. FTC-4 (1974), 5.2–5.7.Google Scholar
  18. (COH76).
    Cohen, E.S. Strong Dependency: a formalism for describing information transmission in computation systems. Technical Report, Computer Science Dept, Carnegie-Mellon Univ., Pittsburgh, PA (Aug. 1976).Google Scholar
  19. (COH77).
    Cohen, E.S. On Mechanisms for Solving Problems in Computational Systems. (In preparation.)Google Scholar
  20. (COS72).
    Cosserat, D.C. A Capability Oriented Multi-processor System for Real-Time Applications. Int. Conf. On Computer Communications. Washington, D.C. (Oct. 1972), 287–289.Google Scholar
  21. (DAR70).
    Darton, K.S. The Dependable Process Computer. Electrical Review 186, 6 (Feb. 1970), 207–209.Google Scholar
  22. (DAV72).
    Davies, C.T. A Recovery/Integrity Architecture for a Data System. Report TR 02.528, IBM, San Jose, Calif. (May 1972).Google Scholar
  23. (DEP77).
    Depledge, P.G., M.G. Hartley. Fault-Tolerant Microcomputer Systems for Aircraft. Proc. Conf. On Computer Systems and Technology, University of Sussex, Institute of Electronic and Radio Engineers, London (1977), 205–220.Google Scholar
  24. (DIJ68).
    Dijkstra E.W. The Structure of the THE Multiprogramming System. Comm. ACM 11, 5 (1968), 341–346.Google Scholar
  25. (DIJ76).
    Dijkstra, E.W. A Discipline of Programming. Prentice-Hall, Englewood Cliffs, N.J. (1976).Google Scholar
  26. (EDE74).
    Edelberg, M. Data Base Contamination and Recovery. Proc. ACM SIGMOD Workshop on Data Description, Access and Control (May 1974), 419–430.Google Scholar
  27. (ESW76).
    Eswaran, K.P., J.N. Gray, R.A. Lorie, I.L. Traiger. The Notions of Consistency and Predicate Locks in a Database System. Comm. ACM 19, 11 (Nov. 1976), 624–633.Google Scholar
  28. (FAB73).
    Fabry, R.S. Dynamic Verification of Operating System Decisions. Comm. ACM 16, 11 (1973), 659–668.Google Scholar
  29. (GOO75).
    Goodenough, J.B. Exception Handling: Issues and a Proposed Notation. Comm. ACM 18, 12 (1975), 683–696.Google Scholar
  30. (GRA75).
    Gray, J.N., R.A. Lorie, G.R. Putzolu, L.L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Database. IBM Research Report RJ1654 (Sept. 1975).Google Scholar
  31. (GRA77).
    Gray, J.N. (Private Communication).Google Scholar
  32. (HAM72).
    Hamer-Hodges, K. Fault Resistance and Recovery within System 250. Int. Conf. On Computer Communications. Washington (Oct. 1972), 290–296.Google Scholar
  33. (HEA73).
    Heart, F.E., S.M. Ornstein, W.R. Crowther, W.B. Barker. A new minicomputer/multiprocessor for the ARPA network. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1973), 529–537.Google Scholar
  34. (HEC76).
    Hecht, H. Fault Tolerant Software for a Fault Tolerant Computer. Software Systems Engineering. Online, Uxbridge (1976), 235–348.Google Scholar
  35. (HOA74).
    Hoare, C.A.R. Monitors: an operating system structuring concept. Comm. ACM 17, 10 (Oct. 1974), 549–537.Google Scholar
  36. (HOR74).
    Horning, J.J., B. Randell. Process Structuring. Comp. Surveys 5, 1 (1973), 5–30.Google Scholar
  37. (HOR74).
    Horning, J.J., H.C. Lauer, P.M. Melliar-Smith, B. Randell. A Program Structure for Error Detection and Recovery. Proc. Conf. On Operating Systems: Theoretical and Practical Aspects. IRIA (1974), 177–193. (Reprinted in Lecture Notes in Computer Science, Vol. 16, Springer-Verlag).Google Scholar
  38. (LAM76).
    Lampson, B., H. Sturgis. Crash Recovery in a Distributed Data Storage System. Computer Science Laboratory, Xerox Palo Alto Research Center, Palo Alto, Calif, (1976).Google Scholar
  39. (LIN76).
    Linden, T.A. Operating System Structures to Support Security and Reliable Software. Comp. Surveys 8, 4 (Dec. 1976), 409–445.Google Scholar
  40. (LOM77).
    Lomet, D.B. Process Structuring, Synchronisation and Recovery using Atomic Actions. Proc. ACM Conf. On Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 128–137.Google Scholar
  41. (MCP74).
    McPhee, W.S. Operating System Integrity in OS/VS2. IBM System J. 13, 3 (1974), 230–252.Google Scholar
  42. (MEL75).
    Melliar-Smith, P.M. Error Detection and Recovery in Data Base Systems. (Unpublished, 1975).Google Scholar
  43. (MEL77).
    Melliar-Smith, P.M., B. Randell. Software Reliability: the role of programmed exception handling. Proc. ACM Conf. on Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 95–100.Google Scholar
  44. (NAU77).
    Naur, P. Software Reliability. Infotech State of the Art Conference on Reliable Software, London (1977), 7–13.Google Scholar
  45. (NEU73).
    Neumann, P.G., J. Goldberg, K.N. Levitt, J.H. Wensley. A Study of Fault-Tolerant Computing. Stanford Research Institute, Menlo Park, California (July 1973).Google Scholar
  46. (ORN75).
    Ornstein, S.M., W.R. Crowther, M.F. Kraley, R.D. Bressler, A. Michael, F.E. Heart. Pluribus — a reliable multi-processor. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1975), 551–559.Google Scholar
  47. (PAR71).
    Parnas, D.L. Information Distribution Aspects of Design Methodology. Proc. IFIP Congress (1971), TA256-30.Google Scholar
  48. (PAR76).
    Parnas, D.L., H. Wurges. Response to Undesired Events in Software Systems. Proc. Conf. On Software Engineering. San Francisco, Calif. (1976), 437–446.Google Scholar
  49. (PAR77).
    Parsons, B.J. Reliability Considerations and Design Aspects of the Hawker Siddeley Space Computer. Proc. Conf. On Computer Systems and Technology, University of Sussex. Inst. Of Electronic and Radio Engineers, London (March 1977), 221–222.Google Scholar
  50. (RAN75).
    Randell, B. System Structure for Software Fault Tolerance. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 220–232.Google Scholar
  51. (REP72).
    Repton, C.S. Reliability Assurance for System 250, a Reliable Real-Time Control System. Int. Conf. On Computer Communications. Washington (Oct. 1972), 297–305.Google Scholar
  52. (ROH73).
    Rohr, J.A. Starex Self-Repair Routines: Software Recovery in the JPL-STAR Computer. Digest of papers FTC-3, (1973), 11–16.Google Scholar
  53. (ROS75).
    Ross, D.T. Plex1: Sameness and the Need for Rigor. Report 9031-1.1, Softech, Inc., Waltham, Mass. (Nov. 1975).Google Scholar
  54. (RUS76).
    Russell, D.L. State Restoration Amongst Communicating Processes. TR 112, Digital Systems Laboratory, Stanford University, Calif. (June 1976).Google Scholar
  55. (SHO68).
    Shooman, M.L. Probabilistic Reliability: An Engineering Approach. McGraw-Hill, New York (1968).Google Scholar
  56. (SIM74).
    Simpson, R.M. A Study in the Design of High Integrity Systems. INFO Software, London (1974).Google Scholar
  57. (STO72).
    Stoy, J.E., C. Strachey. OS6 — An Experimental Operating System for a Small Computer. Comp. J. 15 (1972), 117–124, 195–201.Google Scholar
  58. (TAY76).
    Taylor, J.M. Redundancy and Recovery in the HIVE Virtual Machine. Proc. European Conf. on Software System Engineering, London (Sept. 1976), 263–293.Google Scholar
  59. (VER76).
    Verhofstad, J.S.M. Recovery for Multi-Level Data Structures. Technical Report No. 96. Computing Laboratory, The University, Newcastle upon Tyne (Dec. 1976).Google Scholar
  60. (VER77).
    Verhofstad, J.S.M. Recovery and Crash Resistance in a Filing System. Proc. SIGMOD Conference, Toronto (Aug. 1977).Google Scholar
  61. (WAS76).
    Wasserman, A.I. Procedure-Oriented Exception Handling Medical Information Science, University of California, San Francisco, Calif. (1976).Google Scholar
  62. (WEN72).
    Wensley, J.H. SIFT — Software implemented fault tolerance. Proc. Nat. Computer Conf., New York (June 1972), 243–253.Google Scholar
  63. (WUL75).
    Wulf, W.A. Reliable Hardware-Software Architecture. Proc. Int. Conf. On Reliable Software. SigPlan Notices 10, 6 (June 1975), 122–130.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1978

Authors and Affiliations

  • B. Randell
    • 1
  1. 1.University of Newcastle upon TyneNewcastle upon TyneEngland

Personalised recommendations