Skip to main content
Log in

On-Line Fault Monitoring

  • Published:
Journal of Electronic Testing Aims and scope Submit manuscript

Abstract

Sequoia's fault-tolerant computers were designed subject to some rather rigid constraints: No single hardware malfunction can generate an undetected error; an integrated circuit is a “black box” that can fail in arbitrary ways, affecting an arbitrary subset of input and output signals; faults can be transient or intermittent with arbitrary durations and repetition intervals. Moreover, the incremental hardware to be used to achieve these goals was to be kept to a minimum. The resulting computers do, to a very large extent, satisfy these constraints. To achieve this, a combination of fault-monitoring techniques was used, including: Bit and nibble error-correcting and error-detecting codes; byte parity codes with orthogonal partitioning; cyclic-residue codes on I/O data transfers; codes designed to protect against address counter overruns on I/O transfers; lossless control-signal compactors. The nature and rationale for these various fault monitors is described as well as the analytical and testing techniques used to estimate the resulting coverage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. P.A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing,” Computer, pp. 37–45, Feb. 1988.

  2. K.M. Chandy and C.V. Ramamoorthy, “Rollback and Recovery Strategies for Computer Programs,” IEEE Trans. on Computers, Vol. 21, No. 6, pp.546–556, June 1972.

    Google Scholar 

  3. E.R. Berlekamp, “The Technology of Error-Control Codes,” Proc. of the IEEE, May 1980, Vol. 68, No. 5, pp. 564–593.

    Google Scholar 

  4. B. Bose and T.R.N. Rao, “Theory of Unidirectional Error Correcting/ Detecting Codes,” IEEE Trans. on Computers,Vol. C-31, No. 6, pp. 520–530, June 1982.

    Google Scholar 

  5. J.J. Metzner, “Convolutionally Encoded Memory Protection,” IEEE Trans. on Computers, Vol. C-31, No. 6, pp. 547–551, June 1982.

    Google Scholar 

  6. D.K. Pradhan, “A New Class of Error Correcting-Detecting Codes for Fault-Tolerant Computer Applications,” IEEE Trans. on Computers, Vol. C-29, No. 6, pp. 471–481, June 1980.

    Google Scholar 

  7. D.K. Pradhan and J.J. Stiffler, “Error Correcting Codes and Self-Checking Circuits,” Computer, Vol. 13, No. 3, pp. 27–37, March 1980.

    Google Scholar 

  8. T.R.N. Rao, Error Control Coding for Arithmetic Processors, Academic Press, NewYork, 1974.

    Google Scholar 

  9. J.J. Stiffler, “Coding for Random Access Memories,” IEEE Trans. on Computers, Vol. C-27, No. 6, pp. 526–531, June 1978.

    Google Scholar 

  10. J.F. Wakerly, “Detection of Unidirectional Multiple Errors Using Low-Cost Arithmetic Codes,” IEEE Trans. on Computers, Vol. C-27, No. 4, pp. 302–308, April 1978.

    Google Scholar 

  11. R.W. Hamming, “Error Detecting and Correcting Codes,” Bell Syst. Tech. Journal, Vol. 29, pp. 147–160, 1950.

    Google Scholar 

  12. N.J.A. Sloane, “A Simple Description of an Error-Correcting Code for High-Density Magnetic Tape,” Bell Syst. Tech. Journal, Vol. 55, No. 2, pp. 157–165, Feb. 1976.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stiffler, J. On-Line Fault Monitoring. Journal of Electronic Testing 12, 21–27 (1998). https://doi.org/10.1023/A:1008201032535

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008201032535

Navigation