Abstract
Sequoia's fault-tolerant computers were designed subject to some rather rigid constraints: No single hardware malfunction can generate an undetected error; an integrated circuit is a “black box” that can fail in arbitrary ways, affecting an arbitrary subset of input and output signals; faults can be transient or intermittent with arbitrary durations and repetition intervals. Moreover, the incremental hardware to be used to achieve these goals was to be kept to a minimum. The resulting computers do, to a very large extent, satisfy these constraints. To achieve this, a combination of fault-monitoring techniques was used, including: Bit and nibble error-correcting and error-detecting codes; byte parity codes with orthogonal partitioning; cyclic-residue codes on I/O data transfers; codes designed to protect against address counter overruns on I/O transfers; lossless control-signal compactors. The nature and rationale for these various fault monitors is described as well as the analytical and testing techniques used to estimate the resulting coverage.
Similar content being viewed by others
References
P.A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing,” Computer, pp. 37–45, Feb. 1988.
K.M. Chandy and C.V. Ramamoorthy, “Rollback and Recovery Strategies for Computer Programs,” IEEE Trans. on Computers, Vol. 21, No. 6, pp.546–556, June 1972.
E.R. Berlekamp, “The Technology of Error-Control Codes,” Proc. of the IEEE, May 1980, Vol. 68, No. 5, pp. 564–593.
B. Bose and T.R.N. Rao, “Theory of Unidirectional Error Correcting/ Detecting Codes,” IEEE Trans. on Computers,Vol. C-31, No. 6, pp. 520–530, June 1982.
J.J. Metzner, “Convolutionally Encoded Memory Protection,” IEEE Trans. on Computers, Vol. C-31, No. 6, pp. 547–551, June 1982.
D.K. Pradhan, “A New Class of Error Correcting-Detecting Codes for Fault-Tolerant Computer Applications,” IEEE Trans. on Computers, Vol. C-29, No. 6, pp. 471–481, June 1980.
D.K. Pradhan and J.J. Stiffler, “Error Correcting Codes and Self-Checking Circuits,” Computer, Vol. 13, No. 3, pp. 27–37, March 1980.
T.R.N. Rao, Error Control Coding for Arithmetic Processors, Academic Press, NewYork, 1974.
J.J. Stiffler, “Coding for Random Access Memories,” IEEE Trans. on Computers, Vol. C-27, No. 6, pp. 526–531, June 1978.
J.F. Wakerly, “Detection of Unidirectional Multiple Errors Using Low-Cost Arithmetic Codes,” IEEE Trans. on Computers, Vol. C-27, No. 4, pp. 302–308, April 1978.
R.W. Hamming, “Error Detecting and Correcting Codes,” Bell Syst. Tech. Journal, Vol. 29, pp. 147–160, 1950.
N.J.A. Sloane, “A Simple Description of an Error-Correcting Code for High-Density Magnetic Tape,” Bell Syst. Tech. Journal, Vol. 55, No. 2, pp. 157–165, Feb. 1976.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Stiffler, J. On-Line Fault Monitoring. Journal of Electronic Testing 12, 21–27 (1998). https://doi.org/10.1023/A:1008201032535
Issue Date:
DOI: https://doi.org/10.1023/A:1008201032535