Fault tolerance in embedded real-time systems: Importance and treatment of common mode failures
Dependable computer architectures used in critical embedded real-time applications have successfully employed Byzantine resilience techniques to tolerate physical, internal, operational faults. The dominant cause of failure of a correctly designed Byzantine resilient computer today is the common-mode failure, i.e., the nearly simultaneously failure of multiple redundant copies, generally due to a single cause. Unlike independent hardware faults, for which theoretically rigorous fault tolerance solutions have been implemented, the sources of common-mode failures are so diverse that numerous disparate techniques are required to predict, avoid, remove, and tolerate them.
This paper describes the technical approach that is being used to reduce the probability of common-mode failure in the Draper Fault Tolerant Parallel Processor which has been designed for critical embedded real-time applications. It begins with placing common-mode failures in the context of overall impairments to dependability to clarify their relative importance with respect to other failure sources. The FTPP's approach to tolerating independent hardware faults is briefly motivated and described. The overall strategy for common-mode failure reduction comprises three major areas: common-mode failure avoidance, removal, and tolerance. For fault avoidance, a novel integrated formal methods and VHDL design methodology has been developed and applied. Common-mode fault tolerance techniques include a combination of on-line checking of timing and functional behavior of operating system and application tasks, use of a formally verified system diagnosis processor to diagnose overall system health, and system-wide recovery actions. Techniques for the reduction of common-mode failure probability due to performance timing faults are also discussed.
KeywordsCommon-mode failure tolerance formal methods VHDL automated design tools Byzantine resilience
Unable to display preview. Download preview PDF.
- [Abl88]Abler, T., A Network Element Based Fault Tolerant Processor, MS Thesis, Massachusetts Institute of Technology, Cambridge, MA, May 1988.Google Scholar
- [Avr92]D. Avresky, et al, “Fault Injection for the Formal Testing of Fault Tolerance”, 22nd International Symposium on Fault Tolerant Computing, Boston, MA, July 1992.Google Scholar
- [Bab90]Babikyan, C., “The Fault Tolerant Parallel Processor Operating System Concepts and Performance Measurement Overview,” Proceedings of the 9th Digital Avionics Systems Conference, October 1990, pp. 366–371.Google Scholar
- [Har87]Harper, R., Critical Issues in Ultra-Reliable Parallel Processing, PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, June 1987.Google Scholar
- [Har88a]Harper, R., Lala, J., Deyst, J., “Fault Tolerant Parallel Processor Overview,” 18th International Symposium on Fault Tolerant Computing, June 1988, pp. 252–257.Google Scholar
- [Har88b]Harper, R., “Reliability Analysis of Parallel Processing Systems,” Proceedings of the 8th Digital Avionics Systems Conference., October 1988, pp. 213–219.Google Scholar
- [Har91]Harper, R., Lala, J., Fault Tolerant Parallel Processor, J. Guidance, Control, and Dynamics, V. 14, N. 3, May–June 1991, pp. 554–563.Google Scholar
- [Har92]R. Harper et. al., “Advanced Information Processing System: Army Fault Tolerant Architecture Conceptual Study Final Report, Volumes I and II”, NASA Contractor Report 189632, Langley Research Center, Hampton, VA, July 1992.Google Scholar
- [Joh92]Second NASA Formal Methods Workshop, Compiled By S.C. Johnson, C.M. Holloway, and R.W. Butler, Proceedings of a workshop sponsored by NASA, Washington, DC and held at NASA Langley Research Center, August, 1992, NASA Conference Publication 10110.Google Scholar
- [Lal84]Lala, J. H., “An Advanced Information Processing System,” 6th AIAA-IEEE Digital Avionics Systems Conference, Baltimore, MD, December 1984.Google Scholar
- [Lal85]Lala, J. H., “Advanced Information Processing System: Fault Detection and Error Handling,” AIAA Guidance, Navigation and Control Conf., Snowmass, CO, Aug. 1985.Google Scholar
- [Lal86a]Lala, J.H., “Fault Detection, Isolation, and Reconfiguration in the Fault Tolerant Multiprocessor”, Journal of Guidance, Control, and Dynamics, Sept–Oct. 1986, pp 585–592.Google Scholar
- [Lal86b]Lala, J. H., “A Byzantine Resilient Fault Tolerant Computer for Nuclear Power Plant Applications,” 16th Annual International Symposium on Fault Tolerant Computing Systems, Vienna, Austria, 1–4 July 1986.Google Scholar
- [Lal88]Lala, J.H., and L.S. Alger, “Hardware and Software Fault Tolerance: A Unified Architectural Approach”, The 18th International Symposium on Fault Tolerant Computing, Tokyo, Japan, June 1988.Google Scholar
- [Lap92]Dependability: Basic Concepts and Terminology. Ed: J.C. Laprie, Volume 5 of Dependable Computing and Fault-Tolerant Systems, Springer-Verlag, Wien, New York, 1992, pp.11–16.Google Scholar
- [Sri92]M. Srivas and M. Bickford, “Moving Formal Methods into Practice: Verifying the FTPP Scoreboard: Phase 1 Results”, NASA Contractor Report 189607, Langley Research Center, Hampton, VA, May 1992.Google Scholar