Overview
Fault tolerance is important in safety-critical real-time systems because otherwise a single component failure may lead to a catastrophic system failure. This chapter starts with an explanation of the concepts of failure, error, and fault. It then proceeds to investigate the topic of error detection. Error detection requires knowledge about the intended behavior of a system. This knowledge can stem either from a priori established regularity constraints and known properties of the correct behavior of a computation, or from the comparison of the results that have been computed by two redundant channels. Different error detection techniques for the detection of timing errors and value errors are discussed.
In a distributed system, a node is an appropriate unit of failure. A node implements a self-contained function so that the established architectural principle “form follows function” can be maintained even in a failure scenario. The node implementation must map all internal node failures into simple external failure modes. The problem of node failure detection and membership in event-triggered and time-triggered architectures is elaborated. A set of replica-determinate nodes is grouped together to form a fault-tolerant unit (FTU) that masks a failure of one of its nodes. Two different types of fault-tolerant units are introduced, and the problem of the reintegration of a node into an operating cluster is taken up. The key issue is to find a reintegration point where the h-state of the node is minimal. Different techniques for h-state minimization are discussed.
The final section is devoted to a discussion about the utility of design diversity in the implementation of safety-critical systems. An industrial example of a fail-safe system that uses design diversity to increase the safety of the application is described.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Rights and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Cite this chapter
(2002). Fault Tolerance. In: Real-Time Systems. The International Series in Engineering and Computer Science, vol 395. Springer, Boston, MA. https://doi.org/10.1007/0-306-47055-1_6
Download citation
DOI: https://doi.org/10.1007/0-306-47055-1_6
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-9894-3
Online ISBN: 978-0-306-47055-4
eBook Packages: Springer Book Archive