While Table 3 discusses several technology levels in a system or component, the focus is on the hardware (electronics) and software levels. The lowest level is largely the continuous domain where the rules and laws of material science apply. In general, this domain is well understood and applying design and safety margins mitigates most safety risks. In addition, components in this domain often exhibit graceful degradation, a property that inherently contributes to safety. This even applies to the semiconductor materials used for developing programmable chips.
The levels related to the environment and the user/operator of a system are mostly related to external factors that can create hazardous situations. Hence these must be considered when developing the system and they play an important role in the HARA. However, as such these are external and often unique factors for every system, the reuse factor (except for example in identifying reusable patterns and scenarios) is limited.
In this paper, the focus is on how a component or subsystem can be reused in the context of a safety-critical application. This is mostly an issue in the hardware and software domains because these technology domains are characterized by very large state spaces. In addition, as mentioned before, such systems often will operate in a dynamic and reconfigurable way. In addition, a component developed in these discrete technologies can fail practically speaking in a single instant in time. To mitigate these risks, ARRL levels explicitly take the fault behavior into account as well as the desired state after a fault occurred. This results in derived requirements for the architecture of the component, the contract it carries as well as for the evidence that supports it. Therefore the evidence will also be related to the process followed to develop the component. To clarify the ARRL levels, a more visual representation is used and discussed below.
The ARRL component view
Figure 2 illustrates the generic view of a component. It is seen as a functional block that accepts input vectors, processes them and generates output vectors. In the general sense, the processing can be seen as the transfer function of the component. While the latter terminology is mostly used in the continuous domain, in the discrete domain the transfer function is often a state machine or a collection of concurrent state machines. Important for the ARRL view is that the processing function is not directly linked with the inputs and outputs but via component interfaces that operate as guards.
An illustrated ARRL-1 component
As the ARRL-0 provides no assurance at all for its behavior, we can gracefully skip this level, hence we start with the ARRL-1 level (Fig. 3). Such a component can only be partially “trusted”, i.e. as far as it was tested. The uncertainty is related to unanticipated input values; doubts that the input/output guards are complete, remaining errors in the processing function, invalid assumptions (e.g. erroneous requirements [19]) and hence there can be unanticipated output values. In other words, while a test report provide some evidence, the absence of errors if not guaranteed and as such a ARRL-1 component cannot be used as such for safety-critical systems.
An illustrated ARRL-2 component
An ARRL-2 component (Fig. 4) covers the holes left at the ARRL-1 level. To reach completeness of absence of errors, we first of all assume that the underlying hardware (at the material level) does not introduce any faults from which errors can result. Therefore we speak of “logical correctness” in absence of faults. This level can only be reached if there is formal evidence supporting such a claim. At the hardware level, this means for example extensive design verification, extensive testing and even burn-in of components to find any design or production-related issues. At the software level we could require formal proof that no remaining errors exist. If not practical, formal evidence might also result from “proven in use” arguments whereby stress testing can be mandatory. The latter are weaker arguments than those provided by formal techniques, but even when formal techniques are used, one can never be 100 % sure because even formal models can have errors but they generally increase the confidence significantly. Such errors can further be mitigated by additional process steps (like reviews, continuous integration and validation) but in essence the residual errors should have a probability that is as low as practically feasible so that in practice the component would be considered error-free and hence fully trustworthy, at least if no faults induce errors.
An illustrated ARRL-3 component
An ARRL-3 component (Fig. 5) inherits first of all the properties of ARRL-2. This means, its behavior is logically correct in absence of faults in relationship to its specifications. ARRL-3 introduces additionally:
-
Faults (by default induced by the hardware or by the environment) are detected.
-
Faulty input values are remapped to a valid range (e.g. by clamping) whereby a valid range value is one that is part of the logically correct behavior.
-
Two processing units are used. These can be identical or dissimilar as long as faults are detected before the component can propagate them as erroneous values to other components.
-
Faults induced in the components are detected by comparison at the outputs.
-
The output values are kept within a legal range, hence faulty values will not result in an error propagation that can generate errors downstream in the system.
Note that above does not exclude more sophisticated approaches. Certain faults induced in each sub-unit, typically transient faults, can be locally detected and corrected so that the output remains valid. The second processing unit can also be very different and only act as a monitor (which assumes that faults are independent in time and space). Common mode failures are still a risk.
An illustrated ARRL-4 component
ARRL-3 components detect failures and prevent error propagation but they result in the system loosing its intended functionality. This is due to the fact that redundancy is too low to reconstruct the correct state of the system. An ARRL-3 component addresses this issue by applying N out of M (\(N < M, N > = 2\)) voting. This applies as well to the input as to the outputs. This allows to safeguard the functionality at ARRL-3 level and is a crude form of graceful degradation. The solution also assumes independence of faults in the M “channels” and hence most common mode failures are mitigated. This boundary condition implies often that no state information (such as introduced by the power supply) can propagate to another channel.
Note that while the diagram uses a coarse grain representation, some systems apply this principle at the micro-level. For example radiation-hardened processors can be designed to also support Single Event Upsets by applying triplication and voting at the gate level. This does not address all common mode failures (like power supply issues) but often such a component can be classified as an ARRL-4 component (Fig. 6) (implying that in the example the power supply is very trustworthy).
An illustrated ARRL-5 component
An ARRL-4 component provides continuity in its functionality but can still fail due to residual common mode failures. Most of the residual common mode failures are process related. Typical failures are related to the specifications not being complete or wrong due to misinterpretation. Another class of failures could be time dependent. To mitigate the resulting risks, diversity is used. This can cover using completely different technologies, different teams, applying different algorithms and even using time shifting or using orthogonal placement of the sub-components to reduce the influence of externally induced magnetic fields. In the figure (Fig. 7) this is visualized by using different colors.
This diversity technique is an underlying principle in most safety engineering processes, for example by requiring that tests be done by different people than those who developed the item. It also has as a consequence that such an architecture works with a minimum of asynchronicity, whereby the subcomponents “handshake” (in a time window), which is only possible if the sub-components can be trusted in the sense of ARRL-2 or ARRL-3.
Rules of composition (non-exhaustive)
A major advantage of the ARRL criterion is that we can now define a simple rule for composing safety-critical systems. We use here an approximate mapping to the different SIL definitions by taking into account the recommended architecture for reaching a certain SIL level.
“A system can only reach a certain SIL level if all its components are at least of the same ARRL level or if they are arranged into a whole that exhibits a higher ARRL level due to the application of a fault tolerant architecture.”
The following side-conditions apply:
-
The composition rule defines a necessary condition, not a sufficient condition. Application specific layers must also meet the ARRL criterion.
-
ARRL-4 components can be composed out of ARRL-3 components using redundancy. This requires an additional ARRL-4 voting component
-
ARRL-3 components can be composed using ARRL-2 components (using at least 2 whereby the second instance acts as a monitor).
-
All interfaces and interactions need to have the same ARRL level as the components.
-
Error propagation is to be prevented. Hence a partitioning architecture (using a distributed hardware and concurrent software architecture) is a must.
-
ARRL-5 requires an assessment of the certification of independent development and, when applied to software components, a certified absence of correlated errors.
-
A benefit of the approach is that it leaves less room for ad-hoc, often questionable and difficult to verify decompositions of SIL levels. While this might increase the cost, this will likely be cost-efficient over the lifespan of a given technology and reduce the development cost.
Figure 8 illustrates this for a (simplified) 2 out of 3 voter. Note that the crossbar implements also an ARRL-4 architecture.
The role of formal methods
ARRL-2 introduces the need for formal correctness. This might lead to the conclusions that ARRL-2 makes the use of formal techniques mandatory as well as providing a guarantee of correctness. This view needs further nuance.
In recent years, formal methods have been gaining attention. This is partly driven by the fact (and awareness) that testing and verification can never provide complete coverage of all possible errors, in particular for discrete systems and specifically for software. This is problematic because safety and security issues often concern so-called “corner cases” that do not manifest themselves very often. Formal methods however have the potential to cover all cases either by using formal models checkers (that automatically verify all possible states of the model) or by formal proofs (based on mathematical reasoning). In general we can distinguish a further separation in two domains: the numeral accuracy and stability domain and the event domain whereby the state space itself is verified. Often the same techniques cannot be applied for both.
Practice has shown that using formal methods can greatly increase the trustworthiness of a system or component. Often it will lead to the discovery of logical errors and incomplete assumptions about the system. Another benefit of using formal methods during the design phase is that it helps in finding cleaner, more orthogonal architectures that have the benefit of less complexity and hence provide a higher level of trustworthiness as well as efficiency [20]. One can therefore be tempted to say that formal methods not only provide correctness (in the sense of the ARRL-2 criterion) but also assist in finding more efficient solutions.
Formal methods are however not sufficient and are certainly not a replacement for testing and verification. Formal methods imply the development of a (more abstract) model and also this model cannot cover all aspects of the system, especially non-functional ones. It might even be incomplete or wrong if based on wrong assumptions (e.g. on how to interpret the system’s requirements). Formal methods also suffer from complexity barriers, typically manifested as a state space explosion that makes their use impractical. The latter however is a strong argument for developing a composable architecture that uses small but well proven trustworthy components as advocated by the ARRL criterion. At the same time, the ARRL criterion shows that formal models must also model the additional functionality that each ARRL level requires. This is in line with what John Rushby puts forward in his paper [21] whereby he outlines a formally driven methodology for a safe reuse of components by taking the environment into account.
The other element is that practice has shown that developing a trustworthy system also requires a well-managed engineering process whereby the human factor plays a crucial role [7]. Moreover, processes driven by short iteration cycles whereby each cycle end with a validation or (partial) integration have proven to be more cost-efficient as well as more trustworthy with less residual issues. Formal methods are therefore not something like a miracle cure. Their use is part of a larger process that aims at reaching trustworthiness. The benefit of using formal methods early in the design phase is that it contributes to reducing the state space in an early stage so that the cost and effort of fixing issues that are discovered later in the process is much reduced. In the context of the ARRL criterion they increase the assurance level considerably because of the completeness of the verification, a goal that is only marginally reachable by only testing.
Applying ARRL on a component
Texas Instruments offers an ARM based microcontroller (MCU) with a specific architecture aimed at supporting embedded safety-critical applications (Fig. 9) [22]. The MCU has many features that support this claim. The most important one is that the ARM CPU adopts an ARRL-3 architecture whereby both CPU cores are lock-stepped. In case of a difference between the two CPUs, the MCU is halted. To mitigate common mode failures a time delay of 2 clock pulses is used and in addition the two cores are rotated with 90\(^\circ \) to reduce e.g. electromagnetic disturbances. In addition, Memory Protection Units (MPU) allow the programmer to partition the software in isolated memory blocks. The chip also has quite a number of additional safety (or rather: reliability) features. For example, most memory has error correcting logic to handle bit errors.
At first sight the MCU could be classified as an ARRL-3 component because the processing cores are configured in lockstep mode. However, the chip has also a large number of peripherals in a single instance on the chip. While some have parity bits (but not all), they can most likely be classified as ARRL-2 components on the chip. In addition, the chip has a programmable timer block (that has its own small controller) that is not protected at all from faults. Note that this is deduced from the publicly available documentation. Further information might have an impact on these conclusions.
What can we conclude from this, granted superficial, exercise? First of all, while the MCU core processor can be classified as ARRL-3, most of the peripherals are ARRL-2 or even ARRL-1. Hence, the whole MCU, even better supporting safety-critical applications than most off-the-shelf MCUs, is still an ARRL-2 component, unless one doesn’t use some of the peripherals or if the faults are mitigated at the software level. Secondly, ARRL components must carry a contract and the evidence. Even if the documentation supplied by the manufacturer is extensive, it is not in a form that allow a definite conclusion to be drawn. This is in line with the requirements of safety standards, whereby extensive process evidence as well as supporting documentation is required to qualify or certify a system or sub-system.
The example also clearly shows that starting from the ARRL-3 level, it becomes difficult to develop software components in isolation of the hardware it is running on (ARRL-2 level software is assumed to be perfectly error-free in absence of hardware faults). This is due to the fact that additional fault handling at ARRL-3, vs. ARRL-2, is hardware and often application specific. Nevertheless, it is partially possible by strictly specifying the boundary values that are valid for the software component. The errors resulting from hardware faults can then be trapped in the interface layer, that itself can be considered as a software component that often will make use of the underlying hardware support.
SIL and ARRL are complementary
The ARRL level criterion is not a replacement for the SIL level criterion. It is complementary in the same sense that the HARA and FMEA are complementary (Fig. 10). The HARA is applied top-down whereby the system is considered in its environment including the possible interactions with a user or operator. The goal of the HARA is to find the situations whereby a hazard can result in a safety risk. The outcome is essentially a number of safety measures that must be part of the system design without necessarily prescribing how these are to be implemented
The FMEA takes a complementary approach after the implementation architecture has been selected. FMEA aims at identifying the faults that are likely to result in errors ultimately resulting in a system failure whereby a safety risk can be encountered. Hence the HARA and FMEA meet in the middle confirming their findings.
By introducing the ARRL criterion we take a first step towards making the process more normative and generic, of course still a tentative step because it will require validation in real test cases. The SIL is a top-level requirement decomposed in normal case requirements (ARRL-1 and -2) and fault case requirements (ARRL-3, -4, -5). From a functional point of view, all ARRL levels provide the same functionality but with different degrees of assurance and hence trustworthiness from the point of view of the user. Concretely, different ARRL levels do not modify the functional requirements and specifications of the components. The normative ARRL requirements result in additional functional specifications and corresponding functional support that assures that faults do not result in the functional specifications to be jeopardized. The non-functional specifications might be impacted as well. For example, the additional functionality will require more resources (e.g. memory, energy and CPU cycles) and is likely to increase the cost price of the system. However, it provides a way to reuse components with lesser efforts from one domain to another in a product family. For example a computer module (specified for compatible environmental conditions) can be reused between different domains. The same applies to software components. However, this requires that the components are more completely specified than it is now often the case. ARRL level components carry a contract and the supporting evidence that they will meet this contract given a specific set of fault conditions. Note that when using formal methods, each of these ARRL levels also requires different formal models. The higher-level ARRL models must model the fault behavior in conjunction with the normal behavior, just like invariants are part of the formal models. By defining a composition rule of ARRL components to achieve a certain level of Safety, we now also define safety in a quasi-domain independent way, simplifying the safety engineering process. Note however that any safety-critical system still has an application specific part that must be developed to meet the same level of ARRL to reach the required SIL.
An ARRL inspired process flow
We can now also define an ARRL inspired process flow. It is strictly top-down for the requirements engineering part while bottom-up for developing the architecture. The reader should note that such a process is not limited to safety engineering but rather considers this as a special case of systems engineering in general.
It is shown in Table 5, in a simplified way. For simplicity, we merged the ARRL-1 and -2 levels as de facto, ARRL-1 provides very little assurance in terms of safety.
Table 5 An ARRL driven process flow