In this section, we describe how CPSDebug works with help of the case study introduced in Sect. 3. Figure 5 illustrates the main steps of the workflow. Briefly, the workflow starts from a target CPS model and a test suite with some passing and failing test cases and produces a failure explanation for each failing test case. The workflow consists of three sequential phases:
-
(i)
Testing, which exercises the previously instrumented CPS model with the available test cases to collect information about its behavior, both for passing and failing executions,
-
(ii)
Mining, which mines properties from the traces produced by passing test cases; intuitively, these properties capture the expected behavior of the model,
-
(iii)
Explaining, which uses mined properties to analyze the traces produced by failures and generate failure explanations, including information about the root events responsible for the failure and their propagation.
Testing
CPSDebug starts by instrumenting the CPS model. This is an important pre-processing step that is done before testing the model and that allows to log the internal signals in the model. Model instrumentation is inductively defined on the hierarchical structure of the Simulink/Stateflow model and is performed in a bottom-up fashion. For every signal variable having the real, Boolean or enumeration type, CPSDebug assigns a unique name to it and makes the simulation engine to log its values. Similarly, CPSDebug instruments look-up tables and state machines. Each look-up table is associated with a dedicated variable which is used to produce a simulation trace that reports the unique cell index that is exercised by the input at every point in time. CPSDebug also instruments state-machines by associating two dedicated variables per state-machine, one reporting the transitions taken and one reporting the locations visited during the simulation. We denote by V the set of all instrumented model variables.
The first step of the testing phase, namely Model Simulation, runs the available test cases \(\{w_I^k | 1 \le k \le n\}\) against the instrumented version of the simulation model under analysis. The number of available test cases may vary case by case, for instance in our case study the test suite included \(n=150\) tests.
The result of the model simulation consists of one simulation trace \(w^k\) for each test case \(w_I^k\). The trace \(w^k\) stores the sequence of (simulation time, value) pairs \(w^k_v\) for every instrumented variable \(v \in V\) collected during simulation.
To determine the nature of each trace, we transform the informal model specification, which is typically provided in form of free text, into an STL formula \(\phi \) that can be automatically evaluated by a monitor. In fact, CPSDebug checks every trace \(w^k\) against the STL formula \(\phi \), \(1 \le k \le n\) and labels the trace with a pass verdict P if \(w^k\) satisfies \(\phi \), or a fail verdict F otherwise. In our case study, the STL formula 1 in Sect. 3 labeled 149 traces as passing and 1 trace as failing.
Mining
In the mining phase, CPSDebug selects the traces labeled with a pass verdict P and exploits them for property mining.
Prior to the property inference, CPSDebug performs several intermediate steps that facilitate the mining task. First, CPSDebug uses cross-correlation to reduce the set of variables V to its subset \({\hat{V}}\) of significant variables. Intuitively, the presence of two highly correlated variables implies that one variable adds little information on top of the other one, and thus the analysis may actually focus on one variable only. The approach initializes \({\hat{V}}=V\) and then checks the cross-correlation coefficient between all the logged variables computed on the data obtained from the pass traces. The cross-correlation coefficient \(P(v_1,v_2)\) between two variables \(v_1\) and \(v_2\) is computed with the Pearson method, i.e., \(P(v_1,v_2) = \frac{cov(v_1,v_2)}{\sigma _{v_{1}}\sigma _{v_{2}}}\) which is defined in terms of the covariance of \(v_1\) and \(v_2\) and their standard deviations. Whenever the cross-correlation coefficient between two variables is higher than 0.99, that is \(P(v_1, v_2)>0.99\), CPSDebug non-deterministically removes one of the two variables (and its associated traces) from further analysis, that is, \({\hat{V}}\) is updated to \({\hat{V}} \setminus v_1\). In our case study, \(|V| = 361\) and \(|{\hat{V}}| = 121\), resulting in a reduction of 240 variables.
In the next step, CPSDebug associates each variable \(v \in {\hat{V}}\) to (1) its domain D and (2) its parent Simulink-block B. We denote by \(V_{D,B} \subseteq {\hat{V}}\) the set \(\{v_{1}, \ldots , v_{n} \}\) of variables with the domain D associated with block B. CPSDebug collects all observations \({\overline{v}}_1 \ldots {\overline{v}}_n\) from all samples in all traces associated with variables in \(V_{D,B}\) and uses the Daikon function \(D(V_{D,B}, {\overline{v}}_1 \ldots {\overline{v}}_n)\) and TkT automata learning engine to infer a set of properties \(\{p_{1}, \ldots , p_{k}\}\) related to the block B and the domain D. As mentioned in Sect. 2.3, TkT is used to monitor stateful components which means that TkT infers properties only on state variables. Running property mining per model block and model domain allows to avoid (1) combinatorial explosion of learned properties and (2) learning properties between incompatible domains.
Finally, CPSDebug collects all the learned properties from all the blocks and the domains. Each Daikon property p is transformed to an STL assertion of type \({{\,\mathrm{\Box }\,}}p\). The TkT properties are in form of timed automata describing the behavior of the state variables and do not need any transformation for further use.
In our case study, Daikon returned 96 behavioral properties involving 121 variables and TkT returned 20 timed automata, one automaton for each state variable. Hence, CPSDebug generated an STL property \(\psi \) with 96 temporal assertions, i.e., \(\psi = [\psi _1 \, \psi _2 \, ... \, \psi _{96}]\), from Daikon properties. Equations 2 and 3 show two examples of behavioral properties inferred from our case study by Daikon and translated to STL. Variables \(mode \), \(LI\_pos\_fail \), and \(LO\_pos\_fail \) denote internal signals Mode, Left Inner Position Failure, and Left Outer Position Failure from the aircraft position control Simulink model. The first property states that the Mode signal is always in the state 2 (Passive) or 3 (Standby), while the second property states that the Left Inner Position Failure is encoded the same than the Left Outer Position Failure.
$$\begin{aligned}&\varphi _{1} \equiv {{\,\mathrm{\Box }\,}}(mode \in \{2, 3\}) \end{aligned}$$
(2)
$$\begin{aligned}&\quad \varphi _{2} \equiv {{\,\mathrm{\Box }\,}}(LI\_pos\_fail == LO\_pos\_fail ) \end{aligned}$$
(3)
Table 1 Internal signals that violate at least one learned invariant and Simulink blocks to which they belong Table 2 Scope reduction and cause detection Figure 6 shows the timed automaton for the state variable mode in the Left Outer Hydraulic Actuator block generated by TkT.
Explaining
This phase analyzes a trace w collected from a failing execution and produces a failure explanation. The Monitoring step analyzes the trace w.r.t. the mined properties and returns the signals that violate the properties and the time intervals in which the properties are violated. CPSDebug subsequently labels with F (fail) the internal signals involved in the violated properties and with P (pass) the remaining signals from the trace. To each fail-annotated signal, CPSDebug also assigns the violation time intervals of the corresponding violated properties returned by the monitoring tool and TkT.
In our case study, the analysis of the left inner and the left outer sensor failure resulted in the violation of 17 mined properties involving 19 internal signals.
For each internal signal, there can be several fail-annotated signal instances, each one with a different violation time interval. CPSDebug selects the instance that occurs first in time, ignoring all other instances. That way, CPSDebug focuses on the events that cause observable misbehaviors first to reach the root cause of a failure.
Table 1 summarizes the set of property-violating signals, the block they belong to, and the instant of time the signal has first violated a property for our case study. We can observe that the 17 signals participating in the violation of at least one mined property belong to only 5 different Simulink blocks. In addition, we can see that all the violations naturally cluster around two time instants -- 2 seconds and 4 seconds. This suggests that CPSDebug can effectively isolate in space and time a limited number of events likely responsible for the failure. Figure 6 illustrates the timed automaton inferred by TkT for the variable mode in the Left Outer Hydraulic Actuator block in AECS. In Table 2, we observe that all faults detected by TkT are also captured by Daikon since no guards are violated. However, if such faults exist in the model, TkT is able to capture such failures since the time bounds are inferred. This can be observed in Table 2 for ATCS example where TkT efficiently captures the guard violation. Since properties mined by Daikon capture the qualitative behavior and not timing, Daikon does not capture the faulty guard in the example.
The Clustering & Mapping step then (1) clusters the resulting fail-annotated signal instances by their violation time intervals and (2) maps them to the corresponding model blocks, that is, to the model blocks that have some of the fail-annotated signal instances as internal signals.
Finally, CPSDebug generates failure explanations that capture how the fault originated and propagated in space and time. In particular, the failure explanation is a sequence of snapshots of the system, one for each cluster of property violations. Each snapshot reports (1) the mean time as approximative time when the violations represented in the cluster occurred, (2) the model blocks \(\{B_1,...,B_p\}\) that originate the violations reported in the cluster, (3) the properties violated by the cluster, representing the reason why the cluster of anomalies exist, and (4) the internal signals that participate to the violations of the properties associated with the cluster. Intuitively, a snapshot represents a new relevant state of the system, and the sequence shows how the execution progresses from the violation of the set of properties to the final violation of the specification. The engineer is supposed to exploit the sequence of snapshots to understand the failure, and the first snapshot to localize the root cause of the problem. Figure 7 shows the first snapshot of the failure explanation that CPSDebug generated for the case study. We can see that the explanation of the failure at time 2 involves the Sensors block and propagates to Signal conditioning and failures and Controller blocks. By opening the Sensors block, we can immediately see that the sensor measuring the left inner position of the actuator is marked as a possible cause of the failure. Going one level below, we can see that the signal \(s_{252}\) produced by \(LI\_pos\_fail \) is suspicious -- indeed the fault was injected exactly in that block at time 2. It is not a surprise that the malfunctioning of the sensor measuring the left inner position of the actuator affects the Signal conditioning and failures block (the block that detects if there is a sensor that fails) and the Controller block. However, at time 2 the failure in one sensor does not affect yet the correctness of the overall system; hence, the STL specification is not yet violated. The second snapshot (not shown here) generated by CPSDebug reveals that the sensor measuring the left outer position of the actuator fails at time 4. The redundancy mechanism is not able to cope with multiple sensor faults; hence, anomalies manifest in the observable behavior. From this sequence of snapshots, the engineer can conclude that the problem is in the failure of the two sensors—one measuring the left inner and the other measuring the left outer position of the actuator that stop functioning at times 2 and 4, respectively.