Conformance-Based Doping Detection for Cyber-Physical Systems

We present a novel and generalised notion of doping cleanness for cyber-physical systems that allows for perturbing the inputs and observing the perturbed outputs both in the time– and value–domains. We instantiate our definition using existing notions of conformance for cyber-physical systems. We show that our generalised definitions are essential in a data-driven method for doping detection and apply our definitions to a case study concerning diesel emission tests.


Introduction
System doping, in our terminology, is an intentional intervention causing a change in the system's normal behaviour against the interests of the user or other stakeholders (such as the society at large). Examples of system doping are widespread and range from vendors' enforcing a monopoly on chargers and spare parts (by checking for and refusing third-party chargers and spare parts, respectively) to tampering with exhaust emission in order to detect and pass emission tests. Doping can be the result of embedding a piece of code or smuggling a piece of electronic circuit into the system and it can be caused by the original developers or by hackers. Software and system doping has been studied in the past couple of years and rigorous theories for it have been developed [8,9,15]. These theories were subsequently adopted in order to detect doping, or formally, to check system cleanness [10,32] (corresponding to the absence of doping).
In the present paper, we extend the theory of doping to the setting of cyberphysical systems (CPS) by exploiting the notions of conformance testing for CPS [1,17,33]. The existing theories of software doping define doping in terms of drastic deviations in output as a result of minor deviations in input, where the term "deviation" refers to differences in validity of propositions or values of variables. However, the current notions come short of properly dealing with the issues of retiming and delays, which are commonly present in the signals of CPS. We observe that this is an essential aspect of detecting doping for cyber-physical systems: often the traces to be tested for doping have subtly different timing behaviour, e.g., due to measurement and calibration errors or due to the slight deviations of human actors in acting upon the planned scenarios. The insufficient treatment of retiming and delays can both lead to false negatives, i.e., missing cases of doping, as well as false positives, i.e., reporting spurious doping cases.
To address these issues, we exploit the notion of conformance to devise a general theory of being clean from doping and instantiate that theory with some existing notions of conformance for hybrid systems. We show how these notions can account for retiming and lead to more precise notions of cleanness.
We illustrate the usefulness of our theory by empirical analysis of diesel engine exhaust emissions in the context of one of the official test cycles, the New European Driving Cycle (NEDC) [42]. In particular, we show that catering for retiming is essential in effectively exploiting the actual driving cycles for performing doping analysis. We thus demonstrate that our new theory remedies a major shortcoming in the existing notions from the literature. To facilitate the presentation, we use throughout the remainder of this paper the following simple running example, which is inspired by our case study. Example 1. Figure 1.(a) shows two test cycles (evolution of speed over time), designed to detect whether the exhaust emission control of a particular vehicle is doped. The test cycle i st , depicted with a black solid line, is the standard one prescribed by the (fictitious) official regulation, while test cycle i dev , depicted by a red dotted line, is a slight deviation thereof. If the exhaust emissions measured during the test cycle i dev turn out to be significantly higher than the ones measured in test cycle i st , then we can conclude that the exhaust emission system is potentially doped, since it appears tailored to the standard test cycle. Figure 1.(b) addresses a notorious problem of testing cars: a human tester is supposed to drive the car as just described, however, she can do this only up to a certain imprecision. Assume her driving of i dev exhibits a slight time shift τ relative to the test cycle, as in i ddev , while i st is being driven as intended.
The result of a test is the emission footprint measured at the exhaust pipe of the car. Figure 1.(c) and Fig. 1.(d) show two different possible test results (obtained from different cars) for the scenario in Fig. 1.(b). Intuitively, the footprints in Fig. 1.(c) provide significant evidence for doping -a slightly different test cycle has resulted in significantly larger footprint. However, due to the time shift on the input side Fig. 1.(b) the point-wise difference of the two driven test-cycles has grown very large. As we show in the remainder of this paper, the existing theory of doping fails to detect such a clear evidence, due to the minor delay during the execution of the driving cycle. The emission footprint in Fig. 1.(d) is another (synthetic) example of a significant deviation which cannot be detected for the input in Fig. 1.(b) using existing theories; this latter footprint sheds some light on the intricate design decisions in the theory we develop in this paper.
The contributions for this paper can be summarized as follows: -We define a general notion of conformance that can express different ways of comparing execution traces by allowing deviations both in value and in time. -We define a general notion of cleanness for hybrid systems, and show that it subsumes the existing notion of robust cleanness [15]. -We demonstrate the usefulness of the proposed generic framework by applying it to software doping tests in the automotive domain, where we show that the new cleanness definition is able to flag a case of software doping that goes unnoticed when robust cleanness is used.

Related Work
The term "software doping" was coined around 2015 [30] in media uncovering the diesel exhaust emissions scandal. An informal problem formulation [8] pointed out the general phenomenon of intentionally added hidden software behaviour, which is not in the interest of the consumer. Shortly after, this observation has been complemented by a set of formal cleanness definitions [15] laying the theoretical foundations upon which formal methods to detect such software behaviour can be used. It is possible to detect missing functionality and undesired existing functionality. The definitions support both sequential programs and nondeterministic reactive programs. To check satisfaction of the definitions, it is necessary to compare two (or more) execution traces of the same system. Such properties are called hyperproperties [13] (whereas classical properties are trace properties). Tool support for analysing hyperproperties typically requires high computational effort [12,25]. There exist several temporal logics for analysing satisfaction of trace properties of various kinds of systems, one of them being Linear Temporal Logic (LTL) [39] for systems producing outputs in discrete time steps and properties that do not consider the time passing between outputs. LTL has been extended to the logic HyperLTL, which can express hyperproperties by allowing explicit quantification of execution traces in front of an LTL formula [12]. Tools for model-checking boolean circuits, satisfiability and monitoring of HyperLTL specifications have been developed [6,11,[21][22][23][24][25]29].
Signal Temporal Logic (STL) [36] is an extension of LTL that adds support for time constraints and real-valued signals. Tools exist that automatically try to falsify STL formulas [7,18]. There has been an extension of STL to HyperSTL in a similar fashion as it was done for HyperLTL [37]. The syntax of HyperSTL, however, is not able to express the cleanness definitions (for deterministic systems) in a way that allows (efficient) falsification. Robust cleanness is defined for distance functions on inputs and outputs [15]. When used with temporal logics the distance functions are restricted to those compatible with the logics. To be fully independent, robust cleanness analysis has been embedded into the theory of model-based testing [10] with input-output conformance [40,41].
Notions of conformance for discrete event systems have been discussed for almost a century. The earliest work on this topic dates back to 1960's when researchers studied model-based testing of digital circuits using Finite State Machine models [31,35]. Concurrency theory contributed ideas to this field, such as decoupling (i.e., removing the synchronised assumption between) inputs and outputs and observing failures to engage in a communication (and more specifically quiescence) [16,40]. A theory of conformance testing for systems with continuous dynamics was developed by Michiel van Osch [38]; this theory did not gain much popularity in practice, partly because of its insufficient treatment of approximation (e.g., differences in values and retiming). Pappas and Girard [27,28] proposed the use of Metric Bisimulation for conformance checking in dynamical systems and Pappas and Fainekos [20] developed a falsification framework for the same purpose. This research led to two notions of conformance used in the present paper, namely hybrid conformance by Abbas and Fainekos [1] and Skorokhod conformance by Deshmukh, Majumdar, and Prabhu [17].

Preliminaries
Semantic Domain. In this section, we provide definitions regarding semantic domain, conformance, and robust cleanness. We begin with the definition of our semantic domain, called generalised timed traces [26]. This definition subsumes both discrete-time state sequences and continuous-time trajectories. A generalised timed trace is a function with a discrete or continuous domain (called time domain) and a co-domain which is a metric space. Intuitively, a generalized timed trace maps each element of its time domain to a state. We require that the set of possible states is a metric space since we study conformance notions that compare traces based on the distance between the states of the traces. For a GTT μ : T → Y and time t 0 ∈ T , by μ[. . . t 0 ] we denote the prefix of μ up to t 0 , i.e., the restriction μ| t∈T :t≤t0 ; likewise, by μ[t s . . . t e ], we shall denote the restriction μ| t∈T :ts≤t≤te A hybrid system is a mapping from generalised (input) traces to sets of generalised (output) timed traces.

Definition 2.
A Y-valued hybrid system is a function H : GTT (Y) → P(GTT (Y)) such that for all μ ∈ GTT (Y) and all μ ∈ H (μ) it holds that dom(μ ) = dom(μ). We define H(Y) to be the set of all Y-valued hybrid systems.
In addition, we distinguish deterministic hybrid systems whose output values range over singleton sets only. In what follows, we identify deterministic hybrid systems with functions of the type GTT (Y) → GTT (Y).
For simplicity, we assume that the input and output domain are defined on the same metric spaces. The generalisation to different spaces is straightforward.

Conformance Relations.
Recently, a number of notions of conformance for cyberphysical systems have been proposed [3,33]. It turns out that these notions, two of which are quoted below, can provide a rigorous basis for doping detection.
Note that throughout the paper, the variables τ and (with possible subscripts) always range over non-negative real numbers.
-trace conformant with tolerance threshold for signal value , notation ) ≤ -Skorokhod conformant with tolerance thresholds τ and , notation SkorConf τ, (μ 1 , μ 2 ), if T 1 and T 2 are intervals and there is a strictly increasing continuous bijection r : T 1 → T 2 called retiming, such that: We show in the proposition below and also in our generalisation results in Sect. 4, that these notions are closely related. However, they also have some fundamental differences, that can be illustrated using the example in Fig. 1. Fig. 1. We can see that in and falling signals are reversed in the two trajectories, they are still hybrid conformant, because hybrid conformance disregards the order. However, Skorokhod conformance requires an order-preserving retiming, and hence distinguishes these two trajectories. On the other hand, such retiming exists, e.g., for i st and i ddev in Fig. 1.(b), witnessing their Skorokhod conformance.

Example 2. Consider again the example shown in
We shall use the following notation. We write Conf 1 Conf 2 whenever for all μ 1 : T 1 → Y and μ 2 :

Proposition 1.
For any τ, ∈ R ≥0 , the following relations hold: Robust Cleanness. We shall now state the original definition of robust cleanness from [15], adapted to our framework of hybrid systems. It is based on Definition 7 and Proposition 19 from [15]; the phrasing below abstracts from the so-called parameters of interest and standard inputs. Moreover it is cast in the setting of generalised timed traces rather than discrete-step programs, and stated using trace conformance with different thresholds for inputs and outputs, κ I and κ O . Intuitively, a hybrid system is robustly clean if for every pair of input prefixes on which no difference in the inputs exceeding κ I has occurred so far (i.e., all sub-prefixes are trace conformant), the corresponding sets of output prefixes are also conformant with respect to κ O . As we consider nondeterministic systems, Hausdorff distance is used to compare sets of outputs (see [15] for details).

Definition 4. A hybrid system H is robustly clean, denoted RobustClean
Note that in the above definition we do not require that dom(i 1 ) = dom(i 2 ). In practice, robust cleanness is typically applied to pairs of traces that are both defined over N. Here, however, for the sake of generality we impose no such restriction. In particular, when the time domains of two traces are different, for example disjoint, the predicate RobustClean will trivially evaluate to true. Example 3. Consider the traces depicted in Fig. 1. The input prefixes i st and i ddev are given in Fig. 1.(b), and the corresponding pair of outputs is shown in Fig. 1.(c).
Thus, the left-hand side of the implication in the Definition 4 instantiated with κ I = κ O = does not hold for any t . Hence, regardless of the outputs, this pair of inputs satisfies the condition of RobustClean( , ), and, if these are the only traces in a hybrid system H then we can conclude that H is RobustClean( , ).

Conformance-Based Cleanness
We now define a general notion of conformance-based cleanness and provide two instantiations based on the conformance notions defined in the previous section. The need for considering disturbance in time as well as in value is motivated by our running example from Fig. 1. One of the challenges in performing doping tests for cyber-physical systems is that in such systems timing is rarely perfectly precise, due to imprecision in measurements, or caused by the interaction with the physical world. As illustrated in Example 1, for instance, when checking for software doping in a car [10], the input to the system is the value of the car's speed over time, which is under the control of a driver, and can thus vary from one execution to the other, even if the driver is trying to execute the same input sequence. Clearly, those variations can be in value, as well as in time.
. Thus, without allowing for deviations in time when comparing these input sequences, they will be considered sufficiently different, and as a result their respective exhaust emission outputs will fall out of the comparison when checking for doping according to Definition 4, even if the NO x emission values in the corresponding outputs H (i st (t)) and H (i ddev (t)) are vastly different, as depicted in Fig. 1.(c). This results in a false negative, i.e., failing to detect a clearly doped system.
In the above example, we demonstrated that not accounting for timing disturbances when relating input trajectories can result in false negatives in doping detection. Dually, using the traditional comparison for output traces can result in false positives by requiring overly strict matching of outputs.
The above example motivates the need to account for timing deviations in trajectories. Intuitively, for input trajectories this relaxation results in considering more traces as conforming, and thus enforcing more comparisons when checking if a system is clean. For output trajectories this means relaxing the conformance requirement by considering two output sequences as conforming even if their values are not perfectly aligned in time. Furthermore, different types of timing deviations need to be considered in different scenarios, for example, depending on whether the order in which values occur is important or not.
Example 5. Consider the testing workflow from Example 1 and Fig. 1, where inputs i st and i ddev are passed to a car. In the second experiment, depicted in Fig. 1.(d), the car outputs o (i st ) and o (i ddev ), which are hybrid conformant for and τ . Hence this observation of the system is classified as clean under hybrid output conformance. However, the output o (i ddev ) is clearly suspicious, as the values in o (i ddev ) and o (i st ) are reversed. This motivates considering conformance notions that require retimings to be order-preserving. Indeed, using Skorokhod conformance we can detect that the system is doped.
The above examples show that in order to be useful in a diverse set of applications, a software cleanness theory should allow for using a variety of conformance notions. To this end, we next take a more general view on conformance notions, in order to be able to develop a generic conformance-based cleanness framework.
So far, we have defined three specific notions of conformance which either coincide, or are closely inspired by ones that have appeared in the literature. In order to define a general framework for cleanness, we also wish to treat notions of conformance in a more generic manner. To this end, we propose an abstract definition of conformance predicates. As conformance predicates admit variations in time, as well as in value, our definition is based on retimings, a device that will play a key role in the context of this work. In its general form a retiming is a pair of functions between two time domains. Intuitively, given two GTTs, a retiming will define a mapping from points in each of the traces to points in the other trace. Note that in general the mappings are not required to be injective; this way we can cater for notions of conformance allowing for the so-called local disorder phenomenon (in particular hybrid conformance -see Proposition 2).

Definition 5.
A retiming is a pair of functions between two time domains, i.e., a pair of the form (r 1 , r 2 ), where r 1 : T 1 → T 2 and r 2 : T 2 → T 1 , with time domains T 1 , T 2 ⊆ R ≥0 . Given two time domains T 1 and T 2 , we denote the set of all retimings between T 1 and T 2 with RET (T 1 , T 2 ).
Retiming is explicitly present in the definition of Skorokhod conformance; there, each Skorokhod retiming is required to be a strictly increasing continuous bijection. We can express a Skorokhod retiming r as an instance of our definition as the pair (r, r −1 ). In fact, one can also define hybrid conformance, as well as a whole class of conformance notions, using a suitable family of retimings.
A family of retimings Ret can be further constrained by τ to a subset Ret τ of Ret containing only functions that shift time by at most τ time units. In order to use a family of retimings for concrete sequences μ 1 and μ 2 , it is necessary to consider only functions that match the domains of the sequences. This leads to a generic notion of conformance associated with a given family of retimings Ret, a given time threshold τ and a given value threshold .

Definition 6. Let Ret be a family of retimings, and let
A conformance notion with time threshold τ and value threshold induced by Ret is a predicate Conf Ret τ, on pairs of GTTs such that, for μ 1 : Using the above definition, we can easily express the specific notions of conformance defined in the previous section by selecting a suitable family of retimings. Definition 6 also enables us to define other notions of conformance, such as, for instance a "shift conformance", which, intuitively, shifts all time points by a given constant c ∈ R, i.e., Ret c = {(r, r −1 ) | r(t) = t + c}.
Next, we define a generic notion of cleanness, parametrised by conformance predicates for the input and for the output traces. Instantiating these predicates with existing or new conformance notions, yields different conformance-based notions of cleanness that can capture a variety of cleanness specifications.
We now extend the notion of robust cleanness [15] to allow for "small" variations in time, in addition to the variations in value. To this end, the new notion makes use of two conformance predicates, one that postulates when two input traces should be considered close enough, and another one that specifies when two output traces are close enough.
Our starting point, the notion of robust cleanness in Definition 4, is based on comparison of matching prefixes of a pair of input traces and the corresponding prefixes of the associated output traces. As we now want to accommodate for distance in time, we (1) compare prefixes using a conformance relation, and (2) allow for variation in the length of the compared prefixes that is within the corresponding time-distance threshold. More precisely, when comparing two prefixes, we allow for discarding start and end segments of length at most τ . This intuition is formalized by the predicate PrefConf for relaxed comparison of GTT prefixes using a notion of conformance Conf with tolerance threshold τ for time disturbance. We use cascaded notation to define PrefConf as a higher-order function taking Conf as its first argument. The predicate PrefConf compares two prefixes μ 1 and μ 2 by requiring that there exist traces μ 1 [t s 1 . . . t e 1 ] and μ 2 [t s 2 . . . t e 2 ] obtained from them, that are conformant with respect to Conf. These traces are obtained by possibly removing a sub-prefix of length at most τ , and/or removing extending with a suffix of length at most τ .

Definition 7. Let
Conf be a notion of conformance on GTTs with tolerance threshold τ for time disturbance. For any pair of GTTs μ 1 : T 1 → Y, μ 2 : T 2 → Y, and t ∈ T = T 1 ∪ T 2 , the predicate PrefConf is defined as: ). The predicate PrefConf provides a generic notion of prefix-conformance. By instantiating it with conformance relations Conf I and Conf O for input and output traces respectively, we define the notion of (Conf I , Conf O )-cleanness. For deterministic systems (Conf I , Conf O )-cleanness requires that for all pairs of input prefixes for which all sub-prefixes are prefix-conformant w.r.t. Conf I , the corresponding pair of output prefixes are prefix-conformant w.r.t. Conf O .

Definition 8. A deterministic system H is (Conf
The above definition naturally generalises to nondeterministic hybrid systems, by comparing sets of possible output prefixes using Hausdorff distance as in [15].

Definition 9. A system H is (Conf
Robust cleanness [15] can be now formulated as conformance-based cleanness, which establishes that (Conf I , Conf O )-cleanness is a generalisation. Using hybrid conformance, we define hybrid-conformance cleanness, and similarly, plugging in Skorokhod conformance, we define Skorokhod-conformance cleanness. Formally: We will now establish some key relations between the cleanness notions defined previously. We begin by lifting the implication between conformance relations to implication between cleanness notions defined using those relations.

Proposition 3. Suppose that
The proposition above has two important corollaries. The first one explains the relationships between the original robust cleanness, and notions of cleanness based on Skorokhod conformance and hybrid conformance, in particular stating the conservative generalisation property for the latter notions. The second corollary compares cleanness notions with different conformance thresholds.
Example 6. Consider the testing workflow in Fig. 1. The inputs passed to a car are i st and i ddev , depicted in Fig. 1.(b). One of the test results is presented in . Hence, the system tested in Fig. 1.(c) is not HybridClean( , τ, , τ ).
We now discuss testing and falsification of conformance-based cleanness. For systems with discrete time domains the existing methods for verifying [15] or testing [10] robust cleanness can be readily applied.
In the case of hybrid cleanness, existing methods for testing hybrid conformance, such as [2] and [4] can be extended to testing and falsification of hybrid cleanness of hybrid systems consisting of traces with finite time domains. Methods for checking Skorokhod conformance were presented in [17]. Due to the quantification over all time-points t in our Definition 8 and Definition 9, it is not clear how to directly extend them to testing Skorokhod cleanness.

Case Study
In this section we evaluate the proposed notion of hybrid cleanness in the context of doping detection in relation to the recent Diesel Emissions Scandal.
Conducting software doping tests for cyber-physical systems has a range of applications. A prominent example is the body of recent work [8][9][10]14,15,32,34] that gives insights into the Diesel Emissions Scandal. This is a world-wide scandal where millions of diesel cars have been equipped with defeat devices reducing the effectiveness of emission cleaning systems during real-world usage -in contrast to the regulator-defined driving scenarios on a chassis dynamometer, where the amount of emitted pollutants are well below the applicable limits.
Assuming the existence of a contract that formalizes when software is considered to be doped, recent work demonstrates how doping tests can be generated automatically and how the characteristic challenges arising with these kinds of tests can be tackled [10]. A major challenge is the distortion of inputs that can occur during test execution. As doping tests have to be conducted on the final product, i.e., a vehicle such as a passenger car, a human driver has to provide the inputs to the car by driving it. It is far from trivial to provide the inputs exactly as defined by the test case. Official regulations, that define the approval process for new car models, precisely specify test cycles for which they allow tolerances in the input of up to 2 km/h (in car speed). But even driving a car within this tolerance requires a very experienced driver. To strengthen the position of consumers against manufacturers, it is necessary to allow manufacturer-independent methods to check the compliance of a car model with the applicable regulations, i.e., the absence of defeat devices. These methods are supposed to require a reasonable amount of effort, and training a driver over months so that she has enough experience to stay within the tolerance of 2 km/h is way beyond reasonable effort. This means that the responsibility for accounting for the driver's imprecision must be shifted to the techniques for checking for software doping.
In this section we give a short summary of recent doping tests with a diesel car and demonstrate how the theory developed in this paper addresses the above challenge. More precisely, it allows us to overcome the imprecise timing leading to minor input distortions, by appropriately accounting for the effect of retiming on the input value error. We further show how using our theory one of the tests reveals strong indications for a defeat device in the car under test -despite a very inexperienced driver conducting the test. This doping detection would not have been possible using the cleanness notions existing prior to this work.
Physical Set-Up of the Experiment. Before a car model can be sold, it must meet the requirements defined in the official regulations. The type approval procedure requires the car to be placed on a chassis dynamometer. Cars have to follow certain standardized test cycles, each defined as a function from time to speed. One of the test cycles, involved in the diesel scandal, was the New European Driving Cycle (NEDC) [42] shown in Fig. 2. For the tests here, we consider the speed of the car as input, since this is the parameter defining a test cycle. The total amount of NO and NO 2 (abbreviated as NO x ) is the only output of interest.
The car under test is a Nissan NV200 Evalia, with Renault 1.5 dci (110hp) diesel engine and approved w.r.t. regulation Euro 6b. The test set-up is shown in Fig. 2.
In order to perform a check for defeat devices using a cleanness test, we consider, in addition to the original NEDC, two manually synthesized tests. These test cycles, denoted PowerNEDC and SineNEDC were proposed in previous work [10] and are defined as follows. PowerNEDC is based on the NEDC but slightly deviates from it by enforcing higher accelerations (1.5 m s 2 instead of 0.94 m s 2 ) after 56 s, 251 s, 446 s and 641 s. The maximum input deviation from NEDC is κ I = 10 km/h. SineNEDC is defined as the NEDC superimposed by a sine curve, formally SineN EDC(t) = max{0, NEDC(t) + 5 sin(0.5t)}, with a maximum input deviation from NEDC of κ I = 5 km/h. These test cycles are defined by specifying the input value (the car's speed) in each second. Both test cases are shown by the red dotted lines in Fig. 3.
Conformance-Based Cleanness Tests for NEDC. We have applied our theory of conformance-based cleanness to check for doping, i.e., the presence of a defeat device, in the car under test. For this, we have at our disposal the raw data obtained from three test drives: (1) Test drive dNEDC is the result of NEDC cycle driven by a human driver. It serves as the reference behaviour of the car, to which we will compare the executions of the other two test cycles. (2) Test drive dPowerNEDC is the trajectory that is produced as the result of a human driver driving PowerNEDC. (3) Test drive dSineNEDC is the trajectory that is produced as the result of a human driver driving SineNEDC.
The values of the actual sequences of inputs executed by driving the car are sampled in steps of 0.05 s. As mentioned earlier, the human in the loop makes testing considerably more challenging. The maximum deviation of inputs compared to the test specification for NEDC is just below 10 km/h, for Pow-erNEDC is almost 12 km/h, and for SineNEDC it approaches 16 km/h. This shows that the perturbation introduced by the human driver is clearly noticeable. The amount of NO x measured for dNEDC is 180 mg/km, for dPowerNEDC and dSineNEDC the measurements revealed 204 mg/km and 584 mg/km, respectively.
In order to detect doping (by falsifying cleanness), the input sequences of dPowerNEDC and dSineNEDC have to be each compared to dNEDC, and if the input sequences in the corresponding pair are conforming, then the respective outputs (the total NO x emission values) have to be checked for conformance.
As we desire for our doping tests to be as strict as possible, we identify hybrid conformance HybridConf τI , I , i.e., the weakest of the conformance relations discussed in Sect. 3, as the most suitable conformance relation for the comparison of input traces. As the outputs are just single values, the choice of output conformance relation is immaterial in this case, so we take HybridConf 0, O .
Formally, we consider the deterministic hybrid system H defined by the input GTTs dNEDC, dPowerNEDC, and dSineNEDC, and check whether H is HybridClean(τ I , I , 0, O )-clean for given values of τ I , I and O .
The driver's imprecision has a significant effect on the values in the input sequences and their timing. This can lead to dismissing pairs of sequences if they are incorrectly deemed too far apart, and thus missing some indications of doping. For instance, a too strict comparison of dSineNEDC to dNEDC will dismiss this pair of executions; however, the measured NO x emission during the dSineNEDC drive is three times more than the one measured during dNEDC.
Testing HybridClean(τ I , I , 0, O ) allows us to perform a realistic comparison by taking into account the two possible sources of driving errors: the over-or undershooting of the speed, and the timing offsets, where the driver accelerates or decelerates too fast or too slowly. In comparison, prior doping tests based on Robust Cleanness, considered only the former, i.e., the point-wise offset in speed. As we demonstrate, depending on the specified value threshold, there are cases when this is insufficient to identify doping. Indeed, looking into the official regulations, we can see that they allow for a timing variation of one second [19,42]. Thus, essentially, the regulations allow for hybrid conformance with τ I = 1 s.

Hybrid Cleanness Testing.
In order to test HybridClean(τ I , I , 0, O ) we have to examine the conformance relations HybridConf τI , I (dNEDC, dPowerNEDC) and HybridConf τI , I (dNEDC, dSineNEDC) between the corresponding input sequences. Recall that since the output of the system measured in each test is the total amount of NO x emitted during the test, i.e., a single value for the whole execution, timing plays no role when quantifying the value error for the output.
In order to evaluate the power of using hybrid cleanness for detecting doping, we consider different values for I and τ I , and perform two types of analysis of the results of testing HybridClean(τ I , I , 0, O ), which we describe below.
Effect of τ I on the Minimal I for Which Inputs are Conforming. First, we fix a maximum value that we allow for the time offset τ I . For this τ I we analyse our dataset to find the minimal I such that for the combination τ I and I the input traces under consideration satisfy hybrid conformance. For τ I = 0 we get exactly the I for which the two traces are trace conformant. Table 1 (left  side) shows the computed I values for τ I = 0, 0.5, 1, 2, 5, 10.
As expected, when we increase τ I , the minimal I decreases. At some point (at τ I = 2 for PowerNEDC and τ I = 5 for SineNEDC) the decrease in the value error reduces notably. This happens because the error is only partially caused by the incorrect timing of the driver.
From the values reported in Table 1 (left) we see that if, for example, we allow deviation for the input τ I = 1, as per the official regulation, and set I = 15, then we have that both HybridConf τI , I (dNEDC, dPowerNEDC) and HybridConf τI , I (dNEDC, dSineNEDC) are true, while, for τ I = 0 both are false. Thus, under hybrid conformance these pairs of traces will be considered in the cleanness test, while under trace conformance they will be dismissed. Since the difference between the outputs measured during dSineNEDC and during dNEDC is vast, we establish that HybridClean(1, 15, 0, 180) does not hold.
Effect of I on the Minimal τ I for Which Inputs are Conforming. Second, we fix the maximum value error I and examine what minimal τ I results in a combination τ I and I for which the analysed data is hybrid conformant. For the synthesized test cases we study the error tolerance I set to the respective input thresholds κ I . As discussed above, this is 10 km/h for PowerNEDC and 5 km/h for SineNEDC. We also consider the scenario where the error tolerance allowed by the official regulation for the test cycle is added, that is, we also consider I = κ I + 2 km/h. The two rightmost columns of Table 1 show the necessary time shifts to achieve these value errors. As apparent, they reduce by approximately 84% and 94% when adding the error tolerance of 2 km/h. These values for τ I give us the minimal tolerance threshold for time, for which HybridClean(τ I , I , 0, 180) is violated in H for the given I ; the value of O is fixed at 180 mg/km according to the standard [10].
Evaluation and Discussion. The analysis of the data shows that it is indeed necessary to not only consider a deviation of value, but to also allow for timing deviations, especially when the quality of the studied driving tests suffers from the human-caused input distortions. In terms of the theory established in this paper, this means that in scenarios like this one, employing HybridClean is more adequate than using prior notions such as RobustClean, and without this, the cases of doping we have detected would go unnoticed. Allowing a retiming of up to 10.8 s (for PowerNEDC) and of 4.05 s (for SineNEDC) makes both inputs conformant to the NEDC input, so we are able to detect the violation of SineNEDC for the hybrid cleanness for the specified desired value error tolerance. While these time deviations appear large given the test cycle timeline, they are acceptable when we recall that the tests are executed by human drivers.
If, on the other hand, we want to restrict the tolerance in time to one second, we are able to consider both tests for the hybrid cleanness for value error tolerance of 12.41 km/h for PowerNEDC and 14.24 km/h for SineNEDC.
This demonstrates how conformance-based cleanness notions like HybridClean allow us to some extent to account for human-caused errors related to timing.
Finally, while hybrid cleanness is arguably the appropriate notion for the case study considered here, our generic theory of conformance-based cleanness allows for using other conformance notions as appropriate for the CPS under test.

Conclusions
In this paper, we presented a theory of doping detection and cleanness based on the notions of conformance for cyber-physical systems. Our new notion accounts for possible "deviations" of the system output, upon "perturbing" its inputs, both in time and in values. Both notions of "deviation" and "perturbation" turn out to be expressible using a generic notion of retiming. We instantiate our definition with specific notions of retiming from the conformance testing literature. We apply our notions to a case study from the automotive domain and demonstrate how our generalised notions are useful in using actual driving cycles for doping detection according to the New European Driving Cycle (NEDC) [42].
We intend to turn our theory into an automatic tool for doping detection, using hybrid systems models. We intend to use the HyConf tool [4] as the starting point and use our search-based testing implementation in HyConf [5] to automate the process of test-case generation and test-case selection. Once this process is automated, one can generate test-cases that can go beyond a specific standard and detect intelligent defeat devices that cheat the standards and the tests prescribed by them.
We also intend to organise widespread experiments regarding emission detection to put our theory into practice. Our experimental set-up involves instrumenting a large number of cars using low-cost equipments, constructing models of emission behaviour, and generating realistic driving scenarios that are more likely to detect doping.