Software Doping Analysis for Human Oversight

This article introduces a framework that is meant to assist in mitigating societal risks that software can pose. Concretely, this encompasses facets of software doping as well as unfairness and discrimination in high-risk decision-making systems. The term software doping refers to software that contains surreptitiously added functionality that is against the interest of the user. A prominent example of software doping are the tampered emission cleaning systems that were found in millions of cars around the world when the diesel emissions scandal surfaced. The first part of this article combines the formal foundations of software doping analysis with established probabilistic falsification techniques to arrive at a black-box analysis technique for identifying undesired effects of software. We apply this technique to emission cleaning systems in diesel cars but also to high-risk systems that evaluate humans in a possibly unfair or discriminating way. We demonstrate how our approach can assist humans-in-the-loop to make better informed and more responsible decisions. This is to promote effective human oversight, which will be a central requirement enforced by the European Union's upcoming AI Act. We complement our technical contribution with a juridically, philosophically, and psychologically informed perspective on the potential problems caused by such systems.


Introduction
Software is the main driver of innovation of our times. Software-defined systems are permeating our communication, perception, and storage technology as well as our personal interactions with technical systems at an unprecedented pace. "Software-defined everything" is among the hottest buzzwords in IT today [76,119].
At the same time, we are doomed to trust these systems, despite being unable to inspect or look inside the software we are facing: The owners of the physical hull of 'everything' are typically not the ones owning the software defining 'everything', nor will they have the right to look at what and how 'everything' is defined. This is because commercial software typically is protected by intellectual property rights of the software manufacturer. This prohibits any attempt to disassemble the software or to reconstruct its inner working, albeit it is the very software that is forecasted to be defining 'everything'. The use of machine-learnt software components amplifies the problem considerably by adding opacity of its own kind. Since commercial interests of the software manufacturers seldomly are aligned with the interest of end users, the promise of 'software-defined everything' might well become a dystopia from the perspective of individual digital sovereignty. In this article, we address two of the most pressing incarnations of problematic software behaviour.

Diesel emissions scandal
A massive example of software-defined collective damage is the diesel emissions scandal. Over a period of more than 10 years, millions of diesel-powered cars have been equipped with illegal software that altogether polluted the environment for the sake of commercial advantages of the car manufacturers. At its core, this was made possible by the fact that only a single, precisely defined test setup was put in place for checking conformance with exhaust emissions regulations. This made it a trivial software engineering task to identify the test particularities and to turn off emission cleaning outside these particular conditions. This is an archetypal instance of software doping.
Software doping can be formally characterised as a violation of a cleanness property of a program [10,31]. A detailled and comparative account of meaningful cleanness definitions related to software doping is avaialable [15,Chapter 3]. One cleanness notion that has proven suitable to detect diesel emissions doping is robust cleanness [15,18]. It is based on the assumption that there is some well-defined and agreed standard input/output behaviour of the system which the definition extends to the vicinity around the inputs and outputs close to the standard behaviour. The precise specification of "vicinity" and of "standard behaviour" is assumed to be part of a contract between software manufacturer and user. That contract entails the standard behaviour, distance functions for input and output values, and distance thresholds to define the input and output vicinity, respectively. With this, a system behaviour is considered clean, if its output (is or) stays in the output vicinity of the standard, unless the input (is or) moves outside the standard's input vicinity.
Example 1 Every car model that is to enter the market in the European Union (and other countries) must be compliant with local regulations. As part of this homologation process, common to all of these regulations is the need for executing a test under precisely defined lab conditions, carried out on a chassis dynamometer. In this, the car has to follow a speed profile, which is called test cycle in regulations. At the time when the diesel scandal surfaced, the New European Driving Cycle (NEDC) [126] was the single test cycle used in the European Union. It has by now been replaced by the Worldwide harmonized Light vehicles Test Cycle (WLTC) [122] in many countries. We refer to previous work for more details [15,18,21]. From a perspective of fraud prevention, having only a single test cycle is a major weakness of the homologation procedure. Robust cleanness can overcome this problem. It admits the consideration of driving profiles that stay in the bounded vicinity of one of several standardised test cycle (i.e., NEDC as well as WLTC), while enforcing bounds on the deviations regarding exhaust emission.

Discrimination mitigation
Another set of exemplary scenarios we consider in this article are high-risk AI systems, systems empowered by AI technology whose functioning may introduce risks to health, safety, or fundamental rights of human individuals. The European Union is currently developing the AI Act [39,40] that sets out to mitigate many of the risks that such systems pose. Application areas of concern include credit approval ( [93]), decisions on visa applications ( [82]), admissions to higher education ( [26,131]), screening of individuals in predictive policing ([57]), selection in HR ( [90][91][92]), juridicial decisions (as with COMPAS [3,29,33,70]), tenant screening ( [113]), and more. In many of these areas, there are legitimate interests and valid reasons for using well-understood AI technology, although the risks associated with their use to date is manifold.
It is widely recognised that discrimination by unfair classification and regression models is one particularly important risk. As a result, a colourful zoo of different operationalisations of unfairness has emerged [94,129], which should be seen less as a set of competing approaches and more as mutually complementary [51]. At the same time, a consensus is emerging that human oversight is an important piece of the puzzle for mitigating and minimising societal risks of AI [58, 81,127]. Accordingly, that principle made it into recent drafts of legislation including the European AI Act [39,40] or certain US state laws [130].
The generic approach we develop for software-doping analysis turns out to be powerful enough to provide automated assistance for human overseers of high-risk AI systems. Apart from spelling out the necessary refocusing we illustrate the challenge that our work helps to overcome by an exemplary, albeit hypothetical admission system for higher education (inspired by [26,131]).
Example 2 A large university assigns scores to applicants aiming to enter their computer science PhD program. The sores are computed using an automated, modelbased procedure P which is based on three data points: the position of the applicant's last graduate institution in an official, subject-specific ranking, the applicant's most recent grade point average (GPA), and their score in a subject-specific standardised test taken as part of the application procedure. The system then automatically computes a score for the candidate based on an estimation of how successful it expects them to be as students. A dedicated university employee, Unica is in charge of overseeing the individual outcomes of P and is supposed to detect cases where the output of P is or appears flawed. The university pays especial attention to fairness in the scoring procedure, so Unica has to watch out to any signs of potential unfairness. Unica is supposed to desk-reject candidates whose scores are below a certain, predefined threshold -unless she finds problems with P's scoring. Without any additional support, Unica, as human overseer in the loop, must manually check all cases for signs of unfairness as they are processed. This can be a tedious, complicated, and error-prone task and as such constitutes an impediment for the assumed scalability of the automated scoring process to high numbers of applicants. Therefore, she at least requires tool support that assists her in detecting when something is off about the scoring of individual applicants.
This support can be made real by exploiting the technical contributions of this article, in terms of a runtime monitor that provides automated assistance to the human oversight and itself is based on the probabilistic falsification technique we develop. As we will explain, func-cleanness, a variant of cleanness, is a suitable basis for rolling out runtime monitors for such high-risk systems, that are able to detect and flag discrimination or unfair treatment of humans.
The contributions made by this article are threefold.
Detecting software doping using probabilistic falsification. The paper starts off by developing the theory of robust cleanness and func-cleanness. We provide characterisations in the temporal logics HyperSTL and STL, that are then used for an adaptation of existing probabilistic falsification techniques [1,48].
Altogether, this reduces the problem of software doping detection to the problem of falsifying the logical characterisation of the respective cleanness definition. Falsification-based test input generation. Recent work [18] proposes a formal framework for robust cleanness testing, with the ambition of making it usable in practice, namely for emissions tests conducted with a real diesel car on a chassis dynamometer. However, that approach leaves open how to perform test input selection in a meaningful manner. The probabilistic falsification technique presented in this article attacks this shortcoming. It supports the testing procedure by guiding it towards test inputs that make the robust cleanness tests likely to fail. Promoting effective human oversight. We discuss and demonstrate how the technical contributions of this paper contribute to effective human oversight of high-risk systems, as required by the current proposal of the AI act. The hypothetical university admission scenario introduced above will serve as a demonstrator for shedding light on the applicability of our approach as well as the the principles behind it. On a technical level, we provide a runtime monitor for individual fairness based on probabilistic falsification of func-cleanness. On a conceptual level, we consider it important to clarify which duties come with the usage of such a system; from a legal perspective, particularly considering the AI Act, substantiated by considering the ethical dimension from a philosophical perspective, and from a psychological perspective, particularly deliberating on how the overseeing can become effective.
This paper is based on a conference publication [16]. Relative to that paper, the development of the theory here is more complete and now includes temporal logic characterisations for func-cleanness. On the conceptual side, this article adds a principled analysis of the applicability of func-cleanness to effective human oversight, spelled out in the setting of admission to higher education. We live up to the societal complexity of this new example and provide an interdisciplinary situation analysis and an interdisciplinary assessment of our proposed solution. Accordingly, although the technical realisation is based on the probabilistic falsification approach outlined in this article, our solution is substantially more thoughtful than a naive instantiation of the falsification framework.
This article is structured as follows. Section 2 provides the preliminaries for the contributions in this article. Section 3 develops the theoretical foundations necessary to use the concept of probabilistic falsification with robust cleanness and func-cleanness. Section 4 demonstrates how the probabilistic falsification approach can be combined with the previously proposed testing approach [18] for robust cleanness, with a focus on tampered emission cleaning systems of diesel cars. Section 5 develops the technical realisation of a fairness monitor based on func-cleanness for high-risk systems. Section 6 evaluates the fairness monitor from the perspective of the disciplines philosophy, psychology, and law. Finally, Section 7 summarises the contributions of this article and discusses limitations of our approaches. The appendix of this article contains additional technical details, proofs, and further philosophical and juridical explanations.

Software Doping
After early informal characterisations of software doping [10,12], D'Argenio et al. [31] propose a collection of formal definitions that specify when a software is clean. The authors call a software doped (w.r.t. a cleanness definition) whenever it does not satisfy such cleanness definition. We focus on robust cleanness and func-cleanness in this article [31].
We define by R ≥0 := {x ∈ R | x ≥ 0} the set of non-negative real numbers, by R := R ∪{−∞, ∞} the set of extended reals [102], and by R ≥0 := R ≥0 ∪{∞} the set of the non-negative extended real numbers. We say that a function d : X × X → R ≥0 is a distance function if and only if it satisfies d(x, x) = 0 and d(x, y) = d(y, x) for all x, y ∈ X. We let σ[k] denote the kth literal of the finite or infinite word σ.

Reactive Execution Model
We can view a nondeterministic reactive program as a function S : In ω → 2 (Out ω ) perpetually mapping inputs In to sets of outputs Out [31]. To formally model contracts that specify the concrete configuration of robust cleanness or func-cleanness, we denote by StdIn ⊆ In ω the input space of the system designated to define the standard behaviour, and by d In : (In × In) → R ≥0 and d Out : (Out × Out) → R ≥0 distance functions on inputs, respectively outputs.
For robust cleanness, we additionally consider two constants κ i , κ o ∈ R ≥0 . κ i defines the maximum distance that a non-standard input must have to a standard input to be considered in the cleanness evaluation. For this evaluation, κ o defines the maximum distance between two outputs such that they are still considered sufficiently close. Intuitively, the contract defines tubes around standard inputs and there outputs. For example, in Figure 1, i is a standard input and d In and κ i implicitly define a 2κ i wide tube around i. Every input i ′ that is within this tube will be evaluated on its outputs. Similarly, d Out and κ o define a tube around each of the outputs of i. An output for i ′ that is within this tube satisfies the robust cleanness condition. Together, the above objects constitute a formal contract C = ⟨StdIn, d In , d Out , κ i , κ o ⟩. Robust cleanness is composed of two separate definitions called l-robust cleanness and u-robust cleanness. Assuming a fixed standard behaviour of a system, l-robust cleanness imposes a lower bound on the non-standard outputs that a system must exhibit, while u-robust cleanness imposes an upper bound. Such lower and upper bound considerations are necessary because of the potential nondeterministic behaviour of the system; for deterministic systems the two notions coincide. We remark that in this article we are using past-forgetful distance functions and the trace integral variants of robust cleanness and func-cleanness (see Biewer [15,Chapter 3] We will in the following refer to Definition 1.1 for l-robust cleanness and Definition 1.2 for u-robust cleanness. Intuitively, l-robust cleanness enforces that whenever an input i ′ remains within κ i vicinity around the standard input i, then for every standard output o ∈ S(i), there must be a non-standard output o ′ ∈ S(i ′ ) that is in κ o proximity of o. Referring to Figure 1, every i ′ in the tube around i must produce for every standard output o ∈ S(i) at least one output o ′ ∈ S(i ′ ) that resides in the κ o -tube around o. In other words, for non-standard inputs the system must not lose behaviour that it can exhibit for a standard input in κ i proximity.
For u-robust cleanness the standard and non-standard output switch roles. It enforces that whenever an input i ′ remains within κ i vicinity around the standard input i, then for every output o ′ ∈ S(i ′ ) the system can exhibit for this non-standard input, there must be a standard output o ∈ S(i) that is in κ o proximity of o ′ . Referring to Figure 1, every i ′ in the tube around i must only produce outputs o ′ ∈ S(i ′ ) that are in the κ o -tube of at least one o ∈ S(i). In other words, for non-standard inputs within κ i proximity of a standard input the system must not introduce new behaviour, i.e., it must not exhibit an output that is further than κ o away from the set of standard outputs.
A generalisation of robust cleanness is func-cleanness. A cleanness contract for func-cleanness replaces the constants κ i and κ o by a function f : R ≥0 → R ≥0 inducing a dynamic threshold for output distances based on the distance between the inputs producing such outputs. Definition 2 A nondeterministic reactive system S is func-clean w.r.t. contract C = ⟨StdIn, d In , d Out , f ⟩ if for every standard input i ∈ StdIn and input sequence i ′ ∈ In ω it is the case that 1. for every o ∈ S(i), there exists o ′ ∈ S(i ′ ), such that for every index k ∈ N, We will in the following refer to Definition 2.1 for l-func-cleanness and Definition 2.2 for u-func-cleanness.
For the fairness monitor in Section 5 we will use a simpler variant of funccleanness for deterministic sequential programs. Since P is deterministic, the lower and upper bound requirements coincide, yielding the following simplified definition.

Mixed-IO System Model
The reactive execution model above has the strict requirement that for every input, the system produces exactly one output. Recent work [17,18] instead considers mixed-IO models, where a program L ⊆ (In ∪ Out) ω is a subset of traces containing both inputs and outputs, but without any restriction on the order or frequency in which inputs and outputs appear in the trace. In particular, they are not required to strictly alternate (but they may, and in this way the reactive execution model can be considered a special case [15]). A particularity of this model is the distinct output symbol δ for quiescence, i.e., the absence of an output. For example, finite behaviour can be expressed by adding infinitely many δ symbols to a finite trace.
The new system model induces consequences regarding cleanness contracts. Every mixed-IO trace is projected into an input, respectively output domain. The set of input symbols contains one additional elementi , that indicates that in the respective steps an output was produced, but masking the concrete output. Similarly, the set of output symbols contains the additional element o to mask a concrete input symbol. Projection on inputs ↓ i : (In ∪ Out) ω → (In ∪ {i }) ω and projection on outputs ↓ o : (In ∪ Out) ω → (Out ∪ {o }) ω are defined for all traces σ ∈ (In ∪ Out) ω and k ∈ N as follows: The distance functions d In and d Out apply on input and output symbols or their respective masks, i.e., they are functions ( Finally, instead of a set of standard inputs StdIn, we evaluate mixed-IO system cleanness w.r.t. to a set of standard behaviour Std ⊆ L. Thus, not only inputs, but also outputs can be defined as standard behaviour and for an input, one of its outputs can be considered in Std while a different output can be excluded from Std. As a consequence, the set Std is specific for some mixed-IO system L, because Std is useful only if Std ⊆ L. To emphasise this difference we will call the tuple C = ⟨Std, d In , d Out , κ i , κ o ⟩ (cleanness) context (instead of cleanness contract). Robust cleanness of mixed-IO systems w.r.t. such a context is defined below [18].
and only if Std ⊆ L and for all σ ∈ Std and σ ′ ∈ L, We will in the following refer to Definition 4.1 for l-robust cleanness and Definition 4.2 for u-robust cleanness. Definition 4 universally quantifies a standard trace σ. For l-robust cleanness, the universal quantification of σ ′ effectively only quantifies an input sequence; the input projection for the existentially quantified σ ′′ must match the projection for σ ′ . The remaining parts of the definition are conceptually identical to their reactive systems counterpart in Definition 1.1. For u-robust cleanness, the existentially quantified trace σ ′′ is obtained from set Std in contrast to l-robust cleanness, where σ ′′ can be any arbitrary trace of L. This is necessary, because u-robust cleanness is defined w.r.t. a cleanness context; from knowing that σ ∈ Std is a standard trace and by enforcing that σ↓ i = σ ′′ ↓ i we cannot conclude that also σ ′′ ∈ Std.
Definition 5 shows the definition func-cleanness of mixed-IO systems.
Definition 5 A mixed-IO system L ⊆ (In ∪ Out) ω is func-clean w.r.t. context C = ⟨Std, d In , d Out , f ⟩ if and only if Std ⊆ L and for all σ ∈ Std and σ ′ ∈ L, We will in the following refer to Definition 5.1 for l-func-cleanness and Definition 5.2 for u-func-cleanness.

HyperLTL
Linear Temporal Logic (LTL) [95] is a popular formalism to reason about properties of traces. A trace is an infinite word where each literal is a subset of AP, the set of atomic propositions. We interpret programs as circuits encoded as sets C ⊆ (2 AP ) ω of such traces. LTL provides expressive means to characterise sets of traces, often called trace properties. For some set of traces T , a trace property defines a subset of T (for which the property holds), whereas a hyperproperty defines a set of subsets of T (constituting combinations of traces for which the property holds). In this way it specifies which traces are valid in combination with one another. Many temporal logics have been extended to corresponding hyperlogics supporting the specification of hyperproperties.
HyperLTL [30] is such a temporal logic for the specification of hyperproperties of reactive systems. It extends LTL with trace quantifiers and trace variables that make it possible to refer to multiple traces within a logical formula. A HyperLTL formula is defined by the following grammar, where π is drawn from a set V of trace variables and a from the set AP: The quantifiers ∃ and ∀ quantify existentially and universally, respectively, over the set of traces. For example, the formula ∀π. ∃π ′ . ϕ means that for every trace π there exists another trace π ′ such that ϕ holds over the pair of traces. To account for distinct valuations of atomic propositions across distinct traces, the atomic propositions are indexed with trace variables: for some atomic proposition a ∈ AP and some trace variable π ∈ V, a π states that a holds in the initial position of trace π. The temporal operators and Boolean connectives are interpreted as usual for LTL. Further operators are derivable: ϕ ≡ true U ϕ enforces ϕ to eventually hold in the future, ϕ ≡ ¬ ¬ϕ enforces ϕ to always hold, and the weak-until operator ϕ W ϕ ′ ≡ ϕ U ϕ ′ ∨ ϕ allows ϕ to always hold as an alternative to the obligation for ϕ ′ to eventually hold.

HyperLTL Characterisations of Cleanness
D'Argenio et al. [31] assume distinct sets of atomic propositions to encode inputs and outputs. That is, they assume that AP = AP i ∪ AP o of atomic propositions, where AP i and AP o are the atomic propositions that define the the input values and, respectively, the output values. Thus, in the context of Boolean circuit encodings of programs, we take In = 2 AP i and Out = 2 APo . We capture the following natural correspondence between reactive programs and Boolean circuits; a circuit C can be interpreted as a functionŜ : In the HyperLTL formulas below occur, for convenience, non-atomic propositions. Their semantics is encoded by atomic propositions and Boolean connectives according to a Boolean encoding of inputs and outputs. We refer to the original work for the details [31, Table 1]. Further, we assume that there is a quantifier-free HyperLTL formula StdIn π that can check whether the trace represented by trace variable π is in the set of standard inputs StdIn ⊆ In ω . That is, StdIn π should be defined such that for every trace t ∈ C it holds that {π := t} |= C StdIn π if and only if (t↓ AP i ) ∈ StdIn.
Proposition 1 shows HyperLTL formulas for l-robust cleanness and u-robust cleanness, respectively. 1 Proposition 1 Let C be a set of infinite traces over 2 AP , letŜ be the reactive system constructed from C according to Equation (1), and let C = ⟨StdIn, d In , d Out , κ i , κo⟩ be a contract for robust cleanness. ThenŜ is l-robustly clean w.r.t. C if and only if C satisfies the HyperLTL formula , andŜ is u-robustly clean w.r.t. C if and only if C satisfies the HyperLTL formula The first quantifier (for π 1 ) in both formulas implicitly quantifies the standard input i and the second quantifier (for π 2 ) implicitly quantifies the second input i ′ . Due to the potential nondeterminism in the behaviour of the system, the third, existential, quantifier for π ′ 1 , respectively π ′ 2 is necessary. While the formula for l-robust cleanness has the universal quantification on the outputs of the program that takes the standard input i and the existential quantification on the output for i ′ , the formula for u-robust cleanness works in the other way around. Thus, the formulas capture the ∀∃ alternation in Definition 1. The weak until operator W has exactly the behaviour necessary to represent the interaction between the distances of inputs and the distances of outputs.
The HyperLTL formulas for func-cleanness are given below.
Proposition 2 Let C be a set of infinite traces over 2 AP , letŜ be the reactive system constructed from C according to Equation (1), and let C = ⟨StdIn, d In , d Out , f ⟩ be a contract for func-cleanness. ThenŜ is l-func-clean w.r.t. C if and only if C satisfies the HyperLTL formula , andŜ is u-func-clean w.r.t. C if and only if C satisfies the HyperLTL formula

Signal Temporal Logic
LTL enables reasoning over traces σ ∈ (2 AP ) ω for which it is necessary to encode values using the atomic propositions in AP. Each literal in a trace represents a discrete time step of an underlying model. Thus, σ can equivalently be viewed as a function N → 2 AP . One extension of LTL is Signal Temporal Logic (STL) [32,74], which instead is used for reasoning over real-valued signals that may change in value along an underlying continuous time domain. In this article, we generalise the original work and use generalised timed traces (GTTs) [52], which, for some value domain X and time domain T define traces as functions T → X. The time domain T can be either N (discrete-time), or R ≥0 (continuous-time). For the value domain we will use vectors of real values X = R n for some n > 0 or, to express mixed-IO traces, the set X = In ∪ Out. STL formulas can express properties of systems modelled as sets M ⊆ (T → X) of traces by making the atomic properties refer to booleanisations of the signal values. The syntax of the variant of STL that we use in this article is as follows, where f ∈ X → R: STL replaces atomic propositions by threshold predicates of the form f > 0, which hold if and only if function f applied to the trace value at the current time returns a positive value. The Boolean operators and the Until operator U are very similar to those of HyperLTL. The Next operator X is not part of STL, because "next" is without precise meaning in continuous time. The definitions of the derived operators , and W are the same as for HyperLTL. Formally, the Boolean semantics of an STL formula ϕ at time t ∈ T for a trace w ∈ T → X is defined inductively: A system M satisfies a formula ϕ, denoted M |= ϕ, if and only if for every w ∈ M it holds that w, 0 |= ϕ.

7:
end if 8: end while does not hold. The quantitative semantics for an STL formula ϕ, trace w, and time t the quantitative semantics is defined inductively:

Robustness and Falsification
The value of the quantitative semantics can serve as a robustness estimate and as such be used to search for a violation of the property at hand, i.e., to falsify it. The robustness of STL formula ϕ is its quantitative value at time 0, that is, R ϕ (w) := ρ(ϕ, w, 0). So, falsifying a formula ϕ for a system M boils down to a search problem with the goal condition R ϕ (w) < 0. Successful falsification algorithms solve this problem by understanding it as the optimisation problem minimise w∈M R ϕ (w). Algorithm 1 [1,86] sketches an algorithm for Monte-Carlo Markov Chain falsification, which is based on acceptance-rejection sampling [28].
An input to the algorithm is an initial trace w and a computable robustness function R. Robustness computation for STL formulas has been addressed in the literature [32,48]; we omit this discussion here. The third input PS is a proposal scheme that proposes a new trace to the algorithm based on the previous one (line 2). The parameter β (used in line 3) can be adjusted during the search and is a means to avoid being trapped in local minima, preventing to find a global minimum.
Notably, there exists prior work by Nguyen et al. [87] that discusses an extension of STL to HyperSTL though using a non-standard semantic underpinning. In this context, they present a falsification approach restricted to the fragment "t-HyperSTL" where, according to the authors, "a nesting structure of temporal logic formulas involving different traces is not allowed". Therefore, none of our cleanness definitions belongs to this fragment.

Logical Characterisation of Mixed-IO Cleanness
In this section we provide a temporal logic characterisation for robust cleanness and func-cleanness for mixed-IO systems. For this, we propose a HyperSTL semantics (different to that of [87]) and propose HyperSTL formulas for robust cleanness and func-cleanness. We explain how these formulas can be applied to mixed-IO traces and prove that the characterisation is correct. Furthermore, for the special case that Std is a finite set, we reformulate the HyperSTL formulas characterising the u-cleannesses as equivalent STL formulas.

Hyperlogics over Continuous Domains
Previous work [87] extends STL to HyperSTL echoing the extension of LTL to HyperLTL. We use a similar HyperSTL syntax in this article: The meaning of the universal and existential quantifier is as for HyperLTL. In contrast to HyperLTL (and to the existing definition of HyperSTL), we consider it insufficient to allow propositions to refer to only a single trace. In HyperLTL atomic propositions of individual traces can be compared by means of the Boolean connectives. To formulate thresholds for real values, however, we feel the need to allow real values from multiple traces to be combined in the function f , and thus to appear as arguments of f . Hence, in our semantics of HyperSTL, f > 0 holds if and only if the result of f , applied to all traces quantified over, is greater than 0. For this to work formally, the arity of function f is the number m of traces quantified over at the occurrence of f > 0 in the formula, so f : X m → R. A trace assignment [30] Π : V → M is a partial function assigning traces of M to variables. Let Π[π := w] denote the same function as Π, except that π is mapped to trace w. The Boolean semantics of HyperSTL is defined below. Definition 6 Let ψ be a HyperSTL formula, t ∈ T a time point, M ⊆ (T → X) a set of GTTs, and Π a trace assignment. Then, the Boolean semantics for M, Π, t |= ψ is defined inductively: A system M satisfies a formula ψ if and only if M, ∅, 0 |= ψ. The quantitative semantics for HyperSTL is defined below: Definition 7 Let ψ be a HyperSTL formula, t ∈ T a time point, M ⊆ (T → X) a set of GTTs, and Π a trace assignment. Then, the quantitative semantics for ρ(ψ, M, Π, t) is defined inductively:

HyperSTL Characterisation
The HyperLTL characterisations in Section 2.2.1 assume the system to be a subset of (2 AP ) ω and works with distances between traces by means of a Boolean encoding into atomic propositions. By using HyperSTL, we can characterise cleanness for systems that are representable as subsets of (T → X).
We can take the HyperLTL formulas from Propositions 1 and 2 and transform them into HyperSTL formulas by applying simple syntactic changes. We get for l-robust cleanness the formula for l-func-cleanness we get the formula and, finally, u-func-cleanness is encoded by The quantifiers remain unchanged relative to the formulas in Propositions 1 and 2. The formulas use generic projection functions ↓ i : X → In and ↓ o : X → Out to extract the input values, respectively output values from a trace. To apply the formulas, these functions must be instantiated with functions for the concrete instantiation of the value domain X of the traces to be analysed. For example, for In = R m , Out = R l , and M ⊆ (T → R m+l ), the projections could be defined for every w = (s 1 , . . . , s m , s m+1 , . . . , s m+l ) as w↓ i = (s 1 , . . . , s m ) and w↓ o = (s m+1 , . . . , s m+l ). The input equality requirement for two traces π and π ′ is ensured by globally enforcing eq(π↓ i , π ′ ↓ i ) ≤ 0. eq is a generic function that returns zero if its arguments are identical and a positive value otherwise. It must be instantiated for concrete value domains. For example, eq((s 1 , . . . , s m ), (s ′ 1 , . . . , s ′ m )) could be defined as the sum of the componentwise distances 1≤i≤m |s i − s ′ i |. Finally, in the above formulas we perform simple arithmetic operations to match the syntactic requirements of HyperSTL.
Formulas (3) and (5) are prepared to express u-robust cleanness, respectively u-func-cleanness w.r.t. both cleanness contracts or cleanness contexts. That is, we assume the existence of a function Std π that returns a positive value if and only if the trace assigned to π encodes a standard input (when considering cleanness contracts) or encodes an input and output that constitute a standard behaviour (when considering cleanness contexts). Explicitly requiring that π ′ 1 represents a standard behaviour echoes the setup in Definitions 4.2 and 5.2.
We remark that for encoding Std π , due to the absence of the Next-operator in HyperSTL, it might be necessary to add a clock signal s(t) = t to traces in a preprocessing step.
The HyperSTL formulas ψ l-rob and ψ u-rob reason about sets of traces. For example, the set M = {w 0 , w 1 , w A , w B } satisfies both formulas. If both π 1 and π 2 represent standard traces, then π 1 ↓ i = π 2 ↓ i , because w 0 ↓ i = w 1 ↓ i , and the formulas hold for π ′ 2 = π 1 , respectively π ′ 1 = π 2 . Otherwise, assume that π 1 represents w 0 and π 2 represents w B (the reasoning for other combinations of traces is similar).
First considering ψ l-rob , we pick w A for π ′ 2 . We get that 2 ↓ i | = 0 and, thus, eq(π 2 ↓ i , π ′ 2 ↓ i ) = 0. At time steps 0 ≤ t ≤ 3, the distance between the outputs |w 0 ↓o(t) − w A ↓o(t)| is at most κo. Hence, the left operand of W holds and the formula is satisfied for t ≤ 3.
Hence, the right operand of the W operator holds and ψ l-rob is satisfied also for t ≥ 3. Notice that if we would remove w A from M, then it would violate ψ l-rob , because there is no possible choice for π ′ 2 that has the same inputs as w B and where the output distances to w 0 are below the κo threshold.
To satisfy ψ u-rob , we pick w 1 for π ′ 1 . The reasoning why the formula holds for this choice is analogue to ψ l-rob . Notice that if we add the trace w to M, then ψ u-rob is violated. Concretely, π 2 could represent w ; then, whether we pick w 0 or w 1 for π ′ 1 , we eventually get outputs that violate the κo constraint, while the κ i constraint is always satisfied. For example, if we compare w and w 1 , then we have for all time steps t ≤ 3 that Hence, at t = 3 the left and right operand of W are false, so ψ u-rob is violated.

Correctness under Mixed-IO Interpretation
Mixed-IO signals are defined in the discrete time domain N and value domain In ∪ Out. The abstract functions ↓ i and ↓ o can be defined equally to the syntactically identical projection functions for mixed-IO models defined in Section 2.1. The function eq(i 1 , i 2 ) can be defined using the distance function d In and some arbitrary small ε > 0: In the second clause of the above definition we add some positive value ε to the result of d In , because d In (i 1 , i 2 ) could be 0 even if i 1 ̸ = i 2 . For the correctness of the above HyperSTL formulas, however, it is crucial that eq(i 1 , i 2 ) = 0 if and only if i 1 = i 2 . For a good performance of the falsification algorithm, we will nevertheless want to make use of d In if i 1 ̸ = i 2 . Proposition 3 shows that HyperSTL formulas (2) and (3) under the mixed-IO interpretation outlined above indeed characterise l-robust cleanness and u-robust cleanness. Proposition 4 shows the same for func-cleanness.

STL Characterisation for Finite Standard Behaviour
In many practical settings -when the different standard behaviours are spelled out upfront explicitly, as in NEDC and WLTC -it can be assumed that the number of distinct standard behaviours Std is finite (while there are infinitely many possible behaviours in M). Finiteness of Std makes it possible to remove by enumeration the quantifiers from the u-robust cleanness and u-func-cleanness HyperSTL formulas. This opens the way to work with the STL fragment of HyperSTL, after proper adjustments. In the following, we assume that the set Std = {w 1 , . . . , w c } is an arbitrary standard set with c unique standard traces, where every w k : T → X uses the same time domain T and value domain X.
To encode the HyperSTL formulas (3) and (5) in STL, we use the concept of self-composition, which has proven useful for the analysis of hyperproperties [9,50]. We concatenate a trace under analysis w : T → X and the standard traces . . , w c ) | w ∈ M} the system in which every trace in M is composed with the standard traces in Std. For every w + ∈ M • Std, we will in the following STL formula write w to mean the projection on w + to the trace w, and we write w k , for 1 ≤ k ≤ c, to mean the projection on w + to the kth standard trace.
The theorem for u-func-cleanness is analogue to Theorem 5.

Example 4
We consider the robust cleanness context C = ⟨Std, d In , d Out , κ i , κo⟩ where Std = {w 1 , w 2 } contains the two standard traces w 1 = 1 i 2 i 3 i 7o 0 i δ ω and w 2 = 0 i 1 i 2 i 3 i 6o δ ω . We here decorate inputs with index i and outputs with index o, i.e., w 1 describes a system receiving the three inputs 1, 2, and 3, then producing the output 7, and finally receiving input 0 before entering quiescence. We take otherwise.
The contractual value thresholds are assumed to be κ i = 1 and κo = 6. Assume we are observing the trace w = 0 i 1 i 2 i 6o 0 i δ ω to be monitored with STL formula φ u-rob (from Lemma 10). First notice, that for combinations of a and b in For the first conjunct, the input distance between inputs in w and w 1 is always 1 at positions 1 to 3, it is 0 at position 4 (becausei is compared toi ), and remains 0 in position 5 and beyond. Thus, d In (w 1 ↓ i , w↓ i ) − κ i is always at most 0, and the right hand-side of the W operator is always false. Consequently, by definition of W, the left operand of W must always hold, i.e., d Out (w 1 ↓o, w↓o) must always be less or equal to 6. This is the case for w 1 and w: at all positions except for 4, -o is compared to -o (or δ to δ), so the difference is 0, and at position 4, the distance of 6 and 7 is 1.
For the second W-formula, w is compared to w 2 . These two traces are comparable only to a limited extent: the order of input and output is altered at the last two positions of the signals before quiescence. Hence, the right operand of W is true at position 4, and the formula holds for the remaining trace. For positions 1 to 3, the input distances are 0, because the input values are identical. At these positions, the left operand must hold. The values are input values, so -o is compared to -o at each position. This distance is defined to be 0, so it holds that −6 ≤ 0, and the formula is satisfied. Since both formulas hold, the conjunction of both holds, too, and trace w is qualified as robustly clean. There could however be other system traces not considered in this example, that overall could violate robust cleanness of the system.

Restriction of input space
Robust cleanness puts semantic requirements on fragments of a system's input space, outside of which the system's behaviour remains unspecified. Typically, the fragment of the input space covered is rather small. To falsify the STL formula φ u-rob from Lemma 10, the falsifier has two challenging tasks. First, it has to find a way to stay in the relevant input space, i.e., select inputs with a distance of at most κ i from the standard behaviour. Only if this is assured it can search for an output large enough to violate the κ o requirement. In this, a large robustness estimate provided by the quantitative semantics of STL cannot serve as an indicator for deciding whether an input is too far off or whether an output stays too close to the standard behaviour. We can improve the efficiency of the falsification process significantly by narrowing upfront the input space the falsifier uses.
In practice, test execution traces will always be finite. In previous reallife doping tests, test execution lengths have been bounded by some constant B ∈ N [18], i.e., systems are represented as sets of finite traces M ⊆ (In ∪ Out) B (which for formality reasons each can be considered suffixed with δ ω ). In this bounded horizon, we can provide a predicate discriminating between relevant and irrelevant input sequences. Formally, the restriction to the relevant input space fragment of a system M ⊆ (In ∪ Out) B is given by the set In Std, There are rare cases in which this optimisation may prevent the falsifier from finding a counterexample. This is only the case if there is an input prefix leading to a violation of the formula for which there is no suffix such that the whole trace satisfies the κ i constraint. Below is a pathological example in which this could make a difference.
Example 5 Apart from NOx emissions, NEDC (and WLTC) tests are used to measure fuel consumption. Consider a contract similar to the contracts above, but with fuel rate as the output quantity. Assuming a "normal" fuel rate behaviour during the standard test, there might be a test within a reasonable κ i distance, where the fuel is wasted insanely. Then, the fuel tank might run empty before the intended end of the test, which therefore could not be finished within the κ i distance, because speed would be constantly 0 at the end. The actually driven test is not in set In Std,κ i , but there is a prefix within κ i distance that violates the robust cleanness property.
Notably, there may be additional techniques to reduce the size of the input space. For example, if the next input symbol depends on the history of inputs, this constraint could be considered in the proposal scheme.

Supervision of Diesel Emission Cleaning Systems
The severity of the diesel emissions scandal showed that the regulations alone are insufficient to prevent car manufacturers from implementing tamperedor doped -emission cleaning systems. Recent works [18] shows that robust cleanness is a suitable means to extend the precisely defined behaviour of cars for the NEDC to test cycles within a κ i range around the NEDC. To demonstrate the usefulness of robust cleanness, the essential details of the emission testing scenario were modelled: the set of inputs is the set of speed values, an output value represents the amount of emissions -in particular, the nitric oxide (NO x ) emissions -measured at the exhaust pipe of a car. The distance functions are the absolute differences of speed, respectively NO x , values, and the standard behaviour is the singleton set that contains a trace that consists of the inputs that define the test cycle followed by the average amount of NO x gas measured during the test. Thus, formally, we get In = R, Out = R, Std = {NEDC · o}, 3 and d In and d Out as defined in Example 4 [18]. The STL formulas developed in the previous section, combined with the probabilistic falsification approach, give rise to further improvements to the existing testing-based work [18] on diesel doping detection.
To use the falsification algorithm in Algorithm 1, we implement the restriction of the input space to In {NEDC·o},κ i as explained in Section 3. With this restriction the STL formula φ u-rob from Lemma 10 can be simplified to This is because the conjunction and disjunction over standard traces becomes obsolete for only a single standard trace. For the same reason, the requirement (eq(w a ↓ i , w b ↓ i ) ≤ 0) becomes obsolete, as the compared traces are always identical. In the W subformula, the right proposition is always false, because of the restricted input space. We implemented Algorithm 1 for the robustness computation according to formula (7).
In practice, running tests like NEDC with real cars is a time consuming and expensive endeavour. Furthermore, tests on chassis dynamometers are usually prohibited to be carried out with rented cars by the rental companies. On the other hand, car emission models for simulation are not available to the public -and models provided by the manufacturer cannot be considered trustworthy. To carry out our experiments, we instead use an approximation technique that estimates the amount of NO x emissions of a car along a certain trajectory based on data recorded during previous trips with the same car, sampled at a frequency of 1 Hz (one sample per second). Notably, these trips do not need to have much in common with the trajectory to be approximated. A trip is represented as a finite sequence ϑ ∈ (R × R × R) * of triples, where each such triple (v, a, n) represents the speed, the acceleration, and the (absolute) amount of NO x emitted at a particular time instant in the sample. Speed and acceleration can be considered as the main parameters influencing the instant emission of NO x . This is, for instance, reflected in the regulation [66,122] where the decisive quantities to validate test routes for real-world driving emissions tests on public roads are speed and acceleration.
A recording D is the union of finitely many trips ϑ. We can turn such a recording into a predictor P of the NO x values given pairs of speed and acceleration as follows: The amount of NO x assigned to a pair (v, a) here is the average of all NO x values seen in the recording D for v ± ℓ and a ± ℓ, with 0 ≤ ℓ ≤ 2. To overcome measurement inaccuracies and to increase the robustness of the approximated emissions, the speed and acceleration may deviate up to 2 km/h, and 2 m/s 2 , respectively. This tolerance is adopted from the official NEDC regulation [126], which allows up to 2 km/h of deviations while driving the NEDC.
To demonstrate the practical applicability of our implementation of Algorithm 1 and our NO x approximation, we report here on two experiments with an Audi A6 Avant Diesel admitted in June 2020 and with its successor admitted in 2021. We will refer to the former as car A20 and to the latter as car A21. We used the app LolaDrives to perform in total six low-cost RDE tests -two with A20 and four 4 with A21 -and recorded the data received from the cars' diagnosis ports. The raw data is available on Zenodo [14]. Using the emissions predictor proposed above we estimate that for an NEDC test A20 emits 86 mg/km of NO x and that A21 emits 9 mg/km. Car A20 has previously been falsified w.r.t. the RDE specification. Neither A20 nor A21 has been falsified w.r.t. robust cleanness.
Before turning to falsification, we spell out meaningful contexts for robust cleanness. We identified suitable In, Out, Std, d In , and d Out at the beginning of the section. For κ i , it turned out that κ i = 15 km/h is a reasonable choice, as it leaves enough flexibility for human-caused driving mistakes and intended deviations [18]. The threshold for NO x emissions under lab conditions is 80 mg/km. The emission limits for RDE tests depend on the admission date of the car. Cars admitted in 2020 or earlier, must emit 168 mg/km at most, and cars admitted later must adhere to the limit of 120 mg/km. For our experiments, we use κ o = 88 mg/km for A20 and κ o = 40 mg/km for A21 to have the same tolerances as for RDE tests. Effectively, the upper threshold for A20 is 84 + 88 = 172 mg/km, and for A21 the limit is 9 + 40 = 49 mg/km. Notice that for software doping analysis, the output observed for a certain standard behaviour and the constant κ o define the effective threshold; this threshold is typically different from the thresholds defined by the regulation.
We modified Algorithm 1 by adding a timeout condition: if the algorithm is not able to find a falsifying counterexample within 3,000 iterations, it terminates and returns both the trace for which the smallest robustness has been observed and its corresponding robustness value. Hence, if falsification of robust cleanness for a system is not possible, the algorithm outputs an upper bound on how robust the system satisfies robust cleanness.
For the concrete case of the diesel emissions, the robustness value during the first 1180 inputs (sampled from the restricted input space In Std,κ i ) is always κ o . When the NEDC output o NEDC and the non-standard output o are compared, the robustness value is κ o − |o NEDC − o| (cf., eq. (7), the quantitative semantics of STL, and definition of d Out ). Hence, for test cycles with small robustness values, we get NO x emissions o that are either very small or very large compared to o NEDC . We ran the modified Algorithm 1 on A20 and A21 for the contexts defined above. For A20, it found a robustness value of −8, i.e., it was able to falsify robust cleanness relative to the assumed contract and found a test cycle for which NO x emissions of 182 mg/km are predicted. The test cycle is shown in Figure 2. For A21, the smallest robustness estimate found -even after 100 independent executions of the algorithm -was 38, i.e., A21 is predicted to satisfy robust cleanness with a very high robustness estimate. The corresponding test cycle is shown in Figure 3.

On Doping Tests for Cyber-physical Systems
The proposed probabilistic falsification approach to find instances of software doping needs several hundreds of iterations. This is problematic for testing real-world cyber-physical systems (CPS) to which inputs cannot be passed in an automated way. To conduct a test with a car, for example, the input to the system is a test cycle that is passed to the vehicle by driving it. Notably, we consider here the scenario that the CPS is tested by an entity that is different We propose the following integrated testing approach for effective doping tests of cyber-physical systems. The big picture is provided in Figure 4. In a first step, the CPS is used under real-world conditions without enforcing any specific constraints on the inputs to the system. For all executions, the inputs and outputs are recorded. So, essentially, the system can be used as it is needed by the user, but all interactions with it are recorded. From these recordings, a model can be learned that for arbitrary inputs (whether they were covered in the recorded data or not) predicts the output of the system. Such learning can be as simple as using statistics as we did for the emissions example above, or as complex as using deep neural nets. For the learned model, the probabilistic falsification algorithm computes a test input that falsifies it -inputs to this model can be passed automatically and an output is produced almost instantly. The resulting input serves as an input for the real CPS. If the prediction was correct, also the real system is falsified. If it was incorrect, the learned model can be refined and the process starts again.
For diesel emissions, the first part of this integrated testing approach has been carried out as part of the work reported in this article. We leave the second part -evaluating the generated test traces from Figures 2 and 3 with a real car -for future work.

Technical Context
Software doping theory provides a formal basis for enlarging the requirements on vehicle exhaust emissions beyond too narrow lab test conditions. That conceptual limitation has by now been addressed by the official authorities responsible for car type approval [122,125]: The old NEDC-based test procedure is replaced by the newer Worldwide Harmonised Light Vehicles Test Procedure (WLTP), which is deemed to be more realistic. WLTP replaces the NEDC test by a new WLTC test, but WLTC still is just a single test scenario. In addition, WLTP embraces so called Real Driving Emissions (RDE) tests to be conducted on public roads. A recently launched mobile phone app [19,21], LolaDrives, harvests runtime monitoring technology for making low-cost RDE tests accessible to everyone.
Learning or approximating the behaviour of a system under test has been studied intensively. Meinke and Sindhu [80] were among the first to present a testing approach incrementally learning a Kripke structure representing a reactive system. Volpato and Tretmans [128] propose a learning approach which gradually refines an under-and over-approximation of an input-output transition system representing the system under test. The correctness of this approach needs several assumptions, e.g., an oracle indicating when, for some trace, all outputs, which extend the trace to a valid system trace, have been observed.

Individual Fairness of Systems Evaluating Humans
Example 2 introduces a new application domain for cleanness definitions. Unica uses an AI system that is supposed to assist her with the selection of applicants for a hypothetical university. Cleanness of such a system can be related to the fair treatment of the humans that are evaluated by it. A usable fairness analysis can happen no later than at runtime, since Unica needs to make a timely decision on whether to include the applicant in further considerations. We describe technical measures that help in mitigating this challenge by providing her with information from an individual fairness analysis in a suitable, purposeful, expedient way. To this end, we propose a formal definition for individual fairness extending the one by [34] and based on func-cleanness. We develop a runtime monitor that analyses every output of P immediately after P's decision, which strategically searches for unfair treatment of a particular individual by comparing them to relevant hypothetical alternative individuals so as to provide a fairness assessment in a timely manner. Much like P is to support Unica, AI systems -in the broadest sense of the word -more and more often support human decision makers. Undoubtedly, such systems should be compliant with applicable law (such as the future European AI Act [39,40] or the Washington State facial recognition law [130]) and ought to minimise any risks to health, safety or fundamental rights. Sometimes, we cannot mitigate all these risks in advance by technical measures and also some risk-mitigation requires trade-off decisions involving features that are either impossible or difficult to operationalise and formalise. This is why it is essential that a human effectively oversees the system (which is also emphasised by several institutions such as UNESCO [127] and the European High Level Expert Group [58]). Effective human oversight, however, is only possible with the appropriate technical measures that allow human overseers to better understand the system at runtime [69]. From a technical point of view, this raises the pressing question of what such technical measures can and ought to look like to actually enable humans to live up to these responsibilities. Our contribution is intended to bridge the gap between the normative expectations of law and society and the current reality of technological design.

Positioning within Related Research Topics
Our contribution draws on and adds to three vibrant topics of current research, namely Explainable AI (XAI), AI fairness, and discrimination.

XAI
Many of the most successful AI systems today are some kind of black boxes [11]. Accordingly, the field of "Explainable AI" [53] focuses on the question of how to provide users (and possibly other stakeholders) with more information via several key perspicuity properties [115] of these systems and their outputs to make them understand these systems and their outputs in ways necessary to meet various desiderata [5,27,68,72,83,89]. The concrete expectations and promises associated with various XAI methods are manifold. Among them are enabling warranted trust in systems [61,64,100,109], increasing humansystem decision-making performance [67] for instance through increasing human situation awareness when operating systems [107], enabling responsible decisionmaking and effective human oversight [13,78,112], as well as identifying and reducing discrimination [72]. It often remains unclear what kind of explanations are generated by the various explainability methods and how they are meant to contribute to the fulfilment of the desiderata, even though these questions have become the subject of systematic and interdisciplinary research [68,69].
Our approach can be taxonomised along at least two different distinctions [69,84,99,100,114]: First, it is model-agnostic (not model-specific), i.e., it is not tailored to a particular class of models but operates on observable behaviour -the inputs and outputs of the model. Second, our method is a local method (not global ), i.e., it is meant to shed light on certain outputs rather than the system as a whole.
With regard to fairness, there are two distinctions that are especially relevant to our work. First, one distinction is made between individual fairness, i.e., that similar individuals are treated similarly [34], and group fairness, i.e., that there is adequate group parity [22]. Measures of individual fairness are often close to the Aristotelian dictum to treat like cases alike [6,7]. In a sense, operationalisations of individual fairness are robustness measures [23,116], but instead of requiring robustness with respect to noise or adversarial attacks, measures of individual fairness, such as the one by Dwork et al. [34], call for robustness with respect to highly context-dependent differences between representations of human individuals. Second, recent work from the field of law [129] suggests to differentiate between bias preserving and bias transforming fairness metrics. Bias preserving fairness metrics seek to avoid adding new bias. For such metrics, historic performances are the benchmarks for models, with equivalent error rates for each group being a constraint. In contrast, bias transforming metrics do not accept existing bias as a given or neutral starting point, but aim at adjustment. Therefore, they require to make a "positive normative choice" [129], i.e. to actively decide which biases the system is allowed to exhibit, and which it must not exhibit. Over the years, many concrete approaches have been suggested to foster different kinds of fairness in artificial systems, especially in AI-based ones [72,79,94,129,132]. Yet, to the best of our knowledge, an approach like ours is still missing. One of the approaches that is closest to ours, namely that by John et al. [63], is not local and therefore not suitable for runtime monitoring. Also, it is not model-agnostic. So, to the best of our knowledge, our approach provides a new contribution to the debate on unfairness detection.
It is important to note/recognise that our approach can only be understood as part of a more holistic approach to preventing or reducing unfairness. After all, there are many sources of unfairness [8] (also see Figure 5 and Appendix B). Therefore, not every technical measure is able to detect every kind of unfairness and eliminating one source of unfairness might not be sufficient to eliminate all unfairness. Our approach tackles only unfairness introduced by the system, but not other kinds of unfairness.

Discrimination
We understand discrimination as dissimilar treatment of similar cases or similar treatment of dissimilar cases without justifying reason. This is a definition that can also be found in the law [43, §43]. Our work is exclusively focused on discrimination qua dissimilar treatment of similar cases. Discrimination requires a thoughtful and largely not formalisable consideration of "justifying reason". However, we will exploit the relation of discrimination and fairness: Unfairness in a system can arguably be a good proxy of discrimination -even though not every unfair treatment by a system necessarily constitutes discrimination (especially not in the legal sense). Thus, a tool that highlights cases of unfairness in a system can be highly instrumental in detecting discriminatory features of a system. It is not viable, though, to let such a tool rule out unfair treatment fully automatically without human oversight, since there could be justifying reason to treat two similar inputs in a dissimilar way.

Individual Fairness
Unica from Example 2 should be able to detect individual unfairness. An operationalisation thereof by Dwork et al. [34] is based on the Lipschitz condition to enforce that similar individuals are treated similarly. To measure similarity, they assume the existence of an input distance function d In and an output distance function d Out . This assumption is very similar to the one that we implicitly made in the previous sections for robust cleanness and func-cleanness. However, in the case of the fair treatment of humans finding reasonable distance functions is more challenging than it was for the examples in the previous chapters. Dwork et al. assume that both distance functions perfectly measure distances between individuals 5 and between outputs of the system, respectively, but admit that in practice these distance functions are only approximations of a ground truth at best. They suggest that distance measures might be learned, but there is no one-size-fits-all approach to selecting distance measures. Indeed, obtaining such distance metrics is a topic of active research [60,85,133]. Additionally, the Lipschitz condition assumes a Lipschitz constant L to establish a linear constraint between input and output distances. Lipschitz-fairness comes with some restrictions that limit its suitability for practical application: d In -d Out -relation: High-risk systems are typically complex systems and ask for more complex fairness constraints than the linearly bounded output distances provided by the Lipschitz condition. For example, using the Lipschitz condition prevents us from allowing small local jumps in the output and at the same time forbidding jumps of the same rate of increase over larger ranges of the input space (also see supplementary material in Section A). Input relevance: The condition quantifies over the entire input domain of a program. This overlooks two things: first, it is questionable whether each input in such a domain is plausible as a representation for a real-world individual. But whether a system is unfair for two implausible and purely hypothetical inputs is largely irrelevant in practice. Secondly, it also ignores that mere potential unfair treatment is at most a threat, not necessarily already a harm [104]. Therefore, even with a restriction to only plausible applicants, the analysis might take into account more inputs than needed for many real-world applications. What is important in practice is the ability to determine whether actual applicants are treated unfairly -and for this it is often not needed to look at the entire input domain. Monitorability: In a monitoring scenario with the Lipschitz condition in place, a fixed input i 1 must be compared to potentially all other inputs i 2 . Since the input domain of the system can be arbitrarily large, the Lipschitz condition is not yet suitable for monitoring in practice (for a related point see John et al. [63]).
We propose a notion of individual fairness that is based on Definition 3. Instead of cleanness contracts we consider here fairness contracts, which are tuples F = ⟨d In , d Out , f ⟩ containing input and output distance functions and the function f relating input distances and output distances. Notably, the set of standard inputs StdIn known from cleanness contracts is not part of a fairness contract; it is unknown what qualifies an input to be 'standard' in the context of fairness analyses. Still, our fairness definition evaluates fairness for a set of individuals I ⊆ In (e.g., a set of applicants), which has conceptual similarities to the set StdIn. A fairness contract specifies certain fairness parameters for a concrete context or situation. Such parameters should generally not already include I to avoid introducing new unfairness through the monitor by tailoring it to specific inputs individually or by treating certain inputs differently from others. Func-fairness can thus be defined as follows: Definition 9 A deterministic sequential program P : In → Out is func-fair for a set I ⊆ In of actual inputs w.r.t. a fairness contract F = ⟨d In , d Out , f ⟩, if and only if for every i ∈ I and i ′ ∈ In, d Out (P(i), P(i ′ )) ≤ f (d In (i, i ′ )).
The idea behind func-fairness is that every individual in set I is compared to potential other inputs in the domain of P. These other inputs do not necessarily need to be in I, nor do these inputs need to have "physical counterparts" in the real world. Driven by the insights of the Input relevance restriction of Lipschitzfairness, we explicitly distinguish inputs in the following and will call inputs that are given to P by a user actual inputs, denoted i a , and call inputs to which such i a are compared to synthetic inputs, denoted i s . Actual inputs are typically 6 inputs that have a real-world counterpart, while this might or might not be true for synthetic inputs. On first glance, an alternative to using synthetic inputs is to use only actual inputs, e.g., to compare every actual input with every other actual input in I. For example, for a university admission, all applicants could be compared to every other applicant. However, this would heavily rely on contingencies: the detection of unfair treatment of an applicant depends on whether they were lucky enough that, coincidentally, another candidate has also applied who aids in unveiling the system's unfairness towards them. Instead, func-fairness prefers to over-approximate the set of plausible inputs that actual inputs are compared to rather than under-approximating it by comparing only to other inputs in I. This way, the attention of the human exercising oversight of the system might be drawn to cases that are actually not unfair, but as a competent human in the loop, they will most likely be able to judge that the input was compared to an implausible counterpart. This will usually enable more effective human oversight than an under-approximation that misses to alert the human to unfair cases.
Notice that func-fairness is a conservative extension of Lipschitz-fairness. With I = In and f (x) = L·x, func-fairness mimics Lipschitz-fairness. Wachter et al. [129] classify the Lipschitz-fairness of Dwork et al. [34] as bias-transforming. As we generalise this and introduce no element that has to be regarded as bias-preserving, our approach arguably is bias-transforming, too.
Func-fairness, with its function f , provides a powerful tool to model complex fairness constraints. How such an f is defined has profound impact on the quality of the fairness analysis. A full discussion about which types of functions make a good f go beyond the scope of this article. A suitable choice for f and the distance functions d In and d Out heavily depends on the context in which fairness is analysed -there is no one-fits-it-all solution. Func-fairness makes this explicit with the formal fairness contract F = ⟨d In , d Out , f ⟩.

Fairness Monitoring
We develop a probabilistic-falsification-based fairness monitor that, given a set of actual inputs, searches for a synthetic counterexample to falsify a system P w.r.t. a fairness contract F. To this end, it is necessary to provide a quantitative description of func-fairness that satisfies the characteristics of a robustness estimate. We call this description fairness score. For an actual input i a and a synthetic input i s we define the fairness score as F (i a , i s ) := f (d In (i a , i s )) − d Out (P(i a ), P(i s )). F is indeed a robustness estimate function: if F (i a , i s ) is non-negative, then d Out (P(i a ), P(i s )) ≤ f (d In (i a , i s )), and if it is negative, then d Out (P(i a ), P(i s )) ̸ ≤ f (d In (i a , i s )). For a set of actual inputs I, the definition generalises to F (I, i s ) := min{F (i a , i s ) | i a ∈ I}, i.e., the overall fairness score is the minimum of the concrete fairness scores of the inputs in I. Notice that R I (i s ) := F (I, i s ) is essentially the quantitative interpretation of φ u-func (from Lemma 11) after simplifications attributed to the fact that P is a sequential and deterministic program (cf. Definition 2.2 vs. Definition 3).
Algorithm 2 shows FairnessMonitor, which builds on Algorithm 1 to search for the minimal fairness score in a system P for fairness contract F. The algorithm stores fairness scores in triples that also contain the two inputs for which the fairness score was computed. The minimum in a set of such triples is Algorithm 2 FairnessMonitor, with ξ-min S = (ξ, i 1 , i 2 ) only if (ξ, i 1 , i 2 ) ∈ S and for all (ξ ′ , i ′ 1 , i ′ 2 ) ∈ S, ξ ′ ≥ ξ Falsification Parameters: PS: Proposal scheme, β: Temperature parameter Input: System P : In → Out, Fairness contract F = ⟨d In , d Out , f ⟩, and set of actual inputs I Output: A minimal fairness score triple from R × I × In.

10:
if r ≤ α then 11: defined by the function ξ-min that returns the triple with the smallest fairness score of all triples in the set. The first line of FairnessMonitor initialises the variable i s with an arbitrary actual input from I. For this value of i s , the algorithm checks the corresponding fairness scores for all actual inputs i a ∈ I and stores the smallest one. In line 3, the globally smallest fairness score triple is initialised. In line 5 it uses the proposal scheme to get the next synthetic input i ′ s . Line 6 is similar to line 2: for the newly proposed i ′ s it finds the smallest fairness score, stores it, and updates the global minimum if it found a smaller fairness score (line 7). Lines 8-13 come from Algorithm 1. The only difference is that in addition to i s we also store the fairness score ξ. Line 4 of Algorithm 2 differs from Algorithm 1 by terminating the falsification process after a timeout occurs (similar to the adaptation of Algorithm 1 in Section 4). Hence, the algorithm does not (exclusively) aim to falsify the fairness property, but aims at minimising the fairness score; even if the fair treatment of the inputs in I cannot be falsified in a reasonable amount of time, we still learn how robustly they are treated fairly, i.e., how far the least fairly treated individual in I is away from being treated unfairly. After the timeout occurs, the algorithm returns the triple with the overall smallest seen fairness score ξ min , together with the actual input i 1 and the synthetic input i 2 for which ξ min was found. In case ξ min is negative, i 2 is a counterexample for P being func-fair.
FairnessMonitor implements a sound F-unfairness detection as stated in Proposition 7. However, it is not complete, i.e., it is not generally the case that P is func-fair for I if ξ is positive. It may happen that there is a counterexample, but FairnessMonitor did not succeed in finding it before the timeout. This is analogue to results obtained for model-agnostic robust cleanness analysis [18]. Proposition 7 Let P : In → Out be a deterministic sequential program, F = ⟨d In , d Out , f ⟩ a fairness contract, and I a set of actual inputs. Further, let (ξ min , i 1 , i 2 ) be the result of FairnessMonitor(P, F, I). If ξ min is negative, then P is not func-fair for I w.r.t. F.

Moreover, FairnessMonitor circumvents major restrictions of the Lipschitzfairness:
d In -d Out -relation: Func-fairness defines constraints between input and output distances by means of a function f , which allows to express also complex fairness constraints. For a more elaborate discussion, see Section A. Input relevance: Func-fairness explicitly distinguishes between actual and synthetic inputs. This way, func-fairness acknowledges a possible obstacle of the fairness theory when it comes to a real-world usage of the analysis, namely that only some elements of the system's input domain might be plausible and that usually only few of them become actual inputs that have to be monitored for unfairness. Monitorability: FairnessMonitor demonstrates that func-fairness is monitorable. It resolves the quantification over In using the above concepts from probabilistic falsification using the robustness estimate function F as defined above.

Towards func-fairness in the loop
If a high-risk system is in operation, a human in the loop must oversee the correct and fair functioning of the outputs of the system. To do this, the human needs real-time fairness information. Figure 6 shows how this can be achieved by coupling the system P and the FairnessMonitor in Algorithm 2 in a new system called FairnessAwareSystem. FairnessAwareSystem is sketched in Algorithm 3. Intuitively, the FairnessAwareSystem is a higher-order program that is parameterised with the original program P and the fairness contract F. When instantiated with these parameters, the program takes arbitrary (actual) inputs i a from In. In the first step, it does a fairness analysis using FairnessMonitor with arguments P, F, and {i a }. To make fairness scores comparable, FairnessAwareSystem normalises the fairness score ξ received from FairnessMonitor by dividing 7 it by the output distance limit f (d In (i a , i s )). For fair outputs, the score will be between 0 (almost unfair) and 1 (as fair as possible). 8 Outputs that are not func-fair are accompanied by a negative score representing how much the limit f (d In (i a , i s )) is exceeded. A fairness score of −n means that the output distance of P(i a ) and P(i s ) is n + 1 times as high as that limit. Finally, FairnessAwareSystem returns the triple with P's output for i a , the normalised fairness score, and the synthetic input with its output witnessing the fairness score.

Interpretation of monitoring results
Especially when FairnessAwareSystem finds a violation of func-fairness, the suitable interpretation and appropriate response to the normalised fairness score proves to be a non-trivial matter that requires expertise. Example 6 Instead of using P from Example 2 on its own, Unica now uses FairnessAwareSystem with a suitable fairness contract. (Which fairness contracts are suitable is an open research problem, see Limitations & Challenges in Section 7.) and thereby receive a fairness score along with P's verdict on each applicant. If the fairness score is negative, she can also take into account the information on the synthetic counterpart returned by FairnessAwareSystem. Among the 4096 applicants for the PhD program, the monitoring assigns a negative fairness score to three candidates: Alexa, who received a low score, Eugene, who was scored very highly, and John, who got an average score. According to their scoring, Alexa would be desk-rejected, while Eugene and John would be considered further.
Alexa's synthetic counterpart, let's call him Syntbad, is ranked much higher than Alexa. In fact, he is ranked so high that Syntbad would not be desk-rejected. Unica compares Alexa and Syntbad and finds that they only differ in one respect: Syntbad's graduate university is the one in the official ranking that is immediately below the one that Alexa attended. Unica does some research and finds that Alexa's institution is I n p u t Output (a) case of unfairness where input is treated worse than relevant counterpart I n p u t Output (b) case of unfairness where input is treated better than relevant counterpart I n p u t Output (c) case of no detected unfairness Fig. 7: Exemplary illustration of configurations of an input (red cross) and its synthetic counterparts (grey circles) and the synthetic counterpart with the minimal fairness score (blue polygon); with a two-dimensional input space (grid) and a one-dimensional output.
predominantly attended by People of Colour, while this is not the case for Syntbad's institution. Therefore, FairnessAwareSystem helped Unica not only to find an unfair treatment of Alexa, but also to uncover a case of potential racial discrimination.
John's counterpart, Synclair, is ranked much lower than him. Unica manually inspects John's previous institution (an infamous online university), his GPA of 1.8, and his test result with only 13%. She finds that this very much suggests that John will not be a successful PhD candidate and desk-rejects him. Therefore, Unica has successfully used FairnessAwareSystem to detect a fault in scoring system P whereby John would have been treated unfairly in a way that would have been to his advantage.
Eugene received a top score, but his synthetic counterpart, Syna, received only an average one. Unica suspects that Eugene was ranked too highly given his graduate institution, GPA, and test score. However, as he would not have been desk-rejected either way, nothing changes for Eugene, and the unfairness he was subject to, is not of effect to him.
The cases of John and Eugene share similarities with the configuration in (b) in Figure 7, the one of Alexa with (a), and the ones of all other 4093 candidates with (c).
If our monitor finds only a few problematic cases in a (sufficiently large and diverse) set of inputs, our monitoring helps Unica from our running example by drawing her attention to cases that require special attention. Thereby, individuals who are judged by the system have a better chance of being treated fairly, since even rare instances of unfair treatment are detected. If, on the other hand, the number of problematic cases found is large, or Unica finds especially concerning cases or patterns, this can point to larger issues within the system. In these cases, Unica should take appropriate steps and make sure that the system is no longer used until clarity is established why so many violations or concerning patterns are found. If the system is found to be systematically unfair, it should arguably be removed from the decision process. A possible conclusion could also be that the system is unsuitable for certain use cases, e.g., for the use on individuals from a particular group. Accordingly, it might not have to be removed altogether but only needs to be restricted such that problematic use cases are avoided. In any case, significant findings should also be fed back to developers or deployers of the potentially problematic system. A fairness monitoring such as in FairnessAwareSystem or a fairness analysis as in FairnessMonitor could also be useful to developers, regulating authorities, watchdog organisations, or forensic analysts as it helps them to check the individual fairness of a system in a controlled environment.

Interdisciplinary Assessment of Fairness Monitoring
Regulations for car related emissions are in force for a considerable amount of time, thus, its legal interpretation is mostly clear. In case of human oversight of AI systems, the AI act is new and parts of it are legally ambiguous. This raises the question of whether our approach meets requirements that go beyond pre-theoretical deliberations. Even though comprehensive analyses would go far beyond the scope of this paper, we will nevertheless assess some key normative aspects in philosophical and legal terms, and also briefly turn to the related empirical aspects, especially from psychology.

Psychological assessment
Fairness monitoring promises various advantages in terms of human-system interaction in application contexts -provided it is extended by an adequate user interface -which call for empirical tests and studies. We will only discuss a possible benefit that closely aligns with the current draft of the AI Act: our approach may support effective human oversight. Two central aspects of effective oversight are situation awareness and warranted trust. Our method highlights unfairness in outputs which can be expected to increase users' situation awareness (i.e., "the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning and the projection of their status in the near future" [36, p. 36]), which is a variable central for effective oversight [37]. In the minimal case, this allows users to realise that something requires their attention and that they should check the outputs for plausibility and adequacy. In the optimal case and after some experience with the monitor, it may even allow users to predict instances where a system will produce potentially unfair outputs. In any case, the monitoring should enable them to understand limitations of the system and to feed back their findings to developers who can improve the system. This leads us to warranted trust, which includes that users are able to adequately judge when to rely on system outputs and when to reject them [61,71]. Building warranted trust strongly depends on users being able to assess system trustworthiness in the given context of use [71,108]. According to their theoretical model on trust in automation, Lee and See [71] propose that trustworthiness relates to different facets of which performance (e.g., whether the system performs reliably with high accuracy) and process (e.g., knowing how the system operates and whether the system's decision-processes help to fulfil the trustor's goals) are especially relevant in our case. Specifically, fairness monitoring should enable users to more accurately judge system performance (e.g., by revealing possible issues with system outputs) and system processes (e.g., whether the system's decision logic was appropriate). In line with Lee and See's propositions, this should provide a foundation for users to be better able to judge system trustworthiness and should thus be a promising means to promote warranted trust. In consequence, our monitoring provides a needed addition to high-risk use contexts of AI because it offers information enabling humans to more adequately use AI-based systems in the sense of possibly better human-system decision performance and with respect to user duties as described in the AI Act.

Philosophical assessment
More effective oversight promises more informed decision-making. This, in turn, enables morally better decisions and outcomes, since humans can morally ameliorate outcomes in terms of fairness and can see to it that moral values are promoted. Also, fairness monitoring helps in safeguarding fundamental democratic values if it is applied to potentially unfair systems which are used in certain societal institutions of a high-risk character such as courts or parliaments. It could, for example, make AI-aided court decisions more transparent and promote equality before the law. However, since our approach requires finding context-appropriate and morally permissible parameters for F, moral requirements arise to enable the finding of such parameters. This not only affects, e.g., developers of such systems, but also those who are in a position to enforce that adequate parameters are chosen, such as governmental authorities, supervising institutions or certifiers. Apart from that, various parties have arguably a legitimate interest in adequately ascribing moral responsibility for the outcomes of certain decisions to human deciders [13] -regardless of whether the decision making process is supported by a system. Adequately ascribing moral responsibility is not always possible, though. One precondition for moral responsibility is that the agent had sufficient epistemic access to the consequences of their doing [88,117], i.e., that they have enough and sufficiently well justified beliefs about the results of their decision. Someone overseeing a university selection process (like Unica) should, for example, have sufficiently well justified beliefs that, at the very least, their decisions do not result in more unfairness in the world. If the admission process is supported by a black-box system, though, Unica cannot be expected to have any such beliefs since she lacks insight in the fairness of the system. Therefore, adequate responsibility ascription is usually not possible in this scenario. Our monitoring alleviates this problem by providing the decider with better epistemic access to the fairness of the system. FairnessAwareSystem helps in making Unica's role in the decision process significant and not only that of a mere button-pusher. FairnessAwareSystem makes it possible for her to fulfil some of the responsibilities and duties plausibly associated with her role. For example, she can now be realistically expected to not only detect, but resolve at least some cases of apparent unfairness competently (although she may need additional information to do so). In this respect, she should not be 'automated away' (cf. [77]).

Legal assessment
A central legislative debate of our time is how to counter the risks AI systems can pose to the health and safety or fundamental rights of natural persons. Protective measures must be taken at various levels: First, before being permitted on the market, it must be ensured ex ante that such high-risk AI-systems are in conformity with mandatory requirements 9 regarding safety and human rights. This means in particular that the selection of the properties which a system should exhibit requires a positive normative choice and should not simply replicate biases present in the status quo [129]. In addition, AI-systems must be designed and developed in such a way that natural persons can oversee their functioning. For this purpose, it is necessary for the provider to identify appropriate human oversight measures before its placing on the market or putting into service. In particular, such measures should guarantee that the natural persons to whom human oversight has been assigned have the necessary competence, training and authority to carry out that role [39, recital 48][40, Art. 14 (5)].
Second, during runtime, the proper functioning of high-risk AI systems, which have been placed on the market lawfully, must be ensured. To achieve this goal, a bundle of different measures is needed, ranging from legal obligations to implement and perform meaningful oversight mechanisms to user training and awareness in order to counteract 'automation bias'. Furthermore, the AI Act proposal requires deployers to inform the provider or distributor and suspend the use of the system when they have identified any serious incidents or any malfunctioning [39,40,Art. 29(4)].
Third, and ex post, providers must act and take the necessary corrective actions as soon as they become aware, e.g. through information provided by the deployer, that the high-risk system does not (or no longer) meet the legal requirements [39,40,Art. 16(g)]. To this end, they must establish and document a system of monitoring that is proportionate to the type of AI technology and the risks of the high-risk AI system [39,40,Art. 61(1)].
Fairness monitoring can be helpful in all three of the above respects. Therefore, we argue that there is even a legal obligation to use technical measures such as the method presented in this paper if this is the only way to ensure effective human oversight.

Conclusion & Future Work
This articles brings together software doping theory and probabilistic falsification techniques. To this end, it proposes a suitable HyperSTL semantics and characterises robust cleanness and func-cleanness as HyperSTL formulas and, for the special case of finite standard behaviour, STL formulas. Software doping techniques have been extensively applied to the tampered diesel emission cleaning systems; this article continues this path of research by demonstrating how testing of real cars can become more effective. For the first time, we apply software doping techniques to high-risk (AI) systems. We propose a runtime fairness monitor to promote effective human oversight of high-risk systems. The development of this monitor is complemented by an interdisciplinary evaluation from a psychological, philosophical, and legal perspective.

Limitations & Challenges
A challenge to those employing robust cleanness or func-cleanness analysis is the selection of suitable parameters, especially d In , d Out , and f or κ i and κ o . Because of their high degree of context sensitivity, there are no paradigmatic candidates for them that one can default to. Instead, they have to be carefully selected with the concrete system, the structure of input data and the situation of use in mind.
Reasonable choices for robust cleanness analysis of diesel emissions have been proposed in recent work [18,20]. With respect to individual fairness analysis, potential systems to which FairnessAwareSystem or FairnessMonitor can be applied to are still too diverse to give recommendations for the contract parameters. Obviously, further technical limitations include that f , d In , and d Out must be computable.
With a particular regard to fairness analysis, we identify also non-technical limitations. As seen in Figure 5, our fairness monitoring aims to uncover a particular kind of unfairness, namely individual unfairness that originates from within the system. This excludes all kinds of group unfairness as well as unfairness from sources other than the system. Another limitation is the human's competence to interpret the system outputs. Even though this is not a limitation that is inherent to our approach, it nevertheless will arguably be relevant in some practical cases, and an implementation of the monitoring always has to happen with the human in mind. For example, the design of the tool should avoid creating the false impression that the system is proven to be fair for an individual if no counterexample has been found. Interpretations like this could lead to inflated judgements of system trustworthiness and eventually to overtrusting system outputs [108,110]. Also, it might be reasonable to limit access to the monitoring results: if individuals who are processed by the system have full access to their fairness analysis, they could use this to 'game' the system, i.e. they could use the synthetic inputs to slightly modify their own input such that they receive a better outcome. While more transparency for the user is generally desirable, this has to be kept in mind to avoid introducing new unfairness on a meta-level.

Future Work
The probabilistic falsification technique we use in this article can be seen as a modular framework that consists of several interchangeable components. One of these components is the optimisation technique used to find the input with minimal robustness value. Algorithm 1 uses a simulated annealing technique [28,105], but other techniques have been proposed for temporal logic falsification, too [4,106]. We want to further look into such alternative optimisation techniques and to evaluate if they offer benefits w.r.t. cleanness falsification.
Finally, the fairness monitoring approach has been presented using a toy example. It is not claimed to be readily applicable to real-life scenarios. Besides the future work that has already been mentioned throughout the paper, we are planning on various extensions of our approach, and are working on an implementation that will allow us to integrate the monitoring into a real system. Moreover, we plan to test the possible benefits and shortcomings of the approach in user studies where decision-makers are tasked to make hiring decisions with and without the fairness monitoring approach. Further work will encompass activities such as the improvement and embedding of the algorithm FairnessAwareSystem into a proper tool that can be used by non-computer-scientists, and the extension of the monitoring technique to cover more types of unfairness. For example, logging the output of the fairness monitor could be used to identify groups that are especially likely to be treated unfairly by the system: The individual fairness verdicts provided by FairnessAwareSystem and FairnessMonitor may also be logged and considered for further fairness assessments or other means of quality assurance of system P . Statistical analysis might unveil that individuals of certain groups are treated unfairly more frequently than individuals from other groups. Depending on the distinguishing features of the evaluated group, this can uncover problems in P , especially if protected attributes, such as gender, race, age, etc, are taken into account. Thereby, system fairness can be assessed for protected attributes without including them in the input of P , which should generally be avoided, and even without disclosing them to the human in the loop. By evaluating the monitoring logs from sufficiently many diverse runs of FairnessAwareSystem, our local method can be lifted such that it resembles a global method for many practical applications, i.e. we can make statistical statements about the general fairness of P . Such an evaluation can also be used to extract prototypes and counterexamples in the spirit of Been et al. [65] illustrating the tendency to judge unfairly. This is an interesting combination of individual and group fairness that we want to look into further. Other insights from the research on reactive systems [18,20,31] can potentially be used to further enrich the monitoring. Finally, various disciplines have to join forces to resolve highly interdisciplinary questions such as what constitutes reasonable and adequate choices for f , d In , and d Out in given contexts of application.
Competing interests. The authors have no competing interests to declare that are relevant to the content of this article.

Appendix A Technical Appendix
This appendix illustrates that func-fairness is more expressive than Lipschitzfairness and why this is useful. For this, we use as a toy example a very simple, hypothetical HR scoring system that aggregates five scores given to the candidates. We remark that the whole scenario, the implementation of the system, the choice of distance functions and f , is likely not applicable for real-life situations; everything is picked so that our explanations are understandable.
Suppose that certain qualities and characteristics of the applicants are prescored by other systems on a scale from 0 to 100 %, where 0 means that the candidate is utterly unsuitable for the job in a certain regard, while a scoring of 100 % means that the candidate is perfect for the job in this regard. In particular, we will assume that the following marks are given to each applicant: an education mark for how well they are academically suitable for the job, an experience mark for how well their previous work experience fits the job, a personality mark for their personal and social skills, a mental ability mark for what is colloquially referred to as an applicant's general intelligence, and, finally, a skill mark that tracks the special skills that applicants have which might be beneficial for the job, such as their knowledge of foreign languages.
The system P that is of interest for us in this example is the one that aggregates all of these marks and gives out an overall score of how well the candidate is suited for the job. The human responsible for the hiring process can use this in her hiring decision, e.g., she can focus on the top-scoring candidates and choose among them.
Let M = [0, 1] ⊆ R be the reals between 0 and 1. Each of the five marks mentioned above is a real number from set M. The input domain In = M 5 for the sketched HR system is a tuple of five marks. The output of the system is the overall suitability score of an applicant, which is also a value from M. The distance between two inputs is defined as the euclidean distance, normalised to a value between 0 and 1, i.e., where ed represents the education mark, ex the experience mark, pe the personality mark, in the mental ability mark, and sk the skill mark of an applicant. The distance between two outputs d Out (o 1 , o 2 ) = |o 1 − o 2 | is the absolute difference between the overall scores o and o ′ . Note that also output distances are values between 0 and 1.
Our scoring system is a function P : M 5 → M. We will assume here that P is defined as the sum of five subscoring systems, one for each of the five input marks, computing a value between 0 and 0.2. Then, P ((ed, ex, pe, in, sk)) := P ed (ed) + P ex (ex) + P pe (pe) + P in (in) + P sk (sk).
Let P ed , P ex , P pe and P in be defined according to the plot shown in Fig. A1  a). With an increasing mark, these subscores increases up to an input mark of 0.8, whereafter the applicant becomes overqualified and the subscore slowly decreases. P sk is depicted in Fig. A1 b): The skill mark is less important, however a minimum amount of skills is required for the job. Hence, there is a jump of the skill score at an skill mark of roughly 0.19. Let John be an applicant with ed = ex = pe = in = 0.5 and a skill mark of sk = 0.2, which maps to a skill score on the plateau after the jump. The subscores for education, experience, personality and mental ability mark are 0.12 each. The skill score computed for John is 0.05. Hence, John's overall score is P (John) = 4 · 0.12 + 0.05 = 0.53. Let Synthia be a synthetic applicant with the same marks as John, except for the skill mark, which is 0.19 in Synthia's case. As depicted in Fig. A1 b), the skill subscore for skill mark 0.19 is 0.02 -Synthia is at the plateau right before the jump of the skill score. Her overall score is P (Synthia) = 4 · 0.12 + 0.02 = 0.50. The input distance between John and Synthia is d In (John, Synthia) = 0.01 2 5 ≈ 0.0045 and the output distance is d Out (John, Synthia) = |0.53 − 0.5| = 0.03. It is easy to see that if we use Lipschitz-fairness, the Lipschitz constant L must be at least L = 6.7 to allow the small jump in the skill subscoring function. We argue that small jumps like those in the skill subscore are normal behaviour and, hence, fair. Assume for the remainder of this example that we use Lipschitz-fairness with L = 6.7.
Consider now a slightly modified variant P ′ of P . P ′ is as P but uses a different subscoring function P ′ sk for the skill score. Fig. A1 c) shows the skill subscoring function for P ′ . P ′ sk has a jump at skill mark 0.13 that is significantly larger than that in P sk . We assume in this example that such a big jump is Func-fairness is different in this regard. Function f receives the input distance and can freely define a bound on output distances based on the input distance. Indeed, the concrete f on the right overcomes the problem observed in the example. It uses the input distance for a case distinction on the magnitude of the input distance. For input distances up to 0.01, f effectively applies Lipschitz-fairness with L = 8 to allow small jumps. For input distances between 0.01 and 0.1, f behaves like Lipschitz-fairness for L = 4, and for larger input distances, it enforces L = 2. In all cases we add 0.001 to the result to avoid f becoming zero (see footnote 7 on page 33 in the main paper). Applying func-fairness with C = ⟨d In , d Out , f ⟩ to P , the combination of John and Synthia (and hence the small jump of the skill score function) is not highlighted by FairnessAwareSystem, i.e., it is correctly detected as func-fair. Applied to P ′ , however, John and Synclair fall into the second case in the definition of f , but, as the emulated Lipschitz condition with L = 4 is violated, FairnessAwareSystem likely finds a negative fairness score, i.e., P ′ is not func-fair w.r.t. John. We remark that we propose this f for purely illustrative purposes. For real-world examples, f should be more sophisticated. Finding a suitable f can be a non-trivial task which hinges on various aspects that are crucial for the fairness evaluation in a given context. Clearly, the P and f provided in this illustration are toy examples that are probably inappropriate for real-world usage.

A.1 Proofs
In this section, we will provide proofs for most of the propositions and theorems in the main paper. First, we show the correctness of the HyperSTL characterisations of robust cleanness and func-cleanness.
We first provide a lemma, which destructs the globally ( ) and weak until (W) operators such that the timing constraints encoded by these operators becomes explicit.
Lemma 8 Let σ : T → X be a trace with T = N or T = R ≥0 and let ϕ and ψ be STL formulas. Then the following equivalences hold.
Proof We prove the two statements separately.
1. Using the definition of the derived operators and , we get that σ, 0 |= ϕ holds if and only if σ, 0 |= ¬(⊤ U ¬ϕ) holds. Using the (Boolean) semantics of STL, we get that this is equivalent to ¬(∃t ≥ 0. σ, t |= ¬ϕ ∧ ∀t ′ < t. σ, t ′ |= ⊤). After simple logical operations, we get that this is equivalent to ∀t ≥ 0. σ, t |= ϕ as required. 2. Using 1, the definition of W, the (Boolean) semantics of STL, and considering that T = N, we get that σ, 0 |= ϕ W ψ if and only if ∃t ∈ N. σ, t |= ψ ∧ ∀t ′ < t. σ, t ′ |= ϕ or ∀t ∈ N. σ, t |= ϕ. We denote this proposition as V . It is easy to see that the right operand of the equivalence to prove can be rewritten to ∀t ∈ N. (∃t ′ ≤ t. σ, t ′ |= ψ) ∨ σ, t |= ϕ. We denote this proposition as W and must show that V ⇒ W and W ⇒ V . To prove that V implies W , we distinguish two cases.
• For the first case, assume that the left operand of the disjunction in V holds, i.e., there is some t ∈ N, such that σ, t |= ψ ∧ ∀t ′ < t. σ, t ′ |= ϕ.
To prove that W implies V , let PV = {t ∈ N | σ, t |= ψ} be the set of all time points at which ψ holds. If PV is the empty set, it follows immediately from W that ∀t ∈ N. σ, t |= ϕ and that, hence, V holds. If PV is not empty, let t = min PV be the smallest time in PV (the minimum always exists, because T = N). Then, obviously, ∃t ∈ N. σ, t |= ψ. To show that V holds, it suffices to show that ∀t ′ < t. σ, t ′ |= ϕ. This follows from W , because t is the smallest time at which σ, t |= ψ holds and, therefore, for every t ′ < t it does not hold that σ, t ′ |= ψ.
□ Lemma 9 is specific to the HyperSTL formula (3); it converts it into a first-order logic formula.
Proof Using Lemma 8.1, Lemma 8.2, and Definition 6, we get that holds for Π = {π := w, π ′ := w ′ , π ′′ := w ′′ }. Using the the constraint under which Stdπ must be modelled, and by further applying Definition 6 and basic logical operations, we get that the above proposition is equivalent to Finally, after carefully reordering premises, we get that the above holds if and only if We omit the lemma analogue to Lemma 9 that reformulates formula (2) as a first-order characterisation. The proof for Proposition 3 further transforms the first-order characterisations of formulas (2) and (3) to show that they indeed match the definitions of l-robust cleanness and u-robust cleanness.
Proposition 3 Let L ⊆ N → (In ∪ Out) be a mixed-IO system and C = ⟨Std, d In , d Out , κ i , κo⟩ a contract or context for robust cleanness with Std ⊆ L. Further, let Stdπ be a quantifier-free HyperSTL subformula, such that L, {π := w}, 0 |= Stdπ if and only if w ∈ Std. Then, L is l-robustly clean w.r.t. C if and only if L, ∅, 0 |= ψ l-rob , and L is u-robustly clean w.r.t. C if and only if L, ∅, 0 |= ψ u-rob .
Proof We prove the correctness for l-robust cleanness and u-robust cleanness separately and begin with u-robust cleanness. Using Lemma 9, we get that holds if and only if After applying simple logical operations and using that eq(i 1 , i 2 ) = 0 if and only if i 1 = i 2 , we get that this is equivalent to , w 2 ↓o[t]) ≤ κo , which, since we assumed Std ⊆ L, is equivalent to the definition of u-robust cleanness for mixed-IO systems.
The proof for l-robust cleanness is analogue. □ We recapitulate the proposition similar to Proposition 3 for func-cleanness. The proof for Proposition 4 is conceptually similar to the one for Proposition 3. The only difference is that instead of the reasoning about the W construct, the globally enforced relation between output distances and the result of f must be proven equivalent in the HyperSTL formulas and func-cleanness. We omit the proofs here.
Next, we show the correctness of the STL characterisations, i.e., we will prove the correctness of Theorems 5 and 6. We do so by first establishing a connection between the HyperSTL and the STL characterisations. Proof Using Lemma 9 we get that Since Std = {w 1 , . . . , wc}, we can replace the universal and existential quantifiers over Std by a conjunction, respectively disjunction, over the standard traces [103]. We instantiate the universal quantifier for w ′′ with w and get that 1≤a≤c 1≤b≤c From the Boolean semantics of STL and by replacing all traces w, respectively w k , by the corresponding w + -projections, we get the equivalent proposition 1≤a≤c 1≤b≤c With the Boolean semantics of STL and Lemma 8.1 and 8.2 we get the equivalent statement that

□
We are now able to prove Theorem 5.
Theorem 5 Let L ⊆ N → (In ∪ Out) be a mixed-IO system and C = ⟨Std, d In , d Out , κ i , κo⟩ a context for robust cleanness with finite standard behaviour Std = {w 1 , . . . , wc} ⊆ L. Then, L is u-robustly clean w.r.t. C if and only if (L • Std) |= φ u-rob , where Proof The theorem follows from Proposition 3 and Lemma 10. □ To prove Theorem 6, we establish the following lemma, which is analogue to Lemma 10, up to u-func-cleanness replacing u-robust cleanness. The proof for Lemma 11 is, up to the different reasoning for , identical to that of Lemma 10. We omit it here.
Proof The theorem follows from Proposition 4 and Lemma 11. □

Appendix B Fairness Pipeline
As explained in Section 2 in the main paper, it is important to recognise that there are many sources of unfairness [8]. Section B shows a more detailed version of Figure 5 in the main paper. Not every technical measure is able to detect every kind of unfairness and eliminating one source of unfairness might not be sufficient to eliminate all unfairness.
World There can be unfairness in the world that leads to individuals already having worse (or better) starting conditions than others and subsequently have a lower (or higher) chance that the final decision is made in their favour. For example, an individual could be systematically excluded from certain societal  resources (e.g., girls who are excluded from education in Afghanistan under the Taliban) which puts these individuals at a disadvantage. Input data The input data or its collection, representation or selection could be problematic and lead to unfairness [137]. If, for example, crucial information is left out in the input data or data is aggregated in unsuitable ways, individuals could face an outcome that is unwarranted by the factual situation. System (and training data) The system itself can introduce new unfairness. Among other things, this can come about by erroneous algorithms or (in the case of a trained model) by problematic training data, e.g., if a certain group of individuals is not properly represented [118].
Output The human decider can fail to interpret the output properly [136,138], which can lead to further unfairness. They could, for example, lack knowledge of the limitations of the system or fail to take into account that the system output is subject to some systematic uncertainty. Decision The human decider can make an unfair decision even in the face of a fair system output and an adequate interpretation thereof, for example if they have conscious or subconscious bias against certain groups [135].
Unfairness in any part of the chain can arguably perpetuate or reinforce unfairness in the world.
In the main paper, we propose a runtime monitoring technique that aims to uncover individual unfairness introduced by the system. By focusing on the system and its input-output relation only, we can say that the system is unfair without having to say anything about the degree of fairness with which an individual is treated in other respects in the decision process. It especially allows us to say that a system output is unfair, even though the outcome of the overall decision process is not. It may, for example, be that the system unfairness is 'cancelled out' by something else that is hidden from the system: an applicant with a stellar-looking CV might be treated unfairly by the system because of their age, but not hiring them is not unfair because they are known to have forged their diploma. Cases like this, however, do not make the unfairness introduced by the system any less problematic.

C.1 EU Anti-Discrimination Law
Antidiscrimination is a principle deeply rooted in EU law. It is enshrined in Art. 21 of the Charter of Fundamental Rights (CFR) [46], which prohibits "[a]ny discrimination based on any ground such as sex, race, colour, ethnic or social origin, genetic features, language, religion or belief, political or any other opinion, membership of a national minority, property, birth, disability, age or sexual orientation" as well as "any discrimination based on grounds of nationality". According to Art. 51 CFR, the addressees of this fundamental right are the EU and its institutions, bodies, offices and agencies as well as the Member States, insofar as they implement Union law. They are directly bound by Art. 21 above all in their legislative activities, but also in their executive and judicial measures. In contrast, private individuals are not directly bound by Art. 21 CFR, but they may be bound by regulations implementing this provision. However, according to recent European Court of Justice (ECJ) case law, Art. 21 CFR is directly applicable as a result of Directives, such as Directive 2000/78/EC [120] establishing a general framework for equal treatment in employment and occupation [44, § 76]. Apart from this, while Art. 21 CFR stipulates a general prohibition of any unjustified discrimination, the more specific secondary legislation applicable to private actors only prohibits discrimination only in certain sensitive areas and only with regard to certain protected attributes. Correspondingly, private actors may not discriminate against certain persons-to name just a few-in employment relationships [120], in cases of abuse of a dominant market position [47,Art. 102] or also in so-called mass transactions, i.e., contracts that are typically concluded without regard to the person on comparable terms in a large number of cases [121]. In contrast, discriminating in other legal relationships or on other grounds such as local origin (as opposed to ethnic origin), or a person's financial situation is not generally prohibited. The rationale behind these "discriminatory standards of anti-discrimination law" [56, 111,124] is the principle of private (or personal) autonomy, and more specifically freedom of contract as one of its manifestations, which govern legal transactions between private individuals [73]. According to this principle, individuals are free to shape their legal relationships according to their own preferences and ideas, however irrational or socially unacceptable they may be. In essence, this also includes a right to discriminate against others. This freedom to autonomously form legal relations is only constrained where this is stipulated by anti-discrimination legislation for policy reasons.
When using an AI-system to recruit candidates, developers and deployers have to make sure that the system with its parameters comply with these legal requirements set by anti-discrimination law. This means in particular that the selection of the properties which a classifier should exhibit requires a positive normative choice and should not simply replicate biases present in the status quo [129]. However, the risks associated with deploying such systems in an HR context (such as a malfunctioning remaining undetected due to the system's opacity, a huge practical relevance of biased outputs due to the systems' scalability or the human operator's tendency of over-relying on the output produced by the AI system ( "automation bias")), raise the question whether it can still be deemed normatively acceptable that the EU legal framework turns a blind eye on certain forms of discrimination. Furthermore, the principle of private autonomy as rationale for justifying the freedom to discriminate against others is only valid with regard to human's wilful actions, but not to algorithm-generated output. We are not advocating for abolishing the existing balance between private autonomy (freedom to contract) and prohibition to discriminate. So humans should still be permitted to differentiate on grounds that are not caught by anti-discrimination law. However, there is no reason to grant the "right to discriminate" also to a non-human system that has merely "learned" this discrimination. In this respect, it seems justified to apply different standards for algorithms with regard to the prohibition of discrimination than for human decisions. With regard to an AI system's decision metrics, therefore, it should be considered to expand the secondary legal framework to include a broad prohibition of discrimination. This would not mean that all discrimination would be unlawful, since objectively justified unequal treatment is, after all, permissible, but it would shift the focus to the question of objective justification [45]. Another legal challenge that will become even more pressing with the advent of technical decision systems is how to detect and prove prohibited discrimination. This is because the prohibition of discrimination resulting from various legal regulations in certain, especially sensitive, areas, such as human resources, presupposes that a difference in treatment is recognised in the first place. The recognition of discrimination is therefore not only in the interest of the decision-maker, who is threatened with sanctions in the event of a violation of the prohibition of discrimination. Rather, it is also essential for the discriminated party to prove the discrimination. For as far as a legal claim follows from a prohibited discrimination, the principle applies that the person who invokes the legal claim must prove the facts giving rise to the claim. Especially when complex algorithms are used, however, it is likely to be extremely difficult to prove corresponding circumstantial evidence. According to the case law of the ECJ, however, the burden of proof is reversed if the party who has prima facie been discriminated against would otherwise have no effective means of enforcing the prohibition of discrimination [41,42]. Monitoring, as described here, would therefore be a suitable means of providing the "prima facie" evidence necessary for shifting the burden of proof.

C.2 Discrimination and the GDPR
There has recently been discussion if and to which extent data protection law contains obligations for non-discriminating data processing or whether the scope of protection of data protection law is thereby overstretched. There is no explicit prohibition of discrimination in the General Data Protection Regulation (GDPR). According to Article 1 (2), however, the GDPR is intended to protect the fundamental rights and freedoms of natural persons. This is aimed in particular at their right to protection of personal data (Article 8 CFR), but not exclusively so. Thus, the broad and non-restrictive reference to fundamental rights also encompasses all other fundamental rights, including the right to non-discrimination (Article 21 CFR) [38]. This is reflected, for example, in the higher level of protection for data with an increased potential for discrimination, the so-called special categories of personal data under Article 9 GDPR. The GDPR can also be interpreted as granting a "preventive protection against discrimination", namely when discrimination is made impossible from the outset, in that the data-processing agencies cannot gain knowledge of characteristics susceptible to discrimination in the first place, i.e., when any respective data processing is forbidden [25]. Any processing of personal data must furthermore comply with the processing principles set out in Article 5 GDPR, including the fairness principle ('personal data shall be processed fairly') set out in Article 5(1)(a). While formerly transparency obligations were read into this principle while the Data Protection Directive was into effect, the regulatory content of the fairness principle is highly disputed since it was split off into a separate processing principle. But due to the fact that discriminatory data processing can hardly be described as fair, a prohibition of discrimination can be linked to the fairness principle [55,75]. However, the concrete scope of the fairness principle clearly goes beyond the understanding of fairness in the context of technical systems on which this paper is based.
Specifically for the HR context, there are discrimination-sensitive regulations in the GDPR. Article 9 GDPR makes the processing of special categories of data, i.e., sensitive data and data susceptible to discrimination, subject to particularly strict authorisation criteria, which should in practice rarely be present in recruitment situations. On the one hand, processing for recruitment purposes, i.e., prior to the establishment of an employment relationship, is rarely necessary in order to exercise certain rights and obligations under employment law (Art. 9(2)(b) GDPR), and on the other hand, explicit consent (Art. 9(2)(a) GDPR) will often lack the necessary voluntariness due to the specifics of job application situations and the power imbalances inherent in them. The prohibition of processing sensitive data may be problematic in cases where the link to sensitive data is strictly necessary to detect discriminatory effects. For high-risk systems, Art. 10 V AI Regulation Proposal therefore provides for a new permissive clause: 'To the extent that it is strictly necessary for the purposes of ensuring bias monitoring, detection and correction, ... the providers of such systems may process special categories of personal data' while ensuring appropriate safeguards for the fundamental rights of natural persons.
With regard to the processing of non-sensitive personal data, however, the opening clause in Art 88(1) GDPR allows Member States to adopt more specific rules for processing for recruitment purposes, whereby, according to paragraph 2, suitable and specific measures must be ensured to safeguard the fundamental rights of the data subject. These requirements can be met by state-of-the-art monitoring tools. The national regulations cannot be discussed in depth here. For Germany, for example, Section 26 of the Federal Data Protection Act (BDSG) stipulates that personal data may only be processed for recruitment purposes if this is necessary, i.e., if the data processing is required for the decision on recruitment. In any case, data processing may not be necessary if the characteristics depicted in the data may not be taken into account in the hiring decision, for example due to anti-discrimination law [101]. [