RunTime Optimization for Learned Controllers Through Quantitative Games
 1 Citations
 5.3k Downloads
Abstract
A controller is a device that interacts with a plant. At each time point, it reads the plant’s state and issues commands with the goal that the plant operates optimally. Constructing optimal controllers is a fundamental and challenging problem. Machine learning techniques have recently been successfully applied to train controllers, yet they have limitations. Learned controllers are monolithic and hard to reason about. In particular, it is difficult to add features without retraining, to guarantee any level of performance, and to achieve acceptable performance when encountering untrained scenarios. These limitations can be addressed by deploying quantitative runtime shields that serve as a proxy for the controller. At each time point, the shield reads the command issued by the controller and may choose to alter it before passing it on to the plant. We show how optimal shields that interfere as little as possible while guaranteeing a desired level of controller performance, can be generated systematically and automatically using reactive synthesis. First, we abstract the plant by building a stochastic model. Second, we consider the learned controller to be a black box. Third, we measure controller performance and shield interference by two quantitative runtime measures that are formally defined using weighted automata. Then, the problem of constructing a shield that guarantees maximal performance with minimal interference is the problem of finding an optimal strategy in a stochastic 2player game “controller versus shield” played on the abstract state space of the plant with a quantitative objective obtained from combining the performance and interference measures. We illustrate the effectiveness of our approach by automatically constructing lightweight shields for learned trafficlight controllers in various road networks. The shields we generate avoid liveness bugs, improve controller performance in untrained and changing traffic situations, and add features to learned controllers, such as giving priority to emergency vehicles .
1 Introduction
A challenge in controller synthesis is that, since the number of possible plant readings is huge, it is computationally demanding to find an optimal command, given a plant state. Machine learning is a prominent approach to make decisions based on large amounts of collected data [28, 37]. It is widely successful in practice and takes an integral part in the design process of various systems. Machine learning has been successfully applied to train controllers [15, 33, 34] and specifically controllers for traffic control [20, 35, 39].
A shortcoming of machinelearning techniques is that the controllers that are produced are blackbox devices that are hard to reason about and modify without a complete retraining. It is thus challenging, for example, to obtain worstcase guarantees about the controller, which is particularly important in safetycritical settings. Attempts to address this problem come from both the formal methods community [46], where verification of learned systems is extensively studied [24, 29], and the machinelearning community, where guarantees are added during the training process using reward engineering [13, 18] or by modifying the exploration process [11, 19, 38]. Both approaches require expertise in the respective field and suffer from limitations such as scalability for the first, and intricacy and robustness issues, for the second. Moreover, both techniques were mostly studied for safety properties.
Another shortcoming of machinelearning techniques is that they require expertise and a finetuning of parameters. It is difficult, for example, to train controllers that are robust to plant behaviors, e.g., a controller that has been trained on uniform traffic congestion meeting rushhour traffic, which can be significantly different and can cause poor performance. Also, it is challenging to add features to a controller without retraining, which is both costly and time consuming. These can include permanent features, e.g., priority to public transport, or temporary changes, e.g., changes due to an accident or construction. Again, since the training process is intricate, adding features during training can have unexpected effects.
In this work, we use quantitative shields to deal with the limitations of learned or any other blackbox controllers. A shield serves as a proxy between the controller and the plant. In each point in time, as before, the controller reads the state of the plant and issues a command. Rather than directly feeding the command to the plant, the shield first reads it along with an abstract plant state. The shield can then choose to keep the controller’s command or alter it, before issuing the command to the plant. The concept of shields was first introduced in [30], where shields for safety properties were considered and with a qualitative notion of interference: a shield is only allowed to interfere when a controller error occurs, which is only welldefined when considering safety properties. We elaborate on other shieldlike approaches in the Sect. 1.1.
Our goal is to automatically synthesize shields that optimize quantitative measures for blackbox controllers. We are interested in synthesizing lightweight shields. We assume that the controller performs well on average, but has no worstcase guarantees. When combining the shield and the controller, intuitively, the controller should be active for the majority of the time and the shield intervenes only when it is required. We formalize the plant behavior as well as the interference cost using quantitative measures. Unlike safety objectives, where it is clear when a shield must interfere, with quantitative objectives, a noninterference typically does not have a devastating effect. It is thus challenging to decide, at each time point, whether the shield should interfere or not; the shield needs to balance the cost of interfering with the decrease in performance of not interfering. Automatic synthesis of shields is thus natural in this setting.
We elaborate on the two quantitative measures we define. The interaction between the plant, controller, and shield gives rise to an infinite sequence over \(C \times \varGamma \times \varGamma \), where C is a set of plant states and \(\varGamma \) is a set of allowed actions. A triple \(\langle c, \gamma _1, \gamma _2 \rangle \) means that the plant is in state c, the controller issues command \(\gamma _1\), and the shield (possibly) alters it to \(\gamma _2\). We use weighted automata to assign costs to infinite traces, which have proven to be a convenient, flexible, and robust quantitative specification language [14]. Our behavioral score measures the performance of the plant and it is formally given by a weighted automaton that assigns scores to traces over \(C \times \varGamma \). Boolean properties are a special case, which include safety properties, e.g., “an emergency vehicle should always get a green light”, and liveness, e.g., “a car waiting in a queue eventually gets the green light”. An example of a quantitative score is the longrun average of the waiting times of the vehicles in the city. A second score measures the interference of a shield with a controller. It is given by a weighted automaton over the alphabet \(\varGamma \times \varGamma \). A simple example of an interference score charges the shield 1 for every change of action and charges 0 when no change is made. Then, the score of an infinite trace can be phrased as the ratio of the time that the shield interferes. Using weighted automata we can specify more involved scores such as different charges for different types of alterations or even charges that depend on the past, e.g., altering the controller’s command twice in a row is not allowed.
Given a probabilistic plant model and a formal specification of behavioral and interference scores, the problem of synthesizing an optimal shield is welldefined and can be solved by game theory. While the gamebased techniques we use are those of discreteevent controller synthesis [3] in a stochastic setting with quantitative objectives, our setup is quite different. In traditional controller synthesis, there are two entities; the controller and the adversarial plant. The goal is to synthesize a controller offline. In our setting, there are three entities: the plant, whose behavior we model probabilistically, the controller, which we treat as a blackbox and model as an adversary, and the shield, which we synthesize. Note that the shield’s synthesis procedure is done offline but it makes online decisions when it operates together with the controller and plant. Our plant model is formally given by a Markov decision process which is a standard model with which one models lack of knowledge about the plant using probability (see Fig. 1 and details in Example 1). The game is played on the MDP by two players; a shield and a controller, where the quantitative objective is given by the two scores. An optimal shield is then extracted from an optimal strategy for the shield player. The game we construct admits memoryless optimal strategies, thus the size of the shield is proportional to the size of the abstraction of the plant. In addition, it is implemented as a lookup table for actions in every state. Thus, the runtime overhead is a table lookup and hence negligible.
We experiment with our framework by constructing shields for traffic lights in a network of roads. Our experimental results illustrate the usefulness of the framework. We construct shields that consistently improve the performance of controllers, especially when exhibiting behavior that they are not trained on, but, more surprising, also while exhibiting trained behavior. We show that the use of a shield reduces variability in performance among various controllers, thus when using a shield, the choice of the parameters used in the training phase becomes less acute. We show how a shield can be used to add the functionality of prioritizing public transport as well as local fairness to a controller, both without retraining the controller. In addition, we illustrate how shields can add worstcase guarantees on liveness without a costly verification of the controller.
1.1 Related Work
A shieldlike approach to adding safety to systems is called runtime assurance [47], and has applications, for example, in control of robotics [41] and drones [12]. In this framework, a switching mechanism alternates between running a highperformance system and a provably safe one. These works differ from ours since they consider safety specifications. As mentioned earlier, a challenge with quantitative specifications is that, unlike safety specifications, a noninterference typically does not have a devastating effect, thus it is not trivial to decide when and to what extent to interfere.
Another line of work is runtime enforcement, where an enforcer monitors a program that outputs events and can either terminate the program once it detects an error [45], or alter the event in order to guarantee, for example, safety [21], richer qualitative objectives [16], or privacy [26, 49]. The similarities between an enforcer and a shield is in their ability to alter events. The settings are quite different, however, since the enforced program is not reactive whereas we consider a plant that receives commands.
Recently, formal approaches were proposed in order to restrict the exploration of the learning agent such that a set of logically constraints are always satisfied. This method can support other properties beyond safety, e.g., probabilistic computation tree logic (PCTL) [25, 36], linear temporal logic (LTL) [1], or differential dynamic logic [17]. To the best of our knowledge, quantitative specifications have not yet been considered. Unlike these approaches, we consider the learned controller as a black box, thus our approach is particularly suitable for machine learning nonexperts.
While MDPs and partiallyobservable MDPs have been widely studied in the literature w.r.t. to quantitative objectives [27, 43], our framework requires the interaction of two players (the shield and the blackbox controller) and we use gametheoretic framework with quantitative objectives for our solution.
2 Definitions and Problem Statement
2.1 Plants, Controllers, and Shields
The interaction with a plant over a concrete set of states C is carried out using two functionalities: Plant.GetState returns the plant’s current state and Plant.IssueCommand issues an action from a set \(\varGamma \). Once an action is issued, the plant updates its state according to some unknown transition function. At each point in time, the controller reads the state of the plant and issues a command. Thus, it is a function from a history in \((C \times \varGamma )^* \cdot C\) to \(\varGamma \).
Informally, a shield serves as a proxy between the controller and the plant. In each time point, it reads the controller’s issued action and can choose an alternative action to issue to the plant. We are interested in lightweight shields that add little or no overhead to the controller, thus the shield must be defined w.r.t. an abstraction of the plant, which we define formally below.
Abstraction. An abstraction is a Markov decision process (MDP, for short) is \(\mathcal{A}= \langle \varGamma , A, a_0, \delta \rangle \), where \(\varGamma \) is a set of actions, A is a set of abstract plant states, \(a_0 \in A\) is an initial state, and \(\delta : A \times \varGamma \rightarrow [0,1]^A\) is a probabilistic transition function, i.e., for every \(a \in A\) and \(\gamma \in \varGamma \), we have \(\sum _{a' \in A} \delta (a, \gamma )(a') = 1\). The probabilities in the abstraction model our lack of knowledge of the plant, and we assume that they reflect the behavior exhibited by the plant. A policy f is a function from a finite history of states in \(A^*\) to the next action in \(\varGamma \), thus it gives rise to a probabilistic distribution \(\mathcal{D}(f)\) over infinite sequences over A.
Example 1
Consider a plant that represents a junction with four incoming directions (see Fig. 1). We describe an abstraction \(\mathcal{A}\) for the junction that specifies how many cars are waiting in each queue, where we cut off the count at a parameter \(k \in \mathbb {N}\). Formally, an abstract state is a vector in \(\{ 0,\ldots , k \}^4\), where the indices respectively represent the North, East, South, and West queues. The larger k is, the closer the abstraction is to the concrete plant. The set of possible actions represent the possible light directions in the junction \(\{ \text {NS}, \text {EW} \}\). The abstract transitions estimate the plant behavior, and we describe them in two steps. Consider an abstract state \(a = (a_1, a_2, a_3, a_4)\) and suppose the issued action is \(\text {NS}\), where the case of \(\text {EW}\) is similar. We allow a car to cross the junction from each of the North and South queues and decrease the two queues. Let \(a' = (\max \{ 0,a_11 \}, a_2, \max \{ 0, a_31 \}, a_4)\). Next, we probabilistically model incoming cars to the queues as follows. Consider a vector \(\langle i_1, i_2, i_3, i_4 \rangle \in \{ 0,1 \}^4\) that represents incoming cars to the queues. Let \(a''\) be such that, for \(1 \le j \le 4\), we add \(i_j\) to the jth queue and trim at k, thus \(a''_j = \min \{ a'_j + i_j, k \}\). Then, in \(\mathcal{A}\), when performing action \(\text {NS}\) in a, we move to \(a''\) with the uniform probability 1/16.\(\square \)
2.2 Quantitative Objectives for Shields
We are interested in two types of performance measures for shields. The behavioral measure quantifies the quality of the plant’s behavior when operated with a controller and shield. The interference measure quantifies the degree to which a shield interferes with the controller. Formally, we need to specify values for infinite sequences, and we use weighted automata, which are a convenient model to express such values.
Weighted Automata. A weighted automaton is a function from infinite strings to values. Technically, a weighted automaton is similar to a standard automaton only that the transitions are labeled, in addition to letters, with numbers (weights). Unlike standard automata in which a run is either accepting or rejecting, a run in a weighted automaton has a value. We focus on limitaverage automata in which the value is the limit average of the running sum of weights that it traverses. Formally, a weighted automaton is \(\mathcal{W}= \langle \varSigma , Q, q_0, \varDelta , cost \rangle \), where \(\varSigma \) is a finite alphabet, Q is a finite set of states, \(\varDelta \subseteq (Q \times \varSigma \times Q)\) is a deterministic transition relation, i.e., for every \(q \in Q\) and \(\sigma \in \varSigma \), there is at most one \(q' \in Q\) with \(\varDelta (q, \sigma , q')\), and \(cost: \varDelta \rightarrow \mathbb {Q}\) specifies costs for transitions. A run of \(\mathcal{W}\) on an infinite word \(\sigma = \sigma _1, \sigma _2,\ldots \) is \(r = r_0,r_1,\ldots \in Q^\omega \) such that \(r_0 = q_0\) and, for \(i \ge 1\), we have \(\varDelta (r_{i1}, \sigma _i, r_i)\). Note that \(\mathcal{W}\) is deterministic so there is at most one run on every word. The value that \(\mathcal{W}\) assigns to \(\sigma \) is \(\lim \inf _{n \rightarrow \infty } \frac{1}{n} \sum _{i=1}^n cost(r_{i1}, \sigma _i,r_i)\).
Behavioral Score. A behavioral score measures the quality of the behavior that the plant exhibits. It is given by a weighed automaton over the alphabet \(A \times \varGamma \), thus it assigns real values to infinite sequences over \(A \times \varGamma \). In our experiments, we use a concrete behavioral score, which assigns values to infinite sequences over \(C \times \varGamma \). We compare the performance of the plant with various controllers and shields w.r.t. the concrete score rather than the abstract score. With a weighted automaton we can express costs that change over time: for example, we can penalize traffic lights that change frequently.
Interference Score. The second score we consider measures the interference of the shield with the controller. An interference score is given by a weighted automaton over the alphabet \(\varGamma \times \varGamma \). With a weighted automaton we can express costs that change over time: for example, interfering once costs 1 and any successive interference costs 2, thus we reward the shield for short interferences.
From Shields and Controllers to Policies. Consider an abstraction MDP \(\mathcal{A}\). To ensure worstcase guarantees, we treat the controller as an adversary for the shield. Let \({\textsc {Shield}} \) be a shield with memory set M and initial memory state \(m_0\). Intuitively, we find a policy in \(\mathcal{A}\) that represents the interaction of \({\textsc {Shield}} \) with a controller that maximizes the cost incurred. Formally, an abstract controller is a function \(\chi : A^* \rightarrow \varGamma \). The interaction between \({\textsc {Shield}} \) and \(\chi \) gives rise to a policy \(pol({\textsc {Shield}}, \chi )\) in \(\mathcal{A}\), which, recall, is a function from \(A^*\) to \(\varGamma \). We define \(pol({\textsc {Shield}}, \chi )\) inductively as follows. Consider a history \(\pi \in A^*\) that ends in \(a \in A\), and suppose the current memory state of \({\textsc {Shield}} \) is \(m \in M\). Let \(\gamma = \chi (\pi )\) and let \(\langle \gamma ', m' \rangle = {\textsc {Shield}} (\gamma , a, m)\). Then, the action that the policy \(pol({\textsc {Shield}}, \chi )\) assigns is \(\gamma '\), and we update the memory state to be \(m'\).
Problem Definition; Quantitative Shield Synthesis Consider an abstraction MDP \(\mathcal{A}\), a behavioral score \({\textsc {Beh}} \), an interference score \({\textsc {Int}} \), both given as weighted automata, and a factor \(\lambda \in [0,1]\) with which we weigh the two scores. Our goal is to find an optimal shield w.r.t. these inputs as we define below. Consider a shield \({\textsc {Shield}} \) with memory set M. Let X be the set of abstract controllers. For \({\textsc {Shield}} \) and \(\chi \in X\), let \(\mathcal{D}({\textsc {Shield}}, \chi )\) be the probability distribution over \(A \times \varGamma \times \varGamma \) that the policy \(pol({\textsc {Shield}}, \chi )\) gives rise to. The value of Shield, denoted \(val({\textsc {Shield}})\), is \(\sup _{\chi \in X} \mathbb {E}_{r \sim \mathcal{D}({\textsc {Shield}}, \chi )}[\lambda \cdot {\textsc {Int}} (r) + (1\lambda ) \cdot {\textsc {Beh}} (r)]\). An optimal shield is a shield whose value is \(\inf _{{\textsc {Shield}}} val({\textsc {Shield}})\).
Remark 1
(Robustness and flexibility). The problem definition we consider allows quantitative optimization of shields w.r.t. two dimensions of quantitative measures. Earlier works have considered shields but mainly with respect to Boolean measures in both dimensions. For example, in [30], shields for safety behavioral measures were constructed with a Boolean notion of interference, as well as a Boolean notion of shield correctness. In contrast we allow quantitative objectives in both dimensions which presents a much more general and robust framework. For example, the first measure of correctness can be quantitative and minimize the error rate, and the second measure can allow shields to correct but minimize the longrun average interference. Both of the above allows the shield to be flexible. Moreover, tuning the parameter \(\lambda \) allows flexible tradeoff between the two.
We allow a robust class of quantitative specifications using weighted automata, which have been already established as a robust specification framework. Any automata model can be used in the framework, not necessarily the ones we use here. For example, weighted automata that discount the future or process only finitewords are suitable for planning purposes [32]. Thus our framework is a very robust and flexible framework for quantitative shield synthesis. \(\square \)
2.3 Examples
In Remark 1 we already discussed the flexibility of the framework. We now present concrete examples of instantiations of the optimization problem above on our running example, which illustrate how quantitative shields can be used to cope with limitations of learned controllers.
Dealing with Unexpected Plant Behavior; RushHour Traffic. Consider the abstraction described in Example 1, where each abstract state is a 4dimensional vector that represents the number of waiting cars in each direction. The behavioral score we use is called the max queue. It charges an abstract state \(a \in \{ 0,\ldots , k \}^4\) with the size of the maximal queue, no matter what the issued action is, thus \(cost_{\textsc {Beh}} (a) = \max _{i \in \{ 1,2,3,4 \}} a_i\). A shield that minimizes the maxqueue cost will prioritize the direction with the largest queue. For the interference score, we use a score that we call the basic interference score; we charge the shield 1 whenever it changes the controller’s action and otherwise we charge it 0, and take the longrun average of the costs. Recall that in the construction in Example 1, we chose uniformly at random the vector of incoming cars. Here, in order to model rushhour traffic, we use a different distribution, where we let \(p_j\) be the probability that a car enters the jth queue. Then, the probability of a vector \(\langle i_1, i_2, i_3, i_4 \rangle \in \{ 0,1 \}^4\) is \(\prod _{1 \le j \le 4} (p_j\cdot i_j + (1p_j)\cdot (1i_j))\). To model a higher load traveling on the NorthSouth route, we increase \(p_1\) and \(p_3\) beyond 0.5.
Weighing Different Goals; Local Fairness. Suppose the controller is trained to maximize the number of cars passing a city. Thus, it aims to maximize the speed of the cars in the city and prioritizes highways over farm roads. A secondary objective for a controller is to minimize local queues. Rather than adding this objective in the training phase, which can have an unexpected outcome, we can add a local shield for each junction. To synthesize the shield, we use the same abstraction and basic interference score as in the above. The behavioral score we use charges an abstract state \(a \in \{ 0,\ldots , k \}^4\) with difference \((a_1 + a_3)  (a_2+a_4)\), thus the greater the inequality between the two waiting directions, the higher the cost.
Adding Features to the Controller; Prioritizing Public Transport. Suppose a controller is trained to increase throughput in a junction. After the controller is trained, a designer wants to add a functionality to the controller that prioritizes buses over personal vehicles. That is, if a bus is waiting in the North direction, and no bus is waiting in either the East or West directions, then the light should be NorthSouth, and the other cases are similar. The abstraction we use is simpler than the ones above since we only differentiate between a case in which a bus is present or not, thus the abstract states are \(\{ 0,1 \}^4\), where the indices represent the directions clockwise starting from North. Let \(\gamma = \text {NS}\). The behavioral cost of a state a is 1 when \(a_2=a_4=0\) and \(a_1=1\) or \(a_3=1\). The interference score we use is the basic one. A shield guarantees that in the long run, the specification is essentially never violated.
3 A GameTheoretic Approach to Quantitative Shield Synthesis
In order to synthesize optimal shields we construct a twoplayer stochastic game [10], where we associate Player \(2\) with the shield and Player \(1\) with the controller. The game is defined on top of an abstraction and the players’ objectives are given by the two performance measures. We first formally define stochastic games, then we construct the shield synthesis game, and finally show how to extract a shield from a strategy for Player \(2\).
Stochastic Graph Games. The game is played on a graph by placing a token on a vertex and letting the players move it throughout the graph. For ease of presentation, we fix the order in which the players move: first, Player \(1\), then Player \(2\), and then “Nature”, i.e., the next vertex is chosen randomly. Edges have costs, which, again for convenience, appear only on edges following Player \(2\) moves. Formally, a twoplayer stochastic graphgame is \(\langle V_1, V_2, V_N, E, \Pr , cost \rangle \), where \(V= V_1 \cup V_2 \cup V_N\) is a finite set of vertices that is partitioned into three sets, for \(i \in \{ 1,2 \}\), Player \(i\) controls the vertices in \(V_i\) and “Nature” controls the vertices in \(V_N\), \(E \subseteq (V_1 \times V_2) \cup (V_2 \times V_N)\) is a set of deterministic edges, \(\Pr :V_N \times V_1 \rightarrow [0,1]\) is a probabilistic transition function, and \(cost:(V_2 \times V_N) \rightarrow \mathbb {Q}\). Suppose the token reaches \(v \in V\). If \(v \in V_i\), for \(i \in \{ 1,2 \}\), then Player \(i\) chooses the next position of the token \(u \in V\), such that E(v, u). If \(v \in V_N\), then the next position is chosen randomly; namely, the token moves to \(u \in V\) with probability \(\Pr [v,u]\).
The game is a zerosum game; Player \(1\) tries to maximize the expected longrun average of the accumulated costs, and Player \(2\) tries to minimize it. A strategy for Player \(i\), for \(i\in \{ 1,2 \}\), is a function that takes a history in \(V^*\cdot V_i\) and returns the next vertex to move the token to. The games we consider admit memoryless optimal strategies, thus it suffices to define a Player \(i\) strategy as a function from \(V_i\) to V. We associate a payoff with two strategies \(f_1\) and \(f_2\), which we define next. Given \(f_1\) and \(f_2\), it is not hard to construct a Markov chain \(\mathcal{M}\) with states \(V_N\) and with weights on the edges: for \(v,u \in V_N\), the probability of moving from v to u in \(\mathcal{M}\) is \(\Pr _\mathcal{M}[v,u] = \sum _{w \in V_1: f_2(f_1(w)) = u} \Pr [v,w]\) and the cost of the edge is \(cost_\mathcal{M}(v,u) = \sum _{w \in V_1: f_2(f_1(w)) = u} \Pr [v,w] \cdot cost(f_1(w), u)\). The stationary distribution \(s_v\) of a vertex \(v \in V_N\) in \(\mathcal{M}\) is a well known concept [43] and it intuitively measures the longrun average time that is spend in v. The payoff w.r.t. \(f_1\) and \(f_2\), denoted \(\text {payoff}(f_1, f_2)\) is \(\sum _{v,u \in V_N} s_v \cdot \Pr _\mathcal{M}[v,u] \cdot cost_\mathcal{M}(v,u)\). The payoff of a strategy is the payoff it guarantees against any strategy of the other player, thus \(\text {payoff}(f_1) = \inf _{f_2} \text {payoff}(f_1, f_2)\). A strategy is optimal for Player \(1\) if it achieves the optimal payoff, thus f is optimal if \(\text {payoff}(f) = \sup _{f_1} \text {payoff}(f_1)\). The definitions for Player \(2\) are dual.

\(V_1 = S\),

\(V_2 = S \times \varGamma \),

\(V_N = S\times \varGamma \times \{ N \}\), where the purpose of N is to differentiate between the vertices,

\(E(s, \langle s, \gamma \rangle )\) for \(s \in S \text { and } \gamma \in \varGamma \), and \(E(\langle s, \gamma \rangle , \langle s', \gamma ',N \rangle )\) for \(s = \langle a, q_1, q_2 \rangle \in S, \gamma , \gamma ' \in \varGamma , \text { and } s' = \langle a, q'_1, q'_2 \rangle \in S \text { s.t. } \varDelta _{\textsc {Int}} (q_1, \langle \gamma , \gamma ' \rangle , q'_1) \text { and } \varDelta _{\textsc {Beh}} (q_2, \langle a,\gamma ' \rangle , q'_2)\),

\(\Pr [\langle \langle a,q_1, q_2 \rangle , \gamma ,N \rangle , \langle a', q_1, q_2 \rangle ] = \delta (a, \gamma )(a')\), and

for \(s = \langle a,q_1,q_2 \rangle \) and \(s' = \langle a, q'_1, q'_2 \rangle \) as in the above, we have \(cost(\langle s, \gamma \rangle ,\)\( \langle s', \gamma ',N \rangle ) = \) \(\lambda \cdot cost_{\textsc {Int}} (q_1, \langle \gamma , \gamma ' \rangle , q'_1) + (1\lambda ) \cdot cost_{\textsc {Beh}} (q_2, \langle \gamma ', a \rangle , q'_2)\).
Theorem 1
Given an abstraction \(\mathcal{A}\), weighted automata \({\textsc {Beh}} \) and \({\textsc {Int}} \), and a factor \(\lambda \), the game \(\mathcal{G}_{\mathcal{A},{\textsc {Beh}}, {\textsc {Int}},\lambda }\) admits optimal memoryless strategies. Let f be an optimal memoryless strategy for Player \(2\). The shield \({\textsc {Shield}} _f\) is an optimal shield w.r.t. \(\mathcal{A}\), \({\textsc {Beh}} \), \({\textsc {Int}} \), and \(\lambda \).
Remark 2
(Shield size). Recall that a shield is a function \({\textsc {Shield}}: A \times \varGamma \times M \rightarrow \varGamma \times M\), which we store as a table. The size of the shield is the size of the domain, namely the number of entries in the table. Given an abstraction with \(n_1\) states, a set of possible commands \(\varGamma \), and weighted automata with \(n_2\) and \(n_3\) states, the size of the shield we construct is \(n_1 \cdot n_2 \cdot n_3 \cdot \varGamma \).\(\square \)
Remark 3
Our construction of the game can be seen as a twostep procedure: we construct a stochastic game with two meanpayoff objectives, a.k.a. a twodimensional game, where the shield player’s goal is to minimize both the behavioral and interference scores separately. We then reduce the game to a “onedimension” game by weighing the scores with the parameter \(\lambda \). We perform this reduction for several reasons. First, while multidimensional quantitative objectives have been studied in several cases, such as MDPs [4, 6, 7] and special problems of stochastic games (e.g., almostsure winning) [2, 5, 8], there is no general algorithmic solution known for stochastic games with twodimensional objectives. Second, even for nonstochastic games with twodimensional quantitative objectives, infinitememory is required in general [48]. Finally, in our setting, the parameter \(\lambda \) provides a meaningful tradeoff: it can be associated with how well we value the quality of the controller. If the controller is of poor quality, then we charge the shield less for interference and set \(\lambda \) to be low. On the other hand, for a highquality controller, we charge the shield more for interferences and set a high value for \(\lambda \).\(\square \)
4 Case Study
We experiment with our framework in designing quantitative shields for trafficlight controllers that are trained using reinforcementlearning (RL). We illustrate the usefulness of shields in dealing with limitations of RL as well as providing an intuitive framework to complement RL techniques.
Traffic Simulation. All experiments were conducted using traffic simulator “Simulation of Urban MObility” (SUMO, for short) [31] v0.22 using the SUMO Python API. Incoming traffic in the cities is chosen randomly. The simulations were executed on a desktop computer with a 4 x 2.70 GHz Intel Core i77500U CPU, 7.7 GB of RAM running Ubuntu 16.04.
The Traffic Light Controller. We use RL to train a citywide trafficsignal controller. Intuitively, the controller is aware of the waiting cars in each junction and its actions constitute a light assignment to all the junctions. We train a controller using a deep convolutional Qnetwork [37]. In most of the networks we test with, there are two controlled junctions. The input vector to the neural network is a 16dimensional vector, where 8 dimensions represent a junction. For each junction, the first four components state the number of cars approaching the junction and the last four components state the accumulated waiting time of the cars in each one of the lanes. For example, in Fig. 1, the first four components are (3, 6, 3, 1), thus the controller’s state is not trimmed at 5. The controller is trained to minimize both the number of cars waiting in the queues and the total waiting time. For each junction i, the controller can choose to set the light to be either \(\text {NS}_i\) or \(\text {EW}_i\), thus the set of possible actions is \(\varGamma = \{\text {NS}_1 \text {NS}_2, \text {EW}_1 \text {NS}_2,\text {NS}_1 \text {EW}_2,\text {EW}_1 \text {EW}_2\}\).
We use a network consisting of 4 layers: The input layer is a convolutional layer with 16 nodes, the first hidden and the second hidden layers consisting out of 604 nodes and 1166 nodes, respectively. The output layer consists of 4 neurons with linear activation functions, each representing one of the above mentioned actions listed in \(\varGamma \). The Qlearning uses the learning rate \(\alpha =0.001\) and the discount factor 0.95 for the Qupdate and an \(\epsilon \)greedy exploration policy. The artificial neural network is built on an open source implementation^{1} using Keras [9] and additional optimized functionality was provided by the NumPy [40] library. We train for 100 training epochs, where each epoch is 1500 seconds of simulated traffic, plus 2000 additional seconds in which no new cars are introduced. The total training time of the agent is roughly 1.5 hours. While the RL procedure that we use is simple procedure, it is inspired by standard approaches to learning traffic controllers and produces controllers that perform relatively well also with no shield.
The Shield. We synthesize a “local” shield for a junction and copy the shield for each junction in the city. Recall that the first step in constructing the synthesis game is to construct an abstraction of the plant, which intuitively represents the information according to which the shield makes its decisions. The abstraction we use is described in Example 1; each state is a 4dimensional integer in \(\{ 0,\ldots , k \}\), which represents an abstraction of the number of waiting cars in each direction, cutoff by \(k \in \mathbb {N}\). As elaborated in the example, when a shield assigns a green light to a direction, we evict a car from the two respectable queues, and select the incoming cars uniformly at random. Regarding objectives, in most of our experiments, the behavioral score we use charges an abstract state \(a \in \{ 0,\ldots , k \}^4\) with \((a_1 + a_3)  (a_2+a_4)\), thus the shield aims to balance the total number of waiting cars per direction. The interference score we use charges the shield 1 for altering the controller’s action.
Since we use simple automata for objectives, the size of the shields we use is \(A \times \varGamma \), where \(\varGamma  = 2\). In our experiments, we cutoff the queues at \(k=6\), which results in a shield of size 2592. The synthesis procedure’s running time is in the order of minutes. We have already pointed out that we are interested in small lightweight shields, and this is indeed what we construct. In terms of absolute size, the shield takes \({\sim }60\,\text {KB}\) versus the controller who takes \({\sim }3\,\text {MB}\); a difference of 2 orders of magnitude.
Our synthesis procedure includes a solution to a stochastic meanpayoff game. The complexity of solving such games is an interesting combinatorial problem in NP and coNP (thus unlikely to be NPhard) for which the existence of a polynomialtime algorithm is major longstanding open problem. The current bestknown algorithms are exponential, and even for special cases like turnbased deterministic meanpayoff games or turnbased stochastic games with reachability objectives, no polynomialtime algorithms are known. The algorithm we implemented is called the strategy iteration algorithm [22, 23] in which one starts with a strategy and iteratively improves it, where each iteration requires polynomial time. While the algorithm’s worstcase complexity is exponential, in practice, the algorithm has been widely observed to terminate in a few number of iterations.
Evaluating Performance. Throughout all our experiments, we use a unified and concrete measure of performance: the total waiting time of the cars in the city. Our assumption is that minimizing this measure is the main objective of the designer of the traffic light system for the city. While performance is part of the objective function when training the controller, the other components of the objective are used in order to improve training. Similarly, the behavioral measure we use when synthesizing shields is chosen heuristically in order to construct shields that improve concrete performance.
Performance. We observe that performance as function of \(\lambda \), is a curvelike function. When \(\lambda \) is small, altering commands is cheap, the shield intervenes more frequently, and performance drops. This performance drop is expected: the shield is a simple device and the quality of its routing decisions cannot compete with the trained controller. This drop is also encouraging since it illustrates that our experimental setting is interesting. Surprisingly, we observe that the curve is in fact a paraboloid: for some values, e.g., \(\lambda = 0.4\), the shield improves the performance of the controller. We find it unexpected that the shield improves performance even when observing trained behavior, and this performance increase is observed more significantly in the next experiments.
RushHour Traffic. In Fig. 3, we use a shield to add robustness to a controller for behavior it was not trained for. We see a more significant performance gain in this experiment. We use the controller from the previous experiment, which is trained for uniform car arrival. We simulate it in a network with “rushhour” traffic, which we model by significantly increasing the traffic load in the NorthSouth direction. We synthesize shields that prefer to evict traffic from the NorthSouth queue over the EastWest queue. We achieve this by altering the objective in the stochastic game; we charge the shield a greater penalty for cars waiting in these queues over the other queues. For most values of \(\lambda \) below 0.7, we see a performance gain. Note that the performance of the controller with no shield is depicted on the far right, where the shield is trivial. An alternative approach to synthesize a shield would be to alter the probabilities in the abstraction, but we found that altering the weights results in a better performance gain.
Reducing Variability. Machine learning techniques are intricate, require expertise, and a fine tuning of parameters. This set of experiments show how the use of shields reduces variability of the controllers, and as a result, it reduces the importance of choosing the optimal parameters in the training phase. We fix one of the shields from the first experiment with \(\lambda =0.4\). We observe performance in a city with various controllers, which are trained with varying training parameters, when the controllers are run with and without the shield and on various traffic conditions that sometimes differ from the ones they are trained on.
In Fig. 5, we depict the performance of two controllers for various traffic settings. We observe, in these two controllers as well as the others, that operating with a shield consistently improves performance. The plots illustrate the unexpected behavior of machinelearning techniques: e.g., when run without a shield, Controller0.9 outperforms Controller0.65 in all settings, even in the setting 0.65 on which Controller0.65 was trained on. Thus, a designer who expects a traffic flow of 0.65, would be better off training with a traffic of 0.9. A shield improves performance and thus reduces the importance of which training data to use.
Measuring Variability. In Fig. 4, we depict the variability in performance between the controllers. The higher the variability is, the more significant it is to choose the right parameters when training the controller. Formally, let \(R = \{ 0.65,0.7, 0.75,0.8,0.85, 0.9 \}\). For \(r,k \in R\), we let \(\text {Perf}(r,k)\) denote the performance (total waiting times) when Controllerr observes bulk traffic k. For each \(k \in R\), we depict \(\max _{r\in R} \text {Perf}(r,k)  \min _{r' \in R} \text {Perf}(r',k)\), when operating with and without a shield.
Overcoming Liveness Bugs. Finding bugs in learned controllers is a challenging task. Shields bypass the need to find bugs since they treat the controller as a blackbox and correct its behavior. We illustrate their usefulness in dealing with liveness bugs. In the same network as in the previous setting, we experiment with a controller whose training process lacked variability. In Fig. 6, we depict the light configuration throughout the experiment on the main road; the horizontal axis represents time, red means a red light for the main road and dually green. Initially, the controller performs well, but roughly halfway through the simulation it hits a bad state after which the light stays red. The shield, with only a few interferences, which are represented with dots, manages to recover the controller from its stuck state. In Fig. 7, we depict the number of waiting cars in the city, which clearly skyrockets once the controller gets stuck. It is evident that initially, the controller performs well. This point highlights that it is difficult to recognize when a controller has a bug – in order to catch such a bug, a designer would need to find the right simulation and run it half way through before the bug appears.
One way to regain liveness would be to synthesize a shield for the qualitative property “each direction eventually gets a green light”. Instead, we use a shield that is synthesized for the quantitative specification as in the previous experiment. The shield, with a total of only 20 alterations is able to recover the controller from the bad state it is stuck in, and traffic flows correctly.
Adding Functionality; Prioritizing Public Transport. Learned controllers are monolithic. Adding functionality to a controller requires a complete retraining, which is time consuming, computationally costly, and requires care; changes in the objective can cause unexpected side effects to the performance. We illustrate how, using a shield, we can add to an existing controller, the functionality of prioritizing public transport.
The abstraction over which the shield is constructed slightly differs from the one used in the other experiments. The abstract state space is the same, namely fourdimensional vectors, though we interpret the entries as the positions of a bus in the respective queue. For example, the state (0, 3, 0, 1) represents no bus in the North queue and a bus which is waiting, third in line, in the East queue. Outgoing edges from an abstract state also differ as they take into account, using probability, that vehicles might enter the queues between buses. For the behavioral score, we charge an abstract state with the sum of its entries, thus the shield is charged whenever buses are waiting and it aims to evict them from the queues as soon as possible.
In Fig. 8, we depict the performance of all vehicles and only buses as a function of the weighing factor \(\lambda \). The result of this experiment is positive; the predicted behavior is observed. Indeed, when \(\lambda \) is small, interferences are cheap, which increase bus performance at the expense of the general performance. The experiment illustrates that the parameter \(\lambda \) is a convenient method to control the degree of prioritization of buses.
5 Discussion and Future Work
We suggest a framework for automatically synthesizing quantitative runtime shields to cope with limitations of machinelearning techniques. We show how shields can increase robustness to untrained behavior, deal with liveness bugs without verification, add features without retraining, and decrease variability of performance due to changes in the training parameters, which is especially helpful for machine learning nonexperts. We use weighted automata to evaluate controller and shield behavior and construct a game whose solution is an optimal shield w.r.t. a weighted specification and a plant abstraction. The framework is robust and can be applied in any setting where learned or other blackbox controllers are used.
We list several directions for further research. In this work, we make no assumptions on the controller and treat it adversarially. Since the controller might have bugs, modelling it as adversarial is reasonable. Though, it is also a crude abstraction since typically, the objectives of the controller and shield are similar. For future work, we plan to study ways to model the spectrum between cooperative and adversarial controllers together with solution concepts for the games that they give rise to.
In this work we make no assumptions on the relationship between the plant and the abstraction. While the constructed shields are optimal w.r.t. the given abstraction, the scores they guarantee w.r.t. the abstraction do not imply performance guarantees on the plant. To be able to produce performance guarantees on the concrete plant, we need guarantees on the relationship between the plant its abstraction. For future work, we plan to study the addition of such guarantees and how they affect the quality measures.
Footnotes
References
 1.Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: AAAI. AAAI Press (2018)Google Scholar
 2.Basset, N., Kwiatkowska, M.Z., Wiltsche, C.: Compositional strategy synthesis for stochastic games with multiple objectives. Inf. Comput. 261(Part), 536–587 (2018)MathSciNetCrossRefGoogle Scholar
 3.Bloem, R., Chatterjee, K., Jobstmann, B.: Graph games and reactive synthesis. In: Clarke, E., Henzinger, T., Veith, H., Bloem, R. (eds.) Handbook of Model Checking, pp. 921–962. Springer, Cham (2018). https://doi.org/10.1007/9783319105758_27CrossRefGoogle Scholar
 4.Brázdil, T., Brozek, V., Chatterjee, K., Forejt, V., Kucera, A.: Two views on multiple meanpayoff objectives in Markov decision processes. Log. Methods Comput. Sci. 10(1) (2014). https://lmcs.episciences.org/1156
 5.Chatterjee, K., Doyen, L.: Perfectinformation stochastic games with generalized meanpayoff objectives. In: Proceedings of the 31st LICS, pp. 247–256 (2016)Google Scholar
 6.Chatterjee, K., Kretínská, Z., Kretínský, J.: Unifying two views on multiple meanpayoff objectives in Markov decision processes. Log. Methods Comput. Sci. 13(2) (2017). https://lmcs.episciences.org/3757
 7.Chatterjee, K., Majumdar, R., Henzinger, T. A.: Markov decision processes with multiple objectives. In: Proceedings of the 23rd STACS, pp. 325–336 (2006)Google Scholar
 8.Chen, T., Forejt, V., Kwiatkowska, M. Z., Simaitis, A., Trivedi, A., Ummels, M.: Playing stochastic games precisely. In: Proceedings of the 23rd CONCUR, pp. 348–363 (2012)Google Scholar
 9.Chollet, F.: keras (2015). https://github.com/fchollet/keras
 10.Condon, A.: On algorithms for simple stochastic games. In: Proceedings of the DIMACS, pp. 51–72 (1990)Google Scholar
 11.Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. coRR, abs/1801.08757 (2017). arXiv:1801.08757
 12.Desai, A., Ghosh, S., Seshia, S. A., Shankar, N., Tiwari, A.: SOTER: programming safe robotics system using runtime assurance. coRR, abs/1808.07921 (2018). arXiv:1808.07921
 13.Dewey, D.: Reinforcement learning and the reward engineering principle. In: 2014 AAAI Spring Symposium Series (2014)Google Scholar
 14.Droste, M., Kuich, W., Vogler, H.: Handbook of Weighted Automata. Springer, Heidelberg (2009). https://doi.org/10.1007/9783642014925CrossRefzbMATHGoogle Scholar
 15.Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, 19–24 June 2016, pp. 1329–1338 (2016)Google Scholar
 16.Falcone, Y., Mounier, L., Fernandez, J.C., Richier, J.L.: Runtime enforcement monitors: composition, synthesis, and enforcement abilities. Formal Methods Syst. Des. 38(3), 223–262 (2011)CrossRefGoogle Scholar
 17.Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: AAAI. AAAI Press (2018)Google Scholar
 18.García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)MathSciNetzbMATHGoogle Scholar
 19.Geibel, P.: Reinforcement learning for MDPs with constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 646–653. Springer, Heidelberg (2006). https://doi.org/10.1007/11871842_63CrossRefGoogle Scholar
 20.Genders, W., Razavi, S.: Asynchronous nstep qlearning adaptive traffic signal control. J. Intell. Trans. Syst. 23(4), 319–331 (2019)CrossRefGoogle Scholar
 21.Hamlen, K.W., Morrisett, J.G., Schneider, F.B.: Computability classes for enforcement mechanisms. ACM Trans. Program. Lang. Syst. 28(1), 175–205 (2006)CrossRefGoogle Scholar
 22.Hoffman, A.J., Karp, R.M.: On nonterminating stochastic games. Manag. Sci. 12(5), 359–370 (1966)MathSciNetCrossRefGoogle Scholar
 23.Howard, A.R.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)zbMATHGoogle Scholar
 24.Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural networks. In: Proceedings of the 29th CAV, pp. 3–29 (2017)Google Scholar
 25.Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decisionmaking in MDPs. CoRR, arXiv:1807.06096 (2018)
 26.Ji, Y., Lafortune, S.: Enforcing opacity by publicly known edit functions. In: 56th IEEE Annual Conference on Decision and Control, CDC 2017, Melbourne, Australia, 12–15 December 2017, pp. 4866–4871 (2017)Google Scholar
 27.Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1), 99–134 (1998)MathSciNetCrossRefGoogle Scholar
 28.Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. JAIR 4, 237–285 (1996)CrossRefGoogle Scholar
 29.Katz, G., Barrett, C.W., Dill, C. W., Julian, K., Kochenderfer, M.J.: Reluplex: an efficient SMT solver for verifying deep neural networks. In: Proceedings of the 29th CAV, pp. 97–117 (2017)Google Scholar
 30.Könighofer, B., Alshiekh, M., Bloem, R., Humphrey, L., Könighofer, R., Topcu, U., Wang, C.: Shield synthesis. Formal Methods Syst. Des. 51(2), 332–361 (2017)CrossRefGoogle Scholar
 31.Krajzewicz, D., Erdmann, J., Behrisch, M., Bieker, L.: Recent development and applications of SUMO  Simulation of Urban MObility. Int. J. Adv. Syst. Meas. 5(3&4), 128–138 (2012)Google Scholar
 32.Lahijanian, M., Almagor, S., Fried, D., Kavraki, L.E., Vardi, M.Y.: This time the robot settles for a cost: a quantitative approach to temporal logic planning with partial satisfaction. In: Proceedings of the 29th AAAI, pp. 3664–3671 (2015)Google Scholar
 33.Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., Quillen, D.: Learning handeye coordination for robotic grasping with deep learning and largescale data collection. I. J. Robot. Res. 37(4–5), 421–436 (2018)CrossRefGoogle Scholar
 34.Lillicrap, T.P.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
 35.Mannion, P., Duggan, J., Howley, E.: An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In: McCluskey, T.L., Kotsialos, A., Müller, J.P., Klügl, F., Rana, O., Schumann, R. (eds.) Autonomic Road Transport Support Systems. AS, pp. 47–66. Springer, Cham (2016). https://doi.org/10.1007/9783319258089_4CrossRefGoogle Scholar
 36.Mason, G., Calinescu, R., Kudenko, D., Banks, A.: Assured reinforcement learning with formally verified abstract policies. In: Proceedings of the 9th International Conference on Agents and Artificial Intelligence, ICAART 2017, Porto, Portugal, 24–26 February 2017, vol. 2, pp. 105–117 (2017)Google Scholar
 37.Mnih, V.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRefGoogle Scholar
 38.Moldovan, T.M., Abbeel, P.: Safe exploration in Markov decision processes. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 – July 1, 2012 (2012)Google Scholar
 39.Mousavi, S.S., Schukat, M., Howley, E.: Traffic light control using deep policygradient and valuefunctionbased reinforcement learning. IET Intell. Trans. Syst. 11(7), 417–423 (2017)CrossRefGoogle Scholar
 40.Oliphant, T.E.: Guide to NumPy, 2nd edn. CreateSpace Independent Publishing Platform, USA (2015)Google Scholar
 41.Phan, D., Yang, J., Grosu, R., Smolka, S.A., Stoller, S.D.: Collision avoidance for mobile robots with limited sensing and limited information about moving obstacles. Formal Methods Syst. Des. 51(1), 62–86 (2017)CrossRefGoogle Scholar
 42.Pnueli, A., Rosner, R.: On the synthesis of a reactive module. In: Conference Record of the Sixteenth Annual ACM Symposium on Principles of Programming Languages, Austin, Texas, USA, 11–13 January 1989, pp. 179–190 (1989)Google Scholar
 43.Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons Inc., New York (2005)zbMATHGoogle Scholar
 44.Ramadge, P.J., Wonham, W.M.: Supervisory control of a class of discreteevent processes. SIAM J. Control Optim. 25(1), 206–230 (1987)MathSciNetCrossRefGoogle Scholar
 45.Schneider, F.B.: Enforceable security policies. ACM Trans. Inf. Syst. Secur. 3(1), 30–50 (2000)CrossRefGoogle Scholar
 46.Seshia, S. A., Sadigh, D.: Towards verified artificial intelligence. CoRR, arXiv:1606.08514 (2016)
 47.Sha, L.: Using simplicity to control complexity. IEEE Soft. 18(4), 20–28 (2001)CrossRefGoogle Scholar
 48.Velner, Y., Chatterjee, K., Doyen, L., Henzinger, T.A., Rabinovich, A.M., Raskin, J.F.: The complexity of multimeanpayoff and multienergy games. Inf. Comput. 241, 177–196 (2015)MathSciNetCrossRefGoogle Scholar
 49.Wu, Y., Raman, V., Rawlings, B.C., Lafortune, S., Seshia, S.A.: Synthesis of obfuscation policies to ensure privacy and utility. J. Autom. Reasoning 60(1), 107–131 (2018)MathSciNetCrossRefGoogle Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.