Safety-Aware Apprenticeship Learning

Apprenticeship learning (AL) is a class of"learning from demonstrations"techniques where the reward function of a Markov Decision Process (MDP) is unknown to the learning agent and the agent has to derive a good policy by observing an expert's demonstrations. In this paper, we study the problem of how to make AL algorithms inherently safe while still meeting its learning objective. We consider a setting where the unknown reward function is assumed to be a linear combination of a set of state features, and the safety property is specified in Probabilistic Computation Tree Logic (PCTL). By embedding probabilistic model checking inside AL, we propose a novel counterexample-guided approach that can ensure both safety and performance of the learned policy. We demonstrate the effectiveness of our approach on several challenging AL scenarios where safety is essential.


Introduction
The rapid progress of artficial intelligence (AI) comes with a growing concern over its safety when deployed in real-life systems and situations. As highlighted in [3], if the objective function of an AI agent is wrongly specified, then maximizing that objective function may lead to harmful results. In addition, the objective function or the training data may focus only on accomplishing a specific task and ignore other aspects (such as safety constraints) of the environment. In this paper, we propose a novel framework that combines explicit safety specification with learning from data. We consider safety specification captured in Probabilistic Computation Tree Logic (PCTL) and show that how probabilistic model checking can be used to ensure both safety and performance of a learning algorithm known as apprenticeship learning (AL).
We consider the formulation of apprenticeship learning by Abbeel and Ng [1]. The concept of AL is closely related to reinforcement learning (RL) where an agent learns what actions to take in an environment (known as a policy) by maximizing some notion of long-term reward. However, in AL, the agent is not given the reward function, but instead has to first recover it from a set of expert demonstrations via a technique called inverse reinforcment learning [15] . The formulation assumes that the reward function is expressible as a linear combination of known state features. An expert demonstrates the task by maximizing this reward function and the agent tries to derive a policy that can match the arXiv:1710.07983v1 [cs.AI] 22 Oct 2017 feature expectations of the expert's demonstrations. Apprenticeship learning can also be viewed as an instance of the class of techniques known as "learning from demonstrations" (LfD). The main issue with LfD in general is that the expert often can only demonstrate how the task works but not how the task may fail. This is because failure may cause irrecoverable damages to the system, such as crashing a vehicle. This lack of "negative examples causes a heavy bias in how the learning agent constructs the reward surrogate, and thus can lead to a policy that is entirely unaware of unsafe parts of the environment.
The key idea of this paper is to incorporate formal verification in apprenticeship learning. We are inspired by the line of work on formal inductive synthesis [10] and counterexample-guided inductive synthesis [19]. Our approach is also similar in spirit to the recent work on safety-constrained reinforcement learning [11]. However, our approach uses the results of model checking in a novel way. We consider explicit specification expressed in probabilistic computation tree logic (PCTL). We employ a verification-in-the-loop approach by embedding PCTL model checking as a safety checking mechanism inside the learning phase of AL. In particular, when a learnt policy does not satisfy the PCTL formula, we leverage counterexamples generated by the model checker to steer the policy search in AL. In essence, counterexample generation can be viewed as supplementing negative examples for the learner. Thus, the learner will seek to find a policy that can not only imitate the expert's demonstrations but also stay away from the failure scenarios (as captured by the counterexamples).
In short, we make the following contributions in this paper.
-We propose a novel counterexample-guided framework for ensuring the safety of apprenticeship learning. -This framework enables the combination of probabilistic model checking with the optimization-based approach of apprenticeship learning. -We demonstrate that our approach can guarantee safety for a set of benchmarks and attain comparable or even better performance compared to just using apprenticeship learning.
The rest of the paper is organized as follows. Section 2 reviews background information on apprenticeship learning and PCTL model checking. Section 3 defines the safety-aware apprenticeship learning problem and gives an overview of our approach. Section 4 illustrates the counterexample-guided learning framework. Section 5 describes the proposed algorithm in detail. Section 6 presents a set of experimental results demonstrating the effectiveness of our approach. Section 7 discusses related work. Section 8 concludes and offers future directions. is a transition function describing the probability of transitioning from one state s to another s by taking action a in state s; R : S → R is a reward function which maps each state s to a real number indicating the reward of being in state s; s 0 ∈ S is the initial state; γ ∈ [0, 1) is a discount factor which describes how future rewards attenuate when a sequence of transitions is made.

Preliminaries
Given a policy π : S → A for an MDP, an agent can decide which action to perform in each state. A trajectory τ = s 0 → s 1 → s 2 , ..., is a sequence of states by starting from the initial state s 0 and following policy π.
Definition 2. Optimal policy π for an MDP M = (S, A, T, γ, s 0 , R) shall satisfy [4]: Inverse reinforcement learning aims at recovering the reward function R when only a partially known MDP M \R = (S, A, T, γ, s 0 ) without the reward function R and a set of m trajectories {τ 0 , τ 1 , ..., τ m−1 } from expert are given. Apprenticeship learning assumes that reward function is the result of linear combination R(s) = ω T f (s), where f : S → [0, 1] k is a vector of features over states and ω ∈ R k is a weight vector that satisfies ||ω|| 2 ≤ 1. f is known by both the expert and the learning agent. Given some weight vector ω, we can derive the corresponding reward function R and solve for the optimal policy π for ω. The value of the initial state s 0 can be expressed as Let µ π be the expected features of policy π. It can be solved by using Monte Carlo method or value iteration or linear programming [1,4]. Definê as the expected features of expert's m trajectories. If m is large enough, the expected features of expert's policy µ E can be approximated byμ E .
It's defined in [1] that a policy π * is -close to π E , if its expected features µ π * satisfies ||μ E − µ π * || 2 ≤ . The algorithm proposed in [1] searches for such policy by gathering candidate policies through iterations. Starting with a random policy π 0 and its expected features µ 0 , assuming that in iteration t, t policies and their corresponding expected features {µ 0 , µ 1 , ..., µ t−1 } have been found, the algorithm shall get a weight vector ω by solving the optimization problem (2).
If δ ≤ , then let µ i and its corresponding policy π i be the output. Otherwise, use ω to find its optimal policy π t as well as its expected features µ t .

Probabilistic Model Checking
Given a policy π, an AP set and L function, an MDP M = (S, A, T, γ, s 0 , R) becomes a DTMC M π = (S, P, s 0 , AP, L), where P : S × S → [0, 1] is the probability of transitioning from a state s to another state s' in one step. Using probabilistic model checking techniques, properties such as the probability of reaching certain states in M π can be analyzed. Such questions like "given a policy, what is the probability that the agent reaches the unsafe state?" and "is the probability that the agent reaches the unsafe area within 10 steps smaller than p * ?" can be formally specified and evaluated.
Probabilistic Computation Tree Logic (PCTL) [7] allows for probabilistic quantification of described properties. The basic element of PCTL is atomic proposition. The syntax of PCTL property of a DTMC includes state formula and path formula [12]. State formula φ asserts PCTL property of single state s ∈ S while path formula ψ asserts PCTL property of a trajectory.
where l i , l j ∈ AP are atomic propositions; ∈ {≤, ≥, <, >}; P p * [ψ] means that the probability of generating a trajectory that satisfies formula ψ is p * . Xφ asserts that the next state after initial state in the trajectory satisfies φ; φ 1 U ≤k φ 2 asserts that φ 2 is satisfied in at most k transitions and all preceding states satisfy φ 1 ; φ 1 Uφ 2 asserts that φ 2 will be eventually satisfied and all preceding states satisfy φ 1 .
The semantics of PCTL is defined by a satisfaction relation |= between formula and DTMC: s |= φ iff state s satisfies state formula φ τ |= ψ iff trajectory τ satisfies path formula ψ.
Counterexample of a state formula φ = P ≤p * [ψ] for state s 0 in DTMC There can be different counterexamples for one formula. Let P(Γ ) = τ ∈Γ P (τ ) be the sum of probabilities of all trajectories in one set. Assume that all counterexamples for formula φ are gathered in a set There can be multiple minimal counterexamples in CEX φ . By converting DTMC M π into a weighted digraph, counterexample can be found by solving k shortest paths (KSP) problem or hop-constrained KSP (HKSP) problem [6].

Problem Formulation and Overview
We will first analyze the safety issues in apprenticeship learning with a grid-world example. Then we will define the safety-aware apprenticeship learning (SafeAL) problem and give intuitions on how we solve it.
Assuming that there are some unsafe states in an M DP \R M = (S, A, T, γ, s 0 ). A safety issue means an agent following a learnt policy has a higher probability of entering those unsafe states than it should. There are multiple reasons that can give rise to this issue. First, it is possible that the expert policy itself has a high probability of reaching the unsafe states. Second, human experts often tend to perform only successful demonstrations that do not highlight the unwanted situations [18]. This lack of negative examples in the training set results in the agent being unaware of the existence of those unsafe states. We use a 8 x 8 grid world navigation example as shown in Fig. 1 to illustrate this problem. The agent starts from the upper-left corner and moves from cell to cell until reaching the lower-right corner or a maximal step length t < 64. Meanwhile, there are several cells labelled as unsafe enclosed by the red dashed lines shown near the upper-right and lower-left corners. These are regions that agent should avoid. In each time step, the agent can choose to stay in current cell or move to an adjacent cell. Due to stochasticity of the system, it has 30% chance of moving randomly instead of moving accroding its decision. The expert knows the goal area, the unsafe area and the reward mapping on all states as shown in Fig. 1(a). For each state s ∈ S, the feature vector f (s) consists of 4 feature functions f i (s), i = 0, 1, 2, 3. All of them are radial basis functions which respectively depend on the squared Euclidean distances between s and the 4 states with the highest or lowest rewards as shown in Fig. 1(a). In addition, a specification Φ formalized in PCTL is given to capture the safety requirement. In Eq. 9, p * is the upper bound of the probability of reaching an unsafe state within t = 64 steps and is set to be 0.5 initially.
We illustrate two scenarios in this simple example. The first simulates a setting with abundant expert demonstrations, i.e. µ E is directly generated from the optimal policy π E with respect to the predetermined ω E which results in the reward mapping in Fig. 1(a). In this case, the AL algorithm can accurately recover π E . Model checking result shows that the probability of reaching an unsafe state by following the learnt policy, or the expert policy π E itself, is 0.117. Hence, the specification is satisfied in this scenario. In the second scenario, which is more realistic, the expert follows π E but only performs a limited number of demonstrations which are all successful and safe. As indicated by the two representative (in blue) trajectories shown in Fig. 1(b) and Fig. 1(c), 10, 000 1 trajectories were used as expert demonstrations. The reward function that induces the learnt policy in this scenario is shown in Fig. 1(d). Observe that only the goal area has been learnt whereas the agent is oblivious to the unsafe regions (treating them in the same way as other black cells). Indeed, probability of reaching an unsafe state within 64 steps with this policy turns out to be 0.980 (thus violating the safety requirement by a large margin). To make matters worse, the value of p * may be decided or revised after a policy has been learnt. In that case, even the original expert policy π E is unsafe, e.g., when p * = 0.1. Thus, we need to adapt the AL algorithm to account for this additional safety requirement.
Definition 3. The safety-aware apprenticeship learning (SafeAL) problem is, given an M DP \R, a set of m trajectories {τ 0 , τ 1 , ..., τ m−1 } demonstrated by an expert, and a specification Φ, to learn a policy π that satisfies Φ and is -close to the expert policy π E .

Counterexample-Guided Learning
We propose a counterexample-guided learning framework that can utilize information from both expert demonstrations and the verifier. The proposed framework is illustrated in Fig.2.

Meet Learning
Objective?
The counterexample-guided learning framework. Given an initial policy π, a specifiction Φ and a learning objective (error) , the framework iterates between a verifier and a learner to search for a policy π * that is both safe and satisfactory in terms of the learning objective. One invariant that this framework maintains is that all the πi's in the candidate policy set satisfy Φ.
Similar to the counterexample-guided inductive synthesis (CEGIS) paradigm [19], our framework consist of a verifier and a learner. The verifier checks if the a set of candidate policies and a set of counterexamples. Different from CEGIS, our framework not only considers functional correctness, e.g., safety, but also considers performance (as captured by the learning objective). Starting from an initial policy, each time the learner learns a new policy, the verifier checks if the specification is satisfied. If true, then this policy is added to the candidate set, otherwise the oracle will generate a (minimal) counterexample and add it to the counterexample set. During the learning phase, the learner uses both the counterexample set and candidate set to find a policy that is close to the (unknown) expert policy and away from the counterexamples. When a learnt policy is -close to the expert policy and satisfies the specification, it will be produced as the final output. For the grid-world example introduced in Section 3, when p * = 0.05 (thus presenting a stricter safety requirement compared to the expert policy π E ), our approach produces a policy with probability 0.042 of reaching an unsafe state in 64 steps and the corresponding inferred reward mapping is shown in Fig. 3(a) where the goal area and the unsafe regions are well represented as in the ground-truth reward. Learning from a (minimal) counterexample CEX π is similar to learning from expert demonstrations. The basic principle of the apprenticeship learning algorithm proposed in [1] is to find a weight vector ω under which the expected reward of π E maximally outperforms any mixture of the policies found so far. In each iteration the optimal solution ω can be seen as the normal vector of the hyperplane ω T (µ − µ E ) = 0 that has the maximal distance δ to the convex hull of {µ 0 , µ 1 , ..., µ t−1 }. In the 2D feature space in Fig. 3(b), we show an example of the hyperplane and convex hull. Dually, we want to maximize the distance from counterexample features µ CEX to candidate features. Π CEX is defined as the set of expected feature counts of the counterexample trajectory sets, Π CEX = {µ CEX0 , µ CEX1 , µ CEX2 , ...}. Maximizing the distance between the convex hulls of Π CEX and Π is equivalent to maximizing the distance between the parallel support hyperplanes of Π CEX and Π as shown in Fig. 3(c). The optimization function is given in Eq. 10.
To attain good performance similar to that of the expert, we still want to learn form µ E . Thus, the overall problem can be formulated as a multi-objective optimization problem that combines (2) and (10) into (12).

An Algorithm for SafeAL
In this section, we describe our safety-aware apprenticeship learning (SafeAL) algorithm in detail. Combining policy verification, counterexample generation and the apprentice learning algorithm, our approach searches for a policy iteration by iteration and keeps gathering candidate policies that satisfy the specification.
Π S is used to maximize the chance of satisfying the specification Φ. Additional constraints (17)(18) are added in order to ensure the dominance of the optimal solution with respect to µ, µ S , µ CEX . The ω in the optimal solution shall be used to generate an optimal policy π. This can be done with algorithms such as policy iteration. Then we use a probabilistic model checker, such as PRISM [12], to check if π satisfies Φ. If it does then it will be added to Π S . For the learning objective, we check the expected features of π with ||µ E − µ π || 2 to see if it is the closest one to expert policy so far. If it is, then π is considered as the best candidate policy found and we set π * = π. If π does not satisfy Φ, we use a counterexample generator, such as COMICS [9], to generate a minimal counterexample µ CEXπ of π and add it to CEX. π will in turn be compared with π * . Only when ||µ E − µ π * || 2 ≥ ||µ E − µ π || 2 , shall π be added to Π. This constraint is added so that the search space for π * is not narrowed down by a performant but unsafe policy.
Algorithm 1 describes our overall algorithm for the SafeAL problem. It starts by adding a zero vector to the set Π and Π S at initialization. In each iteration, weight k is updated and adapted to the newly learnt policy. Once a safe policy is learnt, the multi-objective optimization problem is biased to learn from the expert by setting k = 1. Otherwise it tries to satisfy the safety specification by adding weight to the (10) component. We use two parameters inf and sup to indicate the upper and lower bound of k such that k ∈ [inf, sup]. inf = 1.0 is constant while sup is initialized to 0 but can be updated later. Assuming that in iteration i a policy π (i) is found, if π (i) is verified by the model checker to satisfy the specification Φ, then sup is updated and set to the current k and k itself is updated and set to inf . If π (i) does not satisfy Φ, k is reduced to α · inf + (1 − α)k where α ∈ (0, 1) is a step length parameter. When |k − inf | is smaller than some error bound and inf = sup, the algorithm converges. Note that when inf = sup = k = 1, we recovers the original apprenticeship learning. During the iterations, we keep updating the best learnt policy so far so that the final output policy is closest to µ E among all learnt policies. α ∈ (0, 1) ← step length 5: inf = 0, sup = 1, k = α(inf + sup) 6: Initialize: 7: π (0) , µ (0) ← Randomly generated policy and its expected features 8: Model check if π (0) satisfies Φ 9: If True 10: Add µ (i) to Π, ΠS 11: The current policy already satisfies Φ 12: If False, µ Go to Iteration 15: Iteration: 16: In iteration i ≥ 1 17: Solve in (14) with constraints (15)(16)(17)(18). 18: Model check if π (i) satisfies Φ 20: If True, µ (i) ← Expected feature of π (i) 21: Add If ||µE − µ (i) ||2 ≥ ||µE − µ * ||2, then add µ (i) to Π 29: k = α · inf + (1 − α)k 30: Go to next iteration Theorem 1. If the initial policy is safe, then the algorithm is guaranteed to output a safe policy. In addition, the expected features of the output policy has a same or shorter distance to expert's features than that of the initial policy.
Proof sketch. Only policies satisfying the specification are added to the candidate set. In addition, among these candidates, the policy π that has smallest ||µ E − µ π || 2 is produced as the final output. Convergence of the algorithm is proven at the end of this section.

Remark 2.
While the theorem requires the initial policy to be safe, a naively safe policy can often be obtained easily, e.g., the policy is to stay in the current cell for every cell in the grid-world example. In practice, as indicated in our experiments, a random initial policy often suffices for obtaining a safe and performant policy at the end. We also note that AL itself does not guarantee safety for the final output even if we initialize it with a safe policy.
This algorithm is guaranteed to converge. When inf = sup = 1, the algorithm is reduced to an iteration in AL. So we only need to consider convergence when |k −inf | ≤ . After inf is assigned inf t , k is updated by k t = α·inf +(1−α)k t+1 in every iteration until either convergence due to |k * −inf t | ≤ or inf t is updated to inf t+1 > inf t . In the former case, the update of k satisfies the following.
For any , it takes less than log 1−α k−inf iterations before the algorithm converges or find a safe policy. In the latter case where a safe policy is found and inf t is updated to inf t+1 > inf t + , treat each such update as a time point and inf eventually converges to inf * , then the following inequality holds.
Therefore, it takes at most 1 time points for inf to be updated from 0 to 1. Between each two consecutive time points, it takes at most log 1−α k−inf iterations to converge or to cross the time points. After inf = 1, the algorithm becomes apprenticeship learning. The convergence of apprenticeship learning is proved in [1].

Experiments
We evaluate our algorithm on three case studies: (1) navigation in a grid world, (2) cart pole, and (3) mountain car. The cart pole environment 2 and the mountain car environment 3 are obtained from OpenAI Gym. In all three cases, we show that our algorithm can guarantee the safety of the learnt policy, and moreover, the learnt policy has comparable and even better performance compared with the one directly learnt by apprenticeship learning.

Navigation in Grid World
Our first experiment is an extension on the 2D navigation task in Section 3. Specifications with different p * 's in are given. Fig. 4 shows the different reward mappings that induce the policies learnt by our algorithm. As the safety threshold (value of p * ) decreases, the algorithm will try to find a weight vector that assigns low rewards to the unsafe states and states around them, so that the agent will have a lower probability of moving into the unsafe areas. However, as a result, the agent will also focus more on avoiding the unsafe areas than actually reaching the goal area. In essence, we trade off performance with safety.

Cart Pole Environment
In grid world, it is hard to evaluate the performance that the agent may need to sacrifice for safety. In this experiment with the cart pole environment, the goal is to keep the pendulum on a cart from falling over as long as possible. In each time step, the agent can either push the cart to the left or to the right. The position, velocity and angle of the cart and the pole are continuous values and observable, but the actual dynamics of the system are unknown. An maneuver is deemed unsafe if the pole angle is more than ±20.9 • or the cart's horizontal position is more than ±2.4. The maximum step length is set to t = 200. Thus, we can formalize the safety requirement in PCTL as follows 4 .
In Table 1, for different safety threshold p * 's, the learnt policies are compared in terms of the model checking results on the PCTL property, average steps the agent can hold in 5000 rounds (the higher the better), and average rates that the agent violates the safety constraints in 5000 rounds. Safety threshold 1.0 in the first row is equivalent to having no safety constraint. The policy in this row is basically the one directly learnt with AL. When the safety threshold ranges from 0.44 to 0.15, there is little degradation in performance. In fact, finding a policy that keeps the cart position and pole angle under safe conditions can help maintain the stability of whole system. For the learnt policy at p * = 0.2, its performance is even higher than the original policy learnt via AL. On the other hand, when the threshold becomes too low, e.g., p * = 0.05, the performance drops significantly. In this case, the agent becomes so conservative that it would rather let the pendulum fall than risk pushing the cart into unsafe conditions.

Mountain Car Environment
Our third experiment uses the mountain car environment from OpenAI Gym. In this environment, a car starts from the bottom area of the valley between two mountains and tries to reach the mountaintop on the right by taking as less time steps as possible. In each time step the car can perform one of the three actions, going left, turning off the engine, and going right. The agent fails when the step length reaches the maximum, i.e. t = 200. The velocity and position of the car are continuous values and observable while the exact dynamics are unknown. The car cannot get enough momentum to reach the top by just accelerating from the start. It has to move back and forth in the valley to gather momentum. To represent safety, additional rules are added so that the car should not pass some speed limit when it is close to the left mountaintop or the right mountaintop (in case it is a cliff on the other side of the mountaintop) 5 . In this experiment, the expert feature is generated by sampling trajectories that reach the mountaintop without any violation of the speed rules. We formalize the safety requirement in PCTL as follows.
Φ :: . 6. The mountain car environment. (a) The original mountaincar environment without any safety concern. (b) The mountain car environment with traffic rules: when the distance from the car to the left edge or the right edge is shorter than 0.01, the speed of the car should be lower than 0.06.
In Table 2, for different safety thresholds, the learnt policies are compared in terms of the model checking results on the PCTL property, the average steps the agent takes to reach the goal in 5000 rounds, the average times that the agent violates the safety property in 5000 rounds. The average steps that the expert takes during demonstration is 162. In the first row, the policy learnt without any safety constraint has the highest probability of going over the speed limit, and its performance is also the worst. What this policy does is it makes the car speed up all the way towards the left mountaintop to maximize its potential energy, which is actually a waste of steps. Thus, when the safety threshold is 0.08, the agent actually learns to slow down earlier on the left hillside. This not only helps to keep the car safe but also save time. However, if the safety threshold drops to 1.0e-5, the agent will need to sacrifice all its performance in exchange for safety.

Related Work
A taxonomy of AI safety problems is given in [3] where the issues of misspecified objective or reward and insufficient or poorly curated training data were highlighted. There have been several efforts attempting to address these issues from  [14] and [8]. In particular, the latter work proposes to add a safety constraint, which is evaluated by amount of damage, to the optimization problem so that the optimal policy can maximize the return without violating the limit on the expected damage. An obvious shortcoming of this approach is that actual failures will have to occur to properly assess damage. Formal methods have been applied to the problem of AI safety. In [5], the authors propose to combine machine learning and reachability analysis for dynamical models to achieve high performance and guarantee safety. In this work, we focus on probabilistic models which are natural to many modern machine learning methods. In [17], the authors propose to use formal specification to synthesize a control policy for reinforcement learning. They consider formal specifications captured in Linear Temporal Logic (LTL), which is suitable for partial MDPs with unknown probabilities, whereas we consider PCTL which matches well with the underlying probabilistic model. Recently, the problem of safe reinforcement learning was explored in [2] where a monitor (called shield) is used to enforce temporal logic properties either during the learning phase or execution phase of the reinforcement learning algorithm. The shield provides a list of safe actions each time the agent makes a decision so that the temporal property is preserved. In [11], the authors also propose an approach for controller synthesis for reinforcement learning. In this case, an SMT-solver is used to find a scheduler (policy) for the synchronous product of an MDP and a DTMC so that it satisfies both a probabilistic reachability property and an expected cost property. Another approach that leverages PCTL model checking is proposed in [13]. A so-called abstract Markov decision process (AMDP) model of the environment is first built and PCTL model checking is then used to check the satisfiability of safety specification. Our work is similar to these in spirit in the appilication of formal methods. However, while the concept of AL is closely related to reinforcement learning, an agent in the AL paradigm needs to learn a policy from demonstrations without knowing the reward function.
A distinguishing characteristic of our method is the tight integration of formal verification in learning from data (aprenticeship learning in particular). Among imitation or apprenticeship learning methods, margin based algorithms [1], [15], [16] try to maximize the margin between the expert's policy and all learnt policies until the one with the smallest margin is produced. The apprenticeship learning algorithm proposed by Abbeel and Ng [1] was largely motivated by the support vector machine (SVM) in that features of expert demonstration is maximally separately from all features of all other candidate policies. Our algorithm makes use of this observation when using counterexamples to steer the policy search process. Recently, the idea of learning from failed demonstrations started to emerge. In [18], the authors propose an IRL algorithm that can learn from both successful and failed demonstrations. It is done by reformulating maximum entropy algorithm in [20] to find a policy that maximally deviates from the failed demonstrations while approaches the successful ones as much as possible. However, this entropy-based method requires obtaining many failed demonstrations and can be very costly in practice.
Finally, our approach is inspired by the work on formal inductive synthesis [10] and counterexample-guided inductive synthesis (CEGIS) [19]. These frameworks typically combine a constraint-based synthesizer with a verification oracle. In each iteration, the agent refines her hypothesis (i.e. generates a new candidate solution) based on counterexamples provided by the oracle. Our approach can be viewed as an extension of CEGIS to an area of machine learning where the objective is not just functional correctness but also meeting certain learning criteria.

Conclusion and Future work
We propose a counterexample-guided approach for combining probabilistic model checking with apprenticeship learning to ensure safety of the learning outcome. Our approach makes novel use of the counterexamples to steer the policy search process by formulating the problem as a multi-objective optimization problem. Our experiments indicate that the proposed approach can guarantee safety and retain performance for a set of benchmarks including examples drawn from Ope-nAI Gym. In the future, we would like to explore other imitation or apprenticeship learning algorithms and extend our techniques to those settings.