SafetyAware Apprenticeship Learning
Abstract
Apprenticeship learning (AL) is a kind of Learning from Demonstration techniques where the reward function of a Markov Decision Process (MDP) is unknown to the learning agent and the agent has to derive a good policy by observing an expert’s demonstrations. In this paper, we study the problem of how to make AL algorithms inherently safe while still meeting its learning objective. We consider a setting where the unknown reward function is assumed to be a linear combination of a set of state features, and the safety property is specified in Probabilistic Computation Tree Logic (PCTL). By embedding probabilistic model checking inside AL, we propose a novel counterexampleguided approach that can ensure safety while retaining performance of the learnt policy. We demonstrate the effectiveness of our approach on several challenging AL scenarios where safety is essential.
1 Introduction
The rapid progress of artificial intelligence (AI) comes with a growing concern over its safety when deployed in reallife systems and situations. As highlighted in [3], if the objective function of an AI agent is wrongly specified, then maximizing that objective function may lead to harmful results. In addition, the objective function or the training data may focus only on accomplishing a specific task and ignore other aspects, such as safety constraints, of the environment. In this paper, we propose a novel framework that combines explicit safety specification with learning from data. We consider safety specification expressed in Probabilistic Computation Tree Logic (PCTL) and show how probabilistic model checking can be used to ensure safety and retain performance of a learning algorithm known as apprenticeship learning (AL).
We consider the formulation of apprenticeship learning by Abbeel and Ng [1]. The concept of AL is closely related to reinforcement learning (RL) where an agent learns what actions to take in an environment (known as a policy) by maximizing some notion of longterm reward. In AL, however, the agent is not given the reward function, but instead has to first estimate it from a set of expert demonstrations via a technique called inverse reinforcement learning [18]. The formulation assumes that the reward function is expressible as a linear combination of known state features. An expert demonstrates the task by maximizing this reward function and the agent tries to derive a policy that can match the feature expectations of the expert’s demonstrations. Apprenticeship learning can also be viewed as an instance of the class of techniques known as Learning from Demonstration (LfD). One issue with LfD is that the expert often can only demonstrate how the task works but not how the task may fail. This is because failure may cause irrecoverable damages to the system such as crashing a vehicle. In general, the lack of “negative examples” can cause a heavy bias in how the learning agent constructs the reward estimate. In fact, even if all the demonstrations are safe, the agent may still end up learning an unsafe policy.
The key idea of this paper is to incorporate formal verification in apprenticeship learning. We are inspired by the line of work on formal inductive synthesis [10] and counterexampleguided inductive synthesis [22]. Our approach is also similar in spirit to the recent work on safetyconstrained reinforcement learning [11]. However, our approach uses the results of model checking in a novel way. We consider safety specification expressed in probabilistic computation tree logic (PCTL). We employ a verificationintheloop approach by embedding PCTL model checking as a safety checking mechanism inside the learning phase of AL. In particular, when a learnt policy does not satisfy the PCTL formula, we leverage counterexamples generated by the model checker to steer the policy search in AL. In essence, counterexample generation can be viewed as supplementing negative examples for the learner. Thus, the learner will try to find a policy that not only imitates the expert’s demonstrations but also stays away from the failure scenarios as captured by the counterexamples.
In summary, we make the following contributions in this paper.

We propose a novel framework for incorporating formal safety guarantees in Learning from Demonstration.

We develop a novel algorithm called CounterExample Guided Apprenticeship Learning (CEGAL) that combines probabilistic model checking with the optimizationbased approach of apprenticeship learning.

We demonstrate that our approach can guarantee safety for a set of case studies and attain performance comparable to that of using apprenticeship learning alone.
The rest of the paper is organized as follows. Section 2 reviews background information on apprenticeship learning and PCTL model checking. Section 3 defines the safetyaware apprenticeship learning problem and gives an overview of our approach. Section 4 illustrates the counterexampleguided learning framework. Section 5 describes the proposed algorithm in detail. Section 6 presents a set of experimental results demonstrating the effectiveness of our approach. Section 7 discusses related work. Section 8 concludes and offers future directions.
2 Preliminaries
2.1 Markov Decision Process and DiscreteTime Markov Chain
Markov Decision Process (MDP) is a tuple \(M = (S,A,T,\gamma ,s_0,R)\), where S is a finite set of states; A is a set of actions; \(T: S\times A\times S\rightarrow [0, 1]\) is a transition function describing the probability of transitioning from one state \(s\in S\) to another state by taking action \(a\in A\) in state s; \(R: S\rightarrow \mathbb {R}\ \)is a reward function which maps each state \(s\in S\) to a real number indicating the reward of being in state s; \(s_0\in S\) is the initial state; \(\gamma \in [0, 1)\) is a discount factor which describes how future rewards attenuate when a sequence of transitions is made. A deterministic and stationary (or memoryless) policy \(\pi : S \rightarrow A\) for an MDP M is a mapping from states to actions, i.e. the policy deterministically selects what action to take solely based on the current state. In this paper, we restrict ourselves to deterministic and stationary policy. A policy \(\pi \) for an MDP M induces a DiscreteTime Markov Chain (DTMC) \(M_{\pi }=(S, T_\pi , s_0)\), where \(T_\pi :S\times S\rightarrow [0, 1]\) is the probability of transitioning from a state s to another state in one step. A trajectory \(\tau = s_0\xrightarrow {T_\pi (s_0, s_1)>0} s_1\xrightarrow {T_\pi (s_1, s_2)>0} s_2,\ldots ,\) is a sequence of states where \(s_i\in S\). The accumulated reward of \(\tau \) is \(\sum \limits \limits _{i=0}^{\infty } \gamma ^i R(s_i)\). The value function \(V_\pi : S\rightarrow \mathbb {R}\) measures the expectation of accumulated reward \(E[\sum \limits _{i=0}^{\infty }\gamma ^i R(s_i)]\) starting from a state s and following policy \(\pi \). An optimal policy \(\pi \) for MDP M is a policy that maximizes the value function [4].
2.2 Apprenticeship Learning via Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) aims at recovering the reward function R of \(M\backslash R = (S, A, T, \gamma , s_0)\) from a set of m trajectories \(\varGamma _E=\{\tau _0, \tau _1, \ldots , \tau _{m1}\}\) demonstrated by an expert. Apprenticeship learning (AL) [1] assumes that the reward function is a linear combination of state features, i.e. \(R(s) = \omega ^Tf(s)\) where \(f : S \rightarrow [0,\ 1]^k\) is a vector of known features over states S and \(\omega \in \mathbb {R}^k\) is an unknown weight vector that satisfies \(\omega _2\le 1\). The expected features of a policy \(\pi \) are the expected values of the cumulative discounted state features f(s) by following \(\pi \) on M, i.e. \(\mu _\pi = E[\sum ^\infty _{t=0} \gamma ^t f(s_t)  \pi ]\). Let \(\mu _{E}\) denote the expected features of the unknown expert’s policy \(\pi _E\). \(\mu _{E}\) can be approximated by the expected features of expert’s m demonstrated trajectories \(\hat{\mu }_E=\frac{1}{m} \sum \limits _{\tau \in \varGamma _E}\sum \limits _{t=0}^{\infty }\gamma ^t f(s_{t})\) if m is large enough. With a slight abuse of notations, we use \(\mu _\varGamma \) to also denote the expected features of a set of paths \(\varGamma \). Given an error bound \(\epsilon \), a policy \(\pi ^*\) is defined to be \(\epsilon \)close to \(\pi _E\) if its expected features \(\mu _{\pi ^*}\) satisfies \(\mu _E  \mu _{\pi ^*}_2 \le \epsilon \). The expected features of a policy can be calculated by using Monte Carlo method, value iteration or linear programming [1, 4].
2.3 PCTL Model Checking
A counterexample of \(\phi \) is a set \(cex\subseteq \varGamma _\psi \) that satisfies \(\sum \limits _{\tau \in cex}P(\tau )> p^*\). Let \(\mathbb {P}(\varGamma ) = \sum \limits _{\tau \in \varGamma }P(\tau )\) be the sum of probabilities of all trajectories in a set \(\varGamma \). Let \(CEX_{\phi }\subseteq 2^{\varGamma _\psi }\) be the set of all counterexamples for a formula \(\phi \) such that \((\forall cex\in CEX_{\phi },\mathbb {P}(cex)> p^*)\) and \((\forall \varGamma \in 2^{\varGamma _\psi }\backslash CEX_{\phi }, \mathbb {P}(\varGamma )\le p^*)\). A minimal counterexample is a set \(cex\in CEX_{\phi }\) such that \(\forall cex'\in CEX_{\phi }, cex\le cex'\). By converting DTMC \(M_\pi \) into a weighted directed graph, counterexample can be found by solving a kshortest paths (KSP) problem or a hopconstrained KSP (HKSP) problem [6]. Alternatively, counterexamples can be found by using Satisfiability Modulo Theory solving or mixed integer linear programming to determine the minimal critical subsystems that capture the counterexamples in \(M_\pi \) [23].
A policy can also be synthesized by solving the objective \(\underset{\pi }{min}\ P_{=?}[\psi ]\) for an MDP M. This problem can be solved by linear programming or policy iteration (and value iteration for stepbounded reachability) [14].
3 Problem Formulation and Overview
Definition 1
The safetyaware apprenticeship learning (SafeAL) problem is, given an \(MDP\backslash R\), a set of m trajectories \(\{\tau _0, \tau _1,\ldots , \tau _{m1}\}\) demonstrated by an expert, and a specification \(\varPhi \), to learn a policy \(\pi \) that satisfies \(\varPhi \) and is \(\epsilon \)close to the expert policy \(\pi _E\).
Remark 1
We note that a solution may not always exist for the SafeAL problem. While the decision problem of checking whether a solution exists is of theoretical interest, in this paper, we focus on tackling the problem of finding a policy \(\pi \) that satisfies a PCTL formula \(\varPhi \) (if \(\varPhi \) is satisfiable) and whose performance is as close to that of the expert’s as possible, i.e. we relax the condition on \(\mu _{\pi }\) being \(\epsilon \)close to \(\mu _E\).
4 A Framework for SafetyAware Learning
5 CounterexampleGuided Apprenticeship Learning
Algorithm 1 describes CEGAL in detail. With a constant \(sup=1\) and a variable \(inf\in [0, sup]\) for the upper and lower bounds respectively, the learner determines the value of k within [inf, sup] in each iteration depending on the outcome of the verifier and uses k in solving (10) in line 27. Like most nonlinear optimization algorithms, this algorithm requires an initial guess, which is an initial safe policy \(\pi _0\) to make \(\varPi _S\) nonempty. A good initial candidate would be the maximally safe policy for example obtained using PRISMgames [15]. Without loss of generality, we assume this policy satisfies \(\varPhi \). Suppose in iteration i, an intermediate policy \(\pi _i\) learnt by the learner in iteration \(i1\) is verified to satisfy \(\varPhi \), then we increase inf to \(inf=k\) and reset k to \(k=sup\) as shown in line 22. If \(\pi _i\) does not satisfy \(\varPhi \), then we reduce k to \(k=\alpha \cdot inf + (1  \alpha )k\) as shown in line 26 where \(\alpha \in (0, 1)\) is a step length parameter. If \(kinf\le \sigma \) and \(\pi _i\) still does not satisfy \(\varPhi \), the algorithm chooses from \(\varPi _S\) a best safe policy \(\pi ^*\) which has the smallest margin to \(\pi _E\) as shown in line 24. If \(\pi _i\) satisfies \(\varPhi \) and is \(\epsilon \)close to \(\pi _E\), the algorithm outputs \(\pi _i\) as show in line 19. For the occasions when \(\pi _i\) satisfies \(\varPhi \) and \(inf = sup = k = 1\), solving (10) is equivalent to solving (1) as in the original AL algorithm.
Remark 2
The initial policy \(\pi _0\) does not have to be maximally safe, although such a policy can be used to verify if \(\varPhi \) is satisfiable at all. Naively safe policies often suffice for obtaining a safe and performant output at the end. Such a policy can be obtained easily in many settings, e.g., in the gridworld example one safe policy is simply staying in the initial cell. In both cases, \(\pi _0\) typically has very low performance since satisfying \(\varPhi \) is the only requirement for it.
Theorem 1
Given an initial policy \(\pi _0\) that satisfies \(\varPhi \), Algorithm 1 is guaranteed to output a policy \(\pi ^*\), such that (1) \(\pi ^*\) satisfies \(\varPhi \), and (2) the performance of \(\pi ^*\) is at least as good as that of \(\pi _0\) when compared to \(\pi _E\), i.e. \(\Vert \mu _E  \mu _{\pi ^*}\Vert _2\le \Vert \mu _E  \mu _{\pi _0}\Vert _2\).
Proof Sketch. The first part of the guarantee can be proven by case splitting. Algorithm 1 outputs \(\pi ^*\) either when \(\pi ^*\) satisfies \(\varPhi \) and is \(\epsilon \)close to \(\pi _E\), or when \(kinf\le \sigma \) in some iteration. In the first case, \(\pi ^*\) clearly satisfies \(\varPhi \). In the second case, \(\pi ^*\) is selected from the set \(\varPi _S\) which contains all the policies that have been found to satisfy \(\varPhi \) so far, so \(\pi ^*\) satisfies \(\varPhi \). For the second part of the guarantee, the initial policy \(\pi _0\) is the final output \(\pi ^*\) if \(\pi _0\) satisfies \(\varPhi \) and is \(\epsilon \)close to \(\pi _E\). Otherwise, \(\pi _0\) is added to \(\varPi _S\) if it satisfies \(\varPhi \). During the iteration, if \(kinf\le \sigma \) in some iteration, then the final output is \(\pi ^*=\underset{{\pi }\in \varPi _S}{argmin}\mu _E  \mu _{{\pi }}_2\), so it must satisfy \(\Vert \mu _E  \mu _{\pi ^*}\Vert _2\le \Vert \mu _E  \mu _{\pi _0}\Vert _2\). If a learnt policy \(\pi ^*\) satisfies \(\varPhi \) and is \(\epsilon \)close to \(\pi _E\), then Algorithm 1 outputs \(\pi ^*\) without adding it to \(\varPi _S\). Obviously \(\Vert \mu _E  \mu _{{\pi }}\Vert _2>\epsilon , \forall {\pi }\in \varPi _S\), so \(\Vert \mu _E  \mu _{\pi ^*}\Vert _2\le \Vert \mu _E  \mu _{\pi _0}\Vert _2\).
Discussion. In the worst case, CEGAL will return the initial safe policy. However, this can be because a policy that simultaneously satisfies \(\varPhi \) and is \(\epsilon \)close to the expert’s demonstrations does not exist. Comparing to AL which offers no safety guarantee and finding the maximally safe policy which has very poor performance, CEGAL provides a principled way of guaranteeing safety while retaining performance.
In each iteration, the algorithm first solves a secondorder cone programming (SOCP) problem (10) to learn a policy. SOCP problems can be solved in polynomial time by interiorpoint (IP) methods [12]. PCTL model checking for DTMCs can be solved in time linear in the size of the formula and polynomial in the size of the state space [7]. Counterexample generation can be done either by enumerating paths using the kshortest path algorithm or determining a critical subsystem using either a SMT formulation or mixed integer linear programming (MILP) [23]. For the kshortest pathbased algorithm, it can be computationally expensive sometimes to enumerate a large amount of paths (i.e. a large k) when \(p^*\) is large. This can be alleviated by using a smaller \(p^*\) during calculation, which is equivalent to considering only paths that have high probabilities.
6 Experiments
We evaluate our algorithm on three case studies: (1) gridworld, (2) cartpole, and (3) mountaincar. The cartpole environment^{1} and the mountaincar environment^{2} are obtained from OpenAI Gym. All experiments are carried out on a quadcore i77700K processor running at 3.6 GHz with 16 GB of memory. Our prototype tool was implemented in Python^{3}. The parameters are \(\gamma =0.99, \epsilon =10, \sigma =10^{5}, \alpha =0.5\) and the maximum number of iterations is 50. For the OpenAIgym experiments, in each step, the agent sends an action to the OpenAI environment and the environment returns an observation and a reward (0 or 1). We show that our algorithm can guarantee safety while retaining the performance of the learnt policy compared with using AL alone.
6.1 Grid World
Average runtime per iteration in seconds.
Size  Num. of states  Compute \(\pi \)  Compute \(\mu \)  MC  Cex 

\(8 \times 8\)  64  0.02  0.02  1.39  0.014 
\(16 \times 16\)  256  0.05  0.05  1.43  0.014 
\(32 \times 32\)  1024  0.07  0.08  3.12  0.035 
\(64 \times 64\)  4096  6.52  25.88  22.877  1.59 
6.2 CartPole from OpenAI Gym
In the cartpole environment, higher average steps mean better performance. The safest policy is synthesized using PRISMgames.
MC Result  Avg. Steps  Num. of Iters  

AL  49.1%  165  2 
Safest Policy  0.0%  8  N.A. 
\(p^*\) = 30%  17.2%  121  10 
\(p^*\) = 25%  9.3%  136  14 
\(p^*\) = 20%  17.2%  122  10 
\(p^*\) = 15%  6.9%  118  22 
\(p^*\) = 10%  7.2%  136  22 
\(p^*\) = 5%  0.04%  83  50 
We used 2000 demonstrations for which the pole is held upright without violating any of the safety conditions for all 200 steps in each demonstration. The safest policy synthesized by PRISMgames is used as the initial safe policy. We also compare the different policies learned by CEGAL for different safety threshold \(p^*\)s. In Table 2, the policies are compared in terms of model checking results (‘MC Result’) on the PCTL property in (14) using the constructed MDP, the average steps (‘Avg. Steps’) that a policy (executed in the OpenAI environment) can hold across 5000 rounds (the higher the better), and the number of iterations (‘Num. of Iters’) it takes for the algorithm to terminate (either converge to an \(\epsilon \)close policy, or terminate due to \(\sigma \), or terminate after 50 iterations). The policy in the first row is the result of using AL alone, which has the best performance but also a 49.1% probability of violating the safety requirement. The safest policy as shown in the second row is always safe has almost no performance at all. This policy simply letts the pole fall and thus does not risk moving the cart out of the range \([0.3, 0.3]\). On the other hand, it is clear that the policies learnt using CEGAL always satisfy the safety requirement. From \(p^*\) = 30% to 10%, the performance of the learnt policy is comparable to that of the AL policy. However, when the safety threshold becomes very low, e.g., \(p^*\) = 5%, the performance of the learnt policy drops significantly. This reflects the phenomenon that the tighter the safety condition is the less room for the agent to maneuver to achieve a good performance.
6.3 MountainCar from OpenAI Gym
In the mountaincar environment, lower average steps mean better performance. The safest policy is synthesized via PRISMgames.
MC Result  Avg. steps  Num. of Iters  

Policy Learnt via AL  69.2%  54  50 
Safest Policy  0.0%  Fail  N.A. 
\(p^*\) = 60%  43.4%  57  9 
\(p^*\) = 50%  47.2%  55  17 
\(p^*\) = 40%  29.3%  61  26 
\(p^*\) = 30%  18.9%  64  17 
\(p^*\) = 20%  4.9%  Fail  40 
7 Related Work
A taxonomy of AI safety problems is given in [3] where the issues of misspecified objective or reward and insufficient or poorly curated training data are highlighted. There have been several attempts to address these issues from different angles. The problem of safe exploration is studied in [8, 17]. In particular, the latter work proposes to add a safety constraint, which is evaluated by amount of damage, to the optimization problem so that the optimal policy can maximize the return without violating the limit on the expected damage. An obvious shortcoming of this approach is that actual failures will have to occur to properly assess damage.
Formal methods have been applied to the problem of AI safety. In [5], the authors propose to combine machine learning and reachability analysis for dynamical models to achieve high performance and guarantee safety. In this work, we focus on probabilistic models which are natural in many modern machine learning methods. In [20], the authors propose to use formal specification to synthesize a control policy for reinforcement learning. They consider formal specifications captured in Linear Temporal Logic, whereas we consider PCTL which matches better with the underlying probabilistic model. Recently, the problem of safe reinforcement learning was explored in [2] where a monitor (called shield) is used to enforce temporal logic properties either during the learning phase or execution phase of the reinforcement learning algorithm. The shield provides a list of safe actions each time the agent makes a decision so that the temporal property is preserved. In [11], the authors also propose an approach for controller synthesis in reinforcement learning. In this case, an SMTsolver is used to find a scheduler (policy) for the synchronous product of an MDP and a DTMC so that it satisfies both a probabilistic reachability property and an expected cost property. Another approach that leverages PCTL model checking is proposed in [16]. A socalled abstract Markov decision process (AMDP) model of the environment is first built and PCTL model checking is then used to check the satisfiability of safety specification. Our work is similar to these in spirit in the application of formal methods. However, while the concept of AL is closely related to reinforcement learning, an agent in the AL paradigm needs to learn a policy from demonstrations without knowing the reward function a priori.
A distinguishing characteristic of our method is the tight integration of formal verification with learning from data (apprenticeship learning in particular). Among imitation or apprenticeship learning methods, margin based algorithms [1, 18, 19] try to maximize the margin between the expert’s policy and all learnt policies until the one with the smallest margin is produced. The apprenticeship learning algorithm proposed by Abbeel and Ng [1] was largely motivated by the support vector machine (SVM) in that features of expert demonstration is maximally separately from all features of all other candidate policies. Our algorithm makes use of this observation when using counterexamples to steer the policy search process. Recently, the idea of learning from failed demonstrations started to emerge. In [21], the authors propose an IRL algorithm that can learn from both successful and failed demonstrations. It is done by reformulating maximum entropy algorithm in [24] to find a policy that maximally deviates from the failed demonstrations while approaching the successful ones as much as possible. However, this entropybased method requires obtaining many failed demonstrations and can be very costly in practice.
Finally, our approach is inspired by the work on formal inductive synthesis [10] and counterexampleguided inductive synthesis (CEGIS) [22]. These frameworks typically combine a constraintbased synthesizer with a verification oracle. In each iteration, the agent refines her hypothesis (i.e. generates a new candidate solution) based on counterexamples provided by the oracle. Our approach can be viewed as an extension of CEGIS where the objective is not just functional correctness but also meeting certain learning criteria.
8 Conclusion and Future Work
We propose a counterexampleguided approach for combining probabilistic model checking with apprenticeship learning to ensure safety of the apprenticehsip learning outcome. Our approach makes novel use of counterexamples to steer the policy search process by reformulating the feature matching problem into a multiobjective optimization problem that additionally takes safety into account. Our experiments indicate that the proposed approach can guarantee safety and retain performance for a set of benchmarks including examples drawn from OpenAI Gym. In the future, we would like to explore other imitation or apprenticeship learning algorithms and extend our techniques to those settings.
Footnotes
 1.
 2.
 3.
 4.
The MDP is built from sampled data. The feature vector in each state contains 30 radial basis functions which depend on the squared Euclidean distances between current state and other 30 states which are uniformly distributed in the state space.
 5.
The MDP is built from sampled data. The feature vector for each state contains 2 exponential functions and 18 radial basis functions which respectively depend on the squared Euclidean distances between the current state and other 18 states which are uniformly distributed in the state space.
 6.
AL did not converge to an \(\epsilon \)close policy in 50 iterations in this case.
Notes
Acknowledgement
This work is funded in part by the DARPA BRASS program under agreement number FA875016C0043 and NSF grant CCF1646497.
References
 1.Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the TwentyFirst International Conference on Machine Learning, ICML 2004, p. 1. ACM, New York (2004)Google Scholar
 2.Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. CoRR, abs/1708.08611 (2017)Google Scholar
 3.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in AI safety. CoRR, abs/1606.06565 (2016)Google Scholar
 4.Bellman, R.: A Markovian decision process. Indiana Univ. Math. J. 6, 15 (1957)MathSciNetCrossRefGoogle Scholar
 5.Gillulay, J.H., Tomlin, C.J.: Guaranteed safe online learning of a bounded system. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2979–2984. IEEE (2011)Google Scholar
 6.Han, T., Katoen, J.P., Berteun, D.: Counterexample generation in probabilistic model checking. IEEE Trans. Softw. Eng. 35(2), 241–257 (2009)CrossRefGoogle Scholar
 7.Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Aspects Comput. 6(5), 512–535 (1994)CrossRefGoogle Scholar
 8.Held, D., McCarthy, Z., Zhang, M., Shentu, F., Abbeel, P.: Probabilistically safe policy transfer. CoRR, abs/1705.05394 (2017)Google Scholar
 9.Jansen, N., Ábrahám, E., Scheffler, M., Volk, M., Vorpahl, A., Wimmer, R., Katoen, J., Becker, B.: The COMICS tool  computing minimal counterexamples for discretetime Markov chains. CoRR, abs/1206.0603 (2012)Google Scholar
 10.Jha, S., Seshia, S.A.: A theory of formal synthesis via inductive learning. Acta Informatica 54(7), 693–726 (2017)MathSciNetCrossRefGoogle Scholar
 11.Junges, S., Jansen, N., Dehnert, C., Topcu, U., Katoen, J.P.: Safetyconstrained reinforcement learning for MDPs. In: Chechik, M., Raskin, J.F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 130–146. Springer, Heidelberg (2016). https://doi.org/10.1007/9783662496749_8CrossRefGoogle Scholar
 12.Kuo, Y.J., Mittelmann, H.D.: Interior point methods for secondorder cone programming and or applications. Comput. Optim. Appl. 28(3), 255–285 (2004)MathSciNetCrossRefGoogle Scholar
 13.Kwiatkowska, M., Norman, G., Parker, D.: PRISM: probabilistic symbolic model checker. In: Field, T., Harrison, P.G., Bradley, J., Harder, U. (eds.) TOOLS 2002. LNCS, vol. 2324, pp. 200–204. Springer, Heidelberg (2002). https://doi.org/10.1007/3540460292_13CrossRefGoogle Scholar
 14.Kwiatkowska, M., Parker, D.: Automated verification and strategy synthesis for probabilistic systems. In: Van Hung, D., Ogawa, M. (eds.) ATVA 2013. LNCS, vol. 8172, pp. 5–22. Springer, Cham (2013). https://doi.org/10.1007/9783319024448_2CrossRefzbMATHGoogle Scholar
 15.Kwiatkowska, M., Parker, D., Wiltsche, C.: PRISMgames: verification and strategy synthesis for stochastic multiplayer games with multiple objectives. Int. J. Softw. Tools Technol. Transfer 20, 195–210 (2017)CrossRefGoogle Scholar
 16.Mason, G.R., Calinescu, R.C., Kudenko, D., Banks, A.: Assured reinforcement learning for safetycritical applications. In: Doctoral Consortium at the 10th International Conference on Agents and Artificial Intelligence. SciTePress (2017)Google Scholar
 17.Moldovan, T.M., Abbeel, P.: Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.4810 (2012)
 18.Ng, A.Y., Russell, S.J.: Algorithms for inverse reinforcement learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, ICML 2000, pp. 663–670. Morgan Kaufmann Publishers Inc., San Francisco (2000)Google Scholar
 19.Ratliff, N.D., Bagnell, J.A., Zinkevich, M.A.: Maximum margin planning. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 729–736. ACM, New York (2006)Google Scholar
 20.Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. CoRR, abs/1409.5486 (2014)Google Scholar
 21.Shiarlis, K., Messias, J., Whiteson, S.: Inverse reinforcement learning from failure. In: Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, pp. 1060–1068. International Foundation for Autonomous Agents and Multiagent Systems (2016)Google Scholar
 22.SolarLezama, A., Tancau, L., Bodik, R., Seshia, S., Saraswat, V.: Combinatorial sketching for finite programs. SIGOPS Oper. Syst. Rev. 40(5), 404–415 (2006)CrossRefGoogle Scholar
 23.Wimmer, R., Jansen, N., Ábrahám, E., Becker, B., Katoen, J.P.: Minimal critical subsystems for discretetime Markov models. In: Flanagan, C., König, B. (eds.) TACAS 2012. LNCS, vol. 7214, pp. 299–314. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642287565_21CrossRefzbMATHGoogle Scholar
 24.Ziebart, B.D., Maas, A., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 3, pp. 1433–1438. AAAI Press (2008)Google Scholar
Copyright information
<SimplePara><Emphasis Type="Bold">Open Access</Emphasis>This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.</SimplePara><SimplePara>The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.</SimplePara>