Explicit Explore, Exploit, or Escape ($E^4$): near-optimal safety-constrained reinforcement learning in polynomial time

In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape ($E^{4}$), which extends the Explicit Explore or Exploit ($E^{3}$) algorithm to a robust CMDP setting. $E^4$ explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. $E^4$ robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with the empirical observations of the deployment environment. Theoretical results show that $E^4$ finds a near-optimal constraint-satisfying policy in polynomial time whilst satisfying safety constraints throughout the learning process. We then discuss $E^4$ as a practical algorithmic framework, including robust-constrained offline optimisation algorithms, the design of uncertainty sets for the transition dynamics of unknown states, and how to further leverage empirical observations and prior knowledge to relax some of the worst-case assumptions underlying the theory.


Introduction
As machine learning methods are increasingly deployed in real-world applications, their safety is of foremost importance. Reinforcement learning (RL) is a particularly relevant example since in RL, an agent explores an initially unknown environment through observation, action, and reward. Among the various challenges to safe artificial intelligence (see Amodei et al. (2016); Everitt, Lea and Hutter (2018)), safe exploration and robustness to mismatches between training and application environment are of particular importance when applying RL algorithms in safety-critical missions. Unsafe exploration may occur when an RL system explores a new region of the state space, and in doing so performs actions with disastrous consequences. A mismatch between training and application environment may happen when the model within a model-based RL system is incorrect on some states, which can result in the system learning behaviours that are safe in simulation but dangerous in the real world. More generally, desirable behaviour in a complex world comes with various safety constraints such as avoiding damage or long-term wear-and-tear or following legal and social norms to avoid harm. RL research has traditionally been studied from two angles, namely modelfree and model-based RL, within the framework of Markov decision processes (MDPs). In model-free RL (e.g. Rummery and Niranjan (1994); Watkins and Dayan (1992)), the RL agent has no model of the environment and only learns the long-term cumulative reward associated with the action taken in a given state. Deep model-free RL (e.g. Mnih et al. (2015); Schulman, Wolski, Dhariwal, Radford and Klimov (2017)) additionally incorporates the expressive power of deep neural networks for high performance across a variety of simulation environments. Such methods do not consider safety throughout the exploration process, and therefore are likely to violate safety criteria. In model-based RL (e.g. Brafman and Tennenholtz (2002); Kearns and Singh (2002) ;Strehl, Li and Littman (2006)), the RL agent learns a model of the environment, including the transitions between states and the reward function, which can then be used to compute the long-term cumulative reward. Since learning in the real world may come with high-cost failures, a key benefit of model-based RL is the use of offline optimisation, which can sample trajectories from the environment model without requiring too many real world samples and therefore, failures. Consequently, model-based RL provides improved sample complexity and may also be beneficial for safety. Before model-based RL can be used in longterm safety-critical applications, at least four requirements must be satisfied, namely constraint-satisfaction, safe exploration, robustness to model errors, and applicability in non-episodic environments.
Constraint-satisfaction means that the agent must not violate any of the safety constraints defined by the user. The main framework for dealing with such constraints in this general manner is constrained Markov decision processes (CMDPs) (Altman, 1999), which aims to optimise the long-term reward subject to constraints over a long-term constraint-cost function; in particular, the RL agent is given a maximum budget of cumulative constraint-cost which cannot be exceeded. Traditionally, CMDPs are solved by linear programming, and more recently by policy optimisation methods for high-dimensional control problems (Achiam, Held, Tamar & Abbeel, 2017).
Safe exploration means that the agent must not venture too far into unknown or dangerous states and ensure a timely return to safe known states. Within CMDPs, there are a variety of approaches to safe return based on a few additional assumptions. For control problems where the aim is to stay stable in a particular goal state, Lyapunov functions can be used with additional assumption of continuity on the policy and environment dynamics (Berkenkamp, Turchetta, Schoellig & Krause, 2017;Chow, Nachum, Duenez-Guzman & Ghavamzadeh, 2018). An approach for general CMDPs is to define a supervising agent which can return the agent safely (Turchetta, Kolobov, Shah, Krause & Agarwal, 2020). However, this defers the problem of safe exploration to the supervising agent, which needs to be trained on realistic data. A modelfree approach for safe exploration involves imposing an entropy constraint on policies to ensure exploration (Yang, Simao, Tindemans & Spaan, 2021). As an alternative to the CMDP framework, one can also consider safety in terms of ensuring performance improves on a known policy (Garcelon, Ghavamzadeh, Lazaric & Pirotta, 2020;Thomas, Theocharous & Ghavamzadeh, 2015); in this case safe exploration can be ensured using off-policy evaluation (Thomas et al., 2015), which samples only from a known baseline policy, or by ensuring the online exploration is monotonically improving the performance (Garcelon et al., 2020).
Robustness to model errors means that the agent must be prepared for the worst-case when its model is inaccurate. This approach has been considered primarily by the framework of robust Markov decision processes, which accounts for uncertainty on the transition dynamics model (Iyengar, 2005;Nilim & Ghaoui, 2005;Wiesemann, Kuhn & Rustem, 2013) and which has been integrated in CMDPs by Russel, Benosman and Van Baar (2020). Others additionally incorporate uncertainty over the reward model with applications to non-stationary MDPs (Jaksch, Ortner & Auer, 2010;Lecarpentier & Rachelson, 2019), although these approaches are in the unconstrained setting and the optimistic bias in Jaksch et al. (2010) is likely to be unsafe. A setting with both reward and constraint-cost uncertainty has also been considered but this assumes known transition dynamics (Zheng & Ratliff, 2020).
Applicability in non-episodic (also called "continuing") environments means that the agent will be embedded in a long-term environment with no resets. The potential of continued sequential dependencies makes non-episodic environments challenging; it is therefore no surprise that non-episodic environments became a topic for recent benchmark formulations (see Naik, Abbas, White and Sutton (2021) ;Platanios, Saparov and Mitchell (2020)). In safety-critical settings, the continued lifetime means that any failure is unforgiving-unlike in the episodic video games. Therefore, the above-mentioned challenge of returnability becomes critically important in such environments. Somewhat similar to Turchetta et al. (2020), Eysenbach, Gu, Ibarz and Levine (2018) defines a separate policy for resetting, with the main aim to abort and return to an initial state, mimicking the episodic setting without manual resets being required. Similar principles may be applied to the safety-constrained setting as well, where one could potentially target return to a larger set of safe and known states.
With respect to the above requirements, Explicit Explore or Exploit (E 3 ) (Kearns & Singh, 2002), a model-based RL algorithm for near-optimal RL with polynomial sample and time complexity guarantees, provides a unique starting point for safe model-based RL. E 3 approximates the MDP with which the agent is interacting and the algorithm accounts for model errors and the resulting value estimate errors by distinguishing between known states, which haven been estimated correctly, and unknown states, which have not been estimated correctly. E 3 also provides a natural way to deal with exploration and exploitation in a non-episodic environment, namely by alternating limitedstep trajectories, each of which is long enough to assess value function statistics correctly, and then explicitly choosing either an exploration policy or an exploitation policy. The exploitation policy is chosen when an optimal policy is available from the given starting state of the trajectory, and an exploration policy is chosen otherwise, in an attempt to find an unknown state. Recent work (Henaff, 2019) has also shown the practical feasibility of the approach for continuous state spaces, comparing favourably to state-of-the-art deep RL.
What is currently missing from E 3 is a suitable way to deal with constraintsatisfaction across the known and unknown states, as well as a suitable method for providing a safe return, or "escape", from the unknown states back to the known states. Considering an additional "escape" policy in addition to an exploration and exploitation policy, this paper formulates an algorithm, called Explicit Explore, Exploit, or Escape (E 4 ; see Figure 1). E 4 extends the E 3 algorithm to satisfy safety constraints throughout the lifetime of the RL agent through the following algorithmic contributions.
• Using the CMDP rather than an MDP framework, E 4 additionally models another stream of reinforcement signals called constraint-costs, which are constrained to a long-term budget and which represent the cost of state-action pairs. • A correction is formulated for the constraint-cost budget in offline optimisation to account for the potential model errors in known states. • Using a specialised "escape policy", optimised to return to the known states as quickly as possible, E 4 halts the balanced wandering behaviour of E 3 as soon as there is a risk of exceeding the budgetary constraint in unknown states.
• Analytical formulas are derived to determine an allowed pseudo-budget in known and unknown states to ensure the budget over the full CMDP is not exceeded. • To ensure the escape policy returns without exceeding the pseudo-budget in unknown states, even under worst-case assumptions, we propose four possible approaches, making modifications to existing robust and constrained optimisation algorithms: -using the robust CMDP policy gradient (Russel et al., 2020) as is in unknown states and optionally in known states; -using the robust linear programming technique by Zheng and Ratliff (2020), which is applicable to episodic and undiscounted CMDPs with known transition dynamics, in the known states, by accounting for the discount factor and by considering a limited-step trajectory within the full lifetime of the E 4 agent; -incorporating an uncertainty over transition dynamics into the constraints of constrained DP (Altman, 1998); and -incorporating constraints into robust DP (Iyengar, 2005;Nilim & Ghaoui, 2005) by reformulating the CMDP as a Lagrangian MDP (Taleghan & Dietterich, 2018).
Applying these algorithmic principles, we demonstrate that E 4 finds a nearoptimal policy within the set of policies that satisifies the budgetary constraints. Doing so yields similar sample complexity as E 3 but adds time complexity due to robust optimisation. Compared to the works mentioned above, the key distinctive features of E 4 are 1) safety-constraint satisfaction throughout the lifetime; 2) the explicit explore, exploit, or escape structure; 3) modelling the transition dynamics, reward function, constraint-cost function, and their uncertainty for robust offline optimisation; and 4) the aim to find a near-optimal constrained policy, rather than to improve on a known policy.

Preliminaries and definitions
The framework for E 4 is based on the CMDP task-modelling framework. In CMDPs, at each time step, an agent receives a state, performs an action, receives a reward, and a constraint-cost; the goal of the agent is to maximise the long-term cumulative reward whilst not exceeding a pre-defined budget of constraint-cost. Formally, a CMDP is a tuple S, A, P * , r, c, d, γ , where S is the finite state space of size S; A is the finite action space of size A; P * : S × A → ∆ S is the true transition dynamics of the environment with ∆ S = {p ∈ R S : p T 1 = 1} being the probability simplex in S; r(s, a) is the average reward obtained for performing action a ∈ A in state s ∈ S; c(s, a) is the average constraint-cost for performing action a ∈ A state s ∈ S; d is the budget on the expected asymptotic constraint-cost (see below); and γ ∈ [0, 1) is the discount factor used for computing the long-term value and constraintcost function based on the expected cumulative reward and constraint-cost, Figure 1: Diagram illustrating E 4 . E 4 repeatedly takes limited-step trajectories in either known or unknown states. (a) The "exploitation policy" has the aim of solving the CMDP within the known states. If from a known state it has near-optimal constrained limited-step performance, then this policy is applied.
(b) From another state, the exploitation policy may not be near-optimal; in this case an "exploration policy" aims to reach the unknown states, in an attempt to make them known. (c) Once in the unknown states, a balanced wandering behaviour ensures that actions are equally often taken for each unknown state, thereby yielding reliable statistics for different actions. Once there is an observed future risk of constraint-violation, an "escape policy" goes back to the known states as quickly as possible. (d) At the final stages of the E 4 algorithm, nearly all states have been frequently visited with balanced action-visitations, making them known. This then allows finding a near-optimal exploitation policy for the full CMDP.
respectively. When at time t the RL agent is presented with a state s t , its objective within the CMDP is to find a policy π : S → A that maximises the expected asymptotic value, while the expected asymptotic constraint-cost, defined as must satisfy C π (s t ) ≤ d. Moreover, assumptions on the CMDP are three-fold. First, the reward distribution for a state-action pair (s, a) ∈ S × A has mean r(s, a) ∈ [0, r max ] and variance Var(s) ≤ Var r max . Second, the constraint-cost distribution for a state-action pair (s, a) ∈ S × A has a mean c(s, a) ∈ [0, c max ] and variance Var c (s) ≤ Var c max . Third, a general upper-bound for the expected T-step value is denoted by G r max (T ) = ∞ t=0 γ t r max = rmax 1−γ while analogously the general upper-bound for the expected T-step constraint-cost is denoted by G c max (T ) = ∞ t=0 γ t c max = cmax 1−γ . In practice, the above-mentioned maximal variance, reward, value, etc. do not need to be known exactly; any upper bound suffices, although tighter bounds improve the results. Knowledge of r max and c max are a basic requirement since the other maxima can be upper-bounded from them. To give a few examples of when r max and c max can be known exactly or upper-bounded, one can consider applications where rewards or constraint-costs represent a limited resource, such as energy, food, or money, or a physical force, such as friction or torque. A further property of the CMDP being studied is that it has a limited "diameter". Completely analogous to the diameter (Jaksch et al., 2010) for MDPs, we define the diameter of a CMDP M as the maximal expected number of actions from one arbitrary state s ∈ S to any other s ∈ S under the best possible stationary deterministic policy π : S → A for the choice of (s, s ), or where W (s |s, M, π) is the number of actions from s to s given the model M and policy π.
While the above functions span over an infinite horizon, it is often of interest to compute their T -step approximation to make decisions based on limited data. The corresponding T -step approximations instead sum over T steps and are denoted by V π (s t , T ) and C π (s t , T ), respectively. From the class of policies Π, the class of constrained policies for any state s ∈ S, a budget d, a number of time steps T , is denoted as Π c (s, d, T ) = {π ∈ Π : C π (s, T ) ≤ d}. Denoting Π c (s, d) for T → ∞, Π c (s, d) = ∅ is non-empty by assumption, and therefore any other choice of T yields Π c (s, d, T ) = ∅. An optimal constrained policy for a T-step trajectory in M is then defined as π * = arg max π∈Π(s,d,T ) V π (s, T ) where for the asymptotic case of course T → ∞. In addition to the expected asymptotic constraint-cost, which computes the expectation over different possible trajectories, the analysis also uses a path-based constraint-cost and value, which computes the average cost for a particular state-action trajectory and is denoted for a T -step path p as C(p) = T −1 t=0 γ t c(s t , a t ) and V (p) = T −1 t=0 γ t r(s t , a t ), respectively. With M being the CMDP to be solved, this paper takes a model-based approach in which an approximate modelM = S, A,P ,r,ĉ, d , γ is continually improved by sampling from M , with hat-notations emphasising the model and its components are estimated by the sample mean, and where a potentially lower budget d , c max ≤ d ≤ d, is used when the model is induced over a subset of the full CMDP (see Section 3.2), where the relation with c max is an assumption of the algorithmic approach to ensure at least one action can be performed before potentially exceeding d . For states that have not yet been visited frequently, we formulate an uncertainty set (or ambiguity set), P, as the set of transition dynamics models consistent with the samples from (or prior knowledge on) the true but unknown transition dynamics, P * . For example, one can define a confidence interval aroundP s,a , the estimated (also called "nominal") transition dynamics model, with a budget ψ s,a for state-action pair (s, a): Within the states that have not been frequently visited yet, the uncertainty set is then used to optimise the constraints robustly, that is, with max P ∈PĈπ (s t ) ≤ d, whereĈ π indicates the estimate of the expected asymptotic constraint-cost, with more precise definitions and notations to follow.

Main theorem
With the preliminary definitions in mind, this section states and proves the main theorem, which postulates the existence of an algorithm, namely E 4 , which explores a CMDP safely within the constraint-cost budget and which finds a near-optimal constrained policy within polynomial time.
Theorem 1 There exists an algorithm (E 4 ) that outputs with probability 1 − δ a near-optimal constrained policy π for s with Vπ(s) ≥ V π * (s) − and Cπ(s) ≤ d with sample complexity and time complexity that is polynomial in 1/ , 1/δ, S, the horizon time 1/(1 − γ), rmax, and cmax. Moreover, the non-stationary policy induced by the exploration process of E 4 yields expected asymptotic constraint-cost C π n (s t ) ≤ d with probability at least 1 − (U Aδ ψ + δ).
The E 4 algorithm is based on the model-based CMDP framework presented in the previous section. The Constrained Simulation Lemma (Section 3.1) shows that a sufficiently accurate CMDP model of the reward function, constraintcost function, and transition dynamics allow, with high probability, an -correct value function and constraint-cost function approximation. Moreover, it is sufficient to visit each state-action pair a number of times that is polynomial in S, T , 1/(1 − γ), r max ,1/ , and 1/δ to obtain such an accurate CMDP model. This leads to a natural definition of "known states", which have been visited often with high frequency for each action and thereby have accurate models, and "unknown states", which have not been visited often enough. Each of the two cases can then be treated as a different Induced CMDP (see Section 3.2). Therefore when starting a limited T -step trajectory from a state s ∈ S, its strategy for known states differs from that in the unknown states.
Starting from a known state, the agent may already have a near-optimal policy available. If so, then this policy is best exploited for the following Tstep trajectory. If not, then the model must be further improved by exploring unknown states. This is formalised by the l-safe Explore-or-Exploit Lemma (Section 3.3). This lemma shows that for policies satisfying a given a constraintbudget within the known states, there either exists a "exploitation policy" which is a near-optimal constrained policy or there exists an "exploration policy" which with high probability finds an unknown state. The l-safe Explore-or-Exploit Lemma then shows how to define a correction to the original budget to account for the approximation error of the model, resulting in a high-probability guarantee for satisfaction of the original constraint across the known states.
When exploring unknown states, the agent must attempt to make these states known by performing many state-action visitations. A key challenge is that this risks exceeding the constraint-cost budget because the agent has little knowledge of the unkown states. With worst-case assumptions on the transition dynamics and constraint-costs, the Safe Escape Lemma (Section 3.4) provides a high-probability guarantee for an optimised "escape policy" to find a path from the unknown states to the known states whilst satisfying the constraint-cost budget defined in the unknown states. Since the escape policy will not select each action equally frequently from a given state, the escape policy provides no guarantee on making states known. To safely gain as much knowledge on the unknown states as possible, E 4 performs a balanced wandering behaviour, which selects the least-taken action for a given state, as long as timely escape is ensured by the escape policy, as formalised by the Safe Balanced Wandering Lemma (Section 3.5).
To integrate the results for the known and unknown states, the Escape Budget Lemma (Section 3.6) shows how to set the budget for the unknown states, d , to satisfy the budget on the full CMDP (d) given a budget within the known states d ; this also helps to define the conditions for constraintsatisfiability. The remainder of the section (Section 3.8) then puts all the lemmas together and demonstrates the validity of Theorem 1.

Constrained Simulation Lemma
Since the model is estimated based on the sample mean, repeatedly visiting the same state allows estimating the transition dynamics, the reward function, and the constraint-cost function. Therefore, the true CMDP M can be approximated by simulation modelM with a sufficient number of samples. If the simulation model is sufficient, the expected asymptotic value and constraint-cost of the CMDP can be approximated with high accuracy, a finding formalised by the Constrained Simulation Lemma (Section 2). The set of "known states" is then defined as those states which have sufficient visitations and are therefore approximated correctly.
The Constrained Simulation Lemma relies on the relation between the expected asymptotic and T-step value and constraint-cost. Specifically, the following Lemma shows that for sufficiently large T , the expected asymptotic value and constraint-costs are -close to their T -step counterparts.
Lemma 1 Constrained T-step Estimation Lemma: Let M be the CMDP, π be a policy in M , and T ≥ 1 1−γ ln max(rmax,cmax) (1−γ) . Then for any state s ∈ S we have Proof of (a): Define MDP M − = S, A, r, γ, P with the same states, action space, reward function, discount factor, and transition dynamics as M . Let π be any policy in M − . The result follows directly from the original T-step estimation Lemma (see Lemma 2 in Kearns and Singh (2002)) as this holds for any policy in any MDP. In short, the first inequality V π (s, T ) ≤ V π (s) follows from rewards being non-negative and the expected asymptotic value additionally considering time steps t = T + 1, . . . , ∞; the second inequality follows after requiring ≤ γ T rmax 1−γ (the maximal remaining reward not accounted for after time step T ) and then solving for T . Proof of (b): The proof is completely analogous as the transition dynamics and policy are the same while the constraint-cost function is analogously bounded The basic reasoning behind the Constrained Simulation Lemma is that, if an estimatorM of the true CMDP M is sufficiently accurate, thenM can be used as a simulation model to simulate realistic trajectories of M ; with offline optimisation algorithms, this can then yield nearly correct estimates of the expected asymptotic value and the expected asymptotic constraint-cost. To develop this reasoning, a suitable definition for the accuracy of such a simulation model is formalised below.
With this definition in mind, the Constrained Simulation Lemma states that as long as T is chosen according to the -horizon time of M and the simulation M is α-correct, the simulated value and constraint-costs are -correct with respect to the true expected asymptotic value and constraint-cost. The proof extends the original Simulation Lemma (see Lemma 4 in Kearns and Singh (2002)) to include the constraint-costs. As it is extensive and otherwise analogous, it is provided in the appendix.
Having established the relation between the expected asymptotic constraintcost and value on the one hand and the α-approximation on the other hand, the following lemma provides the number of samples for an accurate estimate of the expected asymptotic constraint-cost and value with error of at most .

Proof:
The proof is based on Hoeffding's inequality and Chernoff bounds (see Appendix A for details).
Following the above lemma, the notion of "known states" can now be defined directly from Eq. 5. Intuitively, contrasting known versus unknown states allows to distinguish regions for which expected asymptotic value and constraint-cost is estimated correctly, and regions for which they are not. This contrast between known and unknown states will be critical to provide safety guarantees for the RL algorithm later on.
Definition 2 Known states K and unknown states U : a state s ∈ S is called known if ∀a ∈ A : n(s, a) ≥ m known where m known ∈ N is O((ST G/ ) 4 Varmax ln(1/δ)) following Eq. 5. Alternatively, s ∈ S is called unknown if ∃a ∈ A : n(s, a) < m known . Known states are collectively referred to as K = {s ∈ S|∀a ∈ A : n(s, a) ≥ m known } while unknown states are referred to as U = {s ∈ S|∃a ∈ A : n(s, a) < m known }.

Induced CMDPs
Since only known states are modelled to a sufficient accuracy, it is useful to consider CMDP simulations over the known states only. The following defines the notion of an "induced CMDP", which limits the original CMDP to a subset of the original state space (e.g., the known states). The induced CMDPs allow two useful results. First, since the induced CMDP is also a CMDP like in the Constrained Simulation Lemma, an analogous result holds for induced CMDPs over the set of known states (see Lemma 4); that is, an estimated induced CMDP is -correct with probability at least 1 − δ when compared to the true induced CMDP. Second, the expected T -step value and constraint-cost within induced CMDPs provide lower bounds for the expected T -step value and constraint-cost, respectively, in the original CMDP (see Lemma 5).
Definition 3 Given a CMDP M , define for a subset of states S ⊂ S an induced CMDP M S , which is equal in all but the following respects: • its state space is S ∪ s 0 where s 0 is a terminal state of the CMDP.
• transitions to states in s ∈ S are redirected to a terminal state s 0 , which terminates the episode and yields terminal reward r(s 0 ) = 0 and terminal constraint-cost c(s 0 ) = 0. • for each (s, a) ∈ S, rewards and constraint-costs always yield their mean r(s, a) and c(s, a) deterministically, with zero variance.
When a CMDP is induced over the known states K (see Definition 2), the resulting CMDP is denoted as M K and is called the known-state CMDP of M .
In the following, whenever a value function, constraint-cost, or other quantity is computed in an induced CMDP M S , this will be denoted by the conditioning operator |M S ; for example, V π (s|M K ) denotes the expected asymptotic value of π in the known state CMDP. For brevity and consistency with the earlier sections, when the conditioning is on the full CMDP of interest, M , the conditioning is omitted; for example, V π (s) denotes the expected asymptotic value of π in the full CMDP of interest. Now the so-called Induced Simulation Lemma applies the Constrained Simulation Lemma to known-state CMDPs.
Lemma 4 Induced Simulation Lemma Let M be the CMDP of interest and let M K be its known state CMDP, and let T ≥ 1 1−γ ln . Then, for any policy π and any state s ∈ S, with probability 1 − δ, we have

Proof:
The proof follows directly from the Constrained Simulation Lemma (see Lemma 2) and the Known State Lemma (see Lemma 3).
As one is interested in what happens in the "real world", the below lemma relates the induced CMDP to the original CMDP M . Proof: Let π be a policy in M and let s ∈ S . Then we have, analogous to the original lemma, the following results because π stops at time T stop ≤ T and rewards and constraint-costs are positive. To illustrate, for the value function, we have: due to rewards being positive and the paths being terminated earlier at a time T stop ≤ T at terminal state s Tstop ∈ S \ S . The result for the constraint-cost is completely analogous.

l-safe Explore-or-Exploit lemma
Following the definition of induced CMDPs, this section provides results for exploration and exploitation within the known states. For any induced CMDP which has satisfiable constraint-cost budget, the Constrained Explore-or-Exploit lemma (Lemma 6) proves the existence of either an exploitation policy, which solves the CMDP within the known states near-optimally, and an exploration policy, which gives a high-probability guarantee of visiting an "unknown state". This allows the l-safe Explore-or-Exploit Lemma (Lemma 7), which provides safety, i.e. guarantees on constraint-satisfaction, within the known states by defining a lower budget l that accounts for worst-case modelling errors (hence the term l-safe). The budget l subtracts the model errors from Lemma 1 and Lemma 4 to the Constrained Explore-or-Exploit Lemma.
Based on defining an induced CMDP over a subset of states S (e.g. the known states, in which case S = K ), the Constrained Explore-or-Exploit Lemma proves either the existence of an optimal constrained policy on the full CMDP or a constrained policy that finds a state not in S in T -steps with high probability.
Lemma 6 Constrained Explore-or-Exploit Lemma Let M be any CMDP, let S be any subset of states S ⊂ S, and let M S be the induced CMDP over S with a given budget d. For any s ∈ S , for any T , and any ≥ 0, we have either a) there exists a policy π ∈ M S for which Vπ(s, T |M S ) ≥ V π * (s, T ) − , where π * = arg max π∈Πc(s,d,T ) Vπ(s, T ) is the optimal constrained T -step policy, and which satisfies Cπ(s, T |M S ) ≤ d; or b) there exists a policy π in M S which reaches the terminal state s 0 in S \ S in at most T steps with probability p > /G r max (T ), and Cπ(s, T |M S ) ≤ d.

Proof:
The proof and lemma is analogous to the original Explore-or-Exploit Lemma, except that the optimal constrained policy rather than the optimal policy is considered (see Appendix B).

Now by integrating the Constrained Simulation Lemma and the Constrained
Explore-or-Exploit Lemma, a characterisation of the safety can be provided in terms of the probability of constraint-satisfaction during exploitation or exploitation in the known-state CMDP. This safety guarantee, based on the estimated known-state CMDP, is provided by the l-safe Explore-or-Exploit Lemma below.
Lemma 7 l-safe Explore-or-Exploit Lemma: Let at least the -horizon time). Furthermore, let M K be the known-state CMDP andM K its estimate. Then with probability 1 − δ, a policy π that satisfiesĈπ(s, T |M K ) ≤ l for s ∈ K will satisfy Cπ(s|M K ) ≤ d.

Proof:
Due to the Induced Simulation Lemma and the Constrained T-step Estimation Lemma, we have for any state s ∈ S with probability 1 − δ:

Safe escape lemma
Contrasting to the known states, any model based on the unknown states is inaccurate, making it difficult to guarantee safe return to the known states before exceeding the constraint-cost budget. To overcome the challenge, we construct a "Worst-case Escape CMDP", a CMDP induced over the unknown states that rewards escaping back to the known states before exceeding a constraint-cost budget and makes worst-case assumptions on the constraintcost. Additionally incorporating worst-case transition dynamics through the use of an uncertainty set, the resulting robust CMDP provides a probabilistic guarantee for safety within the unknown states. The proof relies on robust constraint-satisfiability, a condition that is further discussed in Section 4.3.
Definition 4 Worst-case Escape CMDP Given a CMDP M , the Worst-case Escape CMDP M U induced on U is equivalent to the Induced CMDP on U except that c(s) = cmax for all s ∈ U .
Intuitively, the policy that optimises a Worst-case Escape CMDP escapes to the known states before violating the constraint-cost budget under worst-case assumptions.
The following result shows that a policy that is optimised robustly on an estimated Worst-case Escape CMDP will be able to escape the unknown states with high probability while also satisfying the constraint-cost budget within both the (non-estimated) Worst-case Escape CMDP and the full CMDP.
Lemma 8 Worst-case Escape Lemma Let M be the full CMDP, and M U be the Worst-case Escape CMDP on U , andM U its estimate. Let d be the constraintcost budget within the Worst-case Escape CMDP M U . Let T = inf{T ∈ N : and T ≥ T . Moreover, let P : S × A → ∆ S be an uncertainty set, such that for each state-action pair (s, a) ∈ S × A, the true transition dynamics model P * s,a is contained in Ps,a with probability at least 1 − δ ψ . 2 If π is a policy in M U that satisfies the constraint of the Worst-case Escape CMDP robustly, that is, for the initial state s ∈ U , then with probability at least 1 − U Aδ ψ , π in M U meets the constraint-cost budget C P * ,π (s|M U ) ≤ d and takes at most T steps in U .

Proof:
Let π be a policy inM U that satisfies the constraint on a state s ∈ U robustly, i.e. max P ∈PĈP,π (s, T |M U ) ≤ d . Due to union bound, we have with probability at least 1 − U Aδ ψ that P * ∈ P (i.e. the true transition dynamics model is contained in the uncertainty set) where U = |U |. Therefore, with constraintcost estimates equal to their true values c max for all steps taken in the worst-case CMDP, we have with probability at least 1 − U Aδ ψ , Since the budget is not exceeded and T is the infimum of the set {T ∈ N :

Balanced wandering
To be able to make an unknown state known, each action must be performed often enough, following the Known State Lemma (Lemma 3). In E 3 , this is achieved through balanced wandering.
Definition 5 For a given state s ∈ S, balanced wandering takes the action that has been performed the least, a * = arg min a∈A n(s, a) .
In E 4 , taking many arbitrary actions with balanced wandering may exceed the constraint-cost budget. How can the agent perform balanced wandering while achieving a safe escape? The answer is to perform balanced wandering as long as the predicted constraint-cost following the escape policy allows doing so, as formalised by the Safe Balanced Wandering Lemma.
Lemma 9 Safe Balanced Wandering Lemma Let M U be the Worst-case Escape CMDP on U , andM U its estimate. Let d be the constraint-cost budget within the Worst-case Escape CMDP M U . Let T = inf{T ∈ N : and T ≥ T . Let P : S × A → ∆ S be an uncertainty set, such that for each state-action pair (s, a) ∈ S × A, the true transition dynamics model P * s,a is contained in Ps,a with probability at least 1 − δ ψ . Let π be a policy inM U that satisfies the constraint of the Worst-case CMDP robustly, that is, for the initial state s ∈ U . Let µ be a strategy that performs balanced wandering but switches to policy π at time t when where p t = {s, s 1 , . . . , s t−1 } is the t-step path taken so far from s. Then with probability at least 1 − U Aδ ψ , µ is safe, in the sense that Proof: Let t ∈ {0, . . . , T − 1} be the first step for which Since one step expends at most c max , this implies and therefore, by Lemma 8, with probability 1 − U Aδ ψ that Since this holds for any path p t from s, it also holds on expectation over t-step paths from s, such that with probability 1 − U Aδ ψ

Escape budget lemma
Having introduced how to use induced CMDPs for safety within known and unknown states, this section now turns to the full CMDP M : that is, when following T-step policies on induced CMDPs M K and M U , does this still guarantee safety across the entire CMDP? To resolve this question, the following lemma, called the Escape Budget Lemma, formulates an "escape budget" d ≤ d, the highest constraint-cost in M U that still provides a probabilistic safety guarantee over the entire CMDP M . First follows a definition of safe return states, which indicate states from which a T -step trajectory is known to stay in K following some policy π and which satisifes a much lower budget but has no requirements on the value (so typically yields a poor value).
Definition 6 A ds-safe return state is a known state s ∈ K for which there exists a stationary policy π ds : S → A that takes at least T steps from s in K with probability 1 − δ and that satisfies Cπ ds (s, T |γ = 1) ≤ ds − . This policy π is called a ds-safe return policy and the set of such states is denoted by K ds .
While the CMDP of interest, M , has γ ∈ [0, 1), the above definition is based on γ = 1 to ensure accounting for the worst-case impact of safe return on future constraint-satisfaction. Since T is the -horizon time, this implies any non-stationary policy π n which first applies π ds for T steps and then any policy in M will yield C πn (s) ≤ d s for any starting state s ∈ K ds .
The Escape Budget Lemma takes place in a setting where there is an cycle alternating between three policies: • an exploration policy in M K , which aims to find an unknown state; • a safe balanced wandering policy in M U , which performs balanced wandering followed by an escape back to the known states, in an attempt to make states known; and • a safe return policy in M K , which acts T steps within K with low expected T -step constraint-cost, to ensure long-term constraint-satisfaction despite the constraint-costs of exploration attempts.
This alternation defines a non-stationary policy π = {π i } ∞ i=0 , where i indicates the particular policy being used, where i = 0, 3, 6, . . . index exploration policies, i = 1, 4, 7, . . . index safe balanced wandering policies, and i = 2, 5, 8, . . . index safe return policies. Within this setting, the lemma determines the available escape budget based on the initial path cost in the known states, the worstcase expected cost in the unknown states (d ), and the worst-case expected asymptotic constraint-cost of the safe return policy (d ).
Lemma 10 Escape Budget Lemma Let M be the full CMDP, M K be the knownstate CMDP, M U be the Worst-case Escape CMDP, and let K ds = K for some level ds ≤ d. Let p be a path from s ∈ K to a starting state s ∈ U . Let d ≤ d − be the budget in M K and d i be the "escape" budget in M U at the i'th visitation to M U . Let π = {π i } ∞ i=0 be a non-stationary policy, defined on M K with expected asymptotic cost Cπ i (s|M K ) ≤ d for i = 0, 3, 6, . . . , on M U with expected asymptotic cost Cπ i (s|M K ) ≤ d i//3 3 for i = 1, 4, 7, . . . , and on M K as a ds-safe return policy Define the following requirements: (a) Diameter requirement: the diameter satisfies D + ≤ T min ≤ T i//3 ; (b) Known-state requirements: for any recent path p from M K to unknown state Then these requirements are not conflicting and imply Cπ(s t ) ≤ d for all s t ∈ S at all times t. Moreover, the budget allows at least one time step of balanced wandering before escape.

Proof:
The proof will first show that the requirements (a), (b), (c), and (d) are not conflicting with each other. Then, the proof selects an arbitrary known state s t at any time t and shows how the requirements yield C π (s t ) ≤ d. Then, the proof analogously selects an unknown state s t at any time t and shows how the requirements yield C π (s t ) ≤ d. Finally, the proof shows that at least one time step of balanced wandering is allowed due to the definition of D + . 0) Requirements are not conlicting. Requirements (b) and (c) are not conflicting with each other: since they are both upper bounds, one can set it is required to show that (a) does not conflict with this setting of d i//3 . First, note that the worst case, where the discount factor is γ = 1 and where d s − + d is expended in the known states and d s is expended from a safe return state, yields the lower bound for d i//3 , and therefore which is consistent with requirement (a). 1) Known states. Let s ∈ K . If the agent is at the starting state of its d ssafe return then the proof is finished by definition of d s ≤ d. Otherwise, the agent is either somewhere along the trajectory of the d s -safe return policy or it is along the trajectory of the exploration policy. In both cases, the agent forms a path p k from s to a starting unknown state s u ∈ U , then follows a safe balanced wandering policy that escapes to a d s -safe return state s k with probability 1 − U Aδ ψ , and then from s k takes T steps using a d s -safe return policy. Denoting the length of the path p k as T k , the cost of the path in the known states as C(p k ), the cost of the path in the unknown states as C(p u ), and the expected asymptotic constraint-cost from the safe return states as C π (s k ), the statement to prove is By definition of the diameter, requirement (a) ensures that there exists an escape policy π which satisfies C π (s|M U ) ≤ d i//3 from any s ∈ U . Therefore, setting π i = µ, where µ is a safe balanced wandering policy for the budget where the last line follows from requirement (b). Since the inequality holds for any path p k , it also holds on expectation over paths.
2) Unknown states. Let s ∈ U . The agent forms a path p u from s to a starting unknown state s k ∈ K . Denoting the length of the path as T u , based on the cost of the path C(p u ) and the expected asymptotic cost over the terminal state in K , the statement to prove is Note that T u ≥ 1, and via requirement (a) and (b), C π (s|M U ) = C πi (s|M U ) ≤ d i//3 for some stationary policy π i , and C π (s k ) ≤ d s . Further filling in requirement (c) yields the statement to prove, 3) At least one step of balanced wandering. Let j ∈ {0, 1, . . . }. As shown in part 0) of the proof, T j ≥ D + = D(M ) + 1. By definition of the diameter, an escape policy can return to the known states within at most D(M ) time steps. Therefore, at least one step of balanced wandering can be performed without exceeding the budget d j .
For the C π (s t ) ≤ d for all t in the current trajectory, it is required to perform computation b) over all known states recently visited and take the minimum, and then again the minimum between the result and the quantity in c). For such a computation, one can take the last T steps within K and subtract for the computation of d . Note that for unknown states, one only needs to perform c) on starting states because it assumes the worst-case c max for all coming T steps whereas intermediate unknown states will only yield c max for a lower number of steps.
The Escape Budget Lemma hereby provides a general strategy for determining a desired escape budget. Below is an example of a high constraint-cost trajectory in K and the resulting budget d . The below trajectory is shorter than usual purely for demonstration purposes; any trajectory (except the initial i = 0 trajectory) would involve at least T steps from the known states due to the d s -safe policy taking T steps prior to the exploration policy. Following escape, the safe return policy will take T steps in the known states, where it yields Cπ(s, T |γ = 1) ≤ ds − = 4.0, before an exploration policy starts an exploration attempt.

Simulated Budget Satisfaction Lemma
Previously, the l-safe Explore-or-Exploit Lemma (Lemma 7) showed that, if a policy yields d − 2 -safety for the expected T -step constraint-cost in a simulation of the Known-state CMDP, then it also yields d-safety for the expected asymptotic constraint-cost in the real Known-state CMDP. The converse does not follow automatically; if a policy π yields d-safety for the expected asymptotic constraint-cost in a real Known-state CMDP, then d − 2safety is not guaranteed for the T -step expected constraint-cost in a simulation of the Known-state CMDP -making it possible that the constraint cannot be satisfied. The following lemma provides the conditions on the real-world CMDP under which the d − 2 -safety constraint can be satisfied in simulation.
Lemma 11 Simulated Budget Satisfaction Lemma Let M be a full CMDP, M K be a known-state CMDP induced over a set of known states K in M , s ∈ K , π be a policy that satisfies π ∈ Πc(s, l − ) over M , where l = d − 2 . Further, letM K be an estimation of M K . Then π will satisfyĈπ(s, T |M K ) ≤ l.

Proof:
Note that C π (s|M K ) ≤ C π (s) due to constraint-costs being positive. Therefore, by the Constrained T-step Estimation Lemma (Lemma 1) and the Constrained Simulation Lemma (Lemma 2), Thereby,Ĉ π (s, T |M K ) ≤ l.

Proof of Theorem 1
With all the lemmas in place, the following presents the step-by-step proof of Theorem 1.
(a) Returning a near-optimal constrained policy within polynomial time. First, we prove the statement that for ≥ 0, E 4 outputs with probability at least 1 − δ a near-optimal constrained policy π for s with V π (s) ≥ V π * (s) − and C π (s) ≤ d with sample complexity and time complexity that is polynomial in 1/ , 1/δ, S, the horizon time 1/(1 − γ), r max , and c max . Select s ∈ S and ≥ 0 arbitrarily. Let m known be O((ST G/ ) 4 Var max ln(1/δ )), where δ = δ/4. Define T as the /4-horizon time. If s ∈ K , then by assumption Π c (s, d, T ) = {π ∈ Π : C π (s, T ) ≤ d} = ∅. By Lemma 6, either there exists a near-optimal constraint-satisfying policy π ∈ Π c (s, d, T ) from s with V π (s, T |M K ) ≥ V π * (s, T ) − /2 or there exists a policy π ∈ Π c (s, d, T ) that finds an unknown state with probability at least /(2G r max (T )), regardless of the choice of T and . If V π (s, T |M K ) ≥ V π * (s) − /2, then due to Lemma 5, Lemma 1, and Lemma 2, it follows with probability 1 − δ that and that V π (s, T |M K ) ≥V π (s, T |M K ) − /4 ≥ V π * (s) − 3 /4 ≥ V π * (s, T ) − 3 /4 , which implies offline optimisation will find an -optimal exploitation policy π, and the proof is finished in this case. Otherwise, ifV π (s, T |M K ) < V π * (s)− /2 this implies there exists an exploration policy that can be found by E 4 since V π (s, T |M K ) ≤V π (s, T |M K ) + /4 < V π * (s) − 3 /4 ≤ V π * (s, T ) − /2 . E 4 starts an exploration attempt using π , which takes a T -step trajectory to find an unknown state, based on the exploration known-state CMDP. Such attempts may fail repeatedly but have a success probability of at least /(2G r max (T )). Upon success, the algorithm performs balanced wandering in M U . This cycle of attempted explorations is repeated and due to the repeated visitation, states will become known after a number of visitations m known = O((ST G/ ) 4 Var max log(1/δ )). In the worst case, the algorithm must make all the states known before finding a near-optimal constrained exploitation policy for s, requiring up to Sm known steps of balanced wandering. In the worst case, due to T ≥ D(M ) + 1 following the requirements of Lemma 10, there is only one step of balanced wandering per successful exploration attempt. With this in mind, Chernoff bound analysis (see Appendix C) shows that with probability 1 − δ , the total number of exploration attempts before making all states known is bounded by O G r max (T ) Sm known ln(S/δ ) . Since each T -step trajectory takes at most T actions, the number of actions taken by E 4 before a near-optimal policy can be returned will be bounded by O T G r max (T ) Sm known log(S/δ ) .
Therefore, with T ≥ ln max(rmax,cmax) (1−γ) 1 1−γ and m known = O((ST G/ ) 4 Var max log(1/δ )), the sample complexity to output a near-optimal constrained policy for state s ∈ S is polynomial in 1/ , 1/δ, S, the horizon time 1/(1 − γ), r max and c max . 4 Since offline optimisation is repeated at every attempted exploration, the time complexity of E 4 is where Opt refers to the time complexity of the offline optimisation. Given the above, all that remains to be shown is that 1) Opt is polynomial-time; and 2) offline optimisation converges to the global optimum. These statements are demonstrated for different offline optimisation algorithms in Section 4.1. Summing the three observed failure probabilities, the above results hold with probability at least 1 − δ.
Here, we prove the statement that at any time t, the non-stationary policy π n = {π i } ∞ i=1 induced by the exploration process of E 4 yields C π n (s t ) ≤ d, as specified in the CMDP objective, with probability at least 1 − (U Aδ ψ + δ).
Select an arbitrary time point t in the RL agent's lifetime and observe the state s t , let > 0, and let d be the known-state budget, and let l = d − 2 . Moreover, it is assumed that Π c (s t , l − , T ) = {π ∈ Π : C π (s t , T ) ≤ l − } = ∅, Π c (s t , d s − , T ) = {π ∈ Π : C π (s t , T ) ≤ d s − } = ∅ and that the optimal value within Π c (s t , l − , T ) is -close to that of the optimal value within Π c (s t , d, T ). 5 If s t ∈ K , then either the agent is performing d s -safe return or the agent is doing exploration/exploitation. If the agent is at the starting state of its d s -safe return then the proof is finished by definition of d s (see Definition 6). If not, the exploration or exploitation policy will be activated after fewer than T steps of d s -safe return. The setting Π c (s t , l − , T ) = ∅ implies that the offline optimisation constraints are satisfiable on level l inM K with probability 1 − δ . By Lemma 7, any policy π withĈ π (s t , T |M K ) ≤ l yields C π (s t |M K ) ≤ d as required. By Lemma 6 either the exploitation policy π ∈ Π c (s t , d , T ) exists or the exploration policy π ∈ Π c (s t , d , T ) exists (or both). By assumption, the optimal value within Π c (s t , d, T ) is -close to that of the optimal value within Π c (s t , l − , T ) and consequently that in Π c (s t , d , T ) as well. Combining the above, either a) E 4 will be able to find a policy π ∈ Π c (s t , d , T ) that is nearoptimal from s t , or b) E 4 finds a policy π which has C π (s t , T |M K ) ≤ d and which with probability p > /G r max (T ) reaches un unknown state in T steps. If the agent stays within K , then the proof for the known states is complete since then C π (s t ) = C π (s t |M K ) ≤ d ≤ d. Otherwise, by Lemma 10, the setting of d implies that the non-stationary policy π n , which alternates between l-safe exploration/exploration, d -safe balanced wandering, and d s -safe return, yields C π n (s t ) ≤ d -provided that the agent does not exceed the budget d in the unknown states (see following paragraph).
If s t ∈ U , then the E 4 agent will use the safe balanced wandering policy µ until returning to the known states. Let P be an uncertainty set such that 1) for all (s, a) ∈ S × A, P s,a contains P * s,a with probability at least 1 − δ ψ , and 2) its worst-transition diameter is at most T − 1 (see Eq. 20). From the definition of T = inf{T ∈ N : there exists a policy π (the escape policy) that satisfies max P ∈PĈ P,π (s, T |M U ) ≤ d .

By
Lemma 8, the above implies C P * ,π (s, T |M U ) ≤ max P ∈PĈP,π (s, T |M U ) ≤ d with probability at least 1 − U Aδ ψ . 5 Note that the last assumption is not needed for safety throughout exploration; instead its purpose is to be able to return an -optimal exploitation policy despite the difference between d and d. If the assumption does not hold on the initial known-state budget d , one potential way to realise the assumption practically is to reset d to be close to d after most states have been made known, thereby returning the exploitation policy only when the known-state budget is sufficiently close to d. Now consider the trajectory within the unknown states. µ first performs steps of balanced wandering until taking one more step would not allow the escape policy π to return safely, i.e. as soon as C(p t ) + γ t max P ∈P max at∈A st+1∈S P st,at (s t+1 )Ĉ P,π (s t+1 , T |M U ) ≥ d − c max . By the proof of Lemma 9, the total cost within the unknown states is This constraint-satisfaction implies that C π n (s t ) ≤ d when exploring or exploiting from s t ∈ K . Similarly, due to requirement (c) in Lemma 10, it follows that for s t ∈ U , C π n (s t ) ≤ d. Further, the safe return policy may fail to stay within K with probability δ . Therefore, combining all the previous yields C π n (s t ) ≤ d with probability at least 1 − (U Aδ ψ + δ).

Explicit Explore, Exploit, or Escape algorithm
With the above theory in mind, E 4 is now further developed as a practical algorithmic framework, with a discussion of different prototype implementations. Within the main loop of the algorithm, three different policies are optimised: an exploration policy, an exploitation policy, and an escape policy. We discuss a range of offline methods to optimise these policies, including policy gradient, linear programming and dynamic programming approaches (Section 4.1). The general flow of how these policies fit together is then discussed in Section 4.2, with the complete algorithm being summarised in Algorithm 1. The algorithm and theory relies on two conditions for robust constraint-satisfiability, namely that the diameter of the CMDP must be limited and the availability of tight uncertainty sets that still capture the true transition dynamics with high probability. These conditions are discussed along with a variety of example uncertainty sets (Section 4.3).

Offline optimisation
The algorithm starts by performing offline optimisation of three different policies, the exploration policy, the exploitation policy, and the escape policy, each of which are defined over a different induced CMDP. The remainder of this subsection provides four possible approaches for offline optimisation in E 4 , comparing their time complexity, applicability, and scalability. Using one of these algorithms allows fulfilling the two remaining conditions on offline optimisation in Section 3, namely 1) the algorithm is polynomial-time; and 2) the algorithm converges to the global optimum. Therefore, using these algorithms for E 4 yields near-optimal constrained policies in polynomial time.

Robust-constrained policy gradient
A first proposed approach to optimising these policies is based on an existing solution, namely the RCMDP policy gradient (Russel et al., 2020), where distributional robustness is integrated into CMDPs. This is proposed for the Worst-case Escape CMDP but can optionally also be used for the Known-state CMDPs (see Algorithm 1). As Russel et al. (2020) note, one can incorporate the argmax over either the value function or the constraint-cost and they choose the value function; in E 4 , safety is the main concern, and therefore robustness is incorporated by an argmax over the constraint-cost. The robust CMDP objective is solved by computing the saddle point of the Lagrangian for a given budget d, whereM S is an induced CMDP over S ⊂ S, P is the uncertainty set, and P = arg max P ∈PĈπθ,P (s, T |M S ). Substituting f (θ) =V π θ ,P (s|M S ) and g(θ) = d −Ĉ π θ ,P (s, T |M S ), the aim is to find the policy π θ such that the gradient is a null-vector; that is, and ∇ λ L(λ, π θ ) = g(θ) = 0 .
To optimise the above objective, sampling of limited-step trajectories is repeated for a large number of independent iterations starting from a randomly selected state in the subset of the state space over which the CMDP is induced (see Algorithm 2). Based on the large number of trajectories collected, one then performs gradient descent in λ and gradient ascent in θ. A benefit of this approach is scalability to a high number of parameters. A common wisdom is that gradient descent methods do not perform well in multi-modal and non-convex landscapes, where they may get stuck in local optima. Similarly, the theory behind the RCMDP policy gradient only provides a convergence result for a local optimum (Russel et al., 2020;Russel, Benosman, Van Baar & Corcodel, 2021). However, with sufficient over-parametrisationand more precisely, with a number of neurons n h polynomial in the number of samples n ξ , number of hidden layers l, and the (inverse of) the maximal distance between data points ν −1 -a neural network can learn the global optimum to arbitrary accuracy within polynomial number of samples (Allen-Zhu, Li & Song, 2019). In addition to general results for the L2-norm, Allen-Zhu et al. (2019) provides similar results for arbitrary Lipschitz-continuous loss functions; one such result states that if the loss function is σ-gradient dominant (see Zhou and Liang (2017) for its definition), then gradient descent finds with probability 1 − e −Ω(ln 2 (n ξ )) an -optimal parameter vector within a number of iterations I polynomial in n h , l, ν −1 , σ −1 , and −1 . Per iteration, all operations have polynomial time complexity: look-ups, list appending, random number generation, gradient computation, division, and the forward pass of a neural network. Therefore, the resulting algorithm has overall polynomial time complexity and converges with high probability to the global optimum.

Robust linear programming
A second and novel approach is a robust variant of linear programming (LP) for RCMDPs, extending the traditional LP framework of Altman (1998). Note that Zheng and Ratliff (2020) have previously used a "robust" version of LP in the context of UCRL, which estimates an upper confidence bound on the cost. This section proposes a similar variant of robust linear programming, with the key difference that 1) our version is discounted, non-episodic for an extension of E 3 whereas their version is based on an undiscounted, episodic framework for an extension of UCRL; 2) separate exploration, exploitation, and escape policies are optimised; and 3) the error term in the budget (right hand side) as opposed to the constraint-cost, although this is merely a superficial difference. A downside of this approach is that similar to Zheng and Ratliff (2020), this formulation does not account for uncertainty in the transition dynamics. The agent may therefore get stuck in unknown states longer than anticipated.
The approach uses the occupation measure (Altman, 1999), which can be interpreted as the total proportion of discounted time spent in a particular state-action pair. The occupation measure allows to formulate the asymptotic constraint-cost as a simple weighted sum of its immediate cost. For a CMDP M S , a policy π, and a state-action pair (s, a), the T -step estimate of the occupation measure is defined as leading to the definition of a policy as Now the state-action value and constraint-cost functions can be expressed in terms of f π , namelyV π (s, a, T |M S ) = f (s, a)r(s, a|M S ) andĈ π (s, a, T |M S ) = f (s, a|M S )ĉ(s, a|M S ) .
For the exploitation and exploration policies in E 4 , this yields the linear programming problem where f T ∈ R SA andr ∈ R SA . For a Worst-case Escape CMDP, the robust LP is not recommended due to requiring known transition dynamics but if the constraint-cost estimate is set to c max , one may optimise the linear programming problem max Various general polynomial-time algorithms exist for linear programming, including the ellipsoid method (Khachiyan, 1979) and the projective method (Karamarkar, 1984), which have O(n 6 L) and O(n 3.5 L) on O(L) digit numbers. Following Karamarkar (1984), interior point methods have been further developed, providing -optimal guarantees with much improved time complexity (see Potra and Wright (2000) for a variety of algorithms).

Dynamic programming approaches
Dynamic programming (DP) as used in the original E 3 provides convergence to the optimal value and a time complexity of O(S 2 T ). In E 4 , DP cannot be directly applied due to the robust constrained setting. There is currently no suitable algorithm for constrained-robust DP. We propose two possible approaches but leave a full analysis for further research. The first approach starts with constrained DP (Altman, 1998) and then incorporates the uncertainty set while the second approach starts with robust DP (Iyengar, 2005;Nilim & Ghaoui, 2005) and incorporates constraints by reformulating the CMDP as a Lagrangian MDP (Taleghan & Dietterich, 2018).

Dual linear program
The technique by Altman (1998) goes as follows when applied to a CMDP M S , which has transition dynamicsP , constraint-cost functionĉ, and reward functionr. Note that the solution to the unconstrained DP problem can be rewritten as an LP of the form For constrained DP, the value can be defined based on the min-max of the LagrangianV where min and max can be interchanged as strong duality holds without requiring Slater's condition. Representing L(λ, π θ ) in terms of the immediate rewards and constraint-cost results in the LP In E 4 , the transition dynamics are uncertain. Adding the constraints for all transition dynamics in the uncertainty set and using the estimators yields the LP Robust dynamic programming with Lagrangian MDP Another proposed approach for dynamic programming in E 4 is to apply robust DP (Iyengar, 2005;Nilim & Ghaoui, 2005) to a Lagrangian MDP (Taleghan & Dietterich, 2018). Robust DP models the value aŝ for worst-case transition P ∈ ∆ S and the values for each following statê V (·, T |M S ) ∈ R S . To solve the "inner problem", max P ∈Ps,a P TV (·, T |M S ), one uses a bisection algorithm, yielding time complexity O(S log(G/ s )), where G is an upper bound to the value function and s is the desired accuracy of the approximation. The overall problem (Eq. 19) gives -optimal guarantees in polynomial time complexity O(T S 2 log(1/ ) 2 ), adding only O(log(1/ ) 2 ) time cost compared to traditional DP. To make robust DP work for E 4 , one can then construct a Lagrangian MDP which redefines the reward function as a linear combination of the original reward function and the constraint-cost function, such that the value is similar to the Lagrangian in Eq. 9.
A downside of using robust DP is the requirement of (s, a)-rectangular uncertainty sets (see Section 4.3.2). More general uncertainty sets can be converted into (s, a)-rectangular format by projection onto a larger rectangular subspace (Nilim & Ghaoui, 2005) but this will result in a higher upper bound on the worst-case constraint-cost.

Exploration, Exploitation and Escape policies
The three different kinds of policies are optimised over different CMDPs introduced earlier in Section 3. This section discusses how these policies are used, what they represent, and when they are activated.
Due to the Constrained Exploit-or-Exploit Lemma (Lemma 6), for any satisfiable budget l and any ≥ 0, from a starting state s, either a) there exists a policy π ∈ M S for which V π (s, T |M K ) ≥ V π * (s, T ) − , where π * = arg max π V π (s) s.t. C π (s) ≤ l for all s ∈ S; or b) there exists a policy π in M K which reaches the terminal state s 0 in S \ K in at most T steps with probability p > /G r max (T ). The estimated exploitation and exploration policies represent case a) and b), respectively, performing l-safe exploitation and l-safe exploration. The exploitation policy is ideally activated whenever it is known thatVπ(s|M K ) ≥ V π * (s) − . In practice, this knowledge is often not available. However, one may use a strategy similar to what has been proposed for the exploitation and exploration policy in E 3 (Kearns & Singh, 2002); that is, one activates the exploration policy first and continues it as long as p > /G r max (T ) remains likely, and then, as soon as p < /G r max (T ) with high probability, one activates the exploitation policy which is then guaranteed to exist by Lemma 6.
With the above in mind, the different policies and their corresponding induced CMDPs can now be discussed. First, the l-safe exploitation policy has the aim of solving a CMDP induced over the known states M K , differing from M only in the sense that: 1) it terminates with r(s 0 ) = 0 and c(s 0 ) = 0 as soon as it reaches the terminal state s 0 , which is when an unknown state in U is entered; and 2) the allowed constraint-cost budget is more limited, namely it is set to l = d − 2 following Lemma 7. Second, the l-safe exploration policy has the aim of solving a CMDP induced over the known states M K . M K differs from M K in the sense that it terminates with r(s 0 ) = r max as soon as it reaches the terminal state s 0 , which is when an unknown state in U is entered, and that it receives r(s) = 0 for all s ∈ K . Third, the d -safe escape policy has the aim of solving a CMDP induced over the unknown states M U . It differs from M in the sense that: a) it terminates with r(s 0 ) = 0 and c(s 0 ) = 0 as soon as it reaches the terminal state s 0 , which is when a known state in K is entered; and b) the allowed constraint-cost budget d is set according to Lemma 10. Finally, to ensure constraint-satisfaction, Lemma 10 requires an additional Tstep trajectory of safe return within the known states, a trajectory which yields a cumulative constraint-cost of at most d s − . Such a safe return policy can be similarly formulated as a CMDP (e.g. by rewarding 1 for known states and 0 for unknown states) but is often readily available from domain knowledge; for example, domains such as those in Example 2 include an action which makes the agent stay in a particular state or set of states for extended periods of time.

Before the main loop,
> 0 is chosen and the -horizon time T ← 1 1−γ ln( max(rmax,cmax) (1−γ) ) and safety budget l ← d − 2 are chosen accordingly. At the start of each iteration of the main loop, exploration and exploitation policies are optimised. Via Lemma 7, l-safe exploration or exploitation policies will with high probability be constraint-satisfying despite the T -step estimated constraint-cost inM K differing from the asymptotic true constraint-cost in M K . Once an unknown state is visted, d and T are set according to Lemma 10. As long as the different policies satisfy their respective constraint-cost budget within their induced CMDP, l for the exploration and exploitation policies and d for the escape policy, then the overall non-stationary policy applying these policies sequentially will satisfy the constraint-cost in the full CMDP M .

Robust constraint-satisfiability
For E 4 to work as intended, the Worst-case Escape CMDP must have robust constraint-satisfiability. That is, when selecting the worst-case transitions from an uncertainty set P, there must be a policy π that satisfies max P ∈PĈP,π (s, T |M U ) ≤ d . This depends critically on two factors. First, constraint-satisfiability depends on the CMDP of interest. The parameter of the CMDP that we analyse here is the diameter; if this parameter is limited, then the constraints are satisfiable; this holds true in simulation as well as in the real world CMDP. Second, when performing offline optimisation, robust constraint-satisfiability further depends on the uncertainty set. If the uncertainty set is too broad or does not include the true transition dynamics, then constraint-satisfiability cannot be guaranteed unless for trivial constraints (e.g. where every path goes to the known states before the budget d is exceeded). Because the transition dynamics model is not known in the unknown states, we discuss how the uncertainty set can be formed based on prior knowledge and statistical inference.

The diameter of the CMDP
For constraint-satisfiability in the Worst-case Escape CMDP M U , the diameter must satisfy D(M U ) ≤ T = inf{T ∈ N : In this case, the policy can escape back to K within at most T steps, before potentially exceeding the budget (on expectation). To ensure that at least one step of balanced wandering can be taken per attempted exploration, in line with the theory in Section 3 and Algorithm 1, the diameter must satisfy D(M U ) ≤ T −1. The robust optimisation further implies that the worst-case diameter in the uncertainty set must also be at most D(M U ) ≤ T − 1. This worst-case assumption therefore can be written in terms of the uncertainty set (which may be restricted to the entries with P s,a where s ∈ U for making the assumption Note that this "worst-transition diameter" differs from the worst-case diameter of Garcelon et al. (2020) in that the worst-case is over the uncertain transition Algorithm 1 E 4 algorithm: Explore or Exploit-Explore-Exploit-or-Escape for CMDPs.
Require: starting set of known states K , known-state budget d , safe return budget ds, learning rate schedules η 1 and η 2 , and > 0. 1: Start from random s ∈ K . 2: Define l ← d − 2 .
l-safe exploration or exploitation ).

46:
s ∼ P * s,a . sample next state

11:
deterministic cost and reward in induced CMDP

14:
State s ∼ P s,a . Worst-case transition

23:
for t = T stop − 1, T stop − 2, . . . , 0 do 24: return π θ 31: end procedure dynamics rather the policy and that the best policy is taken rather than the worst-case policy.
The assumptions on the diameter provide a worst-case guarantee. In practice, many domains (e.g. Example 2) have properties related to reversibility, such that the number of steps taken in unknown steps relates to the number of steps to escape back to the known states. In such cases, the diameter assumption can be significantly relaxed. In addition, more informative metrics than the diameter could potentially improve the budgetary requirements specified in Section 10.

Uncertainty sets
For unknown states, no samples are given at the start of the algorithm, implying that uncertainty sets constructed from their state-action visitations are too large to provide safety guarantees and to make efficient use of exploration attempts. This section discusses how to construct narrow uncertainty sets to obtain safe and efficient behaviour within the unknown states. We present existing uncertainty sets, including the (s, a)-rectangular uncertainty sets (see e.g., Russel et al. (2020); Wiesemann et al. (2013)), and the factor matrix representation (Goyal & Grand-Clement, 2018), as well as the use of expert knowledge to provide tight uncertainty sets. A gridworld example is then given as an illustration (see Example 2).

(s,a)-rectangular sets
The most well-studied class of uncertainty set is the (s,a)-rectangular set, which defines a plausible interval for each (s, a) ∈ S × A. An advantage of (s,a)-rectangular set is that they provide various polynomial-time results for robust optimisation (Nilim & Ghaoui, 2005;Wiesemann et al., 2013). A simple example is the set based on the L1-norm (Russel et al., 2020), which defines for all (s, a) ∈ S ×A. Traditionally, (s,a)-rectangular uncertainty sets are formed based on Hoeffding's inequality, defining the budget ψ s,a = 2 n(s,a) ln( SA2 S 1−δ ψ ) based on the number of visitations of the state-action pairs for failure rate δ ψ . This will unfortunately not work when the number of visitations is small (or even zero) in unknown states, since then the uncertainty set is prohibitively broad. However, a useful alternative for the unknown states is the Bayesian Credible Region (Russel & Petrik, 2019), which defines the uncertainty set by finding the tightest budget with low posterior belief of failure, ψ = min ψ {ψ ∈ R + : P ||P * s,a −P s,a || 1 > ψ s,a < δ ψ } .
The posterior P allows injecting prior knowledge via posterior sampling, either analytically (e.g. via Dirichlet prior) or alternatively, via simulation methods (e.g. Monte Carlo Markov Chain sampling methods).
r-rectangular sets based on factor matrix More efficient than (s,a)-rectangular sets are r-rectangular sets based on a factor matrix representation, which allows to efficiently treat different stateaction pairs as being correlated. An uncertainty set P ⊂ R S×A×S is generated using a factor matrix W = {w 1 , . . . , w r }, a convex, compact subset of R r×S , and a fixed set of coefficients u i s,a for all i ∈ {1, . . . , r} and all (s, a) ∈ S × A.
Specifically, one has for all (s, a) ∈ S × A. The resulting representation is flexible; for example, if r = SA and u i sa for all i = 1, . . . , r one has the (s,a)-rectangular case. The factor matrix can be estimated via non-negative matrix factorisation (Xu & Yin, 2013) although currently this requires a relatively accurate estimateP s,a (Goyal & Grand-Clement, 2018). In addition, factorisation has been used to provide E 3 with scalability to large or continuous state spaces (Henaff, 2019); as a result, factorisation methods seem promising for improving the scalability of E 4 .
Sets based on expert knowledge: action models and local inference When a domain expert has a high-precision model for the transition dynamics of unknown states, much tighter uncertainty sets can be formed, which is especially useful in the early stages of the E 4 algorithm, when the visitations are too few to provide reliable statistics. As an illustration, we consider the case where an agent is located in a discrete state space organised along Cartesian coordinates and its available actions are moving within a local neighbourhood, as is typical in many robotics applications. The expert then formulates a set of probabilistic models, each of which is valid locally in a subset of the state space (e.g. due to the position in the landscape).
The expert first formulates n action models g i (s, a), i = 1, . . . , n, which output the "typical" next state following action a in state s. Then, for each i a probability distribution P gi,τ,N is formed by assigning high probability 1 − τ to s = g i (s, a) and probability τ /N to a local neighbourhood N (s |s, a) (e.g. all states s e ∈ S with ||s e − s || 2 ≤ 2, representing plausible action outcomes), where N is the size of the neighbourhood. One may additionally iterate the definition of P g,τ,N (s, a, s ) over different neighbourhoods and different error rates if these are uncertain. For simplicity, we illustrate this only the error rate in Example 2, i.e. we fix the neighbourhood and define P gi = {P gi,τ,N : τ ≤ 0.1}.
The resulting uncertainty set, {P g1 , . . . , P gn }, is typically small when significant domain knowledge exists. However, the size of this uncertainty set can be even further reduced by eliminating models with low probability. For this purpose, one can consider the probability of the recent path p under model g i , P Pg i (p). Alternatively, the expert defines a transfer probability ρ(g j |s, a, g i ), reflecting the probability of model g j after taking action a in state s for which g i was valid. Note that the dependencies on s, a, and/or g i can be dropped in the case of statistical independence.
Example 2 Let S be a discrete set of Cartesian xy-coordinates in a 10-by-10 gridworld (S = 100). Let A be the set of moves within a Von Neumann neighbourhood, with actions {north,west,south,east} moving one step in the corresponding direction and the action stay remaining in the same cell (A = 5). The optimal constrained policy is to cycle around the bounds of the gridworld, without hitting the wall. If any wall is hit, the constraint-cost is 1 and the agent remains in the same state; otherwise the constraint-cost is 0. Hitting the wall repeatedly in quick succession, for an asymptotic constraint-cost significantly larger than d = 10 (assuming γ = 0.99), is known to risk damaging the agent. Actions fail stochastically with a probability τ = 0.05. Initially, we have K = {(1, 1), (2, 1), . . . , (9, 1)} and U = S \ K as known and unknown states, respectively. The agent is aware of being initialised in the south-west wall.
The expert knows there are walls but only knows opposite ends are at least 5 steps away from each other. For the interior of the gridworld, g 1 (s, a) := co(s) + co(a) where co defines the coordinates for the state and for the action (e.g. one step north is given by (0, 1)). For the north, west, south, and east bounds of the gridworld, respectively, the action models i = 2, . . . , 5 make the corresponding action a null move (e.g., g 2 (s, north) := co(s)). Each model is error-free, i.e. τ = 0. For hitting the wall, the activation of the partial uncertainty set depends on the action and the previously valid φ i . For example, if a = east, then ρ(g 5 |a) = 1 (east bound) except when g 3 (west bound) was previously active.
The uncertainty set constructed from the uncertainty subsets, {P g1 , . . . , P g5 }, where P gi = {P gi,τ,N : τ ≤ 0.1}. These subsets are sufficient to model hitting the wall from the different bounds of the gridworld with the given error rate. For example, if s = (10, 1) ∈ U and taking the next action east, the worst-case transition within the uncertainty set is hitting the wall with 100% probability (τ = 0), a scenario considered by action model g 5 . For other actions, the worst-case transition within the uncertainty set is hitting the wall with 10% probability (τ = 0.10). These worst-cases over-estimate the constraint-cost when compared to τ = 0.05, which would yield only 5% probability.
The current budget for unknown states is d = 5, allowing for 5 steps in unknown states with c max = 1. Although the diameter is significantly higher, From the first unknown state (10, 1), the agent will behave as follows. Given the recent trajectory, the agent has the south and east bounds as active models {g 4 , g 5 } due to taking more than 5 actions east from the known south-west walls. After offline optimisation iterations on the Worst-case Escape CMDP with uncertainty set {P g4 , P g5 }, the agent performs balanced wandering for a few steps and then escapes to (10, 1). The agent then performs T actions of stay to ensure safe return.
After much exploration of the gridworld, eventually the agent makes all states known save for a few in the north bound. Now the uncertainty set is narrowed down to P g2 = {P g2,τ,N : τ ≤ 0.1} and moreover, there are different routes to escape. Therefore, once the agent is optimised for this uncertainty set, the agent can make much quicker progress with balanced wandering. Therefore, these remaining states will quickly become known. Once the bounds are known, a near-optimal exploitation policy can now be found for the entire CMDP.

Practical considerations
Setting m known in practice In general, the required number of samples for a state to be known is 4 Var max ln 1 δ . In practice, m known needs to be set to a fixed value, which requires uncovering constants hidden by the big O notation as well as replacing asymptotic or other unpractical assumptions with practical assumptions. First, in the Constrained Simulation Lemma, obtaining the constant K 1 in α = K 1 ( /(ST G)) 2 requires solving exactly for the two conditions specified in addition to the big O notation α = O( ), plugging in exact values of , T , G r max (T ), and G c max (T ). G will be equal to G r max (T ) or G c max (T ); since these have analogous requirements, let G = G r max (T ). To obtain the first requirement, first note that for most practical settings, α ≤ 1. Then, set K 1 ≤ 1/144 to obtain Additionally assuming S ≥ 4 yields the second requirement, Then, note that combining the three conditions following from Hoeffding's inequality in the Known State Lemma while plugging in K 1 and summing over actions, which are modelled as a constant with relatively low value, and substituting M = max({r max , c max , 1}) appropriately instead of Var max (as → 0 does not hold in practice), yields a general formula for m known : Note that Eq. 24 implies E 4 is particularly suitable for domains with limited state space. Naturally, the failure rate δ should at all times be very limited but for reducing m known one can dynamically change the other parameters, starting with relatively high and decreasing later on in the lifetime. This would also decrease T and G, allowing the algorithm to reach "satisfactory" performance levels across a large set of states more quickly.
Given certain assumptions, one may be able to significantly reduce m known by using a different concentration inequality. For example, one may use Bernstein's inequality if the variance of rewards and costs are known, and relatively low compared to both r max or c max . For the reward function, define the random variable X = |r(s, a) − r t |/m, where r t is the observed reward at time t provided s t = s and a t = a. Then, note that Therefore, defining V = max(Var(Z), Var max ) and summing over actions, we have which provides a decrease of a factor 1 4 α 2 M 2 /(V + α/3). As an example, plugging in M = max(r max , c max ) = 100, V = 1, α = 0.10, a 24-fold improvement can be observed.
Finally, it is worth noting that in many settings, the reward and constraintcost function are a deterministic, rather than stochastic, function of the stateaction pair. In such cases, a single sample suffices to know the average reward and constraint-cost for a given state-action pair, and therefore the number of samples for a state to be known depends only on the result from the transition dynamics, that is, From worst-case to practical assumptions In deriving the number of actions that need to be taken to reach a nearoptimal policy for the entire CMDP, many worst-case assumptions were made. With more practical assumptions, one can get improved results for the sample complexity. First, the worst-case number of exploration attempts is N = O(G r max (T )/ ln(S/δ)Sm known ) can be improved by noting that one does not always need to make all states known before having the -optimal solution, yielding some S < S. Second, both the number of exploration attempts and m known can be improved by estimating G r max (T ) and G c max (T ) via empirical trajectories instead of using a generic upper bound. Third, the cost of c max at all time steps in the Worst-case Escape CMDP could be improved tô c(s, a) + α u (s, a), where α u (s, a) is the approximation error for state-action pairs that have been visited already (but are not yet known). For example, one could set α u (s, a) based on the standard error α u (s, a) = Var c max n(s,a) such that more balanced wandering steps could be taken. Note that such changes to the Worst-case Escape MDP would also allow for the application of E 4 to CMDPs with larger diameter. Fourth, using a separate α (and/or α u ) for the reward, constraint-cost, and transition dynamics functions may account for the range and variance of these different random variables. Fifth, in practice, there will be an initial set of known states; this further reduces the number of states that need to be made known to a number S < S < S. Sixth, many of the settings from Lemma 10 can be significantly optimised in practice: (a) the diameter requirement can be relaxed in domains such as Example 2; (b) the current path as well as predictions of future constraint-cost may indicate when the T -step safe return trajectory is not needed; (c) the bounds on the budget can be improved by using the specific setting of γ of the target CMDP instead of the worst-case setting of γ = 1; and (d) the known-state and unknown-state budgets can take further information into account, such as the current number of known and unknown states. Finally, the use of adaptive sampling, which has been used in prior work on E 3 (Domingo, 1999), may also be beneficial for practical E 4 implementations.
Can one use E 4 for constraints on individual paths? A particular trajectory p starting on a state s ∈ S may yield C(p) > d, even if C π (s) ≤ d, since C π is formulated on expectation over paths. While E 4 has not been designed for constraints on individual paths, if the user desires probabilistic guarantees on the constraint-cost of paths with the same level d, one can form a confidence interval based on the standard deviation S C , which if unknown can be upper-bounded by Var c max . Then the budget can be reformulated as d p ← d − 3S C such that the budget of interest d is rarely exceeded. For example, if the asymptotic constraint-cost of paths is normally distributed with mean d p and standard deviation S C , then there is at most a 0.1% probability that any such path p has constraint-cost C(p) > d. More generally, if one has no information on the distribution, one may use concentration equalities to provide upper bounds on failure probabilities. For example, with the Chebyshev-Cantelli inequality (Cantelli, 1928), the same example yields When it is not the case that d − 3S C is positive (and greater than 2c max ), an alternative safe exploration algorithm would be required, one which is explicitly formulated to handle probability of individual failures. In E 4 , the reasoning is that having C(p) > d is dangerous but usually not catastrophic, while in the above setting C(p) > d is always catastrophic. To the authors' best knowledge, handling individual failures in this manner has not been done before as this requirement is too strict, but solving this exciting challenge would often be useful in safety-critical settings.

Conclusion
Safety is of critical concern in real world applications, and especially in RL, where an agent interacts with an initially unknown environment. Unlike game environments, failures in real world safety-critical applications will finish the lifetime of the RL agent with serious cost to the owner of the agent, and potentially, society at large. Therefore, instead of the model-free RL algorithms which are formulated with this trial-and-error setting in mind, a model-based RL approach may be more suitable for safety-critical applications. Model-based RL algorithms such as E 3 learn near-optimal policies with polynomial sample and time complexity, making them an attractive option for learning in the real world as the model can be used to perform offline computations without requiring too much real world trial-and-error. This paper integrates E 3 into a constrained Markov decision process framework to show that -as long as the constraints are satisfiable due to a limited diameter of the CMDP -there exists an algorithm (E 4 ) which with high probability finds a near-optimal constrained policy within polynomial time. The E 4 algorithm combines the use of an explicit exploration and an explicit exploitation policy with an additional escape policy that provides a safe return for states not reliably known by the model. The algorithm is attractive for safety-critical settings, not only due to the offline computations, but also because of the non-episodic setting in which there is only a single lifetime. The simulation model allows anticipating constraint-satisfaction failures and where there is uncertainty about the true environment, distributional robustness is used to ensure the worst-case scenario does not violate the constraints. Beyond theoretical results supporting the framework, a discussion highlights offline optimisation algorithms and shows how to formulate uncertainty sets for unknown states based on prior knowledge, empirical inference, or a combination thereof. Further practical considerations discussed include relaxing the worst-case assumptions that underlie the theory, aspects of the domain and the implementation that affect sample efficiency, and the practical interpretation of the various components of the algorithm. A few exciting research directions emerge from this paper, including the development of scalable methods for constrained robust offline optimisation, defining uncertainty sets for unknown states, as well as the explicit use of exploration, exploitation, and escape policies within other model-based or even model-free RL algorithms. Given that the explicit separation of exploration and exploitation that characterised E 3 has been taken up by recent works in different learning settings, including hard-exploration (Ecoffet, Huizinga, Lehman, Stanley & Clune, 2021) and meta-learning (Liu, Raghunathan, Liang & Finn, 2020), E 4 may provide unique perspectives on how to solve such problems safely.

Declarations
This version of the article has been accepted for publication after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections.  where G = max(G r max (T ), G c max (T )), and let π be a policy in M . Then for any state s, (a) V π (s) − ≤V π (s) ≤ V π (s) + ; (b) C π (s) − ≤Ĉ π (s) ≤ C π (s) + .

Proof:
Let π be any policy in M . Define β-small transition as a transition for which P * s,a (s ) ≤ β. After T time steps, there is a probability of at most T * S * β that a β-small transition is crossed (since at any one time step at most S * β). The proof distinguishes between case A) the path traversed includes at least one β-small transition; and B) the path does not include any β-small transition. Case 1: at least one β-small transition If we do have β-small transitions then: 1) T -step walks of π from s that cross at least one β-small transition contribute at most T SβG r max (T ) (on expectation) to V π (s, T ) and at most T SβG c max (T ) (on expectation) to C π (s, T ). 2) SinceM is an α-approximation of M , this means that β-small transitions satisfy P * s,a (s ) ≤ β + α. Therefore, T -step walks of π from s that cross at least one β-small transition contribute at most T S(α + β)G r max (T ) (on expectation) toV π (s, T ) and at most T S(α +β)G c max (T ) (on expectation) toĈ π (s, T ). These also give the upper bounds to the contribution to the discrepancy ofM to M . So, the remainder of the proof consists of choosing an appropriate β and then solving for α to bound the discrepancy to the desired in the following equation: (α + 2β)T SG c max (T ) ≤ /4 .
In the general case, we will then solve the approximation for paths with no β-small transitions and then adding the above-mentioned /4 to account for the maximal contribution of such β-small transitions.
A similar argument follows for the value function. For all state-actions (s, a) ∈ S × A, we have r(s, a) − α ≤r(s, a) ≤ r(s, a) + α and therefore V (p) − T α ≤V π (p) ≤ V (p) + T α holds for any path p. Then, since we took any arbitrary path without β-small transitions, taking the expectation over such paths will yield the same inequality, while we can further additional account for /4 as the added contribution of T -step walks with β-small transitions. Then converting it to a multiplicative approximation over the expected T -step value yields (1 − ∆) T (V π (s, T ) − T α) − /4 ≤V π (s, T ) ≤ (1 + ∆) T (V π (s, T ) + T α) + /4 .
Line 2 and 4 of the above imply together that (1 + ∆) T is a constant such that αT = O( ).
Analogous results follow for the constraint-cost, concluding the proof.
For any p k -type path, we have P π,P [p k ] = P π,P M S [p k ] and V (p k ) = V (p k |M S ) ≤ V (s, T |M S ), where equalities follow from the paths all containing the known set only (since for the induced model the transition dynamics are the same and rewards outside S are not being considered) and the inequality follows from positive rewards and V (s, T |M S ) considering additionally p u -type paths if any. Therefore, p k P π,P [p k ]V π (p k ) = p k P π,P M S [p k ]V π (p k |M S ) ≤ V π (s, T |M S ) < V π * (s, T ) − Therefore, since V π * (s, T ) = V π (s, T ) = p k P π,P [p k ]V π (p k ) + pu P π,P [p u ]V π (p u ), this implies that: pu P π,P [p u ]V π (p u ) > .
Since V π (p) ≤ G r max (T ) for any T -step path, it follows that pu P π,P [p u ] > /G r max (T ) Therefore, condition b) is satisfied.