Keywords

1 Introduction

Deep reinforcement learning (RL) is a promising approach for synthesizing controllers [19] to govern cyber-physical systems like autonomous vehicles. State-of-the-art RL algorithms can autonomously acquire motor skills through trial and error, either in simulated environments or even in unknown terrains, thus circumventing the need for laborious manual engineering. However, during training in most RL algorithms, agents perform a significant number of exploratory steps that can lead to dangerous behavior. In many real-world scenarios where ensuring high assurance is crucial, it becomes imperative for the RL agent to behave safely during environment interactions, even in training scenarios when the agent is not yet optimal [37, 43].

figure a
figure b

To facilitate safe exploration, it is essential to have a mechanism that determines the safety of executing an action in a given environment state. Several existing approaches utilize prior knowledge about system dynamics [5, 6, 52] to make such assessments. When the environment dynamics are not known a priori, existing safe RL methods utilize learned predictors in the shape of neural networks [1, 7, 15, 47] to predict the safety implications of particular control action. Training these neural predictors may require numerous potentially unsafe environment interactions.

There are also model-based safe RL techniques that leverage learned environment models in unknown environments to filter out unsafe actions [4, 34]. In the recent CRABS framework [34], a barrier certificate and a model for environment dynamics are co-trained in conjunction with a controller. The learned neural barrier certificate serves as a predictive tool to assess whether a control action from the policy aligns with the safety requirement. In cases where it does not, a safeguard policy, trained on the environment model, is executed. While rooted in formal methods concepts, CRABS cannot rigorously verify the accuracy of a learned barrier certificate. This challenge arises from the fact that both the certificate and the underlying environment models are deep neural networks, making formal verification a complex task. Another recent work SPICE [4] uses weakest preconditions [16] to generate, from a learned environment model, a predicate that decides if an action is safe to take at a current environment state concerning a short time horizon H. However, H cannot be extended to cover the entire horizon of an RL task, primarily because of the inherent challenge in constructing precise weakest precondition transformers for neural networks. As a result, although grounded in Hoare logic, SPICE still suffers from notable safety violations in its environment exploration.

We present VELM, a model-based safe reinforcement learning framework that engages in formally verified safe exploration through learned environment models, covering the entire horizon of an RL task. VELM learns a symbolic environment model linking the system’s future states, past states, and the controller’s actions. Most non-linear control systems are characterized by dynamics dictated by mathematical equations involving operators such as trigonometric functions (like sine and cosine) and power functions. By leveraging this prior knowledge of common operators that could appear in environment dynamics, VELM searches a symbolic environment model in the space of interpretable mathematical expressions by symbolic regression techniques. Symbolic regression methods have demonstrated remarkable extrapolation capabilities in recent studies and have proven valuable across diverse domains including physics [10, 28, 30]. More importantly, unlike neural environment models, symbolic environment models are conducive to long-horizon reachability analysis, enabling the computation of the reachable set of a control system across the entire task horizon. VELM leverages this capability to establish a safe exploration regime for verified safe learning.

VELM can be instantiated on top of any model-based reinforcement learning algorithms. It involves a two-step procedure repeated until convergence: (a) interact with the true environment to collect a dataset of environment transitions and learn from the data an environment state transition model F (i.e. a function that maps current state \(s_t\) and action \(a_t\) to next state \(s_{t+1}\)) and (b) derive a controller \(\pi \) from this learned model. In each learning iteration, VELM aims to ensure that the data collection process of using the current controller \(\pi \) to interact with the true environment in step (a) is safe. One way to do so is by verifying the safety of \(\pi \) according to the learned model. However, conducting reachability analysis of neural networks in a closed-loop control system remains a challenging research problem [27]. Alternatively, VELM considers \(\pi \) as an oracle and derives a much simpler and verification-friendly time-varying linear controllers \(\pi '\) to approximate the policy actions executed by \(\pi \) at each time step within the RL task horizon. While alternative methods such as approximating a neural controller as a polynomial function exist [48], our objective is to achieve a balance between expressiveness and verifiability. Time-varying linear controllers provide computational efficiency for reachability analysis, making verification over learned models feasible in a learning loop. VELM solves a constrained optimization problem aimed at optimizing the behavior of \(\pi '\) to closely match that of \(\pi \) while simultaneously ensuring that \(\pi '\) can be formally verified as safe for the learned environment model. Leveraging \(\pi '\) as a reference, VELM computes a safety shield that restricts the neural policy \(\pi \) to explore solely within the state space where \(\pi '\) is verified as safe in the learned model. The shield intervenes whenever the neural policy \(\pi \) proposes a potentially unsafe control action that could result in a next state outside the safe state space. It then substitutes this action with a safe alternative provided by \(\pi '\). The environment state transition model F is repeatedly updated during the learning process using data safely collected using the shielded neural controller. The computation of the shield is accordingly repeated, leading to a more refined shield with each update to the controller.

While there exists prior work that explored shielding for safe RL, they require a calibrated piecewise linear dynamics model [5, 52] or an abstract model of the agent’s safe behavior [24], whereas VELM automatically learns a dynamics model and a safe shielding policy. Adapting these techniques to learned environment models that evolve across training iterations is challenging, given the inherent difficulty of approximating nonlinear models as piecewise linear functions. Compared with SPICE [4], VELM is computationally efficient as it only computes a shield once for a policy while SPICE requires calling a QP (Quadratic Programming) procedure at every timestep.

Across a suite of challenging continuous control benchmarks, VELM exhibits reward performance comparable to fully neural approaches and significantly fewer safety violations during training compared to state-of-the-art safe RL techniques.

In summary, this paper makes the following contributions:

  • We propose a novel approach for model-based safe reinforcement learning. Our approach learns an environment model as a symbolic formula and constructs a shielding layer to confine an RL agent to explore within a state space formally verified as safe for the learned model, thereby enhancing the overall safety profile of the RL system.

  • We present VELM as an efficient instantiation of this approach. The experiment results show that VELM offers much greater safety than prior model-based safe RL approaches without suffering a loss in reward performance.

2 Problem Setup

Safety Specification. We define a safety specification as a logical formula specifying the safe states of a control system.

Definition 1

(Safety Specification). A safety specification \(\varphi \) is a quantifier-free Boolean combination of linear inequalities over the environment state variables x:

figure c

A state \(s \in {S}\) satisfies a safety specification \(\varphi \), denoted as \(s \models \varphi \), iff \(\varphi (s)\) is true.

MDP. We formalize an RL system as a Markov decision process (MDP). Specifically, an MDP is a structure \({M}[\cdot ] = (S, A, P, R, S_0, H, \cdot )\) where S is an infinite set of continuous real-vector environment states which are valuations of the state variables \(x_1,x_2,\ldots ,x_n\) of dimension n (\({S} \subseteq \mathbb R^{n}\)), A is a set of continuous real-vector control actions which are valuations of the action variables \(u_1,u_2,\ldots ,u_m\) of dimension m. \(R: S \times A \rightarrow \mathbb {R}\) is a reward function that returns the immediate reward after the transition from an environment state \(s \in S\) with action \(a \in A\). \(P(s_{t+1}\ \vert \ s_t, a_t)\) is an (unknown) probabilistic state transition function where \(s_{t+1}, s_t \in S\) and \(a_t \in A\) and t is a time step index. \(S_0\) is a set of initial states. H is the time horizon of the control task (i.e. the maximum number of timesteps of a trajectory). An MDP \({M}[\cdot ]\) is parameterized with an (unknown) controller.

Controller (Policy). A controller is a stochastic function \(\pi : S \rightarrow A\) mapping states to distributions over actions. We explicitly model the deployment of a (learned) controller \(\pi \) in \({M}[\cdot ]\) as a closed-loop system \(M[\pi ]\). \(M[\pi ]\) generates trajectories (or rollouts) \(\zeta = s_0,a_0,s_1,a_1,\ldots ,a_{H-1},s_H\) where \(s_0 \in S_0\), each \(a_t \sim \pi (s_t)\), and each \(s_{t+1} \sim P(s_t,a_t)\). Given a discount factor \(0 \le \beta < 1\), the long-term reward of a policy \(\pi \) is \(R(\pi ) = \mathbb {E}_{(\zeta = s_0, a_0, \ldots , s_{H}) \sim M[\pi ]}[\sum ^{H}_{t=0} \beta ^t R(s_i, a_i)]\).

Problem Formulation. The goal of reinforcement learning is to find a policy \(\pi ^*= \mathop {\mathrm {arg\,max}}\limits _{\pi } R(\pi )\). To achieve this goal, the learning process of (model-free or model-based) reinforcement learning algorithms progressively refines and optimizes policies \(\pi _0, \pi _1, \ldots , \pi _T\) over successive iterations. At each iteration, the current policy is evaluated, and adjustments are made to improve its performance. This learning process continues until the policy converges to the optimal policy \(\pi ^*\). Given a bound \(\delta \), we define safe exploration as a learning process \(\pi _0, \pi _1, \ldots , \pi _T\) such that

$$\begin{aligned} \pi _T = \pi ^*\ \text {and}\ \forall 1 \le j \le T,\ 0 \le t \le H.\ P_{\zeta \sim \pi _j, s_t \in \zeta }(\lnot \varphi (s_t)) < \delta \end{aligned}$$
(1)

Essentially, the end goal is for the final policy \(\pi _T\) in the sequence to optimize long-term rewards, while each intermediate policy (excluding \(\pi _0\)) is constrained to a limited probability \(\delta \) of unsafe behavior. This definition does not place safety constraints on \(\pi _0\) as the environment dynamics is not known and hence \(\pi _0\) can exhibit arbitrary (unsafe) behavior.

3 Verified Exploration Through Learned Models

The Main Algorithm. Our overall framework, Verified Exploration through Learned Models (VELM), employs a learned environment model to facilitate safety analysis during the training phase. Akin to existing model-based safe RL techniques [4, 5, 12, 26, 34], VELM utilizes the learned environment model to delineate safety regions for the underlying control policy. While VELM can also be applied to safe model-based planning, a policy is in general more efficient than a planner. The primary training procedure is outlined in Algorithm 1. It operates within an unknown environment M and takes as input a safety property \(\varphi \). The algorithm concurrently learns an environment model represented as an MDP \(\hat{M}\) and a stochastic neural control policy \(\pi _\texttt {NN}\)Footnote 1. The algorithm maintains a dataset D comprising observed environment transitions, each of which is a tuple of future and past states along with the controller’s actions \((s_t,a_t,s_{t+1})\). This dataset is acquired by interaction with the real environment M (Line 9). Subsequently, VELM utilizes this dataset to learn a symbolic environment model \(\hat{M}\) (Line 10) and optimizes the neural policy \(\pi _\texttt {NN}\) on this learned model via any model-free RL algorithm of the user’s choice (Line 11). Notably, VELM uses a shielded policy \(\pi _S\) for exploring the real environment to construct D. \(\pi _S\) takes a state s at a timestep t as input and generates a safe action for the RL agent to execute at t. This is necessary because directly executing the neural controller \(\pi _\texttt {NN}\) in the real environment M could result in safety violations. The shield policy \(\pi _S\) is constructed in Line 7 via the Shield procedure (Algorithm 2). This procedure leverages reachability analysis on the learned environment model \(\hat{M}\) to establish a safe exploration regime covering the entire task horizon. \(\pi _S\) constrains \(\pi _\texttt {NN}\) to only explore the real environment within the established safe region.

figure d

In the following, we describe in detail the procedures to learn symbolic environment models and construct shielded policies for verified safe exploration.

3.1 Symbolic Environment Models

The LearnModel procedure (Line 10 in Algorithm 1) follows the conventional model-based RL framework [25] to learn an environment MDP model \(\hat{M}[\cdot ] = (S, A, {F}, R, S_0, H, \cdot )\) where \({F}: S \times A \rightarrow S\) is learned using the dataset D to approximate the unknown probabilistic state transition P in the real environmentFootnote 2. VELM distinguishes itself from existing methods by learning a symbolic environment state transition function F instead of a deep neural network model.

Given a dataset \(D = \{(s_t, a_t, s_{t+1})\}\) of real environment state transitions, the LearnModel procedure learns an approximate model f of the environment’s dynamics to fit D:

$$\begin{aligned} f = \mathop {\textrm{argmax}}\limits _{f \in \mathcal {F}_\alpha } \mathbb {E}_{(s_t,a_t,s_{t+1}) \in D} \Vert f(s_t, a_t) - s_{t+1}\Vert \end{aligned}$$
(2)

where \(\mathcal {F}_\alpha \) is a family of expressions that can be articulated using the grammar outlined in Fig. 1. This grammar accommodates common mathematical operators such as trigonometric functions. The metavariables x and n represent state variables and constants respectively. The symbolic function f establishes the relation between the next state \(s_{t+1}\) and the system’s past state \(s_t\), as well as the controller’s action \(a_t\).

Fig. 1.
figure 1

Context-free grammar for defining state-transition functions.

Why Symbolic Environment Models? First, we observe that the dynamics of non-linear control systems often follow mathematical equations. Second, symbolic environment models are suitable for long-horizon reachability analysis to verify the safety of a control system. In contrast, performing reachability analysis over neural network models suffers from large accumulation errors arising from over-approximation [27].

To infer a symbolic formula f to fit D in Eq. 2, the LearnModel procedure employs off-the-shelf symbolic regression techniques [10]. Symbolic regression is a machine learning approach that can learn the governing formulas of data. As demonstrated in recent studies [10, 28, 30], symbolic regression exhibits excellent extrapolation capabilities and has already proved useful in a variety of domains such as physics. VELM uses it to search over the space of mathematical expression by manipulating the operators, constants, and variables in the grammar depicted in Fig. 1.

Nondeterministic Environment Model. It is important to note that VELM does not directly use the deterministic function f as the state transition function F for learned models \(\hat{M}[\cdot ] = (S, A, {F}, R, S_0, H, \cdot )\). In cases where control environments are stochastic (common in RL tasks), deterministic state transition functions are not adequate. For stochastic environments, we aim to bound the deviation between f and the real environment. We identify \(\epsilon \) such that for all \(s_t\) and \(a_t\), \(\Vert f(s_t, a_t) - s_{t+1}\Vert \le \epsilon \) where \(s_{t+1} \sim P(\cdot | s_t, a_t)\) is sampled from the true environment transition at \(s_t\) by taking action \(a_t\). We then express the state transition function of a learned model \(\hat{M}[\cdot ]\) as a nondeterministic function:

$$\begin{aligned} {F}(s_t, a_t) = f(s_t, a_t) + [-\epsilon , \epsilon ] \end{aligned}$$

When used for simulation, F generates a next state at time step t by adding an error vector uniformly sampled from \([-\epsilon , \epsilon ]\) to the result of \(f(s_t, a_t)\). When used for reachability analysis and verification, we consider all possible error terms within \([-\epsilon , \epsilon ]\) as an overapproximation to account for the worst-case deviation.

Fig. 2.
figure 2

Executing a random policy on the real CartPole environment and a learned model.

In practice, we estimate \(\epsilon \) from data and choose the most permissible \(\epsilon \) such that \(\forall (s_t,a_t,s_{t+1}) \in D.\ \Vert f(s_t, a_t) - s_{t+1}\Vert \le \epsilon \). Given f, with sufficient data in D, the model learning procedure LearnModel returns a model that is close to the actual environment with high probability \(1 - \delta _M\). That is, for all \(s \in S, a \in A\),

$$\begin{aligned} \text {Pr}_{s' \sim P (\cdot \vert s, a)} \big [ s' \not \in {F}(s, a) \big ] < \delta _M \end{aligned}$$

In this paper, we learn F as a discrete dynamics system model. With an Ordinary Differential Equation solver, we can also leverage symbolic regression to learn a more accurate continuous-time dynamics model. This is left for future work.

Example 1

Consider the classic CartPole environment [8]. The system’s state is described by \((x, \dot{x}, \theta , \dot{\theta })\) where x (resp. \(\dot{x}\)) denotes the position (resp. speed) of the cart along the x-axis and \(\theta \) (resp. \(\dot{\theta }\)) is the angle (resp. angular velocity) of the pole with respect to the cart. The goal is to balance the pole straight up and bound the deviation of the cart. VELM learns the following equation to describe the state transition function of the system using the Operon [9] symbolic regression tool where u represents the control action (we ignore \(\epsilon \) for simplicity):

figure e

Figure 2 depicts the rollouts in the real environment and simulated in the learned model by executing a random policy from (0,0,0,0). The learned model can reasonably capture real trajectories within a small error bound.

3.2 Shielding for Verified Safe Exploration

With a learned environment model \(\hat{M}[\cdot ]\), under the assumption of its high-probability approximate accuracy, the verification of a neural controller \(\pi _\texttt {NN}\) can be directly pursued through reachability analysis over the closed-loop neural network controlled system \(\hat{M}[\pi _{\texttt {NN}}]\) (NNCS). However, the verification of NNCS remains a significant challenge in the research literature [27].

Time-varying Linear Controllers. VELM instead distills a neural controller \(\pi _\texttt {NN}\) into a time-varying linear controller that is as similar as possible to \(\pi _\texttt {NN}\). Simultaneously, this process ensures that the safety of the time-varying linear controller can be formally verified concerning the learned model \(\hat{M}[\cdot ]\) and a safety property \(\varphi \). Principally, a time-varying linear controller can provide an accurate local approximation of a neural controller at each time step (if the time step is small) and incur a much-reduced verification cost owing to the linearity of the representation. A time-varying linear controller \(\pi _\theta (s, t)\) with trainable parameters \(\theta \) for a time horizon H (\(0 \le t < H\)) can be expressed mathematically as:

$$\begin{aligned} \pi _\theta (s, t) = {\theta _k(t)}^T \cdot s + \theta _b(t) \end{aligned}$$

\(\pi _\theta (s, t)\) generates the control input at time t when observing the current environment state s at t. The time-varying nature of the controller is captured by the dependence of the time-varying gain matrix \(\theta _k(t)\) and the time-varying bias term \(\theta _b(t)\), reflecting the dynamic adjustments in the control strategy over different time instances t. The overall objective of distilling \(\pi _\texttt {NN}\) into a time-varying linear controller \(\pi _\theta \) is:

$$\begin{aligned} \min _\theta \,& \mathbb {E}_{s_0,s_1,\ldots ,s_H \sim \hat{M}[\pi _\theta ]}\ \Vert \pi _\theta (s_t, t) - \pi _\texttt {NN}(s_t)\Vert _2 \nonumber \\ & \text {subject to {Verify}}(\hat{M}, \pi _\theta , \varphi ) = \text {True} \end{aligned}$$
(3)

where \(\Vert \cdot \Vert _2\) is a loss function using the \(L^2\) norm.

Verifying Time-varying Linear Controllers. VELM verifies the safety of a time-varying linear controller \(\pi _\theta \) for a learned model \(\hat{M}[\cdot ]\) using abstract interpretation. While there exist other approaches such as synthesizing barrier certificates for controller verification, the techniques have difficulty handling non-polynomial system dynamics. VELM soundly performs reachability analysis to approximate the set of reachable states of a control system at each timestep:

Definition 2

(Symbolic Rollouts). Given an environment model \(\hat{M}[\pi ] = ({S}, {A}, {F}, R, S_0, H, \pi )\) deployed with a controller \(\pi \), an abstract domain \(\mathcal {D}\), an abstract transformer \(F^{\mathcal {D}}\) for the state transition function F over \(\mathcal {D}\), a symbolic rollout of \(M[\pi ]\) over \(\mathcal {D}\) is \(\zeta ^\mathcal {D}= S^\mathcal {D}_0, S^\mathcal {D}_1, \ldots , S^\mathcal {D}_H\) where \(S^\mathcal {D}_0 = \alpha (S_0)\) is the abstraction of the initial states \(S_0\) in \(\mathcal {D}\) and \(\alpha \) is the abstraction function of \(\mathcal {D}\). Each symbolic state \(S^\mathcal {D}_{t}\) over-approximates the set of reachable states from an initial state in \(S_0\) at timestep t. We have \(S^\mathcal {D}_{t+1} = F^{\mathcal {D}}\big ( S^\mathcal {D}_{t}, A^\mathcal {D}_{t} \big )\) where \(A^\mathcal {D}_{t}\) overapproximates the set of actions at t. \(\gamma \) is the concretization function of \(\mathcal {D}\) for obtaining the set of concrete states represented by an abstract state \(S^\mathcal {D}_t\).

The abstract interpreter \(F^{\mathcal {D}}\) in VELM uses Taylor Model (TM) flowpipes as the abstract domain \(\mathcal {D}\) to reason about the safety of \(\hat{M}[\pi _\theta ]\). For reachability analysis of \(\hat{M}[\pi _\theta ]\) at each timestep t (where \(t > 0\)), VELM gets the TM flowpipe \(S^\mathcal {D}_{t}\) for the reachable set of states of \(\hat{M}[\pi _\theta ]\) at timestep \(t-1\). To obtain a TM representation for the output set of the time-varying linear controller \(\pi _\theta \) at timestep t, VELM uses TM arithmetic to evaluate a TM flowpipe \({A}^\mathcal {D}_{t}\) for \(\pi _\theta (s, t) = {\theta _k(t)}^T \cdot s + \theta _b(t)\) for all states \(s \in {S}^\mathcal {D}_{t}\). The resulting TM representation \({A}^\mathcal {D}_{t}\) can be viewed as an overapproximation of the controller’s output at timestep t. Finally, we use Flow\(^*\) [11] to construct the TM flowpipe overapproximation \(S^\mathcal {D}_{t+1}\) for all reachable states at timestep t by reachability analysis over the state transition function \(F^\mathcal {D}(S^\mathcal {D}_t, A^\mathcal {D}_t)\). To verify \(\hat{M}(\pi _\theta )\) against a safety property \(\varphi \), VELM uses Flow\(^*\) to check if for each abstract state \(S^\mathcal {D}_t\) in the symbolic rollout of \(\hat{M}(\pi _\theta )\), the concretized states in \(\gamma (S^\mathcal {D}_t)\) does not violate \(\varphi \).

Verified Shielding. The safety of a distilled controller \(\pi _\theta \) does not imply its oracle neural controller \(\pi _\texttt {NN}\) is safe. For safe exploration using \(\pi _\texttt {NN}\), VELM constructs a shield for \(\pi _\texttt {NN}\) based on \(\pi _\theta \). The high-level algorithm for shield synthesis is presented in Algorithm 2.

Given a learned environment model \(\hat{M}[\cdot ]\), a neural controller \(\pi _\texttt {NN}\), and a safety specification \(\varphi \), at Line 2, Algorithm 2 invokes Approximate to construct a distillation of \(\pi _\texttt {NN}\) as a time-varying linear controller \(\pi _\theta \). We describe Approximate in detail in Algorithm 3 and Sect. 3.3. At Line 3, Algorithm 2 uses the symbolic rollout \(\zeta ^\mathcal {D}= S^\mathcal {D}_0, S^\mathcal {D}_1, \ldots , S^\mathcal {D}_H\) of \(\hat{M}[\pi _\theta ]\) to derive the reachable set of states of \(\pi _\theta \) for the learned environment model \(\hat{M}[\cdot ]\). As this reachable set of states has been verified safe for \(\hat{M}[\pi _\theta ]\), the shield constrains \(\pi _\texttt {NN}\) to only explore within the reachable set \(\cup _{0 \le i \le H} \gamma (S^\mathcal {D}_i)\) to remain safe. Algorithm 2 returns a shield \(\pi _\mathcal {S}\) for \(\pi _\texttt {NN}\) in the form of a lambda function that takes an environment state \(s_t\) at time step t and t as inputs. We show that assuming the learning model soundly approximates the unknown state transition distribution P of the real environment (Sect. 3.1), the shield is provably safe in the following lemma.

figure f

Lemma 1

Assume a learned environment model \(\hat{M}[\cdot ] = \{S,A,F,R,S_0,H,\cdot \}\) is a sound nondeterministic approximation of the true environment: \(\forall s \in S, a \in A.\ s' \sim P(\cdot | s, a) \Rightarrow s' \in {F}(s, a)\). Given a safety property \(\varphi \), a neural policy \(\pi _\texttt {NN}\), and its shield \(\pi _\mathcal {S} = \text {Shield}(\hat{M}[\cdot ],\ \pi _\texttt {NN},\ \varphi )\), for any rollouts \(s_0,a_0,s_1,\ldots ,s_H\) collected by \(\pi _S\) in the true environment where \(s_0 \in S_0\), \(a_t = \pi _\mathcal {S}(s_t, t)\), and \(s_{t+1} \sim P(\cdot | s_t, a_t)\), we have \(s_t \models \varphi \) (i.e. \(s_t\) is safe) for all \(0 \le t \le H\).

Proof

Since \(\pi _\mathcal {S} = \text {Shield}(\hat{M}[\cdot ],\ \pi _\texttt {NN},\ \varphi )\), there exists a \(\pi _\theta \) (Line 2 in Algorithm 2) whose symbolic rollouts \(S^\mathcal {D}_0, S^\mathcal {D}_1, \ldots , S^\mathcal {D}_H\) can be verified safe with respect to \(\varphi \) (Line 3 of Algorithm 2). We show that for all \(0 \le t \le H\), we have \(\bigvee _{0 \le i \le t} s_t \in \gamma (S^\mathcal {D}_i)\). This invariant implies that \(s_t\) is safe. We prove the invariant by induction. When \(t = 0\), the invariant holds as \(s_0 \in \gamma (S^\mathcal {D}_0)\) by construction. Given an \(s_t\) that satisfied the invariant, if \(\exists 0 \le {{i}} \le t+1. \ {F}(s_t, \pi _\texttt {NN}(s_t)) \subset \gamma (S^\mathcal {D}_{{i}})\) (Line 5), then \(a_t = \pi _\text {NN}(s_t)\) and by assumption \(s_{t+1} \sim P(\cdot | s_t, a_t) \in {F}(s_t, a_t) \subset \gamma (S^\mathcal {D}_{{i}})\), which means the invariant holds on \(s_{t+1}\) in this case. Otherwise (Line 6), \(a_t = \pi _\theta (s_t, i)\) where \({{i}} = \max {\big (\{{{i}}\ \vert \ s_t \in \gamma (S^\mathcal {D}_{{i}}\}\big )}\). Such i must exist as we assume \(s_t\) satisfied the invariant. Since \(s_{t+1} \sim P(\cdot | s_t, a_t) \in {F}(s_t, a_t)\) and the soundness of the abstract interpreter \(F^\mathcal {D}\) ensures that if \(s_t \in \gamma (S^\mathcal {D}_{{i}})\), then \({F}(s_t, a_t) \subseteq \gamma (S^\mathcal {D}_{i+1})\), which means the invariant holds on \(s_{t+1}\) in this case as well. By induction, the invariant is true for all \(0 \le t \le H\).

Theorem 1

(Shield (Algorithm  2)is probabilistically safe). For a learned environment model \(\hat{M}[\cdot ] = \{S, A, F, R, S_0, H, \cdot \}\), let \(\delta _M\) be the probability bound of the model: Pr\(_{s' \sim P (\cdot \vert s, a)} \big [ s' \not \in {F}(s, a) \big ] < \delta _M\). Given a safety property \(\varphi \), a neural policy \(\pi _\texttt {NN}\), and its shield \(\pi _\mathcal {S} = \text {Shield}(\hat{M}[\cdot ],\ \pi _\texttt {NN},\ \varphi )\), for any rollouts \(s_0,a_0,s_1,\ldots ,s_H\) collected by \(\pi _S\) in the true environment where \(s_0 \in S_0\), \(a_t = \pi _\mathcal {S}(s_t, t)\), and \(s_{t+1} \sim P(\cdot | s_t, a_t)\), we have \(s_t \models \varphi \) (i.e. \(s_t\) is safe) with probability at least \((1-\delta _M)^t\) for all \(0 \le t \le H\).

Proof

By Lemma 1, if \(s_{t+1} \in {F}(s_t, a_t)\), then \(s_{t+1}\) is safe for all \(0 \le t < H\). By assumption, at each time step, we have \(s_{t+1} \in {F}(s_t, a_t)\) with probability at least \(1-\delta _M\). After t time steps, the probability that the assumption is valid is at least \((1-\delta _M)^t\), which means that \(s_t\) is safe with probability at least \((1-\delta _M)^t\).

Fig. 3.
figure 3

Executing a shielded neural policy in ACC. The green region denotes the safe space verified on a learned model. The yellow regions denote the control steps where intervention takes place. (Color figure online)

We can relate the probability guarantee in Theorem 1 with our overall objective in Eq. 1 by bounding \(\delta _M < 1 - (1 - \delta ) / \text {exp}(H)\). This theorem illustrates that VELM only allows for safety violations when there’s an inaccuracy in the environment model. In contrast, existing approaches to safe exploration are susceptible to safety violations stemming from both modeling inaccuracies and actions that are not safe even considering the environment model. For example, SPICE [4] applies weakest precondition generation from safety constraints to a linearization of the learned environment model to determine safe control actions. However, this linearization process introduces substantial approximation errors, compromising the safety of the computed actions on the learned environment model. CRABS [34] uses neural networks for representing environment models and control barrier certificates to identify safe exploration regions. However, a neural control barrier certificate may converge to a suboptimal model and CRABS does not have a procedure to rigorously guarantee its correctness. This may result in delayed or absent intervention for unsafe behaviors.

Example 2

Consider an adaptive cruise control (ACC) system [5]. The goal is to control an ego car to closely follow a lead car without collision. The lead car can apply acceleration to itself at any time. Figure 3 shows the rollouts (blue) of a shielded neural controller \(\pi _\texttt {NN}\) in the real environment. The x-axis shows the distance to the lead car while the y-axis shows the relative velocities of the two cars. The rollouts start by accelerating to close the gap with the lead car and subsequently decelerating to prevent a collision. The green region denotes the reachable set of a distilled time-varying linear controller \(\pi _{\theta }\) verified as safe on a learned model of the ACC environment. The yellow regions indicate interventions where \(\pi _{\theta }\) constrains \(\pi _\texttt {NN}\) to stay within the safe region. Without such intervention, the neural controller alone would fail to decelerate rapidly enough and crash into the lead car (the dashed line on the right side). At times, \(\pi _{\theta }\) needs to intervene well before the final steps to ensure the feasibility of avoiding a crash later.

3.3 Neural Controller Approximation

This section formalizes the Approximate procedure invoked by Algorithm 2 (Line 2) for distilling a neural controller \(\pi _\texttt {NN}\) to a time-varying linear controller \(\pi _\theta \) that can be verified safe according to a learned environment model.

Minimizing the gap between \(\pi _\theta \) and a (fixed) neural controller \(\pi _\texttt {NN}\) as two functions can be straightforwardly achieved by optimizing \(\theta \) through gradient descent. However, a binary verification result (true or false) does not offer guidance on how \(\theta \) should be optimized to ensure that \(\pi _\theta \) can be verified safe. Following previous research [45], when facing verification failures, our approach utilizes verification feedback, indicating the extent of safety violations, to guide the optimization process for \(\pi _\theta \). We first formalize the concept of safety violation within the concrete environment state space and then lift it to abstract state spaces.

Definition 3

(State Safety Loss Function). For a safety specification \(\varphi \) over states \(s \in S\), we define a non-negative loss function \(\mathcal {L}(s, \varphi )\) such that \(\mathcal {L}(s, \varphi ) = 0\) iff s satisfies \(\varphi \), i.e. \(s \models \varphi \). We define \(\mathcal {L}(s, \varphi )\) recursively, based on the possible shapes of \(\varphi \) (Definition 1):

  • \(\mathcal {L}(s, \mathcal {A} \cdot x \le b) := \max (\mathcal {A} \cdot s - b, 0)\)

  • \(\mathcal {L}(s, \varphi _1 \wedge \varphi _2) := \max (\mathcal {L}(s, \varphi _1), \mathcal {L}(s, \varphi _2))\)

  • \(\mathcal {L}(s, \varphi _1 \vee \varphi _2) := \min (\mathcal {L}(s, \varphi _1), \mathcal {L}(s, \varphi _2))\)

Notice that \(\mathcal {L}(s, \varphi _1 \wedge \varphi _2) = 0\) iff \(\mathcal {L}(s, \varphi _1) = 0\) and \(\mathcal {L}(s, \varphi _2) = 0\), and similarly \(\mathcal {L}(\varphi _1 \vee \varphi _2) = 0 \) iff \(\mathcal {L}(\varphi _1) = 0\) or \(L(\varphi _2) = 0\).

We extend the safety loss definition (Definition 3) to the abstract state space employed in a verification procedure.

Definition 4

(Abstract State Safety Loss Function). Given an abstract state \(S^\mathcal {D}\) and a safety specification \(\varphi \), we define an abstract safety loss function:

$$ \mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi ) = \max _{s \in \gamma (S^{\mathcal {D}})} \mathcal {L}(s, \varphi ) $$

It quantifies the worst-case safety loss of \(\varphi \) across all concrete states encompassed by \(S^\mathcal {D}\). For an abstract domain \(\mathcal {D}\), we typically can approximate the concretization of an abstract state \(\gamma (S^{\mathcal {D}})\) using a tight interval \(\gamma _I(S^\mathcal {D})\). For example, it is straightforward to represent Taylor model flowpipes as intervals in Flow\(^*\). Based on the potential structure of \(\varphi \), we redefine \(\mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi )\) as:

  • \(\mathcal {L}_\mathcal {D}(S^\mathcal {D}, \mathcal {A} \cdot x \le b) := \max _{s \in \gamma _I(S^{\mathcal {D}})}\big (\max (\mathcal {A} \cdot s - b, 0)\big )\)

  • \(\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1 \wedge \varphi _2) := \max (\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1), \mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _2))\)

  • \(\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1 \vee \varphi _2) := \min (\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1), \mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _2))\)

By definition, we have \(\mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi ) = 0 \implies \forall s \in \gamma _I(S^\mathcal {D}).\ s \models \varphi \).

We further lift the definition of safety loss over abstract states (Definition 4) to the symbolic rollout of an MDP (Definition 2).

Definition 5

(Symbolic Rollout Safety Loss). Given an environment model \(\hat{M}[\pi _\theta ]\) and a safety specification \(\varphi \), assuming the symbolic rollout of \(\hat{M}[\pi _\theta ]\) over an abstract domain \(\mathcal {D}\) is \(\zeta ^\mathcal {D}_{0:H} = S^\mathcal {D}_0, \ldots , S^\mathcal {D}_{H}\), we define an abstract safety loss function to measure the degree to which \(\varphi \) is violated by \(\hat{M}[\pi _\theta ]\):

$$ \mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ],\ \varphi )\ =\ \mathcal {L}_\mathcal {D}(\zeta _{0:H},\ \varphi )\ =\ \max _{0 \le i \le H}(\mathcal {L}_\mathcal {D}(S^\mathcal {D}_i, \varphi )) $$

Definition 5 enables a quantitative metric for the safety loss of a controller \(\pi _\theta \) in the abstract state space of a safety verifier. By definition, we have

$$\begin{aligned} \mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi ) = 0 \implies \hat{M}[\pi _\theta ] \models \varphi . \end{aligned}$$

We rewrite the overall objective of distilling a neural controller \(\pi _\texttt {NN}\) into a time-varying linear controller \(\pi _\theta \) in Eq. 3 as:

$$\begin{aligned} \min _\theta \, & \mathbb {E}_{s_0,s_1,\ldots ,s_H \sim \hat{M}[\pi _\theta ]}\ \Vert \pi _\theta (s_t, t) - \pi _\texttt {NN}(s_t)\Vert _2 \nonumber \\ & \text {subject to } \mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi ) = 0 \end{aligned}$$
(4)

The objective described in Eq. 4 frames a constraint optimization problem. To address this, we employ Lagrangian optimization, which provides a principled way to seamlessly incorporate the verification constraint (\(\mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi ) = 0\)) into the distillation objective. We introduce a Lagrangian function that incorporates a Lagrange multiplier \(\lambda \) to account for constraint violation and transform Eq. 4 into an unconstrained optimization problem:

$$\begin{aligned} L(\theta , \lambda ) = \mathcal {L}_S(\pi _\theta , \pi _\texttt {NN}) + \lambda \cdot \mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi ) \end{aligned}$$

where \(\mathcal {L}_S(\pi _\theta , \pi _\texttt {NN}) = \mathbb {E}_{s_0,s_1,\ldots ,s_H \sim \hat{M}[\pi _\theta ]}\ \Vert \pi _\theta (s_t, t) - \pi _\texttt {NN}(s_t)\Vert _2\). VELM seeks to optimize the primal parameter \(\theta \) and the Lagrange multiplier \(\lambda \) to minimize the function \(L(\theta , \lambda )\), effectively reducing both the \(L^2\) loss for the distillation objective and safety violations for the verification constraint.

Algorithm 3 outlines the procedure for distilling \(\pi _\texttt {NN}\) to \(\pi _\theta \). It iteratively performs the following two gradient-based update rules to minimize \(L(\theta , \lambda )\):

$$\begin{aligned} \theta \leftarrow \, & \theta - \eta _\theta \cdot \big (\nabla _\theta \mathcal {L}_S(\pi _\theta , \pi _\texttt {NN}) + \lambda \cdot \nabla _\theta \mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi )\big )\\ \lambda \leftarrow \, & \lambda + \eta _\lambda \cdot \mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi ) \end{aligned}$$

where \(\eta _\theta \) is a learning rate for \(\theta \) and \(\eta _\lambda \) is a learning rate for \(\lambda \). The Lagrange multiplier \(\lambda \) is increased during the optimization process to penalize deviations from satisfying the verification constraint. As such, even though the verification procedure may introduce approximation errors, VELM can reduce this error by conducting optimization in the abstract state space [45]. VELM repeats the iterative update until the distillation loss (\(\ell _S\)) converges and the safety violation loss (\(\ell _D\)) converges to 0 (Line 8).

figure g

Gradient Estimation for \(\boldsymbol{\mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi )}\). Deriving the gradients of the verification constraint \(\mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi )\) directly poses a challenge, as it requires the verification procedure to be differentiable, a feature not practical. To address this obstacle, following prior research [45], VELM estimates the gradients of \(\mathcal {L}_\mathcal {D}\) through random search [36]. In each training iteration, given a closed-loop environment \(\hat{M}[\pi _\theta ]\), we generate perturbed systems \(\hat{M}[\pi _{\theta +\nu \omega }]\) and \(\hat{M}[\pi _{\theta -\nu \omega }]\) by introducing sampled Gaussian noise \(\omega \) to the current controller \(\pi _\theta \)’s parameters \(\theta \) in both directions. Here, \(\nu \) represents a small positive real number. By assessing the abstract safety losses of the symbolic rollouts for \(\hat{M}[\pi _{\theta +\nu \omega }]\) and \(\hat{M}[\pi _{\theta -\nu \omega }]\), we update \(\theta \) using a finite difference approximation along an unbiased estimator of the gradient:

$$\begin{aligned} {\nabla _\theta \mathcal {L}_\mathcal {D}(\hat{M}[\pi _\theta ], \varphi ) \leftarrow \frac{1}{N}\sum ^N_{k=1}\frac{\left( {\mathcal {L}_\mathcal {D}(\hat{M}[\pi _{\theta +\nu \omega _k}],\ \varphi ) -}{ \mathcal {L}_\mathcal {D}(\hat{M}[\pi _{\theta -\nu \omega _k}], \varphi )}\right) }{\nu } } \omega _k \end{aligned}$$

Performance Guarantees. We conclude the technical section by discussing the reward performance of VELM. One important concern is whether shielding a neural control policy hinders the RL algorithm’s ability to learn the optimal policy. Previous studies [4, 5, 44] have established the following regret bound concerning the reward performance of a shielded policy for safe exploration compared to the optimal policy that does not seek to restrict safety violations during the learning process. Let \(\pi ^i_S = \text {Shield}(\hat{M}^i[\cdot ], \pi ^i_\texttt {NN}, \varphi )\) for \(1 \le i \le T\) be a sequence of policies learned in Algorithm 1 where \(\varphi \) is the safety property, \(\hat{M}^i[\cdot ]\) and \(\pi ^i_\texttt {NN}\) are the learned environment model and neural controller at the \(i^\text {th}\) iteration. Introduce a safety indicator Z that takes the value 1 when \({\pi ^i_S}(s, t) = \pi ^i_\texttt {NN}(s)\) and 0 otherwise, and let \(\xi = \mathbb {E}[1 - Z]\) be the frequency with which \({\pi ^i_S}\) intervenes in neural policy controls. Assume the reward function is Lipschitz on the controller parameter space and let \(L_R\) be the corresponding Lipschitz constant. Let \(\beta \) and \(\tau ^2\) be the bias and variance in the gradient estimate that is incurred due to sampling. Let \(\epsilon _\mathcal {S}\) be an upper bound on the imprecision incurred by distilling \(\pi ^i_\texttt {NN}\) to a linear time-varying controller. Let \(\epsilon _m\) be an upper bound for the Kullback-Leibler divergence between the learned environment model and the true environment dynamics at all time steps. Let \(\epsilon _\pi \) be an upper bound on the total variational divergence between the policy used to gather data and the policy being trained at all time steps. Set the learning rate \(\eta \) of the RL algorithm for updating \(\pi ^i_\texttt {NN}\) as \(\sqrt{\frac{1}{\tau ^2}(\frac{1}{T} + \epsilon _\mathcal {S})}\). Assuming \(\pi ^*\) is the (unknown) the optimal safe control policy, we have the following regret bound [4, 5, 44] for Algorithm 1: \(R(\pi ^*) - \mathbb {E}\big [\frac{1}{T}\sum ^T_{i=1}R(\pi ^i_S)\big ] = O\big ( \sqrt{\frac{1}{\tau ^2}(\frac{1}{T} + \epsilon _{\mathcal {S}})} + \beta + L_R \cdot \xi + \epsilon _m + \epsilon _\pi \big )\). VELM does not impose a significant penalty on the agent’s reward performance for achieving safety as the regret bound becomes tighter when the frequency of interventions in the decision of the neural controllers \(\xi \) decreases during training. As the environment model improves during training (i.e. \(\epsilon _m\) and \(\epsilon _\pi \) decrease), the controller converges to higher rewards. The remaining terms are associated with the standard error by using sampling to approximate policy update gradients.

Fig. 4.
figure 4

Rewards for all the tools throughout the training phase. The solid curve represents the mean across 5 random seeds. The shaded area indicates the standard deviation.

Fig. 5.
figure 5

Cumulative safety violations for all the tools throughout the training phase. The solid curve represents the mean across 5 random seeds. The shaded area indicates the range between the minimum and maximum values.

4 Experiments

In our implementation of VELMFootnote 3, we use SAC [23], a state-of-the-art reinforcement learning algorithm, as the base algorithm to optimize neural network controllers. We build the abstract interpreter for reachability analysis of a time-varying linear controller against a learned model on top of Flow\(^*\) [11] for reasoning about nonlinear state transition functions. We use Operon [9] to learn a symbolic environment model for the LearnModel procedure at Line 10 in Algorithm 1. In the implementation, we invoke the LearnModel procedure only when the existing environment model is invalid for the newly collected trajectories from the real environment. Recall that our learned model is nondeterministic (Sect. 3.1). Given a current state, it outputs a range for the next state. If the actual next state is not within the range, we consider the model invalid (i.e., \(\exists t. s_{t+1} \not \in F(s_t, a_t)\)). This strategy significantly accelerates the learning process.

Baselines. We compared VELM with three baselines: SAC, SPICE [4], and MBPPO-Lagrangian [26]. The SAC baseline acts as an upper bound on reward performance since the agent does not need to explicitly handle safety constraints. The other safe RL baselines are relevant because they are all model-based as VELM. However, they all use neural networks for learning environment state transition dynamics. SPICE applies weakest precondition generation from safety constraints to a linearized form of learned environment models to ascertain safe control actions for shielding. The linearization step may introduce approximation errors. MBPPO-Lagrangian finds a safety-constraint-satisfying policy by using the Lagrangian method to reduce the cumulative safety violations throughout trajectories executed on the learned model. This method does not consider shielding to ensure safe exploration. We also tried to use CRABS [34] as another model-based safe-learning baseline. In addition to a neural environment model, CRABS uses another neural network to learn a control barrier certificate to identify a safe region on the neural environment model for shielding. However, we found that CRABS is excessively time-consuming to execute, completing only an average of 10 episodes within a day. Therefore, we have excluded CRABS in the results presented in this section. In summary, these baselines suffer from safety violations stemming from both (1) environment modeling imprecision and (2) control policies that are not safe even considering the environment model. VELM eliminates the second source of errors. Our experiments aim to answer the question - How does the performance of VELM compare to representative baseline approaches, considering metrics such as rewards, number of unsafe steps, and overall efficiency?

Benchmarks. We used the benchmarks considered in related work. Pendulum, ACC, Obstacle, Obstacle2, Road2D, and CarRacing are taken from the SPICE benchmarks [4, 5]. In Road2D, an autonomous vehicle is controlled to reach a designated destination while adhering to a specified speed limit. Obstacle and Obstacle2 pose a challenge for a 2D robot to reach a specified goal while avoiding an obstacle. In Obstacle, the obstruction is positioned to the side, affecting the agent only during exploration, without cutting the shortest path to the goal. In Obstacle2, the obstruction is placed between the initial region and the goal region, requiring the learned controller to navigate around it. In the Pendulum task, the objective is to maintain a pendulum in the upright position. The goal of ACC (adaptive cruise control) is to closely follow a leading vehicle without collision, with the lead car selecting acceleration randomly from a truncated normal distribution at each time step. The CarRacing environment is similar to Obstacle2 but the goal is to reach a goal region on the opposite side of the obstacle and then return to the initial region. This requires the agent to complete a loop around the obstacle to fulfill the objective. Cartpole is from Open AI Gym [8]. The nonlinear benchmarks CartPoleMove and CartPoleSwing are taken from CRABS [34]. The CartPoleMove task is challenging as high-reward policies must carefully explore near the safety boundary. The user-specified safety set is \(\{(x,\theta ): |\theta | \le \theta _\texttt {max} = 0.2, |x| \le 0.9\}\) where x is the cart horizontal position and \(\theta \) is the pole angle. \(\theta _\texttt {max}\) corresponds to approximately 11\(^\circ \). The reward function of the task is \(r(s,a) = x^2\). Consequently, the optimal policy must delicately move the cart and pole toward the boundary of unsafe regions but remain safe. Similarly, the CartPoleSwing task is also high-risk, high-reward environment. The reward function is \(r(s,a) = \theta ^2\) and the user-specified safety set is \(\{(x,\theta ): |\theta | \le \theta _\texttt {max} = 1.5, |x| \le 0.9\}\). So the optimal policy will swing back and forth to some degree close to 90\(^{\circ }\) but prevent the pole from falling. LALO20 is a challenging 7-dimensional nonlinear benchmark modeling a molecular network taken from prior work [48]. This task is difficult because the initial states are situated near the boundary of the unsafe region.

Results. We report the mean reward performance of the learned controllers as well as cumulative safety violations over time during training of each benchmark for VELM and each baseline in Fig. 4 and Fig. 5. The shield intervention rates of VELM and SPICE are listed in Table 1. These results are averaged over 5 random seeds.

Table 1. Comparison of Shield Intervention Rates between VELM and SPICE.

Figure 5 demonstrates that VELM exhibits superior safety performance as it experiences a significantly lower frequency of unsafe steps compared to the baseline methods. Except for the initial controller \(\pi _0\), the controllers learned by VELM demonstrate nearly zero safety violations when interacting with the real environment in training. SPICE accumulates safety violations more quickly compared to VELM. Overall, VELM achieves a 99.7% reduction in unsafe steps compared to SPICE. SPICE incurs significantly more safety violations in highly nonlinear environments such as CartPole. This suggests that the model linearization step in SPICE introduces significant approximation errors, resulting in either unnecessary interventions or a lack of intervention when there is truly unsafe behavior. This kind of approximation error also limits SPICE to use a bounded-time analysis to determine potential safety violations within the next few time steps (5, as recommended in SPICE [4]). VELM instead can predict the long-term safety of an action far into the future. For example, on CartPole, the average shield intervention rate for SPICE over all the rollouts in the real environment is 81%, while VELM only has an intervention rate of 12%. Similarly, VELM is safer than MBPPO-Lagrangian in every benchmark. As can be seen from Fig. 4, MBPPO-Lagrangian continues to violate the safety property more over time than VELM. Principally, in contrast to VELM, MBPPO-Lagrangian seeks to limit safety violations in expectation and does not assure safety for all visited states.

Table 2. Training time in seconds for model, network and shield updates

Figure 4 also demonstrates that in most cases, VELM attains comparable (or slightly superior) reward performance to SAC. SPICE imposes a substantial penalty on reward performance compared to SAC. This is because SPICE in general exhibits significantly higher shield intervention rates than VELM. As discussed in the performance guarantee analysis in Sect. 3, frequent shield interventions hinder the RL algorithm from converging to the optimal policy. LALO20 is the only benchmark that VELM does not achieve a comparable reward performance to SAC. This is because, in this benchmark, the average shield intervention rate for VELM over all the rollouts in the real environment is relatively high at 49%. However, VELM achieves nearly 0 safety violations during learning. The modest performance penalty is an acceptable trade-off for safety. Although SPICE also achieves almost 0 safety violations on LALO20, its shield intervention rate is 79%, preventing the neural policy from achieving high reward performance.

We present the execution times for each component of VELM across all benchmarks in Table 2, averaged over five random seeds. The Network column in the table reports the time spent training a neural network controller using the base RL algorithm. The Model and Shield columns report the time spent on learning a symbolic environment model and constructing a formally verified shield, respectively. On average, VELM dedicates approximately 9% of its execution time to model learning and 28% to shield construction. This modest overhead is justified by the substantial safety guarantees provided.

5 Related Work

Prior Safe RL works consider constrained Markov decision processes (CMDP), where observed safety violations should be bounded. Lagrangian methods are widely used to solve CMDP with the Lagrangian multiplier controlled adaptively [41] or by PID [40]. Trust region methods [1, 47, 51] project a current control policy to a feasible safe space around the current policy in each learning iteration. The goal is to bound the number of safety violations under a threshold in expectation, while VELM aims to ensure safety for all visited states. Combining these methods with learning a dynamics model can further improve their data efficiency [26, 50]. There exist works that learn conservative safety critics to underestimate the long-term safety cost of taking a particular action in a particular state and use the conservative safety critics for safe exploration and policy optimization [7, 46, 49]. However, training neural safety critics models may require numerous potentially unsafe environment interactions. VELM instead uses symbolic reachability analysis over learned environment models to identify safe regions of the state space. Other approaches involve pre-training a policy in a simpler environment and fine-tuning it in a more challenging setting [39], or leveraging existing offline data and co-training a recovery policy [42]. Integrating VELM with pretraining and offline data is an interesting avenue for future research.

Another research direction explores Lyapunov functions and barrier certificates. The work in [6] uses Lyapunov functions to identify policy attraction regions where safe operation is guaranteed for discretized deterministic control systems, provided that certain Lipschitz continuity conditions hold. However, this method requires access to system dynamics models. Additionally, a neural network controller may not exhibit Lipschitz continuity with a reasonable coefficient. In [13], it is shown that Lyapunov functions can be co-learned with controllers for discrete action spaces. This work was extended to continuous action spaces by utilizing the Deterministic Policy Gradient theorem [38]. The work by Chow et al. [14] projects control actions to guarantee a decrease in the Lyapunov function after each timestep. In contrast, Donti et al. [17] construct sets of stabilizing actions using a Lyapunov function and then project actions onto this set. A handcrafted barrier function is leveraged in [12] to secure safe exploration in reinforcement learning. A line of research, exemplified by a prior study [48], focuses on verifying an RL controller upon convergence against safety and reachability properties by inferring barrier certificates but does not address safety during training. Combining VELM with such work is promising for future investigation.

Model-based safe reinforcement learning approaches ensure the safety of an RL agent through a model of its environment. When a pre-established model of environmental dynamics is available, a safety shield and a backup controller can be constructed from the model using formal methods to regulate agent behavior [3]. To enforce the safety of a deep neural network controller, the backup controller is run in tandem with the neural controller [2, 5, 18, 20,21,22, 24, 29, 31, 52]. Whenever the neural controller is about to leave the provably safe state space governed by the backup controller, the backup controller overrides the potentially unsafe neural actions to enforce the neural controller to stay within the certified safe space. When environment dynamics models are not known a priori, several works  [26, 32, 33, 35] maintain a learned environment model and employ various statistical techniques to devise a policy that is likely to be safe according to the model. This gives rise to two sources of unsafe conduct: the policy may exhibit unsafety in relation to the model, or the model could provide an imprecise depiction of the environment. VELM addresses the first source of error by assuring control policies are safe within the confines of an environment model. REVEL [5] involves an iterative learning approach where a neural policy is trained, potentially resulting in unsafety. Subsequently, the learned neural policy is distilled into a piecewise linear policy. Automatic verification is then applied to certify the piecewise linear policy, a process akin to constructing a barrier function. First, this certification method assumes a calibrated dynamics model, whereas VELM, in contrast, learns the dynamics model. Second, the verification algorithm in REVEL requires a piecewise linear environment model to be manually constructed to approximate the calibrated dynamics model, a condition not practical in VELM (learned environment models evolve across learning iterations in VELM). CRABS [34] iteratively learns a barrier certificate, a dynamics model, and a control policy where the barrier certificate, learned via adversarial training, ensures the policy’s safety assuming the learned dynamics model. Yet, formally verifying the correctness of the barrier certificate faces challenges as both the certificate and the underlying environment model are complex, deep neural network models. SPICE [4] determines action safety using weakest preconditions derived from a learned neural environment model within a short time horizon H. However, extending H to cover the entire horizon of an RL task faces challenges due to the difficulty of constructing precise weakest precondition transformers for neural networks and the accumulation of approximation errors inherent in linearizing a neural environment model. Instead, VELM conducts formally verified exploration for RL agents, covering the entire horizon of an RL task through learned environment models.

6 Conclusion

In summary, we present VELM, a novel framework for ensuring verified safe exploration in model-based reinforcement learning. VELM learns environment models as symbolic formulas. Through formal reachability analysis over learned models, VELM constructs an online shielding layer that acts as a safeguard, confining RL agent exploration to a state space verified as safe in the learned model. The results of our experiments in various RL environments, alongside comparisons with state-of-the-art safe RL techniques, highlight the efficacy of VELM in significantly mitigating safety violations during online exploration while maintaining strong learning performance. VELM thus establishes a foundation for building trustworthy and secure RL systems capable of navigating complex environments while adhering to stringent safety constraints.