# Verifiably Safe Off-Model Reinforcement Learning

## Abstract

The desire to use reinforcement learning in safety-critical settings has inspired a recent interest in formal methods for learning algorithms. Existing formal methods for learning and optimization primarily consider the problem of constrained learning or constrained optimization. Given a single correct model and associated safety constraint, these approaches guarantee efficient learning while provably avoiding behaviors outside the safety constraint. Acting well given an accurate environmental model is an important pre-requisite for safe learning, but is ultimately insufficient for systems that operate in complex heterogeneous environments. This paper introduces verification-preserving model updates, the first approach toward obtaining formal safety guarantees for reinforcement learning in settings where multiple possible environmental models must be taken into account. Through a combination of inductive data and deductive proving with design-time model updates and runtime model falsification, we provide a first approach toward obtaining formal safety proofs for autonomous systems acting in heterogeneous environments.

## 1 Introduction

The desire to use reinforcement learning in safety-critical settings has inspired several recent approaches toward obtaining formal safety guarantees for learning algorithms. Formal methods are particularly desirable in settings such as self-driving cars, where testing alone cannot guarantee safety [22]. Recent examples of work on formal methods for reinforcement learning algorithms include justified speculative control [14], shielding [3], logically constrained learning [17], and constrained Bayesian optimization [16]. Each of these approaches provide formal safety guarantees for reinforcement learning and/or optimization algorithms by stating assumptions and specifications in a formal logic, generating monitoring conditions based upon specifications and environmental assumptions, and then leveraging these monitoring conditions to constrain the learning/optimization process to a known-safe subset of the state space.

Existing formal methods for learning and optimization consider the problem of constrained learning or constrained optimization [3, 14, 16, 17]. They address the question: assuming we have a single accurate environmental model with a given specification, how can we learn an efficient control policy respecting this specification?

Correctness proofs for control software in a single well-modeled environment are necessary but not sufficient for ensuring that reinforcement learning algorithms behave safely. Modern cyber-physical systems must perform a large number of subtasks in many different environments and must safely cope with situations that are not anticipated by system designers. These design goals motivate the use of reinforcement learning in safety-critical systems. Although some formal methods suggest ways in which formal constraints might be used to inform control even when modeling assumptions are violated [14], none of these approaches provide formal safety guarantees when environmental modeling assumptions are violated.

Holistic approaches toward safe reinforcement learning should provide formal guarantees even when a single, a priori model is not known at design time. We call this problem *verifiably safe off-model learning*. In this paper we introduce a first approach toward obtaining formal safety proofs for off-model learning. Our approach consists of two components: (1) a model synthesis phase that constructs a set of candidate models together with provably correct control software, and (2) a runtime model identification process that selects between available models at runtime in a way that preserves the safety guarantees of all candidate models.

Model update learning is initialized with a set of models. These models consist of a set of differential equations that model the environment, a control program for selecting actuator inputs, a safety property, and a formal proof that the control program constrains the overall system dynamics in a way that correctly ensures the safety property is never violated.

Instead of requiring the existence of a single accurate initial model, we introduce *model updates* as syntactic modifications of the differential equations and control logic of the model. We call a model update *verification-preserving* if there is a corresponding modification to the formal proof establishing that the modified control program continues to constrain the system of differential equations in a way that preserves the original model’s safety properties.

Verification-preserving model updates are inspired by the fact that different parts of a model serve different roles. The continuous portion of a model is often an assumption about how the world behaves, and the discrete portion of a model is derived from these equations and the safety property. For this reason, many of our updates inductively synthesize ODEs (i.e., in response to data from previous executions of the system) and then deductively synthesize control logic from the resulting ODEs and the safety objective.

Our contributions enabling verifiably safe off-model learning include: \(\mathbf ( {\varvec{1}}{} \mathbf ) \) A set of verification preserving model updates (VPMUs) that systematically update differential equations, control software, and safety proofs in a way that preserves verification guarantees while taking into account possible deviations between an initial model and future system behavior. \(\mathbf ( {\varvec{2}}{} \mathbf ) \) A reinforcement learning algorithm, called model update learning (\(\mu \)learning), that explains how to transfer safety proofs for a set of feasible models to a learned policy. The learned policy will actively attempt to falsify models at runtime in order to reduce the safety constraints on actions. These contributions are evaluated on a set of hybrid systems control tasks. Our approach uses a combination of program repair, system identification, offline theorem proving, and model monitors to obtain formal safety guarantees for systems in which a single accurate model is not known at design time. This paper fully develops an approach based on an idea that was first presented in an invited vision paper on Safe AI for CPS by the authors [13].

The approach described in this paper is model-based but does not assume that a single correct model is known at design time. Model update learning allows for the possibility that all we can know at design time is that there are many feasible models, one of which might be accurate. Verification-preserving model updates then explain how a combination of data and theorem proving can be used at design time to enrich the set of feasible models.

We believe there is a rich space of approaches toward safe learning in-between model-free reinforcement learning (where formal safety guarantees are unavailable) and traditional model-based learning that assumes the existence of a single ideal model. This paper provides a first example of such an approach by leveraging inductive data and deductive proving at both design time and runtime.

The remainder of this paper is organized as follows. We first review the logical foundations underpinning our approach. We then introduce verification-preserving model updates and discuss how experimental data may be used to construct a set of explanatory models for the data. After discussing several model updates, we introduce the \(\mu \)learning algorithm that selects between models at runtime. Finally, we discuss case studies that validate both aspects of our approach. We close with a discussion of related work.

## 2 Background

This section reviews existing approaches toward safe on-model learning and discusses the fitness of each approach for obtaining guarantees about off-model learning. We then introduce the specification language and logic used throughout the rest of this paper.

Alshiekh et al. and Hasanbeig et al. propose approaches toward safe reinforcement learning based on Linear Temporal Logic [3, 17]. Alshiekh et al. synthesize monitoring conditions based upon a safety specification and an environmental abstraction. In this formalism, the goal of off-model learning is to systematically expand the environmental abstraction based upon both design-time insights about how the system’s behavior might change over time and based upon observed data at runtime. Jansen et al. extend the approach of Alshiekh et al. by observing that constraints should adapt whenever runtime data suggests that a safety constraint is too restrictive to allow progress toward an over-arching objective [20]. Herbert et al. address the problem of safe motion planning by using offline reachability analysis of pursuit-evasion games to pre-compute an overapproximate monitoring condition that then constrains online planners [9, 19].

The above-mentioned approaches have an implicit or explicit environmental model. Even when these environmental models are accurate, reinforcement learning is still necessary because these models focus exclusively on safety and are often nondeterministic. Resolving this nondeterminism in a way that is not only safe but is also effective at achieving other high-level objectives is a task that is well-suited to reinforcement learning.

We are interested in how to provide formal safety guarantees even when there is not a single accurate model available at design time. Achieving this goal requires two novel contributions. We must first find a way to generate a robust set of feasible models given some combination of an initial model and data on previous runs of the system (because formal safety guarantees are stated with respect to a model). Given such a set of feasible models, we must then learn how to safely identify which model is most accurate so that the system is not over-constrained at runtime.

To achieve these goals, we build on the safe learning work for a single model by Fulton et al. [14]. We choose this approach as a basis for verifiably safe learning because we are interested in safety-critical systems that combine discrete and continuous dynamics, because we would like to produce explainable models of system dynamics (e.g., systems of differential equations as opposed to large state machines), and, most importantly, because our approach requires the ability to systematically modify a model together with that model’s safety proof.

Following [14], we recall Differential Dynamic Logic [26, 27], a logic for verifying properties about safety-critical hybrid systems control software, the ModelPlex synthesis algorithm in this logic [25], and the KeYmaera X theorem prover [12] that will allow us to systematically modify models and proofs together.

*F*.

Hybrid programs

Program statement | Meaning |
---|---|

\(\alpha ;\beta \) | Sequentially composes \(\beta \) after \(\alpha \). |

\(\alpha \cup \beta \) | Executes either \(\alpha \) or \(\beta \) nondeterministically. |

\(\alpha ^*\) | Repeats \(\alpha \) zero or more times nondeterministically. |

\(x := \theta \) | Evaluates term \(\theta \) and assigns result to variable |

\(x := *\) | Nondeterministically assign arbitrary real value to |

Continuous evolution for any duration within domain | |

? | Aborts if formula |

*Hybrid Program Semantics.*The semantics of the hybrid programs described by Table 1 are given in terms of transitions between states [27, 28], where a state

*s*assigns a real number

*s*(

*x*) to each variable

*x*. We use Open image in new window to refer to the value of a term

*t*in a state

*s*. The semantics of a program \(\alpha \), written Open image in new window , is the set of pairs \((s_1, s_2)\) for which state \(s_2\) is reachable by running \(\alpha \) from state \(s_1\). For example, Open image in new window is:for a hybrid program \(\alpha \) and state

*s*where Open image in new window is set of all states

*t*such that Open image in new window .

*Differential Dynamic Logic.*Differential dynamic logic ( Open image in new window ) [26, 27, 28] is the dynamic logic of hybrid programs. The logic associates with each hybrid program \(\alpha \) modal operators \([\alpha ]\) and \(\langle \alpha \rangle \), which express state reachability properties of \(\alpha \). The formula \([\alpha ] \phi \) states that the formula \(\phi \) is true in

*all*states reachable by the hybrid program \(\alpha \), and the formula \(\langle \alpha \rangle \phi \) expresses that the formula \(\phi \) is true after

*some*execution of \(\alpha \). The Open image in new window formulas are generated by the grammarwhere \(\theta _i\) are arithmetic expressions over the reals, \(\phi \) and \(\psi \) are formulas, \(\alpha \) ranges over hybrid programs, and \(\backsim \) is a comparison operator \(=,\ne ,\ge ,>,\le ,<\). The quantifiers quantify over the reals. We denote by \(s \models \phi \) the fact that formula \(\phi \) is true in state

*s*; e.g., we denote by Open image in new window the fact that Open image in new window implies \(t \models \phi \) for all states

*t*. Similarly, Open image in new window denotes the fact that \(\phi \) has a proof in Open image in new window . When \(\phi \) is true in every state (i.e., valid) we simply write \(\models \phi \).

### Example 1

*(Safety specification for straight-line car model).*

This formula states that if a car begins with a non-negative velocity, then it will also always have a non-negative velocity after repeatedly choosing new acceleration (*A* or 0), or coasting and moving for a nondeterministic period of time.

Throughout this paper, we will refer to sets of actions. An **action** is simply the effect of a loop-free deterministic discrete program without tests. For example, the programs \(a{:=}A\) and \(a {:=} 0\) are the actions available in the above program. Notice that **actions** can be equivalently thought of as mappings from variables to terms. We use the term action to refer to both the mappings themselves and the hybrid programs whose semantics correspond to these mappings. For an action *u*, we write *u*(*s*) to mean the effect of taking action *u* in state *s*; i.e., the unique state *t* such that Open image in new window .

*ModelPlex.* Safe off-model learning requires noticing when a system deviates from model assumptions. Therefore, our approach depends upon the ability to check, at runtime, whether the current state of the system can be explained by a hybrid program.

The KeYmaera X theorem prover implements the ModelPlex algorithm [25]. For a given Open image in new window specification ModelPlex constructs a correctness proof for monitoring conditions expressed as a formula of quantifier-free real arithmetic. The monitoring condition is then used to extract provably correct monitors that check whether observed transitions comport with modeling assumptions. ModelPlex can produce monitors that enforce models of control programs as well as monitors that check whether the model’s ODEs comport with observed state transitions.

ModelPlex *controller monitors* are boolean functions that return false if the controller portion of a hybrid systems model has been violated. A *controller monitor* for a model \(\{\texttt {ctrl}; \texttt {plant}\}^*\) is a function \(\texttt {cm} : \mathcal {S} \times \mathcal {A} \rightarrow \mathbb {B}\) from states \(\mathcal {S}\) and actions \(\mathcal {A}\) to booleans \(\mathbb {B}\) such that if \(\texttt {cm}(s,a)\) then Open image in new window . We sometimes also abuse notation by using controller monitors as an implicit filter on \(\mathcal {A}\); i.e., \(\texttt {cm} : \mathcal {S} \rightarrow \mathcal {A}\) such that \(a \in \texttt {cm}(s)\) iff \(\texttt {cm}(s,a)\) is true.

ModelPlex also produces *model monitors*, which check whether the model is accurate. A *model monitor* for a safety specification Open image in new window is a function \(\texttt {mm} : \mathcal {S} \times \mathcal {S} \rightarrow \mathbb {B}\) such that Open image in new window if \(\texttt {mm}(s_0, s)\). For the sake of brevity, we also define \(\texttt {mm} : \mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow \mathbb {B}\) as the model monitor applied after taking an action (\(a \in A\)) in a state and then following the plant in a model of form \(\alpha \equiv \texttt {ctrl};\texttt {plant}\). Notice that if the model has this canonical form and if if \(\texttt {mm}(s, a, a(s))\) for an action *a*, then \(\texttt {cm}(s, a(s))\).

The KeYmaera X system is a theorem prover [12] that provides a language called Bellerophon for scripting proofs of Open image in new window formulas [11]. Bellerophon programs, called tactics, construct proofs of Open image in new window formulas. This paper proposes an approach toward updating models in a way that preserves safety proofs. Our approach simultaneously changes a system of differential equations, control software expressed as a discrete loop-free program, and the formal proof that the controller properly selects actuator values such that desired safety constraints are preserved throughout the flow of a system of differential equations.

## 3 Verification-Preserving Model Updates

A *verification-preserving model update* (VPMU) is a transformation of a hybrid program accompanied by a proof that the transformation preserves key safety properties [13]. VPMUs capture situations in which a model and/or a set of data can be updated in a way that captures possible runtime behaviors which are not captured by an existing model.

### Definition 1

**(VPMU).** A *verification-preserving model update* is a mapping which takes as input an initial
Open image in new window
formula \(\varphi \) with an associated Bellerophon tactic \(\texttt {e}\) of \(\varphi \), and produces as output a new
Open image in new window
formula \(\psi \) and a new Bellerophon tactic \(\texttt {f}\) such that \(\texttt {f}\) is a proof of \(\psi \).

Before discussing our VPMU library, we consider how a set of feasible models computed using VPMUs can be used to provide verified safety guarantees for a family of reinforcement learning algorithms. The primary challenge is to maintain safety with respect to all feasible models while also avoiding overly conservative monitoring constraints. We address this challenge by falsifying some of these models at runtime.

## 4 Verifiably Safe RL with Multiple Models

VPMUs may be applied whenever system designers can characterize likely ways in which an existing model will deviate from reality. Although applying model updates at runtime is possible and sometimes makes sense, model updates are easiest to apply at design time because of the computational overhead of computing both model updates and corresponding proof updates. This section introduces model update learning, which explains how to take a set of models generated using VPMUs at design time to provide safety guarantees at runtime.

Model update learning is based on a simple idea: begin with a set of *feasible models* and act safely with respect to all feasible models. Whenever a model does not comport with observed dynamics, the model becomes infeasible and is therefore removed from the set of feasible models. We introduce two variations of \(\mu \)learning: a basic algorithm that chooses actions without considering the underlying action space, and an algorithm that prioritizes actions that rule out feasible models (adding an *eliminate* choice to the classical explore/exploit tradeoff [32]).

All \(\mu \)learning algorithms use monitored models; i.e., models equipped with ModelPlex controller monitors and model monitors.

### Definition 2

**(Monitored Model).**A

**monitored model**is a tuple (

*m*,

*cm*,

*mm*) such that

*m*is a Open image in new window formula of the formwhere \(\texttt {ctrl}\) is a loop-free program, the entire formula

*m*contains exactly one modality, and the formulas

*cm*and

*mm*are the control monitor and model monitor corresponding to

*m*, as defined in Sect. 2.

Monitored models may have a continuous action space because of both tests and the nondeterministic assignment operator. We sometimes introduce additional assumptions on the structure of the monitored models. A monitored model over a finite action space is a monitored model where Open image in new window is finite for all \(s \in S\). A time-aware monitored model is a monitored model whose differential equations contain a local clock which is reset at each control step.

Model update learning, or \(\mu \)learning, leverages verification-preserving model updates to maintain safety while selecting an appropriate environmental model. We now state and prove key safety properties about the \(\mu \)learning algorithm.

### Definition 3

**(**\(\mu \)**learning Process).** A learning process \(P_M\) for a finite set of monitored models *M* is defined as a tuple of countable sequences \((\mathbf U , \mathbf S , {\varvec{Mon}})\) where \(\mathbf U \) are actions in a finite set of actions \(\mathcal {A}\) (i.e., mappings from variables to values), elements of the sequence \(\mathbf S \) are states, and \({\varvec{Mon}}\) are monitored models with \({\varvec{Mon}}_0 = M\). Let Open image in new window where \(\texttt {cm}\) and \(\texttt {mm}\) are the monitors corresponding to the model *m*. Let *specOK* always return true for \(i=0\).

A \(\mu \)**learning process** is a learning process satisfying the following additional conditions: \({\varvec{(a)}}\) action availability: in each state \(\mathbf S _i\) there is at least one action *u* such that for all \(m \in {\varvec{Mon}}_i\), \(u \in \texttt {specOK}_m(\mathbf U ,\mathbf S ,i)\), \({\varvec{(b)}}\) actions are safe for all feasible models: \(\mathbf U _{i+1} \in \{u \in A \,|\, \forall (m,\texttt {cm},\texttt {mm}) \in {\varvec{Mon}}_i , \texttt {cm}(\mathbf S _{i}, u) \}\), \({\varvec{(c)}}\) feasible models remain in the feasible set: if \((\varphi , \texttt {cm}, \texttt {mm}) \in {\varvec{Mon}}_i\) and \(\texttt {mm}(\mathbf S _{i}, \mathbf U _{i}, \mathbf S _{i+1})\) then \((\varphi , \texttt {cm}, \texttt {mm}) \in {\varvec{Mon}}_{i+1}\).

Note that \(\mu \)learning processes are defined over an environment \(E : \mathcal {A} \times \mathcal {S} \rightarrow \mathcal {S}\) that determines the sequences \(\mathbf U \) and \(\mathbf S \)^{1}, so that \(\mathbf S _{i+1} = E(\mathbf U _i, \mathbf S _i)\). In our algorithms, the set \(\mathbf {Mon}_{i}\) never retains elements that are inconsistent with the observed dynamics at the previous state. We refer to the set of models in \(\mathbf {Mon}_i\) as the set of feasible models for the \(i^{\text {th}}\) state in a \(\mu \)learning process.

Notice that the safe actions constraint is not effectively checkable without extra assumptions on the range of parameters. Two canonical choices are discretizing options for parameters or including an effective identification process for parameterized models.

Our safety theorem focuses on time-aware \(\mu \)learning processes, i.e., those whose models are all time-aware; similarly, a *finite action space* \(\mu \)*learning process* is a \(\mu \)learning process in which all models \(m\in M\) have a finite action space. The basic correctness property for a \(\mu \)learning process is the safe reinforcement learning condition: the system never takes unsafe actions.

### Definition 4

**(**\(\mu \)

**learning process with an accurate model).**Let \(P_M = (\mathbf S , \mathbf U , {\varvec{Mon}})\) be a \(\mu \)learning process. Assume there is some element \(m^* \in {\varvec{Mon}}_0\) with the following properties. First,Second, Open image in new window . Third, Open image in new window implies Open image in new window for a mapping \(E : \mathcal {S} \times \mathcal {A} \rightarrow \mathcal {S}\) from states and actions to new states called environment. When only one element of \({\varvec{Mon}}_0\) satisfies these properties we call that element \(m^*\) the

*distinguished*and/or

*accurate*model and say that the process \(P_M\) is

*accurately modeled with respect to E*.

We will often elide the environment *E* for which the process \(P_M\) is accurate when it is obvious from context.

### Theorem 1

**(Safety).** If \(P_M\) is a \(\mu \)learning process with an accurate model, then \(\mathbf S _i \models \texttt {safe}\) for all \(0< i < |\mathbf S |\).

Listing 1.1 presents the \(\mu \)learning algorithm. The inputs are: \(\mathbf ( {\varvec{a}}{} \mathbf ) \) A set \(\texttt {M}\) of models each with an associated function \(\texttt {m.models} : \mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow \mathbb {B}\) that implements the evaluation of its model monitor in the given previous and next state and actions and a method \(\texttt {m.safe} : \mathcal {S} \times \mathcal {A} \rightarrow \mathbb {B}\) which implements evaluation of its controller monitor, \(\mathbf ( {\varvec{b}}{} \mathbf ) \) an action space \(\texttt {A}\) and an initial state \(\texttt {init}\in S\), \(\mathbf ( {\varvec{c}}{} \mathbf ) \) an environment function \(\texttt {env} : \mathcal {S} \times \mathcal {A} \rightarrow \mathcal {S} \times \mathbb {R}\) that computes state updates and rewards in response to actions, and \(\mathbf ( {\varvec{d}}{} \mathbf ) \) a function \(\texttt {choose} : \wp (\mathcal {A})\rightarrow \mathcal {A}\) that selects an action from a set of available actions and \(\texttt {update}\) updates a table or approximation. Our approach is generic and works for any reinforcement learning algorithm; therefore, we leave these functions abstract. It augments an existing reinforcement learning algorithm, defined by \(\texttt {update}\) and \(\texttt {choose}\), by restricting the action space at each step so that actions are only taken if they are safe with respect to *all* feasible models. The feasible model set is updated at each control set by removing models that are in conflict with observed data.

### 4.1 Active Verified Model Update Learning

Removing models from the set of possible models relaxes the monitoring condition, allowing less conservative and more accurate control decisions. Therefore, this section introduces an active learning refinement of the \(\mu \)learning algorithm that prioritizes taking actions that help rule out models \(m \in M\) that are not \(m^*\). Instead of choosing a random safe action, \(\mu \)learning prioritizes actions that differentiate between available models. We begin by explaining what it means for an algorithm to perform good experiments.

### Definition 5

**(Active Experimentation).** A \(\mu \)learning process with an accurate model \(m^*\) has *locally active experimentation* provided that: if \({\varvec{Mon}}_{i}>1\) and there exists an action *a* that is safe for all feasible models (see Definition 3) in state \(s_i\) such that taking action *a* results in the removal of *m* from the model set^{2}, then \(|{\varvec{Mon}}_{i+1}| < |{\varvec{Mon}}_{i}|\). Experimentation is \(\mathtt {er}\)-*active* if the following conditions hold: there exists an action *a* that is safe for all feasible models (see Definition 3) in state \(s_i\), and taking action *a* resulted in the removal of *m* from the model set, then \(|{\varvec{Mon}}_{i+1}| < |{\varvec{Mon}}_{i}|\) with probability \(0< \texttt {er} < 1\).

### Definition 6

**(Distinguishing Actions).** Consider a \(\mu \)learning process \((\mathbf U ,\mathbf S ,{\varvec{Mon}})\) with an accurate model \(m^*\) (see Definition 4). An action *a* *distinguishes m from* \(m^*\) if \(a = \mathbf U _i\), \(m \in {\varvec{Mon}}_i\) and \(m \not \in {\varvec{Mon}}_{i+1}\) for some \(i>0\).

The *active* \(\mu \)*learning algorithm* uses model monitors to select distinguishing actions, thereby performing active experiments which winnow down the set of feasible models. The inputs to active-\(\mu \)learn are the same as those to Listing 1.1 with two additions: \(\mathbf ( {\varvec{1}}{} \mathbf ) \) models are augmented with an additional prediction method \(\texttt {p}\) that returns the model’s prediction of the next state given the current state, a candidate action, and a time duration. \(\mathbf ( {\varvec{2}}{} \mathbf ) \) An elimination rate \(\texttt {er}\) is introduced, which plays a similar role as the classical explore-exploit rate except that we are now deciding whether to insist on choosing a good experiment. The active-\(\mu \)learn algorithm is guaranteed to make some progress toward winnowing down the feasible model set whenever \(0< \texttt {er} < 1\).

### Theorem 2

Let \(P_M = (\mathbf S , \mathbf U , {\varvec{Mon}})\) be a finite action space \(\mu \)learning process with an accurate model \(m^*\). Then \(m^* \in {\varvec{Mon}}_i\) for all \(0 \le i \le |{\varvec{Mon}}|\).

### Theorem 3

Let \(P_M\) be a finite action space \(\texttt {er}\)-active \(\mu \)learning process under environment *E* and with an accurate model \(m^*\). Consider any model \(m \in {\varvec{Mon}}_0\) such that \(m \not = m^*\). If every state *s* has an action \(a_s\) that is safe for all models and distinguishes *m* from \(m^*\), then \(\lim _{i \rightarrow \infty } \text {Pr}(m \not \in {\varvec{Mon}}_i) = 1\).

### Corollary 1

Let \(P_M = (\mathbf S , \mathbf U , {\varvec{Mon}})\) be a finite action space \(\texttt {er}\)-active \(\mu \)learning process under environment *E* and with an accurate model \(m^*\). If *each* model Open image in new window has in each state *s* an action \(a_s\) that is safe for all models and distinguishes *m* from \(m^*\), then \({\varvec{Mon}}\) converges to \(\{m^*\}\) a.s.

Although locally active experimentation is not strong enough to ensure that \(P_M\) eventually converges to a minimal set of models^{3}, our experimental validation demonstrates that this heuristic is none-the-less effective on some representative examples of model update learning problems.

## 5 A Model Update Library

So far, we have established how to obtain safety guarantees for reinforcement learning algorithms given a set of formally verified
Open image in new window
models. We now turn our attention to the problem of generating such a set of models by systematically modifying
Open image in new window
formulas and their corresponding Bellerophon tactical proof scripts. This section introduces five generic model updates that provide a representative sample of the kinds of computations that can be performed on models and proofs to predict and account for runtime model deviations^{4}.

*p*modeling a car whose acceleration depends upon both a known control input

*accel*and parametric values for maximum braking force \(-B\) and maximum acceleration

*A*. Its proof is

*B*and

*A*with concrete values in both the model and the tactic.

The **Automatic Parameter Instantiation** update improves the basic parameter instantiation update by automatically detecting which variables are parameters and then constraining instantiation of parameters by identifying relevant initial conditions.

The **Replace Worst-Case Bounds with Approximations** update improves models designed for the purpose of safety verification. Often a variable occurring in the system is bounded above (or below) by its worst-case value. Worst-case analyses are sufficient for establishing safety but are often overly conservative. The approximation model update replaces worst-case bounds with approximate bounds obtained via series expansions. The proof update then introduces a tactic on each branch of the proof that establishes our approximations are upper/lower bounds by performing.

Models often assume perfect sensing and actuation. A common way of robustifying a model is to add a piecewise constant noise term to the system’s dynamics. Doing so while maintaining safety invariants requires also updating the controller so that safety envelope computations incorporate this noise term. The **Add Disturbance Term** update introduces noise terms to differential equations, systematically updates controller guards, and modifies the proof accordingly.

Uncertainty in object classification is naturally modeled in terms of sets of feasible models. In the simplest case, a robot might need to avoid an obstacle that is either static, moves in a straight line, or moves sinusoidally. Our generic model update library contains an update that changes the model by making a static point (*x*, *y*) dynamic. For example, one such update introduces the equations \(\{x'=-y, y'=-x\}\) to a system of differential equations in which the variables *x*, *y* do not have differential equations. The controller is updated so that any statements about separation between (*a*, *b*) and (*x*, *y*) require global separation of (*a*, *b*) from the circle on which (*x*, *y*) moves. The proof is also updated by prepending to the first occurrence of a differential tactic on each branch with a sequence of differential cuts that characterize circular motion.

Model updates also provide a framework for characterizing algorithms that combine model identification and controller synthesis. One example is our synthesis algorithm for systems whose ODEs have solutions in a decidable fragment of real arithmetic (a subset of linear ODEs). Unlike other model updates, we do not assume that any initial model is provided; instead, we learn a model (and associated control policy) entirely from data. The **Learn Linear Dynamics** update takes as input: (1) data from previous executions of the system, and (2) a desired safety constraint. From these two inputs, the update computes a set of differential equations \(\texttt {odes}\) that comport with prior observations, a corresponding controller \(\texttt {ctrl}\) that enforces the desired safety constraint with corresponding initial conditions \(\texttt {init}\), and a Bellerophon tactic \(\texttt {prf}\) which proves Open image in new window . Computing the model requires an exhaustive search of the space of possible ODEs followed by a computation of a safe control policy using solutions to the resulting ODEs. Once a correct controller is computed, the proof proceeds by symbolically decomposing the control program and solving the ODEs on each resulting control branch. The full mechanism is beyond the scope of this paper but explained in detail elsewhere [10, Chapter 9].

*Significance of Selected Updates.* The updates described in this section demonstrate several possible modes of use for VPMUs and \(\mu \)learning. VPMUS can update existing models to account for systematic modeling errors (e.g., missing actuator noise or changes in the dynamical behavior of obstacles). VPMUs can automatically optimize control logic in a proof-preserving fashion. VPMUS can also be used to generate accurate models and corresponding controllers from experimental data made available at design time, without access to any prior model of the environment.

## 6 Experimental Validation

The \(\mu \)learning algorithms introduced in this paper are designed to answer the following question: given a set of possible models that contains the one true model, how can we *safely* perform a set of experiments that allow us to efficiently discover a minimal safety constraint? In this section we present two experiments which demonstrate the use of \(\mu \)learning in safety-critical settings. Overall, these experiments empirically validate our theorems by demonstrating that \(\mu \)learning processes with accurate models do not violate safety constraints.

Our simulations use a conservative discretization of the hybrid systems models, and we translated monitoring conditions by hand into Python from ModelPlex’s C output. Although we evaluate our approach in a research prototype implemented in Python for the sake of convenience, there is a verified compilation pipeline for models implemented in Open image in new window that eliminates uncertainty introduced by discretization and hand-translations [7].

**Adaptive Cruise Control.** Adaptive Cruise Control (ACC) is a common feature in new cars. ACC systems change the speed of the car in response to the changes in the speed of traffic in front of the car; e.g., if the car in front of an ACC-enabled car begins slowing down, then the ACC system will decelerate to match the velocity of the leading car. Our first set of experiments consider a simple linear model of ACC in which the acceleration set-point is perturbed by an unknown parameter *p*; i.e., the relative position of the two vehicles is determined by the equations \(\text {pos}_{\text {rel}}' = \text {vel}_{\text {rel}}, \text {vel}_{\text {rel}}' = \text {acc}_{\text {rel}} \).

**Add Noise Term**and

**Parameter Instantiation**updates; we outperform in terms of both avoiding unsafe states and in terms of cumulative reward. These two updates allow us to insert a multiplicative noise term

*p*into these equations, synthesize a provably correct controller, and then choose the correct value for this noise term at runtime. Unlike [14], \(\mu \)learning avoids all safety violations. The graph in Fig. 1 compares the Justified Speculative Control approach of [14] to our approach in terms of cumulative reward; in addition to substantially outperforming the JSC algorithm of [14], \(\mu \)learning also avoids 204 more crashes throughout a 1,000 episode training process.

**A Hierarchical Problem.** Model update learning can be extended to provide formal guarantees for hierarchical reinforcement learning algorithms [6]. If each feasible model *m* corresponds to a subtask, and if all states satisfying termination conditions for subtask \(m_i\) are also safe initial states for any subtask \(m_j\) reachable from \(m_i\), then \(\mu \)learning directly supports safe hierarchical reinforcement learning by re-initializing *M* to the initial (maximal) model set whenever reaching a termination condition for the current subtask.

We implemented a variant of \(\mu \)learning that performs this re-initialization and validated this algorithm in an environment where a car must first navigate an intersection containing another car and then must avoid a pedestrian in a crosswalk (as illustrated in Fig. 1). In the crosswalk case, the pedestrian at \((ped_x, ped_y)\) may either continue to walk along a sidewalk indefinitely or may enters the crosswalk at some point between \(c_{min} \le ped_y \le c_{max}\) (the boundaries of the crosswalk). This case study demonstrates that safe hierarchical reinforcement learning is simply safe \(\mu \)learning with safe model re-initialization.

## 7 Related Work

Related work falls into three broad categories: safe reinforcement learning, runtime falsification, and program synthesis.

Our approach toward safe reinforcement learning differs from existing approaches that do not include a formal verification component (e.g., as surveyed by García and Fernández [15] and the SMT-based constrained learning approach of Junges et al. [21]) because we focused on *verifiably* safe learning; i.e., instead of relying on oracles or conjectures, constraints are derived in a provably correct way from formally verified safety proofs. The difference between verifiably safe learning and safe learning is significant, and is equivalent to the difference between verified and unverified software. Unlike most existing approaches our safety guarantees apply to both the learning process and the final learned policy.

Section 2 discusses how our work relates to the few existing approaches toward *verifiably* safe reinforcement learning. Unlike those [3, 14, 17, 20], as well as work on model checking and verification for MDPs [18], we introduce an approach toward verifiably safe off-model learning. Our approach is the first to combine model synthesis at design time with model falsification at runtime so that safety guarantees capture a wide range of possible futures instead of relying on a single accurate environmental model. Safe off-model learning is an important problem because autonomous systems must be able to cope with unanticipated scenarios. Ours is the first approach toward verifiably safe off-model learning.

Several recent papers focus on providing safety guarantees for model-free reinforcement learning. Trust Region Policy Optimization [31] defines safety as monotonic policy improvement, a much weaker notion of safety than the constraints guaranteed by our approach. Constrained Policy Optimization [1] extends TRPO with guarantees that an agent nearly satisfies safety constraints during learning. Brázdil et al. [8] give probabilistic guarantees by performing a heuristic-driven exploration of the model. Our approach is model-based instead of model-free, and instead of focusing on learning safely without a model we focus on identifying accurate models from data obtained both at design time and at runtime. Learning concise dynamical systems representations has one substantial advantage over model-free methods: safety guarantees are stated with respect to an explainable model that captures the safety-critical assumptions about the system’s dynamics. Synthesizing explainable models is important because safety guarantees are always stated with respect to a model; therefore, engineers must be able to understand inductively synthesized models in order to understand what safety properties their systems do (and do not) ensure.

Akazaki et al. propose an approach, based on deep reinforcement learning, for efficiently discovering defects in models of cyber-physical systems with specifications stated in signal temporal logic [2]. Model falsification is an important component of our approach; however, unlike Akazaki et al., we also propose an approach toward obtaining more robust models and explain how runtime falsification can be used to obtain safety guarantees for off-model learning.

Our approach includes a model synthesis phase that is closely related to program synthesis and program repair algorithms [23, 24, 29]. Relative to work on program synthesis and repair, VPMUs are unique in several ways. We are the first to explore *hybrid* program repair. Our approach combines program verification with mutation. We treat programs as *models* in which one part of the model is varied according to interactions with the environment and another part of the model is systematically derived (together with a correctness proof) from these changes. This separation of the dynamics into inductively synthesized models and deductively synthesized controllers enables our approach toward using programs as representations of dynamic safety constraints during reinforcement learning.

Although we are the first to explore hybrid program repair, several researchers have explored the problem of synthesizing hybrid systems from data [5, 30]. This work is closely related to our **Learn Linear Dynamics** update. Sadraddini and Belta provide formal guarantees for data-driven model identification and controller synthesis [30]. Relative to this work, our **Learn Linear Dynamics** update is continuous-time, synthesizes a computer-checked correctness proof but does not consider the full class of linear ODEs. Unlike Asarin et al. [5], our full set of model updates is sometimes capable of synthesizing nonlinear dynamical systems from data (e.g., the static \(\rightarrow \) circular update) and produces computer-checked correctness proofs for permissive controllers.

## 8 Conclusions

This paper introduces an approach toward verifiably safe off-model learning that uses a combination of design-time verification-preserving model updates and runtime model update learning to provide safety guarantees even when there is no single accurate model available at design time. We introduced a set of model updates that capture common ways in which models can deviate from reality, and introduced an update that is capable of synthesizing ODEs and provably correct controllers without access to an initial model. Finally, we proved safety and efficiency theorems for active \(\mu \)learning and evaluated our approach on some representative examples of hybrid systems control tasks. Together, these contributions constitute a first approach toward verifiably safe off-model learning.

## Footnotes

- 1.
Throughout the paper, we denote by

**S**a specific sequence of states and by \(\mathcal {S}\) the set of all states. - 2.
We say that taking action \(a_i\) in state \(s_i\) results in the removal of a model

*m*from the model set if \(m \in \mathbf {Mon}_i\) but \(m \not \in \mathbf {Mon}_{i+1}\). - 3.
Open image in new window with the parameters \(F=0,F=5, \text { and } F=x\) are a counter example [10, Section 8.4.4].

- 4.
Extended discussion of these model updates is available in [10, Chapters 8 and 9].

## References

- 1.Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Proceedings of Machine Learning Research, vol. 70, pp. 22–31. PMLR (2017)Google Scholar
- 2.Akazaki, T., Liu, S., Yamagata, Y., Duan, Y., Hao, J.: Falsification of cyber-physical systems using deep reinforcement learning. In: Havelund, K., Peleska, J., Roscoe, B., de Vink, E. (eds.) FM 2018. LNCS, vol. 10951, pp. 456–465. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-95582-7_27CrossRefGoogle Scholar
- 3.Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: McIlraith, S.A., Weinberger, K.Q (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018). AAAI Press (2018)Google Scholar
- 4.Alur, R., Courcoubetis, C., Henzinger, T.A., Ho, P.-H.: Hybrid automata: an algorithmic approach to the specification and verification of hybrid systems. In: Grossman, R.L., Nerode, A., Ravn, A.P., Rischel, H. (eds.) HS 1991–1992. LNCS, vol. 736, pp. 209–229. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-57318-6_30CrossRefGoogle Scholar
- 5.Asarin, E., Bournez, O., Dang, T., Maler, O., Pnueli, A.: Effective synthesis of switching controllers for linear systems. Proc. IEEE
**88**(7), 1011–1025 (2000)CrossRefGoogle Scholar - 6.Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst.
**13**(1–2), 41–77 (2003)MathSciNetCrossRefGoogle Scholar - 7.Bohrer, B., Tan, Y.K., Mitsch, S., Myreen, M.O., Platzer, A.: VeriPhy: verified controller executables from verified cyber-physical system models. In: Grossman, D. (ed.) Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018), pp. 617–630. ACM (2018)Google Scholar
- 8.Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_8CrossRefGoogle Scholar
- 9.Fridovich-Keil, D., Herbert, S.L., Fisac, J.F., Deglurkar, S., Tomlin, C.J.: Planning, fast and slow: a framework for adaptive real-time safe trajectory planning. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 387–394 (2018)Google Scholar
- 10.Fulton, N.: Verifiably safe autonomy for cyber-physical systems. Ph.D. thesis, Computer Science Department, School of Computer Science, Carnegie Mellon University (2018)Google Scholar
- 11.Fulton, N., Mitsch, S., Bohrer, B., Platzer, A.: Bellerophon: tactical theorem proving for hybrid systems. In: Ayala-Rincón, M., Muñoz, C.A. (eds.) ITP 2017. LNCS, vol. 10499, pp. 207–224. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66107-0_14CrossRefzbMATHGoogle Scholar
- 12.Fulton, N., Mitsch, S., Quesel, J.-D., Völp, M., Platzer, A.: KeYmaera X: an axiomatic tactical theorem prover for hybrid systems. In: Felty, A.P., Middeldorp, A. (eds.) CADE 2015. LNCS (LNAI), vol. 9195, pp. 527–538. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21401-6_36CrossRefGoogle Scholar
- 13.Fulton, N., Platzer, A.: Safe AI for CPS (invited paper). In: IEEE International Test Conference (ITC 2018) (2018)Google Scholar
- 14.Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: McIlraith, S., Weinberger, K. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018), pp. 6485–6492. AAAI Press (2018)Google Scholar
- 15.García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res.
**16**, 1437–1480 (2015)MathSciNetzbMATHGoogle Scholar - 16.Ghosh, S., Berkenkamp, F., Ranade, G., Qadeer, S., Kapoor, A.: Verifying controllers against adversarial examples with Bayesian optimization. CoRR abs/1802.08678 (2018)Google Scholar
- 17.Hasanbeig, M., Abate, A., Kroening, D.: Logically-correct reinforcement learning. CoRR abs/1801.08099 (2018)Google Scholar
- 18.Henriques, D., Martins, J.G., Zuliani, P., Platzer, A., Clarke, E.M.: Statistical model checking for Markov decision processes. In: QEST, pp. 84–93. IEEE Computer Society (2012). https://doi.org/10.1109/QEST.2012.19
- 19.Herbert, S.L., Chen, M., Han, S., Bansal, S., Fisac, J.F., Tomlin, C.J.: FaSTrack: a modular framework for fast and guaranteed safe motion planning. In: IEEE Annual Conference on Decision and Control (CDC)Google Scholar
- 20.Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in MDPs. CoRR abs/1807.06096 (2018)Google Scholar
- 21.Junges, S., Jansen, N., Dehnert, C., Topcu, U., Katoen, J.-P.: Safety-constrained reinforcement learning for MDPs. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 130–146. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_8CrossRefGoogle Scholar
- 22.Kalra, N., Paddock, S.M.: Driving to Safety: How Many Miles of Driving Would It Take to Demonstrate Autonomous Vehicle Reliability?. RAND Corporation, Santa Monica (2016)Google Scholar
- 23.Kitzelmann, E.: Inductive programming: a survey of program synthesis techniques. In: Schmid, U., Kitzelmann, E., Plasmeijer, R. (eds.) AAIP 2009. LNCS, vol. 5812, pp. 50–73. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11931-6_3CrossRefGoogle Scholar
- 24.Le Goues, C., Nguyen, T., Forrest, S., Weimer, W.: Genprog: a generic method for automatic software repair. IEEE Trans. Softw. Eng.
**38**(1), 54–72 (2012)CrossRefGoogle Scholar - 25.Mitsch, S., Platzer, A.: ModelPlex: verified runtime validation of verified cyber-physical system models. Form. Methods Syst. Des.
**49**(1), 33–74 (2016). Special issue of selected papers from RV’14CrossRefGoogle Scholar - 26.Platzer, A.: Differential dynamic logic for hybrid systems. J. Autom. Reas.
**41**(2), 143–189 (2008)MathSciNetCrossRefGoogle Scholar - 27.Platzer, A.: Logics of dynamical systems. In: LICS, pp. 13–24. IEEE (2012)Google Scholar
- 28.Platzer, A.: A complete uniform substitution calculus for differential dynamic logic. J. Autom. Reas.
**59**(2), 219–266 (2017)MathSciNetCrossRefGoogle Scholar - 29.Rothenberg, B.-C., Grumberg, O.: Sound and complete mutation-based program repair. In: Fitzgerald, J., Heitmeyer, C., Gnesi, S., Philippou, A. (eds.) FM 2016. LNCS, vol. 9995, pp. 593–611. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48989-6_36CrossRefGoogle Scholar
- 30.Sadraddini, S., Belta, C.: Formal guarantees in data-driven model identification and control synthesis. In: Proceedings of the 21st International Conference on Hybrid Systems: Computation and Control (HSCC 2018), pp. 147–156 (2018)Google Scholar
- 31.Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P.: Trust region policy optimization. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), JMLR Workshop and Conference Proceedings, vol. 37, pp. 1889–1897 (2015)Google Scholar
- 32.Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)zbMATHGoogle Scholar

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.