figure a
figure b

1 Introduction

Controller search is commonly used to govern cyber-physical systems such as autonomous vehicles, where high assurance is particularly important. Reinforcement Learning (RL) of neural network controllers is a promising approach for controller search [19]. State-of-the-art RL algorithms can learn motor skills autonomously through trial and error in simulated or even unknown environments, thus avoiding tedious manual engineering. However, well-trained neural network controllers may still be unsafe since the RL algorithms do not provide any formal guarantees on safety. A learned controller may fail occasionally but catastrophically, and debugging these failures can be challenging [46].

Guaranteeing the correctness of an RL controller is therefore important. Principally, given an environment model, the correctness of a controller can be verified by reachability analysis over a closed-loop system that combines the environment model and the controller. Indeed, the use of formal verification techniques to aid the design of reliable learning-enabled autonomous systems has risen rapidly over the last few years [17, 18, 28, 41, 43]. A natural extended question is that in case verification fails, can we exploit verification feedback in the form of counterexamples to synthesize a verifiably correct controller? This turns out to be a very challenging task due to the following reasons.

Fig. 1.
figure 1

An oscillator programmatic controller and its reachability analysis. In Fig. 1b, the red region represents the oscillator unsafe set \((-03, -0.25) \times (0.2, 0.35)\), and the blue region depicts the target set \([-0.05, 0.05]\times [-0.05,0.05]\). The initial state set of oscillator is \([-0.51, -0.49]\times [0.49,0.51]\).

Verification Scalability. A counterexample-guided controller synthesizer has to iteratively conduct reachability analysis and controller optimization as each iteration may discover a new counterexample. However, repeatedly calculating the reachable set of a nonlinear system controlled by a neural network controller over a long horizon is computationally challenging. For example, consider designing a controller for the Van der Pol’s oscillator system [49]. The oscillator is a 2-dimensional non-linear system whose state transition can be expressed by the following ordinary differential equations:

$$\begin{aligned} \dot{x_1} = x_2\quad&\quad \dot{x_2} = (1-x_1^2)x_2 - x_1 + u \end{aligned}$$
(1)

where \((x_1, x_2)\) is the system state variables and u is the control action variable. A feedback controller \(\pi (x_1, x_2)\) measures the current system state and then manipulates the control input u as needed to drive the system toward its target. The initial set of the control system is \((x_1, x_2) \in [-0.51, -0.49]\times [0.49,0.51]\). As depicted in Fig. 1b, the controlled system is expected to reach the target region in blue while avoiding the obstacle region in red within 120 timesteps (i.e. control steps). In our experience, even for this simple example, using Verisig [28] and ReachNN\(^*\) [18] (two state-of-the-art verification tools for neural network controlled systems) to calculate the reachable set of a simple 2-layer neural network feedback controller \(\pi _\textit{NN}(x_1, x_2)\) costs more than 100s each. It is even more a costly process to repeatedly conduct reachability analysis of a complex neural network controller in a counterexample-guided learning loop.

Recently, programmatic controllers emerge as a promising solution to address the lack of interpretability problem in deep reinforcement learning [27, 38, 44, 47] by training controllers as programs. A programmatic controller to control the oscillator environment learned by a programmatic reinforcement learning algorithm [38] is depicted in Fig. 1a. We depict the decision boundary of the program’s conditional statement (\(28.33 x_1 ~ + ~ 4.23 x_2 ~ + ~ 4.16 = 0\)) in solid dash in Fig. 1b. The program can be interpreted as a decomposition of the reach-avoid learning problem into two sub-problems — the linear controller in the else branch of the program first pushes the system away from the obstacle and next the linear controller in the then branch takes over to make the system reach the target. As we show in this paper, the compact and structured representation of a programmatic controller lends itself amenable to off-the-shelf hybrid or continuous system reachability tools e.g. [10, 20]. Compared with verifying a deep neural network controller, reasoning about a programmatic controller is more feasible. However, the question remains when verification fails – rather than retraining a new controller, how can we leverage verification feedback to construct a verifiably correct controller?

Proof Space Optimization. The other main challenge of verification-guided controller synthesis is that when verification fails, the counterexample path may provide little help or even be spurious due to estimated approximation errors. This is because reachability analyses typically overapproximate the true reachable sets using a computationally convenient representation such as polytopes [20] or Taylor models [10]. This overapproximation leads to quick error accumulation over time, known as the wrapping effect. Even a well-trained controller may fail verification because of approximation errors. For example, we adapted a state-of-the-art reachability analyzer Flow\(^*\) [10] to conduct reachability analysis of the closed-loop system combined by the programmatic controller in Fig. 1a and the oscillator environment (Equation 1) to compute a reachable state set between each time interval within the episode horizon (the controller is applied to generate a control action at the start of each time interval). The result is depicted in Fig. 1b. Although the programmatic controller empirically succeeds reaching the goal on extensive test simulations, the reachability analysis cannot determine whether the target region can always be reached as it computes a larger reachable region that keeps expansion, which may be an overestimation caused by over-approximation.

We hypothesize that verification failures can be caused by (1) true counterexample of unsafe states, (2) states caused by approximate errors, and (3) states in between the time interval of each control step (RL algorithms only sample states at the start and the end of a time interval). The latter two kinds of states cannot be observed by an RL algorithm during training in the concrete system state space. Thus, counterexample-guided controller synthesis may not work well if counterexamples are in the form of paths within the concrete state space.

To address this challenge, we propose synthesizing controllers in the proof space of a reachability analyzer. Controller synthesis in the proof space is critical to learning a verified controller because it can leverage verification feedback on either true unsafe counterexample states or approximation errors introduced by the verification procedure for searching a provably correct controller. A counterexample detected by a reachability analyzer is a symbolic rollout of abstract states of the closed-loop system that combines a (fixed) environment model and a (parameterized) programmatic controller. An abstract state (e.g. depicted as a green region in Fig. 1b) at a timestep over-approximates the set of concrete states reachable during the time interval of the timestep. VEL quantifies the safety and reachability property violation by the abstract states, e.g. there is an abstract loss between the approximative abstract state and the target region at the last control step. The loss approximates the worst-case reachability loss of any concrete state subsumed by the abstraction. We introduce lightweight gradient-descent style optimization algorithms to optimize controller parameters to effectively minimize the amount of correctness property violation to zero to refute any verification counterexamples.

Contributions. The main contribution of this paper is twofold. First, we present an efficient controller synthesis approach that integrates formal verification within a programmatic controller learning loop. Second, instead of synthesizing a programmatic controller from concrete state and action samples, we optimize the controller using symbolic rollouts with abstract states obtained by reachability analysis in the verification proof space. We implement the proposed ideas in a tool called VEL and present a detailed experimental study over a range of reinforcement learning systems. Our experiments demonstrate the benefits of integrating formal verification as part of the training objective and using verification feedback for controller synthesis.

2 Problem Setup

Environment Models. An environment is a structure \(M^\delta [\cdot ] = ({S}, {A}, F:\{{S} \times {A} \rightarrow {S}\}, R:\{{S} \times {A} \rightarrow \mathbb {R}\}, \cdot )\) where S is an infinite set of continuous real-vector environment states which are valuations of the state variables \(x_1,x_2,\ldots ,x_n\) of dimension n (\({S} \subseteq \mathbb R^{n}\)); and A is a set of continuous real-vector control actions which are valuations of the action variables \(u_1,u_2,\ldots ,u_m\) of dimension m. F is a state transition function that emits the next environment state given a current state s and an agent action a. We assume that F is defined by an ordinary differential equation (ODE) in the form of \(\dot{x} = f(x, u)\) and the function \(f : \mathbb {R}^m \times \mathbb {R}^n \rightarrow \mathbb {R}^m\) is Lipschitz continuous in x and continuous in u. R(sa) is the immediate reward after transition from an environment state \(s \in S\) with action \(a \in A\). An environment \(M^\delta [\cdot ]\) is parameterized with an (unknown) controller.

Controllers. An agent uses a controller to interact with an environment \(M^\delta [\cdot ]\). We explicitly model the deployment of a (learned) controller \(\pi : \{{S} \rightarrow {A}\}\) in \(M^\delta [\cdot ]\) as a closed-loop system \(M^\delta [\pi ]\). The controller \(\pi \) determines which action the agent ought to take in a given environment state. Specifically, it is invoked every \(\delta \) time period at a timestep. \(\pi \) reads the environment state \(s_i = s(i \delta )\) at time \(t = i \delta \) (\(i = 0, 1, 2, \ldots \)) or timestep i, and computes a control action as \(a_i = a(i \delta ) = \pi (s(i \delta ))\). Then the environment evolves following the ODE \(\dot{x} = f(x, a(i \delta ))\) within the time period \([i \delta ,(i+1) \delta ]\) and obtain the state \(s_{i+1} = s((i+1) \delta )\) at the next timestep \(i+1\). In the oscillator example from Sec. 1, the duration \(\delta \) of a timestep is 0.05s and the time horizon is 6s (i.e. 120 timesteps).

For environment simulation, given a set of initial states \(S_0\), we assume the existence of a flow functionFootnote 1 \(\phi (s_0, t) : S_0 \times \mathbb {R}^+ \rightarrow S\) that maps some initial state \(s_0\) to the environment state \(\phi (s_0, t)\) at time t where \(\phi (s_0, 0) = s_0\). We note that \(\phi \) is the solution of the ODE \(\dot{x} = f(x, a(i \delta ))\) in the state transition function F during the time period \([i \delta ,(i+1) \delta ]\) and \(a(i \delta ) = \pi (\phi (s_0, i \delta ))\).

Reinforcement Learning (RL). Given a set of initial states \(S_0\) and a time horizon \(T\delta \) (\(T > 0\)) with \(\delta \) as the duration of a timestep, a T-timestep rollout \(\zeta \) of a controller \(\pi \) is denoted as \((\zeta = s_0, a_0, s_1, \ldots , s_{T}) \sim \pi \) where \(s_i = s(i \delta )\) and \(a_i = a(i \delta )\) are the environment state and the action taken at timestep i such that \(s_0 \in S_0\), \(s_{i+1} = F(s_i, a_i)\), and \(a_i = \pi (s_i)\). The aggregate reward of \(\pi \) is

$$\begin{aligned} J^R(\pi ) = \mathbb {E}_{(\zeta = s_0, a_0, \ldots , s_{T}) \sim \pi }[\sum ^{T}_{t=0} \beta ^t R(s_i, a_i)] \end{aligned}$$
(2)

where \(\beta \) is the reward discount factor (\(0 < \beta \le 1\)). Controller search via RL aims to produce a controller \(\pi \) that maximizes \(J^R(\pi )\).

Controller Correctness Specification. A correctness specification of a controller is a logical formula specifying whether any rollout \(\zeta \) of the controller accomplishes the task without violating safety properties and reachability properties. To define safety and reachability over rollouts, the user first specifies a set of atomic predicates over environment states s.

Definition 1 (Predicates)

A predicate \(\varphi \) is a quantifier-free Boolean combinations of linear inequalities over the environment state variables x:

  • \(\langle \varphi \rangle \)  ::= \(\langle {P}\rangle \) | \(\varphi \) \(\wedge \) \(\varphi \) | \(\varphi \) \(\vee \) \(\varphi \);

  • \(\langle {P}\rangle \)  ::= \(\mathcal {A} \cdot x \le b\) where \(\mathcal {A} \in \mathbb R^{\vert x \vert }, \, b \in \mathbb R\);

A state \(s \in {S}\) satisfies a preciate \(\varphi \), denoted as \(s\,\models \,\varphi \), iff \(\varphi (s)\) is true.

The correctness requirement of a controller goes beyond from predicates over environment states s to specifications over controller rollouts \(\zeta \).

Definition 2 (Rollout Specifications)

The syntax of our correctness specifications for RL controllers is defined as:

$$\begin{aligned} \psi \,\,{:}{:}\!\!= \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2 \end{aligned}$$

In a rollout specification, \(\varphi _\textit{I}\ \texttt {reach}\ \varphi _1\) enforces reachability - the controlled agent should eventually reach some goal states evaluated true by the predicate \(\varphi _1\) from an initial state that satisfies \(\varphi _\textit{I}\). For instance, the agent should achieve some goals from an initial state. The constraint \(\texttt {ensuring}\ \varphi _2\) additionally enforces safety - any rollout of the controller should only visit safe states evaluated true by the predicate \(\varphi _2\). For example, the agent should remain within a safety boundary or avoid any obstacles throughout a rollout. Formally, the semantics of a rollout specification \(\psi \) is defined as follows:

$$ \llbracket \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2 \rrbracket (\zeta _{0:T})\ =\ \varphi _1(s_T)\ \wedge \ (\forall \ 0 \le i \le T.\ \varphi _2(s_i)) $$

where \(\zeta _{0:T} = s_0, s_1, \ldots , s_T\) is a rollout such that \(s_0 \in \varphi _\texttt {I}\) and \(T > 0\) denotes the total number of timesteps. Our specification implicitly requires that if the target region is reached before the T timestep of a rollout, the controlled agent does not leave the target region at the end of the rollout.

Given a time horizon \(T\delta \) (\(T > 0\)), a controller \(\pi \) is correct for an environment \(M^\delta [\cdot ]\) with respect to a rollout specification \(\psi \,\,{:}{:}\!\!= \varphi _\texttt {I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2\) iff for any rollout \(\zeta _{0:T} = s_0, s_1, \ldots s_{T-1}, s_{T}\) of \(M^\delta [\pi ]\) such that \(\varphi _\texttt {I}(s_0)\) holds, \(\llbracket \psi \rrbracket (\zeta _{0:T})\) is true. Notice that this definition does not consider any states of the continuous environment occurring within the time period of a timestep.

Example 1

Continue the oscillator example. Assume an oscillator initial state is from \(x_1, x_2 \in [-0.51, -0.49]\times [0.49,0.51]\). Specify the initial state constraint:

$$\begin{aligned} \varphi _\texttt {I}(x_1, x_2) \equiv -0.51 \le x_1 \le -0.49 \wedge 0.49 \le x_2 \le 0.51 \end{aligned}$$

The unsafe set of oscillator is \((-03, -0.25) \times (0.2, 0.35)\) (depicted as the red region in Fig. 1b). The safety \(\varphi _{\textit{safe}}\) of the system is specified as:

$$\begin{aligned} \varphi _{\textit{safe}}(x_1, x_2) \equiv x_1 \le -0.3 \vee x_1 \ge -0.25 \vee x_2 \le 0.2 \vee x_2 \ge 0.35 \end{aligned}$$

For this example, the target region is \([-0.05, 0.05]\times [-0.05,0.05]\) (the blue region in Fig. 1b). The reachability of the system \(\varphi _{\textit{reach}}\) is specified as:

$$\begin{aligned} \varphi _{\textit{reach}}(x_1, x_2) \equiv -0.05 \le x_1 \le 0.05 \wedge -0.05 \le x_2 \le 0.05 \end{aligned}$$

The target region should be eventually reached by the end of a control episode while avoiding the unsafe state region. We express the rollout specification as:

$$\begin{aligned} \varphi _\textit{I}(x_1,x_2)\ \texttt {reach}\ \varphi _{\textit{reach}}(x_1, x_2)\ \texttt {ensuring}\ \varphi _{\textit{safe}}(x_1, x_2) \end{aligned}$$

The following specification formulates that a desired controller stabilizes the oscillator around the target region over an infinite time horizon:

$$\begin{aligned} \varphi _\textit{reach}(x_1,x_2)\ \texttt {reach}\ \varphi _{\textit{reach}}(x_1, x_2)\ \texttt {ensuring}\ \varphi _{\textit{safe}}(x_1, x_2) \end{aligned}$$

3 Programmatic Controllers

Programmatic controllers have emerged as a promising solution to address the lack of interpretability in deep reinforcement learning [8, 27, 38, 47] by learning controllers as programs. This paper focuses on programmatic controllers structured as differentiable programs [38].

Our programmatic controllers follow the high-level context-free grammar depicted in Fig. 2 where E is the start symbol, \(\theta \) represents real-valued parameters of the program. The nonterminals E and B stand for program expressions that evaluate to action values in \(\mathbb {R}^m\) and Booleans, respectively, where m is the action dimension size, \(\theta _1 \in \mathbb {R}\) and \(\theta _2 \in \mathbb {R}^{n}\). We represent a state input to a programmatic controller as \(s=\{x_1:\nu _1,x_2:\nu _2,\ldots ,x_n\}\) where n is the state dimension size and \(\nu _i=s[x_i]\) is the value of \(x_i\) in s. As usual, the unbounded variables in \(\mathcal {X} = [x_1, x_2, \ldots , x_n]\) are assumed to be input variables (i.e., state variables). C is a low-level affine controller that can be invoked by a programmatic controller where \(\theta _3, \theta _c \in \mathbb {R}^m, \theta _4 \in \mathbb {R}^{m \cdot n}\) are controller parameters. Notice that C can be as simple as some (learned) constants \(\theta _c\).

Fig. 2.
figure 2

A context-free grammar for programmatic controllers.

The semantics of a programmatic controller in E is mostly standard and given by a function \(\llbracket E \rrbracket (s)\), defined for each language construct. For example, \(\llbracket x_i \rrbracket (s)=s[x_i]\) reads the value of a variable \(x_i\) in a state s. A controller may use an if-then-else branching construct. To avoid discontinuities for differentiability, we interpret its semantics in terms of a smooth approximation:

$$\begin{aligned} \llbracket {\textbf {if}}\ {}&B\ {\textbf {then}}\ C\ {\textbf {else}}\ E \rrbracket (s) = \sigma (\llbracket B \rrbracket (s)) \cdot \llbracket C \rrbracket (s) + (1 - \sigma (\llbracket B \rrbracket (s))) \cdot \llbracket E \rrbracket (s) \end{aligned}$$
(3)

where \(\sigma \) is the sigmoid function. Thus, any controller programmed in this grammar is a differentiable program. During execution, a programmatic controller invokes a set of low-level affine controllers under different environment conditions, according to the activation of the B conditions in the program.

Programmatic Reinforcement Learning. We use the programmatic reinforcement learning algorithm [38] to learn a programmatic controller. Compared with other programmatic reinforcement learning approaches [27, 47], this algorithm stands out by jointly learning both program structures and program parameters. Empirical results show that learned programmatic controllers achieve comparable or even better reward performance than deep neural networks [38].

4 Proof Space Optimization

The main challenge of using a verification procedure to guide controller synthesis is that verifiers are in general incomplete. When verification fails, it does not necessarily mean the system under verification has a true counterexample as the verifier may introduce states caused by over-approximation errors, commonly seen in reachability analysis. Even a well-trained controller may fail verification because of approximation errors. In our context, for soundness, reachability analysis of continuous or hybrid systems additionally takes environment states in between the time interval of a timestep into account. Both of these kinds of states cannot be observed by RL agents during training in the concrete state space, which renders the importance of controller optimization in the proof space of verification. In the following, Sec. 4.1 defines a verification procedure for environment models governed by programmatic controllers. Sec. 4.2 encodes verification feedback as a loss function of controller parameters over the verification proof space. Finally, Sec. 4.3 defines an optimization procedure that iteratively minimizes the loss function for correct-by-construction controller synthesis.

4.1 Controller Verification

We formalize controller synthesis as a verification-based controller optimization problem. A synthesized controller \(\pi \) is certified by a formal verifier against an environment model \(M^\delta [\cdot ]\) and a rollout specification \(\psi \) (Definition 2). The verifier returns true if \(\pi \) can be verified correct.

Reinforcement learning algorithms typically discretize a continuous environment model \(M^\delta [\cdot ]\) to sample environment states every \(\delta \) time period (as a timestep) for controller learning (Sec. 2). For soundness, in verification our approach instead considers all states reachable by the original continuous system. Formally, given a set of initial states \(S_0\), we use \(S_i\) (\(i > 0\)) to represent the set of reachable concrete states during the time interval of \([(i-1) \delta ,\ i \delta ]\):

$$\begin{aligned} S_{i} = \{\phi (s_0, t)\ \vert \ \forall s_0 \in S_0, \forall t \in [(i-1) \delta ,\ i \delta ]\} \end{aligned}$$

where \(\phi \) is the flow function for environment state transition defined in Sec. 2. Our algorithm uses abstract interpretation to soundly approximate the set of reachable states \(S_{i}\) at each time step by reachability analysis.

Definition 3 (Symbolic Rollouts)

Given an environment model \(M^\delta [\pi ] = ({S}, {A}, F, R, \pi )\) deployed with a controller \(\pi \), a set of initial states \(S_0\), and an abstract domain \(\mathcal {D}\), a symbolic rollout of \(M^\delta [\pi ]\) over \(\mathcal {D}\) is \(\zeta ^\mathcal {D}= S^\mathcal {D}_0, S^\mathcal {D}_1, \ldots \) where \(S^\mathcal {D}_0 = \alpha (S_0)\) is the abstraction of the initial states \(S_0\) in \(\mathcal {D}\). Each symbolic state \(S^\mathcal {D}_{i} = F^{\mathcal {D}}[\pi ]\big ( S^\mathcal {D}_{i-1} \big )\) over-approximates \(S_i\) - the set of reachable states from the initial state \(S_0\) during the time interval \([(i-1) \delta , i \delta ]\) of the timestep i. \(F^{\mathcal {D}}\) is an abstract transformer for \(M^\delta [\pi ]\)’s state transition function F.

Our implementation of the abstract interpreter \(F^{\mathcal {D}}\) is based on Flow\(^*\) [10], a reachability analyzer for continuous or hybrid systems, where the abstract domain \(\mathcal {D}\) is Taylor Model (TM) flowpipes. Formally, for reachability computation at each timestep i (where \(i > 0\)), we firstly use Flow\(^*\) to evaluate the TM flowpipe \(\hat{S}_{i-1}\) for the reachable set of states at time \(t = (i-1)\delta \). To obtain a TM representation for the output set of the programmatic controller at timestep i, we use TM arithmetic to evaluate a TM flowpipe \(\hat{A}_{i-1}\) for \(\llbracket \pi \rrbracket (s)\) for all states \(s \in \hat{S}_{i-1}\). Here \(\llbracket \pi \rrbracket \) encodes the semantics of \(\pi \) (Equation 3). For example, the semantics of the oscillator controller in Fig. 1a is:

$$\begin{aligned}&\sigma (28.33 x_1 ~ + ~ 4.23 x_2 ~ + ~ 4.16) \times (6.79 x_1 ~ - ~ 8.56 x_2 ~ + ~ 0.35) \\&\, + (1 - \sigma (28.33 x_1 ~ + ~ 4.23 x_2 ~ + ~ 4.16)) \times (11.01 x_1 ~ - ~ 13.50 x_2 ~ + ~ 8.71) \end{aligned}$$

where the sigmoid function \(\sigma \) can be handled by TM arithmetic. The resulting TM representation \(\hat{A}_{i-1}\) can be viewed as an overapproximation of the controller’s output at timestep i. Finally, we use Flow\(^*\) to construct the TM flowpipe overapproximation \(S^\mathcal {D}_{i}\) for all reachable states during the time period at timestep i by reachability analysis over the ODE dynamics of the transition function \(\dot{x} = f(x, a)\) for \(\delta \) time period with initial state \(x(0) \in \hat{S}_{i-1}\) and the control action \(a \in \hat{A}_{i-1}\).

Verification Procedure. Given a closed-loop system \(M^\delta [\pi ]\), a time horizon \(T\delta \) (\(T > 0\)), and a rollout specification \(\psi \,\,{:}{:}\!\!= \llbracket \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2 \rrbracket \), we obtain the symbolic rollout of \(M^\delta [\pi ]\) as \(\zeta ^\mathcal {D}_{0:T} = S^\mathcal {D}_0, S^\mathcal {D}_{1}, \ldots , S^\mathcal {D}_{T}\) where \(S^\mathcal {D}_0\) is the abstraction of all states in \(\varphi _\textit{I}\) in the abstract domain \(\mathcal {D}\). For formal verification, we extend the semantics definition of the rollout specification \(\llbracket \psi \rrbracket \) over concrete rollouts (Definition 2) to support symbolic rollouts. Formally, \(\llbracket \psi \rrbracket (\zeta ^D_{0:T})\) holds iff:

$$ \forall s \in \gamma (S^\mathcal {D}_{T}).\ \varphi _1(s)\ \bigwedge \ \forall \ 0 \le i \le T,\ s \in \gamma (S^\mathcal {D}_{i}).\ \varphi _2(s) $$

where \(\gamma \) is the concretization function of the abstract domain \(\mathcal {D}\). The closed-loop system \(M^\delta [\pi ]\) satisfies \(\psi \), denoted as \(M^\delta [\pi ]\,\models \,\psi \), iff \(\llbracket \psi \rrbracket (\zeta ^\mathcal {D}_{0:T})\) holds. The abstract domain \(\mathcal {D}\) is the proof space of controller verification.

Example 2

To verify the closed-loop system composed by the oscillator ODE in Eq. 1 and the learned controller in Fig. 1a, we have conducted reachability analysis to overapproximate the reachable state set during the time period of each timestep within the episode horizon. The result of the TM flowpipes are depicted as a sequence of green regions in Fig. 1b. The verification procedure cannot guarantee that the target be reached eventually due to the approximation errors.

4.2 Correctness Property Loss in the Proof Space

To facilitate controller optimization in the presence of verification failures, our approach measures the amount of correctness property violation as verification feedback. To this end, we firstly define correct property violation over the concrete environment state space and then lift this definition to the proof space of controller verification.

We note that a controller rollout that fails correctness property verification violates desired properties at some states. The following definition characterizes a correctness loss function to quantify the correctness property violation of a state.

Definition 4 (State Correctness Loss Function)

For a predicate \(\varphi \) over states \(s \in S\), we define a non-negative loss function \(\mathcal {L}(s, \varphi )\) such that \(\mathcal {L}(s, \varphi ) = 0\) iff s satisfies \(\varphi \), i.e. \(s\,\models \,\varphi \). We define \(\mathcal {L}(s, \varphi )\) recursively, based on the possible shapes of \(\varphi \) (Definition 1):

  • \(\mathcal {L}(s, \mathcal {A} \cdot x \le b) := \max (\mathcal {A} \cdot s - b, 0)\)

  • \(\mathcal {L}(s, \varphi _1 \wedge \varphi _2) := \max (\mathcal {L}(s, \varphi _1), \mathcal {L}(s, \varphi _2))\)

  • \(\mathcal {L}(s, \varphi _1 \vee \varphi _2) := \min (\mathcal {L}(s, \varphi _1), \mathcal {L}(s, \varphi _2))\)

Notice that \(\mathcal {L}(s, \varphi _1 \wedge \varphi _2) = 0\) iff \(\mathcal {L}(s, \varphi _1) = 0\) and \(\mathcal {L}(s, \varphi _2) = 0\), and similarly \(\mathcal {L}(\varphi _1 \vee \varphi _2) = 0 \) iff \(\mathcal {L}(\varphi _1) = 0\) or \(L(\varphi _2) = 0\).

Our objective is to use verification feedback to improve controller safety. To this end, we lift the correctness loss function over concrete states (Definition 4) to an abstract correctness loss function over abstract states.

Definition 5 (Abstract State Correctness Loss Function)

Given an abstract state \(S^\mathcal {D}\) and a predicate \(\varphi \), we define an abstract correctness loss function:

$$ \mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi ) = \max _{s \in \gamma (S^{\mathcal {D}})} \mathcal {L}(s, \varphi ) $$

where \(\gamma \) is the concretization function of the abstract domain \(\mathcal {D}\). The abstract correctness loss function applies \(\gamma \) to obtain all concrete states represented by an abstract state \(S^\mathcal {D}\). It measures the worst-case correctness loss of \(\varphi \) among all concrete states subsumed by \(S^\mathcal {D}\). Given an abstract domain \(\mathcal {D}\), we can usually approximate the concretization of an abstract state \(\gamma (S^{\mathcal {D}})\) with a tight interval \(\gamma _I(S^\mathcal {D})\). As exemplified in Fig. 1b, it is straightforward to represent Taylor model flowpipes as intervals in Flow\(^*\). Based on the possible shape of \(\varphi \), we redefine \(\mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi )\) as:

  • \(\mathcal {L}_\mathcal {D}(S^\mathcal {D}, \mathcal {A} \cdot x \le b) := \max _{s \in \gamma _I(S^{\mathcal {D}})}\big (\max (\mathcal {A} \cdot s - b, 0)\big )\)

  • \(\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1 \wedge \varphi _2) := \max (\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1), \mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _2))\)

  • \(\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1 \vee \varphi _2) := \min (\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1), \mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _2))\)

Theorem 1 (Abstract State Correctness Loss Function Soundness)

Given an abstract state \(S^\mathcal {D}\) and a predicate \(\varphi \), we have:

$$ \mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi ) = 0 \implies \forall s \in \gamma _I(S^\mathcal {D})\ s\, \models \,\varphi . $$

We further lift the definition of the correctness loss function over abstract states (Definition 5) to a correctness loss function over symbolic rollouts.

Definition 6 (Symbolic Rollout Correctness Loss)

Given a rollout specification \(\psi := \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2\) and a symbolic rollout \(\zeta ^\mathcal {D}_{0:T} = S^\mathcal {D}_0, \ldots , S^\mathcal {D}_{T}\) where \(S^\mathcal {D}_0\) is the abstraction of all states in \(\varphi _\textit{I}\) in the abstract domain \(\mathcal {D}\), we define an abstract safety loss function \(\mathcal {L}_\mathcal {D}(\zeta _{0:T},\ \psi )\) measuring the degree to which the rollout specification is violated:

$$ \mathcal {L}_\mathcal {D}(\zeta _{0:T},\ \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2)\ =\ \max (\mathcal {L}_\mathcal {D}(S^\mathcal {D}_T, \varphi _1),\ \max _{0<i\le T}(\mathcal {L}_\mathcal {D}(S^\mathcal {D}_i, \varphi _2))) $$

Definition 6 enables a quantitative metric for the correctness loss of a controller in the verification proof space. Given a closed loop system \(M^\delta [\pi ]\), a time horizon \(T\delta \), a rollout specification \(\psi \), and the corresponding symbolic rollout \(\zeta ^\mathcal {D}_{0:T}\) of \(M^\delta [\pi ]\), the correctness loss of \(M^\delta [\pi ]\) with respect to \(\psi \), denoted as \(\mathcal {L}_\mathcal {D}(M^\delta [\pi ], \psi )\), is defined over the symbolic rollout i.e. \(\mathcal {L}_\mathcal {D}(M^\delta [\pi ], \psi )\) = \(\mathcal {L}_\mathcal {D}(\zeta ^\mathcal {D}_{0:T}, \psi )\).

Example 3

In Fig. 1b, there is a correctness loss (depicted as a red arrow) between the abstract state at the last timestep of the oscillator symbolic rollout and the desired reachable region \(\varphi _{\textit{reach}}\) defined in Example 1. We characterize it as an abstract state correctness loss. The whole symbolic rollout has the same correctness loss with respect to the rollout specification defined in Example 1.

Theorem 2 (Symbolic Rollout Correctness Soundness)

Given an environment \(M^\delta [\cdot ]\) deployed with a controller \(\pi \) and a rollout specification \(\psi \), we have

$$\begin{aligned} \mathcal {L}_\mathcal {D}(M^\delta [\pi ], \psi ) = 0 \implies M^\delta [\pi ]\,\models \,\psi . \end{aligned}$$

4.3 Controller Synthesis

The unique feature of our controller synthesis algorithm is that it leverages verification feedback on either true unsafe states or overapproximation errors introduced by verification to search for a provably correct controller.

Controller Synthesis in the Proof Space. We deem a programmatic controller \(\pi \) with trainable parameters \(\theta \) (e.g. from the grammar in Fig. 2) as \(\pi _\theta \). Given a closed-loop system \(M^\delta [\pi _\theta ]\), the correctness loss function \(\mathcal {L}_\mathcal {D}(M^\delta [\pi _\theta ], \psi )\) is essentially a function of \(\pi _\theta \)’s parameters \(\theta \). To reduce the correctness loss of \(\pi _\theta \) over the proof space \(\mathcal {D}\), we leverage a gradient-descent style optimization to update \(\theta \) by taking steps proportional to the negative of the gradient of \(\mathcal {L}_\mathcal {D}(M^\delta [\pi _\theta ], \psi )\) at \(\theta \). As opposed to standard gradient descent optimization, we optimize \(\pi _\theta \) based on symbolic rollouts in the proof space \(\mathcal {D}\), favouring the abstract interpreter (i.e. Flow\(^*\)) directly for verification-guided controller updates.

figure c

Black-box Gradient Estimation. Directly deriving the gradients of \(\mathcal {L}_\mathcal {D}\), however, requires the controller verification procedure be differentiable, which is not supported by reachability analyzers such as Flow\(^*\). To overcome this challenge, our algorithm effectively estimates the gradients of \(\mathcal {L}_\mathcal {D}\) based on random search [34]. Given a closed-loop environment \(M^\delta [\pi _\theta ]\), at each training iteration, we obtain perturbed systems \(M^\delta [\pi _{\theta +\nu \omega }]\) and \(M^\delta [\pi _{\theta -\nu \omega }]\) where we add sampled Gaussian noise \(\omega \) to the current controller \(\pi _\theta \)’s parameters \(\theta \) in both directions and \(\nu \) is a small positive real number. By evaluating the abstract correctness losses of the symbolic rollouts of \(M^\delta [\pi _{\theta +\nu \omega }]\) and \(M^\delta [\pi _{\theta -\nu \omega }]\), we update \(\theta \) with a finite difference approximation along an unbiased estimator of the gradient:

$$\begin{aligned} {\nabla _\theta \mathcal {L}_\mathcal {D}\leftarrow \frac{1}{N}\sum ^N_{k=1}\frac{\left( {\mathcal {L}_\mathcal {D}(M^\delta [\pi _{\theta +\nu \omega _k}],\ \psi ) -}{ \mathcal {L}_\mathcal {D}(M^\delta [\pi _{\theta -\nu \omega _k}], \psi )}\right) }{\nu } } \omega _k \end{aligned}$$

We update controller parameters \(\theta \) as follows where \(\eta \) is a learning rate:

$$\begin{aligned} \theta \leftarrow \theta - \eta \cdot \nabla _\theta \mathcal {L}_\mathcal {D}\end{aligned}$$

Our high-level controller synthesis algorithm is depicted in Algorithm. 1. The algorithm takes as input an environment model \(M^\delta [\cdot ]\), a rollout specification \(\psi \), and a programmatic controller \(\pi \) learned using the programmatic reinforcement learning technique [38]. When verification fails (line 4), it uses the correctness loss of the symbolic rollout of \(M^\delta [\pi ]\) for optimization (line 8-9). The algorithm repeatedly performs the gradient-based update until a verified controller is synthesized. As the controller verification procedure is undecidable in general, it is possible that Algorithm 1 converges with a nonzero correctness loss. Our empirical results in Sec. 5 demonstrate that the algorithm works well in practice.

5 Experimental Results

We have implemented the verification-guided controller synthesis technique in Algorithm 1 in a tool called VEL (VErification-based Learning) [50]. Given an environment and a rollout specification \(\psi \) (Definition 2), VEL uses the programmatic reinforcement learning algorithm [38] to learn a programmatic controller \(\pi \). The controller \(\pi \) is trained to satisfy the safety and reachability requirements as set by \(\psi \). We do so by shaping a reward function that is consistent with \(\psi \) - this function rewards actions leading to goal states and penalizes actions leading to unsafe states. As the RL algorithm does not provide any correctness guarantees and the verification procedure may introduce large approximation errors, even well-trained controllers may fail verification. In case of verification failures, VEL applies Algorithm 1 to optimize \(\pi \) based on the verification feedback.

Table 1. Benchmark Rollout Specifications (\(\mathcal {T}\) represents True).
Table 2. Experiment Results. Depth shows the height of the abstract syntax tree of a programmatic controller. T.T shows the overall execution time of VEL including both the time for reachability analysis and verification-guided controller synthesis. V.T measures only the verification time for the final controller. If a controller can be verified directly without verification-guided optimization, the value of T.T is empty. The execution times for ReachNN\(^*\) and Verisig measure the cost of verifying a neural network controlled system (NNCS). The notation of the size (\(n \times k\)) indicates a neural network (with sigmoid activations) with n hidden layers and k neurons per layer. If a property could not be verified, it is marked as Unknown. N/A means that the tool is not applicable to a benchmark.

We evaluated VEL on several nonlinear continuous or hybrid systems taken from the literature. These are problems that are widely used for evaluating state-of-the-art verification tools for learning-enabled cyber-physical systems. Benchmarks B1 - B5 were introduced by [18]; adaptive cruise control (ACC) was presented in [43]; mountain car (MC) and quadrotor with model-predictive control (QMPC) were introduced by [28]; Pedulum and CartPole were taken from [29]; Tora and Unicyclecar were presented in the ARCH-COMP21 competition on formal verification of Artificial Intelligence and Neural Network Control Systems (AINNCS). We present the dynamics and the detailed description of each benchmark in [50]. The rollout specifications (Definition 2) are depicted in Table 1. The specifications define for each benchmark the initial states, the goal regions to reach, and the safety properties describing the safety boundary or the obstacles to avoid. On three benchmarks we verify the controller correctness over an infinite horizon. For the classic control problem Pendulum, to verify that the pendulum does not fall in an infinite time horizon, the rollout specification requires that any rollout starting from the region \(x_1,x_2 \in [-0.1, 0.1]\) (representing pendulum angle and angular velocity) eventually turns back to it and any rollout states must be safe (including those that temporarily leave this region). Similarly, Tora models a moving cart attached to a wall with a spring. On Tora\(_\texttt {inf}\), we prove that the controller for the arm of the cart connecting to the spring can stabilize the cart over an infinite horizon while maintain safety around the origin. On Oscillator\(_\texttt {inf}\), we verify that the controller can stabilize the oscillator around a target region over an infinite horizon while the process of reaching the target region from the initial states is safe.

The experimental results are given in Table 2. VEL synthesized provably correct programmatic controllers for all the benchmarks. Table 2 shows the total time spent on each benchmark (T.T) as well as the verification time of the final controller (V.T). Half of the benchmarks can be directly verified with the initial programmatic controller (in Table 2, T.T for these benchmarks is empty as they only need one pass of verification in V.T). The other half must go through the verification-guided controller learning loop due to approximation errors in verification although these controllers achieved satisfactory test performance. We depict the learning performance of VEL on these benchmarks in Fig. 3 averaged over 5 random seeds. The results show that VEL can robustly and reliably reduce the correctness loss over symbolic rollouts (i.e. the verification feedback) to zero.

Table 2 also shows the results of verifying the benchmarks as neural network controlled systems (NNCS) using two state-of-the-art verification tools ReachNN\(^*\) [18] and Verisig [28] where the controllers are trained as neural networks. We note that VEL is designed for programmatic controllers and uniquely has a verification-guided learning loop. Here our intention is not to compare the tools’ performance. Instead, Table 2 demonstrates that integrating verification in training loops for programmatic controllers is more tractable than for neural network controllers. It shows that programmatic controller verification (column V.T) has a much lower computation cost compared to verifying neural network controllers using ReachNN\(^*\) and Verisig except for MountainCarFootnote 2. When ReachNN\(^*\) and Verisig produces Unknown, the tools are not able to verify the rollout specification due to the large estimated approximation errors in verification. On Tora, ReachNN\(^*\) spent over 13000s to produce imprecise flowpipes with large approximation errors that cannot be used for verification. In this case, repeatedly conducting neural network controller verification in a learning loop is computationally infeasible. On the other hand, VEL makes verification-guided controller synthesis feasible as evidenced in Table 2 and Fig. 3. It efficiently uses the programmatic controller verification feedback to reduce the correctness loss over the abstraction of controller reachable states to 0 in the verification proof space (even if the abstraction may introduce approximation errors).

Fig. 3.
figure 3

Learning Performance of Verification-guided Controller Synthesis on B1, UnicycleCar, QMPC, Oscillator, ACC, and Tora\(_\texttt {inf}\). The y-axis records the correctness loss of symbolic rollouts over abstract states. The results are averaged over 5 random seeds. VEL reliably reduces the symbolic rollout correctness loss to zero across the learning loop iterations (the x axis) for each benchmark.

6 Related Work

Robust Machine Learning. Our work on using abstract interpretation [14] for controller synthesis is inspired by the recent advances in verifying neural network robustness, e.g. [5, 23, 40, 51]. These approaches apply abstract interpretation to relax nonlinearity of activation functions in neural networks into convex representations, based on linear approximation [39, 40, 51, 52, 55] or interval approximation [26, 35]. Since the abstractions are differentiable, neural networks can be optimized toward tighter concertized bounds to improve verified robustness [7, 33, 35, 48, 55]. Principally, abstract interpretation can be used to verify the reachability properties of nonlinear dynamics systems [4, 30, 37]. Recent work [13, 17, 18, 28, 29, 41, 43] has already achieved initial results about verifying neural network controlled autonomous systems by conducting reachability analysis. However, these approaches do not attempt to leverage verification feedback for controller synthesis within a learning loop partially because of the high computation demand of repeatedly verifying neural network controllers. VEL demonstrates the substantial benefits of using verification feedback in a proof space for learning correct-by-construction programmatic controllers. Related works [16, 25] conduct trajectory planning from temporal logic specifications but do not provide formal correctness guarantees. Extending VEL to support richer logic specifications is left for future work.

Safe Reinforcement Learning. Safe reinforcement learning is a fundamental problem in machine learning [36, 45]. Most safe RL algorithms form a constraint optimization problem by specifying safety constraints as cost functions in addition to reward functions [1, 9, 15, 31, 42, 53, 54]. Their goal is to train a controller that maximizes the accumulated reward and bound the aggregate safety violation under a threshold. However, aggregate safety costs do not support reachability constraints in the Safe RL context. In contrast, VEL ensures that a learned controller be formally verified correct and can better handle reachability constraints beyond safety. Model-based safe learning is combined with formal verification in [22] where an environment model is updated as learning progresses to take into account the deviations between the model and the actual system behavior. We leave combing VEL and model-based learning in future work.

Safe Shielding. The general idea of shielding is to use a backup controller to enforce the safety of a deep neural network controller [3]. The backup controller is less performant than the neural controller but is safe by construction using formal methods. The backup controller runs in tandem with the neural controller. Whenever the neural controller is about to leave the provably safe state space governed by the backup controller, the backup controller overrides the potentially unsafe neural actions to enforce the neural controller to stay within the certified safe space [2, 6, 11, 21, 22, 24, 32, 56]. In contrast, VEL directly integrates formal verification into controller learning loops to ensure that learned controllers are correct-by-construction and hence eliminates the need for shielding.

7 Conclusion

We present VEL that bridges formal verification and synthesis for learning correct-by-construction programmatic controllers. VEL integrates formal verification into a controller learning loop to enable counterexample-guided controller optimization. VEL encodes verification feedback as a loss function of the parameters of a programmatic controller over the verification proof space. Its optimization procedure iteratively reduces both controller correctness violation by true counterexamples and overapproximation errors caused by abstraction. Our experiments demonstrate that controller updates based on verification feedback can lead to provably correct programmatic controllers. For future work, we plan to extend VEL to support controller safety during exploration in noisy environments. When a worst-case environment model is provided, this can be achieved by repeatedly leveraging the verification feedback on safety violation to project a controller back onto the verified safe space [12] after each reinforcement learning step taken on the parameter space of the controller.