Verification-guided Programmatic Controller Synthesis

Wang, Yuning; Zhu, He

doi:10.1007/978-3-031-30820-8_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13994))

Included in the following conference series:

International Conference on Tools and Algorithms for the Construction and Analysis of Systems

2761 Accesses

Abstract

We present a verification-based learning framework VEL that synthesizes safe programmatic controllers for environments with continuous state and action spaces. The key idea is the integration of program reasoning techniques into controller training loops. VEL performs abstraction-based program verification to reason about a programmatic controller and its environment as a closed-loop system. Based on a novel verification-guided synthesis loop for training, VEL minimizes the amount of safety violation in the proof space of the system, which approximates the worst-case safety loss, using gradient-descent style optimization. Experimental results demonstrate the substantial benefits of leveraging verification feedback for synthesizing provably correct controllers.

You have full access to this open access chapter, Download conference paper PDF

Program Synthesis Using Deduction-Guided Reinforcement Learning

Interpretable, Verifiable, and Robust Reinforcement Learning via Program Synthesis

Toward Neural-Network-Guided Program Synthesis and Verification

1 Introduction

Controller search is commonly used to govern cyber-physical systems such as autonomous vehicles, where high assurance is particularly important. Reinforcement Learning (RL) of neural network controllers is a promising approach for controller search [19]. State-of-the-art RL algorithms can learn motor skills autonomously through trial and error in simulated or even unknown environments, thus avoiding tedious manual engineering. However, well-trained neural network controllers may still be unsafe since the RL algorithms do not provide any formal guarantees on safety. A learned controller may fail occasionally but catastrophically, and debugging these failures can be challenging [46].

Guaranteeing the correctness of an RL controller is therefore important. Principally, given an environment model, the correctness of a controller can be verified by reachability analysis over a closed-loop system that combines the environment model and the controller. Indeed, the use of formal verification techniques to aid the design of reliable learning-enabled autonomous systems has risen rapidly over the last few years [17, 18, 28, 41, 43]. A natural extended question is that in case verification fails, can we exploit verification feedback in the form of counterexamples to synthesize a verifiably correct controller? This turns out to be a very challenging task due to the following reasons.

Verification Scalability. A counterexample-guided controller synthesizer has to iteratively conduct reachability analysis and controller optimization as each iteration may discover a new counterexample. However, repeatedly calculating the reachable set of a nonlinear system controlled by a neural network controller over a long horizon is computationally challenging. For example, consider designing a controller for the Van der Pol’s oscillator system [49]. The oscillator is a 2-dimensional non-linear system whose state transition can be expressed by the following ordinary differential equations:

$$\begin{aligned} \dot{x_1} = x_2\quad&\quad \dot{x_2} = (1-x_1^2)x_2 - x_1 + u \end{aligned}$$

(1)

where $(x_1, x_2)$ is the system state variables and u is the control action variable. A feedback controller $\pi (x_1, x_2)$ measures the current system state and then manipulates the control input u as needed to drive the system toward its target. The initial set of the control system is $(x_1, x_2) \in [-0.51, -0.49]\times [0.49,0.51]$. As depicted in Fig. 1b, the controlled system is expected to reach the target region in blue while avoiding the obstacle region in red within 120 timesteps (i.e. control steps). In our experience, even for this simple example, using Verisig [28] and ReachNN$^*$ [18] (two state-of-the-art verification tools for neural network controlled systems) to calculate the reachable set of a simple 2-layer neural network feedback controller $\pi _\textit{NN}(x_1, x_2)$ costs more than 100s each. It is even more a costly process to repeatedly conduct reachability analysis of a complex neural network controller in a counterexample-guided learning loop.

Recently, programmatic controllers emerge as a promising solution to address the lack of interpretability problem in deep reinforcement learning [27, 38, 44, 47] by training controllers as programs. A programmatic controller to control the oscillator environment learned by a programmatic reinforcement learning algorithm [38] is depicted in Fig. 1a. We depict the decision boundary of the program’s conditional statement ($28.33 x_1 ~ + ~ 4.23 x_2 ~ + ~ 4.16 = 0$) in solid dash in Fig. 1b. The program can be interpreted as a decomposition of the reach-avoid learning problem into two sub-problems — the linear controller in the else branch of the program first pushes the system away from the obstacle and next the linear controller in the then branch takes over to make the system reach the target. As we show in this paper, the compact and structured representation of a programmatic controller lends itself amenable to off-the-shelf hybrid or continuous system reachability tools e.g. [10, 20]. Compared with verifying a deep neural network controller, reasoning about a programmatic controller is more feasible. However, the question remains when verification fails – rather than retraining a new controller, how can we leverage verification feedback to construct a verifiably correct controller?

Proof Space Optimization. The other main challenge of verification-guided controller synthesis is that when verification fails, the counterexample path may provide little help or even be spurious due to estimated approximation errors. This is because reachability analyses typically overapproximate the true reachable sets using a computationally convenient representation such as polytopes [20] or Taylor models [10]. This overapproximation leads to quick error accumulation over time, known as the wrapping effect. Even a well-trained controller may fail verification because of approximation errors. For example, we adapted a state-of-the-art reachability analyzer Flow$^*$ [10] to conduct reachability analysis of the closed-loop system combined by the programmatic controller in Fig. 1a and the oscillator environment (Equation 1) to compute a reachable state set between each time interval within the episode horizon (the controller is applied to generate a control action at the start of each time interval). The result is depicted in Fig. 1b. Although the programmatic controller empirically succeeds reaching the goal on extensive test simulations, the reachability analysis cannot determine whether the target region can always be reached as it computes a larger reachable region that keeps expansion, which may be an overestimation caused by over-approximation.

We hypothesize that verification failures can be caused by (1) true counterexample of unsafe states, (2) states caused by approximate errors, and (3) states in between the time interval of each control step (RL algorithms only sample states at the start and the end of a time interval). The latter two kinds of states cannot be observed by an RL algorithm during training in the concrete system state space. Thus, counterexample-guided controller synthesis may not work well if counterexamples are in the form of paths within the concrete state space.

To address this challenge, we propose synthesizing controllers in the proof space of a reachability analyzer. Controller synthesis in the proof space is critical to learning a verified controller because it can leverage verification feedback on either true unsafe counterexample states or approximation errors introduced by the verification procedure for searching a provably correct controller. A counterexample detected by a reachability analyzer is a symbolic rollout of abstract states of the closed-loop system that combines a (fixed) environment model and a (parameterized) programmatic controller. An abstract state (e.g. depicted as a green region in Fig. 1b) at a timestep over-approximates the set of concrete states reachable during the time interval of the timestep. VEL quantifies the safety and reachability property violation by the abstract states, e.g. there is an abstract loss between the approximative abstract state and the target region at the last control step. The loss approximates the worst-case reachability loss of any concrete state subsumed by the abstraction. We introduce lightweight gradient-descent style optimization algorithms to optimize controller parameters to effectively minimize the amount of correctness property violation to zero to refute any verification counterexamples.

Contributions. The main contribution of this paper is twofold. First, we present an efficient controller synthesis approach that integrates formal verification within a programmatic controller learning loop. Second, instead of synthesizing a programmatic controller from concrete state and action samples, we optimize the controller using symbolic rollouts with abstract states obtained by reachability analysis in the verification proof space. We implement the proposed ideas in a tool called VEL and present a detailed experimental study over a range of reinforcement learning systems. Our experiments demonstrate the benefits of integrating formal verification as part of the training objective and using verification feedback for controller synthesis.

2 Problem Setup

Environment Models. An environment is a structure $M^\delta [\cdot ] = ({S}, {A}, F:\{{S} \times {A} \rightarrow {S}\}, R:\{{S} \times {A} \rightarrow \mathbb {R}\}, \cdot )$ where S is an infinite set of continuous real-vector environment states which are valuations of the state variables $x_1,x_2,\ldots ,x_n$ of dimension n (${S} \subseteq \mathbb R^{n}$); and A is a set of continuous real-vector control actions which are valuations of the action variables $u_1,u_2,\ldots ,u_m$ of dimension m. F is a state transition function that emits the next environment state given a current state s and an agent action a. We assume that F is defined by an ordinary differential equation (ODE) in the form of $\dot{x} = f(x, u)$ and the function $f : \mathbb {R}^m \times \mathbb {R}^n \rightarrow \mathbb {R}^m$ is Lipschitz continuous in x and continuous in u. R(s, a) is the immediate reward after transition from an environment state $s \in S$ with action $a \in A$. An environment $M^\delta [\cdot ]$ is parameterized with an (unknown) controller.

Controllers. An agent uses a controller to interact with an environment $M^\delta [\cdot ]$. We explicitly model the deployment of a (learned) controller $\pi : \{{S} \rightarrow {A}\}$ in $M^\delta [\cdot ]$ as a closed-loop system $M^\delta [\pi ]$. The controller $\pi $ determines which action the agent ought to take in a given environment state. Specifically, it is invoked every $\delta $ time period at a timestep. $\pi $ reads the environment state $s_i = s(i \delta )$ at time $t = i \delta $ ($i = 0, 1, 2, \ldots $) or timestep i, and computes a control action as $a_i = a(i \delta ) = \pi (s(i \delta ))$. Then the environment evolves following the ODE $\dot{x} = f(x, a(i \delta ))$ within the time period $[i \delta ,(i+1) \delta ]$ and obtain the state $s_{i+1} = s((i+1) \delta )$ at the next timestep $i+1$. In the oscillator example from Sec. 1, the duration $\delta $ of a timestep is 0.05s and the time horizon is 6s (i.e. 120 timesteps).

For environment simulation, given a set of initial states $S_0$, we assume the existence of a flow function^{Footnote 1} $\phi (s_0, t) : S_0 \times \mathbb {R}^+ \rightarrow S$ that maps some initial state $s_0$ to the environment state $\phi (s_0, t)$ at time t where $\phi (s_0, 0) = s_0$. We note that $\phi $ is the solution of the ODE $\dot{x} = f(x, a(i \delta ))$ in the state transition function F during the time period $[i \delta ,(i+1) \delta ]$ and $a(i \delta ) = \pi (\phi (s_0, i \delta ))$.

Reinforcement Learning (RL). Given a set of initial states $S_0$ and a time horizon $T\delta $ ($T > 0$) with $\delta $ as the duration of a timestep, a T-timestep rollout $\zeta $ of a controller $\pi $ is denoted as $(\zeta = s_0, a_0, s_1, \ldots , s_{T}) \sim \pi $ where $s_i = s(i \delta )$ and $a_i = a(i \delta )$ are the environment state and the action taken at timestep i such that $s_0 \in S_0$, $s_{i+1} = F(s_i, a_i)$, and $a_i = \pi (s_i)$. The aggregate reward of $\pi $ is

$$\begin{aligned} J^R(\pi ) = \mathbb {E}_{(\zeta = s_0, a_0, \ldots , s_{T}) \sim \pi }[\sum ^{T}_{t=0} \beta ^t R(s_i, a_i)] \end{aligned}$$

(2)

where $\beta $ is the reward discount factor ($0 < \beta \le 1$). Controller search via RL aims to produce a controller $\pi $ that maximizes $J^R(\pi )$.

Controller Correctness Specification. A correctness specification of a controller is a logical formula specifying whether any rollout $\zeta $ of the controller accomplishes the task without violating safety properties and reachability properties. To define safety and reachability over rollouts, the user first specifies a set of atomic predicates over environment states s.

Definition 1 (Predicates)

A predicate $\varphi $ is a quantifier-free Boolean combinations of linear inequalities over the environment state variables x:

$\langle \varphi \rangle $ ::= $\langle {P}\rangle $ | $\varphi $ $\wedge $ $\varphi $ | $\varphi $ $\vee $ $\varphi $;
$\langle {P}\rangle $ ::= $\mathcal {A} \cdot x \le b$ where $\mathcal {A} \in \mathbb R^{\vert x \vert }, \, b \in \mathbb R$;

A state $s \in {S}$ satisfies a preciate $\varphi $, denoted as $s\,\models \,\varphi $, iff $\varphi (s)$ is true.

The correctness requirement of a controller goes beyond from predicates over environment states s to specifications over controller rollouts $\zeta $.

Definition 2 (Rollout Specifications)

The syntax of our correctness specifications for RL controllers is defined as:

$$\begin{aligned} \psi \,\,{:}{:}\!\!= \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2 \end{aligned}$$

In a rollout specification, $\varphi _\textit{I}\ \texttt {reach}\ \varphi _1$ enforces reachability - the controlled agent should eventually reach some goal states evaluated true by the predicate $\varphi _1$ from an initial state that satisfies $\varphi _\textit{I}$. For instance, the agent should achieve some goals from an initial state. The constraint $\texttt {ensuring}\ \varphi _2$ additionally enforces safety - any rollout of the controller should only visit safe states evaluated true by the predicate $\varphi _2$. For example, the agent should remain within a safety boundary or avoid any obstacles throughout a rollout. Formally, the semantics of a rollout specification $\psi $ is defined as follows:

$$ \llbracket \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2 \rrbracket (\zeta _{0:T})\ =\ \varphi _1(s_T)\ \wedge \ (\forall \ 0 \le i \le T.\ \varphi _2(s_i)) $$

where $\zeta _{0:T} = s_0, s_1, \ldots , s_T$ is a rollout such that $s_0 \in \varphi _\texttt {I}$ and $T > 0$ denotes the total number of timesteps. Our specification implicitly requires that if the target region is reached before the T timestep of a rollout, the controlled agent does not leave the target region at the end of the rollout.

Given a time horizon $T\delta $ ($T > 0$), a controller $\pi $ is correct for an environment $M^\delta [\cdot ]$ with respect to a rollout specification $\psi \,\,{:}{:}\!\!= \varphi _\texttt {I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2$ iff for any rollout $\zeta _{0:T} = s_0, s_1, \ldots s_{T-1}, s_{T}$ of $M^\delta [\pi ]$ such that $\varphi _\texttt {I}(s_0)$ holds, $\llbracket \psi \rrbracket (\zeta _{0:T})$ is true. Notice that this definition does not consider any states of the continuous environment occurring within the time period of a timestep.

Example 1

Continue the oscillator example. Assume an oscillator initial state is from $x_1, x_2 \in [-0.51, -0.49]\times [0.49,0.51]$. Specify the initial state constraint:

$$\begin{aligned} \varphi _\texttt {I}(x_1, x_2) \equiv -0.51 \le x_1 \le -0.49 \wedge 0.49 \le x_2 \le 0.51 \end{aligned}$$

The unsafe set of oscillator is $(-03, -0.25) \times (0.2, 0.35)$ (depicted as the red region in Fig. 1b). The safety $\varphi _{\textit{safe}}$ of the system is specified as:

$$\begin{aligned} \varphi _{\textit{safe}}(x_1, x_2) \equiv x_1 \le -0.3 \vee x_1 \ge -0.25 \vee x_2 \le 0.2 \vee x_2 \ge 0.35 \end{aligned}$$

For this example, the target region is $[-0.05, 0.05]\times [-0.05,0.05]$ (the blue region in Fig. 1b). The reachability of the system $\varphi _{\textit{reach}}$ is specified as:

$$\begin{aligned} \varphi _{\textit{reach}}(x_1, x_2) \equiv -0.05 \le x_1 \le 0.05 \wedge -0.05 \le x_2 \le 0.05 \end{aligned}$$

The target region should be eventually reached by the end of a control episode while avoiding the unsafe state region. We express the rollout specification as:

$$\begin{aligned} \varphi _\textit{I}(x_1,x_2)\ \texttt {reach}\ \varphi _{\textit{reach}}(x_1, x_2)\ \texttt {ensuring}\ \varphi _{\textit{safe}}(x_1, x_2) \end{aligned}$$

The following specification formulates that a desired controller stabilizes the oscillator around the target region over an infinite time horizon:

$$\begin{aligned} \varphi _\textit{reach}(x_1,x_2)\ \texttt {reach}\ \varphi _{\textit{reach}}(x_1, x_2)\ \texttt {ensuring}\ \varphi _{\textit{safe}}(x_1, x_2) \end{aligned}$$

3 Programmatic Controllers

Programmatic controllers have emerged as a promising solution to address the lack of interpretability in deep reinforcement learning [8, 27, 38, 47] by learning controllers as programs. This paper focuses on programmatic controllers structured as differentiable programs [38].

Our programmatic controllers follow the high-level context-free grammar depicted in Fig. 2 where E is the start symbol, $\theta $ represents real-valued parameters of the program. The nonterminals E and B stand for program expressions that evaluate to action values in $\mathbb {R}^m$ and Booleans, respectively, where m is the action dimension size, $\theta _1 \in \mathbb {R}$ and $\theta _2 \in \mathbb {R}^{n}$. We represent a state input to a programmatic controller as $s=\{x_1:\nu _1,x_2:\nu _2,\ldots ,x_n\}$ where n is the state dimension size and $\nu _i=s[x_i]$ is the value of $x_i$ in s. As usual, the unbounded variables in $\mathcal {X} = [x_1, x_2, \ldots , x_n]$ are assumed to be input variables (i.e., state variables). C is a low-level affine controller that can be invoked by a programmatic controller where $\theta _3, \theta _c \in \mathbb {R}^m, \theta _4 \in \mathbb {R}^{m \cdot n}$ are controller parameters. Notice that C can be as simple as some (learned) constants $\theta _c$.

The semantics of a programmatic controller in E is mostly standard and given by a function $\llbracket E \rrbracket (s)$, defined for each language construct. For example, $\llbracket x_i \rrbracket (s)=s[x_i]$ reads the value of a variable $x_i$ in a state s. A controller may use an if-then-else branching construct. To avoid discontinuities for differentiability, we interpret its semantics in terms of a smooth approximation:

$$\begin{aligned} \llbracket {\textbf {if}}\ {}&B\ {\textbf {then}}\ C\ {\textbf {else}}\ E \rrbracket (s) = \sigma (\llbracket B \rrbracket (s)) \cdot \llbracket C \rrbracket (s) + (1 - \sigma (\llbracket B \rrbracket (s))) \cdot \llbracket E \rrbracket (s) \end{aligned}$$

(3)

where $\sigma $ is the sigmoid function. Thus, any controller programmed in this grammar is a differentiable program. During execution, a programmatic controller invokes a set of low-level affine controllers under different environment conditions, according to the activation of the B conditions in the program.

Programmatic Reinforcement Learning. We use the programmatic reinforcement learning algorithm [38] to learn a programmatic controller. Compared with other programmatic reinforcement learning approaches [27, 47], this algorithm stands out by jointly learning both program structures and program parameters. Empirical results show that learned programmatic controllers achieve comparable or even better reward performance than deep neural networks [38].

4 Proof Space Optimization

The main challenge of using a verification procedure to guide controller synthesis is that verifiers are in general incomplete. When verification fails, it does not necessarily mean the system under verification has a true counterexample as the verifier may introduce states caused by over-approximation errors, commonly seen in reachability analysis. Even a well-trained controller may fail verification because of approximation errors. In our context, for soundness, reachability analysis of continuous or hybrid systems additionally takes environment states in between the time interval of a timestep into account. Both of these kinds of states cannot be observed by RL agents during training in the concrete state space, which renders the importance of controller optimization in the proof space of verification. In the following, Sec. 4.1 defines a verification procedure for environment models governed by programmatic controllers. Sec. 4.2 encodes verification feedback as a loss function of controller parameters over the verification proof space. Finally, Sec. 4.3 defines an optimization procedure that iteratively minimizes the loss function for correct-by-construction controller synthesis.

4.1 Controller Verification

We formalize controller synthesis as a verification-based controller optimization problem. A synthesized controller $\pi $ is certified by a formal verifier against an environment model $M^\delta [\cdot ]$ and a rollout specification $\psi $ (Definition 2). The verifier returns true if $\pi $ can be verified correct.

Reinforcement learning algorithms typically discretize a continuous environment model $M^\delta [\cdot ]$ to sample environment states every $\delta $ time period (as a timestep) for controller learning (Sec. 2). For soundness, in verification our approach instead considers all states reachable by the original continuous system. Formally, given a set of initial states $S_0$, we use $S_i$ ($i > 0$) to represent the set of reachable concrete states during the time interval of $[(i-1) \delta ,\ i \delta ]$:

$$\begin{aligned} S_{i} = \{\phi (s_0, t)\ \vert \ \forall s_0 \in S_0, \forall t \in [(i-1) \delta ,\ i \delta ]\} \end{aligned}$$

where $\phi $ is the flow function for environment state transition defined in Sec. 2. Our algorithm uses abstract interpretation to soundly approximate the set of reachable states $S_{i}$ at each time step by reachability analysis.

Definition 3 (Symbolic Rollouts)

Given an environment model $M^\delta [\pi ] = ({S}, {A}, F, R, \pi )$ deployed with a controller $\pi $, a set of initial states $S_0$, and an abstract domain $\mathcal {D}$, a symbolic rollout of $M^\delta [\pi ]$ over $\mathcal {D}$ is $\zeta ^\mathcal {D}= S^\mathcal {D}_0, S^\mathcal {D}_1, \ldots $ where $S^\mathcal {D}_0 = \alpha (S_0)$ is the abstraction of the initial states $S_0$ in $\mathcal {D}$. Each symbolic state $S^\mathcal {D}_{i} = F^{\mathcal {D}}[\pi ]\big ( S^\mathcal {D}_{i-1} \big )$ over-approximates $S_i$ - the set of reachable states from the initial state $S_0$ during the time interval $[(i-1) \delta , i \delta ]$ of the timestep i. $F^{\mathcal {D}}$ is an abstract transformer for $M^\delta [\pi ]$’s state transition function F.

Our implementation of the abstract interpreter $F^{\mathcal {D}}$ is based on Flow$^*$ [10], a reachability analyzer for continuous or hybrid systems, where the abstract domain $\mathcal {D}$ is Taylor Model (TM) flowpipes. Formally, for reachability computation at each timestep i (where $i > 0$), we firstly use Flow$^*$ to evaluate the TM flowpipe $\hat{S}_{i-1}$ for the reachable set of states at time $t = (i-1)\delta $. To obtain a TM representation for the output set of the programmatic controller at timestep i, we use TM arithmetic to evaluate a TM flowpipe $\hat{A}_{i-1}$ for $\llbracket \pi \rrbracket (s)$ for all states $s \in \hat{S}_{i-1}$. Here $\llbracket \pi \rrbracket $ encodes the semantics of $\pi $ (Equation 3). For example, the semantics of the oscillator controller in Fig. 1a is:

$$\begin{aligned}&\sigma (28.33 x_1 ~ + ~ 4.23 x_2 ~ + ~ 4.16) \times (6.79 x_1 ~ - ~ 8.56 x_2 ~ + ~ 0.35) \\&\, + (1 - \sigma (28.33 x_1 ~ + ~ 4.23 x_2 ~ + ~ 4.16)) \times (11.01 x_1 ~ - ~ 13.50 x_2 ~ + ~ 8.71) \end{aligned}$$

where the sigmoid function $\sigma $ can be handled by TM arithmetic. The resulting TM representation $\hat{A}_{i-1}$ can be viewed as an overapproximation of the controller’s output at timestep i. Finally, we use Flow$^*$ to construct the TM flowpipe overapproximation $S^\mathcal {D}_{i}$ for all reachable states during the time period at timestep i by reachability analysis over the ODE dynamics of the transition function $\dot{x} = f(x, a)$ for $\delta $ time period with initial state $x(0) \in \hat{S}_{i-1}$ and the control action $a \in \hat{A}_{i-1}$.

Verification Procedure. Given a closed-loop system $M^\delta [\pi ]$, a time horizon $T\delta $ ($T > 0$), and a rollout specification $\psi \,\,{:}{:}\!\!= \llbracket \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2 \rrbracket $, we obtain the symbolic rollout of $M^\delta [\pi ]$ as $\zeta ^\mathcal {D}_{0:T} = S^\mathcal {D}_0, S^\mathcal {D}_{1}, \ldots , S^\mathcal {D}_{T}$ where $S^\mathcal {D}_0$ is the abstraction of all states in $\varphi _\textit{I}$ in the abstract domain $\mathcal {D}$. For formal verification, we extend the semantics definition of the rollout specification $\llbracket \psi \rrbracket $ over concrete rollouts (Definition 2) to support symbolic rollouts. Formally, $\llbracket \psi \rrbracket (\zeta ^D_{0:T})$ holds iff:

$$ \forall s \in \gamma (S^\mathcal {D}_{T}).\ \varphi _1(s)\ \bigwedge \ \forall \ 0 \le i \le T,\ s \in \gamma (S^\mathcal {D}_{i}).\ \varphi _2(s) $$

where $\gamma $ is the concretization function of the abstract domain $\mathcal {D}$. The closed-loop system $M^\delta [\pi ]$ satisfies $\psi $, denoted as $M^\delta [\pi ]\,\models \,\psi $, iff $\llbracket \psi \rrbracket (\zeta ^\mathcal {D}_{0:T})$ holds. The abstract domain $\mathcal {D}$ is the proof space of controller verification.

Example 2

To verify the closed-loop system composed by the oscillator ODE in Eq. 1 and the learned controller in Fig. 1a, we have conducted reachability analysis to overapproximate the reachable state set during the time period of each timestep within the episode horizon. The result of the TM flowpipes are depicted as a sequence of green regions in Fig. 1b. The verification procedure cannot guarantee that the target be reached eventually due to the approximation errors.

4.2 Correctness Property Loss in the Proof Space

To facilitate controller optimization in the presence of verification failures, our approach measures the amount of correctness property violation as verification feedback. To this end, we firstly define correct property violation over the concrete environment state space and then lift this definition to the proof space of controller verification.

We note that a controller rollout that fails correctness property verification violates desired properties at some states. The following definition characterizes a correctness loss function to quantify the correctness property violation of a state.

Definition 4 (State Correctness Loss Function)

For a predicate $\varphi $ over states $s \in S$, we define a non-negative loss function $\mathcal {L}(s, \varphi )$ such that $\mathcal {L}(s, \varphi ) = 0$ iff s satisfies $\varphi $, i.e. $s\,\models \,\varphi $. We define $\mathcal {L}(s, \varphi )$ recursively, based on the possible shapes of $\varphi $ (Definition 1):

$\mathcal {L}(s, \mathcal {A} \cdot x \le b) := \max (\mathcal {A} \cdot s - b, 0)$
$\mathcal {L}(s, \varphi _1 \wedge \varphi _2) := \max (\mathcal {L}(s, \varphi _1), \mathcal {L}(s, \varphi _2))$
$\mathcal {L}(s, \varphi _1 \vee \varphi _2) := \min (\mathcal {L}(s, \varphi _1), \mathcal {L}(s, \varphi _2))$

Notice that $\mathcal {L}(s, \varphi _1 \wedge \varphi _2) = 0$ iff $\mathcal {L}(s, \varphi _1) = 0$ and $\mathcal {L}(s, \varphi _2) = 0$, and similarly $\mathcal {L}(\varphi _1 \vee \varphi _2) = 0 $ iff $\mathcal {L}(\varphi _1) = 0$ or $L(\varphi _2) = 0$.

Our objective is to use verification feedback to improve controller safety. To this end, we lift the correctness loss function over concrete states (Definition 4) to an abstract correctness loss function over abstract states.

Definition 5 (Abstract State Correctness Loss Function)

Given an abstract state $S^\mathcal {D}$ and a predicate $\varphi $, we define an abstract correctness loss function:

$$ \mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi ) = \max _{s \in \gamma (S^{\mathcal {D}})} \mathcal {L}(s, \varphi ) $$

where $\gamma $ is the concretization function of the abstract domain $\mathcal {D}$. The abstract correctness loss function applies $\gamma $ to obtain all concrete states represented by an abstract state $S^\mathcal {D}$. It measures the worst-case correctness loss of $\varphi $ among all concrete states subsumed by $S^\mathcal {D}$. Given an abstract domain $\mathcal {D}$, we can usually approximate the concretization of an abstract state $\gamma (S^{\mathcal {D}})$ with a tight interval $\gamma _I(S^\mathcal {D})$. As exemplified in Fig. 1b, it is straightforward to represent Taylor model flowpipes as intervals in Flow$^*$. Based on the possible shape of $\varphi $, we redefine $\mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi )$ as:

$\mathcal {L}_\mathcal {D}(S^\mathcal {D}, \mathcal {A} \cdot x \le b) := \max _{s \in \gamma _I(S^{\mathcal {D}})}\big (\max (\mathcal {A} \cdot s - b, 0)\big )$
$\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1 \wedge \varphi _2) := \max (\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1), \mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _2))$
$\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1 \vee \varphi _2) := \min (\mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _1), \mathcal {L}_\mathcal {D}(S^{\mathcal {D}}, \varphi _2))$

Theorem 1 (Abstract State Correctness Loss Function Soundness)

Given an abstract state $S^\mathcal {D}$ and a predicate $\varphi $, we have:

$$ \mathcal {L}_\mathcal {D}(S^\mathcal {D}, \varphi ) = 0 \implies \forall s \in \gamma _I(S^\mathcal {D})\ s\, \models \,\varphi . $$

We further lift the definition of the correctness loss function over abstract states (Definition 5) to a correctness loss function over symbolic rollouts.

Definition 6 (Symbolic Rollout Correctness Loss)

Given a rollout specification $\psi := \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2$ and a symbolic rollout $\zeta ^\mathcal {D}_{0:T} = S^\mathcal {D}_0, \ldots , S^\mathcal {D}_{T}$ where $S^\mathcal {D}_0$ is the abstraction of all states in $\varphi _\textit{I}$ in the abstract domain $\mathcal {D}$, we define an abstract safety loss function $\mathcal {L}_\mathcal {D}(\zeta _{0:T},\ \psi )$ measuring the degree to which the rollout specification is violated:

$$ \mathcal {L}_\mathcal {D}(\zeta _{0:T},\ \varphi _\textit{I}\ \texttt {reach}\ \varphi _1\ \texttt {ensuring}\ \varphi _2)\ =\ \max (\mathcal {L}_\mathcal {D}(S^\mathcal {D}_T, \varphi _1),\ \max _{0<i\le T}(\mathcal {L}_\mathcal {D}(S^\mathcal {D}_i, \varphi _2))) $$

Definition 6 enables a quantitative metric for the correctness loss of a controller in the verification proof space. Given a closed loop system $M^\delta [\pi ]$, a time horizon $T\delta $, a rollout specification $\psi $, and the corresponding symbolic rollout $\zeta ^\mathcal {D}_{0:T}$ of $M^\delta [\pi ]$, the correctness loss of $M^\delta [\pi ]$ with respect to $\psi $, denoted as $\mathcal {L}_\mathcal {D}(M^\delta [\pi ], \psi )$, is defined over the symbolic rollout i.e. $\mathcal {L}_\mathcal {D}(M^\delta [\pi ], \psi )$ = $\mathcal {L}_\mathcal {D}(\zeta ^\mathcal {D}_{0:T}, \psi )$.

Example 3

In Fig. 1b, there is a correctness loss (depicted as a red arrow) between the abstract state at the last timestep of the oscillator symbolic rollout and the desired reachable region $\varphi _{\textit{reach}}$ defined in Example 1. We characterize it as an abstract state correctness loss. The whole symbolic rollout has the same correctness loss with respect to the rollout specification defined in Example 1.

Theorem 2 (Symbolic Rollout Correctness Soundness)

Given an environment $M^\delta [\cdot ]$ deployed with a controller $\pi $ and a rollout specification $\psi $, we have

$$\begin{aligned} \mathcal {L}_\mathcal {D}(M^\delta [\pi ], \psi ) = 0 \implies M^\delta [\pi ]\,\models \,\psi . \end{aligned}$$

4.3 Controller Synthesis

The unique feature of our controller synthesis algorithm is that it leverages verification feedback on either true unsafe states or overapproximation errors introduced by verification to search for a provably correct controller.

Controller Synthesis in the Proof Space. We deem a programmatic controller $\pi $ with trainable parameters $\theta $ (e.g. from the grammar in Fig. 2) as $\pi _\theta $. Given a closed-loop system $M^\delta [\pi _\theta ]$, the correctness loss function $\mathcal {L}_\mathcal {D}(M^\delta [\pi _\theta ], \psi )$ is essentially a function of $\pi _\theta $’s parameters $\theta $. To reduce the correctness loss of $\pi _\theta $ over the proof space $\mathcal {D}$, we leverage a gradient-descent style optimization to update $\theta $ by taking steps proportional to the negative of the gradient of $\mathcal {L}_\mathcal {D}(M^\delta [\pi _\theta ], \psi )$ at $\theta $. As opposed to standard gradient descent optimization, we optimize $\pi _\theta $ based on symbolic rollouts in the proof space $\mathcal {D}$, favouring the abstract interpreter (i.e. Flow$^*$) directly for verification-guided controller updates.

Black-box Gradient Estimation. Directly deriving the gradients of $\mathcal {L}_\mathcal {D}$, however, requires the controller verification procedure be differentiable, which is not supported by reachability analyzers such as Flow$^*$. To overcome this challenge, our algorithm effectively estimates the gradients of $\mathcal {L}_\mathcal {D}$ based on random search [34]. Given a closed-loop environment $M^\delta [\pi _\theta ]$, at each training iteration, we obtain perturbed systems $M^\delta [\pi _{\theta +\nu \omega }]$ and $M^\delta [\pi _{\theta -\nu \omega }]$ where we add sampled Gaussian noise $\omega $ to the current controller $\pi _\theta $’s parameters $\theta $ in both directions and $\nu $ is a small positive real number. By evaluating the abstract correctness losses of the symbolic rollouts of $M^\delta [\pi _{\theta +\nu \omega }]$ and $M^\delta [\pi _{\theta -\nu \omega }]$, we update $\theta $ with a finite difference approximation along an unbiased estimator of the gradient:

$$\begin{aligned} {\nabla _\theta \mathcal {L}_\mathcal {D}\leftarrow \frac{1}{N}\sum ^N_{k=1}\frac{\left( {\mathcal {L}_\mathcal {D}(M^\delta [\pi _{\theta +\nu \omega _k}],\ \psi ) -}{ \mathcal {L}_\mathcal {D}(M^\delta [\pi _{\theta -\nu \omega _k}], \psi )}\right) }{\nu } } \omega _k \end{aligned}$$

We update controller parameters $\theta $ as follows where $\eta $ is a learning rate:

$$\begin{aligned} \theta \leftarrow \theta - \eta \cdot \nabla _\theta \mathcal {L}_\mathcal {D}\end{aligned}$$

Our high-level controller synthesis algorithm is depicted in Algorithm. 1. The algorithm takes as input an environment model $M^\delta [\cdot ]$, a rollout specification $\psi $, and a programmatic controller $\pi $ learned using the programmatic reinforcement learning technique [38]. When verification fails (line 4), it uses the correctness loss of the symbolic rollout of $M^\delta [\pi ]$ for optimization (line 8-9). The algorithm repeatedly performs the gradient-based update until a verified controller is synthesized. As the controller verification procedure is undecidable in general, it is possible that Algorithm 1 converges with a nonzero correctness loss. Our empirical results in Sec. 5 demonstrate that the algorithm works well in practice.

5 Experimental Results

We have implemented the verification-guided controller synthesis technique in Algorithm 1 in a tool called VEL (VErification-based Learning) [50]. Given an environment and a rollout specification $\psi $ (Definition 2), VEL uses the programmatic reinforcement learning algorithm [38] to learn a programmatic controller $\pi $. The controller $\pi $ is trained to satisfy the safety and reachability requirements as set by $\psi $. We do so by shaping a reward function that is consistent with $\psi $ - this function rewards actions leading to goal states and penalizes actions leading to unsafe states. As the RL algorithm does not provide any correctness guarantees and the verification procedure may introduce large approximation errors, even well-trained controllers may fail verification. In case of verification failures, VEL applies Algorithm 1 to optimize $\pi $ based on the verification feedback.

Table 1. Benchmark Rollout Specifications ($\mathcal {T}$ represents True).

Full size table

Table 2. Experiment Results. Depth shows the height of the abstract syntax tree of a programmatic controller. T.T shows the overall execution time of VEL including both the time for reachability analysis and verification-guided controller synthesis. V.T measures only the verification time for the final controller. If a controller can be verified directly without verification-guided optimization, the value of T.T is empty. The execution times for ReachNN$^*$ and Verisig measure the cost of verifying a neural network controlled system (NNCS). The notation of the size ($n \times k$) indicates a neural network (with sigmoid activations) with n hidden layers and k neurons per layer. If a property could not be verified, it is marked as Unknown. N/A means that the tool is not applicable to a benchmark.

Full size table

We evaluated VEL on several nonlinear continuous or hybrid systems taken from the literature. These are problems that are widely used for evaluating state-of-the-art verification tools for learning-enabled cyber-physical systems. Benchmarks B1 - B5 were introduced by [18]; adaptive cruise control (ACC) was presented in [43]; mountain car (MC) and quadrotor with model-predictive control (QMPC) were introduced by [28]; Pedulum and CartPole were taken from [29]; Tora and Unicyclecar were presented in the ARCH-COMP21 competition on formal verification of Artificial Intelligence and Neural Network Control Systems (AINNCS). We present the dynamics and the detailed description of each benchmark in [50]. The rollout specifications (Definition 2) are depicted in Table 1. The specifications define for each benchmark the initial states, the goal regions to reach, and the safety properties describing the safety boundary or the obstacles to avoid. On three benchmarks we verify the controller correctness over an infinite horizon. For the classic control problem Pendulum, to verify that the pendulum does not fall in an infinite time horizon, the rollout specification requires that any rollout starting from the region $x_1,x_2 \in [-0.1, 0.1]$ (representing pendulum angle and angular velocity) eventually turns back to it and any rollout states must be safe (including those that temporarily leave this region). Similarly, Tora models a moving cart attached to a wall with a spring. On Tora$_\texttt {inf}$, we prove that the controller for the arm of the cart connecting to the spring can stabilize the cart over an infinite horizon while maintain safety around the origin. On Oscillator$_\texttt {inf}$, we verify that the controller can stabilize the oscillator around a target region over an infinite horizon while the process of reaching the target region from the initial states is safe.

The experimental results are given in Table 2. VEL synthesized provably correct programmatic controllers for all the benchmarks. Table 2 shows the total time spent on each benchmark (T.T) as well as the verification time of the final controller (V.T). Half of the benchmarks can be directly verified with the initial programmatic controller (in Table 2, T.T for these benchmarks is empty as they only need one pass of verification in V.T). The other half must go through the verification-guided controller learning loop due to approximation errors in verification although these controllers achieved satisfactory test performance. We depict the learning performance of VEL on these benchmarks in Fig. 3 averaged over 5 random seeds. The results show that VEL can robustly and reliably reduce the correctness loss over symbolic rollouts (i.e. the verification feedback) to zero.

Table 2 also shows the results of verifying the benchmarks as neural network controlled systems (NNCS) using two state-of-the-art verification tools ReachNN$^*$ [18] and Verisig [28] where the controllers are trained as neural networks. We note that VEL is designed for programmatic controllers and uniquely has a verification-guided learning loop. Here our intention is not to compare the tools’ performance. Instead, Table 2 demonstrates that integrating verification in training loops for programmatic controllers is more tractable than for neural network controllers. It shows that programmatic controller verification (column V.T) has a much lower computation cost compared to verifying neural network controllers using ReachNN$^*$ and Verisig except for MountainCar^{Footnote 2}. When ReachNN$^*$ and Verisig produces Unknown, the tools are not able to verify the rollout specification due to the large estimated approximation errors in verification. On Tora, ReachNN$^*$ spent over 13000s to produce imprecise flowpipes with large approximation errors that cannot be used for verification. In this case, repeatedly conducting neural network controller verification in a learning loop is computationally infeasible. On the other hand, VEL makes verification-guided controller synthesis feasible as evidenced in Table 2 and Fig. 3. It efficiently uses the programmatic controller verification feedback to reduce the correctness loss over the abstraction of controller reachable states to 0 in the verification proof space (even if the abstraction may introduce approximation errors).

6 Related Work

Robust Machine Learning. Our work on using abstract interpretation [14] for controller synthesis is inspired by the recent advances in verifying neural network robustness, e.g. [5, 23, 40, 51]. These approaches apply abstract interpretation to relax nonlinearity of activation functions in neural networks into convex representations, based on linear approximation [39, 40, 51, 52, 55] or interval approximation [26, 35]. Since the abstractions are differentiable, neural networks can be optimized toward tighter concertized bounds to improve verified robustness [7, 33, 35, 48, 55]. Principally, abstract interpretation can be used to verify the reachability properties of nonlinear dynamics systems [4, 30, 37]. Recent work [13, 17, 18, 28, 29, 41, 43] has already achieved initial results about verifying neural network controlled autonomous systems by conducting reachability analysis. However, these approaches do not attempt to leverage verification feedback for controller synthesis within a learning loop partially because of the high computation demand of repeatedly verifying neural network controllers. VEL demonstrates the substantial benefits of using verification feedback in a proof space for learning correct-by-construction programmatic controllers. Related works [16, 25] conduct trajectory planning from temporal logic specifications but do not provide formal correctness guarantees. Extending VEL to support richer logic specifications is left for future work.

Safe Reinforcement Learning. Safe reinforcement learning is a fundamental problem in machine learning [36, 45]. Most safe RL algorithms form a constraint optimization problem by specifying safety constraints as cost functions in addition to reward functions [1, 9, 15, 31, 42, 53, 54]. Their goal is to train a controller that maximizes the accumulated reward and bound the aggregate safety violation under a threshold. However, aggregate safety costs do not support reachability constraints in the Safe RL context. In contrast, VEL ensures that a learned controller be formally verified correct and can better handle reachability constraints beyond safety. Model-based safe learning is combined with formal verification in [22] where an environment model is updated as learning progresses to take into account the deviations between the model and the actual system behavior. We leave combing VEL and model-based learning in future work.

Safe Shielding. The general idea of shielding is to use a backup controller to enforce the safety of a deep neural network controller [3]. The backup controller is less performant than the neural controller but is safe by construction using formal methods. The backup controller runs in tandem with the neural controller. Whenever the neural controller is about to leave the provably safe state space governed by the backup controller, the backup controller overrides the potentially unsafe neural actions to enforce the neural controller to stay within the certified safe space [2, 6, 11, 21, 22, 24, 32, 56]. In contrast, VEL directly integrates formal verification into controller learning loops to ensure that learned controllers are correct-by-construction and hence eliminates the need for shielding.

7 Conclusion

We present VEL that bridges formal verification and synthesis for learning correct-by-construction programmatic controllers. VEL integrates formal verification into a controller learning loop to enable counterexample-guided controller optimization. VEL encodes verification feedback as a loss function of the parameters of a programmatic controller over the verification proof space. Its optimization procedure iteratively reduces both controller correctness violation by true counterexamples and overapproximation errors caused by abstraction. Our experiments demonstrate that controller updates based on verification feedback can lead to provably correct programmatic controllers. For future work, we plan to extend VEL to support controller safety during exploration in noisy environments. When a worst-case environment model is provided, this can be achieved by repeatedly leveraging the verification feedback on safety violation to project a controller back onto the verified safe space [12] after each reinforcement learning step taken on the parameter space of the controller.

Data Availability Statement

VEL is available at the repository [50]. The instructions for reproducing our experiment results are included in this repository.

Notes

1.
$\phi $ may be implemented using scipy.integrate.odeint (or scipy.integrate.solve_ivp).
2.
MountainCar is a hybrid system model. VEL is not yet optimized for hybrid system verification.

References

Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Proceedings of Machine Learning Research, vol. 70, pp. 22–31. PMLR (2017)
Google Scholar
Akametalu, A.K., Kaynama, S., Fisac, J.F., Zeilinger, M.N., Gillula, J.H., Tomlin, C.J.: Reachability-based safe learning with gaussian processes. In: 53rd IEEE Conference on Decision and Control, CDC 2014, Los Angeles, CA, USA, December 15-17, 2014. pp. 1424–1431. IEEE (2014)
Google Scholar
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. pp. 2669–2678. AAAI Press (2018)
Google Scholar
Althoff, M.: An introduction to cora 2015. In: Proc. of the Workshop on Applied Verification for Continuous and Hybrid Systems (2015)
Google Scholar
Anderson, G., Pailoor, S., Dillig, I., Chaudhuri, S.: Optimization and abstraction: a synergistic approach for analyzing neural network robustness. In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, Phoenix, AZ, USA, June 22-26, 2019. pp. 731–744 (2019)
Google Scholar
Anderson, G., Verma, A., Dillig, I., Chaudhuri, S.: Neurosymbolic reinforcement learning with formally verified exploration. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)
Google Scholar
Balunovic, M., Vechev, M.T.: Adversarial training and provable defenses: Bridging the gap. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020)
Google Scholar
Bastani, O., Pu, Y., Solar-Lezama, A.: Verifiable reinforcement learning via policy extraction. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. pp. 2499–2509 (2018)
Google Scholar
Berkenkamp, F., Turchetta, M., Schoellig, A.P., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. pp. 908–918 (2017)
Google Scholar
Chen, X., Ábrahám, E., Sankaranarayanan, S.: Flow*: An analyzer for non-linear hybrid systems. In: Sharygina, N., Veith, H. (eds.) Computer Aided Verification - 25th International Conference, CAV 2013, Saint Petersburg, Russia, July 13-19, 2013. Proceedings. Lecture Notes in Computer Science, vol. 8044, pp. 258–263. Springer (2013)
Google Scholar
Chow, Y., Nachum, O., Duéñez-Guzmán, E.A., Ghavamzadeh, M.: A lyapunov-based approach to safe reinforcement learning. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. pp. 8103–8112 (2018)
Google Scholar
Chow, Y., Nachum, O., Faust, A., Duéñez-Guzmán, E.A., Ghavamzadeh, M.: Safe policy learning for continuous control. In: Kober, J., Ramos, F., Tomlin, C.J. (eds.) 4th Conference on Robot Learning, CoRL 2020, 16-18 November 2020, Virtual Event / Cambridge, MA, USA. Proceedings of Machine Learning Research, vol. 155, pp. 801–821. PMLR (2020)
Google Scholar
Christakis, M., Eniser, H.F., Hermanns, H., Hoffmann, J., Kothari, Y., Li, J., Navas, J.A., Wüstholz, V.: Automated safety verification of programs invoking neural networks. In: Silva, A., Leino, K.R.M. (eds.) Computer Aided Verification - 33rd International Conference, CAV 2021, Virtual Event, July 20-23, 2021, Proceedings, Part I. Lecture Notes in Computer Science, vol. 12759, pp. 201–224. Springer (2021)
Google Scholar
Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977. pp. 238–252 (1977)
Google Scholar
Dalal, G., Dvijotham, K., Vecerík, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. CoRR abs/1801.08757 (2018)
Google Scholar
Dawson, C., Fan, C.: Robust counterexample-guided optimization for planning from differentiable temporal logic. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022, Kyoto, Japan, October 23-27, 2022. pp. 7205–7212. IEEE (2022)
Google Scholar
Dutta, S., Chen, X., Sankaranarayanan, S.: Reachability analysis for neural feedback systems using regressive polynomial rule inference. In: Ozay, N., Prabhakar, P. (eds.) Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, HSCC 2019, Montreal, QC, Canada, April 16-18, 2019. pp. 157–168. ACM (2019)
Google Scholar
Fan, J., Huang, C., Chen, X., Li, W., Zhu, Q.: Reachnn*: A tool for reachability analysis of neural-network controlled systems. In: Hung, D.V., Sokolsky, O. (eds.) Automated Technology for Verification and Analysis - 18th International Symposium, ATVA 2020, Hanoi, Vietnam, October 19-23, 2020, Proceedings. Lecture Notes in Computer Science, vol. 12302, pp. 537–542. Springer (2020)
Google Scholar
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. Foundations and Trends® in Machine Learning 11(3-4), 219–354 (2018)
Google Scholar
Frehse, G., Guernic, C.L., Donzé, A., Cotton, S., Ray, R., Lebeltel, O., Ripado, R., Girard, A., Dang, T., Maler, O.: Spaceex: Scalable verification of hybrid systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) Computer Aided Verification - 23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings. Lecture Notes in Computer Science, vol. 6806, pp. 379–395. Springer (2011)
Google Scholar
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: Toward safe control through proof and learning. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. pp. 6485–6492. AAAI Press (2018)
Google Scholar
Fulton, N., Platzer, A.: Verifiably safe off-model reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) Tools and Algorithms for the Construction and Analysis of Systems - 25th International Conference, TACAS 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6-11, 2019, Proceedings, Part I. Lecture Notes in Computer Science, vol. 11427, pp. 413–430. Springer (2019)
Google Scholar
Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., Vechev, M.T.: AI2: safety and robustness certification of neural networks with abstract interpretation. In: 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018, San Francisco, California, USA. pp. 3–18 (2018)
Google Scholar
Gillula, J.H., Tomlin, C.J.: Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In: IEEE International Conference on Robotics and Automation, ICRA 2012, 14-18 May, 2012, St. Paul, Minnesota, USA. pp. 2723–2730. IEEE (2012)
Google Scholar
Gilpin, Y., Kurtz, V., Lin, H.: A smooth robustness measure of signal temporal logic for symbolic control. IEEE Control. Syst. Lett. 5(1), 241–246 (2021)
Google Scholar
Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T.A., Kohli, P.: On the effectiveness of interval bound propagation for training verifiably robust models. CoRR abs/1810.12715 (2018)
Google Scholar
Inala, J.P., Bastani, O., Tavares, Z., Solar-Lezama, A.: Synthesizing programmatic policies that inductively generalize. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020)
Google Scholar
Ivanov, R., Weimer, J., Alur, R., Pappas, G.J., Lee, I.: Verisig: verifying safety properties of hybrid systems with neural network controllers. In: Ozay, N., Prabhakar, P. (eds.) Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, HSCC 2019, Montreal, QC, Canada, April 16-18, 2019. pp. 169–178. ACM (2019)
Google Scholar
Jin, P., Tian, J., Zhi, D., Wen, X., Zhang, M.: Trainify: A cegar-driven training and verification framework for safe deep reinforcement learning. In: Shoham, S., Vizel, Y. (eds.) Computer Aided Verification - 34th International Conference, CAV 2022, Haifa, Israel, August 7-10, 2022, Proceedings, Part I. Lecture Notes in Computer Science, vol. 13371, pp. 193–218. Springer (2022)
Google Scholar
Koller, T., Berkenkamp, F., Turchetta, M., Krause, A.: Learning-based model predictive control for safe exploration. In: 57th IEEE Conference on Decision and Control, CDC 2018, Miami, FL, USA, December 17-19, 2018. pp. 6059–6066. IEEE (2018)
Google Scholar
Le, H.M., Voloshin, C., Yue, Y.: Batch policy learning under constraints. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of Machine Learning Research, vol. 97, pp. 3703–3712. PMLR (2019)
Google Scholar
Li, S., Bastani, O.: Robust model predictive shielding for safe reinforcement learning with stochastic dynamics. In: 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020. pp. 7166–7172 (2020)
Google Scholar
Lin, X., Zhu, H., Samanta, R., Jagannathan, S.: Art: Abstraction refinement-guided training for provably correct neural networks. In: 2020 Formal Methods in Computer Aided Design, FMCAD 2020, Haifa, Israel, September 21-24, 2020. pp. 148–157. IEEE (2020)
Google Scholar
Mania, H., Guy, A., Recht, B.: Simple random search of static linear policies is competitive for reinforcement learning. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. pp. 1805–1814 (2018)
Google Scholar
Mirman, M., Gehr, T., Vechev, M.T.: Differentiable abstract interpretation for provably robust neural networks. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Proceedings of Machine Learning Research, vol. 80, pp. 3575–3583. PMLR (2018)
Google Scholar
Moldovan, T.M., Abbeel, P.: Safe exploration in markov decision processes. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress (2012)
Google Scholar
Oulamara, M., Venet, A.J.: Abstract interpretation with higher-dimensional ellipsoids and conic extrapolation. In: Kroening, D., Pasareanu, C.S. (eds.) Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I. Lecture Notes in Computer Science, vol. 9206, pp. 415–430. Springer (2015)
Google Scholar
Qiu, W., Zhu, H.: Programmatic reinforcement learning without oracles. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net (2022)
Google Scholar
Singh, G., Gehr, T., Mirman, M., Püschel, M., Vechev, M.T.: Fast and effective robustness certification. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. pp. 10825–10836 (2018)
Google Scholar
Singh, G., Gehr, T., Püschel, M., Vechev, M.T.: An abstract domain for certifying neural networks. Proc. ACM Program. Lang. 3(POPL), 41:1–41:30 (2019)
Google Scholar
Sun, X., Khedr, H., Shoukry, Y.: Formal verification of neural network controlled autonomous systems. In: Ozay, N., Prabhakar, P. (eds.) Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, HSCC 2019, Montreal, QC, Canada, April 16-18, 2019. pp. 147–156. ACM (2019)
Google Scholar
Tessler, C., Mankowitz, D.J., Mannor, S.: Reward constrained policy optimization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net (2019)
Google Scholar
Tran, H., Yang, X., Lopez, D.M., Musau, P., Nguyen, L.V., Xiang, W., Bak, S., Johnson, T.T.: NNV: the neural network verification tool for deep neural networks and learning-enabled cyber-physical systems. In: Lahiri, S.K., Wang, C. (eds.) Computer Aided Verification - 32nd International Conference, CAV 2020, Los Angeles, CA, USA, July 21-24, 2020, Proceedings, Part I. Lecture Notes in Computer Science, vol. 12224, pp. 3–17. Springer (2020)
Google Scholar
Trivedi, D., Zhang, J., Sun, S.H., Lim, J.J.: Learning to synthesize programs as interpretable and generalizable policies. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021)
Google Scholar
Turchetta, M., Berkenkamp, F., Krause, A.: Safe exploration in finite markov decision processes with gaussian processes. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. pp. 4305–4313 (2016)
Google Scholar
Uesato, J., Kumar, A., Szepesvári, C., Erez, T., Ruderman, A., Anderson, K., Dvijotham, K.D., Heess, N., Kohli, P.: Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net (2019)
Google Scholar
Verma, A., Murali, V., Singh, R., Kohli, P., Chaudhuri, S.: Programmatically interpretable reinforcement learning. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Proceedings of Machine Learning Research, vol. 80, pp. 5052–5061. PMLR (2018)
Google Scholar
Wang, S., Chen, Y., Abdou, A., Jana, S.: Mixtrain: Scalable training of formally robust neural networks. CoRR abs/1811.02625 (2018)
Google Scholar
Wang, Y., Huang, C., Wang, Z., Wang, Z., Zhu, Q.: Design-while-verify: correct-by-construction control learning with verification in the loop. In: Oshana, R. (ed.) DAC ’22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10 - 14, 2022. pp. 925–930. ACM (2022)
Google Scholar
Wang, Y., Zhu, H.: VEL: Verification-guided Programmatic Controller Synthesis. https://doi.org/10.5281/zenodo.7574512, https://github.com/RU-Automated-Reasoning-Group/VEL
Weng, T., Zhang, H., Chen, H., Song, Z., Hsieh, C., Daniel, L., Boning, D.S., Dhillon, I.S.: Towards fast computation of certified robustness for relu networks. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Proceedings of Machine Learning Research, vol. 80, pp. 5273–5282. PMLR (2018)
Google Scholar
Wong, E., Kolter, J.Z.: Provable defenses against adversarial examples via the convex outer adversarial polytope. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018.Proceedings of Machine Learning Research, vol. 80, pp. 5283–5292. PMLR(2018)
Google Scholar
Yang, C., Chaudhuri, S.: Safe neurosymbolic learning with differentiable symbolic execution. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net (2022)
Google Scholar
Yang, T., Rosca, J., Narasimhan, K., Ramadge, P.J.: Projection-based constrained policy optimization. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020)
Google Scholar
Zhang, H., Chen, H., Xiao, C., Gowal, S., Stanforth, R., Li, B., Boning, D.S., Hsieh, C.: Towards stable and efficient training of verifiably robust neural networks. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020)
Google Scholar
Zhu, H., Xiong, Z., Magill, S., Jagannathan, S.: An inductive synthesis framework for verifiable reinforcement learning. In: McKinley, K.S., Fisher, K. (eds.) Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, Phoenix, AZ, USA, June 22-26, 2019. pp. 686–701. ACM (2019)
Google Scholar

Download references

Acknowledgments

This work was supported in part by NSF CCF-2007799 and NSF CCF-2124155.

Author information

Authors and Affiliations

Rutgers University, New Brunswick, NJ, USA
Yuning Wang & He Zhu

Authors

Yuning Wang
View author publications
You can also search for this author in PubMed Google Scholar
He Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to He Zhu .

Editor information

Editors and Affiliations

University of Colorado, Boulder, CO, USA
Sriram Sankaranarayanan
University of Lugano, Lugano, Switzerland
Natasha Sharygina

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Zhu, H. (2023). Verification-guided Programmatic Controller Synthesis. In: Sankaranarayanan, S., Sharygina, N. (eds) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2023. Lecture Notes in Computer Science, vol 13994. Springer, Cham. https://doi.org/10.1007/978-3-031-30820-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-30820-8_16
Published: 20 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30819-2
Online ISBN: 978-3-031-30820-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics