Keywords

1 Introduction

The design of control and decision-making software for autonomous systems is a key part of many industrial applications, such as unmanned aerial vehicles, ground vehicles and general robots, therefore it attracts continued attention in the last decade [7, 9, 12, 14]. Among many research works in this field, a highly challenging problem is the controller synthesis, i.e., to build control systems that guarantee the safety and the reachability simultaneously. As an emergency approach, the machine learning method has also been developed to tackle this problem in recent years. Several existing techniques focus on learning a control policy from user-defined reward/cost functions for encoding the required properties. A typical way is to use the framework of reinforcement learning (RL) which evaluates and improves the controller’s performance by interacting with environments and systems. Because of its strong ability to deal with nonlinear and/or uncertain (or indeterministic) dynamical systems of high dimensions, as well as the universal approximation power of the deep neural networks, the RL-based controller synthesis has been extensively studied, and substantial progress has been made by different research teams [22, 23]. However, formal reasoning of the required properties of such DNN-controlled dynamical systems is an arduous and challenging problem which makes the practical use of RL still limited. For safety/reachability verification of the system under the learned controller, one main approach is tracing the reachable sets of the system through computing [8, 13, 30], which needs to measure the solutions to the ODEs of the system, thus the scalability of these approaches is largely restricted. Another major approach is creating a certificate synthesis through solving the associated SMT problems [6, 16, 31], which also has limited scalability since the complexity of symbolic computation in the general purpose SMT solvers. In this paper, we will utilize the advantage of RL to train an elaborately designed hybrid controller, which makes the system easier to be verified with safety and goal-reaching requirements while maintaining controllability.

Our proposed hybrid controller is in the form of a lower degree polynomial plus a relatively small size neural network, called a polynomial-DNN controller. The learning-based process of the polynomial-DNN controller synthesis is divided into the following four phases: (1) at first we train a well-performing DNN controller by RL with safety and goal-reaching requirements; (2) then we manage to fit the trained DNN roughly by a polynomial with a prescribed lower degree bound as one part of the hybrid structure; (3) we construct a small and special neural network (NN) with Square activation function on the hidden layer and \(\tanh \) on the output layer as the supplement for the polynomial part, and subsequently distill an initial polynomial-DNN controller from the original DNN controller; (4) finally, using RL from the distilled one to fine-tune a well-performing polynomial-DNN controller.

Thanks to the hybrid form consisting of a polynomial and a small NN with the special structure, the obtained hybrid controller is easier to verify and maintains its expressiveness and flexibility for two main reasons: (1) considering the verification efficiency, the original DNN is fitted by a lower degree polynomial through coarse approximation which can be easily obtained and significantly reduce the difficulty of formal verification; (2) the NN part compensates for the controller performance loss caused by the coarse polynomial approximation. Benefitting from its feature, the system with the polynomial-DNN controller can be equivalently transformed into a polynomial form via system recasting, which makes post-verification easily solvable.

The necessity of proposing a polynomial-DNN type controller can be explained as follows. Transforming DNN into polynomial form enables the application of efficient polynomial solving techniques for formal verification, but there is no guarantee that a polynomial of a specified degree bound can fit a DNN with high accuracy; meanwhile, the approximation and corresponding verification problem will become quite complicated as the degree of the polynomial increases, which also may result in the failure of the verification. Therefore, we resort to lower degree polynomial approximation simultaneously retrain a small NN as the compensation for loss of accuracy, since a rough approximating polynomial part cannot replace the whole DNN controller, and the verification may fail for the system controlled by the polynomial part. The hybrid controller balances the richness of expressiveness and the ease of formal verification very well. To check the effectiveness of the proposed approach, we have evaluated the hybrid controller synthesis on a set of commonly used benchmark examples. To summarize, the main contributions of this paper are as follows:

  • We propose a method to synthesize a hybrid polynomial-DNN controller subject to reach-avoid constraints, via RL incorporated with lower degree polynomial fitting and distillation based retraining, which not only maintains good control performance but also makes post-verification solvable.

  • We delicately design a residual network as a compensation of the target controller. The particularity of the differential form of the residual network allows us to cast the differential equations of the control systems into an equivalent polynomial form which is conducive to formal verification.

  • We carry out a detailed experimental evaluation on a set of benchmarks to demonstrate the effectiveness of our approach, and the necessity of the controller in such a hybrid form through ablation studies.

1.1 Related Works

Several research works focus on the controller synthesis for the safety requirement, in which a typical way is to use reinforcement learning or supervised learning to build the overall learning framework for synthesizing security certificates (such as control barrier function, CBF) [1, 26,27,28,29].

For the goal-reaching requirement, most of existing works concentrate on building controllers to drive the system to reach a specified set within a time bound [8, 11, 13, 30]. Some others focus on synthesizing the control policy to make the system asymptotically converge to a specified goal state set, which is called stability requirement. The certificate of Lyapunov functions generation is a practical routine in this aspect [3,4,5, 15, 25].

In fact, learning a reach-avoid controller, namely, for both safety and goal-reaching requirements, is a much more complicated problem. An example was given in [10], where a correct-by-construction controller that consists of a reference controller and a tracking controller has been successfully built to derive the actual trajectory according to the reference trajectory, and different reference controllers have been pre-designed for different scenarios.

Recently, a new learning-based approach is implemented in [17], where the safe and goal-reaching policy is constructed by jointly learning two additional certificate functions using supervised learning. Notice that there may exist the risk of synthesizing false certificates, as the certificate constraints are only satisfied at the sampled points. Although one can perform posterior formal verification to overcome this weak-point, it would be difficult to do the verification with several DNNs in the system. By comparison, our synthesized hybrid polynomial-DNN controller has clear advantages on formal verification.

2 Preliminaries

\(\mathbf {Notations.}\) Let \(\mathbb {R}[{\textbf{x}}]\) denote the ring of polynomials with coefficients in \(\mathbb {R}\) over variables \({\textbf{x}}= [x_{1}, x_{1},\ldots , x_{n}]^T\) , and \(\mathbb {R}[{\textbf{x}}]^n\) denotes the n-dimensional polynomial vector. Let \(\varSigma [{\textbf{x}}]\subset {\mathbb R}[{\textbf{x}}]\) be the set of SOS polynomials. The distance from \({\textbf{x}}\) to a set S is defined by \({\Vert {{\textbf{x}}}\Vert }_S={\inf _{s\in {S}}{ \Vert {\textbf{x}}-s \Vert _2}}\). A continuous function \(\alpha :[0,a) \rightarrow [0,+\infty )\) for some \(a>0\) is said to belong to class \(\mathcal {K}\) if it is strictly increasing and satisfies \(\alpha (0) = 0\). A continuous function \(\beta :(-b,c)\rightarrow (-\infty ,+\infty )\) for some \(b,c>0\) is said to belong to extended-class \(\mathcal {K}\) if it is strictly increasing and satisfies \(\beta (0)=0\). A continuous function \( \gamma : [0, c) \times [0, \infty ) \rightarrow [0,+\infty )\) for some \(c > 0\) belongs to class \(\mathcal{K}\mathcal{L}\), if for each fixed s, the mapping \(\gamma (r, s)\) belongs to class \(\mathcal {K}\) with respect to r, and for each fixed r, the mapping \(\gamma (r, s)\) is decreasing with respect to s, and \(\gamma (r, s) \rightarrow 0\) as \(s\rightarrow \infty \).

This section formulates the safety and goal-reaching controller synthesis problem. A controlled continuous dynamical system is modeled by first-order ordinary differential equations

$$\begin{aligned} \dot{\textbf{x}} =\textbf{f}(\textbf{x}, \textbf{u}), \quad \text {with} \,\, \textbf{u}= \textbf{k}(\textbf{x}), \end{aligned}$$
(1)

where \({\textbf{x}}\in \varPsi \subseteq \mathbb R^n\) are the system states, \(\textbf{u}\in U \subseteq \mathbb R^m\) are the control inputs, and \({\textbf{f}}\in {\mathbb {R}[{\textbf{x}}]^{n}}\) is the vector field defined on the state space \(D\subseteq {\mathbb {R}^{n}}\).

Assume \({\textbf{f}}\) satisfies the local Lipschitz condition, which ensures (1) has a unique solution \({\textbf{x}}(t, {\textbf{x}}_0)\) in D for every initial state \({\textbf{x}}_0\in D\) at \(t=0\). A dynamical system is equipped with a domain \(\varPsi \subset D\) and an initial set \(\varTheta \subset \varPsi \), represented as a triple \(\mathcal {C}\doteq ({\textbf{f}}, \varPsi , \varTheta )\). Given a prespecified unsafe region \(X_u\subset D\), we say that the system \(\mathcal {C}\) is safe if all trajectories starting from \(\varTheta \) can not evolve into the unsafe region \(X_u\), which has been widely investigated in safety critical applications.

Definition 1 (Safety)

For a controlled constrained continuous dynamical system (CCDS) \(\mathcal {C}=(\textbf{f}, \varPsi , \varTheta )\) and a given unsafe region \(X_u\), the system is safe if for all \(\textbf{x}_0 \in \varTheta \), there does not exist \(t_1 > 0\) such that

$$ \forall t \in [0, t_1). \textbf{x}(t, \textbf{x}_0) \in \varPsi \ \,\,\, \textrm{and} \,\, \textbf{x}(t_1, \textbf{x}_0)\in X_u. $$

At the same time, another important property has received much attention which is a generalization of stability and called goal-reaching.

Definition 2 (Goal-reaching)

Given a controlled CCDS \(\mathcal {C}=(\textbf{f}, \varPsi , \varTheta )\) and a set of goal states \(X_g \subset D\), the system \(\mathcal {C}\) is goal-reaching with respect to the goal set \(X_g\), if there exists a \(\mathcal{K}\mathcal{L}\)-function \(\gamma \) such that for any \({\textbf{x}}_0 \in \varTheta \),

$$\begin{aligned} \Vert {\textbf{x}}(t)\Vert _{X_g} \le \gamma (\Vert {\textbf{x}}(0)\Vert _{X_g}, t)\quad for~all ~t \ge 0. \end{aligned}$$

Definition 3 (Safe and Goal-reaching Controller Synthesis)

Given a controlled CCDS \(\mathcal {C}=(\textbf{f}, \varPsi ,\) \(\varTheta )\) with \(\textbf{f}\) defined by (1) with an unsafe set \(X_u\) and a goal set \(X_g\), design a locally Lipschitz continuous feedback control law \(\textbf{k}\) such that the closed-loop system \(\mathcal {C}\) with \(\textbf{f}= \textbf{f}(\textbf{x}, \textbf{k}(\textbf{x}))\) is both safe and goal-reaching as per Definition 1 and 2.

The concept of barrier certificates plays an important role in safety verification of continuous systems. The essential idea is to use the zero level set of a barrier certificate \(B({\textbf{x}})\) as a barrier to separate all the reachable states from the unsafe region. The following concept of barrier certificate, adapted from [24], can be used to guarantee the safety of a given controlled CCDS.

Theorem 1

 [24] Given a controlled CCDS \(\mathcal {C}=({\textbf{f}},\varPsi ,\varTheta )\), with \(\textbf{f}\) defined by (1), a feedback control law \({\textbf{u}}={\textbf{k}}({\textbf{x}})\), and the unsafe region \(X_u\). Suppose there exists a real-valued function \(B: \varPsi \rightarrow {\mathbb R}\) satisfying the following conditions:

(i):

\(B({\textbf{x}}) \ge 0\quad \forall {\textbf{x}}\in \varTheta \),

(ii):

\(B({\textbf{x}})<0\quad \forall {\textbf{x}}\in X_u\),

(iii):

\(B({\textbf{x}})=0\Rightarrow \mathcal {L}_f B({\textbf{x}})>0\quad \forall {\textbf{x}}\in \varPsi \),

where \(\mathcal {L}_f B({\textbf{x}})\) denotes the Lie-derivative of \(B({\textbf{x}})\) along the vector field \({\textbf{f}}({\textbf{x}})\), i.e., \(\mathcal {L}_f B({\textbf{x}})=\sum _{i=1}^n\frac{\partial B}{\partial x_i} \cdot f_{i}({\textbf{x}})\), then \(B({\textbf{x}})\) is a barrier certificate for the closed-loop system \(\mathcal {C}\) with the control law \({\textbf{k}}({\textbf{x}})\), and the safety of system \(\mathcal {C}\) is guaranteed.

For the goal-reaching controller design, we use a more general Lyapunov-like function which is introduced by the following definition.

Definition 4 (Lyapunov-like function)

Given a continuous system \(\mathcal {C}=({\textbf{f}}, \varPsi ,\varTheta )\), and the set of goal states \(X_g\subseteq \varPsi \), a continuous differentiable real-valued function \(V: \varPsi \rightarrow {\mathbb R}\) is said to be a Lyapunov-like function if

(i):

\(\{{\textbf{x}}| V({\textbf{x}})\le {0}\} \ne \emptyset \) and \(\{{\textbf{x}}| V({\textbf{x}})\le {0}\}\subseteq {X_{g}}\),

(ii):

\(\mathcal {L}_f V({\textbf{x}}) \le -{\beta }(V({\textbf{x}})) \quad \forall {{\textbf{x}}}\in {\varPsi },\)

where \(\beta \) is some extended class \(\mathcal {K}\) function, and \(\mathcal {L}_f V({\textbf{x}})=\sum _{i=1}^n\frac{\partial V}{\partial x_i} \cdot f_{i}({\textbf{x}})\).

As mentioned in [17], the above Lyapunov-like function is more general than the classic one used in [3, 4, 21, 25]. The Lyapunov-like function does not necessarily require that \(\mathcal {L}_f V({\textbf{x}})\) has to be always negative-definite, that is, \(\mathcal {L}_f V({\textbf{x}}) > 0\) can happen on \(\{{\textbf{x}}| V ({\textbf{x}}) < 0\}\), which will make the function less restrictive.

Theorem 2

For a controlled CCDS \(\mathcal {C}=(\textbf{f}, \varPsi , \varTheta )\) with \(\textbf{f}\) defined by (1) and a set of goal states \(X_g\subseteq \varPsi \), if \(V({\textbf{x}})\) is a Lyapunov-like function as in Definition 4, then the system under \({\textbf{u}}= \textbf{k}({\textbf{x}})\) is goal-reaching with respect to \(X_g\).

Combining Theorem 1 and Theorem 2, we obtain the following assertion stating that the existence of barrier certificates and Lyapunov-like functions guarantees the control law is both safe and goal-reachable. Hereafter, we refer to both barrier and Lyapunov-like functions as certificate functions for simplification.

3 Hybrid Polynomial-DNN Controllers Training

For the safe and goal-reaching controller synthesis problem, we design an easy-to-verify control policy with the aid of reinforcement learning (RL) based on barrier certificate and Lyapunov-like function generation. As we know, it is hard for a controller with a simple structure to guarantee the safe and goal-reachable behaviors for large-scale systems. Contrarily, controllers with complex structures can make the system have more flexible behaviors. Unfortunately, it requires much more computation efforts to tackle reach-avoid verification of the system with such a complex controller. To make it amenable, we propose a method to learn a controller with special structure, hybrid polynomial-DNN controller, which is easily verifiable, and can be customized to safety and goal-reaching requirement. Specifically, this hybrid controller consists of a polynomial and a small-size neural network with one single hidden layer. Notably, it is expected to exhibit similar behaviors to the original complex DNN controller, but is much easier to be verified thanks to its special structure, which will be elaborated in Sect. 4.

To achieve this, we adopt a low-degree polynomial to roughly approximate the DNN. Then we fix the structure of a small-size neural network and append it to the low-degree polynomial to construct a hybrid form controller, which is retrained using RL. To accelerate the retraining process, we use distillation technology to distill an initialization of the NN part in the hybrid controller. In summary, the learning-based process of the hybrid controller synthesis is divided into the following three stages, as shown in Fig. 1.

Fig. 1.
figure 1

The diagram of training framework.

  • Train a deep neural network controller via RL. Based on reinforcement learning, we train a deep neural network (DNN) controller for the given control system directly. Briefly, the RL procedure continuously uses the current controller to drive the system by interacting with the environment, and updates the relevant parameters of the controller by rewarding and penalizing. Through sufficient simulation and training, we expect to obtain a DNN controller that enables the system behavior to avoid the unsafe set and reach the specified target set with high probability.

  • Fit the DNN controller by a polynomial and distill a residual network by measuring the fitting error. From the learned DNN controller in the previous process, we reconstruct a hybrid controller consisting of a polynomial and a small neural network with a single hidden layer. Specifically, we approximate the trained DNN controller with an appropriate polynomial by sampling based method. The approximate polynomial is used as the main component of the hybrid controller. We further evaluate the error between the original DNN and the polynomial approximation by distillation learning, which yields a small neural network as a refined module.

  • Generate and retrain a hybrid controller by fine-tuning a small neural network from the distilled network. We construct a special small NN with square and \(\tanh \) activation functions on the hidden and output layers respectively, which helps to transform the hard verification problem into a tractable polynomial one. At last, we retrain the hybrid controller consisting of the polynomial part and the small NN template by fine-tuning the small network initialized by the result from the distillation learning.

3.1 Training Well-Performing DNN Controllers Using RL

As illustrated in Fig. 1, the RL method is applied to train a well performed controller, so that the system is able to avoid obstacles and reach the goal region within the time bound.

We construct the reward function through encoding the desired behaviours of the closed-loop system under the DNN controller, which assures unsafe region avoidance and goal region reachability. We hope that the RL helps to synthesize an ideal controller by the designed reward, and all the trajectories of the closed-loop system starting from the initial set \(\varTheta \) cannot evolve into the unsafe region \(X_u\), and reach the desired region \(X_g\) under the trained DNN controller. So the reward function design should concern two aspects, i.e., reward the behaviours far away from the unsafe region, and reward the behaviours approaching the goal region. In terms of the safety requirement, the reward function should penalize the behaviours approaching \(X_u\). Thus, the reward function can be defined as a joint Gaussian distribution on the system state, whose expectation and variance are the center and radius of \(X_u\), respectively,

$$\begin{aligned} reward_{u}({\textbf{x}}_t)=-e^{-\frac{1}{2}\sum _{i=1}^n(\frac{x_{i}(t)-x_{u}^{i}}{\rho ^{i}_{u}})^{2}} \end{aligned}$$

where \({\textbf{x}}_{u} = (x_{u}^{1}, \ldots , x_{u}^{n})\in X_u \subset D\) is the center of \(X_u\) and \(\rho _u\) is the radius of \(X_u\). Similarly, the reward for the goal-reaching purpose could be defined as a joint Gaussian distribution,

$$\begin{aligned} reward_{g}({\textbf{x}}_t)=e^{-\frac{1}{2}\sum _{i=1}^n(\frac{x_{i}(t)-x_{g}^{i}}{\rho ^{i}_{g}})^{2}} \end{aligned}$$

where \({\textbf{x}}_{g}= (x_{g}^{1}, \ldots , x_{g}^{n})\) and \(\rho _g\) are the center and the radius of \(X_g\), respectively. The entire reward function consists of the above two components, i.e.

$$\begin{aligned} reward({\textbf{x}}_t)=\lambda \cdot reward_{g}({\textbf{x}}_t)+(1-\lambda )\cdot reward_{u}({\textbf{x}}_t), \end{aligned}$$

to achieve the task of safety and goal reachability, where \(0<\lambda <1\) is the parameter to control the weights between \(reward_{g}({\textbf{x}}_t)\) and \(reward_{u}({\textbf{x}}_t)\).

The remaining problem is to train the controller via RL. Here we use Deep Deterministic Policy Gradient (DDPG) [20] which is a popular RL approach suited for continuous control applications. The DDPG algorithm combines the value-based and policy-based methods, and is made up of two neural networks: the critic network and actor network.

To train the desired controller, we first generate a set of initial states from \(\varTheta \). For each sampled initial state \({\textbf{x}}_0\), with the help of \({\textbf{u}}_{RL}\), one may yield the associated trajectory as a discrete time state sequence \(\{ {\textbf{x}}_0, {\textbf{x}}_1,\cdots , {\textbf{x}}_t, \cdots , {\textbf{x}}_m\}\) which does not enter the unsafe area, and then collect the transition tuples \(({\textbf{x}}_t,{\textbf{x}}_{t+1},{\textbf{u}}_t,reward({\textbf{x}}_t))\) to form a replay buffer. Every few time steps, a batch size of data is sampled from the replay buffer to update the parameters of critic network and actor network, and then the new controller is used to simulate the trajectory to collect new data until the controller behaves well.

3.2 Polynomial Approximation

Following the RL training process in Sect. 3.1, one may probably adopt a complex DNN structure to obtain a well-performing DNN controller. For safety critical systems, the properties of such synthesized controllers, such as safety and goal-reaching, need to be formally guaranteed. However, it is a challenging problem to verify specified properties for the closed-loop system under the trained DNN-type controller due to its complexity. Consequently, a high-degree polynomial can be found by approximating the trained DNN with extremely high precision and may be expected as the controller candidate to be verified with polynomial constraint solving. However, it could be an unbearable high computation complexity for the corresponding verification problem with such high-degree polynomial controller, which will be explained in the experiment section.

Based on the trained DNN controller \({\textbf{u}}_{RL}\) through RL, we construct an easily verifiable controller with a hybrid form, which could lead the system to be safe and goal-reachable. We firstly roughly approximate the \({\textbf{u}}_{RL}\) by a low-degree polynomial, denoted by \(p({\textbf{x}})\), as a part. Afterwards, we retrain a small NN, denoted by \(k({\textbf{x}})\), with one hidden layer as the compensation for the approximation error between \({\textbf{u}}_{RL}\) and \(p({\textbf{x}})\). The hybrid polynomial-DNN controller is built, i.e., \( p({\textbf{x}}) + k({\textbf{x}})\). The main task of this subsection focuses on how to obtain the approximate polynomial \(p({\textbf{x}})\) based on sampling points.

Concretely, a real coefficient vector \({\textbf{c}}\) is used to parameterize a polynomial \(p({\textbf{x}},{\textbf{c}})\) with a given degree d, i.e., \(p({\textbf{x}},{\textbf{c}})=\sum _{j}c_{j}b_{j}({\textbf{x}})\), where \(b_j({\textbf{x}})\) are monomials with total degree \(\le d\). Given the sampling points, we can obtain the coefficient vector \({\textbf{c}}^*\) by solving a least squares problems. Thus, the approximate polynomial \(p({\textbf{x}}, {\textbf{c}}^*)\) is the approximation of \({\textbf{u}}_{RL}({\textbf{x}})\) on \(\varPsi \), denoted by \(p({\textbf{x}})\) for brevity. And the residual function \(r({\textbf{x}})\) denotes the error between the approximate polynomial \(p({\textbf{x}})\) and the DNN controller, i.e., \(r({\textbf{x}})={\textbf{u}}_{RL}({\textbf{x}})-p({\textbf{x}}).\)

Having \(p({\textbf{x}})\), we cannot just regard it as the controller, because the error \(r({\textbf{x}})\) between \({\textbf{u}}_{RL}({\textbf{x}})\) and \(p({\textbf{x}})\) can not be ignorable. To take this into account, we compensate for the error by fitting the residual function \(r({\textbf{x}})\), by means of retraining a hybrid controller \(p({\textbf{x}}) + k({\textbf{x}}| \theta ')\) to rectify the system behavior, where \(\theta '\) is the parameter to learn the NN part.

3.3 Training the Residual Controller

In this part, we retrain to compensate for the difference in system behavior guided by the polynomial part \(p({\textbf{x}})\) versus the original DNN controller \({\textbf{u}}_{RL}\).

The Structure of the Residual Network. We design a special neural network as the compensation to make the resulting verification problem tractable. As illustrated in Fig. 2, a typical DNN has a layered architecture and can be represented as a composition of its L layers: \(k({\textbf{x}}|\theta ') = l_{L}\circ l_{L-1}\circ \cdots \circ l_{1}({\textbf{x}})\), where \(l_{i}({\textbf{x}}) = \sigma _i(W_i{\textbf{x}}+b_i)\) which is parameterized by a weight matrix \(W_i\) and a bias vector \(b_i\), and all the parameters are denoted by \(\theta '\) for brevity. This work considers \(\sigma _i\) to be square activation on the hidden layers and \(\tanh \) activation function on the output layer L, as shown in Fig. 2. This special setting has two advantages: i) ability to converge in the training process with the help of normalized output in the range of \([-1, 1]\); ii) ability to transform the control system with NN controller of this type into a polynomial form by system recasting (c.f. 4.1 for more details). Regarding ii), we introduce a new variable \(x_{n+1}\) to represent the NN output, i.e., \(x_{n+1}:= \tanh (h({\textbf{x}}))\), where \(h({\textbf{x}}):=l_{L-1}\circ \cdots \circ l_{1}({\textbf{x}})\) denotes the polynomial part in NN. The main observation that allows us to transform the system with this NN controller into an equivalent polynomial system is the fact that the special NN’s derivative can be expressed as

$$\begin{aligned} \dot{x}_{n+1} = (1-x_{n+1}^2) \dot{h}. \end{aligned}$$
(2)

Actually, we construct such small NN with one single hidden layer because it is enough to construct a simple structure neural network further added to the controller as the compensation to control systems well.

Fig. 2.
figure 2

Structure of the small neural network in the hybrid controller.

The Residual Controller Training. Then we retrain the hybrid controller \(p({\textbf{x}}) + k({\textbf{x}}| \theta ')\) making use of RL technique as described in the previous subsection. In order to improve training efficiency, the knowledge distillation technique is used to obtain the initialization of the NN part, i.e., \(k({\textbf{x}}| \theta ')\). It is easy to achieve this by regarding the residual function \(r({\textbf{x}})\) as the ensemble network (also called teacher network) and distilling the knowledge from it into a small model (i.e., student network). The learned student network realizes the knowledge transfer from the teacher network and provides the initial values for the \(k({\textbf{x}}| \theta ')\) for further training.

We reiterate that the purpose of constructing a hybrid controller by adding \(k({\textbf{x}}| \theta ')\) to the polynomial part \(p({\textbf{x}})\) is to make the hybrid controller drive the system to perform as expected by the compensation. Here we achieve this not by training \(k({\textbf{x}}| \theta ')\) to satisfy \({\textbf{u}}_{RL} = p({\textbf{x}}) + k({\textbf{x}}| \theta ')\), but instead we require the controller \(p({\textbf{x}}) + k({\textbf{x}}| \theta ')\) could drive the following closed system to be safe and goal-reachable essentially: \( \dot{{\textbf{x}}}=f({\textbf{x}}, p({\textbf{x}}) + k({\textbf{x}}| \theta ')).\)

We need to train a hybrid controller \(p({\textbf{x}})+k({\textbf{x}}| \theta ')\) for the above system to obtain the parameter \(\theta '\). Utilizing the learned parameters of the student network from the knowledge distillation as the initialization for the \(k({\textbf{x}}| \theta ')\), we simulate the system to collect a dataset of sampled trajectories, and use the DDPG algorithm to achieve the control objective of safety and goal-reaching, by referring to the reward design elaborated in Sect. 3.1. Once the training is completed, we obtain the desired hybrid polynomial-DNN controller \(u({\textbf{x}}) = p({\textbf{x}}) + k({\textbf{x}})\), where \(p({\textbf{x}})\) is the polynomial part and \(k({\textbf{x}})\) is the small neural network.

4 Reach-Avoid Verification with Lyapunov-Like Functions and Barrier Certificates Generation

To ensure the safety and goal-reaching properties for the specified control system under the synthesized controller, a relaxed surrogate is to generate a Lyapunov-like function and a barrier certificate, stated in Theorem 1 and Theorem 2. Note that, to make the computation tractable, the basic idea is to translate the problem of producing barrier certificates and Lyapunov-like function into a solvable polynomial optimization problem. Specifically, we first transform the ODEs \({\textbf{f}}\) of the CCDS through system recasting; and then we abstract the initial set \(\varTheta \), unsafe region \(X_u\), goal set \(X_g\) and the system domain \(\varPsi \) by polynomial expressions. At last, we establish the polynomial optimization problems yielded from the constraints of barrier certificate and Lyapunov-like function, proceeded by solving the resulted polynomial optimization problem to produce a barrier certificate and Lyapunov-like function, which can guarantee the safety and goal-reaching properties for the system with the hybrid controller, respectively. Notably, Sum-of-Squares (SOS) relaxation technique is applied to encode the polynomial optimization problem as an SOS problem involved with bilinear matrices inequalities (BMI) constraints.

4.1 Constructing Polynomial Simulations of the Controller Network

In the following, we assume the control input \({\textbf{u}}\) is one-dimensional for ease of presentation without loss of generality. Given a controlled CCDS \(\mathcal {C}=(\textbf{f}, \varPsi , \varTheta )\) with \(\textbf{f}\) defined by (1) with an unsafe set \(X_u\) and a goal set \(X_g\). Suppose the hybrid controller learned for the safety and goal-reaching requirements is \(u({\textbf{x}}) = p({\textbf{x}}) + k({\textbf{x}})\). Here \(k({\textbf{x}})\) is a small neural network with the square function as its activation function in the hidden layer, and the \(\tanh \) in the output layer, i.e., \(k({\textbf{x}}) = \tanh (h({\textbf{x}}))\) where h is a polynomial which is in fact the composition of an affine function and a square function. We replace the non-polynomial term occurring in the controller part of the vector field \({\textbf{f}}({\textbf{x}},{\textbf{u}})\) by introducing \(x_{n+1} = \tanh (h({\textbf{x}}))\). Then \(\dot{\textbf{x}} =\textbf{f}(\textbf{x}, \textbf{u})\) is transformed into a polynomial one:

$$\begin{aligned} \left\{ \begin{array}{l} \dot{\textbf{x}} =\textbf{f}(\textbf{x}, p({\textbf{x}})+x_{n+1}),\\ \dot{x}_{n+1} =(1-x_{n+1}^2)\dot{h}({\textbf{x}}). \end{array} \right. \end{aligned}$$
(3)

For simplicity, we denote (3) as \(\hat{{\textbf{f}}}\in {\mathbb {R}[{\textbf{x}}]^{n+1}}\).

Besides the vector field, we need to transform the \(\varTheta \), \(\varPsi \), \(X_u\), \(X_g\) respectively because of the introduced new variable. For instance, the initial set should be specified by \(\bar{\varTheta }:=\{({\textbf{x}},x_{n+1}) \in {\mathbb R}^{n+1}\,|\,{\textbf{x}}\in \varTheta ,\, x_{n+1} = \tanh (h({\textbf{x}}))\}\). Actually, \(\bar{\varTheta }\) can be abstracted by a polynomial inclusion. For the initial set \(\bar{\varTheta }\), we first compute a hyper-rectangle \(I:=\{{\textbf{x}}\in {\mathbb R}^{n}| \wedge l_i\le x_i\le u_i\}\) as an over-approximation of the bounded compact set \(\varTheta \) through interval analysis, then we could compute a Taylor model for the term \(\tanh (h({\textbf{x}}))\) on I and obtain \(p_1({\textbf{x}})-\delta _1\le x_{n+1}\le p_1({\textbf{x}})+\delta _1\). For \(\bar{\varTheta }\), we can get the corresponding polynomial abstraction \(\hat{\varTheta }\). For brevity, let \(\hat{{\textbf{x}}}\) denote the variable vector with the introduced variable \(x_{n+1}\), i.e., \(\hat{{\textbf{x}}}=({\textbf{x}},x_{n+1})=(x_1,\ldots ,x_n,x_{n+1})^{T}\). Likewise, the other sets \(\varPsi \), \(X_u\), \(X_g\) can be dealt with in the same manner, and yield the associated polynomial abstractions, \(\hat{\varPsi }\), \(\hat{X}_u\), \(\hat{X}_g\). The above polynomial abstractions can be written as following

$$\begin{aligned} \left\{ \begin{array}{l} \hat{\varTheta }:=\{\hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\,{\textbf{x}}\in \varTheta , \,\, |x_{n+1} - p_1({\textbf{x}})|\le \delta _1\}, \\ \hat{\varPsi }:=\{\hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\, {\textbf{x}}\in \varPsi , \,\, |x_{n+1} - p_2({\textbf{x}})|\le \delta _2\}, \\ \hat{X}_u:=\{\hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\,{\textbf{x}}\in X_u,\,\, |x_{n+1} - p_3({\textbf{x}})|\le \delta _3\}, \\ \hat{X}_g:=\{\hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\, {\textbf{x}}\in X_g, \, \, |x_{n+1} - p_4({\textbf{x}})|\le \delta _4\}. \end{array} \right. \end{aligned}$$
(4)

Finally, we obtain a polynomial CCDS \(\hat{\mathcal {C}}=(\hat{\textbf{f}}, \hat{\varPsi }, \hat{\varTheta })\). Therefore, if \({\textbf{x}}(t)\) is a trajectory of system (1) within domain specified by \(\varPsi \) starting from some initial state \({\textbf{x}}(t_0)\in \varTheta \), then \(\hat{{\textbf{x}}}(t)\) is the trajectory of system (3) within the relaxed domain specified by \(\hat{\varPsi }\) starting from the initial state \(\hat{{\textbf{x}}}(t_0)\in \hat{\varTheta }\) with \(x_{n + 1}(t_0) = \tanh (h({\textbf{x}}(t_0)))\).

Theorem 3

If controlled CCDS \(\hat{\mathcal {C}}=(\hat{\textbf{f}}, \hat{\varPsi }, \hat{\varTheta })\) with \(\hat{{\textbf{f}}}\) defined by (3) and with \(\hat{\varTheta }\), \(\hat{\varPsi }\), and \(\hat{X}_u\) defined by (4) is safe, then the original CCDS \(\mathcal {C}=(\textbf{f}, \varPsi , \varTheta )\) with the given unsafe set \(X_u\) is safe. Moreover, if \(B(\hat{{\textbf{x}}})\) is a barrier certificate of \(\hat{\mathcal {C}}\) w.r.t. \(\hat{X}_u\), then \(B({\textbf{x}}, \tanh (h({\textbf{x}})))\) is also the barrier function of \(\mathcal {C}\) w.r.t. \(X_u\).

Proof

Without loss of generality, let us assume that \({\textbf{x}}(t), t>0\) is one trajectory of the controlled CCDS \(\mathcal {C}\) starting from the initial state \({\textbf{x}}(t_0)\in {\varTheta }\), then \(\hat{{\textbf{x}}}(t)\) with \(x_{n + 1}(t) = \tanh (h({\textbf{x}}(t)))\) is a trajectory of \(\hat{\mathcal {C}}\) starting from the initial state \(\hat{{\textbf{x}}}(t_0)\in \hat{\varTheta }\). Then, the safety of \(\hat{\mathcal {C}}\) indicates that each trajectory of \(\hat{\mathcal {C}}\) from the initial state \(\hat{\varTheta }\) cannot reach any unsafe state specified by the assertions \(\hat{X}_u\), which implies that each trajectory of \(\mathcal {C}\) from the initial state \({\textbf{x}}(t_0)\) cannot reach any state specified by \(X_u\). Furthermore, the vector field \(\hat{{\textbf{f}}}\) is yielded from \({\textbf{f}}\) by the equivalent transformation, and \(\hat{\varTheta }\), \(\hat{\varPsi }\) and \(\hat{X}_u\) are the associated polynomial abstractions. Therefore, \(B({\textbf{x}},\tanh (h({\textbf{x}})))\) is the barrier certificate of CCDS \(\mathcal {C}\).

Theorem 4

If controlled CCDS \(\hat{\mathcal {C}}=(\hat{\textbf{f}}, \hat{\varPsi }, \hat{\varTheta })\) with \(\hat{{\textbf{f}}}\) defined by (3) and with \(\hat{\varTheta }\), \(\hat{\varPsi }\) and \(\hat{X}_g\) defined by (4) is goal-reaching, then the original CCDS \(\mathcal {C}=(\textbf{f}, \varPsi , \varTheta )\) with the given goal set \(X_g\) is goal-reaching. Moreover, if \(V(\hat{{\textbf{x}}})\) is a Lyapunov-like function of \(\hat{\mathcal {C}}\) w.r.t. \(\hat{X}_g\), then \(V({\textbf{x}}, \tanh (h({\textbf{x}})))\) is the Lyapunov-like function of \(\mathcal {C}\) w.r.t. \(X_g\).

Proof

Suppose the CCDS \(\mathcal {C}\) is not goal-reaching for the given goal set \(X_g\). Then \(\exists \epsilon \) and \(\exists {\textbf{x}}_0\in \varTheta \) such that \(\Vert {\textbf{x}}(t)\Vert _{X_g}> \epsilon , \forall t>0\). The state \(\hat{{\textbf{x}}}(t)\in \hat{\varPsi }\) with \(x_{n + 1}(t) = \tanh (h({\textbf{x}}(t)))\) from the initial state \(\hat{{\textbf{x}}}(t_0)\) satisfying

$$\begin{aligned} \Vert \hat{{\textbf{x}}}(t)\Vert _{\hat{X}_g} > \epsilon , \end{aligned}$$
(5)

because according to (4), \(\hat{X}_g\) is obtained just by involving a new variable and not changing the projection on the first n-dimension , i.e., \(X_g\). Then from the theorem assumption, the CCDS \(\hat{\mathcal {C}}\) is goal-reaching, so \(\exists T>0\) such that \(\Vert \hat{{\textbf{x}}}(t)\Vert _{\hat{X}_g}<\epsilon \), which contradicts with (5). Similar to Theorem 3, \(V({\textbf{x}},\tanh (h({\textbf{x}})))\) is the Lyapunov-like function of \(\mathcal {C}\) w.r.t. \(X_g\). This completes the proof.

4.2 Producing Barrier Certificate and Lyapunov-Like Function

For simplicity, hereafter we denote \(\hat{\varTheta }\), \(\hat{\varPsi }\), \(\hat{X}_u\) and \(\hat{X}_g\) as follows.

$$\begin{aligned} \left\{ \begin{array}{l} \hat{\varTheta }:=\{\hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\,\wedge _{i=1}^{m_1} g_{i}(\hat{{\textbf{x}}})\ge 0\}, \,\quad \hat{\varPsi }:=\{ \hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\,\wedge _{j=1}^{m_2} h_{j}(\hat{{\textbf{x}}})\ge 0\}, \\ \hat{X}_u:=\{\hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\,\wedge _{k=1}^{m_3} q_{k}(\hat{{\textbf{x}}})\ge 0 \}, \,\quad \hat{X}_g:=\{ \hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\,|\,\wedge _{\ell =1}^{m_4} s_{\ell }({\hat{\textbf{x}}})\ge 0 \}. \end{array} \right. \end{aligned}$$

Barrier Certificate Generation. Assume that the barrier function \(B(\hat{{\textbf{x}}})\) is a polynomial of degree at most d, whose coefficients form a vector space of dimension \(s(d)=\left( {\begin{array}{c}n+1+d\\ d\end{array}}\right) \) with the canonical basis \((\hat{{\textbf{x}}}^{\alpha })\) of monomials. Suppose the coefficients are unknown, and denoted by \({\textbf{b}}=(b_{\alpha })\in {\mathbb R}^{s(d)}\) the coefficient vector of \(B(\hat{{\textbf{x}}})\), and write

$$B(\hat{{\textbf{x}}},{\textbf{b}})=\sum _{\alpha \in {\mathbb N}_{d}^{n}} b_{\alpha } \hat{{\textbf{x}}}^{\alpha } =\sum _{\alpha \in {\mathbb N}_{d}^{n}} b_{\alpha }\, x_1^{\alpha _1}x_2^{\alpha _2}\cdots x_n^{\alpha _n}x_{n+1}^{\alpha _{n+1}},$$

in the canonical basis. As stated in Theorem 1 and Theorem 3, the controlled CCDS \(\mathcal {C}\) is safe under the designed controller if there exists such a barrier certificate \(B(\hat{{\textbf{x}}},{\textbf{b}})\) for CCDS \(\hat{\mathcal {C}}\). Meanwhile, determining the existence of barrier certificate \(B( \hat{{\textbf{x}}},{\textbf{b}})\), can be represented as the following feasibility problem.

$$\begin{aligned} \left\{ \begin{array}{l@{}l} \text {find} &{} \quad {\textbf{b}}\,\,\, \\ \text {s.t.} &{}B(\hat{{\textbf{x}}},{\textbf{b}}) \ge 0, \,\,\,\forall \hat{{\textbf{x}}} \in \hat{\varTheta }, \\ &{} \mathcal {L}_{{\textbf{f}}_{\textbf{u}}}{B}(\hat{{\textbf{x}}},{\textbf{b}})>0, \,\,\,\forall {\textbf{x}}\in \hat{\varPsi } \text { and } B(\hat{{\textbf{x}}},{\textbf{b}})=0,\\ &{} B(\hat{{\textbf{x}}},{\textbf{b}}) < 0, \,\,\,\forall \hat{{\textbf{x}}} \in \hat{X}_u. \\ \end{array}\right. \end{aligned}$$
(6)

Moreover, Sum-of-Squares (SOS) relaxation technique is applied to encode the optimization problem (6) as an SOS program. Given a basic semi-algebraic set \({\mathbb K}\) defined by: \({\mathbb K}=\{ \hat{{\textbf{x}}} \in {\mathbb R}^{n+1}\, | \, g_{1}(\hat{{\textbf{x}}})\ge 0,\ldots , g_{s}(\hat{{\textbf{x}}})\ge 0\}, \) where \(g_{i}(\hat{{\textbf{x}}})\in {\mathbb R}[\hat{{\textbf{x}}}], 1\le i\le s\), a sufficient condition for the nonnegativity of the given polynomial \(f(\hat{{\textbf{x}}})\) on the semi-algebraic set \({\mathbb K}\) is provided as

$$\begin{aligned} f(\hat{{\textbf{x}}})=\sigma _{0}(\hat{{\textbf{x}}})+\sum _{i=1}^{s}\sigma _{i}(\hat{{\textbf{x}}})g_{i}(\hat{{\textbf{x}}}), \,\, \end{aligned}$$
(7)

where \(\sigma _{i}(\hat{{\textbf{x}}}) \in \varSigma [\hat{{\textbf{x}}}]_{d} , \,\, 1\le i \le s\). Thus, the representation (7) ensures that the polynomial \(f(\hat{{\textbf{x}}})\) is nonnegative on the given semi-algebraic set \({\mathbb K}\).

Observing (6), the polynomial \(\mathcal {L}_{{\textbf{f}}_{\textbf{u}}}{B}(\hat{{\textbf{x}}},{\textbf{b}})\) is involved with the uncertain variable \(\varepsilon \) in the range \([-\mu ^*, \mu ^*]\), which can be written as \(\hat{h}(\varepsilon )\ge 0\) with

$$\hat{h}(\varepsilon ):=(\varepsilon +\mu ^*)(\mu ^*-\varepsilon ).$$

Thus, the problem (6) can be transformed into the following optimization problem through SOS relaxation

$$\begin{aligned} {\displaystyle \left\{ \begin{array}{l@{}l} \text {find} &{} \quad {\textbf{b}}\,\,\, \\ \text {s.t.}&{} B(\hat{{\textbf{x}}},{\textbf{b}})-\sum _i\sigma _i(\hat{{\textbf{x}}})g_i(\hat{{\textbf{x}}}) \in \varSigma [\hat{{\textbf{x}}}], \\ &{}\mathcal {L}_{{\textbf{f}}_{\textbf{u}}}{B}(\hat{{\textbf{x}}},{\textbf{b}})-\lambda (\hat{{\textbf{x}}})B(\hat{{\textbf{x}}},{\textbf{b}}) -\sum _{j}\phi _{j}(\hat{{\textbf{x}}})h_{j}(\hat{{\textbf{x}}})-\nu (\hat{{\textbf{x}}},\varepsilon )\hat{h}(\varepsilon )-\epsilon \in \varSigma [\hat{{\textbf{x}}}],\\ &{} -B(\hat{{\textbf{x}}},{\textbf{b}})-\epsilon '-\sum _{j} \kappa _{j}(\hat{{\textbf{x}}})q_{j}(\hat{{\textbf{x}}}) \in \varSigma [\hat{{\textbf{x}}}], \\ \end{array}\right. } \end{aligned}$$
(8)

where \(\epsilon , \epsilon '>0\), the entries of \(\sigma _i(\hat{{\textbf{x}}})\), \(\phi _{j}(\hat{{\textbf{x}}})\) \(\kappa _{j}(\hat{{\textbf{x}}})\in \varSigma [\hat{{\textbf{x}}}]\), and \(\nu (\hat{{\textbf{x}}},\varepsilon ) \in \varSigma [\hat{{\textbf{x}}},\varepsilon ]\), and \(\lambda (\hat{{\textbf{x}}})\in {\mathbb R}[\hat{{\textbf{x}}}]\). Note that \(\epsilon , \epsilon '\) are needed to ensure positivity of polynomials as required in the second and third constraints in (6). The feasibility of the constraints in (8) is sufficient to imply the feasibility of the constraints in (6).

Investigating (8), the product of undetermined coefficient parameters from \(\lambda (\hat{{\textbf{x}}})\) and \(B(\hat{{\textbf{x}}},{\textbf{b}})\) in the second constraint makes the problem into a bilinear matrix inequalities (BMI) problem, which can be carried out by calling a Matlab package PENBMI solver [18].

Remark that the existence of the feasible solution \({\textbf{b}}^{*}\) to the problem (8) implies that the system is guaranteed to be safe under the designated controller \(u({\textbf{x}}) = p({\textbf{x}})+k({\textbf{x}})\).

Lyapunov-like Function Computation. We wonder that the learned controller is guaranteed to be not only safe but also goal-reaching in a sense of driving the system to converge to the specified goal set. As stated in Theorem 2, the existence of Lyapunov-like function suffices to prove that the system’s behaviors asymptotically converge to the specified goal set \(X_g\). In the similar manner, we first formalize the goal-reaching verification for system \(\mathcal {C}\) through Theorem 2 and Theorem 4. Assume that the Lyapunov-like function \(V(\hat{{\textbf{x}}})\) is a polynomial of degree at most \(d'\), whose coefficients form a vector space of dimension \(s(d')=\left( {\begin{array}{c}n+1+d'\\ d'\end{array}}\right) \) with the canonical basis \((\hat{{\textbf{x}}}^{\alpha })\) of monomials. We introduce the coefficient parameters of the Lyapunov-like function \(V(\hat{{\textbf{x}}})\) denoted as the vector \({\textbf{v}}=(v_{\alpha })\in {\mathbb R}^{s(d')}\), and write

$$V(\hat{{\textbf{x}}},{\textbf{v}})=\sum _{\alpha \in {\mathbb N}_{d'}^{n+1}} v_{\alpha } {\textbf{x}}^{\alpha } =\sum _{\alpha \in {\mathbb N}_{d'}^{n+1}} v_{\alpha }\, x_1^{\alpha _1}x_2^{\alpha _2}\cdots x_{n+1}^{\alpha _{n+1}},$$

in the canonical basis. By Theorem 4, the controlled CCDS \(\mathcal {C}\) is goal-reaching under the designed controller can be reduced to that the CCDS \(\hat{\mathcal {C}}\) is goal-reaching if there exists such a Lyapunov-like function \(V(\hat{{\textbf{x}}},{\textbf{v}})\). The existence of Lyapunov-like function can be solved by tackling the following feasibility problem:

$$\begin{aligned} \left\{ \begin{array}{l@{}l} \text {find} &{} \quad {\textbf{v}}\,\,\, \\ \text {s.t.}~ &{} ~\emptyset \ne \{\hat{{\textbf{x}}}:V(\hat{{\textbf{x}}},{\textbf{v}})\le {0}\}\subseteq {\hat{X}_{g}}, \\ &{} \mathcal {L}_{{\textbf{f}}_{\textbf{u}}}{V}(\hat{{\textbf{x}}},{\textbf{v}})\le -{\beta }(V(\hat{{\textbf{x}}},{\textbf{v}})), \,\,\,\forall {\hat{{\textbf{x}}}}\in {\hat{\varPsi }}.\\ \end{array}\right. \end{aligned}$$
(9)

Similarly, we encode the uncertain variable \(\varepsilon \) in the range \([-\mu , \mu ]\) into \(\hat{h}(\varepsilon )\ge 0\) with \(\hat{h}(\varepsilon ):=(\varepsilon +\mu )(\mu -\varepsilon )\), and \(\varepsilon \) is involved by the controller \({\textbf{u}}\) in the polynomial \(\mathcal {L}_{{\textbf{f}}_{\textbf{u}}}{V}(\hat{{\textbf{x}}},{\textbf{v}})\). And for the given goal-reaching set \(\hat{X}_g\), the constraint \( \{\hat{{\textbf{x}}}:V(\hat{{\textbf{x}}},{\textbf{v}})\le {0}\}\ne \emptyset \) can be encoded by \(V(\hat{{\textbf{x}}}_0,{\textbf{v}})\le {0}\) for a point \(\hat{{\textbf{x}}}_0\in \hat{X}_g\).

Depending on the above encoding operations, the problem (9) can be transformed into the following constrained polynomial optimization problem

$$\begin{aligned} {\displaystyle \left\{ \begin{array}{l@{}l} \text {find} &{} \quad {\textbf{v}}\,\,\, \\ \text {s.t.}~&{} ~s_i(\hat{{\textbf{x}}}) + \sigma '_i(\hat{{\textbf{x}}})V(\hat{{\textbf{x}}},{\textbf{v}}) \in \varSigma [\hat{{\textbf{x}}}], \\ &{} - \mathcal {L}_{{\textbf{f}}_{\textbf{u}}}{V}(\hat{{\textbf{x}}},{\textbf{v}}) - \beta (V(\hat{{\textbf{x}}},{\textbf{v}})) -\sum _{j}\phi '_{j}(\hat{{\textbf{x}}})h_{j}(\hat{{\textbf{x}}})-\nu '(\hat{{\textbf{x}}},\varepsilon )\hat{h}(\varepsilon ) \in \varSigma [\hat{{\textbf{x}}}],\\ &{} - V(\hat{{\textbf{x}}}_0,{\textbf{v}}) \in \varSigma [\hat{{\textbf{x}}}], \\ \end{array}\right. } \end{aligned}$$
(10)

where \(1\le i \le m_4\), \(1\le j \le m_2\), the entries of \(\sigma '_i(\hat{{\textbf{x}}})\), \(\phi '_{j}(\hat{{\textbf{x}}})\) \(\in \varSigma [\hat{{\textbf{x}}}]\), and \(\nu '(\hat{{\textbf{x}}},\varepsilon ) \in \varSigma [\hat{{\textbf{x}}},\varepsilon ]\). For the sake of simplicity, we consider the extended class \(\mathcal {K}\) function \(\beta (\cdot )\) is the \(\beta (x) = x\) or \(\beta (x) =r \cdot x\) \((r > 0)\).

In summary, the safety and goal-reaching verification problem is transformed into a BMI problem by combining (8,10) for the parameters \({\textbf{b}}\) and \({\textbf{v}}\). The solution \({\textbf{b}}^{*}\) to problem (8) yields a barrier certificate \(B(\hat{{\textbf{x}}},{\textbf{b}}^*)\). It means that the closed-loop system under the designed controller \(u({\textbf{x}})=p({\textbf{x}})+k({\textbf{x}})\) is safe. And the solution \({\textbf{v}}^{*}\) to (10) produces a Lyapunov-like function \(V(\hat{{\textbf{x}}},{\textbf{v}}^*)\), which means that the system asymptotically converges to the specified goal set \(X_g\).

5 Experiments

In this section we first present a nonlinear system to illustrate our approach, and then report an experimental evaluation of our method over a set of benchmark examples and compare with other two different potential methods. All experiments are conducted on 3.2GHz AMD Ryzen 7 3700X CPU under Windows 10 with 16GB RAM.

Example 1

[Academic 3D Model [6]] Consider the following continuous dynamical system in the plant:

$$\begin{aligned} \begin{bmatrix} \dot{x}\\ \dot{y}\\ \dot{z}\\ \end{bmatrix} = \begin{bmatrix} z+8y\\ -y+z\\ -z-x^2+u \end{bmatrix}. \end{aligned}$$

The system domain is \(\varPsi =\{{\textbf{x}}= (x,y,z)^T \in \mathbb {R}^3 \,|\, -5 \le x,y,z \le 5\}.\) Our goal is to design a control law \(u=p({\textbf{x}})+k({\textbf{x}})\) such that all trajectories of the closed-loop system under u starting from the initial set

$$\varTheta =\{{\textbf{x}}\in \mathbb {R}^3 \,| (x+0.75)^2+(y+1)^2+(z+0.4)^2\le 0.35^2\}$$

will never enter the unsafe region

$$X_u=\{{\textbf{x}}\in \mathbb {R}^3 \,|(x+0.3)^2+(y+0.36)^2+(z-0.2)^2 \le 0.30^2\},$$

and eventually enter the goal set \(X_g=\{{\textbf{x}}\in \mathbb {R}^3 \,|x^2+y^2+z^2\le 0.1^2\}.\)

For the controller learning process, we attempt to train different NN structures with increasing depth and width as the controller templates, until a desired controller is obtained. We eventually obtained one DNN controller with 5 hidden layers each consisting of 128 neurons, but failed for smaller sizes. Based on this learned DNN controller, we construct a hybrid controller for the system. The polynomial part \(p({\textbf{x}})\) is carried out by the sampling-based method as follows:

$$\begin{aligned} \begin{array}{ll} p({\textbf{x}})=0.125&{}-3.333x-5.726y-10.669z+1.911x^2+1.212xy\\ &{}+2.138x z-1.332y^2-10.07y z-12.952z^2. \end{array} \end{aligned}$$

The hybrid controller is then constructed as \(p({\textbf{x}}) + k({\textbf{x}}|\theta ')\) where \( k({\textbf{x}}|\theta ')\) is a small NN with one hidden layer. After retraining by taking \(p({\textbf{x}}) + k({\textbf{x}}|\theta ')\) into the system, we obtain the NN part with one hidden layer containing 30 neurons.

Fig. 3.
figure 3

Phase portrait of the system in Example 1. Subfigure (a) describes the zero level set of the barrier certificate \(B({\textbf{x}})\) (the blue surface) separates unsafe region \(X_u\) (the red ball) from the initial set \(\varTheta \)(the yellow ball). Subfigure (b) describes all trajectories of different colors from \(\varTheta \) (the yellow ball) can reach \(X_g\) (the green ball). (Color figure online)

Under the hybrid controller \(p({\textbf{x}})+k({\textbf{x}})\), the controlled system can be verified to satisfy the safety and goal-reaching properties by the following barrier certificate \(B({\textbf{x}}, \tanh (h({\textbf{x}})))\) and Lyapunov-like function \(V({\textbf{x}},\tanh (h({\textbf{x}})))\) respectively,

$$\begin{aligned} \left\{ \begin{array}{ll} B=&{}0.641x^2 - 0.143xy + 0.554y^2+\cdots +0.004\tanh (h({\textbf{x}}))-0.353z+0.061,\\ V=&{}-0.09x^2 - 0.311xy+\cdots + 0.0123\tanh (h({\textbf{x}}))- 0.033x-0.024z-0.01, \end{array} \right. \end{aligned}$$

where \(h({\textbf{x}}) = 2.248x^2 + 0.962xy +\cdots - 0.389z + 9.051\).

Figure 3(a) shows the zero level set of the barrier certificate in blue color which separates \(X_u\) (the red ball) from all trajectories starting from \(\varTheta \) (the yellow ball), and Fig. 3(b) describes the simulation of different trajectories of the system converges to the goal set \(X_g\) (the green ball) under the learned hybrid controller. Therefore, we conclude that the system can be guaranteed to be safe and goal-reachable from the initial set under our learned hybrid controller.

Although DNN policy by RL may appear to work well in many applications, it is difficult to assert any strong and provable claims about its correctness since the neurons, layers, weights and biases are far-removed from the intent of the actual controller. As found in [32], the state-of-the-art neural network verifiers are ineffective for verification of a neural controller over an infinite time horizon with complex system dynamics. So the idea is to learn a controller with formal reasonings of the specified property. The following part is to conduct the research experiments stated below:

RE1 : Explore directly learning a polynomial controller to control the system and guarantee its safety and goal-reaching requirements.

On the verification point, one may think how about directly learning a polynomial controller to control the system (without appealing to the neural policy at all), using reinforcement learning to synthesize its unknown parameters. So the experiment first tried training the controller network with the commonly used Square activation function. Through training on the data set from 250 trajectories with 3000 data points on each, the result was unsuccessful for different network structures (of up to 5 layers and 250 neurons), which means it still fails when simulating the behaviors of the system under the trained polynomial controller. As mentioned in [32], Zhu et al. found that despite many experiments on tuning learning rates and rewards, directly training a linear control program to conform to their specification with either reinforcement learning (e.g. policy gradient) or random search was unsuccessful because of undesirable overfitting even for an example as simple as the inverted pendulum.

RE2 : Explore the effects of using just a polynomial or a small NN to imitate the original DNN to avoid the hybrid form.

Our method is based on the RL to obtain a well-performing DNN controller in general form, and then with the guidance of the learned DNN, a hybrid controller is designed which is verifiable for the safety and the goal-reaching properties. The experiment next shows the performance of the hybrid controller synthesis and the comparison of the verification performance with other two RL-guided controller synthesis methods:

(RE2-1) Obtain a polynomial controller by imitating and abstracting the trained DNN controller, and under the guidance of the abstracted polynomial controller the resulting verification of the control system can naturally be encoded to a polynomial constraint solving problem;

(RE2-2) Abstract the DNN controller based on knowledge distillation to obtain a small network that is in simple structure, which is expected to maintain the safety and goal-reaching of the original network (on data set) [11]. Since the posterior verification cannot avoid approximating the neural network with a polynomial, and the upper bound of the error is positively related to the Lipschitz constant, the distilled small network is hopeful to make the verification successful thanks to its smaller Lipschitz constant.

Table 1. Performance Evaluation

We present a detailed experimental evaluation on a set of benchmarks in Table 1. The origins of these 10 widely used examples are provided in the first column; \(n_{\textbf{x}}\) and \(d_{\textbf{f}}\) denote the number of state variables and the maximal degree of the polynomials (or the polynomial abstraction by Taylor model for non-polynomial systems) in the vector fields. The examples are with dimension up to 7. \(u_0({\textbf{x}})\) denotes the network structure of the DNN controller synthesized by RL directly. For example, the trained DNN controller for \(C_1\) has 4 hidden layers with 128 neurons on each. Here, all DNNs are with ReLU activation functions except for \(\tanh \) on the output layer.

Table 1 has shown the performance of the mentioned three controller synthesis methods with the guidance of the well-trained DNN \(u_0({\textbf{x}})\), i.e., hybrid controller design, polynomial controller by imitating (denoted as Poly.), NN controller by distillation (denoted as Distil.). All the verification process on these methods is carried out through the certificate function generating and the time costs are recorded as \(T_H\), \(T_P\) and \(T_D\) respectively, when both barrier certificate and Lyapunov-like function have been obtained, and the degrees of the obtained certificate functions are recorded as \(d_B\),\(d_V\); otherwise, ‘\(\times \)’ is marked when failing to compute any barrier certificate or Lyapunov-like function within the degree bound of 6 and the time bound of 3 hours.

In our hybrid controller design method (i.e., Hyb. design), we uniformly choose \(p({\textbf{x}})\) of degree 2 and \(k({\textbf{x}})\) with one single hidden layer shown in column \(k({\textbf{x}})\). \(d_B\) and \(d_V\) denote the degrees of the computed certificates of barrier function \(B(\hat{{\textbf{x}}})\) and Lyapunov-like function \(V(\hat{{\textbf{x}}})\) respectively. \(T_{H}\) in the last column denotes the verification time cost.

The column Poly. exhibits the results of the method described in (RE2-1) on the benchmarks, intending to further explain the necessity of proposing a hybrid form controller. As an ablation study, we only use polynomial approximations of the original DNNs as surrogate controllers and carry out certificate-based verification of them. Considering the control effect, we increase the degree bound of polynomial templates to 8 to ensure a high precision approximation. \(d_P\) denotes the lowest degree of the polynomial surrogate controllers that pass verification and \(T_P\) denotes the corresponding time cost; ‘\(\times \)’ means that no such controller is found. The column Distil. provides the results of the method in (RE2-2) on the benchmarks. In this ablation study, we have distilled simpler NNs with one single hidden layer from the original DNNs and verify the specified properties using the distilled NN controllers. This process is repeated with the number of neurons of distilled NNs ranging from 20 up to 50 on its hidden layer, until obtaining one satisfying the specified properties whose verification time cost is denoted in \(T_D\), or failing to obtain one such simpler NN, denoted by ‘\(\times \)’ in \(T_D\).

For all the 10 examples, we have successfully verified the safety and goal-reaching properties of the synthesized hybrid controllers with the certificate generation, while the methods based on polynomial surrogate controllers (i.e., Poly.) and distilled NN controllers (i.e., Distil.) succeed on 5 and 4 benchmarks, respectively. Moreover, for some examples, Hyb. design method can find barrier certificates and Lyapunov-like functions with lower degrees. Consequently, the decision variables of the BMI problems are less than the other methods, which does contribute to improving the effectiveness of the verification procedure.

We compare the efficiency of the methods in terms of the time spent in the verification process for successful examples. On average, the time spent by \(T_P\) is 4.3 to 9.5 times as that of \(T_H\) on the 5 successful cases of \(T_P\). Meanwhile, the time cost by \(T_D\) is about 8.18 seconds on average, which is 1.46 times more than that of \(T_H\) on the four successful cases of \(T_D\). Comparing \(T_H\) with \(T_P\) and \(T_D\), we conclude that verification of the hybrid controllers is much more efficient.

To summarize, Table 1 shows that all the synthesized hybrid controllers have been efficiently verified to make the systems safe and goal-reachable on a set of commonly used benchmark examples, which demonstrates that our hybrid polynomial-DNN controller synthesis method is quite promising.

6 Conclusion

This paper has presented an approach to synthesize hybrid polynomial-DNN controllers for nonlinear systems such that the closed-loop system can be both well-performing and easily verified upon required properties. Our approach has creatively integrated low degree polynomial fitting and knowledge distillation into RL method during the constructing process. Thanks to the special feature of the hybrid controller, the controlled system can be transformed into the polynomial form. The SOS relaxation based method is applied to generate barrier certificates and Lyapunov-like functions, which can verify the safety and goal-reaching properties of the nonlinear control systems equipped with our synthesized hybrid controllers. Extensive experiments consistently demonstrate the effectiveness and scalability of the proposed approach.