Keywords

1 Introduction

Learning-based control and verification of learned controllers. Learning-based control and reinforcement learning (RL) were empirically demonstrated to have enormous potential to solve highly non-linear control tasks. However, their deployment in safety-critical scenarios such as autonomous driving or healthcare requires safety assurances. Most safety-aware RL algorithms optimize expected reward while only empirically trying to maximize safety probability. This together with the non-explainable nature of neural network controllers obtained via deep RL raise questions about the trustworthiness of learning-based methods for safety-critical applications [9, 27]. To that end, formal verification of learned controllers as well as learning-based control with formal safety guarantees have become very active research topics.

Learning certificate functions. A classical approach to formally proving properties of dynamical systems is to compute a certificate function. A certificate function [26] is a function that assigns real values to system states and its defining conditions imply satisfaction of the property. Thus, in order to prove the property of interest, it suffices to compute a certificate function for that property. For instance, Lyapunov functions [46] and barrier functions [50] are standard certificate functions for proving reachability of some target set and avoidance of some unsafe set of system states, respectively, when the system dynamics are deterministic. While both Lyapunov and barrier functions are well-studied concepts in dynamical systems theory, early methods for their computation either required designing the certificates by hand or using computationally intractable numerical procedures. A more recent approach reduces certificate computation to a semi-definite programming problem by using sum-of-squares (SOS) techniques [33, 37, 49]. However, a limitation of this approach is that it is only applicable to polynomial systems and computation of polynomial certificate functions, whereas it is not applicable to systems with general non-linearities. Moreover, SOS methods do not scale well with the dimension of the system.

Learning-based methods are a promising approach to overcome these limitations and they have received much attention in recent years. These methods jointly learn a neural network control policy and a neural network certificate function, e.g. a Lyapunov function [3, 17, 18, 53] or a barrier function [1, 38, 52, 58], depending on the property of interest. The neural network certificate is then formally verified, ensuring that these methods provide formal guarantees. Both learning and verification procedures developed for verifying neural network certificates are not restricted to polynomial dynamical systems. See [26] for an overview of existing learning-based control methods that learn a certificate function to verify a system property in deterministic dynamical systems.

Prior works – deterministic dynamical systems. While the above works present significant advancements in learning-based control and verification of dynamical systems, they are predominantly restricted to deterministic dynamical systems. In other words, they assume that they have access to the exact dynamics function according to which the system evolves. However, for most control tasks, the underlying models used by control methods are imperfect approximations of real systems inferred from observed data. Thus, control and verification methods should also account for model uncertainty due to the noise in observed data and the approximate nature of model inference.

This survey – stochastic dynamical systems. In this work, we survey recent developments in learning-based methods for control and verification of discrete-time stochastic dynamical systems, based on [44, 68]. Stochastic dynamical systems use probability distributions to quantify and model uncertainty. In stochastic dynamical systems, given a property of interest and a probability parameter \(p\in [0,1]\), the goal is to learn a control policy and a formal certificate which guarantees that the system under the learned policy satisfies the property of interest with probability at least p.

Supermartingale certificate functions. Lyapunov functions and barrier functions can be used to prove properties in deterministic dynamical systems, however they are not applicable to stochastic dynamical systems and do not allow reasoning about the probability of a property being satisfied. Instead, the learning-based methods of [44, 68] use supermartingale certificate functions to formally prove properties in stochastic systems. Supermartingales are a class of stochastic processes that decrease in expected value at every time step [66]. Their nice convergence properties and concentration bounds allow their use in designing certificate functions for stochastic dynamical systems. In particular, ranking supermartingales (RSMs) [15, 44] were used to verify probability 1 reachability and stochastic barrier functions (SBFs) [50] were used to verify safety with the specified probability \(p\in [0,1]\). Reach-avoid supermartingales (RASMs) [68] unify and extend these two concepts and were used to verify reach-avoidance properties with the specified probability \(p\in [0,1]\), i.e. a conjunction of reachability and safety properties. We define and compare these concepts in Section 3.

Fig. 1.
figure 1

Schematic illustration of the learner-verifier loop.

Learner-verifier framework for stochastic dynamical systems. In Section 4, we then present a learner-verifier framework of [44, 68] for learning-based control and for the verification of learned controllers in stochastic dynamical systems in a counterexample guided inductive synthesis (CEGIS) fashion [55]. The algorithm jointly learns a neural network control policy and a neural network supermartingale certificate function. It consists of two modules – the learner, which learns a policy and a supermartingale certificate function candidate, and the verifier, which then formally verifies the candidate supermartingale certificate function. If the verification step fails, the verifier computes counterexamples and passes them back to the learner, which tries to learn a new candidate. This loop is repeated until a candidate is successfully verified, see Fig. 1.

This framework builds on the existing learner-verifier methods for learning-based control in deterministic dynamical systems [2, 18, 26]. However, the extension of this framework to stochastic dynamical systems and the synthesis of supermartingale certificate functions is far from straightforward. In particular, the methods of [2, 18] use knowledge of the deterministic dynamics function to reduce the verification task to a decision procedure and use an off-the-shelf solver. However, verification of the expected decrease condition of supermartingale certificates by reduction to a decision procedure would require being able to compute a closed-form expression of the expected value of a neural network function over a probability distribution and provide it to the solver. It is not clear how the closed-form expression can be computed, and it is not known whether the closed-form expression exists in the general case.

This challenge is solved by using a method for efficient computation of tight upper and lower bounds on the expected value of a neural network function. The verifier module then verifies the expected decrease condition by discretizing the state space and formally verifying a slightly stricter condition at the discretization points by using the computed expected value bounds. By carefully choosing the mesh of the discretization and adding an additional error term, we obtain a sound verification method applicable to general Lipschitz continuous systems. The expected value bound computation for neural network functions relies on interval arithmetic and abstract interpretation, and since it is of independent interest, we discuss it in detail in Section 5. We are not aware of any existing methods that tackle this problem.

Extension to general stochastic certificates. We conclude this survey with a discussion of possible extensions of the learner-verifier framework in Section 6 and of related work in Section 7.

2 Preliminaries

We consider discrete-time stochastic dynamical systems defined via

$$\begin{aligned} \textbf{x}_{t+1} = f(\textbf{x}_t, \textbf{u}_t, \omega _t),\,\textbf{x}_0\in \mathcal {X}_0. \end{aligned}$$

The function \(f:\mathcal {X}\times \mathcal {U}\times \mathcal {N}\rightarrow \mathcal {X}\) is the dynamics function of the system and \(t\in \mathbb {N}_0\) is the time index. We use \(\mathcal {X}\subseteq \mathbb {R}^m\) to denote the system state space, \(\mathcal {U}\subseteq \mathbb {R}^n\) the control action space and \(\mathcal {N}\subseteq \mathbb {R}^p\) the stochastic disturbance space. For each \(t\in \mathbb {N}_0\), \(\textbf{x}_t\in \mathcal {X}\) the state of the system, \(\textbf{u}_t\in \mathcal {U}\) the action and \(\omega _t\in \mathcal {N}\) the stochastic disturbance vector at time t. The set \(\mathcal {X}_0\subseteq \mathcal {X}\) is the set of initial states. In each time step, \(\textbf{u}_t\) is chosen according to a control policy \(\pi :\mathcal {X}\rightarrow \mathcal {U}\), i.e. \(\textbf{u}_t = \pi (\textbf{x}_t)\), and \(\omega _t\) is sampled according to some specified probability distribution d over \(\mathbb {R}^p\). The dynamics function f, control policy \(\pi \) and probability distribution d together define a stochastic feedback loop system.

A trajectory of the system is a sequence \((\textbf{x}_t,\textbf{u}_t,\omega _t)_{t\in \mathbb {N}_0}\) such that, for each \(t\in \mathbb {N}_0\), we have \(\textbf{u}_t=\pi (\textbf{x}_t)\), \(\omega _t\in \textsf{support}(d)\) and \(\textbf{x}_{t+1}=f(\textbf{x}_t,\textbf{u}_t,\omega _t)\). For each initial state \(\textbf{x}_0\in \mathcal {X}\), the system induces a Markov process. This gives rise to the probability space over the set of all trajectories of the system that start in \(\textbf{x}_0\) [51]. We denote the probability measure and the expectation in this probability space by \(\mathbb {P}_{\textbf{x}_0}\) and \(\mathbb {E}_{\textbf{x}_0}\), respectively.

Assumptions. We assume that \(\mathcal {X}\subseteq \mathbb {R}^m\), \(\mathcal {X}_0\subseteq \mathbb {R}^m\), \(\mathcal {U}\subseteq \mathbb {R}^n\) and \(\mathcal {N}\subseteq \mathbb {R}^p\) are all Borel-measurable. This is necessary for the probability space of the set of all system trajectories starting in some initial state to be mathematically well-defined. We also assume that \(\mathcal {X}\subseteq \mathbb {R}^m\) is compact (i.e. closed and bounded) and that the dynamics function f is Lipschitz continuous, which are common assumptions in control theory. Finally, we assume that the probability distribution d is a product of independent univariate probability distributions, which is necessary for efficient sampling and expected value computation.

2.1 Brief Overview of Martingale Theory

In this subsection, we provide a brief overview of definitions and results from martingale theory that lie at the core of formal reasoning about supermartingale certificate functions. We assume that the reader is familiar with the mathematical definitions of probability space, measurability and random variables, see [66] for the necessary background. The results in this subsection will help in building an intuition on supermartingale certificate functions, but omitting them would not prevent the reader from following the rest of this paper.

Probability space. A probability space is a triple \((\varOmega ,\mathcal {F},\mathbb {P})\) where \(\varOmega \) is a state space, \(\mathcal {F}\) is a sigma-algebra and \(\mathbb {P}\) is a probability measure which is required to satisfy Kolmogorov axioms [66]. A random variable is a function \(X:\varOmega \rightarrow \mathbb {R}\) that is \(\mathcal {F}\)-measurable. We use \(\mathbb {E}[X]\) to denote the expected value of X. A (discrete-time) stochastic process is a sequence \((X_i)_{i=0}^{\infty }\) of random variables in \((\varOmega ,\mathcal {F},\mathbb {P})\).

Conditional expectation. Let X be a random variable in a probability space \((\varOmega ,\mathcal {F},\mathbb {P})\). Given a sub-\(\sigma \)-algebra \(\mathcal {F}'\subseteq \mathcal {F}\), a conditional expectation of X given \(\mathcal {F}'\) is an \(\mathcal {F}'\)-measurable random variable Y such that, for each \(A\in \mathcal {F}'\), we have

$$\begin{aligned} \mathbb {E}[X\cdot \mathbb {I}(A)]=\mathbb {E}[Y\cdot \mathbb {I}(A)]. \end{aligned}$$

Here, \(\mathbb {I}(A):\varOmega \rightarrow \{0,1\}\) is an indicator function of A defined via \(\mathbb {I}(A)(\omega )=1\) if \(\omega \in A\), and \(\mathbb {I}(A)(\omega )=0\) if \(\omega \not \in A\). Intuitively, conditional expectation of X given \(\mathcal {F}'\) is an \(\mathcal {F}'\)-measurable random variable that behaves like X whenever its expected value is taken over an event in \(\mathcal {F}'\). Conditional expectation of a random variable X given \(\mathcal {F}'\) is guaranteed to exist if X is real-valued and nonnegative [66]. Moreover, for any two conditional expectations Y and \(Y'\) of X given \(\mathcal {F}'\), we have that \(\mathbb {P}[Y= Y']=1\). Therefore, the conditional expectation is almost-surely unique and we may pick one such random variable as a canonical conditional expectation and denote it by \(\mathbb {E}[X\mid \mathcal {F}']\).

Supermartingales. Let \((\varOmega ,\mathcal {F},\mathbb {P})\) be a probability space and \(\mathcal {F}_0\subseteq \mathcal {F}_1\subseteq \dots \subseteq \mathcal {F}\) be an increasing sequence of sub-\(\sigma \)-algebras in \(\mathcal {F}\) with respect to inclusion. A nonnegative supermartingale with respect to \((\mathcal {F}_i)_{i=0}^\infty \) is a stochastic process \((X_i)_{i=0}^{\infty }\) such that each \(X_i\) is \(\mathcal {F}_i\)-measurable, and \(X_i(\omega )\ge 0\) and \(\mathbb {E}[X_{i+1}\mid \mathcal {F}_i](\omega ) \le X_i(\omega )\) hold for each \(\omega \in \varOmega \) and \(i\ge 0\). Intuitively, the second condition says that the expected value of \(X_{i+1}\) given the value of \(X_i\) has to decrease. This condition is formalized by using conditional expectation.

The following two results that will be key technical ingredients in our design of supermartingale certificate functions. The first theorem shows that nonnegative supermartingales have nice convergence properties and converge almost-surely to some finite value. The second theorem bounds the probability that the value of the supemartingale ever exceeds some threshold, and it will allow us to bound from above the probability of occurrence of some bad event.

Theorem 1

(Supermartingale convergence theorem [66]). Let \((X_i)_{i=0}^{\infty }\) be a nonnegative supermartingale with respect to \((\mathcal {F}_i)_{i=0}^\infty \). Then, there exists a random variable \(X_{\infty }\) in \((\varOmega ,\mathcal {F},\mathbb {P})\) to which the supermartingale converges to with probability 1, i.e. \(\mathbb {P}[\lim _{i\rightarrow \infty }X_i=X_{\infty }]=1\).

Theorem 2

([41]). Let \((X_i)_{i=0}^{\infty }\) be a nonnegative supermartingale with respect to \((\mathcal {F}_i)_{i=0}^\infty \). Then, for every real \(\lambda >0\), we have \(\mathbb {P}[ \sup _{i\ge 0}X_i \ge \lambda ] \le \mathbb {E}[X_0] / \lambda \).

2.2 Problem Statement

We now formally define the properties and control tasks that we focus on in this work. In what follows, let \(\mathcal {X}_t,\mathcal {X}_u\subseteq \mathcal {X}\) be disjoint Borel-measurable sets and \(p\in [0,1]\) be a lower bound on the probability with which the system under the learned controller needs to satisfy the property:

  • Reachability. Let \(\text {Reach}(\mathcal {X}_t) = \{ (\textbf{x}_t,\textbf{u}_t,\omega _t)_{t\in \mathbb {N}_0} \mid \exists t\in \mathbb {N}_0.\, \textbf{x}_t\in \mathcal {X}_t\}\) be the set of all trajectories that reach the target set \(\mathcal {X}_t\). The goal is to learn a control policy under which the system reaches \(\mathcal {X}_t\) with probability at least p, i.e. \(\mathbb {P}_{\textbf{x}_0}[ \text {Reach}(\mathcal {X}_t)] \ge p\) holds for every initial state \(\textbf{x}_0\in \mathcal {X}_0\).

  • Safety (or avoidance). Let \(\text {Safe}(\mathcal {X}_u) = \{ (\textbf{x}_t,\textbf{u}_t,\omega _t)_{t\in \mathbb {N}_0} \mid \forall t'\le t.\, \textbf{x}_{t'}\not \in \mathcal {X}_u\}\) be the set of all trajectories that do not visit the unsafe set \(\mathcal {X}_u\). The goal is to learn a control policy under which the system stays away from \(\mathcal {X}_u\) with probability at least p, i.e. \(\mathbb {P}_{\textbf{x}_0}[ \text {Safe}(\mathcal {X}_u)] \ge p\) holds for every initial state \(\textbf{x}_0\in \mathcal {X}_0\).

  • Reach-avoidance. Let \(\text {ReachAvoid}(\mathcal {X}_t,\mathcal {X}_u) = \{ (\textbf{x}_t,\textbf{u}_t,\omega _t)_{t\in \mathbb {N}_0} \mid \exists t\in \mathbb {N}_0.\, \textbf{x}_t\in \mathcal {X}_t\wedge (\forall t'\le t.\, \textbf{x}_{t'}\not \in \mathcal {X}_u)\}\) be the set of all trajectories that reach \(\mathcal {X}_t\) without reaching \(\mathcal {X}_u\). The goal is to learn a control policy under which the system reaches \(\mathcal {X}_t\) while staying away from \(\mathcal {X}_u\) with probability at least p, i.e. \(\mathbb {P}_{\textbf{x}_0}[ \text {ReachAvoid}(\mathcal {X}_t,\mathcal {X}_u)] \ge p\) holds for every initial state \(\textbf{x}_0\in \mathcal {X}_0\).

3 Supermartingale Certificate Functions

We now overview three classes of supermartingale certificate functions that formally prove reachability, safety and reach-avoidance properties. Supermartingale certificate functions do not refer to a single class of certificate functions. Rather, we use this term to refer to all certificate functions that exhibit a supermartingale-like behavior and can formally verify properties in stochastic dynamical systems. In what follows, we assume that the control policy \(\pi \) is fixed. In the following section, we will then present a learner-verifier framework for jointly learning a control policy and a supermartingale certificate function.

RSMs for probability 1 reachability. We start with ranking supermartingales (RSMs), which can prove probability 1 reachability of some target set \(\mathcal {X}_t\). Intuitively, an RSM is a continuous function that maps system states to nonnegative real values and is required to strictly decrease in expectation by some \(\epsilon >0\) in every time step until the target \(\mathcal {X}_t\) is reached. Due to the strict expected decrease as well as the Supermartingale Convergence Theorem (Theorem 1), one can show that the existence of an RSM guarantees that the system under policy \(\pi \) reaches \(\mathcal {X}_t\) with probability 1. RSMs can be viewed as a stochastic extension of Lyapunov functions. Note that RSMs can only be used to prove probability 1 reachability, but cannot be used to reason about probabilistic reachability. RSMs were originally used for proving almost-sure termination in probabilistic programs [15] and were used to certify probability 1 reachability in stochastic dynamical systems in [44].

Definition 1

(Ranking supermartingales [44]). Let \(\mathcal {X}_t\subseteq \mathcal {X}\) be a target set. A continuous function \(V:\mathcal {X}\rightarrow \mathbb {R}\) is a ranking supermartingale (RSM) with respect to \(\mathcal {X}_t\) if it satisfies:

  1. 1.

    Nonnegativity condition. \(V(\textbf{x}) \ge 0\) for each \(\textbf{x}\in \mathcal {X}\).

  2. 2.

    Expected Decrease condition. There exists \(\epsilon >0\) such that, for each \(\textbf{x}\in \mathcal {X}\backslash \mathcal {X}_t\), we have \(V(\textbf{x}) \ge \mathbb {E}_{\omega \sim d}[V(f(\textbf{x},\pi (\textbf{x}),\omega ))] + \epsilon \).

Theorem 3

([44]). Suppose that there exists an RSM with respect to \(\mathcal {X}_t\). Then, for every \(\textbf{x}_0\in \mathcal {X}_0\), we have \(\mathbb {P}_{\textbf{x}_0}[ \text {Reach}(\mathcal {X}_t)] = 1\).

SBFs for probabilistic safety. On the other hand, stochastic barrier functions (SBFs) can prove probabilistic safety. Given an unsafe set \(\mathcal {X}_u\) and probability \(p \in [0,1)\), an SBF is also a continuous function mapping system states to nonnegative real values, which is required to decrease in expectation at each time step. However, unlike RSMs, the expected decrease need not be strict and there is no target set. In addition, its initial value must be at most 1, whereas its value upon reaching an unsafe set must be at least \(1/(1-p)\). Thus, for the system under policy \(\pi \) to violate the safety constraint, the value of the SBF needs to increase from at most 1 to at least \(1/(1-p)\) even though it is required to decrease in expectation. The probability of this event can be bounded from above and shown to be at most \(1-p\) by using Theorem 2. We highlight the assumption that \(p<1\), which is necessary for the safety constraint to be mathematically defined. As the name suggests, SBFs are a stochastic extension of barrier functions.

Definition 2

(Stochastic barrier functions [50]). Let \(\mathcal {X}_u\subseteq \mathcal {X}\) be an unsafe set and \(p\in [0,1)\). A continuous function \(V:\mathcal {X}\rightarrow \mathbb {R}\) is a stochastic barrier function (SBF) with respect to \(\mathcal {X}_u\) and p if it satisfies:

  1. 1.

    Nonnegativity condition. \(V(\textbf{x}) \ge 0\) for each \(\textbf{x}\in \mathcal {X}\).

  2. 2.

    Initial condition. \(V(\textbf{x}) \le 1\) for each \(\textbf{x}\in \mathcal {X}_0\).

  3. 3.

    Safety condition. \(V(\textbf{x}) \ge \frac{1}{1-p}\) for each \(\textbf{x}\in \mathcal {X}_u\).

  4. 4.

    Expected Decrease condition. For each \(\textbf{x}\in \mathcal {X}\), if \(V(\textbf{x}) \le \frac{1}{1-p}\) then \(V(\textbf{x}) \ge \mathbb {E}_{\omega \sim d}[V(f(\textbf{x},\pi (\textbf{x}),\omega ))]\).

Theorem 4

([50]). Suppose that there exists an SBF with respect to \(\mathcal {X}_u\) and p. Then, for every \(\textbf{x}_0\in \mathcal {X}_0\), we have \(\mathbb {P}_{\textbf{x}_0}[ \text {Safe}(\mathcal {X}_u)] \ge p\).

RASMs for probabilistic reach-avoidance. Finally, reach-avoid supermartingales (RASMs) unify and extend RSMs and SBFs in the sense that they allow simultaneous reasoning about reachability and safety and proving a conjunction of these properties, i.e. reach-avoid properties. Let \(\mathcal {X}_t\) and \(\mathcal {X}_u\) be disjoint target and unsafe sets and let \(p\in [0,1)\). Similarly to SBFs, an RASM is a continuous nonnegative function which is required to be initially at most 1 but needs to attain a value that is at least \(1/(1-p)\) for the unsafe region to be reached. On the other hand, similarly to RSMs, it is required to strictly decrease in expectation by \(\epsilon >0\) at every time step until either the target set \(\mathcal {X}_t\) or a state in which the value is at least \(1/(1-p)\) is reached. Thus, RASMs can be viewed as a stochastic extension of both Lyapunov functions and barrier functions, which combines the strict decrease of Lypaunov functions and the level-set reasoning of barrier functions.

Definition 3

(Reach-avoid supermartingales [68]). Let \(\mathcal {X}_t\subseteq \mathcal {X}\) and \(\mathcal {X}_u\subseteq \mathcal {X}\) be a target set and an unsafe set, respectively, and let \(p\in [0,1]\) be a probability threshold. Suppose that either \(p<1\) or that \(p=1\) and \(\mathcal {X}_u=\emptyset \). A continuous function \(V:\mathcal {X}\rightarrow \mathbb {R}\) is a reach-avoid supermartingale (RASM) with respect to \(\mathcal {X}_t\), \(\mathcal {X}_u\) and p if it satisfies:

  1. 1.

    Nonnegativity condition. \(V(\textbf{x}) \ge 0\) for each \(\textbf{x}\in \mathcal {X}\).

  2. 2.

    Initial condition. \(V(\textbf{x}) \le 1\) for each \(\textbf{x}\in \mathcal {X}_0\).

  3. 3.

    Safety condition. \(V(\textbf{x}) \ge \frac{1}{1-p}\) for each \(\textbf{x}\in \mathcal {X}_u\).

  4. 4.

    Expected Decrease condition. There exists \(\epsilon >0\) such that, for each \(\textbf{x}\in \mathcal {X}\backslash \mathcal {X}_t\) at which \(V(\textbf{x}) \le \frac{1}{1-p}\), we have \(V(\textbf{x}) \ge \mathbb {E}_{\omega \sim d}[V(f(\textbf{x},\pi (\textbf{x}),\omega ))] + \epsilon \).

Theorem 5

([68]). Suppose that there exists an RASM with respect to \(\mathcal {X}_t\), \(\mathcal {X}_u\) and p. Then, for every \(\textbf{x}_0\in \mathcal {X}_0\), we have \(\mathbb {P}_{\textbf{x}_0}[ \text {ReachAvoid}(\mathcal {X}_t,\mathcal {X}_u)] \ge p\).

Note that RASMs indeed unify and generalize the definitions of RSMs and SBFs. First, by setting \(\mathcal {X}_u=\emptyset \) and \(p=1\) (so \(1/(1-p)=\infty \)), RASMs reduce to RSMs as the Initial condition that can be enforced without loss of generality by rescaling. Second, by setting \(\mathcal {X}_t=\emptyset \), RASMs reduce to SBFs. In this case, the Expected Decrease condition is strengthened as it requires strict decrease by \(\epsilon >0\). However, the proof of Theorem 5 which we outline below also implies Theorem 4 and \(\epsilon >0\) is only necessary to reason about the reachability of \(\mathcal {X}_t\).

We also note that RASMs strictly extend the applicability of RSMs, since RASMs can be used to prove reachability with any lower bound \(p\in [0,1]\) on probability and not only probability 1 reachability. Indeed, if we set \(\mathcal {X}_u=\emptyset \) and \(p\in [0,1]\), in order to prove reachability of \(\mathcal {X}_t\) with probability at least p the RASMs require strict expected decrease in expectation by \(\epsilon >0\) until either \(\mathcal {X}_t\) is reached or the RASM value exceeds \(1/(1-p)\) (with \(1/(1-p)=\infty \) if \(p=1\)).

In the rest of this section, we outline the proof of Theorem 5 that was presented in [68]. This proof also implies Theorem 3 and Theorem 4. We do this to highlight the connection of RSMs, SBFs and RASMs to the mathematical notion of supermartingale processes. We also do this to illustrate the tools from martingale theory that are used in proving soundness of supermatingale certificate functions, as we envision that they may be useful in designing supermatingale certificate functions for more general classes of properties.

Proof

(proof sketch of Theorem 5). Here we outline the main ideas behind the proof, and for the full proof we refer the reader to [68]. Let \(\textbf{x}_0\in \mathcal {X}_0\). We need to show that \(\mathbb {P}_{\textbf{x}_0}[ \text {ReachAvoid}(\mathcal {X}_t,\mathcal {X}_u)] \ge p\). To do this, we consider the probability space \((\varOmega _{\textbf{x}_0},\mathcal {F}_{\textbf{x}_0},\mathbb {P}_{\textbf{x}_0})\) of trajectories that start in \(\textbf{x}_0\) and for each time step \(t\in \mathbb {N}_0\) define a random variable in this probability space via

$$\begin{aligned} X_t(\rho ) = {\left\{ \begin{array}{ll} V(\textbf{x}_t), &{}\text {if } \textbf{x}_i\not \in \mathcal {X}_t\text { and } V(\textbf{x}_i)< \frac{1}{1-p} \text { for each } 0\le i \le t\\ 0, &{}\text {if } \textbf{x}_i\in \mathcal {X}_t\text { for some }0\le i\le t, V(\textbf{x}_j) < \frac{1}{1-p} \text { for each }0\le j\le i\\ \frac{1}{1-p}, &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

for each trajectory \(\rho =(\textbf{x}_t,\textbf{u}_t,\omega _t)_{t\in \mathbb {N}_0}\in \varOmega _{\textbf{x}_0}\). Hence, \((X_t)_{t=0}^\infty \) defines a stochastic process whose value at each time step is equal to the value of V at the current system state unless either the target set \(\mathcal {X}_t\) has been reached after which future values of \(\mathcal {X}_t\) are set to 0, or a state in which V exceeds \(1/(1-p)\) has been reached after which future values of \(\mathcal {X}_t\) are set to \(1/(1-p)\). It can be shown that \((X_t)_{t=0}^\infty \) is a nonnegative supermartingale \((\varOmega _{\textbf{x}_0},\mathcal {F}_{\textbf{x}_0},\mathbb {P}_{\textbf{x}_0})\). This claim can be proved by using the Nonnegativity and the Expected Decrease condition of RASMs. Here we do not yet need that the expected decrease is strict, i.e. \(\epsilon \ge 0\) in the Expected Decrease condition of RASMs is sufficient.

Since \((X_t)_{t=0}^\infty \) is a nonnegative supermartingale, substituting \(\lambda =1/(1-p)\) into the inequality in Theorem 2 shows that

$$\begin{aligned} \mathbb {P}_{\textbf{x}_0}\Big [ \sup _{i\ge 0}X_i \ge \frac{1}{1-p} \Big ] \le (1-p)\cdot \mathbb {E}_{\textbf{x}_0}[X_0] \le 1-p. \end{aligned}$$

The second inequality follows since \(X_0(\rho ) = V(\textbf{x}_0)\le 1\) for every \(\rho \in \varOmega _{\textbf{x}_0}\) by the Initial condition of RASMs. Hence, by the Safety condition of RASMs it follows that the system under policy \(\pi \) reaches the unsafe set \(\mathcal {X}_u\) with probability at most \(1-p\). Note that here we can already conclude the claim of Theorem 4.

Finally, as \((X_t)_{t=0}^\infty \) is a nonnegative supermartingale, by Theorem 1 its value converges with probability 1. One can then prove that this value has to be either 0 or \(\ge 1/(1-p)\) by using the fact that the expected decrease in the Expected Decrease condition of RASMs is strict. But we showed above that a state in which V is \(\ge 1/(1-p)\) is reached with probability at most \(1-p\). Hence, the probability that the system under policy \(\pi \) reaches the target set \(\mathcal {X}_t\) without reaching the unsafe set \(\mathcal {X}_u\) is at least p, i.e. \(\mathbb {P}_{\textbf{x}_0}[ \text {ReachAvoid}(\mathcal {X}_t,\mathcal {X}_u) ] \ge p\).    \(\square \)

4 Learner-Verifier Framework for Stochastic Systems

We now present the learner-verifier framework of [44, 68] for the learning-based control and verification of learned controllers in stochastic dynamical systems. We focus on the probabilistic reach-avoid problem, assume that we are given a target set \(\mathcal {X}_t\), unsafe set \(\mathcal {X}_u\) and a probability parameter \(p\in [0,1]\), and learn a control policy \(\pi \) and an RASM which certifies that \(\mathbb {P}_{\textbf{x}_0}[\text {ReachAvoid}(\mathcal {X}_t,\mathcal {X}_u)] \ge p\) for all \(\textbf{x}_0\in \mathcal {X}_0\). The algorithm for learning RSMs and SBFs can be obtained analogously, since we showed that RASMs unify and generalize RSMs and SBFs.

The algorithm behind the learner-verifier framework consists of two modules – the learner, which learns a neural network control policy \(\pi _\theta \) and a neural network supermartingale certificate function \(V_\nu \), and the verifier, which then formally verifies the learned candidate function. If the verification step fails, the verifier produces counterexamples that are passed back to the learner to fine-tune its loss function. Here, \(\theta \) and \(\nu \) are vectors of neural network parameters. The loop is repeated until either a certificate function is successfully verified, or some specified timeout is reached. By incorporating feedback from the verifier, the learner is able to tune the policy and the certificate function towards ensuring that the resulting policy meets the desired reach-avoid specification.

Applications. As outlined above, the learner-verifier framework can be used for learning-based control with formal guarantees that a property of interest is satisfied by jointly learning a control policy and a supermartingale certificate function for the property. On the other hand, it can also be used to formally verify a previously learned control policy by fixing policy parameters and only learning a supermartingale certificate function. Finally, if one uses a different method to learn a policy that turns out to violate the desired property, one can use the learner-verifier framework to fine-tune an unsafe policy towards repairing it and obtaining a safe policy for which a supermartingale certificate function certifies that the property of interest is satisfied.

4.1 Algorithm Initialization

As mentioned in Section 1, the key challenge for the verifier is to check the Expected Decrease condition of supermartingale certificates. Our algorithm solves this challenge by discretizing the state space and verifying a slightly stricter condition at discretization vertices which we show to imply the Expected Decrease condition over the whole region required by Definition 3. On the other hand, learning two neural networks in parallel while simultaneously optimizing several objectives can be unstable due to inherent dependencies between two networks. Thus, proper initialization of networks is important. We allow all neural network architectures so long as all activation functions are continuous functions. Furthermore, we apply the softplus activation function to the output neuron of \(V_\nu \), in order to ensure that the value of \(V_\nu \) is always nonnegative.

Discretization. A discretization \(\tilde{\mathcal {X}}\) of \(\mathcal {X}\) with mesh \(\tau >0\) is a set of states such that, for every \(\textbf{x}\in \mathcal {X}\), there exists a state \(\tilde{\textbf{x}}\in \tilde{\mathcal {X}}\) such that \(||\textbf{x}-\tilde{\textbf{x}}||_1<\tau \). The algorithm takes mesh \(\tau \) as a parameter and computes a finite discretization \(\tilde{\mathcal {X}}\) with mesh \(\tau \) by simply taking a hyper-rectangular grid of the sufficiently small cell size. Since \(\mathcal {X}\) is compact, this yields a finite discretization.

Network initialization. The policy network \(\pi _\theta \) is initalized by running proximal policy optimization (PPO) [54] on the Markov decision process (MDP) defined by the stochastic dynamical system with a reward function \(r_t = \mathrm {}{1}[\mathcal {X}_t](\textbf{x}_t) - \mathrm [\mathcal {X}_u](\textbf{x}_t)\).

The discretization \(\tilde{\mathcal {X}}\) is used to define three sets of states which are then used by the learner to initialize the certificate network \(V_\nu \) and to which counterexamples computed by the verifier will be added later. In particular, the algorithm initializes \(C_{\text {init}}= \tilde{\mathcal {X}}\cap \mathcal {X}_0\), \(C_{\text {unsafe}}=\tilde{\mathcal {X}}\cap \mathcal {X}_u\) and \(C_{\text {decrease}}=\tilde{\mathcal {X}}\cap (\mathcal {X}\backslash \mathcal {X}_t)\).

4.2 The Learner module

The Learner updates the parameters \(\theta \) of the policy and \(\nu \) of the neural network certificate function candidate \(V_\nu \) with the objective of the candidate satisfying the supermartingle certificate conditions. The parameter updates happen incrementally via gradient descent of the form \(\theta \leftarrow \theta - \alpha \frac{\partial \mathcal {L}(\theta , \nu )}{\partial \theta }\) and \(\nu \leftarrow \nu - \alpha \frac{\partial \mathcal {L}(\theta , \nu )}{\partial \nu }\), where \(\alpha >0\) is the learning rate and \(\mathcal {L}\) is a loss function that corresponds to a differentiable optimization objective of the supermartingle certificate conditions. Ideally, the global minimum of \(\mathcal {L}\) should correspond to a policy \(\pi \) and a neural network \(V_\nu \) that fulfills all certificate conditions. In practice, however, due to the non-convexity of the network \(V_\nu \), gradient descent is not guaranteed to converge to the global minimum. As a result, the learner is not monotone, i.e. a new iteration does not guarantee improvement over the previous iteration. The training process usually applies a fixed number of gradient descent iterations or, alternatively, continues until a certain threshold on the loss value is achieved.

Loss functions. The particular type of loss function \(\mathcal {L}\) depends on the type of supermartingale certificate function that should be learned by the network, but is of the general form

$$\begin{aligned} \mathcal {L}(\theta , \nu ) = \mathcal {L}_{\text {Certificate}}(\theta ,\nu ) + \lambda \cdot \big (\mathcal {L}_{\text {Lipschitz}}(\theta ) + \mathcal {L}_{\text {Lipschitz}}(\nu )\big ), \end{aligned}$$
(1)

where \(\mathcal {L}_{\text {Certificate}}\) is the specification-specific loss. The auxiliary loss terms \(\mathcal {L}_{\text {Lipschitz}}\) regularize the training to obtain networks \(\pi _\theta \) and \(V_\nu \) that have a low upper bound of their Lipschitz constant. The purpose of this regularization is that networks with low Lipschitz upper bound are easier to check by the verifier module, i.e. requiring a coarser discretization grid. The value of \(\lambda >0\) decides the strength of the regularization that is applied. The regularization loss is based on the upper bound derived in [57] and defined as

$$\begin{aligned} \mathcal {L}_{\text {Lipschitz}}(\theta ) = \max \Big \{L_{V_{\theta }} - \frac{\delta }{\tau \cdot (L_f \cdot (L_\pi + 1) + 1)}, 0 \Big \}. \end{aligned}$$
(2)

In the case of a reach-avoid specification, the RASM certificate loss is

$$\begin{aligned} \mathcal {L}_{\text {Certificate}}(\theta ,\nu ) = \mathcal {L}_{\text {Expected}}(\theta ,\nu ) + \mathcal {L}_{\text {Unsafe}}(\nu ) + \mathcal {L}_{\text {Init}}(\nu ), \end{aligned}$$
(3)

with

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{\text {Expected}}(\theta ,\nu ) = \frac{1}{|C_{\text {decrease}}|} \cdot \sum _{\textbf{x}\in C_{\text {expected}}}\Big ( \max \Big \{\\&\qquad \qquad \sum _{\omega _1,\dots , \omega _N \sim \mathcal {N}}\frac{V_{\nu }\big (f(\textbf{x},\pi _\theta (\textbf{x}),\omega _i)\big )}{N} - V_{\theta }(\textbf{x}) + \tau \cdot K, 0\Big \} \Big ) \\&\mathcal {L}_{\text {Init}}(\nu ) = \max _{\textbf{x} \in C_{\text {init}}} \{V_\nu (\textbf{x})-1, 0 \} \\&\mathcal {L}_{\text {Unsafe}}(\nu ) = \max _{\textbf{x} \in C_{\text {unsafe}}}\{\frac{1}{1-p}-V_\nu (\textbf{x}),0 \}. \end{aligned} \end{aligned}$$

The sets \(C_{\text {expected}}\), \(C_{\text {init}}\) and \(C_{\text {unsafe}}\) are the training sets for achieving the expected decrease, initial and unsafe RASM conditions. Each of the three sets is initialized with a coarse discretization of the state space to guide the learning toward learning a correct RASM already in the first loop iteration. In the subsequent calls to the learner, these sets are extended by counterexamples computed by the verifier. In [68] it was shown that, if \(V_\theta \) is a RASM and satisfies all conditions checked by the verifier below, then \(\mathcal {L}_{\text {Certificate}}(\theta ,\nu )\rightarrow 0\) as the number of samples N used to estimate expected values in \(\mathcal {L}_{\text {Expected}}(\theta ,\nu )\) increases.

4.3 The Verifier module

Verification task. The verifier now formally checks whether the learned RASM candidate \(V_{\nu }\) satisfies the four RASM defining conditions in Definition 3. Since we applied the softplus activation function to the output neuron of \(V_\nu \), we know that the Nonnegativity condition is satisfied by default. Thus, the verifier only needs to check the Initial, Safety and Expected Decrease conditions in Definition 3.

Expected Decrease condition. To check the Expected Decrease condition, we utilize the fact that the dynamics function f is Lipschitz continuous and that the state space \(\mathcal {X}\) is compact to show that it suffices to check a slightly stricter condition at the discretization points. Let \(L_f\) be a Lipschitz constant of f. Since \(\pi _\theta \) and \(V_\nu \) are continuous functions defined over the compact domain \(\mathcal {X}\), we know that they are also Lipschitz continuous. Let \(L_\pi \) and \(L_V\) be their Lipschitz constants. We assume that \(L_f\) is provided to the algorithm, and use the method of [57] for computing neural network Lipschitz constants to compute \(L_\pi \) and \(L_V\).

To verify the Expected Decrease condition, the verifier collects a subset \(\tilde{\mathcal {X}}_e \subseteq \tilde{\mathcal {X}}\) of all discretization vertices whose adjacent grid cells contain a non-target state and over which \(V_{\nu }\) attains a value that is smaller than \(\frac{1}{1-p}\). To compute this set, the algorithm first collects all grid cells that intersect \(\mathcal {X}\backslash \mathcal {X}_t\). For each collected cell, it then uses interval arithmetic abstract interpretation (IA-AI) [24, 30] to propagate interval bounds across neural network layers towards bounding from below the minimal value that \(V_{\nu }\) attains over the cell. Finally, it adds to \(\tilde{\mathcal {X}}_e\) vertices of those cells at which the computed lower bound is less than \(1/(1-p)\).

Finally, the verifier checks if the following condition is satisfied at each \(\tilde{\textbf{x}}\in \tilde{\mathcal {X}_e}\)

$$\begin{aligned} \mathbb {E}_{\omega \sim d}\Big [ V_{\nu } \Big ( f(\tilde{\textbf{x}}, \pi _{\theta }(\tilde{\textbf{x}}), \omega ) \Big ) \Big ] < V_{\nu }(\tilde{\textbf{x}}) - \tau \cdot K, \end{aligned}$$
(4)

where \(K=L_V \cdot (L_f \cdot (L_\pi + 1) + 1)\). Note that this condition is a strengthened version of the Expected Decrease condition, where instead of strict decrease by arbitrary \(\epsilon >0\) we require strict decrease by at least \(\tau \cdot K\) which depends on the discretization mesh \(\tau \) and Lipschitz constants of f, \(\pi _\theta \) and \(V_\nu \). To compute \(\mathbb {E}_{\omega \sim d}[ V_{\nu } ( f(\tilde{\textbf{x}}, \pi _{\theta }(\tilde{\textbf{x}}), \omega ) ) ]\) in eq. (4), we cannot simply evaluate the expected value in state \(\tilde{\textbf{x}}\) by substituting \(\tilde{\textbf{x}}\) into some expression, as we do not know a closed-form expression for the expected value of a neural network function. Instead, the algorithm uses the method of [44] to compute upper and lower bounds on the expected value of a neural network function, which we describe in Section 5. This upper bound is then plugged it into eq. (4).

If no violations to eq. (4) are found, the verifier concludes that the Expected Decrease condition is satisfied. Otherwise, for any counterexample \(\tilde{\textbf{x}}\) to eq. (4), the algorithm checks if \(\tilde{\textbf{x}}\in \mathcal {X}\backslash \mathcal {X}_t\) and \(V_{\nu }(\textbf{x})<1/(1-p)\) and if so adds it to the counterexample set \(C_{\text {decrease}}\).

Initial and safety conditions. The Initial and Safety conditions are checked using IA-AI. To check the Initial condition, the verifier collects the set \(\text {Cells}_{\mathcal {X}_0}\) of all grid cells that intersect the initial set \(\mathcal {X}_0\), and for each cell in \(\text {Cells}_{\mathcal {X}_0}\) checks if

$$\begin{aligned} \sup _{\textbf{x}\,\in \,\text {cell}}V_{\nu }(\textbf{x}) > 1. \end{aligned}$$
(5)

The supremum is bounded from above via IA-AI by propagating interval bounds across neural network layers. If no violations are found, the verifier concludes that \(V_\nu \) satisfies the Initial condition. Otherwise, vertices of any grid cells which are counterexamples to eq. (5) and which are contained in \(\mathcal {X}_0\) are added to \(C_{\text {init}}\). Analogously, to check the Safety condition, the verifier collects the set \(\text {Cells}_{\mathcal {X}_u}\) of all grid cells that intersect the unsafe set \(\mathcal {X}_u\), and for each cell checks if

$$\begin{aligned} \inf _{\textbf{x}\,\in \,\text {cell}}V_{\nu }(\textbf{x}) < \frac{1}{1-p}. \end{aligned}$$
(6)

If no violations are found, the verifier concludes that \(V_\nu \) satisfies the Safety condition. Otherwise, vertices of any grid cells which are counterexamples to eq. (6) and which are contained in \(\mathcal {X}_u\) are added to \(C_{\text {unsafe}}\).

Algorithm output and correctness. If all three checks are successful and no counterexample is found, the algorithm concludes that \(\pi _\theta \) guarantees reach-avoidance with probability at least p and outputs the policy \(p_\theta \). Otherwise, it proceeds to the next learner-verifier iteration where computed counterexamples are added to sets \(C_{\text {init}}\), \(C_{\text {unsafe}}\) and \(C_{\text {decrease}}\) to be used by the learner. The following theorem establishes correctness of the verifier module, and its proof can be found in [68].

Theorem 6

([68]). Suppose that the verifier verifies that the certificate \(V_{\nu }\) satisfies eq. (4) for each \(\tilde{\textbf{x}}\in \tilde{\mathcal {X}_e}\), eq. (5) for each \(\text {cell}\in \text {Cells}_{\mathcal {X}_0}\) and eq. (6) for each \(\text {cell}\in \text {Cells}_{\mathcal {X}_u}\). Then the function \(V_{\nu }\) is an RASM for the system with respect to \(\mathcal {X}_t\), \(\mathcal {X}_u\) and p.

Optimizations. The verification task can be made more efficient by a discretization refinement procedure. In particular, the verifier may start with a coarse grid and decomposes each grid cell on demand into a finer discretization in case the check when some RASM condition fails. This procedure can be used recursively to refine further in the case when elements of the decomposed grid cannot be verified. In case the recursion encounters a grid element that violates Eq. 4 even for \(\tau =0\), the refinement procedure terminates unsuccessfully with the grid center point as a counterexample of the RASM condition. This optimization with a maximum recursion depth of 1 has been applied in [68].

5 Bounding Expected Values of Neural Networks

We now present the method for computing upper and lower bounds on the expected value of a neural network function over a given probability distribution. We are not aware of any existing methods for solving this problem, so believe that this is a result of independent interest.

To define the setting of the problem at hand, let \(\textbf{x}\in \mathcal {X}\subseteq \mathbb {R}^n\) be a system state and suppose that we want to compute upper and lower bounds the expected value \(\mathbb {E}_{\omega \sim d}[ V ( f(\textbf{x}, \pi (\textbf{x}), \omega ) )]\). Here d is a probability distribution over the stochastic disturbance space \(\mathcal {N}\subseteq \mathbb {R}^p\) from which the stochastic disturbance is sampled independently at each time step. As noted in Section 2, we assume that d is a product of independent univariate probability distributions. Alternatively, the method is also applicable if the support of d is bounded.

The method first partitions the stochastic disturbance space \(\mathcal {N}\subseteq \mathbb {R}^p\) into finitely many cells \(\text {cell}(\mathcal {N}) = \{\mathcal {N}_1,\dots ,\mathcal {N}_{k}\}\). Let \(\textrm{maxvol}=\max _{\mathcal {N}_i\in \text {cell}(\mathcal {N})}\textsf{vol}(\mathcal {N}_i)\) and \(\textrm{minvol}=\min _{\mathcal {N}_i\in \text {cell}(\mathcal {N})}\textsf{vol}(\mathcal {N}_i)\) denote the maximal and the minimal volume of any cell in the partition with respect to the Lebesgue measure over \(\mathbb {R}^p\), respectively. Also, for each \(\omega \in \mathcal {N}\) let \(F(\omega ) = V( f(\textbf{x}, \pi (\textbf{x}), \omega ))\). The upper and the lower boundd on the expected value are computed as follows

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{\omega \sim d}\Big [ V \Big ( f(\textbf{x}, \pi (\textbf{x}), \omega ) \Big ) \Big ] \le \sum _{\mathcal {N}_i\in \text {cell}(\mathcal {N})} \textrm{maxvol} \cdot \sup _{\omega \in \mathcal {N}_i} F(\omega ), \\&\mathbb {E}_{\omega \sim d}\Big [ V \Big ( f(\textbf{x}, \pi (\textbf{x}), \omega ) \Big ) \Big ] \ge \sum _{\mathcal {N}_i\in \text {cell}(\mathcal {N})} \textrm{minvol} \cdot \inf _{\omega \in \mathcal {N}_i} F(\omega ). \end{aligned} \end{aligned}$$

Each supremum (resp. infimum) in the sum is then bounded from above (resp. from below) via interval arithmetic abstract interpretation by using the method of [30].

If the support of d is bounded, then no further adjustments are needed. However, if the support of d is unbounded, \(\textrm{maxvol}\) and \(\textrm{minvol}\) may not be finite. In this case, since we assume that d is a product of univariate distributions, the method first applies the probability integral transform [48] to each univariate probability distribution in d in order to reduce the problem to the case of a probability distribution of bounded support.

6 Discussion on Extension to General Certificates

The focus of this survey has primarily been on three concrete classes of supermartingale certificate functions in stochastic systems, namely RSMs, SBFs and RASMs, and the learner-verifier framework for their computation. For each class of supemartingale certificate functions, the learner module encodes the defining conditions of the certificate as a differentiable loss function whose minimization leads to a candidate certificate function. The verifier module then formally checks whether the defining conditions of the certificate function are satisfied. These checks are performed by discretizing the state space and using interval arithmetic abstract interpretation and the previously discussed method for computing bounds on expected values of neural network functions.

It should be noted that the design of both the learner and the verifier modules was not specifically tailored to any of the three certificate functions. Rather, both the learner and the verifier follow very general design principles that we envision are applicable to more general classes of certificate functions. In particular, we hypothesize that as long as the state space of the system is compact and a certificate function can be defined in terms of

  • exact and expected value evaluations of Lipschitz continuous functions, and

  • inequalities between such evaluations imposed over state space regions,

then the learner-verifier framework in Section 4 may present a promising approach to learning and verifying the certificate function. In particular, the learner-verifier framework presents a natural candidate for automating the computation of any supermartingale certificate function that may be designed for other properties in the future. Furthermore, while RSMs, SBFs and RASMs exhibit a supermartingale-like behavior which is fundamental for their soundness, the learner-verifier framework does not rely or depend on their supermartingale-like behavior. Hence, we envision that the learner-verifier framework could also be used to compute other classes of stochastic certificate functions.

Even more generally, note that all certificate functions that we have considered so far are of the type \(\mathcal {X}\rightarrow \mathbb {R}\). One could also consider extensions of the learner-verifier framework to learning certificate functions of different datatypes. For instance, the work [43] uses a learner-verifier framework to learn an inductive transition invariant of type \(\mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}\) that certifies safety in deterministic systems. On the other hand, lexicographic ranking supermartingales are a multidimensional generalization of RSMs of type \(\mathcal {X}\rightarrow \mathbb {R}^k\) that provide a more efficient and compositional approach to proving probability 1 termination in probabilistic programs [5, 22]. Studying possible extensions of the learner-verifier framework for stochastic systems to learn certificate functions of different arity of both domain and codomain is a very interesting direction of future work.

7 Related Work

Existing learning-based methods for learning and verification of certificate functions in deterministic and stochastic systems have been discussed in Section 1. In this section, we overview some other existing methods for verification and control of stochastic dynamical systems, as well as some other uses of martingale theory in stochastic system verification.

Abstraction-based methods. Another class of approaches to stochastic dynamical system control with formal safety guarantees are abstraction based methods [14, 25, 42, 56, 60, 63]. These methods consider finite-time horizon systems and approximate them via a finite-state Markov decision process (MDP). The control problem is then solved for the obtained MDP and the computed policy is used to exhibit a policy for the original stochastic dynamical system. The key difference in applicability between abstraction based methods and our framework is that abstraction based methods consider finite-time horizon systems, whereas we consider infinite-time horizon systems.

Safe control via shielding. Shielding is an RL framework that ensures safety in the context of avoidance of unsafe regions by computing two control policies – the main policy that optimizes the expected reward, and the backup policy that the system falls back to whenever the safety constraint may be violated [7, 29, 36].

Constrained MDPs. A standard approach to safe RL is to solve constrained MDPs (CMDPs) [8, 28] which impose hard constraints on expected cost for one or more auxiliary cost functions. Several efficient RL algorithms for solving CMDPs have been proposed [4, 59], however their constraints are only satisfied in expectation, hence constraint satisfaction is not formally guaranteed.

RL reward specification and neurosymbolic methods. There are several works on solving model-free RL tasks under logic specifications. In particular, several works propose methods for designing reward functions that encode temporal logic specifications [6, 12, 13, 31, 32, 34, 39, 40, 45]. Formal methods have also been used for extraction of interpretable policies [35, 61, 62] and safe RL [10, 11, 67].

Deterministic systems with stochastic controllers. Another way to give rise to a stochastic dynamical system is to consider a dynamical system with deterministic dynamics function and use a stochastic controller, which helps in quantifying uncertainty in the controller’s prediction. Formal verification of deterministic dynamical systems with Bayesian neural network controllers has been considered in [43]. In particular, this work also uses a learner-verifier method to learn an inductive invariant for the deterministic system which formally proves safety.

Supermartingales for probabilistic program analysis. Supermartingales have also been used for the analysis of probabilistic programs (PPs). In particular, RSMs were originally introduced in the setting of PPs to prove almost-sure termination [15] and have since been extensively used, see e.g. [5, 19, 20, 22, 47]. The work [1] proposed a learner-verifier method to learn an RSM in the PP. Supermartingales were also used for safety [21, 23, 64], cost [65] and recurrence and persistence [16] analysis in PPs.

8 Conclusion

This paper presents a framework for learning-based control with formal reachability, safety and reach-avoidance guarantees in stochastic dynamical systems. We present a learner-verifier framework in which a neural network control policy is learned together with a neural network certificate function that formally proves that the property of interest holds with at least some desired probability \(p\in [0,1]\). For certification, we use supermartingale certificate functions. The learner module encodes the defining certificate function conditions into a differentiable loss function which is then minimized to learn a candidate certificate function. The verifier then formally verifies the candidate by using interval arithmetic abstract interpretation and a novel method for computing bounds on expected values of neural networks.

The learner-verifier framework presented in this work opens several interesting directions for future work. The first is the design of supermartingale certificates for more general properties of stochastic systems and the use of our learner-verifier framework for their computation. The second is to study and understand the general class of certificate functions in stochastic systems that the learner-verifier can be used to compute, possibly going beyond supermartingale certificate functions. Finally, on the practical side, a venue for future work is to explore methods for reducing the computational cost of the framework and extensions that can handle more complex and higher dimensional systems.