figure a

1 Introduction

Probabilistic programs are programs whose execution is affected by random variables [17, 19, 23, 29, 36]. Randomness in programs may emerge from numerous sources, such as uncertain external inputs, hardware random number generators, or the (probabilistic) abstraction of pseudo-random generators, and is intrinsic in quantum programs [34]. Notable exemplars are randomised algorithms, cryptographic protocols, simulations of stochastic processes, and Bayesian inference [7, 33]. Verification questions for probabilistic programs require reasoning about the probabilistic nature of their executions in order to appropriately characterise properties of interest. For instance, consider the following question, corresponding to the program in Fig. 1: will an ambitious marble collector eventually gather any arbitrarily large amounts of red and blue marbles? Intuitively, the question has an affirmative answer regardless of the initially established target amounts, since there is always a chance of collecting a marble of either color. Notice that, if the probabilistic choice is replaced with non-determinism, as often happens in software verification, an adversary may exclusively draw one color of marble and make the program run forever. The question that matches the original intuition is whether the expected number of steps to termination is finite; this is the positive almost-sure termination (PAST) question [8, 10, 13, 19, 27].

Fig. 1.
figure 1

The ambitious marble collector (the variables red and blue are initialised non-deterministically).

Probabilistic termination analysis is typically mechanised through the automated synthesis of ranking supermartingales (RSMs), which are functions of the program variables whose value (i) decreases in expectation by a discrete amount across every loop iteration and (ii) is always bounded from below; an RSM formally witnesses that a program is PAST [10, 13]. Early techniques for discovering RSMs reduced the synthesis problem from the source code of the program into constraint solving [10]. These methods have lent themselves to various generalisations, including polynomial programs, programs with non-determinism, lexicographic and modular termination arguments, and persistence properties [2, 14,15,16, 20, 25]. Recently, for special classes of probabilistic programs or term rewriting systems, novel automated proof techniques that leverage computer algebra systems and satisfiability modulo theories (SMT) have been introduced [5, 6, 38, 39, 41]. All the above methods are sound and, under specific assumptions, complete; they represent the state of the art for the class of programs they have been designed for. However, their assumptions are often too restrictive for the analysis of many simple programs. In particular, to the best of our knowledge, none can identify an RSM for the program in Fig. 1. For this simple program, it is easy to argue that the expected output of the neural network depicted in Fig. 2 decreases after every iteration of the loop and that it is always non-negative (see Ex. 1). As such, this neural network is an appropriate RSM for the program.

Fig. 2.
figure 2

A neural ranking supermartingale for the program in Fig. 1.

We present a novel method for discovering RSMs using machine learning together with SMT solving. We introduce the neural ranking supermartingale (NRSM) model, which lets a neural network mimic a supermartingale over sampled execution traces from a program. We train an NRSM using standard optimisation algorithms over a loss function that makes the neural network decrease—in average—across sampled iterations. We phrase the certification problem into that of computing a counterexample for the NRSM. To do so, we encode the neural network together with the expected value of the program variables; then, we use an SMT solver for verifying that the expected output of the network decreases along every execution. If the solver falsifies the NRSM, then it provides a counterexample that we use to guide a resampling of the execution traces; with this new data we retrain the neural network and repeat verification in a counterexample-guided inductive synthesis (CEGIS) fashion, until the SMT solver determines that no counterexample exists [4, 44]. In the latter case, the solver has certified the generated NRSM; our method thus produces a sound PAST proof or runs indefinitely. Our procedure does not return for programs that are not PAST and may, in general, not return for some PAST instances. However, we experimentally demonstrate that, in practice, our method succeeds over a broad range of PAST benchmarks within a few CEGIS iterations. Previously, machine learning has been applied to the termination analysis of deterministic programs and to the stability analysis of dynamical systems [1, 12, 21, 24, 28, 30,31,32, 42, 43, 45]; our method is the first machine learning approach for probabilistic termination analysis.

Our approach builds upon two key observations. First, the average of expressions along execution traces statistically approximates their true expected value. Thanks to this, we obtain a machine learning model for guessing RSM candidates that only requires execution traces and is thus agnostic to the source code. Second, solving the problem of checking an RSM is simpler than solving the entire termination analysis problem. Reasoning about source code is entirely delegated to the checking phase which, as such, supports programs that are out of reach to the available probabilistic termination analysers.

We experimentally demonstrate that our method is effective over many programs with linear and polynomial expressions, with both discrete and continuous distributions. This includes joint distributions, state-dependent distributions, distributions whose parameters are in turn random (hierarchical models), and distributions with undefined moments (e.g., the Cauchy distribution). We compare our method with a tool based on Farkas’ lemma and with the tools Amber and Absynth [2, 39, 41]; whilst our software prototype is slower than these alternatives, it covers the widest range of benchmark single-loop programs.

Summarising, our contribution is fivefold. First, we present the first machine learning method for the termination analysis of probabilistic programs. Second, we introduce a loss function for training neural networks to behave as ranking supermartingales over execution traces. Third, we show an approach to verify the validity of ranking supermartingales using SMT solving, which applies to a wide variety of single-loop probabilistic programs. Fourth, we experimentally demonstrate over multiple baselines and newly-defined benchmarks the practical efficacy of our method. Fifth, we built a software prototype for evaluating our method.

Fig. 3.
figure 3

Syntax of loop-free probabilistic programs.

2 Termination Analysis of Probabilistic Programs

We treat the termination analysis of single-loop probabilistic programs. We consider an imperative language that includes C-like arithmetic and Boolean expressions, and sequential and conditional composition of commands [13, 17, 19, 23].

Syntax. A grammar for this language is shown in Fig. 3. We analyse single-loop programs of the form

$$\begin{aligned} \begin{array}{l} \texttt {while}~G~\texttt {do}\\ \quad U\\ \texttt {od} \end{array} \end{aligned}$$

where the loop guard G is a Boolean expression and the update statement U is a command. Variables are real-valued and can be either assigned to arithmetic expressions using the usual = operator, or sampled from probability distributions using the \(\boldsymbol{\sim }\) operator. Probability distributions, which can be either discrete or continuous, take not only parameters that are constant, and thus known at compile time, but also parameters that depend on other variables, and thus determined only at run time. In other words, distributions may depend on the current state of the program, which is a random variable. Also, they may depend on other random variables; as such, distributions may be multi-variate, resulting from models with coupled and hierarchically-structured variables.

Semantics. The operational semantics of a probabilistic program induces a probability space over runs, together with a stochastic process [13]. A state of the process is an element of , that is, a valuation of the variables in the program. The space of outcomes \(\varOmega _{\mathsf {run}}\) of a program is the set of runs. A run is a possibly infinite sequence of variable valuations (taken at the beginning of every loop iteration). This comes with a \(\sigma \)-algebra \(\mathcal {F}\) of measurable subsets of \(\varOmega _{\mathsf {run}}\). Initial states are chosen non-deterministically and, thereafter, the process is purely probabilistic. Every initial state determines a unique probability measure \(\mathbb {P}^{(x_0)} :\mathcal {F} \rightarrow [0,1]\), namely a probability measure conditional on the state \(x_0\). The associated stochastic process is , where \(X^{(x_0)}_t\) is a random vector representing the state at the t-th step, initialised as \(X^{(x_0)}_0 = x_0\). Given an initial condition \(x_0\) and a solution process \(X^{(x_0)}\), the associated termination time is a random variable \(T^{(x_0)}\) denoting the length of an execution, which takes values in .

Positive Almost-Sure Termination. Runs are probabilistic and thus also the notion of termination requires a quantitative semantics. The termination question is generalised to the notions of almost-sure and positive almost-sure termination. Almost-sure termination (AST) indicates whether the joint probability of all runs that do not terminate is zero; positive almost-sure termination (PAST), which is stronger, indicates whether the expected number of steps to termination is finite. Formally, a probabilistic program terminates positively almost-surely if \(\mathbb {E}[T^{(x_0)}] < \infty \) for all . Notably, this implies that the program also terminates almost-surely, that is, \(\mathbb {P}[T^{(x_0)} < \infty ] = 1\) for all . We provide conditions ensuring that probabilistic programs are PAST and, consequently, that they are AST. Notice that the converse may not be true, that is, there exist programs that are AST but not PAST. Our method addresses the PAST question only, by building upon the theory of ranking supermartingales [10].

Ranking Supermartingales. A scalar stochastic process \(\{M_t\}\) is an RSM if, for some \(\epsilon > 0\) and lower bound ,

$$\begin{aligned} \mathbb {E}\left[ M_{t+1} \mid M_t = m_t, \dots , M_0 = m_0\right] \le m_t - \epsilon \end{aligned}$$

and \(M_t \ge K\) for all \(t \ge 0\). In other words, this a process whose values are bounded from below and whose expected value decreases by a discrete amount at each step of the program. We prove that a program is PAST by mapping \(X^{(x_0)}\) into an RSM. Our goal is finding a function such that, for every initial condition \(x_0\), it satisfies the following two properties:

  1. (i)

    \(\mathbb {E}[\eta (X^{(x_0)}_{t+1}) \mid X^{(x_0)}_t = x] \le \eta (x) - \epsilon \) for all \(x \in I\) and

  2. (ii)

    \(\eta (x) \ge K\) for all \(x \in I\),

where is some sufficiently strong loop invariant that can be the loop guard or, possibly, a stronger condition. Function \(\eta \) maps the entire stochastic process into an RSM. For this reason, we call \(\eta \) an RSM for the program.

Example 1

Consider the ambitious marble collector problem from Fig. 1. An RSM for this program is a function \(\eta \) mapping variables red and blue to . Rephrasing condition (i) over this program, \(\eta \) is required to satisfy

$$\begin{aligned} 0.01 \cdot \eta (\mathtt{red} - 1, \mathtt{blue}) + 0.99 \cdot \eta (\mathtt{red}, \mathtt{blue} - 1) \le \eta (\mathtt{red}, \mathtt{blue}) - \epsilon , \end{aligned}$$

for all that satisfy \(\mathtt{red}> 0 \vee \mathtt{blue} > 0\), that is, the loop guard. So, for example, function \(\eta (\mathtt{red},\mathtt{blue}) = \mathtt{red} + \mathtt{blue}\) satisfies this condition; however, it may take any negative value over the arguments \(\mathtt{red}\) and \(\mathtt{blue}\) such that \(\mathtt{red}> 0 \vee \mathtt{blue} > 0\), thus violating condition (ii). By contrast, the neural network in Fig. 2 succeeds at satisfying both conditions. In fact, the network realises function \(\eta (\mathtt{red},\mathtt{blue}) = \max \{\mathtt{red}, 0\} + \max \{\mathtt{blue}, 0\}\), which satisfies Eq. (2) and is bounded from below by zero.    \(\square \)

3 Training Neural Ranking Supermartingales

Our framework synthesises RSMs by learning from program execution traces. We define a loss function, that measures the number of sampled program transitions that do not satisfy the RSM conditions. Applying gradient-descent optimisation to the loss function guides the parameters to values at which the candidate’s value decreases, on average, across sampled program transitions. Since the learner does not require the underlying program (only execution traces), the learner is agnostic to the structure of program expressions, and the cost of evaluating the loss function does not scale with the size of the program.

figure b

A dataset of sampled transitions is produced using an instrumented program interpreter (Algorithm 1). At a program state p, the interpreter runs the loop body m times to sample successor states \(P^\prime \), where m is a branching factor hyper-parameter, before resuming execution from an arbitrarily chosen successor. The dataset S consists of the union of pairs \((p, P^\prime )\) generated by the interpreter.

Fig. 4.
figure 4

Neural ranking supermartingale architecture.

The loss function is used to optimise the parameters of an NRSM, whose architecture is shown in Fig. 4. This is a neural network with n inputs, one output neuron, and one hidden layer. The hidden layer has h neurons, each of which applies an activation function f to a weighted sum of its inputs. In our experiments, the activation function f is either \(f(x) = x^2\) or \(f(x) = \mathrm {ReLU}(x)\), where \(\mathrm {ReLU}(x) = \max \{x,0\}\).

Therefore, we employ either of the two following functional templates, defined over the learnable parameters \(w_{i, j}\) and \(b_i\):

  • Sum of ReLU (SOR):

    $$\begin{aligned} \eta (x_1, \ldots , x_n)&= \sum _{i = 1}^{h} \mathrm {ReLU}\left( \sum _{j = 1}^{n} w_{i, j} x_j + b_i \right) ; \end{aligned}$$
  • Sum of Squares (SOS):

    $$\begin{aligned} \eta (x_1, \ldots , x_n)&= \sum _{i = 1}^{h} \left( \sum _{j = 1}^{n} w_{i, j} x_j + b_i \right) ^2. \end{aligned}$$

These choices of activation mean that our NRSMs are restricted to non-negative outputs, and therefore satisfy condition (ii) by construction. The learner therefore needs to find parameters that satisfy condition (i), which requires \(\eta \) to decrease in expectation by at least some positive constant \(\epsilon > 0\).

The role of the loss function is to allow the learner parameters to be optimised such that the NRSM decreases, on average, across sampled transitions. That is, the loss function evaluates the number of sampled transitions for which the NRSM does not satisfy the RSM condition (i), and the lower its value, the more the neural network behaves like an RSM.

Concretely, the loss associated with a state p and its successors \(P^\prime \) is:

$$\begin{aligned} L(p, P^\prime )&= \text {softplus}\left( \mathbb {E}_{p^\prime \sim P^\prime }[\eta (p^\prime )] - \eta \left( p\right) + \epsilon \right) , \end{aligned}$$

where \(\text {softplus}(x) = \ln (1+e^x)\), and \(\mathbb {E}_{p^\prime \sim P^\prime }[\eta (p^\prime )]\) is the average of \(\eta \) over the sampled successor states \(p^\prime \) from \(P^\prime \).

We then train an NRSM by solving the following optimisation problem:

$$\begin{aligned} \min \frac{1}{|S|} \sum _{(p, P^\prime ) \in S} L(p, P^\prime ), \end{aligned}$$

which aims to minimise the average loss over all sampled transitions in the dataset S, over the trainable weights and biases . This objective is non-convex and non-linear, and we resort to gradient-based optimisation (see Sect. 6).

The softplus in Eq. (5) forces the parameters to satisfy condition (i) uniformly across all sampled transitions in the dataset, rather than decreasing by a large amount in expectation over some transitions at the expense of failing to decrease sufficiently quickly for others. Furthermore, for NRSMs of SOR form we replace the ReLU activation function by softplus, to help gradient descent converge faster. Softplus approximates the ReLU function, and has the same asymptotic behaviour, but results in an NRSM that is differentiable w.r.t. the network parameters at all inputs, unlike ReLU [22, p.193]. However, since softplus is a transcendental function, we revert back to using a simpler ReLU activation when verifying an SOR candidate.

Fig. 5.
figure 5

CEGIS architecture for the adversarial training of NRSM.

A CEGIS loop integrates the learner and verifier (Fig. 5). The dataset S sampled by the interpreter is used to train an NRSM candidate \(\eta \) according to Eq. (6). The verifier checks whether \(\eta \) satisfies condition (i), concluding either that the program is PAST, or producing a counterexample program state \(x_\mathsf{cex}\) for which \(\eta \) does not satisfy (i). The interpreter generates new traces, starting at \(x_\mathsf{cex}\), forcing it to explore parts of the state space over which the NRSM fails to decrease sufficiently in expectation.

4 Verifying Ranking Supermartingales by SMT Solving

Fig. 6.
figure 6

Verifier architecture.

To verify an NRSM we must check that it decreases in expectation by at least some constant (condition (i)). Condition (ii) is satisfied by construction because the network’s output is non-negative for every input, leaving only condition (i) to verify. The architecture of the verifier is depicted in Fig. 6. First, a program (GU) is translated into an equivalent logical formulation denoted by \(\bar{G}\) and \(\bar{U}\) (‘Encode’ block), which are used to construct a closed-form term \(\mathbb {E}[\bar{\eta }]\) for the NRSM’s expected value at the end of the loop body (‘Marginalise’ block). Secondly, given an NRSM \(\eta \), its parameters are rounded and encoded as a logical term \(\bar{\eta }\) (‘Round’ block). Then, the satisfiability of the following formula is decided using SMT solving:

$$\begin{aligned} \bar{G}(x_1 \ldots x_n) \wedge \mathbb {E}[\bar{\eta }](x_1 \ldots x_n) > \bar{\eta }(x_1 \ldots x_n) - \epsilon . \end{aligned}$$

This is the dual satisfiability problem for the validity problem associated with condition (i) on page 5. If Eq. (7) is unsatisfiable, then \(\bar{\eta }\) is a valid RSM and we conclude the program is PAST. Otherwise, the solver yields a counterexample state .

Fig. 7.
figure 7

Quantifier-free first-order logic formulae.

The rounding strategy (‘Round’ block) provides multiple candidates to the verifier by adding i.i.d. noise to parameters and rounding them to various precisions. Setting parameters that are numerically very small to zero is useful since learning that a parameter should be exactly zero could require an unbounded number of samples; rounding provides a pragmatic way of making this work in practice. If none of the generated candidates are valid NRSMs, all counterexamples are passed back to the interpreter which generates more transition samples for the learner (Fig. 5).

Notice that, if a program’s guard predicate is not strong enough to allow a valid RSM to be verified as such, the CEGIS loop will run indefinitely. In general, stronger supporting loop invariants may need to be provided.

4.1 From Programs to Symbolic Store Trees

We now introduce a translation from a loop-free probabilistic program to a symbolic store tree (Fig. 8), a datastructure representing the distribution over program states at the end of a loop iteration as a function of the variable valuation at its start. Marginalising out the probabilistic choices made in the loop yields the NRSM expectation \(\mathbb {E}[\bar{\eta }]\).

Fig. 8.
figure 8

Symbolic store tree.

Fig. 9.
figure 9

Translation from a loop-free command to a symbolic store tree.

This requires a form of symbolic execution. We represent program states symbolically using symbolic stores, denoted \(\varSigma \) (Fig. 8), which map program variables to probabilistic terms. A probabilistic term \(\pi \) can be either a first-order logic term (Fig. 7) representing an arithmetic expression, or a placeholder for a probability distribution whose parameters are terms (allowing them to be functions of the program state). Finally, symbolic store trees \(\sigma \) (Fig. 8) represent the set of control-flow paths through the loop body, arising from if-statements; it is a binary tree with symbolic stores at the leaves, and internal nodes labelled by logical formulae over program variables.

Figure 9 defines a translation from an initial symbolic store tree and command to a new symbolic store tree characterising the distribution over states after executing the command. At the top level, we provide the command G (the loop body) and the initial symbolic store \(\{x_1' \mapsto x_1, \ldots , x_n' \mapsto x_n\}\), where primed variables represent the variable valuation at the end of the iteration, whereas unprimed variables represent the variable valuation at the beginning of the loop.

The first four cases of Fig. 9 define the translation of arithmetic expressions (to terms) and Boolean expressions (to formulae), by replacing program syntax with the corresponding logical operators.

The next four cases define the translation of commands. skip leaves the symbolic store unchanged. For deterministic assignments, the right hand side of the assignment is translated in the current symbolic store and bound to the variable. Sequential composition involves translating the first command, and translating the second command in the resulting store tree. A conditional statement creates a new node in the symbolic store tree that selects between the two recursively-translated branches, based on the formula derived from the guard predicate. These rules assume the store tree to be a leaf-level symbolic store, because the next rule handles the case where the initial symbolic store tree is a node. Finally, if the command is a probabilistic assignment, we translate the parameters to terms, and bind the resulting probabilistic term to a freshly generated symbol. This allows variables to be overwritten by multiple probabilistic sampling operations in the body of the loop. The mapping of variables to distributions in leaf-level stores defines the probability density over particular probabilistic choices.

Example 2

Figure 10 is the store tree produced for the ambitious marble collector program (Fig. 1). Each leaf-level store in the program’s store tree corresponds to a particular control-flow path through the loop body. The interpretation of a symbolic store tree is that if we fix the outcomes of the probabilistic sampling operations performed by the loop body, then the state of the variables at the end of the iteration is determined by the predicates labelling the internal nodes.

Fig. 10.
figure 10

A store tree for the program in Fig. 1.

4.2 Marginalisation

To construct the closed-form logical term representing the NRSM’s expected value at the end of an iteration, the probabilistic choices in the symbolic store tree must be marginalised out. If the program is limited to discrete random variables with finite support, we automatically marginalise the random choices by enumeration (for both SOR- and SOS-form NRSMs), as illustrated by Ex. 3.

Example 3

The ambitious marble collector program of Fig. 1, yields the symbolic store tree of Fig. 10. Suppose we want to marginalise the NRSM:

$$\begin{aligned} \eta (\mathtt{red}, \mathtt{blue}) = \mathrm {ReLU}(w_{1,1} \cdot \mathtt{red}&+ w_{1,2} \cdot \mathtt{blue} + b_1) \nonumber \\&\;\;\; + \mathrm {ReLU}(w_{2,1} \cdot \mathtt{red} + w_{2,2} \cdot \mathtt{blue} + b_2) \end{aligned}$$

with respect to this symbolic store tree. We first apply the encoding of the NRSM to each leaf-level symbolic store of Fig. 10, and enumerate the possible choices for the probabilistic choices (which in this example is limited to \(\nu \in \{0, 1\}\)), using the bindings of \(\nu \) to distributions in leaf-level stores to compute the probability mass of each choice. After resolving the predicates for each choice of \(\nu \), this yields:

$$\begin{aligned} 0.01 \cdot \eta (\texttt {red} - 1, \texttt {blue}) + 0.99 \cdot \eta (\texttt {red}, \texttt {blue}-1). \end{aligned}$$

The term (9) is then provided as the value of the NRSM’s expectation to the verifier.    \(\square \)

If the program samples from continuous distributions, we marginalise SOS-form NRSMs (but not SOR-form NRSMs) by substituting symbolic moments for a set of supported built-in distributions, including Gaussian, MultivariateGaussian, and Exponential, though could include any distribution whose closed-form symbolic moments are available. Example 4 provides an example. This strategy is general enough to support a wide variety of programs, including those of Sect. 5. If a sampling distribution lacks symbolic moments, the cumulative distribution function can also be utilised, which is illustrated in the slicedcauchy case study (Fig. 15).

Example 4

Consider an NRSM \(\eta (\mathtt{x}) = (w \mathtt{x} + b)^2\) and a symbolic store tree \(\texttt {node}(p = 1, \sigma _1, \sigma _2)\) where \(\sigma _1 = \{x\mapsto x + v, v \mapsto \texttt {Exp}(\lambda ), p \mapsto \texttt {Bernoulli}(3/4)\}\) and \(\sigma _2 = \{ x \mapsto x- v, v \mapsto \texttt {Exp}(\lambda ), p\mapsto \texttt {Bernoulli}(3/4) \}\). \(\texttt {Exp(}\lambda \texttt {)}\) denotes the exponential distribution with parameter \(\lambda \), with pdf denoted \(p_{\texttt {Exp(}\lambda \texttt {)}}(v)\). We apply \(\eta \) to each leaf-level symbolic store, and marginalise the probabilistic choices. We marginalise p first by enumerating over its possible values, and then marginalise v. There are no dependencies between the distributions in this example, so the order in which they are marginalised does not matter.

$$\begin{aligned} \int _0^\infty \left( \frac{3}{4}\eta (x + v) + \frac{1}{4} \eta (x - v) \right) p_{\texttt {Exp(}\lambda \texttt {)}}(v)\mathrm {d}v. \end{aligned}$$

The result of marginalisation is a closed-form expression for Eq. (10). Note that since

$$\begin{aligned} \eta (x+v)&= w^2 v^2 + 2(wx+b)wv + (wx + b)^2 \end{aligned}$$

and \(\int _0^\infty v^n p_{\texttt {Exp(}\lambda \texttt {)}}(v) \mathrm {d}v = \frac{n!}{\lambda ^n}\), we use linearity of integration to perform the following simplification, by substituting expressions for the moments of v in terms of the parameter \(\lambda \):

$$\begin{aligned} \int _0^\infty \eta (x+v)p_{\texttt {Exp(}\lambda \texttt {)}}(v) \mathrm {d}v&= \frac{2w^2}{\lambda ^2} + \frac{2(wx+b)w}{\lambda } + (wx + b)^2, \end{aligned}$$

which is used to reduce Eq. (10) to a closed form. This is the method used to perform marginalisation for several case studies, including crwalk, gaussrw and expdistrw.    \(\square \)

Notably, our verifier requires the expected value of the RSM to be computed (or soundly approximated) in closed form. We automate marginalisation for discrete distributions of finite support, but require manual intervention for continuous distributions. Nevertheless, our learning component is automated in both cases. Characterising the space of programs with continuous distributions that admit fully automated verification of an RSM is an open question.

5 Case Studies

Existing tools for synthesising RSMs reduce the problem to constraint-solving [2, 10, 11, 14], which can limit the generality of the synthesis framework. For instance, methods that convert the RSM constraints into a linear program using Farkas’ lemma can only handle programs with affine arithmetic, and can only synthesise linear/affine (lexicographic) RSMs [2, 10]. A second restriction of existing approaches is that they typically require the moments of distributions to be compile-time constants. This rules out programs whose distributions are determined at runtime, such as hierarchical and state-dependent distributions. Since the loss function of Eq. (6) only requires execution traces, our learner is agnostic to the structure of program expressions, imposing minimal restrictions on the kinds of expressions that can occur, or the kinds of distributions that can be sampled from. This allows us to learn RSMs for a wider class of programs compared to existing tools, as we will illustrate in this section using a number of case studies.

5.1 Non-linear Program Expressions and NRSMs

Many simple programs do not admit linear or polynomial RSMs, such as Fig. 1. Since the program cannot be encoded as a prob-solvable loop (due to the disjunctive guard predicate which cannot be replaced by a polynomial inequality), it cannot be handled by another recent tool, Amber [39]. However, this program admits the following piecewise-linear NRSM:

$$\begin{aligned} \text {ReLU}(0 \cdot \mathtt{red} + 1 \cdot \mathtt{blue} + 11) + \text {ReLU}(1 \cdot \mathtt{red} + 0 \cdot \mathtt{blue} + 11), \end{aligned}$$

whose parameters are learnt by our method, within the first CEGIS iteration.

Fig. 11.
figure 11

Probabilistic factorial (probfact).

Similarly, we learn the piecewise-linear NRSM:

$$\begin{aligned} \text {ReLU}(-1 \cdot \mathtt{i} + 0 \cdot \mathtt{s} + 12) + \text {ReLU}(0 \cdot \mathtt{i} + 0 \cdot \mathtt{s} + 9) \end{aligned}$$

for the program in Fig. 11, which contains a bilinear assignment (cf. multiplication of s and i on line 3), so this program is not supported by [2]. The conjunction in the guard means it is not supported by Amber, either.

Fig. 12.
figure 12

Random walk with correlated variables (crwalk).

5.2 Multivariate and Hierarchical Distributions

Figure 12 is a random walk that samples from a multivariate Gaussian distribution, with zero mean, unit variances, and correlation sampled uniformly in the range \(\left[ -\frac{1}{2}, 1 \right] \). The MultivariateGaussian of line 4 is an instance of a hierarchical distribution, having parameters that are random variables. This program also contains a non-linear (polynomial) expression that updates the value of x. For crwalk we learn an SOS-form NRSM:

$$\begin{aligned} (0.1 \cdot \mathtt{x} - 47.2)^2, \end{aligned}$$

proving this program is PAST. To verify this, the NRSM expectation is computed via the symbolic moments of the multivariate Gaussian distribution, given its covariance matrix (line 3), and then marginalising w.r.t. rho (again, using the moments of the uniform distribution over \(\left[ -\frac{1}{2}, 1\right] \)). Unfortunately, it is challenging to translate many simple programs containing hierarchical distributions into ones that can be handled by existing tools. For instance, although it is possible to simulate sampling from a bivariate Gaussian of arbitrary correlation by sampling from independent standard Gaussian distributions, this would involve computing a non-polynomial function of the correlation. Similarly, for the program in Fig. 14 (further discussed below), if a variable is exponentially distributed, \(X \boldsymbol{\sim }\texttt {Exponential}(1)\), then \(\frac{X}{\lambda } \boldsymbol{\sim }\texttt {Exponential}(\lambda )\), providing a way of simulating an exponential distribution with arbitrary parameter \(\lambda \). However, this again requires a non-polynomial program expression (i.e. the reciprocal of \(\lambda \)) when \(\lambda \) is part of the program state and not a constant, and therefore out of scope for methods that restrict program expressions to being linear/polynomial.

5.3 State-Dependent Distributions and Non-Linear Expectations

Fig. 13.
figure 13

Gaussian random walk with time-varying and coupled noise (gaussrw).

Once we allow hierarchical distributions, it is natural to consider state-dependent distributions, i.e. distributions whose parameters depend on the program state rather than being sampled from other distributions. As an example, consider the program in Fig. 13 (a 2-dimensional Gaussian random walk with state-dependent moments). This is unsupported by existing tools because the mean of the Gaussian is a non-polynomial function of the program state. However, after defining the function \(\sqrt{1+\texttt {x}^2}\) by means of the following polynomial logical inequalities:

$$\begin{aligned} \texttt {mu\_x}^2&= 1 + \texttt {x}^2 \end{aligned}$$
$$\begin{aligned} \texttt {mu\_x}&\ge 1 \end{aligned}$$

(similarly for mu_y), we express the expected value of an SOS-form NRSM in terms of symbolic moments mu_x, etc. Since these moments are state-dependent, we cannot marginalise them out as in the hierarchical case. Instead we perform non-deterministic abstraction, providing inequalities \(\frac{1}{10} \le \mathtt{vx}, \mathtt{vy} \le 2\) and \(-1 \le \texttt {rho} \le 1\) as further verifier assumptions.

Fig. 14.
figure 14

State-dependent exponential random walk (expdistrw).

Even if program expressions are linear, the presence of state-dependent distributions can result in a non-linear verification problem, if the moments are themselves non-linear functions of the program variables. For instance, the program in Fig. 14 represents a 1-dimensional random walk, with steps sampled from an exponential distribution. Since the \(n^\text {th}\) moment of \(\texttt {Exponential(}\lambda \texttt {)}\) is \(\frac{n!}{\lambda ^n}\), the expectation of an SOS-form NRSM is non-polynomial but still expressible in the theory of non-linear real arithmetic (see Ex. 4). For expdistrw we learn

$$\begin{aligned} (0.1 \cdot \mathtt{x} - 3.3)^2, \end{aligned}$$

whereas for gaussrw in Fig. 13 we learn

$$\begin{aligned} (0 \cdot \mathtt{x} -1 \cdot \mathtt{y} + 11)^2 + (0 \cdot \mathtt{x} + 0 \cdot \mathtt{y} + 8)^2. \end{aligned}$$

We translate the program in Fig. 14 for Amber by replacing the update for \(\lambda \) by instead sampling it uniformly from \(\left[ 1, 10 \right] \). Amber correctly identifies the program is AST, and that \((10 - \mathtt{x})\) is a supermartingale expression (note, not an RSM), though does not report that the program is PAST (answering “maybe”).

5.4 Undefined Moments

The ability to evaluate the cumulative distribution function (CDF) of a sampled distribution could be useful in marginalisation, even if the moments of the sampled distribution are undefined or not known analytically to infinite precision. An example is Fig. 15: the program samples from the standard Cauchy distribution, for which all moments are undefined. Since the sampled value is only used to determine which branch of a conditional is taken, the RSM expectation is well defined, and can be expressed in terms of the standard Cauchy CDF. Namely, the if-branch is taken with probability \(q = 1 - \left( \frac{1}{\pi }\text {arctan}(10) + \frac{1}{2}\right) \). This equation is not expressible using polynomials; so we perform a sound approximation by introducing a new variable that is quantified over a small interval surrounding a finite precision approximation to q. This allows us to learn and verify the SOR-form NRSM:

$$\begin{aligned} \text {ReLU}(1.2 \cdot \mathtt{x}+ 9.1). \end{aligned}$$
Fig. 15.
figure 15

Sliced Cauchy distribution (slicedcauchy).

For our experimental evaluation (Sect. 6) we create a modified version of each of the six case studies described in this section, as follows:

  • program marbles3 is a generalisation of marbles to three marble types, instead of two;

  • probfact2 uses 5/8 as the Bernoulli parameter, rather than 3/4;

  • crwalk2 samples rho from a Beta(1, 3) distribution, instead of a uniform distribution over \(\left[ -\frac{1}{2}, 1\right] \);

  • expdistrw2 samples from an exponential distribution, where parameter \(\texttt {lambda}\) is replaced by \(\texttt {lambda*lambda}\);

  • gaussrw2 uses \([3 + 1/(1-\mathtt{x}), 3 + 1/(1-\mathtt{y})]^T\) for its mean vector, instead of \([\sqrt{1 + \mathtt{x}^2}, \sqrt{1 + \mathtt{y}^2}]^T\); and

  • slicedcauchy2 has a loop guard of \(\mathtt{x} < 10\), instead of \(\mathtt{x} > 0\), and swaps the two branches of the conditional.

5.5 Rare Transitions

A limitation of relying on a sampled transition dataset to learn NRSM parameters is we rely on the average \(\mathbb {E}_{p^\prime \sim P^\prime }[\eta (p^\prime )]\) in Eq. (5) being accurate (see Sect. 3). This assumption is challenged by programs that have certain control-flow paths of very low probability, which are unlikely to be sampled by the interpreter. For example, in the context of the ambitious marble collector (Fig. 1), Fig. 16 shows that when the probability of obtaining a red marble decreases below \(2^{-7}\), our success rate drops. This is because a lower probability makes the corresponding control-flow path rarer in the dataset, to the point where the expected value of the NRSM cannot be estimated accurately.

Fig. 16.
figure 16

Success rate and execution times for the ambitious marble collector program (Fig. 1), where p is the probability of taking the if-branch. Success rate refers to the fraction of 10 executions that succeeded in finding an NRSM before a timeout of 300 s. Execution times show the median time with the error bar ranging between the minimum and maximum times of the 10 executions.

6 Experimental Results

We built a prototype implementation of our framework (in Python) and present experimental results for benchmarks adapted from previous work, as well as our own case studies (from Sect. 5). The case studies illustrate programs for which our framework synthesises an RSM, yet existing tools cannot prove to be PAST.

The learner is implemented with Jax [9]. To train NRSMs, we use AdaGrad [18] for gradient-based optimisation, with a learning rate of \(10^{-2}\). Parameters are initialised by sampling from Gaussian distributions: weight parameters are sampled from a zero-mean Gaussian, whereas the bias parameters are sampled either from a Gaussian with mean 10 (for SOR candidates) or mean 0 (for SOS candidates). We verify the NRSMs using the SMT solver Z3 [26, 40]. The outcomes are obtained on the following platform: macOS Catalina version 10.15.4, 8 GB RAM, Intel Core i5 CPU 2.4 GHz QuadCore, 64-bit.

As mentioned in Sect. 4, the verifier checks a candidate NRSM over states satisfying the loop predicate, which characterises the set of reachable states. For our experiments, we manually provide the NRSM expectation, and augment the guard predicate with additional invariants where necessary. We generate outcomes using two different rounding strategies (Sect. 4): an “aggressive” rounding strategy which generated between 80 and 120 candidates per CEGIS iteration, and a “weaker” rounding strategy producing between 15 to 25 candidates per CEGIS iteration. The outcomes in Table 1 used the aggressive rounding strategy.

Table 1. Experimental results over existing (top section) and newly added benchmarks (bottom section); (c) indicates the benchmark uses continuous distributions, (d) indicates it only uses discrete distributions. All reported times are in seconds, oot indicates time-out after 300 s, n/a indicates the tool terminated without definite answer, and—indicates the benchmark is unsupported. Our method is run 10 times with different seeds; the overall success rate is reported. Runtimes of interpretation, training, verification phases, and # of CEGIS iterations refer to the run with median total runtime.

Benchmarks from Previous Work. We run our prototype on single-loop programs from the WTC benchmark suite [3], augmented with probabilistic branching and assignments [2]. These correspond to the programs in the first section of Table 1. We perturb assignment statements by adding noise sampled from a discrete uniform distribution of support \(\{-2, 2\}\), or a continuous uniform distribution on the interval \([-2,2]\). The while loops are also made probabilistic; with probability 1/2 the loop is executed, and with the remaining probability a skip command is executed.

We compare our framework against three existing tools. The first is Amber [39]: where possible, we translate instances from the WTC suite into the language of Amber, but this is not possible for some programs where the loop predicate is a logical conjunction or disjunction of predicates (indicated by dashes in Table 1). Second, we compare against a tool for synthesising affine lexicographic RSMs (referred to as Farkas’ lemma) for affine programs (i.e. containing only linear expressions), based on reduction to linear programming via Farkas’ lemma [2]. This is applicable to probabilistic programs with nested-loops, unlike our method. However, since it is limited to affine programs and affine lexicographic RSMs, it is not able to analyse all the programs we consider (again, indicated by dashes in Table 1). The third tool is Absynth [41], for which we are able to encode all programs that were limited to discrete random variables.

The experimental results (Table 1) show that for all the WTC benchmarks our approach has a success rate of at least 8/10, and is able to synthesise an RSM within 2 iterations (for the seed that results in median total execution time). For 15 of the 18 WTC benchmarks no full CEGIS iterations are required. As expected our approach, particularly the learning component, is much slower than all three tools. However, our framework has broader applicability, as illustrated with the next set of experiments.

Newly Defined Case Studies. The examples in the second section of Table 1 (from Sect. 5) are not proven PAST by any of the three tools. Our approach is able to do so with a success rate of at least 9/10, under the “aggressive” rounding strategy. Of the new examples, marbles3 (Sect. 5) requires the longest time, since we use an NRSM with \(h=3\) ReLU nodes (see Sect. 3), and six of the nine parameters must be brought sufficiently close to zero to learn a valid RSM. For gaussrw/gaussrw2, we find it necessary to set an SMT solver time limit within the CEGIS loop (of 200 ms for gaussrw, and 5 s for gaussrw2), such that candidates taking longer than this to verify are skipped. The fact that these examples are harder to verify is unsurprising, given that they give rise to non-polynomial decision problems, containing equationally defined rational expressions. In comparing the two rounding strategies, we find that using the “aggressive” strategy tends to result in fewer CEGIS iterations, reducing the learner time, while increasing the verifier time: this is to be expected, since a larger number of candidates needs to be checked in each CEGIS iteration.

7 Conclusion

We have presented the first machine learning method for the termination analysis of probabilistic programs. We have introduced a loss function for training neural networks so that they behave as RSMs over sampled execution traces; our training phase is agnostic to the program and thus easily portable to different programming languages. Reasoning about the program code is entirely delegated to our checking phase which, by SMT solving over a symbolic encoding of program and neural network, verifies whether the neural network is a sound RSM. Upon a positive answer, we have formally certified that the program is PAST; upon a negative answer, we obtain a counterexample that we use to resample traces and repeat training in a CEGIS loop. Our procedure runs indefinitely for programs that are not PAST, as these necessarily lack a ranking supermartingale, and may run indefinitely for some PAST programs. Nevertheless, we have experimentally demonstrated over several PAST benchmarks that our method is effective in practice and covers a broad range of programs w.r.t. existing tools.

Our method naturally generalises to deeper networks, but whether these are necessary in practice remains an open question; notably, neural networks with one hidden layer were sufficient to solve our examples. We have exclusively tackled the PAST question, and techniques for almost-sure (but not necessarily PAST) termination and non-termination exist [16, 37, 39]. Our results pose the basis for future research in machine learning (and CEGIS) for the formal verification of probabilistic programs. Different verification questions will require different learning models. Our approach lends itself to extensions toward probabilistic safety, exploiting supermartingale inequalities, and towards the non-termination question, using repulsing supermartingales [16]. Adapting our method to termination analysis with infinite expected time is also a matter for future investigation [37]. Moreover, we have exclusively considered purely probabilistic single-loop programs: generalisations to programs with non-determinism, arbitrary control-flow, and concurrency are material for future work [15, 20, 35].