Abstract
We present the first machine learning approach to the termination analysis of probabilistic programs. Ranking supermartingales (RSMs) prove that probabilistic programs halt, in expectation, within a finite number of steps. While previously RSMs were directly synthesised from source code, our method learns them from sampled execution traces. We introduce the neural ranking supermartingale: we let a neural network fit an RSM over execution traces and then we verify it over the source code using satisfiability modulo theories (SMT); if the latter step produces a counterexample, we generate from it new sample traces and repeat learning in a counterexampleguided inductive synthesis loop, until the SMT solver confirms the validity of the RSM. The result is thus a sound witness of probabilistic termination. Our learning strategy is agnostic to the source code and its verification counterpart supports the widest range of probabilistic singleloop programs that any existing tool can handle to date. We demonstrate the efficacy of our method over a range of benchmarks that include linear and polynomial programs with discrete, continuous, statedependent, multivariate, hierarchical distributions, and distributions with undefined moments.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Probabilistic programs are programs whose execution is affected by random variables [17, 19, 23, 29, 36]. Randomness in programs may emerge from numerous sources, such as uncertain external inputs, hardware random number generators, or the (probabilistic) abstraction of pseudorandom generators, and is intrinsic in quantum programs [34]. Notable exemplars are randomised algorithms, cryptographic protocols, simulations of stochastic processes, and Bayesian inference [7, 33]. Verification questions for probabilistic programs require reasoning about the probabilistic nature of their executions in order to appropriately characterise properties of interest. For instance, consider the following question, corresponding to the program in Fig. 1: will an ambitious marble collector eventually gather any arbitrarily large amounts of red and blue marbles? Intuitively, the question has an affirmative answer regardless of the initially established target amounts, since there is always a chance of collecting a marble of either color. Notice that, if the probabilistic choice is replaced with nondeterminism, as often happens in software verification, an adversary may exclusively draw one color of marble and make the program run forever. The question that matches the original intuition is whether the expected number of steps to termination is finite; this is the positive almostsure termination (PAST) question [8, 10, 13, 19, 27].
Probabilistic termination analysis is typically mechanised through the automated synthesis of ranking supermartingales (RSMs), which are functions of the program variables whose value (i) decreases in expectation by a discrete amount across every loop iteration and (ii) is always bounded from below; an RSM formally witnesses that a program is PAST [10, 13]. Early techniques for discovering RSMs reduced the synthesis problem from the source code of the program into constraint solving [10]. These methods have lent themselves to various generalisations, including polynomial programs, programs with nondeterminism, lexicographic and modular termination arguments, and persistence properties [2, 14,15,16, 20, 25]. Recently, for special classes of probabilistic programs or term rewriting systems, novel automated proof techniques that leverage computer algebra systems and satisfiability modulo theories (SMT) have been introduced [5, 6, 38, 39, 41]. All the above methods are sound and, under specific assumptions, complete; they represent the state of the art for the class of programs they have been designed for. However, their assumptions are often too restrictive for the analysis of many simple programs. In particular, to the best of our knowledge, none can identify an RSM for the program in Fig. 1. For this simple program, it is easy to argue that the expected output of the neural network depicted in Fig. 2 decreases after every iteration of the loop and that it is always nonnegative (see Ex. 1). As such, this neural network is an appropriate RSM for the program.
We present a novel method for discovering RSMs using machine learning together with SMT solving. We introduce the neural ranking supermartingale (NRSM) model, which lets a neural network mimic a supermartingale over sampled execution traces from a program. We train an NRSM using standard optimisation algorithms over a loss function that makes the neural network decrease—in average—across sampled iterations. We phrase the certification problem into that of computing a counterexample for the NRSM. To do so, we encode the neural network together with the expected value of the program variables; then, we use an SMT solver for verifying that the expected output of the network decreases along every execution. If the solver falsifies the NRSM, then it provides a counterexample that we use to guide a resampling of the execution traces; with this new data we retrain the neural network and repeat verification in a counterexampleguided inductive synthesis (CEGIS) fashion, until the SMT solver determines that no counterexample exists [4, 44]. In the latter case, the solver has certified the generated NRSM; our method thus produces a sound PAST proof or runs indefinitely. Our procedure does not return for programs that are not PAST and may, in general, not return for some PAST instances. However, we experimentally demonstrate that, in practice, our method succeeds over a broad range of PAST benchmarks within a few CEGIS iterations. Previously, machine learning has been applied to the termination analysis of deterministic programs and to the stability analysis of dynamical systems [1, 12, 21, 24, 28, 30,31,32, 42, 43, 45]; our method is the first machine learning approach for probabilistic termination analysis.
Our approach builds upon two key observations. First, the average of expressions along execution traces statistically approximates their true expected value. Thanks to this, we obtain a machine learning model for guessing RSM candidates that only requires execution traces and is thus agnostic to the source code. Second, solving the problem of checking an RSM is simpler than solving the entire termination analysis problem. Reasoning about source code is entirely delegated to the checking phase which, as such, supports programs that are out of reach to the available probabilistic termination analysers.
We experimentally demonstrate that our method is effective over many programs with linear and polynomial expressions, with both discrete and continuous distributions. This includes joint distributions, statedependent distributions, distributions whose parameters are in turn random (hierarchical models), and distributions with undefined moments (e.g., the Cauchy distribution). We compare our method with a tool based on Farkas’ lemma and with the tools Amber and Absynth [2, 39, 41]; whilst our software prototype is slower than these alternatives, it covers the widest range of benchmark singleloop programs.
Summarising, our contribution is fivefold. First, we present the first machine learning method for the termination analysis of probabilistic programs. Second, we introduce a loss function for training neural networks to behave as ranking supermartingales over execution traces. Third, we show an approach to verify the validity of ranking supermartingales using SMT solving, which applies to a wide variety of singleloop probabilistic programs. Fourth, we experimentally demonstrate over multiple baselines and newlydefined benchmarks the practical efficacy of our method. Fifth, we built a software prototype for evaluating our method.
2 Termination Analysis of Probabilistic Programs
We treat the termination analysis of singleloop probabilistic programs. We consider an imperative language that includes Clike arithmetic and Boolean expressions, and sequential and conditional composition of commands [13, 17, 19, 23].
Syntax. A grammar for this language is shown in Fig. 3. We analyse singleloop programs of the form
where the loop guard G is a Boolean expression and the update statement U is a command. Variables are realvalued and can be either assigned to arithmetic expressions using the usual = operator, or sampled from probability distributions using the \(\boldsymbol{\sim }\) operator. Probability distributions, which can be either discrete or continuous, take not only parameters that are constant, and thus known at compile time, but also parameters that depend on other variables, and thus determined only at run time. In other words, distributions may depend on the current state of the program, which is a random variable. Also, they may depend on other random variables; as such, distributions may be multivariate, resulting from models with coupled and hierarchicallystructured variables.
Semantics. The operational semantics of a probabilistic program induces a probability space over runs, together with a stochastic process [13]. A state of the process is an element of , that is, a valuation of the variables in the program. The space of outcomes \(\varOmega _{\mathsf {run}}\) of a program is the set of runs. A run is a possibly infinite sequence of variable valuations (taken at the beginning of every loop iteration). This comes with a \(\sigma \)algebra \(\mathcal {F}\) of measurable subsets of \(\varOmega _{\mathsf {run}}\). Initial states are chosen nondeterministically and, thereafter, the process is purely probabilistic. Every initial state determines a unique probability measure \(\mathbb {P}^{(x_0)} :\mathcal {F} \rightarrow [0,1]\), namely a probability measure conditional on the state \(x_0\). The associated stochastic process is , where \(X^{(x_0)}_t\) is a random vector representing the state at the tth step, initialised as \(X^{(x_0)}_0 = x_0\). Given an initial condition \(x_0\) and a solution process \(X^{(x_0)}\), the associated termination time is a random variable \(T^{(x_0)}\) denoting the length of an execution, which takes values in .
Positive AlmostSure Termination. Runs are probabilistic and thus also the notion of termination requires a quantitative semantics. The termination question is generalised to the notions of almostsure and positive almostsure termination. Almostsure termination (AST) indicates whether the joint probability of all runs that do not terminate is zero; positive almostsure termination (PAST), which is stronger, indicates whether the expected number of steps to termination is finite. Formally, a probabilistic program terminates positively almostsurely if \(\mathbb {E}[T^{(x_0)}] < \infty \) for all . Notably, this implies that the program also terminates almostsurely, that is, \(\mathbb {P}[T^{(x_0)} < \infty ] = 1\) for all . We provide conditions ensuring that probabilistic programs are PAST and, consequently, that they are AST. Notice that the converse may not be true, that is, there exist programs that are AST but not PAST. Our method addresses the PAST question only, by building upon the theory of ranking supermartingales [10].
Ranking Supermartingales. A scalar stochastic process \(\{M_t\}\) is an RSM if, for some \(\epsilon > 0\) and lower bound ,
and \(M_t \ge K\) for all \(t \ge 0\). In other words, this a process whose values are bounded from below and whose expected value decreases by a discrete amount at each step of the program. We prove that a program is PAST by mapping \(X^{(x_0)}\) into an RSM. Our goal is finding a function such that, for every initial condition \(x_0\), it satisfies the following two properties:

(i)
\(\mathbb {E}[\eta (X^{(x_0)}_{t+1}) \mid X^{(x_0)}_t = x] \le \eta (x)  \epsilon \) for all \(x \in I\) and

(ii)
\(\eta (x) \ge K\) for all \(x \in I\),
where is some sufficiently strong loop invariant that can be the loop guard or, possibly, a stronger condition. Function \(\eta \) maps the entire stochastic process into an RSM. For this reason, we call \(\eta \) an RSM for the program.
Example 1
Consider the ambitious marble collector problem from Fig. 1. An RSM for this program is a function \(\eta \) mapping variables red and blue to . Rephrasing condition (i) over this program, \(\eta \) is required to satisfy
for all that satisfy \(\mathtt{red}> 0 \vee \mathtt{blue} > 0\), that is, the loop guard. So, for example, function \(\eta (\mathtt{red},\mathtt{blue}) = \mathtt{red} + \mathtt{blue}\) satisfies this condition; however, it may take any negative value over the arguments \(\mathtt{red}\) and \(\mathtt{blue}\) such that \(\mathtt{red}> 0 \vee \mathtt{blue} > 0\), thus violating condition (ii). By contrast, the neural network in Fig. 2 succeeds at satisfying both conditions. In fact, the network realises function \(\eta (\mathtt{red},\mathtt{blue}) = \max \{\mathtt{red}, 0\} + \max \{\mathtt{blue}, 0\}\), which satisfies Eq. (2) and is bounded from below by zero. \(\square \)
3 Training Neural Ranking Supermartingales
Our framework synthesises RSMs by learning from program execution traces. We define a loss function, that measures the number of sampled program transitions that do not satisfy the RSM conditions. Applying gradientdescent optimisation to the loss function guides the parameters to values at which the candidate’s value decreases, on average, across sampled program transitions. Since the learner does not require the underlying program (only execution traces), the learner is agnostic to the structure of program expressions, and the cost of evaluating the loss function does not scale with the size of the program.
A dataset of sampled transitions is produced using an instrumented program interpreter (Algorithm 1). At a program state p, the interpreter runs the loop body m times to sample successor states \(P^\prime \), where m is a branching factor hyperparameter, before resuming execution from an arbitrarily chosen successor. The dataset S consists of the union of pairs \((p, P^\prime )\) generated by the interpreter.
The loss function is used to optimise the parameters of an NRSM, whose architecture is shown in Fig. 4. This is a neural network with n inputs, one output neuron, and one hidden layer. The hidden layer has h neurons, each of which applies an activation function f to a weighted sum of its inputs. In our experiments, the activation function f is either \(f(x) = x^2\) or \(f(x) = \mathrm {ReLU}(x)\), where \(\mathrm {ReLU}(x) = \max \{x,0\}\).
Therefore, we employ either of the two following functional templates, defined over the learnable parameters \(w_{i, j}\) and \(b_i\):

Sum of ReLU (SOR):
$$\begin{aligned} \eta (x_1, \ldots , x_n)&= \sum _{i = 1}^{h} \mathrm {ReLU}\left( \sum _{j = 1}^{n} w_{i, j} x_j + b_i \right) ; \end{aligned}$$(3) 
Sum of Squares (SOS):
$$\begin{aligned} \eta (x_1, \ldots , x_n)&= \sum _{i = 1}^{h} \left( \sum _{j = 1}^{n} w_{i, j} x_j + b_i \right) ^2. \end{aligned}$$(4)
These choices of activation mean that our NRSMs are restricted to nonnegative outputs, and therefore satisfy condition (ii) by construction. The learner therefore needs to find parameters that satisfy condition (i), which requires \(\eta \) to decrease in expectation by at least some positive constant \(\epsilon > 0\).
The role of the loss function is to allow the learner parameters to be optimised such that the NRSM decreases, on average, across sampled transitions. That is, the loss function evaluates the number of sampled transitions for which the NRSM does not satisfy the RSM condition (i), and the lower its value, the more the neural network behaves like an RSM.
Concretely, the loss associated with a state p and its successors \(P^\prime \) is:
where \(\text {softplus}(x) = \ln (1+e^x)\), and \(\mathbb {E}_{p^\prime \sim P^\prime }[\eta (p^\prime )]\) is the average of \(\eta \) over the sampled successor states \(p^\prime \) from \(P^\prime \).
We then train an NRSM by solving the following optimisation problem:
which aims to minimise the average loss over all sampled transitions in the dataset S, over the trainable weights and biases . This objective is nonconvex and nonlinear, and we resort to gradientbased optimisation (see Sect. 6).
The softplus in Eq. (5) forces the parameters to satisfy condition (i) uniformly across all sampled transitions in the dataset, rather than decreasing by a large amount in expectation over some transitions at the expense of failing to decrease sufficiently quickly for others. Furthermore, for NRSMs of SOR form we replace the ReLU activation function by softplus, to help gradient descent converge faster. Softplus approximates the ReLU function, and has the same asymptotic behaviour, but results in an NRSM that is differentiable w.r.t. the network parameters at all inputs, unlike ReLU [22, p.193]. However, since softplus is a transcendental function, we revert back to using a simpler ReLU activation when verifying an SOR candidate.
A CEGIS loop integrates the learner and verifier (Fig. 5). The dataset S sampled by the interpreter is used to train an NRSM candidate \(\eta \) according to Eq. (6). The verifier checks whether \(\eta \) satisfies condition (i), concluding either that the program is PAST, or producing a counterexample program state \(x_\mathsf{cex}\) for which \(\eta \) does not satisfy (i). The interpreter generates new traces, starting at \(x_\mathsf{cex}\), forcing it to explore parts of the state space over which the NRSM fails to decrease sufficiently in expectation.
4 Verifying Ranking Supermartingales by SMT Solving
To verify an NRSM we must check that it decreases in expectation by at least some constant (condition (i)). Condition (ii) is satisfied by construction because the network’s output is nonnegative for every input, leaving only condition (i) to verify. The architecture of the verifier is depicted in Fig. 6. First, a program (G, U) is translated into an equivalent logical formulation denoted by \(\bar{G}\) and \(\bar{U}\) (‘Encode’ block), which are used to construct a closedform term \(\mathbb {E}[\bar{\eta }]\) for the NRSM’s expected value at the end of the loop body (‘Marginalise’ block). Secondly, given an NRSM \(\eta \), its parameters are rounded and encoded as a logical term \(\bar{\eta }\) (‘Round’ block). Then, the satisfiability of the following formula is decided using SMT solving:
This is the dual satisfiability problem for the validity problem associated with condition (i) on page 5. If Eq. (7) is unsatisfiable, then \(\bar{\eta }\) is a valid RSM and we conclude the program is PAST. Otherwise, the solver yields a counterexample state .
The rounding strategy (‘Round’ block) provides multiple candidates to the verifier by adding i.i.d. noise to parameters and rounding them to various precisions. Setting parameters that are numerically very small to zero is useful since learning that a parameter should be exactly zero could require an unbounded number of samples; rounding provides a pragmatic way of making this work in practice. If none of the generated candidates are valid NRSMs, all counterexamples are passed back to the interpreter which generates more transition samples for the learner (Fig. 5).
Notice that, if a program’s guard predicate is not strong enough to allow a valid RSM to be verified as such, the CEGIS loop will run indefinitely. In general, stronger supporting loop invariants may need to be provided.
4.1 From Programs to Symbolic Store Trees
We now introduce a translation from a loopfree probabilistic program to a symbolic store tree (Fig. 8), a datastructure representing the distribution over program states at the end of a loop iteration as a function of the variable valuation at its start. Marginalising out the probabilistic choices made in the loop yields the NRSM expectation \(\mathbb {E}[\bar{\eta }]\).
This requires a form of symbolic execution. We represent program states symbolically using symbolic stores, denoted \(\varSigma \) (Fig. 8), which map program variables to probabilistic terms. A probabilistic term \(\pi \) can be either a firstorder logic term (Fig. 7) representing an arithmetic expression, or a placeholder for a probability distribution whose parameters are terms (allowing them to be functions of the program state). Finally, symbolic store trees \(\sigma \) (Fig. 8) represent the set of controlflow paths through the loop body, arising from ifstatements; it is a binary tree with symbolic stores at the leaves, and internal nodes labelled by logical formulae over program variables.
Figure 9 defines a translation from an initial symbolic store tree and command to a new symbolic store tree characterising the distribution over states after executing the command. At the top level, we provide the command G (the loop body) and the initial symbolic store \(\{x_1' \mapsto x_1, \ldots , x_n' \mapsto x_n\}\), where primed variables represent the variable valuation at the end of the iteration, whereas unprimed variables represent the variable valuation at the beginning of the loop.
The first four cases of Fig. 9 define the translation of arithmetic expressions (to terms) and Boolean expressions (to formulae), by replacing program syntax with the corresponding logical operators.
The next four cases define the translation of commands. skip leaves the symbolic store unchanged. For deterministic assignments, the right hand side of the assignment is translated in the current symbolic store and bound to the variable. Sequential composition involves translating the first command, and translating the second command in the resulting store tree. A conditional statement creates a new node in the symbolic store tree that selects between the two recursivelytranslated branches, based on the formula derived from the guard predicate. These rules assume the store tree to be a leaflevel symbolic store, because the next rule handles the case where the initial symbolic store tree is a node. Finally, if the command is a probabilistic assignment, we translate the parameters to terms, and bind the resulting probabilistic term to a freshly generated symbol. This allows variables to be overwritten by multiple probabilistic sampling operations in the body of the loop. The mapping of variables to distributions in leaflevel stores defines the probability density over particular probabilistic choices.
Example 2
Figure 10 is the store tree produced for the ambitious marble collector program (Fig. 1). Each leaflevel store in the program’s store tree corresponds to a particular controlflow path through the loop body. The interpretation of a symbolic store tree is that if we fix the outcomes of the probabilistic sampling operations performed by the loop body, then the state of the variables at the end of the iteration is determined by the predicates labelling the internal nodes.
4.2 Marginalisation
To construct the closedform logical term representing the NRSM’s expected value at the end of an iteration, the probabilistic choices in the symbolic store tree must be marginalised out. If the program is limited to discrete random variables with finite support, we automatically marginalise the random choices by enumeration (for both SOR and SOSform NRSMs), as illustrated by Ex. 3.
Example 3
The ambitious marble collector program of Fig. 1, yields the symbolic store tree of Fig. 10. Suppose we want to marginalise the NRSM:
with respect to this symbolic store tree. We first apply the encoding of the NRSM to each leaflevel symbolic store of Fig. 10, and enumerate the possible choices for the probabilistic choices (which in this example is limited to \(\nu \in \{0, 1\}\)), using the bindings of \(\nu \) to distributions in leaflevel stores to compute the probability mass of each choice. After resolving the predicates for each choice of \(\nu \), this yields:
The term (9) is then provided as the value of the NRSM’s expectation to the verifier. \(\square \)
If the program samples from continuous distributions, we marginalise SOSform NRSMs (but not SORform NRSMs) by substituting symbolic moments for a set of supported builtin distributions, including Gaussian, MultivariateGaussian, and Exponential, though could include any distribution whose closedform symbolic moments are available. Example 4 provides an example. This strategy is general enough to support a wide variety of programs, including those of Sect. 5. If a sampling distribution lacks symbolic moments, the cumulative distribution function can also be utilised, which is illustrated in the slicedcauchy case study (Fig. 15).
Example 4
Consider an NRSM \(\eta (\mathtt{x}) = (w \mathtt{x} + b)^2\) and a symbolic store tree \(\texttt {node}(p = 1, \sigma _1, \sigma _2)\) where \(\sigma _1 = \{x\mapsto x + v, v \mapsto \texttt {Exp}(\lambda ), p \mapsto \texttt {Bernoulli}(3/4)\}\) and \(\sigma _2 = \{ x \mapsto x v, v \mapsto \texttt {Exp}(\lambda ), p\mapsto \texttt {Bernoulli}(3/4) \}\). \(\texttt {Exp(}\lambda \texttt {)}\) denotes the exponential distribution with parameter \(\lambda \), with pdf denoted \(p_{\texttt {Exp(}\lambda \texttt {)}}(v)\). We apply \(\eta \) to each leaflevel symbolic store, and marginalise the probabilistic choices. We marginalise p first by enumerating over its possible values, and then marginalise v. There are no dependencies between the distributions in this example, so the order in which they are marginalised does not matter.
The result of marginalisation is a closedform expression for Eq. (10). Note that since
and \(\int _0^\infty v^n p_{\texttt {Exp(}\lambda \texttt {)}}(v) \mathrm {d}v = \frac{n!}{\lambda ^n}\), we use linearity of integration to perform the following simplification, by substituting expressions for the moments of v in terms of the parameter \(\lambda \):
which is used to reduce Eq. (10) to a closed form. This is the method used to perform marginalisation for several case studies, including crwalk, gaussrw and expdistrw. \(\square \)
Notably, our verifier requires the expected value of the RSM to be computed (or soundly approximated) in closed form. We automate marginalisation for discrete distributions of finite support, but require manual intervention for continuous distributions. Nevertheless, our learning component is automated in both cases. Characterising the space of programs with continuous distributions that admit fully automated verification of an RSM is an open question.
5 Case Studies
Existing tools for synthesising RSMs reduce the problem to constraintsolving [2, 10, 11, 14], which can limit the generality of the synthesis framework. For instance, methods that convert the RSM constraints into a linear program using Farkas’ lemma can only handle programs with affine arithmetic, and can only synthesise linear/affine (lexicographic) RSMs [2, 10]. A second restriction of existing approaches is that they typically require the moments of distributions to be compiletime constants. This rules out programs whose distributions are determined at runtime, such as hierarchical and statedependent distributions. Since the loss function of Eq. (6) only requires execution traces, our learner is agnostic to the structure of program expressions, imposing minimal restrictions on the kinds of expressions that can occur, or the kinds of distributions that can be sampled from. This allows us to learn RSMs for a wider class of programs compared to existing tools, as we will illustrate in this section using a number of case studies.
5.1 Nonlinear Program Expressions and NRSMs
Many simple programs do not admit linear or polynomial RSMs, such as Fig. 1. Since the program cannot be encoded as a probsolvable loop (due to the disjunctive guard predicate which cannot be replaced by a polynomial inequality), it cannot be handled by another recent tool, Amber [39]. However, this program admits the following piecewiselinear NRSM:
whose parameters are learnt by our method, within the first CEGIS iteration.
Similarly, we learn the piecewiselinear NRSM:
for the program in Fig. 11, which contains a bilinear assignment (cf. multiplication of s and i on line 3), so this program is not supported by [2]. The conjunction in the guard means it is not supported by Amber, either.
5.2 Multivariate and Hierarchical Distributions
Figure 12 is a random walk that samples from a multivariate Gaussian distribution, with zero mean, unit variances, and correlation sampled uniformly in the range \(\left[ \frac{1}{2}, 1 \right] \). The MultivariateGaussian of line 4 is an instance of a hierarchical distribution, having parameters that are random variables. This program also contains a nonlinear (polynomial) expression that updates the value of x. For crwalk we learn an SOSform NRSM:
proving this program is PAST. To verify this, the NRSM expectation is computed via the symbolic moments of the multivariate Gaussian distribution, given its covariance matrix (line 3), and then marginalising w.r.t. rho (again, using the moments of the uniform distribution over \(\left[ \frac{1}{2}, 1\right] \)). Unfortunately, it is challenging to translate many simple programs containing hierarchical distributions into ones that can be handled by existing tools. For instance, although it is possible to simulate sampling from a bivariate Gaussian of arbitrary correlation by sampling from independent standard Gaussian distributions, this would involve computing a nonpolynomial function of the correlation. Similarly, for the program in Fig. 14 (further discussed below), if a variable is exponentially distributed, \(X \boldsymbol{\sim }\texttt {Exponential}(1)\), then \(\frac{X}{\lambda } \boldsymbol{\sim }\texttt {Exponential}(\lambda )\), providing a way of simulating an exponential distribution with arbitrary parameter \(\lambda \). However, this again requires a nonpolynomial program expression (i.e. the reciprocal of \(\lambda \)) when \(\lambda \) is part of the program state and not a constant, and therefore out of scope for methods that restrict program expressions to being linear/polynomial.
5.3 StateDependent Distributions and NonLinear Expectations
Once we allow hierarchical distributions, it is natural to consider statedependent distributions, i.e. distributions whose parameters depend on the program state rather than being sampled from other distributions. As an example, consider the program in Fig. 13 (a 2dimensional Gaussian random walk with statedependent moments). This is unsupported by existing tools because the mean of the Gaussian is a nonpolynomial function of the program state. However, after defining the function \(\sqrt{1+\texttt {x}^2}\) by means of the following polynomial logical inequalities:
(similarly for mu_y), we express the expected value of an SOSform NRSM in terms of symbolic moments mu_x, etc. Since these moments are statedependent, we cannot marginalise them out as in the hierarchical case. Instead we perform nondeterministic abstraction, providing inequalities \(\frac{1}{10} \le \mathtt{vx}, \mathtt{vy} \le 2\) and \(1 \le \texttt {rho} \le 1\) as further verifier assumptions.
Even if program expressions are linear, the presence of statedependent distributions can result in a nonlinear verification problem, if the moments are themselves nonlinear functions of the program variables. For instance, the program in Fig. 14 represents a 1dimensional random walk, with steps sampled from an exponential distribution. Since the \(n^\text {th}\) moment of \(\texttt {Exponential(}\lambda \texttt {)}\) is \(\frac{n!}{\lambda ^n}\), the expectation of an SOSform NRSM is nonpolynomial but still expressible in the theory of nonlinear real arithmetic (see Ex. 4). For expdistrw we learn
whereas for gaussrw in Fig. 13 we learn
We translate the program in Fig. 14 for Amber by replacing the update for \(\lambda \) by instead sampling it uniformly from \(\left[ 1, 10 \right] \). Amber correctly identifies the program is AST, and that \((10  \mathtt{x})\) is a supermartingale expression (note, not an RSM), though does not report that the program is PAST (answering “maybe”).
5.4 Undefined Moments
The ability to evaluate the cumulative distribution function (CDF) of a sampled distribution could be useful in marginalisation, even if the moments of the sampled distribution are undefined or not known analytically to infinite precision. An example is Fig. 15: the program samples from the standard Cauchy distribution, for which all moments are undefined. Since the sampled value is only used to determine which branch of a conditional is taken, the RSM expectation is well defined, and can be expressed in terms of the standard Cauchy CDF. Namely, the ifbranch is taken with probability \(q = 1  \left( \frac{1}{\pi }\text {arctan}(10) + \frac{1}{2}\right) \). This equation is not expressible using polynomials; so we perform a sound approximation by introducing a new variable that is quantified over a small interval surrounding a finite precision approximation to q. This allows us to learn and verify the SORform NRSM:
For our experimental evaluation (Sect. 6) we create a modified version of each of the six case studies described in this section, as follows:

program marbles3 is a generalisation of marbles to three marble types, instead of two;

probfact2 uses 5/8 as the Bernoulli parameter, rather than 3/4;

crwalk2 samples rho from a Beta(1, 3) distribution, instead of a uniform distribution over \(\left[ \frac{1}{2}, 1\right] \);

expdistrw2 samples from an exponential distribution, where parameter \(\texttt {lambda}\) is replaced by \(\texttt {lambda*lambda}\);

gaussrw2 uses \([3 + 1/(1\mathtt{x}), 3 + 1/(1\mathtt{y})]^T\) for its mean vector, instead of \([\sqrt{1 + \mathtt{x}^2}, \sqrt{1 + \mathtt{y}^2}]^T\); and

slicedcauchy2 has a loop guard of \(\mathtt{x} < 10\), instead of \(\mathtt{x} > 0\), and swaps the two branches of the conditional.
5.5 Rare Transitions
A limitation of relying on a sampled transition dataset to learn NRSM parameters is we rely on the average \(\mathbb {E}_{p^\prime \sim P^\prime }[\eta (p^\prime )]\) in Eq. (5) being accurate (see Sect. 3). This assumption is challenged by programs that have certain controlflow paths of very low probability, which are unlikely to be sampled by the interpreter. For example, in the context of the ambitious marble collector (Fig. 1), Fig. 16 shows that when the probability of obtaining a red marble decreases below \(2^{7}\), our success rate drops. This is because a lower probability makes the corresponding controlflow path rarer in the dataset, to the point where the expected value of the NRSM cannot be estimated accurately.
6 Experimental Results
We built a prototype implementation of our framework (in Python) and present experimental results for benchmarks adapted from previous work, as well as our own case studies (from Sect. 5). The case studies illustrate programs for which our framework synthesises an RSM, yet existing tools cannot prove to be PAST.
The learner is implemented with Jax [9]. To train NRSMs, we use AdaGrad [18] for gradientbased optimisation, with a learning rate of \(10^{2}\). Parameters are initialised by sampling from Gaussian distributions: weight parameters are sampled from a zeromean Gaussian, whereas the bias parameters are sampled either from a Gaussian with mean 10 (for SOR candidates) or mean 0 (for SOS candidates). We verify the NRSMs using the SMT solver Z3 [26, 40]. The outcomes are obtained on the following platform: macOS Catalina version 10.15.4, 8 GB RAM, Intel Core i5 CPU 2.4 GHz QuadCore, 64bit.
As mentioned in Sect. 4, the verifier checks a candidate NRSM over states satisfying the loop predicate, which characterises the set of reachable states. For our experiments, we manually provide the NRSM expectation, and augment the guard predicate with additional invariants where necessary. We generate outcomes using two different rounding strategies (Sect. 4): an “aggressive” rounding strategy which generated between 80 and 120 candidates per CEGIS iteration, and a “weaker” rounding strategy producing between 15 to 25 candidates per CEGIS iteration. The outcomes in Table 1 used the aggressive rounding strategy.
Benchmarks from Previous Work. We run our prototype on singleloop programs from the WTC benchmark suite [3], augmented with probabilistic branching and assignments [2]. These correspond to the programs in the first section of Table 1. We perturb assignment statements by adding noise sampled from a discrete uniform distribution of support \(\{2, 2\}\), or a continuous uniform distribution on the interval \([2,2]\). The while loops are also made probabilistic; with probability 1/2 the loop is executed, and with the remaining probability a skip command is executed.
We compare our framework against three existing tools. The first is Amber [39]: where possible, we translate instances from the WTC suite into the language of Amber, but this is not possible for some programs where the loop predicate is a logical conjunction or disjunction of predicates (indicated by dashes in Table 1). Second, we compare against a tool for synthesising affine lexicographic RSMs (referred to as Farkas’ lemma) for affine programs (i.e. containing only linear expressions), based on reduction to linear programming via Farkas’ lemma [2]. This is applicable to probabilistic programs with nestedloops, unlike our method. However, since it is limited to affine programs and affine lexicographic RSMs, it is not able to analyse all the programs we consider (again, indicated by dashes in Table 1). The third tool is Absynth [41], for which we are able to encode all programs that were limited to discrete random variables.
The experimental results (Table 1) show that for all the WTC benchmarks our approach has a success rate of at least 8/10, and is able to synthesise an RSM within 2 iterations (for the seed that results in median total execution time). For 15 of the 18 WTC benchmarks no full CEGIS iterations are required. As expected our approach, particularly the learning component, is much slower than all three tools. However, our framework has broader applicability, as illustrated with the next set of experiments.
Newly Defined Case Studies. The examples in the second section of Table 1 (from Sect. 5) are not proven PAST by any of the three tools. Our approach is able to do so with a success rate of at least 9/10, under the “aggressive” rounding strategy. Of the new examples, marbles3 (Sect. 5) requires the longest time, since we use an NRSM with \(h=3\) ReLU nodes (see Sect. 3), and six of the nine parameters must be brought sufficiently close to zero to learn a valid RSM. For gaussrw/gaussrw2, we find it necessary to set an SMT solver time limit within the CEGIS loop (of 200 ms for gaussrw, and 5 s for gaussrw2), such that candidates taking longer than this to verify are skipped. The fact that these examples are harder to verify is unsurprising, given that they give rise to nonpolynomial decision problems, containing equationally defined rational expressions. In comparing the two rounding strategies, we find that using the “aggressive” strategy tends to result in fewer CEGIS iterations, reducing the learner time, while increasing the verifier time: this is to be expected, since a larger number of candidates needs to be checked in each CEGIS iteration.
7 Conclusion
We have presented the first machine learning method for the termination analysis of probabilistic programs. We have introduced a loss function for training neural networks so that they behave as RSMs over sampled execution traces; our training phase is agnostic to the program and thus easily portable to different programming languages. Reasoning about the program code is entirely delegated to our checking phase which, by SMT solving over a symbolic encoding of program and neural network, verifies whether the neural network is a sound RSM. Upon a positive answer, we have formally certified that the program is PAST; upon a negative answer, we obtain a counterexample that we use to resample traces and repeat training in a CEGIS loop. Our procedure runs indefinitely for programs that are not PAST, as these necessarily lack a ranking supermartingale, and may run indefinitely for some PAST programs. Nevertheless, we have experimentally demonstrated over several PAST benchmarks that our method is effective in practice and covers a broad range of programs w.r.t. existing tools.
Our method naturally generalises to deeper networks, but whether these are necessary in practice remains an open question; notably, neural networks with one hidden layer were sufficient to solve our examples. We have exclusively tackled the PAST question, and techniques for almostsure (but not necessarily PAST) termination and nontermination exist [16, 37, 39]. Our results pose the basis for future research in machine learning (and CEGIS) for the formal verification of probabilistic programs. Different verification questions will require different learning models. Our approach lends itself to extensions toward probabilistic safety, exploiting supermartingale inequalities, and towards the nontermination question, using repulsing supermartingales [16]. Adapting our method to termination analysis with infinite expected time is also a matter for future investigation [37]. Moreover, we have exclusively considered purely probabilistic singleloop programs: generalisations to programs with nondeterminism, arbitrary controlflow, and concurrency are material for future work [15, 20, 35].
References
Abate, A., Ahmed, D., Giacobbe, M., Peruffo, A.: Formal synthesis of Lyapunov neural networks. IEEE Control. Syst. Lett. 5(3), 773–778 (2021)
Agrawal, S., Chatterjee, K., Novotný, P.: Lexicographic ranking supermartingales: an efficient approach to termination of probabilistic programs. Proc. ACM Program. Lang. 2(POPL), 34:1–34:32 (2018)
Alias, C., Darte, A., Feautrier, P., Gonnord, L.: Multidimensional rankings, program termination, and complexity bounds of flowchart programs. In: Cousot, R., Martel, M. (eds.) Static Analysis, pp. 117–133. Springer, Berlin, Heidelberg (2010)
Alur, R., et al.: Syntaxguided synthesis. In: FMCAD, pp. 1–8. IEEE (2013)
Avanzini, M., Dal Lago, U., Yamada, A.: On probabilistic term rewriting. Sci. Comput. Program. 185, 102338 (2020)
Avanzini, M., Moser, G., Schaper, M.: A modular cost analysis for probabilistic programs. Proc. ACM Program. Lang. 4(OOPSLA), 172:1–172:30 (2020)
Batz, K., Kaminski, B.L., Katoen, J.P., Matheja, C.: How long, O Bayesian network, will I sample thee? In: Ahmed, A. (ed.) ESOP 2018. LNCS, vol. 10801, pp. 186–213. Springer, Cham (2018). https://doi.org/10.1007/9783319898841_7
Bournez, O., Garnier, F.: Proving positive almostsure termination. In: Giesl, J. (ed.) RTA 2005. LNCS, vol. 3467, pp. 323–337. Springer, Heidelberg (2005). https://doi.org/10.1007/9783540320333_24
Bradbury, J., et al.: JAX: composable transformations of Python+NumPy programs (2018). http://github.com/google/jax
Chakarov, A., Sankaranarayanan, S.: Probabilistic program analysis with martingales. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 511–526. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642397998_34
Chakarov, A., Voronin, Y.L., Sankaranarayanan, S.: Deductive proofs of almost sure persistence and recurrence properties. In: Chechik, M., Raskin, J.F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 260–279. Springer, Heidelberg (2016). https://doi.org/10.1007/9783662496749_15
Chang, Y., Roohi, N., Gao, S.: Neural Lyapunov control. In: NeurIPS, pp. 3240–3249 (2019)
Chattenjee, K., Fu, H., Novotný, P.: Termination analysis of probabilistic programs with martingales. In: Barthe, G., Katoen, J.P., Silva, A. (eds.) Foundations of Probabilistic Programming, p. 221–258. Cambridge University Press (2020)
Chatterjee, K., Fu, H., Goharshady, A.K.: Termination analysis of probabilistic programs through Positivstellensatz’s. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9779, pp. 3–22. Springer, Cham (2016). https://doi.org/10.1007/9783319415284_1
Chatterjee, K., Fu, H., Novotný, P., Hasheminezhad, R.: Algorithmic analysis of qualitative and quantitative termination problems for affine probabilistic programs. ACM Trans. Program. Lang. Syst. 40(2), 7:1–7:45 (2018)
Chatterjee, K., Novotný, P., Zikelic, D.: Stochastic invariants for probabilistic termination. In: POPL, pp. 145–160. ACM (2017)
Dahlqvist, F., Silva, A.: Semantics of probabilistic programming: a gentle introduction. In: Barthe, G., Katoen, J.P., Silva, A. (eds.) Foundations of Probabilistic Programming, pp. 1–42. Cambridge University Press (2020)
Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. In: COLT, pp. 257–269. Omnipress (2010)
Fioriti, L.M.F., Hermanns, H.: Probabilistic termination: Soundness, completeness, and compositionality. In: POPL, pp. 489–501. ACM (2015)
Fu, H., Chatterjee, K.: Termination of nondeterministic probabilistic programs. In: Enea, C., Piskac, R. (eds.) VMCAI 2019. LNCS, vol. 11388, pp. 468–490. Springer, Cham (2019). https://doi.org/10.1007/9783030112455_22
Giacobbe, M., Kroening, D., Parsert, J.: Neural termination analysis. CoRR abs/2102.03824 (2021)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press (2016)
Gordon, A.D., Henzinger, T.A., Nori, A.V., Rajamani, S.K.: Probabilistic programming. In: FOSE, pp. 167–181. ACM (2014)
Heizmann, M., Hoenicke, J., Podelski, A.: Termination analysis by learning terminating programs. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559, pp. 797–813. Springer, Cham (2014). https://doi.org/10.1007/9783319088679_53
Huang, M., Fu, H., Chatterjee, K., Goharshady, A.K.: Modular verification for almostsure termination of probabilistic programs. Proc. ACM Program. Lang. 3(OOPSLA), 129:1–129:29 (2019)
Jovanović, D., de Moura, L.: Solving nonlinear arithmetic. In: Gramlich, B., Miller, D., Sattler, U. (eds.) IJCAR 2012. LNCS (LNAI), vol. 7364, pp. 339–354. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642313653_27
Kaminski, B.L., Katoen, J.P., Matheja, C.: On the hardness of analyzing probabilistic programs. Acta Informatica 56(3), 255–285 (2018). https://doi.org/10.1007/s0023601803211
Kapinski, J., Deshmukh, J.V., Sankaranarayanan, S., Aréchiga, N.: Simulationguided Lyapunov analysis for hybrid dynamical systems. In: HSCC, pp. 133–142. ACM (2014)
Kozen, D.: Semantics of probabilistic programs. J. Comput. Syst. Sci. 22(3), 328–350 (1981)
Kura, S., Unno, H., Hasuo, I.: Decision tree learning in CEGISbased termination analysis. In: CAV (2021)
Le, T.C., Antonopoulos, T., Fathololumi, P., Koskinen, E., Nguyen, T.: DynamiTe: Dynamic termination and nontermination proofs. Proc. ACM Program. Lang. 4(OOPSLA), 189:1–189:30 (2020)
Lee, W., Wang, B.Y., Yi, K.: Termination analysis with algorithmic learning. In: Madhusudan, P., Seshia, S.A. (eds.) CAV 2012. LNCS, vol. 7358, pp. 88–104. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642314247_12
Lee, W., Yu, H., Rival, X., Yang, H.: Towards verified stochastic variational inference for probabilistic programs. Proc. ACM Program. Lang. 4(POPL), 16:1–16:33 (2020)
Li, Y., Ying, M.: Algorithmic analysis of termination problems for quantum programs. Proc. ACM Program. Lang. 2(POPL), 35:1–35:29 (2018)
Lin, A.W., Rümmer, P.: Liveness of randomised parameterised systems under arbitrary schedulers. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 112–133. Springer, Cham (2016). https://doi.org/10.1007/9783319415406_7
McIver, A., Morgan, C.: Abstraction, Refinement and Proof for Probabilistic Systems. Monographs in Computer Science. Springer, Berlin (2005)
McIver, A., Morgan, C., Kaminski, B.L., Katoen, J.: A new proof rule for almostsure termination. Proc. ACM Program. Lang. 2(POPL), 33:1–33:28 (2018)
Meyer, F., Hark, M., Giesl, J.: Inferring expected runtimes of probabilistic integer programs using expected sizes. In: TACAS 2021. LNCS, vol. 12651, pp. 250–269. Springer, Cham (2021). https://doi.org/10.1007/9783030720162_14
Moosbrugger, M., Bartocci, E., Katoen, J.P., Kovács, L.: Automated termination analysis of polynomial probabilistic programs. In: ESOP 2021. LNCS, vol. 12648, pp. 491–518. Springer, Cham (2021). https://doi.org/10.1007/9783030720193_18
de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540788003_24
Ngo, V.C., Carbonneaux, Q., Hoffmann, J.: Bounded expectations: resource analysis for probabilistic programs. In: PLDI, pp. 496–512. ACM (2018)
Nori, A.V., Sharma, R.: Termination proofs from tests. In: ESEC/SIGSOFT FSE, pp. 246–256. ACM (2013)
Richards, S.M., Berkenkamp, F., Krause, A.: The Lyapunov neural network: Adaptive stability certification for safe learning of dynamical systems. In: CoRL. Proceedings of Machine Learning Research, vol. 87, pp. 466–476. PMLR (2018)
SolarLezama, A., Tancau, L., Bodík, R., Seshia, S.A., Saraswat, V.A.: Combinatorial sketching for finite programs. In: ASPLOS, pp. 404–415. ACM (2006)
Urban, C., Gurfinkel, A., Kahsai, T.: Synthesizing ranking functions from bits and pieces. In: Chechik, M., Raskin, J.F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 54–70. Springer, Heidelberg (2016). https://doi.org/10.1007/9783662496749_4
Acknowledgments
This work was in part supported by a partnership between Aerospace Technology Institute (ATI), Department for Business, Energy & Industrial Strategy (BEIS) and Innovate UK under project HICLASS (113213), by the Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Partnership, by the Department of Computer Science Scholarship, University of Oxford, and by the DeepMind Computer Science Scholarship.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2021 The Author(s)
About this paper
Cite this paper
Abate, A., Giacobbe, M., Roy, D. (2021). Learning Probabilistic Termination Proofs. In: Silva, A., Leino, K.R.M. (eds) Computer Aided Verification. CAV 2021. Lecture Notes in Computer Science(), vol 12760. Springer, Cham. https://doi.org/10.1007/9783030816889_1
Download citation
DOI: https://doi.org/10.1007/9783030816889_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030816872
Online ISBN: 9783030816889
eBook Packages: Computer ScienceComputer Science (R0)