Probabilistic Programming Inference via Intensional Semantics
Abstract
We define a new denotational semantics for a firstorder probabilistic programming language in terms of probabilistic event structures. This semantics is intensional, meaning that the interpretation of a program contains information about its behaviour throughout execution, rather than a simple distribution on return values. In particular, occurrences of sampling and conditioning are recorded as explicit events, partially ordered according to the data dependencies between the corresponding statements in the program.
This interpretation is adequate: we show that the usual measuretheoretic semantics of a program can be recovered from its event structure representation. Moreover it can be leveraged for MCMC inference: we prove correct a version of singlesite MetropolisHastings with incremental recomputation, in which the proposal kernel takes into account the semantic information in order to avoid performing some of the redundant sampling.
Keywords
Probabilistic programming Denotational semantics Event structures Bayesian inference1 Introduction
Probabilistic programming languages [8] were put forward as promising tools for practitioners of Bayesian statistics. By extending traditional programming languages with primitives for sampling and conditioning, they allow the user to express a wide class of statistical models, and provide a simple interface for encoding inference problems. Although the subject of active research, it is still notoriously difficult to design inference methods for probabilistic programs which perform well for the full class of expressible models.
One popular inference technique, proposed by Wingate et al. [21], involves adapting wellknown MonteCarlo Markov chain methods from statistics to probabilistic programs, by manipulating program traces. One such method is the MetropolisHastings algorithm, which relies on a key proposal step: given a program trace x (a sequence \(x_1, \dots , x_n\) of random choices with their likelihood), a proposal for the next trace sample is generated by choosing \(i \in \{1, \dots , n\}\) uniformly, resampling \(x_i\), and then continuing to execute the program, only performing additional sampling for those random choices not appearing in x. The variables already present in x are not resampled: only their likelihood is updated according to the new value of \(x_i\). Likewise, some conditioning statements must be reevaluated in case the corresponding weight is affected by the change to \(x_i\).
Observe that there is some redundancy in this process, since the updating process above will only affect variables and observations when their density directly depends on the value of \(x_i\). This may significantly affect performance: to solve an inference problem one must usually perform a large number of proposal steps. To overcome this problem, some recent implementations, notably [12, 25], make use of incremental recomputation, whereby some of the redundancy can be avoided via a form of static analysis. However, as pointed out by Kiselyov [13], establishing the correctness of such implementations is tricky.
Here we address this by introducing a theoretical framework in which to reason about data dependencies in probabilistic programs. Specifically, our first contribution is to define a denotational semantics for a firstorder probabilistic language, in terms of graphlike structures called event structures [22]. In event structures, computational events are partially ordered according to the dependencies between them; additionally they can be equipped with quantitative information to represent probabilistic processes [16, 23]. This semantics is intensional, unlike most existing semantics for probabilistic programs, in which the interpretation of a program resembles a probability distribution on output values. We relate our approach to a measuretheoretic semantics [18] through an adequacy result.
Our second contribution is the design of a MetropolisHastings algorithm which exploits the event structure representation of the program at hand. Some of the redundancy in the proposal step of the algorithm is avoided by taking into account the extra dependency information given by the semantics. We provide a proof of correctness for this algorithm, and argue that an implementation is realistically achievable: we show in particular that all graph structures involved and the associated quantitative information admit a finite, concrete representation.
Outline of the Paper. In Sect. 2 we give a short introduction to probabilistic programming. We define our main language of study and its measuretheoretic semantics. In Sect. 3.1, we introduce MCMC methods and the MetropolisHastings algorithm in the context of probabilistic programming. We then motivate the need for intensional semantics in order to capture data dependency. In Sect. 4 we define our interpretation of programs and prove adequacy. In Sect. 5 we define an updated version of the algorithm, and prove its correctness. We conclude in Sect. 6.
The proofs of the statements are detailed in the technical report [4].
2 Probabilistic Programming
In this section we motivate the need for capturing data dependency in probabilistic programs. Let us start with a brief introduction to probabilistic programming – a more comprehensive account can be found in [8].
2.1 Conditioning and Posterior Distribution
Let us introduce the problem of inference in probabilistic programming from the point of view of programming language theory.
We consider a firstorder programming language enriched with a real number type \(\mathbb R\) and a primitive \(\texttt {sample}\) for drawing random values from a given family of standard probability distributions. The language is idealised—but it is assumed that an implementation of the language comprises builtin sampling procedures for those standard distributions. Thus, repeatedly running the program Open image in new window returns a sequence of values approaching the true uniform distribution on [0, 1].
Measure Theory. Because this work makes heavy use of probability theory, we start with a brief account of measure theory. A standard textbook for this is [1]. Recall that a measurable space is a set X equipped with a \(\sigma \)algebra \(\varSigma _X\): a set of subsets of X containing \(\emptyset \) and closed under complements and countable unions. Elements of \(\varSigma _X\) are called measurable sets. A measure on X is a function \(\mu : \varSigma _X \rightarrow [0, \infty ]\), such that \(\mu (\emptyset )= 0\) and, for any countable family \(\{U_i\}_{i \in I}\) of measurable sets, \(\mu (\bigcup _{i \in I} U_i) = \sum _{i \in I} \mu (U_i)\).
An important example is that of the set \(\mathbb {R}\) of real numbers, whose \(\sigma \)algebra \(\varSigma _\mathbb {R}\) is generated by the intervals [a, b), for \(a, b \in \mathbb {R}\) (in other words, it is the smallest \(\sigma \)algebra containing those intervals). The Lebesgue measure on \((\mathbb {R}, \varSigma _\mathbb {R})\) is the (unique) measure \(\lambda \) assigning \(ba\) to every interval [a, b) (with \(a \le b\)).
Given measurable spaces \((X, \varSigma _X)\) and \((Y, \varSigma _Y)\), a function \(f : X \rightarrow Y\) is \(\mathbf{measurable }\) if for every \(U \in \varSigma _Y\), \(f^{1} U \in \varSigma _X\). A measurable function \(f : X \rightarrow [0, \infty ]\) can be integrated: given \(U \in \varSigma _X\) the integral \(\int _{U} f \,d\lambda \) is a welldefined element of \([0, \infty ]\); indeed the map \(\mu : U \mapsto \int _{U} f \mathrm {d}\lambda \) is a measure on X, and f is said to be a density for \(\mu \). The precise definition of the integral is standard but slightly more involved; we omit it.
We identify the following important classes of measures: a measure \(\mu \) on \((X, \varSigma _X)\) is a probability measure if \(\mu (X) = 1\). It is finite if \(\mu (X) < \infty \), and it is sfinite if \(\mu = \sum _{i \in I} \mu _i\), a pointwise, countable sum of finite measures.

\(\varSigma _{\prod _{i \in I} X_i}\) is generated by \(\{ \prod _{i \in I} U_i \mid U_i \in \varSigma _{X_i} \text { for all } i \}\), and

\(\varSigma _{\coprod _{i \in I} X_i}\) is generated by \(\{ \{i\} \times U_i \mid i \in I \text { and } U_i \in \varSigma _{X_i} \}\).
The measurable spaces in this paper all belong to a wellbehaved subclass: call \((X, \varSigma _X)\) a standard Borel space if it either countable and discrete (i.e. all \(U \subseteq X\) are in \(\varSigma _X\)), or measurably isomorphic to \((\mathbb {R}, \varSigma _\mathbb {R})\). Note that standard Borel spaces are closed under countable products and coproducts, and that in a standard Borel space all singletons are measurable.
2.2 A FirstOrder Probabilistic Programming Language

f ranges over measurable functions \(\llbracket {A}\rrbracket \rightarrow \llbracket {B}\rrbracket \), where A and B are types;

d ranges over a family of parametric distributions over the reals, i.e. measurable functions \(\mathbb {R}^n \times \mathbb {R}\rightarrow \mathbb {R}\), for some \(n \in \mathbb N\), such that for every \(\mathbf r \in \mathbb {R}^n\), \(\int d(\mathbf r, ) = 1\). For the purposes of this paper we ignore all issues related to invalid parameters, arising from e.g. a call to \(\texttt {gaussian}\) with standard deviation \(\sigma = 0\). (An implementation could, say, choose to behave according to an alternative distribution in this case.)

The usual product projections \(\pi _i : \llbracket {\prod _{i \in I} A_i}\rrbracket \rightarrow \llbracket {A_i}\rrbracket \) and coproduct injections \(\iota _i : \llbracket {A_i}\rrbracket \rightarrow \llbracket {\coprod _{i \in I} A_i}\rrbracket \);

The operators \(+, \times : \mathbb {R}^2 \rightarrow \mathbb {R}\),

The tests, eg. Open image in new window ,

The constant functions Open image in new window of the form \(()\mapsto a\) for some \(a \in \llbracket A \rrbracket \).
Examples for d include Open image in new window , Open image in new window , ...
2.3 MeasureTheoretic Semantics of Programs
We now define a semantics of probabilistic programs using the measuretheoretic concept of kernel, which we define shortly. The content of this section is not new: using kernels as semantics for probabilistic was originally proposed in [14], while the (more recent) treatment of conditioning (score) via sfinite kernels is due to Staton [18]. Intuitively, kernels provide a semantics of open terms \(\varGamma \vdash M: A\) as measures on \(\llbracket {A}\rrbracket \) varying according to the values of variables in \(\varGamma \).
Formally, a kernel from \((X, \varSigma _X)\) to \((Y, \varSigma _Y)\) is a function \(k : X \times \varSigma _Y \rightarrow [0, \infty ]\) such that for each \(x \in X\), \(k(x, )\) is a measure, and for each \(U \in \varSigma _Y\), \(k(, U)\) is measurable. (Here the \(\sigma \)algebra \(\varSigma _{[0, \infty ]}\) is the restriction of that of \(\mathbb {R}+ \{\infty \}\).) We say k is finite (resp. probabilistic) if each \(k(x, )\) is a finite (resp. probability) measure, and it sfinite if it is a countable pointwise sum \(\sum _{i \in I} k_i\) of finite kernels. We write \(k : X \rightsquigarrow Y\) when k is an sfinite kernel from X to Y.
A term \(\varGamma \vdash M : A\) will denote an sfinite kernel \(\llbracket {M}\rrbracket : \llbracket {\varGamma }\rrbracket \rightsquigarrow \llbracket {A}\rrbracket \), where the context \(\varGamma = x_1 : A_1, \dots , x_n: A_n\) denotes the product of its components: \(\llbracket {\varGamma }\rrbracket = \llbracket {A_1}\rrbracket \times \dots \times \llbracket {A_n}\rrbracket \).

\(\llbracket {()}\rrbracket \) is the lifting of \(\llbracket {\varGamma }\rrbracket \rightarrow 1 : x \mapsto ()\).

Open image in new window is \( \llbracket N \rrbracket \circ \llbracket M \rrbracket \)

\( \llbracket f\, M \rrbracket = f^\dagger \circ \llbracket M \rrbracket \)

Open image in new window , the Dirac distribution \( \delta _x(X) = 1\) if \(x \in X\) and zero otherwise.

Open image in new window where \(\mathtt {sam}_d : \mathbb {R} ^n \rightsquigarrow \mathbb {R} \) is given by \(\mathtt {sam}_d(\mathbf r, X) = \int _{x \in X} d(\mathbf r, x)\mathrm {d}x\).

Open image in new window where Open image in new window is \(\mathtt {sco}(r, X) = r \cdot \delta _{()}(X)\).

Open image in new window : this is welldefined since the \(\prod X_i\) generate the measurable sets of the product space.

Open image in new window where Open image in new window maps \(( \gamma , \{i\} \times X)\) to \( \llbracket N_i \rrbracket ( \gamma , X)\).
We observe that when M is a program making no use of conditioning (i.e. a generative model), the kernel \( \llbracket M \rrbracket \) is probabilistic:
Lemma 1
For \( \varGamma \vdash M : A\) without scores, Open image in new window for each \( \gamma \in \llbracket \varGamma \rrbracket \).
2.4 Exact Inference
Note that a kernel \(1 \rightsquigarrow \llbracket {A}\rrbracket \) is the same as a measure on \(\llbracket {A}\rrbracket \). Given a closed program \(\vdash M: A\), the measure \(\llbracket {M}\rrbracket \) is a combination of the prior (occurrences of sample) and the likelihood (score). Because score can be called on arbitrary arguments, it may be the case that the measure of the total space (that is, the coefficient Open image in new window , often called the model evidence) is 0 or \(\infty \).
3 Approximate Inference via Intensional Semantics
3.1 An Introduction to Approximate Inference
In this section we describe the MetropolisHastings (MH) algorithm for approximate inference in the context of probabilistic programming. MetropolisHastings is a generic algorithm to sample from a probability distribution D on a measurable state space \(\mathbb X\), of which we know the density Open image in new window up to some normalising constant.
MH is part of a family of inference algorithms called MonteCarlo Markov chain, in which the posterior distribution is approximated by a series of samples generated using a Markov chain.
Formally, the MH algorithm defines a Markov chain M on the state space \(\mathbb X\), that is a probabilistic kernel \(M: \mathbb X \rightsquigarrow \mathbb X\). The correctness of the MH algorithm is expressed in terms of convergence. It says that for almost all \(x \in \mathbb X\), the distribution \(M^n(x, \cdot )\) converges to D as n goes to infinity, where \(M^n\) is the niteration of M: \(M \circ \ldots \circ M\). Intuitively, this means that iterated sampling from M gets closer to D with the number of iterations.
The MH algorithm is itself parametrised by a Markov chain, referred to as the proposal kernel \(P: \mathbb {X} \rightsquigarrow \mathbb {X}\): for each sampled value \(x\in \mathbb X\), a proposed value for the next sample is drawn according to \(P(x, \cdot )\). Note that correctness only holds under certain assumptions on P.
The MH algorithm assumes that we know how to sample from P, and that its density is known, ie. there is a function Open image in new window such that \(p(x, \cdot )\) is the density of the distribution \(P(x, \cdot )\),
 1.
Sample a new state \(x'\) from the distribution \(P(x, \cdot )\)
 2.Compute the acceptance ratio of \(x'\) with respect to x:$$ \alpha (x, x') = \min \left( 1, \frac{d(x') \times p(x, x')}{d(x) \times p(x', x)}\right) $$
 3.
With probability \(\alpha (x, x')\), return the new sample \(x'\), otherwise return the input state x.
The formula for \( \alpha (x, x')\) is known as the Hastings acceptance ratio and is key to the correctness of the algorithm.
Very little is assumed of P, which makes the algorithm very flexible; but of course the convergence rate may vary depending on the choice of P. We give a more formal description of MH in Sect. 5.2.
SingleSite MH and Incremental Recomputation. To apply this algorithm to probabilistic programming, we need a proposal kernel. Given a program M, the execution traces of M form a measurable set \(\mathbb X_M\). In this setting the proposal is given by a kernel \(\mathbb X_M \rightsquigarrow \mathbb X_M\).
 1.
Select uniformly one of the random choices s encountered in x.
 2.
Sample a new value for this instruction.
 3.
Reexecute the program M from that point onwards and with this new value for s, only ever resampling a variable when the corresponding instruction did not already appear in x.
Observe that there is some redundancy in this process: in the final step, the entire program has to be explored even though only a subset of the random choices will be reevaluated. Some implementations of Trace MH for probabilistic programming make use of incremental recomputation.
We propose in this paper to statically compile a program M to an event structure \(G_M\) which makes explicit the probabilistic dependences between events, thus avoiding unnecessary sampling.
3.2 Capturing Probabilistic Dependencies Using Event Structures
Consider the program depicted in Fig. 1 in which we are interested in learning the parameters \(\mu \) and \(\sigma \) of a Gaussian distribution from which we have observed two data points, say \(v_1\) and \(v_2\). For \(i = 1, 2\) the function Open image in new window expresses a soft constraint; it can be understood as indicating how much the sampled value of xi matches the observed value \(v_i\).
A proposal step following the singlesite kernel may choose to resample \(\mu \); then it must run through the entire trace, checking for potential dependencies to \(\mu \), though in this case none of the other variables need to be resampled.
So we argue that viewing a program as tree of traces is not most appropriate in this context: we propose instead to compile a program into a partially ordered structure reflecting the probabilistic dependencies.
With our approach, the example above would yield the partial order displayed below on the righthand side. The nodes on the first line corresponds to the sample for \(\mu \) and \(\sigma \), and those on the second line to \(x_1\) and \(x_2\). This provides an accurate account of the probabilistic dependencies: whenever \(e \le e'\) (where \(\le \) is the reflexive, transitive closure of Open image in new window ), it is the case that \(e'\) depends on e.
We represent this information by enriching the partial order with a conflict relation, indicating when two actions are in different branches of a conditional statement. The resulting structure is depicted on the right. Combining partial order and conflict in this way can be conveniently formalised using event structures [22]:
Definition 1

for every \(e \in E\), the set \([e] = \{e' \in E \mid e' \le e\}\) is finite, and

if \(e \# e'\) and \(e' \le e''\), then \(e \# e''\).
From the partial order \( \le \), we extract immediate causality Open image in new window : Open image in new window when \(e < e'\) with no events in between; and from the conflict relation, we extract minimal conflict Open image in new window : Open image in new window when \(e \# e'\) and there are no other conflicts in \([e] \cup [e']\). In pictures we draw Open image in new window and Open image in new window rather than \(\le \) and \(\#\).
A subset \(x\subseteq E\) is a configuration of E if it is downclosed (if \(e' \le e \in x\) then \(e' \in x\)) and conflictfree (if \(e, e' \in x\) then \(\lnot (e \# e')\)). So in this framework, configurations correspond to exactly to partial executions traces of E.
The configuration [e] is the causal history of e; we also write [e) for \([e] \setminus \{e\}\). We write \(\mathscr {C}(E)\) for the set of all finite configurations of E, a partial order under inclusion. A configuration x is maximal if it is maximal in \(\mathscr {C}(E)\): for every \(x' \in \mathscr {C}(E)\), if \(x \subseteq x'\) then \(x = x'\). We use the notation Open image in new window , and in that case we say \(x'\) covers x.
An event structure is confusionfree if minimal conflict is transitive, and if any two events \(e, e'\) in minimal conflict satisfy \([e) = [e')\).
Definition 2
A dependency graph over \( \varGamma \vdash B\) is an event structure G along with a labelling map Open image in new window where any two events \(s, s' \in G\) labelled \(\textsf {Rtn}\, \) are in conflict, and all maximal configurations of G are of the form [r] for \(r \in G\) a return event.
The condition on return events ensures that in any configuration of G there is at most one return event. Events of G are called static events.
We use dependency graphs as a causal representation of programs, reflecting the dependency between different parts of the program. In what follows we enrich this representation with runtime information in order to keep track of the dataflow of the program (in Sect. 3.3), and the associated distributions (in Sect. 3.4).
3.3 Runtime Values and Dataflow Graphs
We have seen how data dependency can be captured by representing a program P as a dependency graph \(G_P\). But observe that this graph does not give any runtime information about the data in P; every event \(s \in G_P\) only carries a label \(\textsf {lbl}(s)\) indicating the class of action it belongs to. (For an event labelled \(\textsf {Rd}\, a\, \), G does not specify the value at a; whereas at runtime this will be filled by an element of \( \llbracket A \rrbracket \) where A is the type of a.)
Such runtime events organise themselves in an event structure \(E_P\), labelled over \(\mathscr {L}^{\texttt {run}}_{\varGamma \vdash B}\), the runtime graph of P. Runtime graphs are in general uncountable, and so difficult to represent pictorially. It can be done in some simple, finite cases: the graph for Open image in new window is depicted on the right. Recall that in dependency graphs conflict was used to represent conditional branches; here instead conflict is used to keep disjoint the possible outcomes of the same static event. (Necessarily, this static event must be a sample or a read, since other actions (return, score) are deterministic.)
Intuitively one can project runtime events to static events by erasing the runtime information; this suggests the existence of a function Open image in new window . This function will turn out to satisfy the axioms of a rigid map of event structures:
Definition 3

it preserves configurations: for every \(x \in \mathscr {C}(E)\), \(\pi x \in \mathscr {C}(G)\)

it is locally injective: for every \(x \in \mathscr {C}(E)\) and \(e, e' \in x\), if \(\pi (e) = \pi (e')\) then \(e = e'\).

it preserves dependency: if \(e \le _E e'\) then \(\pi (e) \le _G \pi (e').\)
In general \( \pi \) is not injective, since many runtime events may correspond to the same static event – in that case however the axioms will require them to be in conflict. The last condition in the definition ensures that all causal dependencies come from G.
Given Open image in new window we define the possible runtime values for x as the set Open image in new window of functions mapping \(s \in x\) to a runtime value in Open image in new window ; in other words Open image in new window . A configuration \(x'\) of \(E_P\) can be viewed as a trace over \( \pi _P \, x'\); hence Open image in new window is the set of traces of P over x. We can now define dataflow graphs:
Definition 4

\( \pi _S\) is a rigid map and Open image in new window
 for each \(x \in \mathscr {C}(G_S)\), the following function is injective

if \(e, e' \in E_S\) with Open image in new window then \(\pi e = \pi e'\), and moreover e and \(e'\) are either both sample or both read events.
As mentioned above, maximal configurations of \(E_P\) correspond to total traces of P, and will be the states of the Markov chain in Sect. 5. By the second axiom, they can be seen as pairs Open image in new window . Because of the third axiom, \(E_S\) is always confusionfree.
Measurable Fibres. Rigid maps are convenient in this context because, they allow for reasoning about program traces by organising them as fibres. The key property we rely on is the following:
Lemma 2
If \(\pi : E \rightarrow G\) is a rigid map of event structures, then the induced map \(\pi : \mathscr {C}(E) \rightarrow \mathscr {C}(G)\) is a discrete fibration: that is, for every \(y \in \mathscr {C}(E)\), if \(x \subseteq \pi y\) for some \(x \in \mathscr {C}(G)\), then there is a unique \(y' \in \mathscr {C}(E)\) such that \(y' \subseteq y\) and \(\pi y' = x\).
This enables an essential feature of our approach: given a configuration x of the dataflow graph G, the fibre \(\pi ^{1}\{x\}\) over it contains all the (possibly partial) program traces over x, i.e. those whose path through the program corresponds to that of x. Additionally the lemma implies that every pair of configurations \(x, x' \in \mathscr {C}(G)\) such that \(x \subseteq x'\) induces a restriction map \(r_{x, x'} : \pi ^{1}\{x'\} \rightarrow \pi ^{1}\{x\}\), whose action on a program trace over \(x'\) is to return its prefix over x.
Although there is no measuretheoretic structure in the definition of dataflow graphs, we can recover it: for every \(x \in \mathscr {C}(G_S)\), the fibre \(\pi _S^{1}\{x\}\) can be equipped with the \(\sigma \)algebra induced from Open image in new window via \(q_x\); it is generated by sets \(q_x^{1} {U}\) for Open image in new window .
It is easy to check that this makes the restriction map \(r_{x, x'} : \pi _S^{1}\{x'\} \rightarrow \pi _S^{1}\{x\}\) measurable for each pair \(x, x'\) of configurations with \(x \subseteq x'\). (Note that this makes \(\mathbf S\) a measurable event structure in the sense of [16].) Moreover, the map Open image in new window , mapping \(x' \in \pi _S^{1}\{x\}\) to \(\mathbf q(\textsf {lbl}(s'))\) for \(s'\) the unique antecedent by \( \pi _S\) of s in \(x'\), is also measurable.
We will also make use of the following result:
Lemma 3
3.4 Quantitative Dataflow Graphs
We can finally introduce the last bit of information we need about programs in order to perform inference: the probabilistic information. So far, in a dataflow graph, we know when the program is sampling, but not from which distribution. This is resolved by adding for each sample event s in the dependency graph a kernel \(k_s : \pi ^{1}\{[s)\} \rightsquigarrow \pi ^{1}\{[s]\}\). Given a trace x over [s), \(k_s\) specifies a probability distribution according to which x will be extended to a trace over [s]. This distribution must of course have support contained in the set \(r_{[s), [s]}^{1}\{x\}\) of traces over [s] of which x is a prefix; this is the meaning of the technical condition in the definition below.
Definition 5
This axiom stipulates that any extension \(x' \in \pi _S^{1}\{[s]\}\) of \(x \in \pi _S^{1}\{[s)\}\) drawn by \(k_s\) must contain x; in effect \(k_s\) only samples the runtime value for s.

If s is a sample event, \(k_s^{S[ \gamma ]} = k_s^S\)
 If s is a read on a : A, any \(x \in \pi ^{1}[s)\) has runtime information \(q_{[s)}(x)\) in Open image in new window which can be extended to Open image in new window by mapping s to \( \gamma (a)\):$$k^{S[ \gamma ]}_s(x, q_{[s]}^{1}U) = \delta _{q_{[s)}(x)[s:= \gamma (a)]}(U)$$

If s is a return or a score event: any \(x \in \pi ^{1}\{[s)\}\) has at most one extension to \(o(x) \in \pi ^{1}\{[s]\}\) (because return and score events cannot be involved in a minimal conflict): \(k^{S[ \gamma ]}_s (x, q_{[s]}^{1}(U)) = \delta _{q_{[s]}(o(x))}(U).\) If o(x) does not exist, we let \(k_s^{S[ \gamma ]}(x, X) = 0\).
From this definition we derive:
Lemma 4
If Open image in new window and Open image in new window are concurrent extensions of x (i.e. \(s_1\) and \(s_2\) are not in conflict), then \(k^{S[ \gamma ]} _{x_1, s_2} \circ k^{S[ \gamma ]} _{x, s_1} = k^{S[ \gamma ]} _{x_2, s_1} \circ k^{S[ \gamma ]} _{x, s_2}\).
Lemma 5
\(\textsf {kernel}(\mathbf S)\) is an sfinite kernel \( \llbracket \varGamma \rrbracket \rightsquigarrow \llbracket B \rrbracket \).
4 Programs as Labelled Event Structures
We now detail our interpretation of programs as quantitative dataflow graphs. Our interpretation is given by induction, similarly to the measuretheoretic interpretation given in Sect. 2.3, in which composition of kernels plays a central role. In Sect. 4.1, we discuss how to compose quantitative dataflow graphs, and in Sect. 4.2, we define our interpretation.
4.1 Composition of Probablistic Event Structures
Consider two quantitative dataflow graphs, S on \( \varGamma \vdash A\), and T on \( \varGamma , a:A \vdash B\) where a does not occur in \( \varGamma \). In what follows we show how they can be composed to form a quantitative dataflow graph \(T \odot ^{a}_{} S\) on \( \varGamma \vdash B\).
Unlike in the kernel model of Sect. 2.3, we will need two notions of composition. The first one is akin to the usual sequential composition: actions in T must wait on S to return before they can proceed. The second is closer to parallel composition: actions on T which do not depend on a read of the variable a can be executed in parallel with S. The latter composition is used to interpret the let construct. In Open image in new window , we want all the probabilistic actions or reads on other variables which do not depend on the value of a to be in parallel with M. However, in a program such as Open image in new window we do not want any actions of \(N_i\) to start before the selected branch is known, i.e. before the return value of M is known.
The two compositions \(S \odot ^{a}_{\text {par}} T\) and \(S \odot ^{a}_{\text {seq}} T\) are two instances of the same construction, parametrised by a set of labels \(D \subseteq \mathscr {L}^{\texttt {run}}_{\varGamma , a: A \vdash B}\). Informally, D specifies which events of T are to depend on the return value of S in the resulting composition graph. It is natural to assume in particular that D contains all reads on a, and all return events.
 1.
if \(\textsf {lbl}(y)\) intersects D, then x contains a return event
 2.
for all \(t \in y\) with label \(\textsf {Rd}\, a\, v\), there exists an event \(s \in x\) labelled \(\textsf {Rtn}\, v\).

Events: Open image in new window ;

Causality: \( \le _S\ \cup \ \{ (x, t), (x', t') \mid x \subseteq x' \wedge t \le t' \}\ \cup \ \{ s, (x, t) \mid s \in x \}\);
 Conflict: the symmetric closure of
Lemma 6
Lemma 7
Adding Probability. At this point we have defined all the components of dataflow graphs \(S \odot ^{a}_{D} T\) and \(S \cdot _D T\). We proceed to make them quantitative.
Observe first that each sampling event of \(G_{S \cdot _D T}\) (or equivalently of \(G_{S \odot ^{a}_{D} T}\) – sampling events are never hidden) corresponds either to a sampling event of \(G_S\), or to an event (x, t) where t is a sampling event of \(G_T\). We consider both cases to define a family of kernels \((k_s^{S \cdot _D T})\) between the fibres of \(S\cdot _D T\). This will in turn induce a family \((k_s^{S \odot ^{a}_{D} T})\) on \(S\odot ^{a}_{D} T\).
 If s is a sample event of \(G_S\), we use the isomorphisms \(\varphi _{[s)}\) and \(\varphi _{[s]}\) of Lemma 7 to define:$$k_s^{S \odot ^{a}_{D} T} (v, X) = k_s^S ( \varphi ^{1}_{[s)}\, v, \varphi ^{1}_{[s]} X).$$
 If s corresponds to (x, t) for t a sample event of \(G_T\), then for every \(X_x \in \varSigma _{ \pi _S^{1}\{x\}} \) and \(X_t \in \varSigma _{ \pi ^{1}_T\{[t)\}}\) we defineBy Lemma 7, the sets \(\varphi ^{1}_{x, [t]}( X_x \times X_t )\) form a basis for \(\varSigma _{\pi ^{1}\{\langle x, [t)\}}\), so that this definition determines the entire kernel.$$k_{(x, t)}^{S \odot ^{a}_{D} T} ( \langle x', y' \rangle , \varphi ^{1}_{x, [t]}( X_x \times X_t )) = \delta _{x'}(X_x) \times k_t^T(y', X_t).$$
So we have defined a kernel \(k_s^{S\cdot _D T}\) for each sample event s of \(G_{S\cdot _D T}\). We move to the composition \((S \odot ^{a}_{D} T).\) Recall that the causal history of a configuration Open image in new window is the set [z], a configuration of \(G_{S \cdot _D T}\). We see that hiding does not affect the fibre structure:
Lemma 8
For any Open image in new window , there is a measurable isomorphism \( \psi _z: \pi _{S \odot ^{a}_{D} T}^{1}\{z\} \cong \pi _{S \cdot _D T}^{1}\{[z]\}\).
Lemma 9
\({S}\odot ^{a}_{D} T := (G_{S \odot ^{a}_{D} T}, E_{S\odot ^{a}_{D} T}, \pi _{S\odot ^{a}_{D} T}, (k_s^{S\odot ^{a}_{D} T}))\) is a quantitative dataflow graph on \( \varGamma \vdash B\).
4.2 Interpretation of Programs
We now describe how to interpret programs of our language using quantitative dataflow graphs. To do so we follow the same pattern as for the measuretheoretical interpretation given in Sect. 2.3.
Adequacy of Composition. We now prove that our interpretation is adequate with respect to the measuretheoretic semantics described in Sect. 2.3. Given any subset \(D \subseteq {\mathscr {L}^{\texttt {static}}_{ \varGamma , a: A \vdash B}}\) containing returns and reads on a, we show that the composition \({S}\odot ^{a}_{D} T\) does implement the composition of kernels:
Theorem 1
From this result, we can deduce that the semantics in terms of quantitative dataflow graphs is adequate with respect to the measuretheoretic semantics:
Theorem 2
For every term \( \varGamma \vdash M: A\), \(\textsf {kernel}( \llbracket {M}\rrbracket _{\mathcal G}) = \llbracket {M}\rrbracket \).
5 An Inference Algorithm
In this section, we exploit the intensional semantics defined above and define a MetropolisHastings inference algorithm. We start, in Sect. 5.1, by giving a concrete presentation of those quantitative dataflow graphs arising as the interpretation of probabilistic programs; we argue this makes them wellsuited for manipulation by an algorithm. Then, in Sect. 5.2, we give a more formal introduction to the MetropolisHastings sampling methods than that given in Sect. 3. Finally, in Sect. 5.3, we build the proposal kernel on which our implementation relies, and conclude.
5.1 A Concrete Presentation of Probabilistic Dataflow Graphs
Quantitative dataflow graphs as presented in the previous sections are not easy to handle inside of an algorithm: among other things, the runtime graph has an uncountable set of events. In this section we show that some dataflow graphs, in particular those needed for modelling programs, admit a finite representation.
Recovering Fibres. Consider a dataflow graph \(\mathbf S = (E_S, G_S, \pi _S)\) on \(\varGamma \vdash B\). It follows from Lemma 3 that the fibre structure of \(\mathbf S\) is completely determined by the spaces \(\pi _S^{1}\{[s]\}\), for \(s \in G_S\), so we focus on trying to give a simplified representation for those spaces.
In fact this structure is all we need in order to describe a dataflow graph:
Lemma 10
Adding Probabilities. To add probabilities, we simply equip each sample event s of \(G_S\) with a density function Open image in new window .
Definition 6
A concrete quantitative dataflow graph is a tuple Open image in new window where \(d_s(x, \cdot )\) is normalised.
Lemma 11
Any concrete quantitative dataflow graph \(\mathcal S\) unfolds to a quantitative dataflow graph Open image in new window .
We see now that the quantitative dataflow graphs arising as the interpretation of a program must be the unfolding of a concrete quantitative dataflow graph:
Lemma 12
For any concrete quantitative dataflow graphs \(\mathcal S\) on \( \varGamma \vdash A\) and \(\mathcal T\) on \( \varGamma , a: A \vdash B\), Open image in new window is the unfolding of a concrete quantitative dataflow graph. It follows that for any program \( \varGamma \vdash M : B\), Open image in new window is the unfolding of a concrete quantitative dataflow graph.
5.2 MetropolisHastings
We will use terms of our language to describe computable Markov chains language, taking mild liberties with syntax. We assume in particular that programs may call each other as subroutines (this can be done via substitutions), and that manipulating finite structures is computable and thus representable in the language.
In words, the Markov chain works as follows: given a start state x, it generates a proposal for the next state \(x'\) using P. It then computes an acceptance ratio \(\alpha \), which is the probability with which the new sample will be accepted: the return state will then either be the original x or \(x'\), accordingly.
Assuming P and p satisfy a number of conditions, the algorithm is correct:
Theorem 3
 1.
Strong irreducibility: There exists \(n \in \mathbb {N} \) such that for all \(x \in \mathbb A\) and \(X \in \varSigma _{\mathbb A}\) such that \(D(X) \ne \emptyset \) and \(d(x) > 0\), there exists \(n \in \mathbb {N} \) such that Open image in new window .
 2.
 3.
If \(d(x) > 0\) and \(p(x, y) > 0\) then \(d(y) > 0\).
 4.
If \(d(x) > 0\) and \(d(y) > 0\), then \(p(x, y) > 0\) iff \(p(y, x) > 0\).
Then, the limit of \(\texttt {MH}(P, p, d)\) for any initial state \(x \in \mathbb A\) with \(d(x) > 0\) is equal to D, the distribution obtained after normalising d.
5.3 Our Proposal Kernel
Lemma 13
For all \(X \in \varSigma _{\mathscr {C}(E_S)}\), \( \displaystyle {\mu ^{\mathcal S} (X) = \int _{y \in X} d_S(y) \mathrm {d}y.}\)
Note that \(d_S(x, q)\) is easy to compute, but it is not normalised. Computing the normalising factor is in general intractable, but the MetropolisHastings algorithm does not require the density to be normalised.
Accordingly, we focus on designing a MetropolisHastings algorithm for sampling values in Open image in new window following the (unnormalised) density \(d_S\). We start by defining a proposal kernel for this algorithm.
To avoid overburdening the notation, we will no longer distinguish between a type and its denotation. Since \(G_S\) is finite, it can be represented by a type, and so can Open image in new window . Moreover, Open image in new window is a subset of \(\sum _{x \in \mathscr {C}(G_S)} \mathscr {Q}(x)\) which is also representable as the type of pairs Open image in new window . Operations on \(G_S\) and related objects are all computable and measurable so we can directly use them in the syntax. In particular, we will make use of the function Open image in new window which for each configuration Open image in new window returns \((1, \mathfrak s)\) if there exists Open image in new window with \(o_s(q\vert _{[s)})\) defined, and \((2, *)\) if (x, q) is maximal.

Pick a sample event \(s \in x\), randomly over the set of sample events of x.

Construct Open image in new window .

Return a maximal extension \((x', q')\) of \((x_0, q\vert _{x_0})\) by only resampling the sample events of \(x'\) which are not in x.
The last step follows the singlesite MH principle: sample events in \(x \cap x'\) have already been evaluated in x, and are not updated. However, events which are in \(x' \setminus x\) belong to conditional branches not explored in x; they must be sampled.

if s is not a sample event, ie. since S is closed it must be a return or a score event, we use the function \(o_s\).

if s is a sample event occurring in x, we use the value in q

if s is a sample event not occurring in x, we sample a value for it.
Theorem 4
The Markov chain \(P_S\) and density p satisfy the hypothesis of Theorem 3, as a result for any Open image in new window the distribution Open image in new window tends to \(\mu _{\text {norm}}^P\) as n goes to infinity.
One can thus sample from Open image in new window using the algorithm above, keeping only the return value of the obtained configuration.
Let us restate the key advantage of our approach: having access to the data dependency information, complete requires fewer steps in general, because at each proposal step only a portion of the graph needs exploring.
6 Conclusion
Related Work. There are numerous approaches to the semantics of programs with random choice. Among those concerned with statistical applications of probabilistic programming are Staton et al. [18, 19], Ehrhard et al. [7], and Dahlqvist et al. [6]. A game semantics model was announced in [15].
The work of Scibior et al. [17] was influential in suggesting a denotational approach for proving correctness of inference, in the framework of quasiBorel spaces [9]. It is not clear however how one could reason about data dependencies in this framework, because of the absence of explicit causal information.
Hur et al. [11] gives a proof of correctness for Trace MCMC using new forms of operational semantics for probabilistic programs. This method is extended to higherorder programs with soft constraints in Borgström et al. [2]. However, these approaches do not consider incremental recomputation.
To the best of our knowledge, this is the first work addressing formal correctness of incremental recomputation in MCMC. However, methods exist which take advantage of data dependency information to improve the performance of each proposal step in “naive” Trace MCMC. We mention in particular the work on slicing by Hur et al. [10]; other approaches include [5, 24]. In the present work we claim no immediate improvement in performance over these techniques, but only a mathematical framework for reasoning about the structures involved.
It is worth remarking that our event structure representation is reminiscent of graphical model representation made explicit in some languages. Indeed, for a firstorder language such as the one of this paper, Bayesian networks can directly be used as a semantics, see [20]. We claim that the alternative view offered by event structures will allow for an easier extension to higherorder programs, using ideas from game semantics.
Perspectives. This is the start of an investigation into intensional semantics for probabilistic programs. Note that the framework of event structures is very flexible and the semantics presented here is by no means the only possible one. Additionally, though the present work only treats the case of a firstorder language, we believe that building on recent advances in probabilistic concurrent game semantics [3, 16] (from which the present work draws much inspiration), we can extend the techniques of this paper to arbitrary higherorder probabilistic programs with recursion.
Notes
Acknowledgements
We thank the anonymous referees for helpful comments and suggestions. We also thank Ohad Kammar for suggesting the idea of using causal structures for reasoning about data dependency in this context. This work has been partially sponsored by: EPSRC EP/K034413/1, EP/K011715/1, EP/L00058X/1, EP/N027833/1, EP/N028201/1, and an EPSRC PhD studentship.
References
 1.Billingsley, P.: Probability and Measure. John Wiley & Sons, New York (2008)zbMATHGoogle Scholar
 2.Borgström, J., Lago, U.D., Gordon, A.D., Szymczak, M.: A lambdacalculus foundation for universal probabilistic programming. In: ACM SIGPLAN Notices, vol. 51, pp. 33–46. ACM (2016)Google Scholar
 3.Castellan, S., Clairambault, P., Paquet, H., Winskel, G.: The concurrent game semantics of probabilistic PCF. In: 2018 33rd Annual ACM/IEEE Symposium on Logic in Computer Science (LICS). ACM/IEEE (2018)Google Scholar
 4.Castellan, S., Paquet, H.: Probabilistic programming inference via intensional semantics. Technical report (2019). http://iso.mor.phis.me/publis/esop19.pdf
 5.Chen, Y., Mansinghka, V., Ghahramani, Z.: Sublinear approximate inference for probabilistic programs. stat, 1050:6 (2014)Google Scholar
 6.Dahlqvist, F., Danos, V., Garnier, I., Silva, A.: Borel kernels and their approximation, categorically. arXiv preprint arXiv:1803.02651 (2018)
 7.Ehrhard, T., Pagani, M., Tasson, C.: Measurable cones and stable, measurable functions: a model for probabilistic higherorder programming, vol. 2, pp. 59:1–59:28 (2018)Google Scholar
 8.Gordon, A.D., Henzinger, T.A., Nori, A.V., Rajamani, S.K.: Probabilistic programming. In: Proceedings of the on Future of Software Engineering, pp. 167–181. ACM (2014)Google Scholar
 9.Heunen, C., Kammar, O., Staton, S., Yang, H.: A convenient category for higherorder probability theory. In: LICS 2017, Reykjavik, pp. 1–12 (2017)Google Scholar
 10.Hur, C.K., Nori, A.V., Rajamani, S.K., Samuel, S. Slicing probabilistic programs. In: ACM SIGPLAN Notices, vol. 49, pp. 133–144. ACM (2014)Google Scholar
 11.Hur, C.K., Nori, A.V., Rajamani, S.K., Samuel, S.: A provably correct sampler for probabilistic programs. In: LIPIcsLeibniz International Proceedings in Informatics, vol. 45. Schloss DagstuhlLeibnizZentrum fuer Informatik (2015)Google Scholar
 12.Kiselyov, O.: Probabilistic programming language and its incremental evaluation. In: Igarashi, A. (ed.) APLAS 2016. LNCS, vol. 10017, pp. 357–376. Springer, Cham (2016). https://doi.org/10.1007/9783319479583_19CrossRefzbMATHGoogle Scholar
 13.Kiselyov, O.: Problems of the lightweight implementation of probabilistic programming. In: Proceedings of Workshop on Probabilistic Programming Semantics (2016)Google Scholar
 14.Kozen, D.: Semantics of probabilistic programs. J. Comput. Syst. Sci. 22(3), 328–350 (1981)MathSciNetCrossRefGoogle Scholar
 15.Ong, L., Vákár, M.: Sfinite kernels and game semantics for probabilistic programming. In: POPL 2018 Workshop on Probabilistic Programming Semantics (PPS) (2018)Google Scholar
 16.Paquet, H., Winskel, G.: Continuous probability distributions in concurrent games. Electr. Notes Theor. Comput. Sci. 341, 321–344 (2018)MathSciNetCrossRefGoogle Scholar
 17.Ścibior, A., et al.: Denotational validation of higherorder Bayesian inference. In: Proceedings of the ACM on Programming Languages, vol. 2(POPL), p. 60 (2017)Google Scholar
 18.Staton, S.: Commutative semantics for probabilistic programming. In: Yang, H. (ed.) ESOP 2017. LNCS, vol. 10201, pp. 855–879. Springer, Heidelberg (2017). https://doi.org/10.1007/9783662544341_32CrossRefzbMATHGoogle Scholar
 19.Staton, S., Yang, H., Wood, F.D., Heunen, C., Kammar, O.: Semantics for probabilistic programming: higherorder functions, continuous distributions, and soft constraints. In: Proceedings of LICS 2016, New York, NY, USA, July 5–8, 2016, pp. 525–534 (2016)Google Scholar
 20.van de Meent, J.W., Paige, B., Yang, H., Wood, F.: An introduction to probabilistic programming. arXiv preprint arXiv:1809.10756 (2018)
 21.Wingate, D., Stuhlmüller, A., Goodman, N.: Lightweight implementations of probabilistic programming languages via transformational compilation. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 770–778 (2011)Google Scholar
 22.Winskel, G.: Event structures. In: Brauer, W., Reisig, W., Rozenberg, G. (eds.) ACPN 1986. LNCS, vol. 255, pp. 325–392. Springer, Heidelberg (1987). https://doi.org/10.1007/3540179062_31CrossRefGoogle Scholar
 23.Winskel, G.: Distributed probabilistic and quantum strategies. Electr. Notes Theor. Comput. Sci. 298, 403–425 (2013)MathSciNetCrossRefGoogle Scholar
 24.Wu, Y., Li, L., Russell, S., Bodik, R.: Swift: compiled inference for probabilistic programming languages. arXiv preprint arXiv:1606.09242 (2016)
 25.Yang, L., Hanrahan, P., Goodman, N.: Generating efficient MCMC kernels from probabilistic programs. In: Artificial Intelligence and Statistics, pp. 1068–1076 (2014)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.