Abstract
We show that a measurebased denotational semantics for probabilistic programming is commutative.
The idea underlying probabilistic programming languages (Anglican, Church, Hakaru, etc.) is that programs express statistical models as a combination of prior distributions and likelihood of observations. The product of prior and likelihood is an unnormalized posterior distribution, and the inference problem is to find the normalizing constant. One common semantic perspective is thus that a probabilistic program is understood as an unnormalized posterior measure, in the sense of measure theory, and the normalizing constant is the measure of the entire semantic domain.
A programming language is said to be commutative if only data flow is meaningful; control flow is irrelevant, and expressions can be reordered. It has been unclear whether probabilistic programs are commutative because it is wellknown that FubiniTonelli theorems for reordering integration fail in general. We show that probabilistic programs are in fact commutative, by characterizing the measures/kernels that arise from programs as ‘sfinite’, i.e. sums of finite measures/kernels.
The result is of theoretical interest, but also of practical interest, because program transformations based on commutativity help with symbolic inference and can improve the efficiency of simulation.
Download conference paper PDF
1 Introduction
The key idea of probabilistic programming is that programs describe statistical models. Programming language theory can give us tools to build and analyze the models. Recall Bayes’ law: the posterior probability is proportional to the product of the likelihood of observed data and the prior probability.
One way to understand a probabilistic program is that it describes the measure that is the product of the likelihood and the prior. This product is typically not a probability measure, it does not sum to one. The inference problem is to find the normalizing constant so that we can find (or approximate) the posterior probability measure.
A probabilistic programming language is an MLlike programming language with three special constructs, corresponding to the three terms in Bayes’ law:

sample, which draws from a prior distribution, which may be discrete (like a Bernoulli distribution) or continuous (like a Gaussian distribution);

score, or observe, which records the likelihood of a particular observed data point, sometimes called ‘soft conditioning’;

normalize, which finds the normalization constant and the posterior probability distribution.
The implementation of normalize typically involves simulation. One hope is that we can use program transformations to improve the efficiency of this simulation, or even to symbolically calculate the normalizing constant. We turn to some transformations of this kind in Sect. 4.1. But a very first program transformation is to reorder the lines of a program, as long as the data dependencies are preserved, e.g.
where x not free in u, y not free in t. This is known as commutativity. For example, in a traditional programming language with memory, this transformation is valid provided t and u reference different locations. In probabilistic programming, a fundamental intuition is that programs are stateless. From a practical perspective, it is essential to be able to reorder lines and so access more sophisticated program transformations (e.g. Sect. 4.1); reordering lines can also affect the efficiency of simulation. The main contribution of this paper is the result:
Theorem 4 (Sect. 4.2). The commutativity Eq. (2) is always valid in probabilistic programs.
1.1 A First Introduction to Probabilistic Programming
To illustrate the key ideas of probabilistic programming, consider the following simple problem, which we explain in English and then specify as a probabilistic program.

1.
A telephone operator has forgotten what day it is.

2.
He receives on average ten calls per hour in the week and three calls per hour at the weekend.

3.
He observes four calls in a given hour.

4.
What is the probability that it is a week day?
We describe this as a probabilistic program as follows:
Lines 2–5 describe the combination of the likelihood and the prior. First, on line 2, we sample from the prior: the chance that it is a week day is \(\frac{5}{7}\). On line 3, we set the rate of calls, depending on whether it is a week day. On line 4 we record the observation that six calls were received when the rate was r, using the Poisson distribution. For a discrete distribution, the likelihood is the probability of the observation point, which for the Poisson distribution with rate r is \(r^4e^{r}/4!\).
We thus find a semantics for lines 2–5, an unnormalized posterior measure on \(\{\mathsf {true},\mathsf {false}\}\), by considering the only two paths through the program, depending on the outcome of the Bernoulli trial.

The Bernoulli trial (line 2) produces \(\mathsf {true}\) with prior probability \(\frac{5}{7}\) (it is a week day), and then the rate is 10 (line 3) and so the likelihood of the data is \(10^4e^{10}/4!\approx 0.019\) (line 4). So the unnormalized posterior probability of \(\mathsf {true}\) is \(\frac{5}{7}\times 0.019\approx 0.014\) (prior\(\times \)likelihood).

The Bernoulli trial produces \(\mathsf {false}\) with prior probability \(\frac{2}{7}\) (it is the weekend), and then the likelihood of the observed data is \(3^4e^{3}/4!\approx 0.168\); so the unnormalized posterior measure of \(\mathsf {false}\) is \(\frac{2}{7}\times 0.168\approx 0.048\).
The measure \((\mathsf {true}\mapsto 0.014,\mathsf {false}\mapsto 0.048\)) is not a probability measure because it doesn’t sum to 1. To build a probability measure we divide by \(0.014+0.048=0.062\), to get a posterior probability measure \((\mathsf {true}\mapsto 0.22,\mathsf {false}\mapsto 0.78)\). The normalizing constant, 0.062, is sometimes called model evidence; it is an indication of how well the data fits the model.
Next we consider a slightly different problem. Rather than observing four calls in a given hour, suppose the telephone operator merely observes that the time between two given calls is 15 min. We describe this as a probabilistic program as follows:
The difference here is that the observation is from the exponential distribution (\( exp (r)\)), which is a continuous distribution, In Bayesian statistics, the likelihood of a continuous distribution is taken to be the value of the probability density function at the observation point. The density function of the exponential distribution \( exp (r)\) with rate r is \((x\mapsto re^{rx})\). So if the decay rate is 10, the likelihood of time \(\frac{15}{60}\) is \(10e^{2.5}\approx 0.82\), and if the decay rate is 3, the likelihood is \(3e^{0.75}\approx 1.42\). We thus find that the unnormalized posterior measure of \(\mathsf {true}\) is \(\frac{5}{7}\times 0.82\approx 0.586\) (prior\(\times \)likelihood), and the unnormalized posterior measure of \(\mathsf {false}\) is \(\frac{2}{7}\times 1.42\approx 0.405\). In this example, the model evidence is \(0.586+0.405\approx 0.991\). We divide by this to find the normalized posterior, which is \((\mathsf {true}\mapsto 0.592,\mathsf {false}\mapsto 0.408)\).
In these simple examples, there are only two paths through the program. In general the prior may be a continuous distribution over an uncountable set, such as the uniform distribution on an interval, in which case a simulation can only find an approximate normalizing constant. Suppose that the telephone operator does not know what time it is, but knows a function \(f:[0,24)\rightarrow (0,\infty )\) mapping each time of day to the average call rate. Then by solving the following problem, he can ascertain a posterior probability distribution for the current time.
Although simulation might only be approximate, we can give a precise semantics to the language using measure theory. In brief,

programs of type \(\mathbb {A}\) are interpreted as measures on \(\mathbb {A}\), and more generally expressions of type \(\mathbb {A}\) with free variables in \(\varGamma \) are measure kernels \(\varGamma \rightsquigarrow \mathbb {A}\);

sampling from a prior describes a probability measure;

observations are interpreted by multiplying the measure of a path by the likelihood of the data;

sequencing is Lebesgue integration: \(\mathsf {let}\, x=t\,\mathsf {in}\, u\approx \int t(\mathrm {d}x)\, u\);

normalization finds the measure of the whole space, the normalizing constant.
To put it another way, the programming language is a language for building measures. For full details, see Sect. 3.2.
1.2 Commutativity and Infinite Measures
If, informally, sequencing is integration, then commutativity laws such as (2) amount to changing the order of integration, e.g.
A first nontrivial fact of measure theory is Fubini’s theorem: for finite measures, Eq. (4) holds. However, commutativity theorems like this do not hold for arbitrary infinite measures. In fact, if we deal with arbitrary infinite measures, we do not even know whether sequencing \(\int t(\mathrm {d}x)\, v\) is a genuine measure kernel. As we will show, for the measures that are definable in our language, sequencing is well defined, and commutativity does hold. But let us first emphasize that infinite measures appear to be unavoidable because

there is no known useful syntactic restriction that enforces finite measures;

a program with finite measure may have a subexpression with infinite measure, and this can be useful.
To illustrate these points, consider the following program, a variation on (3).
Here \( gauss (0,1)\) is the standard Gaussian distribution with mean 0 and standard deviation 1; recall that its density f is \(f(x)=\frac{1}{\sqrt{2\pi }}e^{\frac{x^2}{2}}\). The illustration on the right shows the unnormalized posterior for (5) as the observed data goes from \(d=0.1\) (blue dotted line) to \(d=0\) (red straight line). Notice that at \(d=0\), the resulting unnormalized posterior measure on \(\mathbb {R}\) is the flat Lebesgue measure on \(\mathbb {R}\), which assigns to each interval (m, n) its size, \((nm)\). The Lebesgue measure of the entire real line, the wouldbe normalizing constant, is \(\infty \), so we cannot find a posterior probability measure. A statistician would probably not be very bothered about this, because a tiny change in the observed data yields a finite normalizing constant. But that is not good enough for a semanticist, who must give a meaning to every program.
It is difficult to see how a simple syntactic restriction could eliminate program (5) while keeping other useful programs such as (3). Another similar program is
where \({g(x)=\frac{1}{\pi (1+x^2)}}\) is the density function of the standard Cauchy distribution and \(\mathsf {score}(r)\) is shorthand for \((\mathsf {observe}\,0\,\mathsf {from}\, exp (r))\)—recall that the density of the exponential distribution \( exp (r)\) at 0 is \({r=re^{r\times 0}}\). Program (6) is the importance sampling algorithm for simulating a Cauchy distribution from a Gaussian. To see why this algorithm is correct, i.e. (6) = \(\mathsf {sample}( cauchy (0,1))\), it is helpful to rewrite it:
Notice that the underlined subexpression is the Lebesgue measure, as in (5), and recall that sequencing is integration. So program (6) is correct because it is integrating the density g over the Lebesgue measure; this is equal to the Cauchy probability measure, by definition of density.
1.3 Commutativity Through sFinite Kernels
It is known that commutativity holds not just for finite measures but also for sfinite measures, which are formed from a countable sum of finite measures. The key contribution of this paper is that all closed probabilistic programs define sfinite measures. To show this compositionally, we must also give a semantics to open programs, which we interpret using a notion of sfinite kernel (Definition 2), which is a countable sum of finite, bounded kernels; these support sequential composition (Lemma 3). Iterated integrals and interchange (4) are no problem for sfinite measures (Proposition 5). We conclude (Theorem 4) that the commutativity Eq. (2) is always valid in probabilistic programs.
Moreover, sfinite kernels are exactly what is needed, because:
Theorem 6 (Sect. 5.1). The following are equivalent:

a probabilistic program expression of type \(\mathbb {A}\) and free variables in \(\varGamma \);

an sfinite kernel \(\varGamma \rightsquigarrow \mathbb {A}\).
(The probabilistic programming language here is an idealized one that includes countable sum types, all measurable functions, and all probability distributions.)
Summary of Contribution. We use sfinite kernels to provide the first semantic model (Sect. 3.2) of a probabilistic programming language that

interprets programs such as those in Sect. 1.1;

supports basic program transformations such as commutativity (Theorem 4);

justifies program transformations based on statistical ideas such as conjugate priors, importance sampling and resampling, in a compositional way (Sect. 4.1).
In Sect. 6 we relate our contributions with earlier attempts at this problem.
2 Preliminaries
2.1 Measures and Kernels
Measure theory generalizes the ideas of size and probability distribution from countable discrete sets to uncountable sets. To motivate, recall that if we sample a real number from a standard Gaussian distribution then it is impossible that we should sample the precise value 0, even though that is the expected value. We resolve this apparent paradox by recording the probability that the sample drawn lies within an interval, or more generally, a measurable set. For example, a sample drawn from a standard Gaussian distribution will lie in the interval \((1,1)\) with probability 0.68. We now recall some rudiments of measure theory; see e.g. [32] for a full introduction.
A \(\sigma \text {}algebra\) on a set X is a collection of subsets of X that contains \(\emptyset \) and is closed under complements and countable unions. A measurable space is a pair \((X,\varSigma _X)\) of a set X and a \(\sigma \)algebra \(\varSigma _X\) on it. The sets in \(\varSigma _X\) are called measurable sets.
For example, the Borel sets are the smallest \(\sigma \)algebra on \(\mathbb {R}\) that contains the intervals. We will always consider \(\mathbb {R}\) with this \(\sigma \)algebra. Similarly the Borel sets on \([0,\infty ]\) are the smallest \(\sigma \)algebra containing the intervals. For any countable set (e.g. \(\mathbb {N}\), \(\{0,1\}\)) we will consider the discrete \(\sigma \)algebra, where all sets are measurable.
A measure on a measurable space \((X,\varSigma _X)\) is a function \(\mu :\varSigma _X\rightarrow [0,\infty ]\) into the set \([0,\infty ]\) of extended nonnegative reals that takes countable disjoint unions to sums, i.e. \(\mu (\emptyset )=0\) and \(\mu (\biguplus _{n\in \mathbb {N}}U_n)=\sum _{n\in \mathbb {N}}\mu (U_n)\) for any \(\mathbb {N}\)indexed sequence of disjoint measurable sets U. A probability measure is a measure \(\mu \) such that \(\mu (X)=1\).
For example, the Lebesgue measure \(\lambda \) on \(\mathbb {R}\) is generated by \(\lambda (a,b)=ba\). For any \(x\in X\), the Dirac measure \(\delta _x\) has \(\delta _x(U)=[x\in U]\). (Here and elsewhere we regard a property, e.g. \([x\in U]\), as its characteristic function \(X\rightarrow \{0,1\}\).) To give a measure on a countable discrete measurable space X it is sufficient to assign an element of \([0,\infty ]\) to each element of X. For example, the counting measure \(\gamma \) is determined by \(\gamma (\{x\})=1\) for all \(x\in X\).
A function \(f:X\rightarrow Y\) between measurable spaces is measurable if \(f^{\text {}1}(U)\in \varSigma _X\) for all \(U\in \varSigma _Y\). This ensures that we can form a pushforward measure \(f_*\mu \) on Y out of any measure \(\mu \) on X, with \((f_*\mu )(U)=\mu (f^{\text {}1}(U))\).
For example, the arithmetic operations on \(\mathbb {R}\) are all measurable. If \(U\in \varSigma _X\) then the characteristic function \([\in U]:X\rightarrow \{0,1\}\) is measurable.
We can integrate a measurable function \(f:X\rightarrow [0,\infty ]\) over a measure \(\mu \) on X to get number \(\int _X\mu (\mathrm {d}x)\,f(x)\in [0,\infty ]\). (Some authors use different notation, e.g. \(\int f\,\mathrm {d}\mu \).) Integration satisfies the following properties (e.g. [32, Theorem 12]): \(\int _X\mu (\mathrm {d}x)\,[x\in U]=\mu (U)\), \(\int _X\mu (\mathrm {d}x)\,rf(x)=r\int _X\mu (\mathrm {d}x)\,f(x)\), \(\int _X\mu (\mathrm {d}x)\,0=0\), \(\int _X\mu (\mathrm {d}x)\,(f(x)+g(x))=(\int _X\mu (\mathrm {d}x)\,f(x))+(\int _X\mu (\mathrm {d}x)\,g(x))\), and
for any monotone sequence \(f_1\le f_2\le \dots \) of measurable functions \({f:X\rightarrow [0,\infty ]}\). These properties entirely determine integration, since every measurable function is a limit of a monotone sequence of simple functions [32, Lemma 11]. It follows that countable sums commute with integration:
For example, integration over the Lebesgue measure on \(\mathbb {R}\) is Lebesgue integration, generalizing the idea of the area under a curve. Integration with respect to the counting measure on a countable discrete space is just summation, e.g. \(\int _\mathbb {N}\gamma (\mathrm {d}i)\, f(i)=\sum _{i\in \mathbb {N}} f(i)\).
We can use integration to build new measures. If \(\mu \) is a measure on X and \(f:X\rightarrow [0,\infty ]\) is measurable then we define a measure \(\mu _f\) on X by putting . We say f is the density function for \(\mu _f\). For example, the function \(x\mapsto \frac{1}{\sqrt{2 \pi }}e^{\frac{1}{2} {x^2}}\) is the density function for the standard Gaussian probability measure on \(\mathbb {R}\) with respect to the Lebesgue measure.
A kernel k from X to Y is a function \({k:X\times \varSigma _Y\rightarrow [0,\infty ]}\) such that each \(k(x,):\varSigma _Y\rightarrow [0,\infty ]\) is a measure and each \(k(,U):X\rightarrow [0,\infty ]\) is measurable. Because each \(k(x,)\) is a measure, we can integrate any measurable function \(f:Y\rightarrow [0,\infty ]\) to get \(\int _Yk(x,\mathrm {d}y)\,f(y)\in [0,\infty ]\). We write \(k:X \rightsquigarrow Y\) if k is a kernel. We say that k is a probability kernel if \(k(x,Y)=1\) for all \(x\in X\).
2.2 sFinite Measures and Kernels
We begin with a lemma about sums of kernels.
Proposition 1
Let X, Y be measurable spaces. If \(k_1\dots k_n\dots : X\rightsquigarrow Y\) are kernels then the function \((\sum _{i=1}^\infty k_i):X\times \varSigma _Y\rightarrow [0,\infty ]\) given by
is a kernel \(X\rightsquigarrow Y\). Moreover, for any measurable function \(f:Y\rightarrow [0,\infty ]\),
Proof
That \(\sum _{i\in \mathbb {N}} k_i:X\times \varSigma _Y\rightarrow [0,\infty ]\) is a kernel is quite straightforward: it is measurable in X because a countable sum of measurable functions is measurable (e.g. [32, Sect. 2.2]); it is a measure in Y because countable positive sums commute:
The second part of the proposition follows once we understand that every measurable function \({f:Y\rightarrow [0,\infty ]}\) is a limit of simple functions and apply the monotone convergence theorem (7).
Definition 2
Let X, Y be measurable spaces. A kernel \(k:X\rightsquigarrow Y\) is finite if there is finite \(r\in [0,\infty )\) such that, for all x, \(k(x,Y)<r\).
A kernel \(k:X\rightsquigarrow Y\) is sfinite if there is a sequence \(k_1\dots k_n\dots \) of finite kernels and \(\sum _{i=1}^\infty k_i=k\).
Note that the bound in the finiteness condition, and the choice of sequence in the sfiniteness condition, are uniform, across all arguments to the kernel.
The definition of sfinite kernel also appears in recent work by Kallenberg [20] and Last and Penrose [23, Appendix A]. The idea of sfinite measures is perhaps more established ([9, Lemma 8.6], [39, Sect. A.0]).
3 Semantics of a Probabilistic Programming Language
We give a typed first order probabilistic programming language in Sect. 3.1, and its semantics in Sect. 3.2. The semantics is new: we interpret programs as sfinite kernels. The idea of interpreting programs as kernels is old (e.g. [21]), but the novelty here is that we can treat infinite measures. It is not a priori obvious that a compositional denotational semantics based on kernels makes sense for infinite measures; the trick is to use sfinite kernels as an invariant, via Lemma 3.
3.1 A Typed First Order Probabilistic Programming Language
Our language syntax is not novel: it is the same language as in [43], and as such an idealized, typed, first order version of Anglican [46], Church [11], Hakaru [30], Venture [26] and so on.
Types. The language has types
where I ranges over countable, nonempty sets. Alongside the usual sum and product types, we have a special type \(\mathbb {R}\) of real numbers and types \(\mathsf {P}(\mathbb {A})\) of probability distributions. For example, \((1+1)\) is a type of booleans, and \(\mathsf {P}(1+1)\) is a type of distributions over booleans, and \(\sum _{i\in \mathbb {N}}1\) is a type of natural numbers. This is not a genuine programming language because we include countably infinite sums rather than recursion schemes; this is primarily because countably infinite disjoint unions play such a crucial role in classical measure theory, and constructive measure theory is an orthogonal issue (but see e.g. [1]).
Types \(\mathbb {A}\) are interpreted as measurable spaces \([\![ \mathbb {A} ]\!]\).

\([\![ \mathbb {R} ]\!]\) is the measurable space of reals, with its Borel sets.

\([\![ \mathsf {P}(\mathbb {A}) ]\!]\) is the set \(P([\![ \mathbb {A} ]\!])\) of probability measures on \([\![ \mathbb {A} ]\!]\) together with the \(\sigma \)algebra generated by the sets \({\{\mu ~~\mu (U)<r\}}\) for each \(U\in \varSigma _X\) and \({r\in [0,1]}\) (the ‘Giry monad’ [10]).

\([\![ 1 ]\!]\) is the discrete measurable space with one point.

\([\![ \mathbb {A}\times \mathbb {B} ]\!]\) is the product space \([\![ \mathbb {A} ]\!]\times [\![ \mathbb {B} ]\!]\). The \(\sigma \)algebra \(\varSigma _{[\![ \mathbb {A}\times \mathbb {B} ]\!]}\) is generated by rectangles \((U\times V)\) with \(U\in \varSigma _{[\![ \mathbb {A} ]\!]}\) and \(V\in \varSigma _{[\![ \mathbb {B} ]\!]}\) (e.g. [32, Definition 16]).

\([\![ \sum _{i\in I}\mathbb {A}_i ]\!]\) is the coproduct space \(\biguplus _{i\in I}[\![ \mathbb {A}_i ]\!]\). The \(\sigma \)algebra \(\varSigma _{[\![ \sum _{i\in I}\mathbb {A}_i ]\!]}\) is generated by sets \(\{(i,a)~~a\in U\}\) for \(U\in \varSigma _{[\![ \mathbb {A}_i ]\!]}\).
Terms. We distinguish typing judgements: \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {d}}t :\mathbb {A}\) for deterministic terms, and \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t :\mathbb {A}\) for probabilistic terms. Formally, a context \(\varGamma =(x_1 :\mathbb {A}_1,\ldots ,x_n :\mathbb {A}_n)\) means a measurable space . Deterministic terms \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {d}}t :\mathbb {A}\) denote measurable functions from \([\![ \varGamma ]\!]\rightarrow [\![ \mathbb {A} ]\!]\), and probabilistic terms \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t' :\mathbb {A}\) denote kernels \([\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\). We give a syntax and type system here, and a semantics in Sect. 3.2.
Sums and Products. The language includes variables, and standard constructors and destructors for sum and product types.
In the rules for sums, I may be infinite. In the last rule, j is 0 or 1. We use some standard syntactic sugar, such as \(\mathsf {false}\) and \(\mathsf {true}\) for the injections in the type \(\mathsf {bool}= 1+1\), and \(\mathsf {if}\) for \(\mathsf {case}\) in that instance.
Sequencing. We include the standard constructs for sequencing (e.g. [25, 29]).
LanguageSpecific Constructs. So far the language is very standard. We also include constant terms for all measurable functions.
Thus the language contains all the arithmetic operations (e.g. \(+:\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\)) and predicates (e.g. \((=):\mathbb {R}\times \mathbb {R}\rightarrow \mathsf {bool}\)). Moreover, all the families of probability measures are in the language. For example, the Gaussian distributions \( gauss :\mathbb {R}\times \mathbb {R}\rightarrow P(\mathbb {R})\) are parameterized by mean and standard deviation, so that we have a judgement \(\mu :\mathbb {R},\sigma :\mathbb {R}\mathrel {\vdash \!\!\!\!_\mathsf {d}} gauss (\mu ,\sigma ) :\mathsf {P}(\mathbb {R})\). (Some families are not defined for all parameters, e.g. the standard deviation should be positive, but we make adhoc safe choices throughout rather than using exceptions or subtyping.)
The core of the language is the constructs corresponding to the terms in Bayes’ law (1): sampling from prior distributions, recording likelihood scores,
and calculating the normalizing constant and a normalized posterior.
Normalization will fail if the normalizing constant is zero or infinity. Notice that normalization produces a probability distribution; in a complex model this could then be used as a prior and sampled from. This is sometimes called a ‘nested query’.
Note About Observations. Often a probability distribution d has a widely understood density function f with respect to some base measure. For example, the exponential distribution with rate r is usually defined in terms of the density function \(x\mapsto re^{r x}\) with respect to the Lebesgue measure on \(\mathbb {R}\). The \(\mathsf {score}\) construct is typically called with a density. In this circumstance, we use the informal notation \(\mathsf {observe}\,t\,\mathsf {from}\,d\) for \(\mathsf {score}(f(t))\). For example, \(\mathsf {observe}\,t\,\mathsf {from}\, exp (r)\) is informal notation for \(\mathsf {score}(r e^{r\times t})\). In a more realistic programming language, this informality is avoided by defining a ‘distribution object’ to be a pair of a probability measure and a density function for it. There is no difference in expressivity between an observe construction and a \(\mathsf {score}\) construct. For example, \(\mathsf {score}(r)\) can be understood as \(\mathsf {observe}\,0\,\mathsf {from}\, exp (r)\), since \(r e^{r 0}=r\).
(Technical point: although density functions can be understood as RadonNikodym derivatives, these are not uniquely determined on measurezero sets, and so a distribution object does need to come with a given density function. Typically the density is continuous with respect to some metric so that the likelihood is not vulnerable to small inaccuracies in observations. See e.g. [43, Sect. 9] for more details.)
3.2 Denotational Semantics
Recall that types \(\mathbb {A}\) are interpreted as measurable spaces \([\![ \mathbb {A} ]\!]\). We now explain how to interpret a deterministic term in context, \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {d}}t :\mathbb {A}\) as a measurable function \([\![ t ]\!]:[\![ \varGamma ]\!]\rightarrow [\![ \mathbb {A} ]\!]\), and how to interpret a probabilistic term in context, \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t :\mathbb {A}\), as an sfinite kernel \([\![ t ]\!]:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\).
The semantics is given by induction on the structure of terms. Before we begin we need a lemma.
Lemma 3
Let X, Y, Z be measurable spaces, and let \(k:X\times Y \rightsquigarrow Z\) and \(l:X\rightsquigarrow Y\) be sfinite kernels (Definition 2). Then we can define a sfinite kernel \((k\star l):X\rightsquigarrow Z\) by
so that
Proof
Suppose \(k=\sum _{i=1}^\infty k_i\) and \(l=\sum _{j=1}^\infty l_j\) are sfinite kernels, and that the \(k_i\)’s and \(l_j\)’s are finite kernels. We need to show that \(k\star l\) is a kernel and moreover sfinite. We first show that each \(k_i\star l_j\) is a finite kernel. Each \((k_i\star l_j)(x,):\varSigma _Z\rightarrow [0,\infty ]\) is a measure:
The measurability of each \((k_i\star l_j)(,U):X\rightarrow [0,\infty ]\) follows from the general fact that for any measurable function \(f:X\times Y\rightarrow [0,\infty ]\), the function \(\int _Y l_j(,\mathrm {d}y)\, f(,y):X\rightarrow [0,\infty ]\) is measurable (e.g. [32, Theorem 20(ii)]). Thus \((k_i\star l_j)\) is a kernel. This step crucially uses the fact that each measure \(l_j(x,)\) is finite.
To show that \((k_i\star l_j)\) is a finite kernel, we exhibit a bound. Since \(k_i\) and \(l_j\) are finite, we have \(r,s\in (0,\infty )\) such that \(k_i(x,y,Z)<r\) and \(l_j(x,Y)<s\) for all x, y. Now rs is a bound on \((k_i\star l_j)\) since
So each \((k_i\star l_j)\) is a finite kernel. Note that here we used the uniformity in the definition of finite kernel.
We conclude that \((k\star l)\) is an sfinite kernel by showing that it is a countable sum of finite kernels:
The final part of the statement follows by writing f as a limit of a sequence of simple functions and using the monotone convergence property (7).
Remark
It seems unlikely that we can drop the assumption of sfiniteness in Lemma 3. The difficulty is in showing that \((k\star l):X\times \varSigma _Z\rightarrow [0,\infty ]\) is measurable in its first argument without some extra assumption. (I do not have a counterexample, but then examples of nonmeasurable functions are hard to find.)
Semantics. We now explain the semantics of the language, beginning with variables, sums and products, which is essentially the same as a settheoretic semantics.
Here we have only treated the \(\mathsf {case}\) expressions when the continuation is deterministic; we return to the probabilistic case later.
The semantics of sequencing are perhaps the most interesting: \(\mathsf {return}\) is the Dirac delta measure, and \(\mathsf {let}\) is integration.
The interpretation \([\![ \mathsf {return}(t) ]\!]\) is finite, hence sfinite. The fact that \([\![ \mathsf {let}\, x=t\,\mathsf {in}\, u ]\!]\) is an sfinite kernel is Lemma 3: this is the most intricate part of the semantics.
We return to the case expression where the continuation is probabilistic:
We must show that this is an sfinite kernel. Recall that \({[\![ u_i ]\!]:[\![ \varGamma \times \mathbb {A}_i ]\!]\rightsquigarrow [\![ \mathbb {B} ]\!]}\), sfinite. We can also form \(\overline{[\![ u_i ]\!]}:[\![ \varGamma ]\!]\times \biguplus _j[\![ \mathbb {A}_j ]\!]\rightsquigarrow [\![ \mathbb {B} ]\!]\) with
and it is easy to show that \(\overline{[\![ u_i ]\!]}\) is an sfinite kernel. Another easy fact is that a countable sum of sfinite kernels is again an sfinite kernel, so we can build an sfinite kernel \((\sum _i\overline{[\![ u_i ]\!]}):[\![ \varGamma ]\!]\times \biguplus _j[\![ \mathbb {A}_j ]\!]\rightsquigarrow [\![ \mathbb {B} ]\!]\). Finally, we use a simple instance of Lemma 3 to compose \((\sum _i\overline{[\![ u_i ]\!]})\) with \([\![ t ]\!]:[\![ \varGamma ]\!]\rightarrow \biguplus _j[\![ \mathbb {A}_j ]\!]\) and conclude that \([\![ \mathsf {case}\ t\ \mathsf {of}\ \{(i,x)\Rightarrow u_i\}_{i \in I} ]\!]\) is an sfinite kernel.
The language specific constructions are straightforward.
In the semantics of \(\mathsf {sample}\), we are merely using the fact that to give a measurable function \(X\rightarrow P(Y)\) is to give a probability kernel \(X\rightsquigarrow Y\). Probability kernels are finite, hence sfinite.
The semantics of \(\mathsf {score}\) is a one point space whose measure is the argument. (We take the absolute value of \([\![ t ]\!]_{\gamma }\) because measures should be nonnegative. An alternative would be to somehow enforce this in the type system.) We need to show that \([\![ \mathsf {score}(t) ]\!]\) is an sfinite kernel. Although \([\![ \mathsf {score}(t) ]\!]_{\gamma ,1}\) is always finite, \([\![ \mathsf {score}(t) ]\!]\) is not necessarily a finite kernel because we cannot find a uniform bound. To show that it is sfinite, for each \(i\in \mathbb {N}_0\), define a kernel \(k_i:[\![ \varGamma ]\!]\rightsquigarrow 1\)
So each \(k_i\) is a finite kernel, bounded by \((i+1)\), and \([\![ \mathsf {score}(t) ]\!]=\sum _{i=0}^\infty k_i\), so it is sfinite.
We give a semantics to normalization by finding the normalizing constant and dividing by it, as follows. Consider \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t :\mathbb {A}\) and let .
(In practice, the normalization will only be approximate. We leave it for future work to develop semantic notions of approximation in this setting, e.g. [27].)
4 Properties and Examples
4.1 Examples of Statistical Reasoning
Lebesgue Measure, Densities and Importance Sampling. The Lebesgue measure on \(\mathbb {R}\) is not a primitive in our language, because it is not a probability measure, but it is definable. For example, we can score the standard Gaussian by the inverse of its density function, \(f(x)=\frac{1}{\sqrt{2\pi }}e^{\frac{1}{2}x^2}\).
(On the third line, we use the definition of density function.)
Some languages (such as Stan [40], also Core Hakaru [38]) encourage the use of the Lebesgue measure as an ‘improper prior’. We return to the example of importance sampling, proposed in the introduction. Consider a probability measure p with density g. Then
— a simple example of how an improper prior can lead to a proper posterior. We can derive the importance sampling algorithm for p by combining (11) with (10):
Conjugate Prior Relationships and Symbolic Bayesian Update. A key technique for Bayesian inference involves conjugate prior relationships. In general, inference problems are solved by simulation, but sometimes we can work symbolically, when there is a closed form for updating the parameters of a prior according to an observation. In a probabilistic programming language, this symbolic translation can be done semiautomatically as a program transformation (see e.g. [5]).
Recall that \( beta (\alpha ,\beta )\) is a probability measure on [0, 1] describing the distribution of a bias of a coin from which we have observed \((\alpha 1)\) heads and \((\beta 1)\) tails. This has a conjugate prior relationship with the Bernoulli distribution. For instance,
In the graph, notice that \( beta (3,2)\) depicts the updated posterior belief of the bias of the coin after an additional observation: it is more probable that the coin is biassed to heads.
Resampling. In many situations, particularly in Sequential Monte Carlo simulations, it is helpful to freeze a simulation and resample from a histogram that has been built (e.g. [31]). In practical terms, this avoids having too many threads of low weight. Resampling in this way is justified by the following program equation:
Notice that we cannot resample if the model evidence is \(\infty \). For example, we cannot resample from the expression above computing the Lebesgue measure (10), although of course this doesn’t prevent us from resampling from programs that contain it (e.g. (11)).
Hard Constraints. A hard constraint is a score of 0; a nonzero score is a soft constraint. In our language, every type is inhabited, so for each type \(\mathbb {A}\) we can define a term
picking arbitrary \(f:1\rightarrow [\![ \mathbb {A} ]\!]\) at each type \(\mathbb {A}\). The semantics is \([\![ \mathsf {fail}_\mathbb {A} ]\!]_{\gamma ,U}=0\).
Hard constraints suffice for scores below 1, because then
Hard constraints cannot express scores above 1, which can arise from continuous likelihoods — for instance, in the example in the introduction, the likelihoods were 0.82 and 1.42. Inference algorithms often perform better when soft constraints are used.
4.2 Basic Semantic Properties
Standard \(\varvec{\beta }/\varvec{\eta }\) laws and associativity of let. The standard \(\beta /\eta \) laws for sums and products hold. These are easy to verify. For instance,
We also have the standard associativity and identity laws for let:
For instance, the associativity law follows from Lemma 3.
Commutativity.
Theorem 4
For any terms \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t :\mathbb {A}\), \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}u :\mathbb {B}\), \(\varGamma ,x:\mathbb {A},y:\mathbb {B}\mathrel {\vdash \!\!\!\!_\mathsf {p}}v :\mathbb {C}\), we have
This theorem is an immediate consequence of Proposition 5:
Proposition 5
Let \(\mu \) and \(\lambda \) be sfinite measures on X and Y respectively, and let \(f:X\times Y\rightarrow [0,\infty ]\) be measurable. Then
Proof
This result is known (e.g. [39]) and it is easy to prove. Since \(\mu \) and \(\lambda \) are sfinite, we have \(\mu =\sum _{i=1}^\infty \mu _i\) and \(\lambda =\sum _{j=1}^\infty \lambda _j\), with the \(\mu _i\)’s and \(\lambda _j\)’s all finite. Now,
(The commutativity for finite measures is often called Fubini’s theorem.)
Iteration. We did not include iteration in our language but in fact it is definable. In brief, we can use the probabilistic constructs to guess how many iterations are needed for termination. (We do not envisage this as a good implementation strategy, we merely want to show that the language and semantic model can accommodate reasoning about iteration.)
In detail, we define a construction \(\mathsf {iterate}\,t\,\mathsf {from}\,x{=}u\), that keeps calling t, starting from \(x{=}u\); if t returns \({u':\mathbb {A}}\), then we repeat with \(x{=}u'\), if t finally returns in \(\mathbb {B}\), then we stop. This has the following derived typing rule:
We begin by defining the counting measure on \(\mathbb {N}\), which assigns to each set its size. This is not a primitive, because it isn’t a probability measure, but we can define it in a similar way to the Lebesgue measure:
(Recall that the Poisson distribution has \( poisson (1)(\{x\})=\frac{1}{x!e}\).)
Now we can define
where \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}\mathsf {iterate}^{n}\,t\,\mathsf {from}\,x{=}u :\mathbb {B}\) is the program that returns \(v:\mathbb {B}\) if t returns v after exactly n iterations and \(\mathsf {fail}\)s otherwise:
For a simple illustration, von Neumann’s trick for simulating a fair coin from a biassed one d can be written \(d:\mathsf {P}(\mathsf {bool})\mathrel {\vdash \!\!\!\!_\mathsf {p}}\mathsf {iterate}\,t\,\mathsf {from}\,x{=}() :\mathsf {bool}\) where
We leave for future work the relation between this iteration and other axiomatizations of iteration (e.g. [12, Chap. 3]).
5 Remarks About sFinite Kernels
5.1 Full Definability
Theorem 6
If \(k:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\) is sfinite then there is a term \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t :\mathbb {A}\) such that \(k=[\![ t ]\!]\).
Proof
We show that probability kernels are definable. Consider a probability kernel \(k:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\) with \(\varGamma =(x_1:\mathbb {B}_1\dots x_n:\mathbb {B}_n)\). This corresponds to a measurable function \(f:[\![ \prod _{i=1}^n\mathbb {B} ]\!]\rightarrow P([\![ \mathbb {A} ]\!])\), with \(f(b_1,\dots , b_n)(U)=k(b_1,\dots ,b_n,U)\), and \(k=[\![ \varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}\mathsf {sample}(f(x_1,\dots ,x_n)) :\mathbb {A} ]\!].\)
We move on to subprobability kernels, which are kernels \(k:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\) such that \(k(\gamma ,[\![ \mathbb {A} ]\!])\le 1\) for all \(\gamma \). We show that they are all definable. Recall that to give a subprobability kernel \(k:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\) is to give a probability kernel \(\bar{k}:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A}+1 ]\!]\). Define
This probability kernel \(\bar{k}\) is definable, with \(\bar{k}=[\![ t ]\!]\), say, and this has the property that
where \(\mathsf {fail}\) is the zero kernel defined in (12). So the subprobability kernel k is definable.
Next, we show that all finite kernels \(k:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\) are definable. If k is finite then there is a bound \(r\in (0,\infty )\) such that \(k(\gamma ,[\![ \mathbb {A} ]\!])<r\) for all \(\gamma \). Then \(\frac{1}{r}k\) is a subprobability kernel, hence definable, so we have t such that \(\frac{1}{r} k =[\![ t ]\!]\). So \(k=r[\![ t ]\!]=[\![ \mathsf {score}(r);t ]\!]\).
Finally, if k is sfinite then there are finite kernels \(k_i:[\![ \varGamma ]\!]\rightsquigarrow [\![ \mathbb {A} ]\!]\) such that \(k=\sum _{i=1}^\infty k_i\). Since the \(k_i\)’s are finite, we have terms \(t_i\) with \(k_i=[\![ t_i ]\!]\). Recall that a countable sum is merely integration over the counting measure on \(\mathbb {N}\), which we showed to be definable in (13). So we have \(k=[\![ \mathsf {case}\, counting _\mathbb {N}\,\mathsf {of}\,i\Rightarrow t_i ]\!]\).
5.2 Failure of Commutativity in General
The standard example of the failure of Tonelli’s theorem (e.g. [32, Chap. 4., Example 12] can be used to explain why the commutativity program Eq. (2) fails if we allow arbitrary measures as programs.
Let \( lebesgue \) be the Lebesgue measure on \(\mathbb {R}\), and let \( counting _\mathbb {R}\) be the counting measure on \(\mathbb {R}\). Recall that \( counting _\mathbb {R}(U)\) is the cardinality of U if U is finite, and \(\infty \) if U is infinite. Then
So, by Proposition 5, the counting measure on \(\mathbb {R}\) is not sfinite, and hence it is not definable in our language. (This is in contrast to the counting measure on \(\mathbb {N}\), see (13).)
Just for this subsection, we suppose that we can add the counting measure on \(\mathbb {R}\) to our language as a term constructor \(\mathrel {\vdash \!\!\!\!_\mathsf {p}} counting _\mathbb {R} :\mathbb {R}\) and that we can extend the semantics to accommodate it. (This would require some extension of Lemma 3.) The Lebesgue measure is already definable in our language (10). In this extended language we would have
So if such as language extension was possible, we would not have commutativity.
5.3 Variations on sFiniteness
Infinite versions of Fubini/Tonelli theorems are often stated for \(\sigma \)finite measures. Recall that a measure \(\mu \) on X is \(\sigma finite\) if \(X=\biguplus _{i=1}^\infty U_i\) with each \(U_i\in \varSigma _X\) and each \(\mu (U_i)\) finite. The restriction to \(\sigma \)finite measures is too strong for our purposes. For example, although the Lebesgue measure (\( lebesgue \)) is \(\sigma \)finite, and definable (10), the measure \([\![ \,\mathrel {\vdash \!\!\!\!_\mathsf {p}}\mathsf {let}\, x= lebesgue \,\mathsf {in}\, () :1 ]\!]\) is the infinite measure on the onepoint space, which is not \(\sigma \)finite. This illustrates the difference between \(\sigma \)finite and sfinite measures:
Proposition 7
A measure is sfinite if and only if it is a pushforward of a \(\sigma \)finite measure.
Proof
From left to right, let \(\mu =\sum _{i=1}^\infty \mu _i\) be a measure on X with each \(\mu _i\) finite. Then we can form a \(\sigma \)finite measure \(\nu \) on \(\mathbb {N}\times X\) with \(\nu (U)=\sum _{i=1}^\infty \mu _i(\{x~~(i,x)\in U\})\). The original measure \(\mu \) is the pushforward of \(\nu \) along the projection \(\mathbb {N}\times X\rightarrow X\).
From right to left, let \(\nu \) be a \(\sigma \)finite measure on \(X=\biguplus _{i=1}^\infty U_i\) with each restricted measure \(\nu (U_i)\) finite. Let \(f:X\rightarrow Y\) be measurable. For \(i\in \mathbb {N}\), let \(\mu _i(V)=\nu (\{x\in U_i~~f(x)\in V\})\). Then each \(\mu _i\) is a finite measure on Y and \(\sum _{i=1}^\infty \mu _i\) is the pushforward of \(\nu \) along f, as required. (See also [9, Lemma 8.6].)
However, this does not mean that sfinite kernels (Definition 2) are ‘just’ kernels whose images are pushforwards of \(\sigma \)finite measures. In the proof of commutativity, we did only need kernels \(k:X\rightsquigarrow Y\) such that k(x) is an sfinite measure for all \(x\in X\). This condition is implied by the definition of sfinite kernel (Definition 2) but the definition of sfinite kernel seems to be strictly stronger because of the uniformity in the definition. (This is not known for sure; see also the discussion about \(\sigma \)finite kernels in [32, Sect. 4.10].) The reason we use the notion of sfinite kernel, rather than this apparently weaker notion, is that Lemma 3 (and hence the welldefined semantics of \(\mathsf {let}\)) appears to require the uniformity in the definition of finite and sfinite kernels. In brief, the stronger notion of sfinite kernel provides a compositional semantics giving sfinite measures.
6 Concluding Remarks
6.1 Related Work on Commutativity for Probabilistic Programs
Work Using Finite Kernels. Several other authors have given a semantics for probabilistic programs using kernels. Subprobability kernels and finite measures already appear in Kozen’s work on probabilistic programming [21]. Ramsey and Pfeffer [34] focus on a language like ours but without \(\mathsf {score}\) or \(\mathsf {normalize}\); they give a semantics in terms of probability kernels. The measuretransformersemantics of Börgstrom et al. [3] incorporates observations by moving to finite kernels; their semantics is similar to ours (Sect. 3.2), but they are able to make do with finite kernels by considering a very limited language. In the more recent operational semantics by Börgstrom et al. [4], problems of commutativity are avoided by requiring scores to be less than 1, so that all the measures are subprobability measures. Jacobs and Zanasi [18] also impose this restriction to make a connection with an elegant mathematical construction. With discrete countable distributions, this is fine because density functions and likelihoods lie below 1. But when dealing with continuous distributions, it is artificial to restrict to scores below 1, since the likelihood of a continuous distribution may just as well lie above 1 as below it. For example, the subprobability semantics could not handle the example in Sect. 1.1. This is not merely a matter of scaling, because density functions are sometimes unbounded, as shown in \( beta (0.5,0.5)\) on the right. Our results here show that, by using sfinite kernels, one can consider arbitrary likelihoods without penalty.
Verification Conditions for Commutativity. Shan and Ramsey [38] use a similar semantics to ours to justify their disintegration program transformation. They interpret a term \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t :\mathbb {A}\) as a measurable function into a monad \(\mathbb M\) of measures, \([\![ t ]\!]_{\mathrm {SR}}:[\![ \varGamma ]\!]\rightarrow \mathbb M(\mathbb {A})\), which actually amounts to the same thing as a kernel. However, there is a problem with the semantics in this style: we do not know a proof for Lemma 3 without the sfiniteness restriction. In other words, we do not know whether the monad of all measures \(\mathbb M\) is a strong monad. A strength is needed to give a semantics for the \(\mathsf {let}\) construction. So it is not clear whether the semantics is welldefined. Even if \(\mathbb M\) is strong, it is certainly not commutative, as we have discussed in Sect. 5.2, a point also emphasized by Ramsey [35]. Shan and Ramsey [38] regain commutativity by imposing additional verification conditions. Our results here show that these conditions are always satisfied because all definable programs are sfinite kernels and hence commutative.
Contextual Equivalence. Very recently, Culpepper and Cobb [6] have proposed an operational notion of contextual equivalence for a language with arbitrary likelihoods, and shown that this supports commutativity. The relationship between their argument and sfinite kernels remains to be investigated.
Sampling Semantics. An alternative approach to denotational semantics for probabilistic programs is based on interpreting an expression \(\varGamma \mathrel {\vdash \!\!\!\!_\mathsf {p}}t :\mathbb {A}\) as a probability kernel \([\![ t ]\!]':[\![ \varGamma ]\!]\rightsquigarrow ([0,\infty )\times [\![ \mathbb {A} ]\!])\), so that \([\![ t ]\!]'(\gamma )\) is a probability measure on pairs (r, x) of a result x and a weight r. In brief, the probability measure comes from sampling priors, and the weight comes from scoring likelihoods of observations. Börgstrom et al. [4] call this a sampling semantics by contrast with the distribution semantics that we have considered here. This sampling semantics, which has a more intensional flavour and is closer to an operational intuition, is also considered by Ścibor et al. [37] and Staton et al. [43], as well as Doberkat [7]. The two methods are related because every probability kernel \(k:X\rightsquigarrow ([0,\infty )\times Y)\) induces a measure kernel \(\bar{k}:X\rightsquigarrow Y\) by summing over the possible scores:
An advantage to the sampling semantics is that it is clearly commutative, because it is based on a commutative monad \((P([0,\infty )\times ()))\), built by combining the commutative Giry monad P and the commutative monoid monad transformer. However, the sampling semantics does not validate many of the semantic equations in Sect. 4.1: importance sampling, conjugate priors, and resampling are only sound in the sampling semantics if we wrap the programs in \(\mathsf {normalize}(\dots )\). (See e.g. [43].) This makes it difficult to justify applying program transformations compositionally. The point of this paper is that we can verify the semantic equations in Sect. 4.1 directly, while retaining commutativity, by using the measure based (distributional) semantics.
As an aside we note that the probability kernels \(X\rightsquigarrow ([0,\infty )\times Y)\) used in the sampling semantics are closely related to the sfinite kernels advocated in this paper:
Proposition 8
A kernel \(l:X\rightsquigarrow Y\) is sfinite if and only if there exists a probability kernel \(k:X\rightsquigarrow ([0,\infty )\times Y)\) and \(l(x,U)=\int _{[0,\infty )\times U}k(x,\mathrm {d}(r,y))\,r\).
Proof
(notes). We focus on the case where \(X=[\![ \mathbb {A} ]\!]\) and \(Y=[\![ \mathbb {B} ]\!]\). From left to right: build a probability kernel from an sfinite kernel by first understanding it as a probabilistic program (via Theorem 6) and then using the denotational semantics in [43]. From right to left: given a probability kernel \(k:[\![ \mathbb {A} ]\!]\rightsquigarrow ([0,\infty )\times [\![ \mathbb {B} ]\!])\), we build an sfinite kernel
Valuations Versus Measures. Some authors advocate using valuations on topological spaces instead of measures on measurable spaces. This appears to rule out the problematic examples, such as the counting measure on \(\mathbb {R}\). Indeed, Vickers [45] has shown that a monad of valuations on locales is commutative. This suggests a constructive or topological model of probabilistic programming (see [8, 15]) but a potential obstacle is that conditioning is not always computable [1].
6.2 Related Work on Commutativity More Generally
Multicategories and Data Flow Graphs. An early discussion of commutativity is in Lambek’s work on deductive systems and categories [22]. A judgement \(x_1:A_1,\dots ,x_n:A_n\vdash t : B\) is interpreted as a multimorphism \((A_1\dots A_n)\rightarrow B\). These could be drawn as triangles:
(This hints at a link with the graphical ideas underlying several probabilistic programming languages e.g. Stan [40].) Alongside requiring associativity of composition, Lambek requires commutativity:
which matches with our commutativity condition (2). (See also [42].) In this diagrammatic notation, commutativity says that the semantics is preserved under topological transformations. Without commutativity, one would need extra control flow wires to give a topological description of what rewritings are acceptable (e.g. [19, 28]). Our main technical results (Lemma 3 and Proposition 5) can be phrased as follows:
Measurable spaces and sfinite kernels \(X_1\times \dots \times X_n\rightsquigarrow Y\) form a multicategory.
Monoidal Categories, Monads and Arrows. There is a tight connection between multicategories and monoidal categories [13, 24, 42]. Our main technical results (Lemma 3 and Proposition 5) together with the basic facts in Sect. 4.2 can be phrased as follows:
Consider the category whose objects are measurable spaces and morphisms are sfinite kernels. The cartesian product of spaces extends to a monoidal structure which distributes over the coproduct structure.
From this point of view, the key step is that given sfinite kernels \(k:X_1\rightsquigarrow Y_1\) and \(k_2:X_2\rightsquigarrow Y_2\), we can form \((k_1\otimes k_2):X_1\times X_2\rightsquigarrow Y_1\times Y_2\), with
and the interchange law holds, in particular, \((k_1\otimes \mathrm {id})\circ (\mathrm {id}\otimes k_2)= (\mathrm {id}\otimes k_2)\circ (k_1\otimes \mathrm {id})\).
One way of building monoidal categories is as Kleisli categories for commutative monads. For example, the monoidal category of probability kernels is the Kleisli category for the Giry monad [10]. However, we conjecture that sfinite kernels do not form a Kleisli category for a commutative monad on the category of measurable spaces. One could form a space \(M_{\mathsf {sfin}}(Y)\) of sfinite measures on a given space Y, but, as discussed in Sect. 5.3, it is unlikely that every measurable function \(X\rightarrow M_{\mathsf {sfin}}(Y)\) is an sfinite kernel in general, because of the uniformity in the definition (Definition 2). This makes it difficult to ascertain whether \(M_{\mathsf {sfin}}\) is a strong commutative monad. Having a monad would give us a slightlyhigherorder type constructor T(A) and the rules
allowing us to thunk (suspend, freeze) a probabilistic computation and then force (resume, run) it again [25, 29]. The rules are reminiscent of, but not the same as, the rules for \(\mathsf {normalize}\) and \(\mathsf {sample}\). Although monads are a convenient way of building a semantics for programming languages, they are not essential for first order languages such as the language in this paper.
As a technical aside we recall that Power, Hughes and others have eschewed monads and given categorical semantics for first order languages in terms of Freyd categories [25] or Arrows [16] (see also [2, 17, 41]), and the idea of structuring the finite kernels as an Arrow already appears in the work of Börgstrom et al. [3] (see also [36, 44]). Our semantics based on sfinite kernels forms a ‘countably distributive commutative Freyd category’, which is to say that the identityonobjects functor
preserves countable sums and is monoidal. In fact every countably distributive commutative Freyd category \(\mathcal C\rightarrow \mathcal D\) corresponds to a commutative monad, not on the category \(\mathcal C\) but on the category of countableproductpreserving functors \(\mathcal C^{\mathrm {op}}\rightarrow \mathbf {Set}\) (e.g. [33, 43]). This functor category is cartesian closed, and so it is also a fairly canonical semantics for higher order programs. (For a more concrete variant, see also [14].)
6.3 Summary
We have given a denotational semantics for a probabilistic programming language using sfinite kernels (Sect. 3.2). Compositionality relied on a technical lemma (Lemma 3). This semantic model supports reasoning based on statistical techniques (Sect. 4.1), such as conjugate priors, as well as basic equational reasoning (Sect. 4.2), such as commutativity (Theorem 4). The model is actually completely described by the syntax, according to our full definability theorem (Theorem 6).
References
Ackerman, N.L., Freer, C.E., Roy, D.M.: Noncomputable conditional distributions. In: Proceedings of the LICS 2011 (2011)
Atkey, R.: What is a categorical model of arrows? In: Proceedings of the MSFP 2008 (2008)
Borgström, J., Gordon, A.D., Greenberg, M., Margetson, J., van Gael, J.: Measure transformer semantics for Bayesian machine learning. LMCS 9(3), 11 (2013)
Borgström, J., Lago, U.D., Gordon, A.D., Szymczak, M.: A lambdacalculus foundation for universal probabilistic programming. In: Proceedings of the ICFP (2016)
Carette, J., Shan, C.C.: Simplifying probabilistic programs using computer algebra. In: Gavanelli, M., Reppy, J. (eds.) PADL 2016. LNCS, vol. 9585, pp. 135–152. Springer, Cham (2016). doi:10.1007/9783319282282_9
Culpepper, R., Cobb, A.: Contextual equivalence for probabilistic programs with continuous random variables and scoring. In: Proceedings of the ESOP 2017 (2017, to appear)
Doberkat, E.E.: Stochastic Relations: Foundations for Markov Transition Systems. Chapman & Hall, London (2007)
Faissole, F., Spitters, B.: Synthetic topology in homotopy type theory for probabilistic programming. In: Proceedings of the PPS 2017 (2017)
Getoor, R.K.: Excessive Measures. Birkhäuser (1990)
Giry, M.: A categorical approach to probability theory. Categorical Aspects Topology Anal. 915, 68–85 (1982)
Goodman, N., Mansinghka, V., Roy, D.M., Bonawitz, K., Tenenbaum, J.B.: Church: a language for generative models. In: UAI (2008)
Haghverdi, E.: A categorical approach to linear logic, geometry of proofs and full completeness. Ph.D. thesis, Ottawa (2000)
Hermida, C.: Representable multicategories. Adv. Math. 151, 164–225 (2000)
Heunen, C., Kammar, O., Staton, S., Yang, H.: A convenient category for higherorder probability theory (2017). arXiv:1701.02547
Huang, D., Morrisett, G.: An application of computable distributions to the semantics of probabilistic programs: part 2. In: Proceedings of the PPS 2017 (2017)
Hughes, J.: Generalising monads to arrows. Sci. Comput. Program. 37(1–3), 67–111 (2000)
Jacobs, B., Heunen, C., Hasuo, I.: Categorical semantics for arrows. J. Funct. Program. 19(3–4), 403–438 (2009)
Jacobs, B., Zanasi, F.: A predicate/state transformer semantics for Bayesian learning. In: Proceedings of the MFPS 2016 (2016)
Jeffrey, A.: Premonoidal categories and a graphical view of programs. Unpublished (1997)
Kallenberg, O.: Stationary and invariant densities and disintegration kernels. Probab. Theory Relat. Fields 160, 567–592 (2014)
Kozen, D.: Semantics of probablistic programs. J. Comput. Syst. Sci. 22, 328–350 (1981)
Lambek, J.: Deductive systems and categories II. In: Hilton, P.J. (ed.) Category Theory, Homology Theory and Their Applications. LNM, vol. 86, pp. 76–122. Springer, Heidelberg (1969)
Last, G., Penrose, M.: Lectures on the Poisson process. CUP (2016)
Leinster, T.: Higher operads, higher categories. CUP (2004)
Levy, P.B., Power, J., Thielecke, H.: Modelling environments in callbyvalue programming languages. Inf. Comput. 185(2), 182–210 (2003)
Mansinghka, V.K., Selsam, D., Perov, Y.N.: Venture: a higherorder probabilistic programming platform with programmable inference (2014). http://arxiv.org/abs/1404.0099
Mardare, R., Panangaden, P., Plotkin, G.: Quantitative algebraic reasoning. In: Proceedings of the LICS 2016 (2016)
Møgelberg, R.E., Staton, S.: Linear usage of state. Logical Methods Comput. Sci. 10 (2014)
Moggi, E.: Notions of computation and monads. Inf. Comput. 93(1), 55–92 (1991)
Narayanan, P., Carette, J., Romano, W., Shan, C., Zinkov, R.: Probabilistic inference by program transformation in Hakaru (system description). In: Kiselyov, O., King, A. (eds.) FLOPS 2016. LNCS, vol. 9613, pp. 62–79. Springer, Cham (2016). doi:10.1007/9783319296043_5
Paige, B., Wood, F.: A compilation target for probabilistic programming languages. In: ICML (2014)
Pollard, D.: A user’s guide to measure theoretic probability. CUP (2002)
Power, J.: Generic models for computational effects. TCS 364(2), 254–269 (2006)
Ramsey, N., Pfeffer, A.: Stochastic lambda calculus and monads of probability distributions. In: POPL (2002)
Ramsey, N.: All you need is the monad.. what monad was that again? In: PPS Workshop (2016)
Scherrer, C.: An exponential family basis for probabilistic programming. In: Proceedings of the PPS 2017 (2017)
Ścibor, A., Ghahramani, Z., Gordon, A.D.: Practical probabilistic programming with monads. In: Proceedings of the Haskell Symposium. ACM (2015)
Shan, C.C., Ramsey, N.: Symbolic Bayesian inference by symbolic disintegration (2016)
Sharpe, M.: General Theory of Markov Processes. Academic Press, Cambridge (1988)
Stan Development Team: Stan: A C++ library for probability and sampling, version 2.5.0 (2014). http://mcstan.org/
Staton, S.: Freyd categories are enriched Lawvere theories. In: Algebra, Coalgebra and Topology, ENTCS, vol. 303 (2013)
Staton, S., Levy, P.: Universal properties for impure programming languages. In: Proceedings of the POPL 2013 (2013)
Staton, S., Yang, H., Heunen, C., Kammar, O., Wood, F.: Semantics for probabilistic programming: higherorder functions, continuous distributions, and soft constraints. In: Proceedings of the LICS 2016 (2016)
Toronto, N., McCarthy, J., Van Horn, D.: Running probabilistic programs backwards. In: Vitek, J. (ed.) ESOP 2015. LNCS, vol. 9032, pp. 53–79. Springer, Heidelberg (2015). doi:10.1007/9783662466698_3
Vickers, S.: A monad of valuation locales available from the author’s website (2011)
Wood, F., van de Meent, J.W., Mansinghka, V.: A new approach to probabilistic programming inference. In: AISTATS (2014)
Acknowledgements
I am profoundly grateful to my coauthors on [43] for many discussions about the problems and examples in this subject (C. Heunen, O. Kammar, F. Wood, H. Yang). The MFPS 2016 special session on Probabilistic Programming was also illuminating and it was helpful to discuss these issues with the participants (J Börgstrom, D Roy, CC Shan and others). Thanks too to Adam Ścibior and the ESOP reviewers.
Research supported by a Royal Society University Research Fellowship.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 SpringerVerlag GmbH Germany
About this paper
Cite this paper
Staton, S. (2017). Commutative Semantics for Probabilistic Programming. In: Yang, H. (eds) Programming Languages and Systems. ESOP 2017. Lecture Notes in Computer Science(), vol 10201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/9783662544341_32
Download citation
DOI: https://doi.org/10.1007/9783662544341_32
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 9783662544334
Online ISBN: 9783662544341
eBook Packages: Computer ScienceComputer Science (R0)