An Application of Computable Distributions to the Semantics of Probabilistic Programming Languages
 4 Citations
 932 Downloads
Abstract
Most probabilistic programming languages for Bayesian inference give either operational semantics in terms of sampling, or denotational semantics in terms of measuretheoretic distributions. It is important that we can relate the two, given that practitioners often reason both analytically (e.g., density) as well as algorithmically (i.e., in terms of sampling) about distributions. In this paper, we give denotational semantics to a functional language extended with continuous distributions and show that by restricting attention to computable distributions, we can realize a corresponding sampling semantics.
Keywords
Probabilistic programs Computable distributions Semantics1 Introduction
Probabilistic programming for Bayesian inference hopes to simplify probabilistic modeling by (1) providing a programming language for users to formally specify probabilistic models and (2) automating the task of inferring model parameters given observed data.
A natural interpretation of these probabilistic programs is as a sampler (e.g., [9, 35]). In essence, we can view a probabilistic program as describing a sampling procedure encoding how one believes observed data is generated. The flexibility of sampling has inspired embeddings of probabilistic primitives in fullfledged programming languages, including Scheme and Scala (e.g., [9, 23]). However, languages based on sampling often avoid the basic question of what it means to sample from a continuous distribution. For instance, these languages either have semantics given by an implementation in a hostlanguage (e.g., [9]), or use an abstract machine that assumes reals and primitive continuous distributions [21].
Another approach grounds the semantics of probabilistic programs in measure theory, which is the foundation of probability theory (e.g., [3, 29]). Measure theory provides a rigorous definition of conditioning, and a uniform treatment of discrete and continuous distributions. For instance, measuretheoretic probability makes sense of why the probability of obtaining any sample from an (absolutely) continuous distribution is zero, as well what it means to observe a probability zero event. These situations are encountered frequently in practice, as many models incorporate both continuous distributions and observation of realvalued data. Consequently, measuretheoretic semantics have been proposed as a generalization of sampling semantics (e.g., [3]). However, this approach also has its drawbacks. First, measure theory has been developed without programming language constructs in mind (e.g., recursion and higherorder functions), unlike standard denotational semantics. Hence, languages based on measure theory often omit language features (e.g., [3]) or develop new metatheory (e.g., [29]). Second, measuretheoretic semantics must also develop a corresponding operational theory. For instance, Toronto et al. [29] give denotational semantics and later show how to implement a sound approximation of their ideal semantics because it is not directly implementable.
In this paper, we address the shortcomings of these existing approaches, and give denotational semantics to a functional language with continuous distributions that uses largely standard semantic constructs, and make the connection to a faithful sampling implementation. Our approach is motivated by Type2 computable distributions, which admit Type2 computable sampling procedures.^{1} Computable distributions have been defined and studied in the context of Type2 (Turing) machines and algorithmic randomness (e.g., [6, 7, 31]). Their implications for probabilistic programs have also been hinted at in the literature (e.g., [1, 5]). Hence, we will recast these ideas in the context of highlevel probabilistic languages for Bayesian inference.
There are advantages to giving semantics with computable distributions in mind, instead of directly using measuretheoretic distributions. First, every computable distribution admits a sampling algorithm, which operates on input bitrandomness (e.g., a stream of coin flips) instead of requiring blackbox primitives that generate Uniformly distributed values. Indeed, a computable distribution is completely characterized by a sampling algorithm (see Proposition 1), which reflects the intuition that we can express a distribution by describing how to sample from it. Second, we can use results from computability theory to guide the design of such a language. In particular, Ackerman et al. show that computing a conditional distribution is akin to solving the Halting problem [1]. Thus, our language provides only conditioning operators for restricted settings. Finally, computable distributions can be represented as ordinary program values. This means that we can use standard programming language metatheory, and it enables us to scale our approach to probabilistic languages embedded in fullfledged programming languages proposed by the machine learning community (e.g., [9, 23]).
However, our approach has some limitations. As we already mentioned above, conditioning is not computable in general so we do not give semantics to a generic conditioning operator (see Sect. 5). Nonetheless, the situation where we do give semantics to conditioning corresponds to an effective version of Bayes rule, which is central to Bayesian inference in practice. Second, even though our approach gives realizable semantics, it is not necessarily efficient because algorithms that operate on computable distributions compute by enumeration. It would be interesting to see if we can efficiently implement algorithms that compute approximately with computable distributions, but this is not in scope for this paper.
2 The Basic Idea
We begin by expanding upon the issues involved in giving semantics to probabilistic programming languages. In order to focus on the probabilistic aspect, we informally consider simple, firstorder languages without recursion, extended with discrete and continuous distributions.
In this paper, we propose to address the gap between a denotational semantics with continuous distributions and an algorithmic sampling interpretation using Type2 computable distributions. To this end, we will present a core probabilistic language extending PCF with distributions and give it semantics using largely standard techniques. Because we want the denotational semantics to be easily relatable to a sampling semantics, we will be restricted to considering only distributions on topological spaces (in this paper, computable metric spaces) instead of measurable spaces (i.e., the tuple \((X, \mathcal {F})\) of a set and a \(\sigma \)algebra) used in standard measuretheoretic probability. Thus, our semantics can support only Borel measures, i.e., a measure defined uniquely on the open sets of the topology. Nonetheless, this covers a wide class of distributions used in practice, including familiar continuous distributions on the reals \(\mathbb {R}\) (e.g., Uniform, Gaussian, and etc.) and (infinite) products of reals.
After we give the semantics, we will identify a subset of the denotations as corresponding to Type2 computable objects. Importantly, we can implement Type2 computable operators in a standard programming language. Thus, we will provide an implementation of the sampling semantics as a Haskell library. Moreover, the semantics of conditioning (when computable) can be given as a program that computes the conditional distribution (see Sect. 5). Hence, we can also give semantics to conditioning as a library function. We will see that we can write familiar probabilistic programs with familiar reasoning principles. Because we do not need anything beyond standard language semantics, our approach can also be used to give semantics to probabilistic languages embedded in fullfledged languages proposed and used by the machine learning community (e.g., [9, 23, 35]).
3 Background on Computability and Distributions
In this section, we introduce background on Type2 computability (see Type Two Theory of Effectivity or TTE [32]), which can be used to provide a notion of computability on reals and distributions. The background will primarily be useful for connecting the denotational semantics to an implementation of the sampling semantics (see Sect. 4.2). We start by illustrating the basic idea behind Type2 computability using computable reals.
Intuitively, a real is computable if we can enumerate it’s binary expansion. Of course, a Turing machine algorithm can only enumerate a finite prefix of the expansion in a finite amount of time. Thus, Type2 computability extends the conventional notion of algorithm (which terminates) to account for computing on bitstreams. A Type2 algorithm is specified using conventional Turing machine code (i.e., with finite set of states and finite state transition specification) and computes (partial) bitstream functions (i.e., \(g_M: \{0, 1\}^\omega \rightharpoonup \{0, 1\}^\omega \)) such that finite prefixes of the output are determined by finite prefixes of the input (called the finite prefix property.) This captures the intuition that a Type2 algorithm computes a function to arbitrary precision in finite time using a finite amount of input, even though computing the entire output bitstream cannot be done in finite time. Now that we have clarified what we mean by enumerate, we can return to computable reals.
More formally, a real \(x \in \mathbb {R}\) is computable if we can enumerate a fast Cauchy sequence of rationals that converges to x. Recall that a sequence \((q_n)_{n \in \mathbb {N}}\) is Cauchy if for every \(\epsilon > 0\), there is an N such that \(d(q_n, q_m) < \epsilon \) for every \(n, m > N\). Thus, the elements of a Cauchy sequence become closer and closer to one another as we traverse the sequence. When \(d(q_n, q_{n + 1}) < 2^{n}\) for all n, we call \((q_n)_{n \in \mathbb {N}}\) a fast Cauchy sequence. Hence, the representation of a computable real as a fast Cauchy sequence evokes the idea of enumerating it’s binary expansion.
A function \(f: \mathbb {R}\rightarrow \mathbb {R}\) is computable if given a fast Cauchy encoding of \(x \in \mathbb {R}\), there is an algorithm that outputs a fast Cauchy sequence of f(x). For example, addition \(+: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) is computable because an algorithm can add the (Cauchy) input sequences elementwise to obtain a (Cauchy) output sequence. Next, we introduce computable metric spaces, which can be used to generalize the ideas above to computable distributions.
As a reminder, metric space (X, d) is a set X equipped with a metric\(d: X \times X \rightarrow \mathbb {R}\). A metric induces a collection of sets called (open) balls, where a ball centered at \(c \in X\) with radius \(r \in R\) is the set of points within r of c, i.e., \(B(c, r) = \{x \in X \;  \;d(c, x) < r\}\). In this paper, the topology we associate with a metric space will be the one induced by the collection of balls. Hence, a Borel measure will assign probabilities to open balls. For example, \((\mathbb {R}, d_{\text {Euclid}})\) gives the familiar Euclidean topology on the reals \(\mathbb {R}\) (with Euclidean distance \(d_{\text {Euclid}}\)) and a Borel distribution on the reals assigns probabilities to open intervals.
To define a computable metric space, we need some additional properties on (X, d). First, we must be able to approximate elements of X using elements from a simpler, countable set S. This way, we can encode an element \(s \in S\) as a finite sequence of bits and an element \(x \in X\) as the stream of bits corresponding to a sequence of elements of S that converges to x. More formally, we say S is dense in X if for every \(x \in X\), there is a sequence \((s_n)_{n \in \mathbb {N}}\) that converges to x, where \(s_n \in S\) for every n. Second, (X, d) should be complete, i.e., every Cauchy sequence comprised of elements from X also converges to a point in X. Putting this together gives the definition of a computable metric space.
Definition 1

(X, d) is a complete metric space.

S is a countable, dense subset of X with a fixed numbering.

For all \(i, j \in \mathbb {N}\), \(d(s_i, s_j)\) is computable, uniformly in \(\langle i, j\rangle \) (i.e., the function \((i, j) \mapsto d(s_i, s_j)\) is computable), where \(\langle \cdot , \cdot \rangle : \mathbb {N}\times \mathbb {N}\rightarrow \mathbb {N}\) is a pairing function.
The computable reals we gave at the beginning corresponds to the computable metric space \((\mathbb {R}, d_{\text {Euclid}}, \mathbb {Q})\).
Alternatively, we can think of a computable distribution on a computable metric space (X, d, S) in terms of sampling, i.e., as a (Type2) computable function \(\{0, 1\}^\omega \rightarrow X\). To make this more concrete, we sketch an algorithm that samples from the standard Uniform. The idea is to generate a value that can be queried for more precision instead of a sample x in its entirety. Thus, a sampling algorithm will interleave flipping coins with outputting an element to the desired precision, such that the sequence of outputs \((s_n)_{n \in \mathbb {N}}\) converges to a sample.
The sampling view is equivalent to the definition of a computable distribution in terms of computable metric spaces. From a practical perspective, this means we can implement probabilistic programs as samplers and still capture the class of computable distributions. For completeness, we state the equivalence below. First, a computable probability space [12, Definition 5.0.1] \((\mathcal {X}, \mu )\) is a pair where \(\mathcal {X}\) is a computable metric space and \(\mu \) is a computable distribution. A function \(s: (\mathcal {X}, \mu ) \rightarrow (\mathcal {Y}, \nu )\) is measurepreserving if \(\nu (A) = \mu (s^{1}(A))\) for all measurable A. We call a distribution \(\mu \) on \(\mathcal {X}\)samplable if there is a computable function \(s: (2^\omega , \mu _{\text {iid}}) \rightarrow (\mathcal {X}, \mu )\) such that s is computable on \(\text {dom}(s)\) of fullmeasure and is measurepreserving.
Proposition 1
(Computable iff samplable). A distribution \(\mu \in \mathcal {M}(X)\) on computable metric space \((X, d, \mathcal {S})\) is computable iff it is samplable (see [4, Lemmas 2 and 3]).
4 Semantics
In this section, we describe a core probabilistic language based on PCF and give it semantics using largely standard constructs (see [34]). The semantics of distributions will be given in terms of both valuations (a topological variant of distributions) and samplers. After we give the semantics, we identify a subset of the denotations as Type2 computable, and give an implementation of the sampling semantics as a Haskell library. We will refer to Haskell plus the library as \(\lambda _{CD}\).
4.1 A Core Language and Its Semantics
Interpretation of Types. The interpretation of types will use both complete partial orders (CPO’s) and computable metric spaces. The former is a standard structure from denotational semantics used to give meaning to recursion. The latter was introduced in the background and are the spaces that we consider distributions on.
To start, we introduce basic notation for constructing CPO’s that will be used to give the interpretation of types \(\mathcal {V}\!\!\llbracket \cdot \rrbracket \!\!\). The construction \(\text {Disc}(X)\) equips the set X with the discrete order. \(D_\bot \) creates a lifted domain with underlying set \(\{\lfloor d \rfloor \;  \;d \in D\} \cup \{\bot \}\) with the usual lifted ordering. The CPO construction \(D \Rightarrow E\) creates the CPO of continuous functions between CPO’s D and E.
A valuation\(\nu : \mathcal {O}(X) \rightarrow [0, 1]\) is a function that assigns to each open set of a topological space \((X, \mathcal {O}(X))\) a probability, such that it is strict (\(\nu (\emptyset ) = 0\)), monotone, and modular (\(\nu (U) + \nu (V) = \nu (U \cup V) + \nu (U \cap V)\) for every open U and V). The type of a valuation can be given as a CPO. To see this, let \(\mathcal {O}^\subseteq (X)\) be the open sets on a space X ordered by set inclusion and let \([0, 1]^\uparrow \) be the interval [0, 1] ordered by \(\le \). Then, a valuation can be seen as an element of the CPO \(\mathcal {O}^\subseteq (X) \Rightarrow [0, 1]^\uparrow \). A valuation is similar to a measure (hence, topological variant of distribution), but a valuation does not necessarily satisfy countable additivity. A valuation that does is called a \(\omega \)continuous valuation, where \(\omega \)continuous means \(\nu (\cup _{n \in \mathbb {N}} O_n) = \sup _{n \in \mathbb {N}} \nu (O_n)\) for \(O_n\) open for \(n \in \mathbb {N}\). Indeed, every Borel measure \(\mu \) on \(\mathcal {X}\) can be restricted to a \(\omega \)continuous valuation\(\mu _{\mathcal {O}(X)} : [\mathcal {O}^\subseteq (X) \Rightarrow [0, 1]^\uparrow ]\). Moreover, a computable distribution can be identified with a \(\omega \)continuous valuation (see [12, Proposition 4.2.1] and [28]). Now, we can proceed to interpret \(\texttt {Samp} \; \alpha \).
One idea used in the study of probabilistic powerdomains is to use valuations to put distributions on CPO’s (see [13]) using the CPO’s Scott topology. For example, a valuation on the CPO \(\text {Disc}(\mathbb {N})\) results in the powerset \(2^\mathbb {N}\), which is the \(\sigma \)algebra associated with distributions on the naturals. However, we cannot apply a valuation in this manner to \(\text {Disc}(\mathbb {R})\) to obtain a familiar continuous distribution used in statistics. For example, a valuation on the Scott topology derived from the CPO \(\text {Disc}(\mathbb {R})\) results in the powerset \(2^\mathbb {R}\), which is not the Borel \(\sigma \)algebra on \(\mathbb {R}\). Instead, we will start with the topology of a computable metric spaces which includes familiar continuous distributions, and then derive a CPO from it to support recursion.
To this end, we will use the specialization preorder (written S), which orders \(x \sqsubseteq y\) if every open set that contains x also contains y, to convert a topological space \((X, \mathcal {O}(X))\) into a preordered set. Intuitively, \(x \sqsubseteq y\) if x contains less information than y. We can always find an open ball that separates two distinct points x and y (the distance between two distinct points is positive) in a computable metric space. Hence, the specialization preorder of a computable metric space always gives the discrete order (information ordering), and hence degenerately, a CPO. For example, the specialization preorder of the computable metric space \((\mathbb {R}, d, \mathbb {Q})\) is \(\text {Disc}(\mathbb {R})\). We can now start to put this together to convert a computable metric space into a CPO.
The denotation of the sampler contains an extra lifting to distinguish \(\texttt {bot} : \texttt {Samp} \; \alpha \) from \(\texttt {return} \; \texttt {bot} : \texttt {Samp} \; \alpha \), where \(\mathcal {E}\!\!\llbracket \texttt {bot}\rrbracket \!\!\rho = \bot \). In the former, we obtain the bottom valuation, which assigns 0 mass to every open set. This corresponds to the sampling function \(\lambda u \in 2^\omega . \; \bot \). In the latter, we obtain the valuation that assigns 0 mass to every open set, except for \(\{\lfloor X \rfloor \cup \bot \}\) which is assigned mass 1. This corresponds to the sampling function \(\lambda u \in 2^\omega . \; \lfloor \bot \rfloor \).
Semantics. The expression denotation function \(\mathcal {E}\!\!\llbracket \varGamma \vdash e : \alpha \rrbracket \!\!\rho \in \mathcal {V}\!\!\llbracket \alpha \rrbracket \!\!\) is defined by induction on the typing derivation. The denotation is also parameterized by a global environment \(\Upsilon \) that interprets constants c and primitive distributions d. The following notation will be used for writing the semantics. We lift a function using \(\dagger \in (D \Rightarrow E_\bot ) \Rightarrow (D_\bot \Rightarrow E_\bot )\), defined in the usual way. The function \(\text {lift}(d) = \lfloor d \rfloor \) lifts an element. Finally, the notation \(\text {let} \; x = e_1 \; \text {in} \; e_2\) is a strict let binding.
Properties. We need to check that the semantics we gave above is welldefined, particularly for the cases corresponding to manipulating distributions. Recall that the interpretation of \(\texttt {Samp} \; \alpha \) is a pair of a valuation \(\nu \) and a sampling function s, such that the pushforward of s is equivalent to \(\nu \). Thus, for the meaning of \(\texttt {return} \; e\), we will need to argue that a sampler that ignores its input randomness is equivalent to a point mass valuation. For the meaning of \(x \leftarrow e_1 \; ; \; e_2\), we will need to relate a composition of sampling functions to a reweighing by integration. The first sampling function passes input bits that are unconsumed to the second sampling function. We can state this formally below.
Lemma 1
(Push). Let X and Y be computable metric spaces.
 1.
Let \(f \in 2^\omega \Rightarrow (S \circ L(X))_\bot \) be a constant function, i.e., \(f(u) = \lfloor d \rfloor \) for every u. Then, \(\texttt {psh}(f) = \lambda U. \; 1_{U}(d)\)
 2.Let \(f \in 2^\omega \Rightarrow (S \circ L(X) \times 2^\omega )_\bot \) such that \(\texttt {psh}(f) = \texttt {psh}((\text {lift}\circ \pi _1)^\dagger \circ f) \otimes \mu _{\text {iid}}\) (independence) and \(g \in S \circ L(X) \times 2^\omega \Rightarrow (S \circ L(Y))_\bot \). Then,$$\begin{aligned} \texttt {psh}(g^\dagger \circ f) = \lambda U. \; \int \lambda v. \; \texttt {psh}(g(v))(U) \; d (\texttt {psh}((\text {lift}\circ \pi _1)^\dagger \circ f)) \end{aligned}$$
Consequently, the expression denotation function is welldefined.
Lemma 2
(Denotation welldefined). If \(\varGamma \vdash e : \alpha \), then \(\mathcal {E}\!\!\llbracket e\rrbracket \!\!\rho \in \mathcal {V}\!\!\llbracket \alpha \rrbracket \!\!\).
The expression denotation function is still welldefined, assuming that our global environment only contains Type2 computable elements and functions. Informally, it follows because Type2 computable functions are closed under composition. For instance, the composition of Type2 computable sampling algorithms will be Type2 computable.
Lemma 3
(Denotation welldefined). If \(\varGamma \vdash e : \alpha \), then \(\mathcal {E}\!\!\llbracket e\rrbracket \!\!\rho \in \mathcal {V}^c\!\!\llbracket \alpha \rrbracket \!\!\).
In the next section, we will implement the sampling semantics as a Haskell library.
4.2 Computable Distributions as a Library
We now present a Haskell library (Fig. 1) for expressing samplers that implements the ideas from the previous section. In particular, we will show that we can implement the sampling semantics in Haskell without assuming reals or primitive continuous distributions.
The module \(\texttt {A}\) contains operations for computable metric spaces. First, the type \(\texttt {A} \; \alpha \) models an element of a computable metric space. It can be read as an approximation by a sequence of values of type \(\alpha \). For example, a computable real can be given the type \(\texttt {CReal}\triangleq \texttt {A} \; \texttt {Rat}\), meaning it is a sequence of rationals that converges to a real. We can form values of type \(\texttt {A} \; \alpha \) using \(\texttt {approx}\), which requires us to check that the function we are coercing describes a fast Cauchy sequence, and project out approximations using \(\texttt {anth}\).
5 Conditioning
Conditioning is the core operation of Bayesian inference. As we alluded to earlier, conditioning is not computable in general [1]. We could use a more abstract definition of conditioning used in measure theory, but it would be undesirable if the semantics of conditioning for a probabilistic programming language was not computable given that one of its goals is to automate inference. Instead, we take a library approach, which requires the client to provide an implementation of a conditioning algorithm and limits us to situations where conditioning is computable. Hence, the semantics of our core language remains unchanged.
5.1 Preliminaries
Conditioning can be defined in more abstract settings when we do not have densities. First, we need to introduce additional measuretheoretic definitions. A tuple \((\varOmega , \mathcal {F}, \mu )\) is called a probability space, where \((\varOmega , \mathcal {F})\) is a measurable space and \(\mu \) is a measure. We will omit the underlying \(\sigma \)algebra and measure when they are unambiguous from context. A function \(f: \varOmega _1 \rightarrow \varOmega _2\) is measurable if \(f^{1}(B)\) is measurable for measurable B. In the case of metric spaces, the measurable sets are generated by the open balls so measurable functions are a superset of the continuous functions. A random variable that takes on values in a probability space S is a measurable function \(X: \varOmega \rightarrow S\). The distribution of a random variable X, written \(\mathbf {P}[X \in \cdot ]\), is defined as the pushforward of the underlying measure, i.e., \(\mathbf {P}[X \in A] = \mu (X^{1}(A))\) for all measurable A. We write \(X \sim \mu \) to indicate that the random variable X is distributed according to \(\mu \).
These definitions have computable versions.
Definition 2
(Computable probability kernel [1, Definition 4.2]). Let S and T be computable metric spaces, and \(\mathcal {B}_T\) be the \(\sigma \)algebra on T. A probability kernel \(\kappa : S \times \mathcal {B}_T \rightarrow [0, 1]\) is computable if \(\kappa (\cdot , A)\) is a lower semicomputable function for every r.e. open \(A \in \sigma (\mathcal {B}_T)\).
Definition 3
(Computable conditional distribution [1, Definition 4.7]). Let X and Y be random variables in computable metric spaces S and T. Let \(\kappa \) be a version of \(\mathbf {P}[Y \;  \;X]\) (notation for \(\mathbf {P}[Y \in \cdot \;  \;X]\)). Then \(\mathbf {P}[Y \;  \;X]\) is computable if \(\kappa \) is computable on a \(\mathbf {P}_X\) measureone subset.
5.2 Conditioning
Now, we add conditioning as a library to \(\lambda _{CD}\) (Fig. 2). \(\lambda _{CD}\) provides only a restricted conditioning operation \(\texttt {obs\_dens}\), which requires a conditional density. We will see that the computability of \(\texttt {obs\_dens}\) corresponds to an effective version of Bayes’ rule, which is central to Bayesian inference, and hence, widely applicable in practice. We have given only one conditioning primitive here, but it is possible to identify other situations where conditioning is computable and add those to the conditioning library. For example, conditioning on positive probability events is computable (see [7, Proposition 3.1.2]).
The library provides the conditioning operation \(\texttt {obs\_dens}\), which enables us to condition on continuousvalued data when a bounded and computable conditional density is available.
Proposition 2
[1, Corollary 8.8]. Let U, V and Y be computable random variables, where Y is independent of V given U. Let \(p_{Y \;  \;U}(y \;  \;u)\) be a conditional density of Y given U that is bounded and computable. Then the conditional distribution \(\mathbf {P}[(U, V) \;  \;Y]\) is computable.
Another interpretation of the restricted situation is that our observations have been corrupted by independent smooth noise [1, Corollary 8.9]. To see this, let U be the random variable corresponding to our ideal model of how the data was generated, V be the random variable corresponding to the model parameters, and Y be the random variable corresponding to the corrupted data we observe. Notice that the model (U, V) is not required to have a density and can be an arbitrary computable distribution. Indeed, probabilistic programming systems proposed by the machine learning community impose a similar restriction (e.g., [9, 35]).
Because \(\texttt {obs\_dens}\) works with conditional densities, we do not need to worry about the Borel paradox. The Borel paradox shows that we can obtain different conditional distributions by conditioning on equivalent probability zero events [26]. In addition, note that it is not possible to create a boolean value that distinguishes two probability zero events in \(\lambda _{CD}\). For instance, the operator \(\texttt {==}\) implementing equality on reals returns false if two reals are provably notequal and diverges otherwise because equality is not decidable.
6 Examples
In this section, we give additional examples of distributions in \(\lambda _{CD}\), including a nonparametric prior (a distribution on a countable number of parameters) and a singular distribution (neither discrete nor continuous). This highlights the expressiveness of \(\lambda _{CD}\) and demonstrates how to apply the reasoning principles from before.
The encodings show that we can use probabilistic and standard program reasoning principles at the same time. Because we checked that each program encoded their respective representation, we also obtain that sampling sticks an infinite number of times is equivalent to urn because both encode the Dirichlet process. This might seem strange because the urn encoding has more sequential dependencides than the stickbreaking representation. The equivalence relies on a probabilistic concept called exchangeability, which asserts the existence of a conditionally independent representation if the distribution is invariant under all finite permutations. Exchangeability has been studied in the Type2 setting [5] and it would be interesting to see if we can lift those results into \(\lambda _{CD}\).
7 Related Work
The semantics of probabilistic programs have been studied outside the context of Bayesian inference, in particular, to look at nondeterminism and probabilistic algorithms. For instance, Kozen [15] gives both a sampling semantics and denotational semantics to a firstorder imperative language. Instead of CPO constructions, the denotational semantics uses constructs from analysis, and recovers ordertheoretic structure to support recursion. In the functional setting, researchers have studied probabilistic powerdomains, which put distributions on CPO’s so that the space of distributions themselves forms a CPO. For example, Saheb [27] introduces the probabilistic powerdomain on CPO’s by considering probability measures. Jones [13] considers powerdomains of valuations on CPO’s instead and shows that this results in CPO’s that have more desirable properties reminiscent of standard domains (e.g., \(\omega \)algebraic). The work on powerdomains typically does not consider continuous distributions (we obtain the topology from a computable metric space as opposed to the Scotttopology of a CPO) or Bayesian inference.
Semantics for probabilistic programs have also been studied with machine learning applications in mind. Ramsey et al. give denotational semantics to a higherorder language without recursion extended with discrete distributions using the probability monad, and shows how to efficiently implement probabilistic queries [25]. Borgström et al. [3] uses measure theory to give denotational semantics to a firstorder language without recursion using measure transformers. The main focus of the work is to ensure that the semantics of conditioning on events of 0 probability is welldefined. Toronto et al. [29] propose another measuretheoretic approach for a firstorder language with recursion by interpreting probabilistic programs as preimage computations of measurable functions to support conditioning. They do not use standard constructs from denotational semantics, and thus, do not handle recursion with a fixedpoint construction. Instead, their target is a variant of lambda calculus extended with settheoretic operations (\(\lambda _{ZFC}\)). Park et al. [21] give an operational semantics to an MLlike language in terms of sampling functions, but uses an idealized abstract machine. Hence, they also use the observation that a continuous distribution is characterized by a sampling procedure, but do not explore connections to a denotational approach nor the (faithful) realizability of their operational semantics.
Probabilistic languages have also been proposed to study inference, and typically have been given operational semantics. Some languages restrict to expressing wellestablished abstractions, such as factor graphs and Bayesian networks, in order to automate standard inference algorithms. In this restricted setting, they are given semantics in terms of the abstractions they express. For instance, the Bugs system [16] provides a language for users to specify Bayesian networks and implements Gibbs sampling, a Markov Chain Monte Carlo sampling algorithm used in practice that reduces the problem of sampling from a multivariate conditional distribution into sampling from multiple univariate conditional distributions. Several other probabilistic programming languages similar to Bugs have been developed for expressing factor graphs (e.g., [10, 17, 18]) and directed models (e.g., [11, 30]), and automate inference using standard algorithms, including message passing and HMC sampling.
Other researchers extend Turingcomplete languages with the ability to sample from primitive distributions (e.g., [9, 14, 20, 23, 24, 33, 35]) and have been proposed to study inference in this richer setting. These languages have operational semantics given in terms of an inference method implemented in the host language. Examples include Stochastic Lisp [14] and Ibal [24] which only have discrete distributions. Others add continuous distributions, including Church [9] which embeds in Scheme and Fiagro [23] which embeds in Scala. The Church language performs inference by sampling program traces. Other languages have built upon this, including Probabilistic Matlab [33], Probabilistic C [35], and R2 [20], each proposing a different method to improve the efficiency of sampling program traces.
8 Discussion
In summary, we show that we can give sampling semantics to a highlevel probabilistic language for Bayesian inference that corresponds to a natural denotational view. Our approach, by acknowledging the limits of computability, gives a semantics that corresponds to the intuition of probabilistic programs encoding generative models as samplers. In particular, Type2 computability makes sense (i.e., algorithmically) of sampling from continuous distributions as well as the difficulty of supporting a generic conditioning primitive. Moreover, we have shown that many ideas such as the ones in the probabilistic powerdomains can also be applied to give semantics to modern probabilistic languages designed for Bayesian inference. We end with a few directions for future work.
First, as we mentioned previously, algorithms that operate on computable distributions (and reals) are not necessarily efficient. In many practical situations, it is not necessary to compute to arbitrary precision as Type2 computability demands, but enough precision. It would be interesting to see if we can efficiently implement an approximate semantics in this relaxed setting. Second, our approach has not been designed with inference in mind. The language \(\lambda _{CD}\) is perhaps too expressive—we can express distributions that are not meaningful for inference (e.g., maybeLoop) and singular distributions that have limited applications (e.g., Cantor distribution). It would be interesting to explore restricted language designs where we have guaranteed efficient inference.
Footnotes
 1.
Because we will use the phrase “Type2 computable” frequently, we will sometimes abbreviate it to just “computable” when it is clear from context that we are referring to Type2 computability.
 2.
Algorithms that operate on computable metric spaces compute by enumeration so the algorithm is sensitive to the choice of enumeration.
 3.
That is, we implement the Type2 machine code as a Haskell program. The implementation relies on Haskell’s imprecise exceptions mechanism [22] to express the modulus of continuity of a computable function (see Andrej Bauer’s blog http://math.andrej.com/2006/03/27/sometimesallfunctionsarecontinuous).
Notes
Acknowledgements
This work was supported by Oracle Labs. We would like to thank Nate Ackerman, Stephen Chong, JeanBaptiste Tristan, Dexter Kozen, Dan Roy, and our anonymous reviewers for helpful discussions and feedback.
References
 1.Ackerman, N.L., Freer, C.E., Roy, D.M.: Noncomputable conditional distributions. In: Proceedings of the IEEE 26th Annual Symposium on Logic in Computer Science, LICS 2011, pp. 107–116. IEEE Computer Society, Washington, DC (2011)Google Scholar
 2.Bailey, D., Borwein, P., Plouffe, S.: On the Rapid Computation of Various Polylogarithmic Constants. Math. Comput. 66(218), 903–913 (1997)MathSciNetzbMATHCrossRefGoogle Scholar
 3.Borgström, J., Gordon, A.D., Greenberg, M., Margetson, J., Van Gael, J.: Measure transformer semantics for bayesian machine learning. In: Barthe, G. (ed.) ESOP 2011. LNCS, vol. 6602, pp. 77–96. Springer, Heidelberg (2011)CrossRefGoogle Scholar
 4.Freer, C.E., Roy, D.M.: Posterior distributions are computable from predictive distributions. In: Teh, Y.W., Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS10), vol. 9, pp. 233–240 (2010)Google Scholar
 5.Freer, C.E., Roy, D.M.: Computable de finetti measures. Ann. Pure. Appl. Logic 163(5), 530–546 (2012)MathSciNetzbMATHCrossRefGoogle Scholar
 6.Gács, P.: Uniform test of algorithmic randomness over a general space. Theor. Comput. Sci. 341(1), 91–137 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
 7.Galatolo, S., Hoyrup, M., Rojas, C.: Effective symbolic dynamics, random points, statistical behavior, complexity and entropy. Inf. Comput. 208(1), 23–41 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
 8.Giry, M.: A categorical approach to probability theory. In: Banaschewski, B. (ed.) Categorical Aspects of Topology and Analysis, pp. 68–85. Springer, Heidelberg (1982)CrossRefGoogle Scholar
 9.Goodman, N.D., Mansinghka, V.K., Roy, D., Bonawitz, K., Tenenbaum, J.B.: Church: A language for generative models. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, UAI , pp. 220–229 (2008)Google Scholar
 10.Hershey, S., Bernstein, J., Bradley, B., Schweitzer, A., Stein, N., Weber, T., Vigoda, B.: Accelerating Inference: towards a full Language, Compiler and Hardware stack. CoRR abs/1212.2991 (2012)
 11.Hoffman, M.D., Gelman, A.: The NoUTurn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15, 1351–1381 (2013)MathSciNetzbMATHGoogle Scholar
 12.Hoyrup, M., Rojas, C.: Computability of probability measures and MartinLöf randomness over metric spaces. Inf. Comput. 207(7), 830–847 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
 13.Jones, C., Plotkin, G.: A probabilistic powerdomain of evaluations. In: Proceedings of the Fourth Annual Symposium on Logic in Computer Science, pp. 186–195. IEEE Press, Piscataway (1989)Google Scholar
 14.Koller, D., McAllester, D., Pfeffer, A.: Effective bayesian inference for stochastic programs. In: Proceedings of the 14th National Conference on Artificial Intelligence (AAAI), pp. 740–747 (1997)Google Scholar
 15.Kozen, D.: Semantics of probabilistic programs. J. Comput. Syst. Sci. 22(3), 328–350 (1981)MathSciNetzbMATHCrossRefGoogle Scholar
 16.Lunn, D., Spiegelhalter, D., Thomas, A., Best, N.: The BUGS project: Evolution, critique and future directions. Statistic in Medicine (2009)Google Scholar
 17.McCallum, A., Schultz, K., Singh, S.: Factorie: Probabilistic programming via imperatively defined factor graphs. Adv. Neural Inf. Process. Syst. 22, 1249–1257 (2009)Google Scholar
 18.Minka, T., Guiver, J., Winn, J., Kannan, A.: Infer.NET 2.3, Microsoft Research Cambridge (2009)Google Scholar
 19.Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)MathSciNetGoogle Scholar
 20.Nori, A.V., Hur, C.K., Rajamani, S.K., Samuel, S.: R2: An efficient MCMC sampler for probabilistic programs. In: AAAI Conference on Artificial Intelligence (2014)Google Scholar
 21.Park, S., Pfenning, F., Thrun, S.: A probabilistic language based upon sampling functions. In: Proceedings of the 32Nd ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2005, pp. 171–182. ACM, New York (2005)Google Scholar
 22.Jones, S.P., Reid, A., Henderson, F., Hoare, T., Marlow, S.: A semantics for imprecise exceptions. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 1999, pp. 25–36. ACM, New York (1999)Google Scholar
 23.Pfeffer, A.: Creating and manipulating probabilistic programs with figaro. In: 2nd International Workshop on Statistical Relational AI (2012)Google Scholar
 24.Pfeffer, A.: IBAL: a probabilistic rational programming language. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence  vol. 1, IJCAI 2001, pp. 733–740. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
 25.Ramsey, N., Pfeffer, A.: Stochastic lambda calculus and monads of probability distributions. In: Proceedings of the 29th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2002, pp. 154–165. ACM, New York (2002)Google Scholar
 26.Rao, M.M., Swift, R.J.: Probability Theory with Applications. SpringerVerlag New York Inc, Secaucus (2006)zbMATHGoogle Scholar
 27.SahebDjahromi, N.: CPO’s of measures for nondeterminism. Theor. Comput. Sci. 12(1), 19–37 (1980)MathSciNetzbMATHCrossRefGoogle Scholar
 28.Schröder, M.: Admissible representations for probability measures. Math. Logic Q. 53(4–5), 431–445 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
 29.Toronto, N., McCarthy, J., Van Horn, D.: Running probabilistic programs backwards. In: Vitek, J. (ed.) ESOP 2015. LNCS, vol. 9032, pp. 53–79. Springer, Heidelberg (2015)CrossRefGoogle Scholar
 30.Tristan, J.B., Huang, D., Tassarotti, J., Pocock, A.C., Green, S., Steele, G.L.: Augur: dataparallel probabilistic modeling. In: Advances in Neural Information Processing Systems, pp. 2600–2608 (2014)Google Scholar
 31.Weihrauch, K.: Computability on the probability measures on the Borel sets of the unit interval. Theor. Comput. Sci. 219(1), 421–437 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
 32.Weihrauch, K.: Computable Analysis: An Introduction. SpringerVerlag New York Inc, Secaucus (2000)zbMATHCrossRefGoogle Scholar
 33.Wingate, D., Stuhlmller, A., Goodman, N.D.: Lightweight implementations of probabilistic programming languages via transformational compilation. In: Artificial Intelligence and Statistics, AISTATS 2011 (2011)Google Scholar
 34.Winskel, G.: The Formal Semantics of Programming Languages: An Introduction. MIT Press, Cambridge (1993)zbMATHGoogle Scholar
 35.Wood, F., van de Meent, J.W., Mansinghka, V.: A new approach to probabilistic programming inference. In: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, pp. 2–46 (2014)Google Scholar