Correctness of Sequential Monte Carlo Inference for Probabilistic Programming Languages

Probabilistic programming is an approach to reasoning under uncertainty by encoding inference problems as programs. In order to solve these inference problems, probabilistic programming languages (PPLs) employ different inference algorithms, such as sequential Monte Carlo (SMC), Markov chain Monte Carlo (MCMC), or variational methods. Existing research on such algorithms mainly concerns their implementation and efficiency, rather than the correctness of the algorithms themselves when applied in the context of expressive PPLs. To remedy this, we give a correctness proof for SMC methods in the context of an expressive PPL calculus, representative of popular PPLs such as WebPPL, Anglican, and Birch. Previous work have studied correctness of MCMC using an operational semantics, and correctness of SMC and MCMC in a denotational setting without term recursion. However, for SMC inference—one of the most commonly used algorithms in PPLs as of today—no formal correctness proof exists in an operational setting. In particular, an open question is if the resample locations in a probabilistic program affects the correctness of SMC. We solve this fundamental problem, and make four novel contributions: (i) we extend an untyped PPL lambda calculus and operational semantics to include explicit resample terms, expressing synchronization points in SMC inference; (ii) we prove, for the first time, that subject to mild restrictions, any placement of the explicit resample terms is valid for a generic form of SMC inference; (iii) as a result of (ii), our calculus benefits from classic results from the SMC literature: a law of large numbers and an unbiased estimate of the model evidence; and (iv) we formalize the bootstrap particle filter for the calculus and discuss how our results can be further extended to other SMC algorithms.


Introduction
operational semantics of an expressive functional PPL calculus based on the operational formalization in Borgström et al. [6], representative of common PPLs. The operational semantics assign to each pair of term t and initial random trace (sequences of random samples) a non-negative weight. This weight is accumulated during evaluation through a weight construct, which, in current calculi and implementations of SMC, is (implicitly) always followed by a resampling. To decouple resampling from weighting, we present our first contribution.
(i) We extend the calculus from Borgström et al. [6] to include explicit resample terms, expressing explicit synchronization points for performing resampling in SMC. With this extension, we also define a semantics which limits the number of evaluated resample terms, laying the foundation for the remaining contributions.
In Section 4, we define the probabilistic semantics of the calculus. The weight from the operational semantics is used to define unnormalized distributions t over traces and t over result terms. The measure t is called the target measure, and finding a representation of this is the main objective of inference algorithms.
We give a formal definition of SMC inference based on Chopin [9] in Section 5. This includes both a generic SMC algorithm, and two standard correctness results from the SMC literature: a law of large numbers [9], and the unbiasedness of the likelihood estimate [32].
In Section 6, we proceed to present the main contributions.
(ii) From the SMC formulation by Chopin [9], we formalize a sequence of distributions t n , indexed by n, such that t n allows for evaluating at most n resamples. This sequence is determined by the placement of resamples in t. Our first result is Theorem 1, showing that t n eventually equals t if the number of calls to resample is upper bounded. Because of the explicit resample construct, this also implies that, for all resample placements such that the number of calls to resample is upper bounded, t n eventually equals t . We further relax the finite upper bound restriction and investigate under which conditions lim n→∞ t n = t pointwise. In particular, we relate this equality to the dominated convergence theorem in Theorem 2, which states that the limit converges as long as there exists a function dominating the weights encountered during evaluation. This gives an alternative set of conditions under which t n converges to t (now asymptotically, in the number of resamplings n).
The contribution is fundamental, in that it provides us with a sequence of approximating distributions t n of t that can be targeted by the SMC algorithm of Section 5. As a consequence, we can extend the standard correctness results of that section to our calculus. This is our next contribution.
(iii) Given a suitable sequence of transition kernels (ways of moving between the t n ), we can correctly approximate t n with the SMC algorithm from Section 5. The approximation is correct in the sense of Section 5: the law of large numbers and the unbiasedness of the likelihood estimate holds. As a consequence of (ii), SMC also correctly approximates t , and in turn the target measure t . Crucially, this also means estimating the model evidence (likelihood), which allows for compositionality [18] and comparisons between different models [37]. This contribution is summarized in Theorem 3.
Related to the above contributions,Ścibior et al. [40] formalizes SMC and MCMC inference as transformations over monadic inference representations using a denotational approach (in contrast to our operational approach). They prove that their SMC transformations preserve the measure of the initial representation of the program (i.e., the target measure). Furthermore, their formalization is based on a simply-typed lambda calculus with primitive recursion, while our formalization is based on an untyped lambda calculus which naturally supports full term recursion. Our approach is also rather more elementary, only requiring basic measure theory compared to the relatively heavy mathematics (category theory and synthetic measure theory) used by them. Regarding generalizability, their approach is both general and compositional in the different inference transformations, while we abstract over parts of the SMC algorithm. This allows us, in particular, to relate directly to standard SMC correctness results.
Section 7 concerns the instantiation of the transition kernels from (iii), and also discusses other SMC algorithms. Our last contribution is the following.
(iv) We define a sequence of sub-probability kernels k t,n induced by a given program t, corresponding to the fundamental SMC algorithm known as the bootstrap particle filter (BPF) for our calculus. This is the most common version of SMC, and we present a concrete SMC algorithm corresponding to these kernels. We also discuss other SMC algorithms and their relation to our formalization: the resample-move [14], alive [24], and auxiliary [35] particle filters.
Importantly, by combining the above contributions, we justify that the implementation strategies of the BPFs in WebPPL, Anglican, and Birch are indeed correct. In fact, our results show that the strategy in Anglican, in which every evaluation path must resample the same number of times, is too conservative. Detailed proofs for many lemmas found in the paper are available in the appendix. These lemmas are explicitly marked with † .

A Motivating Example from Phylogenetics
In this section, we give a motivating example from phylogenetics. The example is written in a functional PPL 3 developed as part of this paper, in order to verify and experiment with the presented concepts and results. In particular, this PPL supports SMC inference (Algorithm 2) with decoupled resamples and weights 4 , as well as sampling from random distributions with a sample construct.
Consider the program in Fig. 1, encoding a simplified version of a phylogenetic birth-death model (see Ronquist et al. [37] for the full version). The problem is to find the model evidence for a particular birth rate (lambda = 0.2) and death rate (mu = 0.1), given an observed phylogenetic tree. The tree represents known lineages of evolution, where the leaves are extant (surviving to the present) species. Most importantly, for illustrating the usefulness of the results in this paper, the recursive function simBranch, with its two weight applications #1 and #2, is called a random number of times for each branch in the observed tree. Thus, different SMC executions encounter differing numbers of calls to weight. When resampling is performed after every call to weight (#1, #2, and #3), it is, because of the differing numbers of resamples, not obvious that inference is correct (e.g., the equivalent program in Anglican gives a runtime error). Our results show that such a resampling strategy is indeed correct.
This strategy is far from optimal, however. For instance, only resampling at #3, which is encountered the same number of times in each execution, performs much better [26,37]. Our results show that this is correct as well, and that it gives the same asymptotic results as the naive strategy in the previous paragraph.
Another strategy is to resample only at #1 and #3, again causing executions to encounter differing numbers of resamples. Because #1 weights with (log) 0, this approach gives the same accuracy as resampling only at #3, but avoids useless computation since a zero-weight execution can never obtain non-zero weight. 6 D. Lundén et al. Equivalently to resampling at #1, zero-weight executions can also be identified and stopped automatically at runtime. This gives a direct performance gain, and both are correct by our results. We compared the three strategies above for SMC inference with 50 000 particles 5 : resampling at #1,#2, and #3 resulted in a runtime of 15.0 seconds, at #3 in a runtime of 12.6 seconds, and at #1 and #3 in a runtime of 11.2 seconds. Furthermore, resampling at #1,#2, and #3 resulted in significantly worse accuracy compared to the other two strategies [26,37].
Summarizing the above, the results in this paper ensure correctness when exploring different resampling placement strategies. As just demonstrated, this is useful, because resampling strategies can have a large impact on SMC accuracy and performance.

A Calculus for Probabilistic Programming Languages
In this section, we define the calculus used throughout the paper. In Section 3.1, we begin by defining the syntax, and demonstrate how simple probability distributions can be encoded using it. In Section 3.2, we define the semantics and demonstrate it on the previously encoded probability distributions. This semantics is used in Section 4 to define the target measure for any given program. In Section 3.3, we extend the semantics of Section 3.2 to limit the number of allowed resamples in an evaluation. This extended semantics forms the foundation for formalizing SMC in Sections 6 and 7.

Syntax
The main difference between the calculus presented in this section and the standard untyped lambda calculus is the addition of real numbers, functions operating on real numbers, a sampling construct for drawing random values from real-valued probability distributions, and a construct for weighting executions. The rationale for making these additions is that, in addition to discrete probability distributions, continuous distributions are ubiquitous in most real-world models, and the weighting construct is essential for encoding inference problems. In order to define the calculus, we let X be a countable set of variable names; D ∈ D range over a countable set D of identifiers for families of probability distributions over R, where the family for each identifier D has a fixed number of real parameters |D|; and g ∈ G range over a countable set G of identifiers for real-valued functions with respective arities |g|. More precisely, for each g, there is a measurable function σ g : R |g| → R. For simplicity, we often use g to denote both the identifier and its measurable function. We can now give an inductive definition of the abstract syntax, consisting of values v and terms t.
Here, c ∈ R, x ∈ X, D ∈ D, g ∈ G. We denote the set of all terms by T and the set of all values by V.
The formal semantics is given in Section 3.2. Here, we instead give an informal description of the various language constructs. Some examples of distribution identifiers are N ∈ D, the identifier for the family of normal distributions, and U ∈ D, the identifier for the family of continuous uniform distributions. The semantics of the term sample N (0, 1) is, informally, "draw a random sample from the normal distribution with mean 0 and variance 1". The weight construct is illustrated later in this section, and we discuss the resample construct in detail in Sections 3.3 and 6.
We use common syntactic sugar throughout the paper. Most importantly, we use false and true as aliases for 0 and 1, respectively, and () (unit) as another alias for 0. Furthermore, we often write g ∈ G as infix operators. For instance, 1 + 2 is a valid term, where + ∈ G. Now, let R + denote the non-negative reals. We define f D : R |D|+1 → R + as the function f D ∈ G such that f D (c 1 , . . . , c |D| , ·) is the probability density (continuous distribution) or mass function (discrete distribution) for the probability distribution corresponding to D ∈ D and (c 1 , . . . , c |D| ).
2 ·x 2 is the standard probability density of the normal distribution with mean 0 and variance 1. Lastly, we will also use let bindings, let rec bindings, sequencing using ;, and lists (all of which can be encoded in the calculus). Sequencing is required for the side-effects produced by weight (see Definition 5) and resample (see Sections 3.3 and 6).
The explicit if expressions in the language deserve special mention-as is well known, they can also be encoded in the lambda calculus. The reason for explicitly including them in the calculus is to connect the lambda calculus to the continuous parts of the language. That is, we need a way of making control flow depend on the result of calculations on real numbers (e.g., if c1 < c2 then t1 else t2, where c 1 and c 2 are real numbers). An alternative to adding if-expressions is to let comparison functions in G return Church Booleans, but this requires extending the codomain of primitive functions.
We now consider a set of examples. In Section 3.2 and Section 4.3 these examples will be further considered to illustrate the semantics, and target measure, respectively. Here, we first give the syntax, and informally discuss and visualize the probability distributions (i.e., the target measures, as we will see in Section 4.3) for the examples.
First, consider the program in Fig. 2a. This program encodes a slight variation on the standard geometric distribution: flip a coin with bias 0.6 (i.e., the flip will result in heads, or true, 60% of the time) until a flip results in tails (false). The probability distribution is over the number of flips before encountering tails (including the final tails flip), and is illustrated in Fig. 2b. The Beta(2, 2) distribution as a program in (a), and visualized with a solid line in (c). Also, the program t obs in (b), visualized with a dashed line in (c). The iter function in (b) simply maps the given function over the given list and returns (). That is, it calls observe true, observe false, and observe true purely for the side-effect of weighting.
The geometric distribution is a discrete distribution, meaning that the set of possible outcomes is countable. We can also encode continuous distributions in the language. Consider first the program in Fig. 3a, directly encoding the Beta(2, 2) distribution, illustrated in Fig. 3c. This distribution naturally represents the uncertainty in the bias of a coin-in this case, the coin is most likely unbiased (bias 0.5), and biases closer to 0 and 1 are less likely. In Fig. 3b, we extend Fig. 3a by observing the sequence [true, false, true] when flipping the coin. These observations are encoded using the weight construct, which simply accumulates a product (as a side-effect) of all real-valued arguments given to it throughout the execution. First, recall the standard mass function (σ fBern (p, true) = p; σ fBern (p, false) = (1 − p); σ fBern (p, x) = 0 otherwise) for the Bernoulli distribution corresponding to f Bern ∈ G. The observations [true, false, true] are encoded using the observe function, which uses the weight construct internally to assign weights to the current value p according to the Bernoulli mass function. As an example, assume we have drawn p = 0.4. The weight for this execution is σ fBern (0.4, true) · σ fBern (0.4, false) · σ fBern (0.4, true) = 0.4 2 · 0.6. Now consider p = 0.6 instead. For this value of p the weight is instead 0.6 2 · 0.4. This explains the shift in Fig. 3c-a bias closer to 1 is more likely, since we have observed two true flips, but only one false.

Semantics
In this section, we define the semantics of our calculus. The definition is split into two parts: a deterministic semantics and a stochastic semantics. We use evaluation contexts to assist in defining our semantics. The evaluation contexts E induce a call-by-value semantics, and are defined as follows.
We denote the set of all evaluation contexts by E.
With the evaluation contexts in place, we proceed to define the deterministic semantics through a small-step relation → Det .
The rules are straightforward, and will not be discussed in further detail here. We use the standard notation for transitive and reflexive closures (e.g. → * Det ), and transitive closures (e.g. → + Det ) of relations throughout the paper. Following the tradition of Kozen [23] and Park et al. [34], sampling in our stochastic semantics works by consuming randomness from a tape of real numbers. We use inverse transform sampling, and therefore the tape consists of numbers from the interval [0, 1]. In order to use inverse transform sampling, we require that for each D ∈ D, there exists a measurable function F −1 is the inverse cumulative distribution function for the probability distribution corresponding to D and (c 1 , . . . , c |D| ). We call the tape of real numbers a trace, and make the following definition.
We use the notation (c 1 , c 2 , . . . , c n ) S to indicate the trace consisting of the n numbers c 1 , c 2 , . . . , c n . Given a trace s, we denote by |s| the length of the trace. We also denote the concatenation of two traces s and s with s * s . Lastly, we let c :: s denote the extension of the trace s with the real number c as head.
With the traces and F −1 D defined, we can proceed to the stochastic 6 semantics → over T × R + × S.
The rule (Det) encapsulates the → Det relation, and states that terms can move deterministically only to terms of the form t stop . Note that terms of the form t stop are found at the left-hand side in the other rules. The (Sample) rule describes how random values are drawn from the inverse cumulative distribution functions and the trace when terms of the form sample D (c 1 , . . . , c |D| ) are encountered. Similarly, the Weight rule determines how the weight is updated when weight(c) terms are encountered. Finally, the resample construct always evaluates to unit, and is therefore meaningless from the perspective of this semantics. We elaborate on the role of the resample construct in Section 3.3.
With the semantics in place, we define two important functions over S for a given term. In the below definition, assume that a fixed term t is given.
Intuitively, r t is the function returning the result value after having repeatedly applied → on the initial trace s. Analogously, f t gives the density or weight of a particular s. Note that, if (t, 1, s) gets stuck or diverges, the result value is (), and the weight is 0. In other words, we disregard such traces entirely, since we are in practice only interested in probability distributions over values. Furthermore, note that if the final s = () S , the value and weight are again () and 0, respectively. The motivation for this is discussed in Section 4.3.
To illustrate r t and f t , first consider the geometric program t geo in Fig. 2a, and a trace s = (0.5, 0.3, 0.7) S . Let E = if [·] then 1 + geometric () else 1. It is easy to check that t geo → + Det E[sample Bern (0.6)]. Now, note that, since Bern(0.6) is the probability distribution for flipping a coin with bias 0.6, As such, we have It follows that r tgeo (s) = 3, and that f tgeo (s) = 1. Now, instead consider the trace s 2 = (0.5, 0.7, 0.3) S . We have The term is now stuck, and because we have not used up the entire trace, we have r tgeo (s 2 ) = (), f tgeo (s 2 ) = 0. The opposite of the above can also occurgiven the trace s 3 = (0.5, 0.3) S , it holds that r tgeo (s 3 ) = () and f tgeo (s 3 ) = 0, since the provided trace is not long enough. In general, we have that r tgeo (s) = n and f tgeo (s) = 1 whenever s ∈ [0, 0.6] n−1 × (0.6, 1]. Otherwise, r tgeo (s) = () and f tgeo (s) = 0. We will apply this conclusion when reconsidering this example in Section 4.3.
To illustrate the weight construct, consider the program t obs in Fig. 3b, and the singleton trace (0.8) S . This program will, in total, evaluate one call to sample, and three calls to weight. Now, let h(c) = F −1 Beta (2, 2, c) and recall the function σ fBern from Section 3.1. Using the notation φ(c, x) = σ fBern (h(c), x), we have, for some evaluation contexts That is, r t obs ((0.8) S ) = h(0.8) and f t obs ((0.8) S ) = h(0.8) 2 (1 − h(0.8)). For arbitrary c, we see that r t obs ((c) S ) = h(c) and f t obs ((c) S ) = h(c) 2 (1 − h(c)). For any other trace s with |s| = 1, r t obs (s) = () and f t obs (s) = 0. We will apply this result when reconsidering this example in Section 4.3.

Resampling Semantics
In order to connect SMC in PPLs to the classical formalization of SMC presented in Section 5-and thus enabling the theoretical treatments in Sections 6 and 7we need a relation in which terms "stop" after a certain number n of encountered resample terms. In this section, we define such a relation, denoted by →. Its definition is given below.
This relation is → extended with a natural number n, indicating how many further resample terms can be evaluated. We implement this limitation by replacing the rule (Resample) of → with (Resample-Fin) of → above which decrements n each time it is applied, causing terms to get stuck at the n + 1th resample encountered. Now, assume that a fixed term t is given. We define r t,n and f t,n similar to r t and f t .
As for r t and f t , these functions return the result value and weight, respectively, after having repeatedly applied → on the initial trace s. There is one difference compared to →: besides values, we now also allow stopping with non-zero weight at terms of the form E[resample].
To illustrate →, r t,n (s), and f t,n (s), consider the term t seq defined by This term encodes a model in which an object moves along a real-valued axis in discrete time steps, but where the actual positions (x 1 , x 2 , . . . ) can only be observed through a noisy sensor (c 1 , c 2 , . . . ). The inference problem consists of finding the probability distribution for the very last position, x t , given all collected observations (c 1 , c 2 , . . . , c t ). Most importantly, note the position of resample in (12)-it is evaluated just after evaluating weight in every folding step. Because of this, for n < t and all traces s such that f tseq ,n (s) > 0, we have [cn+1, cn+2, . . . , ct−1, ct] and where x n is the value sampled in sim at the nth folding step. That is, we can now "stop" evaluation at resamples. We will revisit this example in Section 6.

The Target Measure of a Program
In this section, we define the target measure induced by any given program in our calculus. We assume basic familiarity with measure theory, Lebesgue integration, and Borel spaces. McDonald and Weiss [28] provide a pedagogical introduction to the subject. We also summarize the definitions and lemmas used in this article in Appendix B.1. In order to define the target measure of a program as a Lebesgue integral (Section 4.3), we require a measure space on traces (Section 4.1), and a measurable space on terms (Section 4.2). For illustration, we derive the target measures for two of the example programs from Section 3 in Section 4.3. The concepts presented in this section are quite standard, and experienced readers might want to quickly skim it, or even skip it entirely.

A Measure Space over Traces
We use a standard measure space over traces of samples [27]. First, we define a measurable space over traces. We denote the Borel σ-algebra on R n with B n , and the Borel σ-algebra on [0, 1] with B n  The most common measure on B n is the n-dimensional Lebesgue measure, denoted λ n . For n = 0, we let λ 0 = δ () S , where δ denotes the standard Dirac measure. By combining the Lebesgue measures for each n, we construct a measure µ S over (S, S).

A Measurable Space over Terms
In order to show that r t is measurable, we need a measurable space over terms. We let (T, T ) denote the measurable space that we seek to construct, and follow the approach in Staton et al. [42] and Vákár et al. [46]. Because our calculus includes the reals, we would like to at least have B ⊂ T . Furthermore, we would also like to extend the Borel measurable sets B n to terms with n reals as subterms. For instance, we want sets of the form This leads us to consider terms in a language in which constants (i.e., reals) are replaced with placeholders [·]. Most importantly, it is easy to verify that T p is countable. Next, we make the following definitions.
Definition 13. For n ∈ N 0 , we denote by T n p ⊂ T p the set of all terms with exactly n placeholders.
Definition 14. We let t n p range over the elements of T n p . The t n p can be regarded as functions t n p : R n → t n p (R n ) which replaces the n placeholders with the n reals given as arguments.
From the above definitions, we construct the required σ-algebra T .

The Target Measure
We are now in a position to define the target measure. We will first give the formal definitions, and then illustrate the definitions with examples. The definitions rely on the following result.
We can now proceed to define the measure t over S induced by a term t using Lebesgue integration.
Importantly, by Lemma 15 and Lemma 2, it holds that the density f t is unique µ S -ae if t is σ-finite.
Using Definition 17 and the measurability of r t , we can also define a corresponding pushforward measure t over T.
The measure t is our target measure, i.e., the measure encoded by our program that we are interested in.
Let us now consider the target measures for our earlier examples. Consider first the program in Fig. 2a. Recall that the density f tgeo of a given trace s is 1 if s ∈ [0, 0.6] n−1 × (0.6, 1], and 0 otherwise. Hence, we can write Since t geo is a distribution over N, we always have Consequently, As expected, by taking t geo ({1}), t geo ({2}), t geo ({3}), . . ., we exactly recover the graph from Fig. 2b. Now consider the continuous distribution given by program t obs , and recall The Beta distributions have strictly increasing cumulative distribution functions F Beta (a, b, ·) for all a and b. It follows that h is the true inverse of this function, and is therefore bijective 7 . Because of this, In the third equality, we have used integration by substitution. We also used the fact that (h −1 ) is the derivative of the cumulative distribution function We should in some way ensure the target measure is finite (i.e., can be normalized to a probability measure), since we are in the end most often only interested in probability measures. Unfortunately, as observed by Staton [41], there is no known useful syntactic restriction that enforces finite measures in PPLs while still admitting weights > 1. We will discuss this further in Section 6.2 in relation to SMC in our calculus.
Lastly, from Section 3.2, recall that we disallow non-empty final traces in f t and r t . We see here why this is needed: if allowed, for every trace s with f t (s) > 0, all extensions s * s have the same density f t (s * s ) = f t (s) > 0. From this, it is easy to check that if t = 0 (the zero measure), then t (T) = ∞ (i.e., the measure is not finite). In fact, for any T ∈ T , t (T ) > 0 =⇒ t (T ) = ∞. Clearly, this is not a useful target measure.

Formal SMC
In this section, we give a generic formalization of SMC based on Chopin [9]. We assume a basic understanding of SMC. For a concrete SMC example, see Appendix A. For a complete introduction to SMC, we recommend Naesseth et al.
First, in Section 5.1, we introduce transition kernels, which is a fundamental concept used in the remaining sections of the paper. Second, in Section 5.2, we describe Chopin's generic formalization of SMC as an algorithm for approximating a sequence of distributions based on a sequence of approximating transition kernels. Lastly, in Section 5.3, we give standard correctness results for the algorithm.

Preliminaries: Transition Kernels
Intuitively, transition kernels describe how elements move between measurable spaces. For a more comprehensive introduction, see Vákár and Ong [47].
Definition 19. Let (A, A) and (A , A ) be measurable spaces, and let B * is measurable. Additionally, we can classify transition kernels according to the below definition.
is a sub-probability measure for all a ∈ A; a probability kernel if k(a, ·) is a probability measure for all a ∈ A; and a finite kernel if sup a∈A k(a, A ) < ∞.

Algorithm
The starting point in Chopin's formulation of SMC is a sequence of probability measures π n (over respective measurable spaces (A n , A n ), with n ∈ N 0 ) that are difficult or impossible to directly draw samples from.
Algorithm 1 A generic formulation of sequential Monte Carlo inference based on Chopin [9]. In each step, we let 1 ≤ j ≤ J, where J is the number of samples.
The new empirical distribution is unweighted and is given by {â j n } J j=1 . This distribution also approximates πn. 4. Mutation: Increment n.
The SMC approach is to generate samples from the π n by first sampling from a sequence of proposal measures q n , and then correcting for the discrepancy between these measures by weighting the proposal samples. The proposal distributions are generated from an initial measure q 0 and a sequence of transition kernels k n : In order to approximate π n by weighting samples from q n , we need some way of obtaining the appropriate weights. Hence, we require each measurable space (A n , A n ) to have a default σ-finite measure µ An , and the measures π n and q n to have densities f πn and f qn with respect to this default measure. Furthermore, we require that the functions f πn and f qn can be efficiently computed pointwise, up to an unknown constant factor per function and value of n. More precisely, we can efficiently compute the densities f πn = Z πn · f πn and f qn = Z qn · f qn , corresponding to the unnormalized measures π n = Z πn · π n and q n = Z qn · q n . Here, Z πn = π n (A n ) ∈ R + and Z qn = q n (A n ) ∈ R + denote the unknown normalizing constants for the distributions π n and q n . Algorithm 1 presents a generic version of SMC [9] for approximating π n . We make the notion of approximation used in the algorithm precise in Section 5.3. Note that in the correction step, the unnormalized pointwise evaluation of f πn and f qn is used to calculate the weights. In the algorithm description, we also use some new terminology. First, an empirical distribution is the discrete probability measure formed by a finite set of possibly weighted samples {(a j n , w j n )} J j=1 , where a j n ∈ A n and w j n ∈ R + . Second, when resampling an empirical distribution, we sample J times from it (with replacement), with each sample having its normalized weight as probability of being sampled. More specifically, this is known as multinomial resampling. Other resampling schemes also exist [11], and are often used in practice to reduce variance. After resampling, the set of samples forms a new empirical distribution with J unweighted (all w j n = 1) samples. An important feature of SMC compared to other inference algorithms is that SMC produces, as a by-product of inference, unbiased estimatesẐ πn of the normalizing constants Z πn . Stated differently, this means that Algorithm 1 not only approximates the π n , but also the unnormalized versions π n . From the weights w j n in Algorithm 1, the estimates are given bŷ for each π n . We give the unbiasedness result ofẐ πn in Lemma 5 (item 2) below. The normalizing constant is often used to compare the accuracy of different probabilistic models, and as such, it is also known as the marginal likelihood, or model evidence. For an example application, see Ronquist et al. [37].
To conclude this section, note that many sequences of probability kernels k n can be used to approximate the same sequence of measures π n . The only requirement on the k n is that f πn (a n ) > 0 =⇒ f qn (a n ) > 0 must hold for all n ∈ N 0 and a n ∈ A n (i.e., the proposals must "cover" the π n ) [12]. We call such a sequence of kernels k n valid. Different choices of k n induce different proposals q n , and hence capture different SMC algorithms. The most common example is the BPF, which directly uses the kernels from the model as the sequence of kernels in the SMC algorithm (hence the "bootstrap"). In Section 7.1, we formalize the bootstrap kernels in the context of our calculus. However, we may want to choose other probability kernels that satisfy the covering condition, since the choice of kernels can have major implications for the rate of convergence [35].

Correctness
We begin by defining the notion of approximation used in Algorithm 1.
Definition 21 (Based on Chopin [9, p. 2387]). Let (A, A) denote a measurable space, {{(a j,J , w j,J )} J j=1 } J∈N a triangular array of random variables in A × R, and π : A → R * + a probability measure. We say that {{(a j,J , w j, surely for all measurable functions ϕ : (A, A) → (R, B) such that E π (ϕ)-the expected value of the function ϕ over the distribution π-exists.
First, note that the triangular array can also be viewed as a sequence of random empirical distributions (indexed by J). Precisely such sequences are formed by the random empirical distributions in Algorithm 1 when indexed by the increasing number of samples J. For simplicity, we often let context determine the sequence, and directly state that a random empirical distribution approximates some distribution (as in Algorithm 1).
Two classical results in SMC literature are given in the following lemma: a law of large numbers and the unbiasedness of the normalizing constant estimate. We take these results as the definition of SMC correctness used in this paper.
Lemma 5. Let π n , n ∈ N 0 , be a sequence of probability measures over measurable spaces (A n , A n ) with default σ-finite measures µ An , such that the π n have densities f πn with respect to these default measures. Furthermore, let q 0 be a probability measure with density f q0 with respect to µ A0 , and k n a sequence of probability kernels inducing a sequence of proposal probability measures q n , given by (18), over (A n , A n ) with densities f qn with respect to µ An . Also, assume the k n are valid, i.e., that that f πn (a n ) > 0 =⇒ f qn (a n ) > 0 holds for all n ∈ N 0 and a n ∈ A n . Then produced by Algorithm 1 approximate π n for each n ∈ N 0 ; and 2. E(Ẑ πn ) = Z πn for each n ∈ N 0 , where the expectation is taken with respect to the weights produced when running Algorithm 1, andẐ πn is given by (19 Chopin [9][Theorem 1] gives another SMC convergence result in the form of a central limit. This result, however, requires further restrictions on the weights w j n in Algorithm 1. It is not clear when these restrictions are fulfilled when applying SMC on a program in our calculus. This is an interesting topic for future work.

Formal SMC for Probabilistic Programming Languages
This section contains our main contribution: how to interpret the operational semantics of our calculus as the unnormalized sequence of measures π n in Chopin's formalization (Section 6.1), as well as sufficient conditions for this sequence of approximating measures to converge to t and for the normalizing constant estimate to be correct (Section 6.2).
An important insight during this work was that it is more convenient to find an approximating sequence of measures t n to the trace measure t , compared to finding a sequence of measures t n directly approximating the target measure t . In Section 6.1, we define t n similarly to t , except that at most n evaluations of resample are allowed. This upper bound on the number of resamples is formalized through the relation → from Section 3.3.
In Section 6.2, we obtain two different conditions for the convergence of the sequence t n to t : Theorem 1 states that for programs with an upper bound N on the number of resamples they evaluate, t N = t . This precondition holds in many practical settings, for instance where each resampling is connected to a datum collected before inference starts. Theorem 2 states another convergence result for programs without such an upper bound but with dominated weights. Because of these convergence results, we can often approximate t by approximating t n with Algorithm 1. When this is the case, Lemma 5 implies that Algorithm 1, either after a sufficient number of time steps or asymptotically, correctly approximates t and the normalizing constant Z t . This is the content of Theorem 3. We conclude Section 6.2 by discussing resample placements and their relation to Theorem 3, as well as practical implications of Theorem 3.

The Sequence of Measures Generated by a Program
We now apply the formalization from Section 4.3 again, but with f t,n and r t,n (from Section 3.3) replacing f t and r t . Intuitively, this yields a sequence of measures t n indexed by n, which are similar to t , but only allow for evaluating at most n resamples. To illustrate this idea, consider again the program t seq in (12). Here, t seq 0 is a distribution over terms of the form E 1 seq [resample; x 1 ], t seq 1 a distribution over terms of the form E 2 seq [resample; x 2 ], and so forth. For n ≥ t, t seq n = t seq , because it is clear that t is an upper bound on the number of resamples evaluated in t seq .
While the measures t n are useful for giving intuition, it is easier from a technical perspective to define and work with t n , the sequence of measures over traces where at most n resamples are allowed. First, we need the following result, analogous to Lemma 4. Analogously to Definition 17, by Lemma 15 and Lemma 2, it holds that the density f t,n is unique µ S -ae if t n is σ-finite. We can now also clarify how the resample construct relates to the resampling in the selection step of Algorithm 1. If we approximate the sequence t n with Algorithm 1, at the nth selection step of the algorithm, all traces s with non-zero weight must have r t,n (s) = v or r t,n (s) = E[resample], by Definitions 8 and 9. That is, having a q n in Algorithm 1 proposing traces other than these is wasteful, since they will in any case have weight zero. We illustrate this further when considering the bootstrap kernel in Section 7.1.

Correctness
We begin with a convergence result for when the number of calls to resample in a program is upper bounded.
This follows directly since f t,n not only converges to f t , but is also equal to f t for all n > N . However, even if the number of calls to resample in t is upper bounded, there is still one concern with using t n as π n in Algorithm 1: there is no guarantee that the measures t n can be normalized to probability measures and have unique densities (i.e., that they are finite). This is a requirement for the correctness results in Lemma 5. Unfortunately, recall from Section 4.3 that there is no known useful syntactic restriction that enforces finiteness of the target measure. This is clearly true for the measures t n as well, and as such, we need to make the assumption that the t n are finite-otherwise, it is not clear that Algorithm 1 produces the correct result, since the conditions in Lemma 5 are not fulfilled. Fortunately, this assumption is valid for most, if not all, models of practical interest. Nevertheless, investigating whether or not the restriction to probability measures in Lemma 5 can be lifted to some extent is an interesting topic for future work. Note that, even if the target measure is finite, this does not necessarily imply that all measures t n are finite. For example, consider the program let rec inflate = if sampleBern(0.5) then weight 2; 1 + inflate () else 0 in let deflate n = weight 1/2 n in let n = inflate () in resample; deflate n; n, adapted from [6]. Clearly, t 0 is not finite (in fact, it is not even σ-finite), while t 1 = t is.
Although of limited practical interest, programs with an unbounded number of calls to resample are of interest from a semantic perspective. If we have lim n→∞ t n = t pointwise, then any SMC algorithm approximating the sequence t n also approximates t , at least asymptotically in the number of steps n. First, consider the variation t geo-res of the geometric program t geo in The only difference is the added resample (marked with a box). Here t geo = t geo-res , since, in general, t is unaffected by placing resamples in t. Note that t geo-res has no upper bound on the number of calls to resample, and therefore Theorem 1 is not applicable. We have, however, that and as a consequence, lim n→∞ t geo-res n = t geo-res pointwise. The question is then if lim n→∞ t n = t pointwise holds in general? The answer is no, as we demonstrate next. For lim n→∞ t n = t to hold pointwise, it must hold that lim n→∞ f t,n = f t pointwise µ S -ae. Unfortunately, this does not hold for all programs. Consider the program t loop defined by let rec loop _ = resample; loop () in loop (). Here, f t loop = 0 since the program diverges deterministically, but f t loop ,n (() S ) = 1 for all n. Because µ S ({() S }) = 0, we do not have lim n→∞ f t loop ,n = f t loop pointwise µ S -ae.
Even if we have lim n→∞ f t,n = f t pointwise µ S -ae, we might not have lim n→∞ t n = t pointwise. Consider, for instance, the program t unit given by let s = sampleU (0, 1) in let rec foo n = if s ≤ 1/n then resample; weight 2; foo (2 · n) else weight 0 in foo 1

(22)
We have f tunit = 0 and f tunit ,n = 2 n · 1 [0,1/2 n ] for n > 0. Also, lim n→∞ f tunit ,n = f tunit pointwise. However, This shows that the limit may fail to hold, even for programs that terminate almost surely, as is the case for the program t unit in (22). In fact, this program is positively almost surely terminating [7] since the expected number of recursive calls to foo is 1.
Guided by the previous example, we now state the dominated convergence theorem-a fundamental result in measure theory-in the context of SMC inference in our calculus.
Theorem 2. Assume that lim n→∞ f t,n = f t holds pointwise µ S -ae. Furthermore, assume that there exists a measurable function g : (S, S) → (R + , B + ) such that f t,n ≤ g µ S -ae for all n, and S g(s)dµ S (s) < ∞. Then lim n→∞ t n = t pointwise.
For a proof, see McDonald and Weiss [28, Theorem 4.9]. It is easy to check that for our example in (22), there is no dominating and integrable g as is required in Theorem 2. We have already seen that the conclusion of the theorem fails to hold here. As a corollary, if there exists a dominating and integrable g, the measures t n are always finite. Corollary 1. If there exists a measurable function g : (S, S) → (R + , B + ) such that f t,n ≤ g µ S -ae for all n, and S g(s)dµ S (s) < ∞, then t n is finite for each n ∈ N 0 . This holds because t n (S) = S f t,n (s)dµ S (s) ≤ S g(s)dµ S (s) < ∞. Hence, we do not need to assume the finiteness of t n in order for Algorithm 1 to be applicable, as was the case for the setting of Theorem 1.
In Theorem 3, we summarize and combine the above results with Lemma 5.
Theorem 3. Let t be a term, and apply Algorithm 1 with t n as π n , and with arbitrary valid kernels k n . If the condition of Theorem 1 holds and t n is finite for each n ∈ N 0 , then Algorithm 1 approximates t and its normalizing constant after a finite number of steps. Alternatively, if the condition of Theorem 2 holds, then Algorithm 1 approximates t and its normalizing constant in the limit n → ∞.
This follows directly from Theorem 1, Theorem 2, and Lemma 5. We conclude this section by discussing resample placements, and the practical implications of Theorem 3. First, we define a resample placement for a term t as the term resulting from replacing arbitrary subterms t of t with resample; t . Note that such a placement directly corresponds to constructing the sequence t n . Second, note that the measure t and the target measure t are clearly unaffected by such a placement-indeed, resample simply evaluates to (), and for t and t , there is no bound on how many resamples we can evaluate. As such, we conclude that all resample placements in t fulfilling one of the two conditions in Theorem 3 leads to a correct approximation of t when applying Algorithm 1. Furthermore, there is always, in practice, an upper bound on the number of calls to resample, since any concrete run of SMC has an (explicit or implicit) upper bound on its runtime. This is a powerful result, since it implies that when implementing SMC for PPLs, any method for selecting resampling locations in a program is correct under mild conditions (Theorem 1 or Theorem 2) that are most often, if not always, fulfilled in practice. Most importantly, this justifies the basic approach for placing resamples found in WebPPL, Anglican, and Birch, in which every call to weight is directly followed (implicitly) by a call to resample. It also justifies the approach to placing resamples described in Lundén et al. [26]. This latter approach is essential in, e.g., Ronquist et al. [37], in order to increase inference efficiency.
Our results also show that the restriction in Anglican requiring all executions to encounter the same number of resamples, is too conservative. Clearly, this is not a requirement in either Theorem 1 or Theorem 2. For instance, the number of calls to resample varies significantly in (20).

SMC Algorithms
In this section, we take a look at how the kernels k n in Algorithm 1 can be instantiated to yield the concrete SMC algorithm known as the bootstrap particle filter (Section 7.1), and also discuss other SMC algorithms and how they relate to Algorithm 1 (Section 7.2).

The Bootstrap Particle Filter
We define for each term t a particular sequence of kernels k t,n , that gives rise to the SMC algorithm known as the bootstrap particle filter (BPF). Informally, these kernels correspond to simply continuing to evaluate the program until . For the bootstrap kernel, calculating the weights w j n from Algorithm 1 is particularly simple. Similarly to t n , it is more convenient to define and work with sequences of kernels over traces, rather than terms. We will define k t,n (s, ·) to be the subprobability measure over extended traces s * s resulting from evaluating the term r t,n−1 (s) until the next resample or value v, ignoring any call to weight. First, we immediately have that the set of all traces that do not have s as prefix must have measure zero. To make this formal, we will use the inverse images of the functions prepend s (s ) = s * s , s ∈ S in the definition of the kernel. The next ingredient for defining the kernels k t,n is a function p t,n that indicates what traces are possible when executing t until the n + 1th resample or value.
The proof is analogous to that of Lemma 6. We can now formally define the kernels k t,n .
Definition 24. k t,n (s, S) = prepend −1 s (S) p rt,n−1(s),1 (s ) dµ S (s ) By the definition of p t,n , the k t,n are sub-probability kernels rather than probability kernels. Intuitively, the reason for this is that during evaluation, terms can get stuck, deterministically diverge, or even stochastically diverge. Such traces are assigned 0 weight by p t,n .
Lemma 9. The functions k t,n : S × S → R + are sub-probability kernels. †8 We get a natural starting measure q 0 from the sub-probability distribution resulting from running the initial program t until reaching a value or a call to resample, ignoring weights.
Definition 25. t 0 (S) = S p t,0 (s)dµ S (s). Now we have all the ingredients for the general SMC algorithm described in Section 5.2: a sequence of target measures t n = π n (Definition 22), a starting measure t 0 ∝ q 0 (Definition 25), and a sequence of kernels k t,n ∝ k n (Definition 24). These then induce a sequence of proposal measures t n = q n as in Equation (18), which we instantiate in the following definition.
Definition 26. t n (S) = S k t,n (s, S)f t,n−1 (s)dµ S (s), n > 0 Intuitively, the measures t n are obtained by evaluating the terms in the support of the measure t n−1 until reaching the next resample or value. For an efficient implementation, we need to factorize this definition into the history and the current step, which amounts to splitting the traces. Each feasible trace can be split in such a way.
Since the kernels k t,n are sub-probability kernels, the measures t n are finite given that the t n are finite.
Lemma 12. t 0 is a sub-probability measure. Also, if t n−1 is finite, then t n is finite. † As discussed in Section 6.2, the t n are finite, either by assumption (Theorem 1) or as a consequence of the dominating function of Theorem 2. From this and Lemma 12, the t n are also finite. Furthermore, checking that t n are valid, i.e. that the density f t n of each t n covers the density f t n of t n is trivial. As such, by Lemma 5, we can now correctly approximate t n using Algorithm 1. The details are given in Algorithm 2, which closely resembles the standard SMC algorithm in WebPPL. For ease of notation, we assume it possible to draw samples from t 0 and k t,n (s, ·), even though these are sub-probability measures. This essentially corresponds to assuming evaluation never gets stuck or diverges. Making sure this is the case is not within the scope of this paper. The weights in Algorithm 2 at time step n can easily be calculated according to the following lemma.
Here, s * s = s is the unique decomposition from Lemma 10. † Finally, it is now obvious how the resample construct relates to the resampling in the selection step in Algorithm 2-only traces for which r t,n (s j n ) is a term of the form E[resample], or a value, will issue from the mutation step and thus participate in resampling at the selection step. As a consequence of how the kernels k t,n are constructed, we only stop at such terms in steps (1) and (5) when running the program. This is the reason for naming the construct resample. Algorithm 2 A concrete instantiation of Algorithm 1 with π n = t n , k n ∝ k t,n , q 0 ∝ t 0 , and as a consequence q n = t n (for n > 0). In each step, we let 1 ≤ j ≤ J, where J is the number of samples.
1. Initialization: Set n = 0. Draw s j 0 ∼ t 0 for 1 ≤ j ≤ J. That is, run the program t, and draw from U(0, 1) whenever required by a sample D . Record these draws as the trace s j 0 . Stop when reaching a term of the form As a consequence of Lemma 13, this is trivial. Simply set w j n to the weight accumulated while running t in step (1), or rt,n−1(ŝ j n−1 ) in step (5). The empirical distribution given by {(s j n , w j n )} J j=1 approximates t n/Z t n . 3. Termination: If all samples rt(s j n ) are values, terminate and output {(s j n , w j n )} J j=1 . If not, go to the next step. We cannot evaluate values further, so running the algorithm further if all samples are values is pointless. When terminating, assuming the conditions in Theorem 1 or Theorem 2 holds, {(s j n , w j n )} J j=1 approximates t /Z t n . Also, by the definition of t , {(rt(s j n ), w j n )} J j=1 approximates t /Z t n , the normalized version of t . 4. Selection: Resample the empirical distribution {(s j n , w j n )} J j=1 . The new empirical distribution is unweighted and given by {ŝ j n } J j=1 . This distribution also approximates t n/Z t n . 5. Mutation: Increment n. Draw s j n ∼ kt,n(ŝ j n−1 , ·) for 1 ≤ j ≤ J. That is, simply run the intermediate program rt,n−1(ŝ j n−1 ), and draw from U(0, 1) whenever required by a sample D . Record these draws and append them toŝ j n−1 , resulting in the trace s j n . Stop when reaching a term of the form E[resample] or a value v. The empirical distribution {s j n } J j=1 approximates t n/Z t n . Go to (2).

Other SMC Algorithms
In this section, we discuss SMC algorithms other than the BPF. First, we have the resample-move algorithm by Gilks and Berzuini [14], which is also implemented in WebPPL [16], and treated by Chopin [9] andŚcibior et al. [40]. In this algorithm, the SMC kernel is composed with a suitable MCMC kernel, such that one or more MCMC steps are taken for each sample after each resampling. This helps with the so-called degeneracy problem in SMC, which refers to the tendency of SMC samples to share a common ancestry as a result of resampling. We can directly achieve this algorithm in our context by simply choosing appropriate transition kernels in Algorithm 1. Let k MCMC,n be MCMC transition kernels with π n−1 = t n−1 as invariant distributions. Using the bootstrap kernels as the main kernels, we let k n = k t,n • k MCMC,n where • denotes kernel composition. The sequence k n is valid because of the validity of the main SMC kernels and the invariance of the MCMC kernels.
While Algorithm 1 captures different SMC algorithms by allowing the use of different kernels, some algorithms require changes to Algorithm 1 itself. The first such variation of Algorithm 1 is the alive particle filter, recently discussed by Kudlicka et al. [24], which reduces the tendency to degeneracy by not including sample traces with zero weight in resampling. This is done by repeating the selection and mutation steps (for each sample individually) until a trace with non-zero weight is proposed; the corresponding modifications to Algorithm 1 are straightforward. The unbiasedness result of Kudlicka et al. [24] can easily be extended to our PPL context, with another minor modification to Algorithm 1.
Another variation of Algorithm 1 is the auxiliary particle filter [35]. Informally, this algorithm allows the selection and mutation steps of Algorithm 1 to be guided by future information regarding the weights w n . For many models, this is possible since the weighting functions w n from Algorithm 1 are often parametric in an explicitly available sequence of observation data points, which can also be used to derive better kernels k n . Clearly, such optimizations are model-specific, and can not directly be applied in expressive PPL calculi such as ours. However, the general idea of using look-ahead in general-purpose PPLs to guide selection and mutation is interesting, and should be explored.

Related Work
The only major previous work related to formal SMC correctness in PPLs iś Scibior et al.
[40] (see Section 1). They validate both the BPF and the resamplemove SMC algorithms in a denotational setting. In a companion paper,Ścibior et al.
[39] also give a Haskell implementation of these inference techniques.
Although formal correctness proofs of SMC in PPLs are sparse, there are many languages that implement SMC algorithms. Goodman and Stuhlmüller [17] describe SMC for the probabilistic programming language WebPPL. They implement a basic BPF very similar to Algorithm 2, but do not show correctness with respect to any language semantics. Also, related to WebPPL, Stuhlmüller et al.
[43] discuss a coarse-to-fine SMC inference technique for probabilistic programs with independent sample statements.
Wood et al.
[50] describe PMCMC, an MCMC inference technique that uses SMC internally, for the probabilistic programming language Anglican [44]. Similarly to WebPPL, Anglican also includes a basic BPF similar to Algorithm 2, with the exception that every execution needs to encounter the same number of calls to resample. They use various types of empirical tests to validate correctness, in contrast to the formal proof found in this paper. Related to Anglican, a brief discussion on resample placement requirements can be found in van de Meent et al. [48].
Birch [31] is an imperative object-oriented PPL, with a particular focus on SMC. It supports a number of SMC algorithms, including the BPF [19] and the auxiliary particle filter [35]. Furthermore, they support dynamic analytical optimizations, for instance using locally-optimal proposals and Rao-Blackwellization [30]. As with WebPPL and Anglican, the focus is on performance and efficiency, and not on formal correctness.
There are quite a few papers studying the correctness of MCMC algorithms for PPLs. Using the same underlying framework as for their SMC correctness proof,Ścibior et al. [40] also validates a trace MCMC algorithm. Another proof of correctness for trace MCMC is given in Borgström et al. [6], which instead uses an untyped lambda calculus and an operational semantics. Much of the formalization in this paper is based on constructions used as part of their paper. For instance, the functions f t and r t are defined similarly, as well as the measure space (S, S, µ S ) and the measurable space (T, T ). Our measurability proofs of f t , r t , f t,n , and r t,n largely follow the same strategies as found in their paper. Similarly to us, they also relate their proof of correctness to classical results from the MCMC literature. A difference is that we use inverse transform sampling, whereas they use probability density functions. As a result of this, our traces consist of numbers on [0, 1], while their traces consist of numbers on R. Also, inverse transform sampling naturally allows for built-in discrete distributions. In contrast, discrete distributions must be encoded in the language itself when using probability densities. Another difference is that they restrict the arguments to weight to [0, 1], in order to ensure the finiteness of the target measure.
Other Classical work on SMC includes Chopin [9], which we use as a basis for our formalization. In particular, Chopin [9] provides a general formulation of SMC, placing few requirements on the underlying model. The book by Del Moral [10] contains a vast number of classical SMC results, including the law of large numbers and unbiasedness result from Lemma 5. A more accessible summary of the important SMC convergence results from Del Moral [10] can be found in Naesseth et al. [32].

Conclusions
In conclusion, we have formalized SMC inference for an expressive functional PPL calculus, based on the formalization by Chopin [9]. We showed that in this context, SMC is correct in that it approximates the target measures encoded by programs in the calculus under mild conditions. Furthermore, we illustrated a particular instance of SMC for our calculus, the bootstrap particle filter, and discussed other variations of SMC and their relation to our calculus.
As indicated in Section 2, the approach used for selecting resampling locations can have a large impact on SMC accuracy and performance. This leads us to the following general question: can we select optimal resampling locations in a given program, according to some formally defined measure of optimality? We leave this important research direction for future work.

A SMC: an Illustrative Example
In order to fully appreciate the contributions of this paper, we devote this section to introducing SMC inference for the unfamiliar with an informal example. The example is based on Lindholm [25].

A.1 Model
Consider the following scenario: a pilot is flying an aircraft in bad weather with zero visibility, and is attempting to estimate the aircraft's position. In order to do this, available is an elevation map of the area, a noisy altimeter, and a noisy sensor for measuring the vertical distance to the ground (see Fig. 4 for an illustration). Concretely, assume that (a) X 0:t = X 0 , X 1 , . . . , X t are real-valued random variables representing the true horizontal position of the aircraft at the discrete time steps 0, 1, . . . , t, and (b) Y 0:t = Y 0 , Y 1 , . . . , Y t are real-valued random variables for the measurements given by subtracting the vertical distance sensor reading from the altimeter sensor reading.
The problem we consider is to estimate the positions X n , n ≤ t, given all combined sensor measurements Y 0:n collected up until time n. This random variable is denoted X n | Y 0:n , and the distribution for this random variable is known as the target measure. In general, X | Y denotes the random variable X conditioned on Y having been observed. Concretely, we assume the following model for n ∈ N: In other words, we have that the initial position X 0 of the aircraft is uniformly distributed between 0 and 100, and at each time step n, X n is normally distributed around X n−1 + 2 with variance 1 (the conditional distribution of X n | X n−1 is known as a transition kernel ). Finally the combined measurement Y n from the sensors is normally distributed around the true elevation of the ground at the current horizontal position X n with variance 2, where the true position is given by our elevation map, here modeled as a function elevation.

A.2 Inference
With the model in place, we can proceed to sequentially estimating the probability distributions for the random variables X n | Y 0:n using the BPF, a fundamental SMC algorithm. In Section 7.1, we will give a formal definition of this algorithm for models encoded in our calculus. Here, we instead give an informal description for our current aircraft model. In Fig. 4, we show the true initial aircraft position (1), and the true position at three later time steps, denoted by (2), (3), and (4). In addition, for each of these time steps, we show the empirical SMC approximations to the distributions for X n | Y 0:n , where n is increasing for each of the four positions.
Step (1)  3) Next, we take the set of weighted particles from the previous time step and resample them according to their weights. That is, we draw (with replacement) a set of new samples from the previous set of samples, based on their relative weights. We see that the samples with high weight are indeed the ones to survive this resampling step. Note that after resampling, we also reset the weights (which is required for correctness). (1.4) For each sample of X 0 , draw from the distribution of X 1 | X 0 to propagate it forwards by one time step. (2) At this point, we have completed many iterations of the above four substeps-the exception being that in the first sub-step, we don't draw from U(0, 100), but instead reuse the set of particles from the previous step. We see that the set of samples now correctly cluster on the true position. have diverged slightly, representing the increased uncertainty in the aircraft's position. (4) When encountering more varied terrain once again, the uncertainty is reduced, and the set of samples again cluster more closely on the true position.
The key step in every SMC algorithm is the resampling step illustrated above.
Resampling allows for focusing the empirical approximations on regions of the sample space with high probability, yielding efficient inference for many models of practical interest. For instance, SMC is commonly used in tracking problems [1,20]. It is also possible to encode the example as a program in the calculus from Section 3. This is done in Fig. 5. The real numbers c 0 , c 1 , c 2 , . . . , c t in the program correspond to the observations of Y 0:t .

B Definitions and Proofs
In this appendix, we prove lemmas found throughout the main article. First, we introduce measure theory and Borel spaces (Section B.1), and define pointwise convergence of functions (Section B.2). Then, we introduce metric spaces and their properties (Section B.3), and look closer at the measure space (S, S, µ S ) (Section B.4) and the measurable space (T, T ) (Section B.5). In Section B.6 and Section B.7, we establish further results required for proving the measurability of r t and f t (Section B.8), and r t,n and f t,n (Section B.9). Lastly, we look at the bootstrap particle filter kernels k t,n and induced proposal measures t n (Section B.10).

B.1 Preliminaries: Measure Theory and Borel Spaces
This section gives fundamental definitions and lemmas from measure theory, and defines Borel spaces. For a more pedagogical introduction to the subject, we recommend McDonald and Weiss [28].
Definition 27. Let A be a set. We say that Definition 28 . Let (A, A) and (A , A ) be measurable spaces. A function f : To indicate that a function is measurable with respect to specific measurable spaces, we write f : (A, A) → (A , A ).
Definition 29. Let (A, A) be a measurable space, and let R * (3) if {A n } n ⊂ A is countable, and such that A i ∩ A j = ∅ for i = j, then µ ( n A n ) = n µ(A n ). Furthermore, we call (A, A, µ) a measure space if A is a σ-algebra on A, and µ is a measure on A.  Definition 35. Let (A, A, µ) be a measure space. We say that a property holds µ almost everywhere, or µ-ae for short, if there is a set B ∈ A of µ-measure 0 such that the property holds on A \ B.
When µ is a (sub-)probability measure, the term "almost surely" is used interchangeably with "almost everywhere".

B.2 Preliminaries: Convergence
In this section, we recall the definition of pointwise convergence of sequences of functions. Convergence is used to define correctness in Section 5.3 and Section 6.2. For a more comprehensive introduction to convergence, we recommend McDonald and Weiss [28].
Definition 36. Let {x n } n∈N be a sequence of real numbers, and x a real number. We say that lim n→∞ x n = x if for all ε > 0, there exists an N such that |x n −x| < ε for all n > N .
Definition 37. Let {f n : A → R} n∈N be a sequence of functions, and f : A → R a function. We say that lim n→∞ f n = f pointwise if for all x ∈ A, it holds that lim n→∞ f n (x) = f (x).
In particular, we say that lim n→∞ f n = f µ-ae if the sequence f n converges pointwise to f , except on a set of µ-measure 0. Definition 39. For n ∈ N, we let d R n ((x 1 , x 2 , . . . , x n ), (y 1 , y 2 , . . . , y n )) = |x 1 −y 1 |+|x 2 −y 2 |+· · ·+|x n −y n |, (25) and d R = d R 1 . It is easy to verify that d R n is a metric for each n. Proof. We have to show that S is a σ-algebra: Since B c n ∈ B n [0,1] , the implication holds.
Since i B n,i ∈ B n [0,1] , the implication holds.
Proof. We begin by showing that µ S is a measure.
2. µ S (∅) = 0. Follows since Next, we need to show that µ S is σ-finite. To do this, we show that there is a sequence {S i } i ⊂ S, µ S (S i ) < ∞ for all i, such that i S i = S. We can choose these S i simply as We now define a metric on S.
Definition 45. Let c i and c i denote the ith element of s ∈ S and s ∈ S, respectively.
It is easy to verify that S Q is a countable dense subset of S, from which the result follows.
Proof. Informally, this follows since S is the union of a countable set of isolated subspaces (the distance from each element in a subset to all elements of other subsets is ∞) which are all isomorphic to R n , for some n ∈ N 0 .
More formally, note that S = σ n∈N0 B n [0,1] . Clearly, by definition, Hence, Next, because the distance between traces of different length is ∞, we note that The result follows. Proof. Analogous to the proof for Lemma 29.
where k is the maximum of the number of occurrences of x in t 1 and t 2 .
Proof. The result follows immediately if d T (t 1 , t 2 ) = ∞. Therefore, assume d T (t 1 , t 2 ) < ∞. We now proceed by induction over the structure of t 1 and t 2 .
The result follows immediately.
. By using the induction hypothesis, we because the number of occurrences k of x are the same in (λx .t) and t. -Case t 1 = x , t 2 = x . In this case, we have two subcases: either and the result follows immediately (k = 1). In the case x = x , and the result follows immediately (k = 0). -Case t 1 = t 1 t 2 , t 2 = t 1 t 2 . By using the induction hypothesis, we have -The remaining cases follow by largely similar arguments.
where × denotes the usual Cartesian product of sets.
Lemma 32. Let where n i=1 d i is the Manhattan metric formed from the component metrics d i .
be a set of separable metric spaces. Then is a separable metric space, and Lemma 35. Let A ⊂ P(A). Furthermore, let (A , A ), and (A, σ(A)) be measurable spaces. Then f : Proof. The "only if" part is trivial. We now show the "if" part. Consider the set B = {A ∈ P(A) | f −1 (A) ∈ A }. Obviously, A ⊂ B. Furthermore, from properties of the preimage, it is easy to check that B is a σ-algebra. Therefore, σ(A) ⊂ B, and f −1 (A) ∈ A for each A ∈ σ(A). Hence, f is measurable.
be a finite set of measurable functions. Then is measurable.
Proof. By Lemma 35, it suffices to check that Hence, for all A × = × n i=1 A i , by properties of the preimage and the measurability of the f i , The result follows.

B.7 The Big-Step Function Induced by a Small-Step Relation.
Assume there is a small-step relation → which can be regarded as a measurable function →: with A ∈ A. We complete this function, forming the function step → : A → A.
Because → and id are measurable, we have step −1 → (A) ∈ A, as required.
In the following, we use the notation with n ∈ N 0 . Next, assume that we have a measurable function extract : (A, A) → (H, H). We require that H has a bottom element ⊥ (such that {⊥} ∈ H) and that H is equipped with a flat partial order ≤ H (i.e., the smallest partial order with ⊥ ≤ H h for all h ∈ H). Furthermore, we require that extract has the following property with respect to the function step → . Proof. This proof is based on Borgström et al. [5,Lemma 89]. Let f n = extract • step n → . The function f n is clearly measurable, since it is a composition of measurable functions (step n → is measurable as a consequence of Lemma 37). Next, let sup f n = final →,extract , and pick an arbitrary H ∈ H such that ⊥ ∈ H. Then which is measurable by definition. Also, is also measurable by definition. Now assume ⊥ ∈ H. Then which is also measurable by (60) and (61).
We summarize all of the above in the following lemma. In this section, we prove that r t and f t are measurable. We follow the proof strategy from Borgström et al. [5].
Condition 2 We require that, for each identifier D ∈ D, the function is measurable.

Condition 3
We require that, for each identifier g ∈ G, the function is measurable.
Lemma 42. T App , T Prim , T IfTrue , T IfFalse , and T d are T -measurable.
Proof. We can write all of these sets as countable unions of sets of the form t n p (R n ). Hence, they must be T -measurable. Definition 55. Proof. By Lemma 14.
(73) By applying Lemma 27, Lemma 31, and Lemma 29 (in that order), we have Hence, we see that by selecting δ = ε k+1 , we get the implication (73) and the function is continuous, and hence measurable.
For any E [g(c 1 , . . . , c |g| )] ∈ T Prim , by Lemma 29, we have From this, it follows that unbox is continuous (set δ = ε) and hence measurable. Furthermore, implying that box E is continuous (set δ = ε) and measurable as well. Lastly, we have It holds that box E • σ g • unbox is measurable (composition of measurable functions) for each g and E. Because E and G are countable, by Lemma 32, step Prim is measurable.
Proof. We show that step IfTrue is continuous as a function between the metric spaces (T IfTrue , d T ) and (T, d T ). By Lemma 44 and Lemma 34, the result then follows.
Pick arbitrary E[if true then t 1 else t 2 ] ∈ T IfTrue and ε > 0. Following Definition 51, we want to show that there exists a δ > 0 such that for all E [if true then t 1 else t 2 ] ∈ T IfTrue , We have Hence, we see that by selecting δ = ε, we get the implication (79) and the function is continuous, and hence measurable.
Proof. Follows from Lemma 45 and Lemma 32.
Let us now make the following definition With T ⊥ = T∪{⊥}, and T ⊥ the corresponding least σ-algebra such that T ⊂ T ⊥ (which must necessarily contain {⊥}), we have the following lemma.
Proof. We have extract → Det = id| T c d ∪ ⊥| T d , where ⊥ here denotes the constant function producing ⊥. Because id, ⊥, and T d are measurable, the result follows by Lemma 32.
Definition 58. The partial order ≤ d is the least partial order on T ⊥ with ⊥ ≤ d t.
Proof. Consider first t ∈ T d . We then have extract → Det (t) = ⊥ by definition, and the result holds immediately. Now consider t ∈ T d . By definition, step → Det (t) = t, and the result holds.
Lastly, we apply Lemma 41 to get the measurable function final → Det .
Proof. We can write the sets T Sample , T Weight , and T Resample as countable unions of sets of the form t n p (R n ). Hence, they must be T -measurable. T Det is measurable because final → Det is a measurable function, and V, T Sample , T Weight , and T Resample are measurable. Finally, T s is measurable because it is a finite union of measurable sets.
The below Lemma allows us to ignore the element ⊥ introduced by the function extract → Det . Proof. The restriction of a measurable function to a measurable set is also a measurable function (follows from Lemma 32). Furthermore, we can restrict the codomain from (T ⊥ , T ⊥ ) to (T, T ) as a result of Lemma 55 and by the definition of (T ⊥ , T ⊥ ).
we have Hence, by choosing δ = ε, we see that π j is continuous.
Lemma 63. X Det , X Sample , X Weight , X Resample , and X s are all X -measurable.
Proof. X Det , X Sample , X Weight , and X Resample are the Cartesian products of measurable sets, hence measurable. X s is a finite union of measurable sets, hence measurable.
Lemma 64. X Det , X Sample , X Weight , X Resample , and X s are σ-algebras.
Lemma 65. Let d X = d T + d R+ + d S . Then Furthermore, (X, d X ) is a separable metric space.
Proof. The projections π 1 ,π 2 , and π 3 are continuous and hence measurable. From Lemma 57, final → Det | T Det is measurable, and therefore, so is the composition final → Det | T Det • π 1 . By Lemma 36, the result now follows.
By copying the arguments from the proof of Lemma 48, unbox and box E are measurable. Next, define head (p :: s) = p tail (p :: s) = s.
By letting δ = ε, we see that head is continuous and hence measurable. Furthermore, by a similar argument, tail is continuous and measurable. Now, we note that By the measurability of the component functions, Lemma 36 (applied twice), and Lemma 32, the result follows.
Proof. Pick arbitrary E ∈ E and define By using similar arguments as in the proof of Lemma 48, it holds that unbox and box are measurable. Now, we note that Here, · denotes the pointwise function product. That is, for two functions f and g, (f ·g)(x) = f (x)·f (g). It is a standard result in measure theory that the function product of two measurable functions is measurable. By the measurability of the component functions, Lemma 36, and Lemma 32, the result now follows.