Densities of Almost Surely Terminating Probabilistic Programs are Differentiable Almost Everywhere

We study the differential properties of higher-order statistical probabilistic programs with recursion and conditioning. Our starting point is an open problem posed by Hongseok Yang: what class of statistical probabilistic programs have densities that are differentiable almost everywhere? To formalise the problem, we consider Statistical PCF (SPCF), an extension of call-by-value PCF with real numbers, and constructs for sampling and conditioning. We give SPCF a sampling-style operational semantics à la Borgström et al., and study the associated weight (commonly referred to as the density) function and value function on the set of possible execution traces. Our main result is that almost surely terminating SPCF programs, generated from a set of primitive functions (e.g. the set of analytic functions) satisfying mild closure properties, have weight and value functions that are almost everywhere differentiable. We use a stochastic form of symbolic execution to reason about almost everywhere differentiability. A by-product of this work is that almost surely terminating deterministic (S)PCF programs with real parameters denote functions that are almost everywhere differentiable. Our result is of practical interest, as almost everywhere differentiability of the density function is required to hold for the correctness of major gradient-based inference algorithms.


Introduction
Probabilistic programming refers to a set of tools and techniques for the systematic use of programming languages in Bayesian statistical modelling. Users of probabilistic programming -those wishing to make inferences or predictions -(i) encode their domain knowledge in program form; (ii) condition certain program variables based on observed data; and (iii) make a query. The resulting code is then passed to an inference engine which performs the necessary computation to answer the query, usually following a generic approximate Bayesian inference algorithm. (In some recent systems [5,14], users may also write their own inference code.) The Programming Language community has contributed to the field by developing formal methods for probabilistic programming languages (PPLs), seen as usual languages enriched with primitives for (i) sampling and (ii) conditioning. (The query (iii) can usually be encoded as the return value of the program.) It is crucial to have access to reasoning principles in this context. The combination of these new primitives with the traditional constructs of programming languages leads to a variety of new computational phenomena, and a major concern is the correctness of inference: given a query, will the algorithm converge, in some appropriate sense, to a correct answer? In a universal PPL (i.e. one whose underlying language is Turing-complete), this is not obvious: the inference engine must account for a wide class of programs, going beyond the more well-behaved models found in many of the current statistical applications. Thus the design of inference algorithms, and the associated correctness proofs, are quite delicate. It is well-known, for instance, that in its original version the popular lightweight Metropolis-Hastings algorithm [53] contained a bug affecting the result of inference [20,25].
Fortunately, research in this area benefits from decades of work on the semantics of programs with random features, starting with pioneering work by Kozen [26] and Saheb-Djahromi [44]. Both operational and denotational models have recently been applied to the validation of inference algorithms: see e.g. [20,8] for the former and [45,10] for the latter. There are other approaches, e.g. using refined type systems [33].
Inference algorithms in probabilistic programming are often based on the concept of program trace, because the operational behaviour of a program is parametrised by the sequence of random numbers it draws along the way. Accordingly a probabilistic program has an associated value function which maps traces to output values. But the inference procedure relies on another function on traces, commonly called the density 1 of the program, which records a cumulative likelihood for the samples in a given trace. Approximating a normalised version of the density is the main challenge that inference algorithms aim to tackle. We will formalise these notions: in Sec. 3 we demonstrate how the value function and density of a program are defined in terms of its operational semantics.

Contributions.
The main result of this paper is that both the density and value function are differentiable almost everywhere (that is, everywhere but on a set of measure zero), provided the program is almost surely terminating in a suitable sense. Our result holds for a universal language with recursion and higher-order functions. We emphasise that it follows immediately that purely deterministic programs with real parameters denote functions that are almost everywhere differentiable. This class of programs is important, because they can express machine learning models which rely on gradient descent [30].
This result is of practical interest, because many modern inference algorithms are "gradient-based": they exploit the derivative of the density function in order to optimise the approximation process. This includes the well-known methods of Hamiltonian Monte-Carlo [15,37] and stochastic variational inference [18,40,6,27]. But these techniques can only be applied when the derivative exists "often enough", and thus, in the context of probabilistic programming, almost everywhere differentiability is often cited as a requirement for correctness [55,31]. The question of which probabilistic programs satisfy this property was selected by Hongseok Yang in his FSCD 2019 invited lecture [54] as one of three open problems in the field of semantics for probabilistic programs.
Points of non-differentiability exist largely because of branching, which typically arises in a program when the control flow reaches a conditional statement. Hence our work is a study of the connections between the traces of a probabilistic program and its branching structure. To achieve this we introduce stochastic symbolic execution, a form of operational semantics for probabilistic programs, designed to identify sets of traces corresponding to the same control-flow branch. Roughly, a reduction sequence in this semantics corresponds to a control flow branch, and the rules additionally provide for every branch a symbolic expression of the trace density, parametrised by the outcome of the random draws that the branch contains. We obtain our main result in conjunction with a careful analysis of the branching structure of almost surely terminating programs.
Outline. We devote Sec. 2 to a more detailed introduction to the problem of trace-based inference in probabilistic programming, and the issue of differentiability in this context. In Sec. 3, we present a trace-based operational semantics to Statistical PCF, a prototypical higher-order functional language previously studied in the literature. This is followed by a discussion of differentiability and almost sure termination of programs (Sec. 4). In Sec. 5 we define the "symbolic" operational semantics required for the proof of our main result, which we present in Sec. 6. We discuss related work and further directions in Sec. 7.

Probabilistic Programming and Trace-Based Inference
In this section we give a short introduction to probabilistic programs and the densities they denote, and we motivate the need for gradient-based inference methods. Our account relies on classical notions from measure theory, so we start with a short recap.

Measures and Densities
A measurable space is a pair (X, Σ X ) consisting of a set together with a σalgebra of subsets, i.e. Σ X ⊆ P(X) contains ∅ and is closed under complements and countable unions and intersections. Elements of Σ X are called measurable sets. A measure on (X, Σ X ) is a function µ : The space R of real numbers is an important example. The (Borel) σ-algebra Σ R is the smallest one containing all intervals [a, b), and the Lebesgue measure Leb is the unique measure on (R, Σ R ) satisfying Leb([a, b)) = b − a. For measurable spaces (X, Σ X ) and (Y, Σ Y ), the product σ-algebra Σ X×Y is the smallest one containing all U × V , where U ∈ Σ X and V ∈ Σ Y . So in particular we get for each n ∈ N a space (R n , Σ R n ), and additionally there is a unique measure Leb n on R n satisfying Leb n ( i U i ) = i Leb(U i ).
When a function f : X → R is measurable and non-negative and µ is a measure on X, for each U ∈ Σ X we can define the integral U (dµ)f ∈ [0, ∞]. Common families of probability distributions on the reals (Uniform, Normal, etc. ) are examples of measures on (R, Σ R ). Most often these are defined in terms of probability density functions with respect to the Lebesgue measure, meaning that for each µ D there is a measurable function pdf D : R → R ≥0 which determines it: µ D (U ) = U (d Leb) pdf D . As we will see, density functions such as pdf D have a central place in Bayesian inference.
Formally, if µ is a measure on a measurable space X, a density for µ with respect to another measure ν on X (most often ν is the Lebesgue measure) is a measurable function f : X → R such that µ(U ) = U (dν)f for every U ∈ Σ X . In the context of the present work, an inference algorithm can be understood as a method for approximating a distribution of which we only know the density up to a normalising constant. In other words, if the algorithm is fed a (measurable) function g : X → R, it should produce samples approximating the probability measure U → U (dν)g X (dν)g on X. We will make use of some basic notions from topology: given a topological space X and an set A ⊆ X, the interior of A is the largest open setÅ contained in A. Dually the closure of A is the smallest closed set A containing A, and the boundary of A is defined as ∂A := A \Å. Note that for all U ⊆ R n , all ofŮ , U and ∂U are measurable (in Σ R n ).

Probabilistic Programming: a (Running) Example
Our running example is based on a random walk in R ≥0 .
The story is as follows: a pedestrian has gone on a walk on a certain semiinfinite street (i.e. extending infinitely on one side), where she may periodically change directions. Upon reaching the end of the street she has forgotten her starting point, only remembering that she started no more than 3km away. Thanks to an odometer, she knows the total distance she has walked is 1.1km, although there is a small margin of error. Her starting point can be inferred using probabilistic programming, via the program in Fig. 1a.
The function walk in Fig. 1a is a recursive simulation of the random walk: note that in this model a new direction is sampled after at most 1km. Once the pedestrian has travelled past 0 the function returns the total distance travelled. The rest of the program first specifies a prior distribution for the starting point, representing the pedestrian's belief -uniform distribution on [0, 3] -before observing the distance measured by the odometer. After drawing a value for start the program simulates a random walk, and the execution is weighted (via score) according to how close distance is to the observed value of 1.1. The return ( * r e t u r n s t o t a l d i s t a n c e t r a v e l l e d * ) l e t rec walk s t a r t = i f ( s t a r t <= 0 ) then 0 e l s e ( * each l e g < 1km * ) l e t s t e p = Uniform ( 0 , 1 ) in i f ( f l i p ( ) ) then ( * go t o w a r d s +i n f t y * ) s t e p + walk ( s t a r t+s t e p ) e l s e ( * go t o w a r d s 0 * ) s t e p + walk ( s t a r t −s t e p ) in ( * p r i o r * ) l e t s t a r t = Uniform ( 0 , 3 ) in l e t d i s t a n c e = walk s t a r t in ( * l i k e l i h o o d * ) s c o r e ( ( pdfN d i s t a n c e 0 . 1 ) 1 . 1 ) ; ( * q u e r y * ) s t a r t (a) Running example in pseudo-code. value is our query: it indicates that we are interested in the posterior distribution on the starting point. The histogram in Fig. 1b is obtained by sampling repeatedly from the posterior of a Python model of our running example. It shows the mode of the pedestrian's starting point to be around the 0.8km mark.
To approximate the posterior, inference engines for probabilistic programs often proceed indirectly and operate on the space of program traces, rather than on the space of possible return values. By trace, we mean the sequence of samples drawn in the course of a particular run, one for each random primitive encountered. Because each random primitive (qua probability distribution) in the language comes with a density, given a particular trace we can compute a coefficient as the appropriate product. We can then multiply this coefficient by all scores encountered in the execution, and this yields a (weight ) function, mapping traces to the non-negative reals, over which the chosen inference algorithm may operate. This indirect approach is more practical, and enough to answer the query, since every trace unambiguously induces a return value.

Remark 1.
In much of the probabilistic programming literature (e.g. [31,55,54], including this paper), the above-mentioned weight function on traces is called the density of the probabilistic program. This may be confusing: as we have seen, a probabilistic program induces a posterior probability distribution on return values, and it is natural to ask whether this distribution admits a probability density function (Radon-Nikodym derivative) w.r.t. some base measure. This problem is of current interest [2,3,21] but unrelated to the present work.

Gradient-Based Approximate Inference
Some of the most influential and practically important inference algorithms make use of the gradient of the density functions they operate on, when these are differentiable. Generally the use of gradient-based techniques allow for much greater efficiency in inference.
A popular example is the Markov Chain Monte Carlo algorithm known as Hamiltonian Monte Carlo (HMC) [15,37]. Given a density function g : X → R, HMC samples are obtained as the states of a Markov chain by (approximately) simulating Hamilton's equations via an integrator that uses the gradient ∇ x g(x). Another important example is (stochastic) variational inference [18,40,6,27], which transforms the posterior inference problem to an optimisation problem. This method takes two inputs: the posterior density function of interest g : X → R, and a function h : Θ × X → R; typically, the latter function is a member of an expressive and mathematically well-behaved family of densities that are parameterised in Θ. The idea is to use stochastic gradient descent to find the parameter θ ∈ Θ that minimises the "distance" (typically the Kullback-Leibler divergence) between h(θ, −) and g, relying on a suitable estimate of the gradient of the objective function. When g is the density of a probabilistic program (the model ), h can be specified as the density of a second program (the guide) whose traces have additional θ-parameters. The gradient of the objective function is then estimated in one approach (score function [41]) by computing the gradient ∇ θ h(θ, x), and in another (reparameterised gradient [24,42,49]) by computing the gradient ∇ x g(x).
In probabilistic programming, the above inference methods must be adapted to deal with the fact that in a universal PPL, the set of random primitives encountered can vary between executions, and traces can have arbitrary and unbounded dimension; moreover, the density function of a probabilistic program is generally not (everywhere) differentiable. Crucially these adapted algorithms are only valid when the input densities are almost everywhere differentiable [55, 38,32]; this is the subject of this paper.
Our main result (Thm. 3) states that the weight function and value function of almost surely terminating SPCF programs are almost everywhere differentiable. This applies to our running example: the program in Fig. 1a (expressible in SPCF using primitive functions that satisfy Assumption 1 -see Ex. 1) is almost surely terminating.

Sampling Semantics for Statistical PCF
In this section, we present a simply-typed statistical probabilistic programming language with recursion and its operational semantics.

Statistical PCF
Statistical PCF (SPCF) is higher-order probabilistic programming with recursion in purified form. The terms and part of the (standard) typing system of SPCF are presented in Fig. 2 2 . In the rest of the paper we write x to represent a sequence of variables x 1 , . . . , x n , Λ for the set of SPCF terms, and Λ 0 for the set of closed SPCF terms. In the interest of readability, we sometimes use pseudo code (e.g. Fig. 1a) in the style of Core ML to express SPCF terms. SPCF is a statistical probabilistic version of call-by-value PCF [46,47] with reals as the ground type. The probabilistic constructs of SPCF are relatively standard (see for example [48]): the sampling construct sample draws from U(0, 1), the standard uniform distribution with end points 0 and 1; the scoring construct score(M ) enables conditioning on observed data by multiplying the weight of the current execution with the (non-negative) real number denoted by M . Sampling from other real-valued distributions can be obtained from U(0, 1) by applying the inverse of the distribution's cumulative distribution function.
Our SPCF is an (inconsequential) variant of CBV SPCF [51] and a (CBV) extension of PPCF [16] with scoring; it may be viewed as a simply-typed version of the untyped probabilistic languages of [8,13,52].
Example 1 (Running Example Ped). We express in SPCF the example in Fig. 1a.
The let construct, let x = N in M , is syntactic sugar for the term (λx.M ) N ; and pdf N (1.1,0.1) , the density function of the normal distribution with mean 1.1 and variance 0.1, is a primitive function. To enhance readability we use infix notation and omit the underline for standard functions such as addition.

Operational Semantics
The execution of a probabilistic program generates a trace: a sequence containing the values sampled during a run. Our operational semantics captures this dynamic perspective. This is closely related to the treatment in [8] which, following [26], views a probabilistic program as a deterministic program parametrized by the sequence of random draws made during the evaluation.

Traces.
Recall that in our language, sample produces a random value in the open unit interval; accordingly a trace is a finite sequence of elements of (0, 1). We define a measure space S of traces to be the set n∈N (0, 1) n , equipped with the standard disjoint union σ-algebra, and the sum of the respective (higherdimensional) Lebesgue measures. Formally, writing S n := (0, 1) n , we define: To avoid clutter, we will elide the subscript from µ R m ×S whenever it is clear from the context. Small-Step Reduction. Next, we define the values (typically denoted V ), redexes (typically R) and evaluation contexts (typically E): We write Λ v for the set of SPCF values, and Λ 0 v for the set of closed SPCF values.
It is easy to see that every closed SPCF term M is either a value, or there exists a unique pair of context E and redex R such that M ≡ E[R].
We now present the operational semantics of SPCF as a rewrite system of configurations, which are triples of the form M, w, s where M is a closed SPCF term, w ∈ R ≥0 is a weight, and s ∈ S a trace. (We will sometimes refer to Redex Contractions: Evaluation Contexts: The small-step reduction relation → is defined in Fig. 3. In the rule for sample, a random value r ∈ (0, 1) is generated and recorded in the trace, while the weight remains unchanged: in a uniform distribution on (0, 1) each value is drawn with likelihood 1. In the rule for score(r), the current weight is multiplied by nonnegative r ∈ R: typically this reflects the likelihood of the current execution given some observed data. Similarly to [8] we reduce terms which cannot be reduced in a reasonable way (i.e. scoring with negative constants or evaluating functions outside their domain) to fail. Example 2. We present a possible reduction sequence for the program in Ex. 1: In this execution, the initial sample yields 0.2, which is appended to the trace. At step (⋆), we assume given a reduction sequence walk 0.6, 1, [0.2] → * 0.9, 1, [0.2, 0.9, 0.7] ; this means that in the call to walk, 0.9 was sampled as the the step size and 0.7 as the direction factor; this makes the new location −0.3, which is negative, so the return value is 0. 9. In the final step, we perform conditioning using the likelihood of observing 0.9 given the data 1.1: the score() expression updates the current weight using the the density of 0.9 in the normal distribution with parameters (1.1, 0.1).
Value and Weight Functions. Using the relation →, we now aim to reason more globally about probabilistic programs in terms of the traces they produce. Let M be an SPCF term with free variables amongst given values for each free variable and a trace, the output value of the program, if the program terminates in a value. The weight function weight M : R m × S → R ≥0 returns the final weight of the corresponding execution. Formally: This corresponds to the denotational semantics of SPCF in the ω-quasi-Borel space model via computational adequacy [51].
Returning to Remark 1, what are the connections, if any, between the two types of density of a program? To distinguish them, let's refer to the weight function of the program, weight M , as its trace density, and the Radon-Nikodyn derivative of the program's value-measure, d M dν where ν is the reference measure of the measurable space Σ Λ 0 v , as the output density. Observe that, for any measurable In other words, if our intended output is just a sequence of samples, then our inference engine does not need to concern itself with the consequences of change of variables.

Differentiability of the Weight and Value Functions
To reason about the differential properties of these functions we place ourselves in a setting in which differentiation makes sense. We start with some preliminaries.

Background on Differentiable Functions
Basic real analysis gives a standard notion of differentiability at a point x ∈ R n for functions between Euclidean spaces R n → R m . In this context a function f : R n → R m is smooth on an open U ⊆ R n if it has derivatives of all orders at every point of U . The theory of differential geometry (see e.g. the textbooks [50,29,28]) abstracts away from Euclidean spaces to smooth manifolds. We recall the formal definitions.
A topological space M is locally Euclidean at a point x ∈ M if x has a neighbourhood U such that there is a homeomorphism φ from U onto an open subset of R n , for some n. The pair (U, φ : A smooth manifold is a manifold equipped with an atlas. It follows from the topological invariance of dimension that charts that cover a part of the same connected component have the same dimension. We emphasise that, although this might be considered slightly unusual, distinct connected components need not have the same dimension. This is important for our purposes: S is easily seen to be a smooth manifold since each connected component S i is diffeomorphic to R i . It is also straightforward to endow the set Λ of SPCF terms with a (smooth) manifold structure. Following [8] we view Λ as m∈N SK m × R m , where SK m is the set of SPCF terms with exactly m place-holders (a.k.a. skeleton terms) for numerals. Thus identified, we give Λ the countable disjoint union topology of the product topology of the discrete topology on SK m and the standard topology on R m . Note that the connected components of Λ have the form {M } × R m , with M ranging over SK m , and m over N. So in particular, the subspace Λ v ⊆ Λ of values inherits the manifold structure. We fix the Borel algebra of this topology to be the σ-algebra on Λ. Given The definitions above are useful because they allow for a uniform presentation. But it is helpful to unpack the definition of differentiability in a few instances, and we see that they boil down to the standard sense in real analysis. Take an SPCF term M with free variables amongst x 1 , . . . , x m (all of type R), and (r, s) ∈ R m × S n .

Why Almost Everywhere Differentiability Can Fail
Conditional statements break differentiability. This is easy to see with an example: the weight function of the term This function is however differentiable almost everywhere: the diagonal is an uncountable set but has Leb 2 measure zero in the space S 2 . Unfortunately, this is not true in general. Without sufficient restrictions, conditional statements also break almost everywhere differentiability. This can happen for two reasons. The induced weight function is the characteristic function of {[s 1 ] ∈ S | s 1 ∈ Q}; the set of points at which this function is non-differentiable is S 1 , which has measure 1.
We proceed to overcome Problem 1 by making appropriate assumptions on the set of primitives. We will then address Problem 2 by focusing on almost surely terminating programs.

Admissible Primitive Functions
One contribution of this work is to identify sufficient conditions for F . We will show in Sec. 6 that our main result holds provided: F is a set of partial, measurable functions R ℓ ⇀ R including all constant and projection functions which satisfies Recall that a function f : R ℓ → R n is analytic if it is infinitely differentiable and its multivariate Taylor expansion at every point x 0 ∈ R ℓ converges pointwise to f in a neighbourhood of x 0 . 2. The set F 2 of (partial) functions f : and f is differentiable everywhere in dom(f ), and f −1 (I) is a finite union of (possibly unbounded) rectangles 4 for (possibly unbounded) intervals I.
Note that all primitive functions mentioned in our examples (and in particular the density of the normal distribution) are included in both F 1 and F 2 .
It is worth noting that both F 1 and F 2 satisfy the following stronger (than Assumption 1.3) property: Leb n (∂f −1 I) = 0 for every interval I, for every primitive function f .

Almost Sure Termination
To rule out the contrived counterexamples which diverge we restrict attention to almost surely terminating SPCF terms. Intuitively, a program M (closed term of ground type) is almost surely terminating if the probability that a run of M terminates is 1. Take an SPCF term M with variables amongst x 1 , . . . , x m (all of type R), and set Let us first consider the case of closed M ∈ Λ 0 i.e. m = 0 (notice that the measure µ R m ×S is not finite, for m ≥ 1). As T M,term now coincides with value −1 M (Λ 0 v ), T M,term is a measurable subset of S. Plainly if M is deterministic (i.e. samplefree), then µ S (T M,term ) = 1 if M converges to a value, and 0 otherwise. Generally for an arbitrary (stochastic) term M we can regard µ S (T M,term ) as the probability that a run of M converges to a value, because of Lem. 1.
More generally, if M has free variables amongst x 1 , . . . , x m (all of type R), then we say that M is almost surely terminating if for almost every (instantiation of the free variables by) r ∈ R m , M [r/x] terminates with probability 1.
We formalise the notion of almost sure termination as follows. Suppose that M is a closed term and M ♭ is obtained from M by recursively replacing subterms score(L) with the term if L < 0, N fail , L , where N fail is a term that reduces to fail such as 1/0. It is easy to see that for all s ∈ S, M ♭ , 1, [] → * V, 1, s iff for some (unique) w ∈ R ≥0 , M, 1, [] → * V, w, s . Therefore, Consequently, the closed term M terminates almost surely iff M ♭ is a probability measure.
-Like many treatments of semantics of probabilistic programs in the literature, we make no distinction between non-terminating runs and aborted runs of a (closed) term M : both could result in the value semantics M ♭ being a sub-probabilty measure (cf. [4]).
-Even so, current probabilistic programming systems do not place any restrictions on the code that users can write: it is perfectly possible to construct invalid models because catching programs that do not define valid probability distributions can be hard, or even impossible. This is not surprising, because almost sure termination is hard to decide: it is Π 0 2 -complete in the arithmetic hierarchy [22]. Nevertheless, because a.s. termination is an important correctness property of probabilistic programs (not least because of the main result of this paper, Thm. 3), the development of methods to prove a.s. termination is a hot research topic.
Accordingly the main theorem of this paper is stated as follows: Theorem 3. Let M be an SPCF term (possibly with free variables of type R) which terminates almost surely. Then its weight function weight M and value function value M are differentiable almost everywhere.

Stochastic Symbolic Execution
We have seen that a source of discontinuity is the use of if-statements. Our main result therefore relies on an in-depth understanding of the branching behaviour of programs. The operational semantics given in Sec. 3 is unsatisfactory in this respect: any two execution paths are treated independently, whether they go through different branches of an if-statement or one is obtained from the other by using slightly perturbed random samples not affecting the control flow.
More concretely, note that although we have derived weight Ped So we propose an alternative symbolic operational semantics (similar to the "compilation scheme" in [55]), in which no sampling is performed: whenever a sample command is encountered, we simply substitute a fresh variable α i for it, and continue on with the execution. We can view this style of semantics as a stochastic form of symbolic execution [12,23], i.e., a means of analysing a program so as to determine what inputs, and random draws (from sample) cause each part of a program to execute.
Consider the term M ≡ let x = sample · 3 in (walk x), defined using the function walk of Ex. 1. We have a reduction path but at this point we are stuck: the CBV strategy requires a value for α 1 . We will "delay" the evaluation of the multiplication α 1 · 3; we signal this by drawing a box around the delayed operation: α 1 · 3. We continue the execution, inspecting the definition of walk, and get: We are stuck again: the value of α 1 is needed in order to know which branch to follow. Our approach consists in considering the space S 1 = (0, 1) of possible values for α 1 , and splitting it into {s 1 ∈ (0, 1) | s 1 · 3 ≤ 0} = ∅ and {s 1 ∈ (0, 1) | s 1 · 3 > 0} = (0, 1). Each of the two branches will then yield a weight function restricted to the appropriate subspace. Formally, our symbolic operational semantics is a rewrite system of configurations of the form ⟪M , w , U ⟫, where M is a term with delayed (boxed) operations, and free "sampling" variables 5 α 1 , . . . , α n ; U ⊆ S n is the subspace of sampling values compatible with the current branch; and w : U → R ≥0 is a function assigning to each s ∈ U a weight w (s). In particular, for our running example Recall that M appears in the context of our running example Ped. Using our calculations above we derive one of its branches: Note that M may be open and contain other free "non-sampling" variables, usually denoted x1, . . . , xm. 6 We use the meta-lambda-abstraction λx. f (x) to denote the set-theoretic function x → f (x).
In particular the trace [0.2, 0.9, 0.7] of Ex. 2 lies in the subspace U . We can immediately read off the corresponding value and weight functions for all [s 1 , s 2 , s 3 ] ∈ U simply by evaluating the computation α 1 · 3, which we have delayed until now:

Symbolic Terms and Values
We have just described informally our symbolic execution approach, which involves delaying the evaluation of primitive operations. We make this formal by introducing an extended notion of terms, which we call symbolic terms and define in Fig. 4a along with a notion of symbolic values. For this we assume fixed denumerable sequences of distinguished variables: α 1 , α 2 , . . ., used to represent sampling, and x 1 , x 2 , . . . used for free variables of type R. Symbolic terms are typically denoted M , N , or L. They contain terms of the form f (V 1 , . . . , V ℓ ) for f : R ℓ ⇀ R ∈ F a primitive function, representing delayed evaluations, and they also contain the sampling variables α j . The type system is adapted in a straightforward way, see Fig. 4b.
We use Λ (m,n) to refer to the set of well-typed symbolic terms with free variables amongst x 1 , . . . , x m and α 1 , . . . , α n (and all are of type R). Note that every term in the sense of Fig. 2 is also a symbolic term.
Each symbolic term M ∈ Λ (m,n) has a corresponding set of regular terms, accounting for all possible values for its sampling variables α 1 , . . . , α n and its (other) free variables x 1 , . . . , x m . For r ∈ R m and s ∈ S n , we call partially evaluated instantiation of M the term ⌊M ⌋ (r, s) obtained from M [r/x, s/α] by recursively "evaluating" subterms of the form f (r 1 , . . . , r ℓ ) to f (r 1 , . . . , r ℓ ), provided (r 1 , . . . , r ℓ ) ∈ dom(f ). In this operation, subterms of the form f (r 1 , . . . , r ℓ ) are left unchanged, and so are any other redexes. ⌊M ⌋ can be viewed as a partial function ⌊M ⌋ : R m × S n ⇀ Λ and a formal definition is presented in Fig. 5b. (To be completely rigorous, we define for fixed m and n, partial functions ⌊M ⌋ m,n : R m ×S n ⇀ Λ for symbolic terms M whose distinguished variables are amongst x 1 , . . . , x m and α 1 , . . . , α n . M may contain other variables y, z, . . . of any type. Since m and n are usually clear from the context, we omit them.) Observe that for M ∈ Λ (m,n) and (r, s) ∈ dom ⌊M ⌋, ⌊M ⌋ (r, s) is a closed term. More generally, observe that if Γ ⊢ M : σ and (r, s) ∈ dom ⌊M ⌋ then Γ ⊢ ⌊M ⌋ (r, s) : σ. In order to evaluate conditionals if L ≤ 0, M , N we need to reduce L to a real constant, i.e., we need to have ⌊L⌋ (r, s) = r for some r ∈ R. This is the case whenever L is a symbolic value of type R, since these are built only out of delayed operations, real constants and distinguished variables x i or α j . Indeed we can show the following:   For symbolic values V : R and (r, s) ∈ dom ⌊V ⌋ we employ the notation V (r, s) := r ′ provided that ⌊V ⌋ (r, s) = r ′ .
A simple induction on symbolic terms and values yields the following property, which is crucial for the proof of our main result (Thm. 3): Lemma 3. Suppose the set F of primitives satisfies Item 1 of Assumption 1. 1. For each symbolic value V of type R, by identifying dom V with a subset of R m+n , we have V ∈ F . 2. If F also satisfies item 2 of Assumption 1 then for each symbolic term M , ⌊M ⌋ : R m × S n ⇀ Λ is differentiable in the interior of its domain.

Symbolic Operational Semantics
We aim to develop a symbolic operational semantics that provides a sound and complete abstraction of the (concrete) operational trace semantics. The symbolic As formalised by Thm. 1, the key intuition behind symbolic configurations ⟪M , w , U ⟫ (that are reachable from a given ⟪M, 1, R m ⟫) is that, whenever M is a symbolic value: -M gives a correct local view of value M (restricted to U ), and w gives a correct local view of weight M (restricted to U ); moreover, the respective third components U (of the symbolic configurations ⟪M , w , U ⟫) cover T M,term .
To establish Thm. 1, we introduce symbolic reduction contexts and symbolic redexes. These are presented in Fig. 4c and extend the usual notions (replacing real constants with arbitrary symbolic values of type R).
Using Lem. 2 we obtain: If R is a symbolic redex and (r, s) ∈ dom ⌊R ⌋ then ⌊R ⌋ (r, s) is a redex.

The following can be proven by a straightforward induction(see Appendix A.3):
Lemma 5 (Subject Construction). Let M be a symbolic term. The partial instantiation function also extends to symbolic contexts E in the evident way -we give the full definition in Appendix A.3 (Def. 2). Now, we introduce the following rules for symbolic redex contractions:
The rules are designed to closely mirror their concrete counterparts. Crucially, the rule for sample introduces a "fresh" sampling variable, and the two rules for conditionals split the last component U ⊆ R m ×S n according to whether V (r, s) ≤ 0 or V (r, s) > 0. The "delay" contraction (second rule) is introduced for a technical reason: ultimately, to enable item 1 (Soundness). Otherwise it is, for example, unclear whether λy. α 1 + 1 should correspond to λy. 0.5 + 1 or λy. 1.5 for s 1 = 0. 5.
Finally we lift this to arbitrary symbolic terms using the obvious rule for symbolic evaluation contexts: Note that we do not need rules corresponding to reductions to fail because the third component of the symbolic configurations "filters out" the pairs (r, s) corresponding to undefined behaviour. In particular, the following holds: A key advantage of the symbolic execution is that the induced computation tree is finitely branching, since branching only arises from conditionals, splitting the trace space into disjoint subsets. This contrasts with the concrete situation (from Sec. 3), in which sampling creates uncountably many branches.
Lemma 7 (Basic Properties). Let ⟪M , w , U ⟫ be a symbolic configuration. Then Crucially, there is a correspondence between the concrete and symbolic semantics in that they can "simulate" each other: Proposition 1 (Correspondence). Suppose ⟪M , w , U ⟫ is a symbolic configuration, and (r, s) ∈ U . Let M ≡ ⌊M ⌋ (r, s) and w := w (r, s). Then As a consequence of Lem. 2, we obtain a proof of Thm. 1.

Densities of Almost Surely Terminating Programs are Differentiable Almost Everywhere
So far we have seen that the symbolic execution semantics provides a sound and complete way to reason about the weight and value functions. In this section we impose further restrictions on the primitive operations and the terms to obtain results about the differentiability of these functions. Henceforth we assume Assumption 1 and we fix a term M with free variables amongst x 1 , . . . , x m . From Lem. 3 we immediately obtain the following: Lemma 8. Let ⟪M , w , U ⟫ be a symbolic configuration such that w is differentiable onŮ and µ(∂U ) = 0. If ⟪M , w , U ⟫ ⇒ ⟪M ′ , w ′ , U ′ ⟫ then w ′ is differentiable onŮ ′ and µ(∂U ′ ) = 0.

Differentiability on Terminating Traces
As an immediate consequence of the preceding, Lem. 3 and the Soundness (item 1 of Thm. 1), whenever ⟪M, 1, R m ⟫ ⇒ * ⟪V , w , U ⟫ then weight M and value M are differentiable everywhere inŮ .
Recall the set T M,term of (r, s) ∈ R m ×S from Eq. (1) for which M terminates. We abbreviate T M,term to T term and define The first equation holds because the U -indexed union is of pairwise disjoint sets. The inequality is due to (U \Ů ) ⊆ ∂U . The last equation above holds because each µ(∂U ) = 0 (Assumption 1 and Lem. 8).
Thus we conclude: Let M be an SPCF term. Then its weight function weight M and value function value M are differentiable for almost all terminating traces.

Differentiability for Almost Surely Terminating Terms
Next, we would like to extend this insight for almost surely terminating terms to suitable subsets of R m × S, the union of which constitutes almost the entirety of R m ×S. Therefore, it is worth examining consequences of almost sure termination (see Def. 1). Fig. 6: Illustration of how R m × S -visualised as the entire rectangle -is partitioned to prove Thm. 3. The value function returns ⊥ in the red dotted area and a closed value elsewhere (i.e. in the blue shaded area).
We say that (r, Note that T term ⊆ T max and there are terms for which the inclusion is strict (e.g. for the diverging term M ≡ Y(λf. f ), [] ∈ T max but [] ∈ T term ). Besides, T max is measurable because, thanks to Prop. 1, for every n ∈ N, and the RHS is a countable union of measurable sets (Lemmas 6 and 7).
The following is a consequnce of the definition of almost sure termination and a corollary of Fubini's theorem (see Appendix A.4 for details): Clearly, this is an open set and the situation is illustrated in Fig. 6. By what we have seen, Moreover, to conclude the proof of our main result Thm. 3 it suffices to note: 1. weight M and value M are differentiable everywhere on T int term (as for Thm. 2), and 2. weight M (r, s) = 0 and value M (r, s) = ⊥ for (r, s) ∈ T int pref ∪ T int stuck .
Theorem 3. Let M be an SPCF term (possibly with free variables of type R) which terminates almost surely. Then its weight function weight M and value function value M are differentiable almost everywhere.
We remark that almost sure termination was not used in our development until the proof of Lem. 9. For Thm. 3 we could have instead directly assumed the conclusion of Lem. 9; that is, almost all maximal traces are terminating. This is a strictly weaker condition than almost sure termination. The exposition we give is more appropriate: almost sure termination is a standard notion, and the development of methods to prove almost sure termination is a subject of active research.
We also note that the technique used in this paper to establish almost everywhere differentiability could be used to target another "almost everywhere" property instead: one can simply remove the requirement that elements of F are differentiable, and replace it with the desired property. A basic example of this is smoothness.

Conclusion
We have solved an open problem in the theory of probabilistic programming. This is mathematically interesting, and motivated the development of stochastic symbolic execution, a more informative form of operational semantics in this context. The result is also of major practical interest, since almost everywhere differentiability is necessary for correct gradient-based inference.
Related Work. This problem was partially addressed in the work of Zhou et al.
[55] who prove a restricted form of our theorem for recursion-free first-order programs with analytic primitives. Our stochastic symbolic execution is related to their compilation scheme, which we extend to a more general language.
The idea of considering the possible control paths through a probabilistic programs is fairly natural and not new to this paper; it has been used towards the design of specialised inference algorithms for probabilistic programming, see [11,56]. To our knowledge, this is the first semantic formalisation of the concept, and the first time it is used to reason about whole-program density.
The notions of weight function and value function in this paper are inspired by the more standard trace-based operational semantics of Borgström et al. [8] (see also [52,31]) . Mazza and Pagani [35] study the correctness of automatic differentiation (AD) of purely deterministic programs. This problem is orthogonal to the work reported here, but it is interesting to combine their result with ours. Specifically, we show a.e. differentiability whilst [35] proves a.s. correctness of AD on the differentiable domain. Combining both results one concludes that for a deterministic program, AD returns a correct gradient a.s. on the entire domain. Going deeper into the comparison, Mazza and Pagani propose a notion of admissible primitive function strikingly similar to ours: given continuity, their condition 2 and our condition 3 are equivalent. On the other hand we require admissible functions to be differentiable, when they are merely continuous in [35]. Finally, we conjecture that "stable points", a central notion in [35], have a clear counterpart within our framework: for a symbolic evaluation path arriving at ⟪V , w, U ⟫, for V a symbolic value, the points ofŮ are precisely the stable points.
Our work is also connected to recent developments in differentiable programming. Lee et al. [30] study the family of piecewise functions under analytic partition, or just "PAP" functions. PAP functions are a well-behaved family of almost everywhere differentiable functions, which can be used to reason about automatic differentiation in recursion-free first-order programs. An interesting question is whether this can be extended to a more general language, and whether densities of almost surely terminating SPCF programs are PAP functions. (See also [19,9] for work on differentiable programs without conditionals.) A similar class of functions is also introduced by Bolte and Pauwels [7] in very recent work; this is used to prove a convergence result for stochastic gradient descent in deep learning. Whether this class of functions can be used to reason about probabilistic program densities remains to be explored.
Finally we note that open logical relations [1] are a convenient proof technique for establishing properties of programs which hold at first order, such as almost everywhere differentiability. This approach remains to be investigated in this context, as the connection with probabilistic densities is not immediate.
Further Directions. This investigation would benefit from a denotational treatment; this is not currently possible as existing models of probabilistic programming do not account for differentiability.
In another direction, it is likely that we can generalise the main result by extending SPCF with recursive types, as in [51], and, more speculatively, firstclass differential operators as in [17]. It would also be useful to add to SPCF a family of discrete distributions, and more generally continuous-discrete mixtures, which have practical applications [36].
Our work will have interesting implications in the correctness of various gradient-based inference algorithms, such as the recent discontinuous HMC [39] and reparameterisation gradient for non-differentiable models [32]. But given the lack of guarantees of correctness properties available until now, these algorithms have not yet been developed in full generality, leaving many perspectives open. tion The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

A Supplementary Materials
A.1 More Details on Ex. 3 We prove that F 1 and F 2 of Ex. 3 satisfy Assumption 1. 1. Clearly, all constant and projection functions are analytic. Since analytic functions are total and differentiable (hence continuous) functions, they are Borel-measurable. Therefore, due to the fact that analytic functions are closed under pairs and composition [?, Prop. 2.4], it remains to check whether the boundary of f −1 ([0, ∞)) has measure zero. Note that the set of finite unions of (possibly unbounded) rectangles forms an algebra A (i.e. a collection of subsets of R n closed under complements and finite unions, hence finite intersections). Then Besides, for every U ∈ A, Leb(∂U ) = 0 (because Leb(∂R) = 0 for every rectangle R).
It remains to prove that F 2 is closed under composition. Suppose that f : R ℓ ⇀ R ∈ F 2 and g 1 , . . . , g ℓ : ) remains open (note that • is relational composition). Moreover, for every which is open. It follows that the (standard) chain rule is applicable, and we have f • g i ℓ i=1 is differentiable at x. Besides, suppose I is a (possibly unbounded) interval. By assumption there are m ∈ N and (potentially unbounded) intervals I i,j , where 1 ≤ i ≤ m and 1 ≤ j ≤ ℓ such that f −1 (I) = m i=1 I i,1 × · · · × I i,ℓ . Observe that and this is in A because algebras are closed under finite unions and intersections.
A.2 Supplementary Materials for Sec. 5 -For r ′ ∈ R, r ′ is a constant function and x i and α j are projections, which are in F by assumption. -Next, suppose V is a symbolic value f (V 1 , . . . , V ℓ ). By the inductive hypothesis, each V i ∈ F . It suffices to note that f (V 1 , . . . , V ℓ ) (r, s) = f ( V 1 (r, s), . . . , V ℓ (r, s)) for (r, s) ∈ dom f (V 1 , . . . , V ℓ ). Therefore, f (V 1 , . . . , V ℓ ) because F is assumed to be closed under composition. -Finally, note that we do not need to consider abstractions because they do not have type R.

Proof.
We prove all parts of the lemma simultaneously by structural induction on M .
-First, note that for every E and R all of the following holds and the left hand sides are symbolic values.   In the latter case in particular U 1 ∩ U 2 = ∅ holds.
This implies the second and third part of the lemma.
Proof. We prove the lemma by case analysis on the symbolic redex contractions.
Proof. For the Score-rule this is due to Lem. 3  Proof. Let T ∈ B m be such that µ(R m \ T ) = 0 and for every r ∈ T , M [r/x] terminates almost surely. For r ∈ R m we use the abbreviations S r,max := {s ∈ S | (r, s) ∈ T max } S r,term := {s ∈ S | (r, s) ∈ T term } and we can argue analogously to T max and T term that they are measurable. Similarly to Lem. 1, for all r ∈ R m , µ(S r,max ) ≤ 1 because (s + + s ′ ) ∈ S r,max and s ′ = [] implies s / ∈ S r,max . Therefore, for every r ∈ T , µ(S r,max \ S r,term ) = 0. Finally, due to a consequence of Fubini's theorem ( Lem. 12) and the fact that the Lebesgue measure is σ-finite, µ(T max \ T term ) = µ({(r, s) ∈ R m × S | s ∈ S r,max \ S r,term ) = 0 Lemma 12. Let (X, Σ X , µ) and (Y, Σ Y , ν) be σ-finite measure spaces. Suppose that U ∈ Σ X and that for every r ∈ X, V r ∈ Σ Y , and W := {(r, s) ∈ X × Y | s ∈ V r } is measurable.
Proof. Let X n ∈ Σ X and Y n ∈ Σ Y (for n ∈ N) be such that X = n∈N X n = X, Y = n∈N Y n and µ(X n ) = ν(Y n ) < ∞ for every n ∈ N. Define W n := W ∩ (X n × Y n ). Clearly (µ × ν)(W n ) is finite. By assumption the characteristic function 1 W : X × Y → R ≥0 is measurable. By Fubini's theorem [?, Thm. 1.27], for every n ∈ N, The third equation is due to µ(X n \ U ) = 0. The claim is immediate by W = n∈N W n .