Suspension Analysis and Selective Continuation-Passing Style for Universal Probabilistic Programming Languages

. Universal probabilistic programming languages (PPLs) make it relatively easy to encode and automatically solve statistical inference problems. To solve inference problems, PPL implementations often apply Monte Carlo inference algorithms that rely on execution suspension. State-of-the-art solutions enable execution suspension either through (i) continuation-passing style (CPS) transformations or (ii) efficient, but comparatively complex, low-level solutions that are often not available in high-level languages. CPS transformations introduce overhead due to unnecessary closure allocations—a problem the PPL community has generally overlooked. To reduce overhead, we develop a new efficient selective CPS approach for PPLs. Specifically, we design a novel static suspension analysis technique that determines parts of programs that require suspension, given a particular inference algorithm. The analysis allows selectively CPS transforming the program only where necessary. We formally prove the correctness of the analysis and implement the analysis and transformation in the Miking CorePPL compiler. We evaluate the implementation for a large number of Monte Carlo inference algorithms on real-world models from phylogenetics, epidemiology, and topic modeling. The evaluation results demonstrate significant improvements across all models and inference algorithms.


Introduction
Probabilistic programming languages (PPLs), such as Anglican [49], Birch [35], WebPPL [18], Stan [10], Pyro [6], and Gen [11], make it possible to encode and solve statistical inference problems.Such inference problems are of significant interest in many research fields, including phylogenetics [42], computer vision [25], topic modeling [7], inverse graphics [20], and cognitive science [19].A particularly appealing feature of PPLs is the separation between the inference problem specification (the language) and the inference algorithm used to solve the problem (the language implementation).This separation allows PPL users to focus solely on encoding their inference problems while inference algorithm experts deal with the intricacies of inference implementation.
Implementations of PPLs apply many different inference algorithms.Monte Carlo inference algorithms-such as Markov chain Monte Carlo (MCMC) [16] and sequential Monte Carlo (SMC) [12]-are popular due to their asymptotic correctness and relative ease of implementation for universal 5 PPLs.The central idea behind all Monte Carlo methods in PPLs is to execute probabilistic programs multiple times to generate samples that approximate the target distribution for the encoded inference problem.However, repeated execution is expensive, and PPL implementations must avoid unnecessary overhead.
Monte Carlo algorithms often need to suspend executions.For example, MCMC algorithms can suspend at random draws in the program to avoid unnecessary re-execution when proposing new executions, and SMC algorithms can suspend at likelihood updates to resample executions.Languages such as WebPPL [18] and Anglican [49], and the approach described by Ritchie et al. [40], apply continuation-passing style (CPS) transformations [3] to enable arbitrary suspension during execution.The main benefit of CPS transformations is that they are relatively easy to implement in functional programming languages.However, one disadvantage with CPS transformations is that highperformance low-level languages, without higher-order functions, do not support them.For this reason, there are also more direct low-level alternatives to CPS, including non-preemptive multitasking (e.g., coroutines [15]) and PPL controlflow graphs [30].These more direct alternatives can additionally avoid much of the overhead resulting from CPS 6 , but are more complex to implement.
We consider how to bridge the performance gap between CPS-based PPLs and lower-level PPLs that rely on, e.g., direct implementation of coroutines.We consider optimizations at the CPS transformation level, and not the translation from CPS-based PPLs to lower-level representations.CPS overhead is a result of closure allocations for continuations.We make the important observation that PPLs do not require the arbitrary suspensions provided by full CPS transformations.Most Monte Carlo inference algorithms require suspension only in very specific parts of programs.Current state-of-the-art CPS-based PPLs do not consider inference-specific suspension requirements to reduce CPS overhead.
We design a new static suspension analysis and a new selective CPS transformation for PPLs that together significantly reduce runtime overhead com-pared to a traditional full CPS transformation.Current state-of-the-art functional PPLs that use CPS for execution suspension can therefore greatly benefit from our new approach.The suspension analysis identifies all parts of programs that may require suspension as a result of applying a particular inference algorithm.We formalize the suspension analysis algorithm using a core PPL calculus equipped with a big-step operational semantics.Specifically, the challenge lies in capturing how suspension requirements propagate through the program in the presence of higher-order functions.Furthermore, we formalize the selective CPS transformation and justify its correctness when guided by the suspension analysis.Prior work on selective CPS for general-purpose programming languages, e.g., by Nielsen [37] and Asai and Uehara [4], focuses on analyses based on type systems and type inference.In contrast, we instead build our suspension analysis using 0-CFA [45] and it operates directly on an untyped calculus.
Overall, we (i) prove that the suspension analysis is correct, (ii) show that the resulting selective CPS transformation gives significant performance gains compared to using a full CPS transformation, and (iii) show that the overall approach is directly applicable to a large set of inference algorithms.Specifically, we evaluate the approach for the following inference algorithms: likelihood weighting, the SMC bootstrap particle filter, the SMC alive particle filter [24], aligned lightweight MCMC [29,48], and particle-independent Metropolis-Hastings [39].We consider each inference algorithm for four real-world models from phylogenetics, epidemiology, and topic modeling.
We implement the suspension analysis and selective CPS transformation in Miking CorePPL [30,9].Similarly to WebPPL and Anglican, the implementation supports the co-existence of many inference problems and applications of inference algorithms to these problems within the same program.However, compared to full CPS, such programs are more challenging to handle with selective CPS, as the CPS transformation of an inference problem also depends on the applied inference algorithm-different inference algorithms generally require different suspensions.To complicate things further, different inference problems may share some code, or the PPL user may apply two different inference algorithms to the same inference problem.The compiler must then apply different CPS transformations to different parts of the program, and sometimes even many different CPS transformations to separate copies of the same part of the program.To solve this, we develop an approach that, for any given Miking CorePPL program, extracts all possible inference problems and corresponding inference algorithm applications.This extraction procedure allows the correct application of selective CPS throughout the program.
In summary, we make the following contributions.
-We design, formalize, and prove the correctness of a suspension analysis for PPLs, where the suspension requirements come from a given inference algorithm (Section 4).-We design and formalize a new selective CPS transformation for PPLs.Compared to full CPS, selectively CPS transforming PPL programs guided by the suspension analysis significantly reduces runtime overhead resulting from unnecessary closure allocations (Section 5).-We implement the suspension analysis and selective CPS transformation in the Miking CorePPL compiler.Unlike full CPS, selective CPS introduces challenges for probabilistic programs containing many inference problems and inference algorithm applications.We implement an approach that correctly applies selective CPS to such programs by extracting individual inference problems (Section 6).
Section 7 presents the evaluation and its results for the implementations in Miking CorePPL, Section 8 discusses related work in more detail, and Section 9 concludes.We first consider a motivating example in Section 2 and introduce the underlying PPL calculus in Section 3.

A Motivating Example
This section introduces the running example in Fig. 1 and uses it to present the basic idea behind PPLs and how inference algorithms such as SMC and MCMC make use of CPS to suspend executions.Most importantly, we illustrate the motivation and key ideas behind selective CPS for PPLs.Consider the probabilistic program in Fig. 1a, written in a functional-style PPL.The program encodes an inference problem for estimating the probability distribution over the bias of a coin, conditioned on the outcome of four experimental coin flips: true, true, false, and true (true = heads and false = tails).At line 1, we use the PPL-specific assume construct to define our prior belief in the bias a 1 of the coin.We set this prior belief to a Beta(2, 2) probability distribution, illustrated in Fig. 1b.In the illustration, 0 indicates a coin that always results in false, 1 a coin that always results in true, and 0.5 a fair coin.We see that our prior belief is quite evenly spread out, but with more probability mass towards a fair coin.To condition this prior distribution on the observed coin flips, we conceptually execute the program in Fig 1a infinitely many times, sampling values from the prior Beta distribution at assume (line 1) and, as a side effect, accumulating the product of weights given as argument to the PPLspecific weight construct (line 4).We make the four consecutive calls weight (f Bernoulli a 1 true), weight (f Bernoulli a 1 true), weight (f Bernoulli a 1 false), and weight (f Bernoulli a 1 true) 7   (e) Suspension at weight.that, because we observed three true outcomes and only one false, the weights shift the probability mass towards 1 and narrows it slightly as we are now more sure about the bias of the coin.Increasing the number of experimental coin flips would make Fig. 1c more and more narrow.
We can approximate the infinite number of samples by running the program a large (but finite) number of times.This basic inference algorithm is known as likelihood weighting.The problem with likelihood weighting is that it is only accurate enough for simple models.For complex models, it is common that only a few likelihood weighting samples (often only one) get much larger weights relative to the other samples, greatly reducing inference accuracy.Real-world models require more powerful inference algorithms based on, e.g., SMC or MCMC.A key requirement in both SMC and MCMC is the ability to suspend executions of probabilistic programs at calls to weight and/or assume.One way to enable suspensions is by writing programs in CPS.We first illustrate a simple use of CPS to suspend at assume in Fig. 1d.Here, the program immediately returns an object Suspension assume (Beta 2 2, k), indicating that execution stopped at an assume with the argument Beta 2 2 and a continuation k (i.e., the abstraction binding a 1 ) that executes the remainder of the program.With likelihood weighting, we would simply sample a value a 1 from the Beta 2 2 distribution and resume execution by calling k a 1 .This call then runs the program until termination and results in the actual return value of the program, which is a 1 .Many MCMC inference algorithms reuse samples from previous executions at Suspension assume , and the suspensions are thus useful to avoid unnecessary re-execution [40].
As a second example, we illustrate suspension at weight for, e.g., SMC inference in Fig. 1e.Here, we require suspensions in the middle of the recursive call to iter , and writing the program in CPS is more challenging.We rewrite the iter function to take a continuation k as argument, and call the continuation with the return value () at line 3 instead of directly returning () as in Fig. 1a at line 3.This continuation argument k is precisely what allows us to construct and return Suspension weight objects at line 5.To illustrate the suspensions, consider executing the program with likelihood weighting.First, the program returns the object Suspension weight (f Bernoulli(a1) true, k ′ ), where k ′ is the continuation that line 7 constructs.Likelihood weighting now updates the weight for the execution with the value f Bernoulli(a1) true and resumes execution by calling k ′ ().Similarly, this next execution returns Suspension weight (f Bernoulli(a1) true, k ′′ ) for the second recursive call to iter , and we again update the weight and resume by calling k ′′ ().We similarly encounter Suspension weight (f Bernoulli(a1) false, k ′′′ ) and Suspension weight (f Bernoulli(a1) true, k ′′′′ ) before the final call k ′′′′ () runs the program to termination and produces the actual return value a 1 .In SMC, we run many executions concurrently and wait until they all have returned a Suspension weight object.At this point, we resample the executions according to their weights (the first value in Suspension weight ), which discards executions with low weight and replicates executions with high weight.After resampling, we continue to the next suspension and resampling by calling the continuations.
PPL implementations enable suspensions at assume and/or weight through automatic and full CPS transformations.Fig. 1f illustrates such a transformation for Fig. 1a.We indicate CPS versions of intrinsic functions with the CPS subscript.Note that the full CPS transformation results in many additional closure allocations compared to Fig. 1d and Fig. 1e.As a result, runtime overhead increases significantly.The contribution in this paper is a static analysis that allows an automatic and selective CPS transformation of programs, as in Fig. 1d and Fig. 1e.With a selective transformation, we avoid many unnecessary closure allocations, and can significantly reduce runtime overhead while still allowing suspensions as required for a given inference algorithm.

Syntax and Semantics
This section introduces the PPL calculus used to formalize the suspension analysis in Section 4 and selective CPS transformation in Section 5. Section 3.1 gives the abstract syntax and Section 3.2 a big-step operational semantics.Section 3.3 introduces A-normal form-a prerequisite for both the suspension analysis and the selective CPS transformation.

Syntax
We build upon the standard untyped lambda calculus, representative of functional universal PPLs such as Anglican, WebPPL, and Miking CorePPL.We define the abstract syntax below.
Definition 1 (Terms, values, and environments).We define terms t ∈ T and values v ∈ V as The countable set X contains variable names, C intrinsic values and operations, and D ⊂ C intrinsic probability distributions.The set P contains evaluation environments, i.e., maps from variables in X to values in V .
Definition 2 (Target language terms).As a target language for the selective CPS transformation in Section 5, we additionally extend Definition 1 to target language terms t ∈ T + by Fig. 1a gives an example of a term in T , and Fig. 1d and Fig. 1e of terms in T + .However, note that the programs in Fig. 1 also use the list constructor [. ..] (not part of the above definitions) to make the example more interesting.
In addition to the standard variable, abstraction, and application terms in the untyped lambda calculus, we include explicit let expressions for convenience.Furthermore, we use the syntactic sugar let rec f = λx.t 1 in t 2 to define recursive functions (translating to an application of a call-by-value fixed-point combinator).We use t 1 ; t 2 as a shorthand for (λ_.t 2 ) t 1 , where _ means that we do not use the argument.That is, we evaluate t 1 for side effects only.
We include a set C of intrinsic operations and constants essential to inference problems encoded in PPLs.The set of intrinsics includes boolean truth values, the unit value, real numbers, and probability distributions.We can also add further operations and constants to C. For example, we can let + ∈ C to support addition of real numbers.To allow control flow to depend on intrinsic values, we include if expressions that use intrinsic booleans as condition.
We saw examples of the assume and weight constructs in Section 2. The assume construct takes distributions D ⊂ C as argument, and produces random variables distributed according to these distributions.For example, we can let N ∈ C be a function that constructs normal distributions.Then, assume (N 0 1), where N 0 1 ∈ D, defines a random variable with a standard normal distribution.Partially constructed distributions, e.g., N 0, are also in C, but not in D (they are not yet proper distributions).As we saw in Section 2, the weight construct updates the likelihood with the real number given as argument, and allows conditioning on data (e.g., the four coin flips in Fig. 1).

Semantics
We construct a call-by-value big-step operational semantics, based on Lundén et al. [29], describing how to evaluate terms t ∈ T .Such a semantics is a key component when formally defining the probability distributions corresponding to terms t ∈ T (e.g., the distribution in Fig. 1c corresponding to the program in Fig. 1a) and also when proving various properties of PPLs and their inference algorithms (e.g., inference correctness).See, e.g., the work by Borgström et al. [8] and Lundén et al. [28] for full formal treatments.
We use the semantics to formally define suspension, and use this definition to state the soundness of the suspension analysis in Section 4 (Theorem 1).We use a big-step semantics, as we do not require the additional control provided by a small-step semantics.For example, we do not concern ourselves with details of termination, as the soundness of the analysis relates only to terminating executions.Fig. 2 presents the full semantics as a relation ρ ⊢ t s ⇓ w u v over tuples (P, T, S, {false, true}, R, V ).S is a set of traces capturing the random draws at assume during evaluation.Intuitively, ρ ⊢ t s ⇓ w u v holds iff t evaluates to v in the environment ρ with the trace s and the total probability density (i.e., the accumulated weight) w.We describe the suspension flag u later in this section.
Most of the rules are standard and we focus on explaining key properties related to PPLs and suspension.We first consider the rule (Const-App), which uses the δ-function to evaluate intrinsic operations.

Definition 3 (Intrinsic arities and the δ-function).
For each c ∈ C, we let |c| ∈ N denote its arity.We also assume the existence of a partial function Fig. 2: A big-step operational semantics for t ∈ T .We omit the rule (If-False) for brevity; it is analogous to (If-True).The environment ρ, x → v denotes ρ extended with a binding v for x.For each d ∈ D, the function f d is its probability density or probability mass function.E.g., f N (0,1) (x) = e x 2 /2 / √ 2π, the density function of the standard normal distribution.We use the following notation: ∥ for sequence concatenation, • for multiplication, and ∨ for logical disjunction.
The rule (Assume) formalizes random draws and consumes elements of the trace.Specifically, (Assume) updates the evaluation's total probability density w ∈ R with the density w ′ of the first trace element with respect to the distribution given as argument to assume.The rule (Weight) furthermore directly modifies the total probability density according to the weight argument.
We now consider the special suspension flag u in the derivation ρ ⊢ t s ⇓ w u v.

Definition 5 (Suspension requirement).
A derivation ρ ⊢ t s ⇓ w u v requires suspension if the suspension flag u is true.
For example, the rule (App) requires suspension if u 1 ∨ u 2 ∨ u 3 -i.e., if any subderivation requires suspension.To reflect the particular suspension requirements in SMC and MCMC inference, we limit the source of suspension requirements to assume and weight.We turn the individual sources on and off through the boolean variables suspend assume and suspend weight in Fig. 2. For the examples in the remainder of this paper, we let suspend weight = true and suspend assume = false (i.e., only weight requires suspension, as in SMC inference).Fig. 3: The running example t example from Fig. 1a transformed to ANF.
To illustrate the semantics, consider t example of Fig. 1a again.Because t example evaluates precisely one assume, the only valid traces for t example are singleton traces [a 1 ], where a 1 ∈ R [0,1] due to the Beta prior for a 1 .By initially setting ρ to the empty environment ∅ and following the rules of Fig. 2, we derive Note that every evaluation of t example has u = true, as there are always four calls to weight during evaluation.That is, the derivation requires suspension.However, many subderivations of t example do not require suspension.For example, the subderivations assume (Beta 2 2) and null obs do not (i.e., have u = false).Section 4 presents a suspension analysis that conservatively approximates which subderivations require suspension.The analysis enables, e.g., the selective CPS transformation in Fig. 1e.

A-Normal Form
We simplify the suspension analysis in Section 4 and the selective CPS transformation in Section 5 by requiring that terms are in A-normal form (ANF) [13].
Definition 6 (A-normal form).We define the A-normal form terms t ANF ∈ T ANF as follows.
It holds that T ANF ⊂ T .Furthermore, there exist standard transformations to convert terms in T to T ANF .Fig. 3 illustrates Fig. 1a transformed to ANF.We will use Fig. 1a as a running example in Section 4 and Section 5.
Restricting programs to ANF significantly simplifies the suspension analysis and selective CPS transformation.From now on we require that all variable bindings in programs are unique, and together with ANF, the result is that every expression in a program t ∈ T ANF is uniquely labeled by a variable name from a let expression.This property is essential for the treatment in Section 4.

Suspension Analysis
This section presents the main technical contribution: the suspension analysis.The analysis goal is to identify program expressions that may require suspension in the sense of Definition 5. Identifying such expressions leads to the selective CPS transformation in Section 5, enabling transformations such as in Fig 1e.
The suspension analysis builds upon the 0-CFA algorithm [45,38], and we formalize our algorithms based on Lundén et al. [29].The main challenge we solve is how to model the propagation of suspension in the presence of higherorder functions.The 0 in 0-CFA stands for context insensitivity-the analysis considers every part of the program in one global context.Context insensitivity makes the analysis more conservative compared to context-sensitive approaches such as k-CFA, where k ∈ N indicates the level of context sensitivity [32].We use 0-CFA for two reasons: (i) the worst-case time complexity for the analysis is polynomial, while it is exponential for k-CFA already at k = 1, and (ii) the limitations of 0-CFA rarely matter in practical PPL applications.For example, k-CFA provides no benefits over 0-CFA for the programs in Section 7.
We assume ⟨λx.t, ρ⟩ ̸ ∈ C (recall that C is the set of intrinsics).That is, we assume that closures are not part of the intrinsics.In particular, this disallows intrinsic operations (including the use of assume d, d ∈ D ⊂ C) to produce closures, which would needlessly complicate the analysis without any benefit.
Consider the program in Fig. 3, and assume that weight requires suspension.Clearly, the expression labeled by w 1 at line 20 then requires suspension.Furthermore, w 1 evaluates as part of the larger expression labeled by t 8 at line 10.Consequently, the evaluation of t 8 also requires suspension.Also, t 8 evaluates as part of an application of the abstraction binding obs at line 7.In particular, the abstraction binding obs binds to iter , and we apply iter at lines 23 and 33.Thus, the expressions named by t 17 and t 22 require suspension.In summary, we have that w 1 , t 8 , t 17 , and t 22 require suspension, and we also note that all applications of the abstraction binding obs require suspension.
We proceed to the formalization and first introduce standard abstract values.
Definition 7 (Abstract values).We define the abstract values a ∈ A as a ::= λx.y | const x n for x, y ∈ X and n ∈ N.
The abstract value λx.y represents all closures originating at, e.g., a term λx.let y = 1 in y in a program at runtime (recall that we assume that the variables x and y are unique).Note that the y indicates the name returned by the body (formalized by the function name in Algorithm 1).The abstract value const x n represents all intrinsic functions of arity n originating at x.For example, const x 2 originates at, e.g., a term let x = + in t.
Algorithm 1 Constraint generation for the suspension analysis.We write the functional-style pseudocode for the algorithm itself in sans serif font to distinguish it from terms in T .
function generateConstraints(t): TANF → P(R) = The central objects in the analysis are sets S x ∈ P(A) and boolean values suspend x for all x ∈ X.The set S x contains all abstract values that may flow to the expression labeled by x, and suspend x indicates whether or not the expression requires suspension.A trivial but useless solution is S x = A and suspend x = true for all variables x in the program.To get more precise information regarding suspension, we wish to find smaller solutions to the S x and suspend x .
To formalize the set of sound solutions for S x and suspend x , we generate constraints c ∈ R for programs (for a formal definition of constraints, see Appendix A.1). Algorithm 1 formalizes the necessary constraints for programs t ∈ T ANF with a function generateConstraints that recursively traverses the program t to generate a set of constraints.Due to ANF, there are only two cases in the top match (line 1).Variables generate no constraints, and the important case is for let expressions at lines 3-30.The algorithm makes use of an auxiliary function name (line 39) that determines the name of an ANF expression, and a function suspendNames (line 44) that determines the names of all top-level expressions within an expression that may suspend (namely, applications, if expressions, and assume and/or weight).
We next illustrate and motivate the generated constraints by considering the set of constraints generateConstraints(t example ), where t example is the pro-gram in Fig. 3.Many constraints are standard, and we therefore focus on the new suspension constraints introduced as part of this paper.In particular, the challenge is to correctly capture the flow of suspension requirements across function applications and higher-order functions.First, we see that defining aliases (line 6) generates constraints of the form S y ⊆ S x , that constants introduce const abstract values (e.g., const t6 1 ∈ S t6 ), and that assume and weight introduce suspension requirements, e.g., suspend w1 (shorthand for suspend w1 = true).
First, we consider the constraints generated for λobs.(line 7 in Fig. 3) through the case at lines 9-12 in Algorithm 1.To keep the example simple, we treat the unexpanded let rec as an ordinary let in the analysis (for this particular example, the analysis result is unaffected).Omitting the recursively generated constraints for the abstraction body, the generated constraints are The first constraint is standard and states that the abstract value λobs.t 8 flows to S iter as the variable naming the λobs expression is t 8 at line 26 in Fig. 3 (difficult to notice due to the column breaks).The remaining constraints are new and sets up the flow of suspension requirements.Specifically, the abstraction obs itself requires suspension if any expression bound by a top-level let in its body requires suspension.For efficiency, we only set up dependencies for expressions that may suspend (formalized by suspendNames in Algorithm 1).Note here that we do not add the constraint suspend w1 ⇒ suspend obs , as w 1 is not at top-level in the body of obs.Instead, we later add the constraint suspend w1 ⇒ suspend t8 , and suspend w1 ⇒ suspend obs follows by transitivity.
The constraints generated for the if bound to t 8 at line 10 through the case at lines 31-37 in Algorithm 1 are (omitting recursively generated constraints) ( The first two constraints are standard, and state that abstract values in the results of both branches flow to the result S t8 .The last set of constraints is new and similar to the abstraction suspension constraints.The constraints capture that all expressions at top-level in both branches that require suspension also cause t 8 to require suspension.
Consider the application at line 23 in Fig. 3.The generated constraints through the case at lines 13-25 in Algorithm 1 are The first two constraints are standard and state how abstract values flow as a result of applications.The last three constraints are new and relate to suspension.The third and fourth constraints state that if an abstraction or intrinsic requiring suspension flows to iter , the result t 17 of the application also requires suspension.The fifth constraint states that if the result t 17 requires suspension, then all abstractions and constants flowing to iter require suspension.This last constraint is not strictly required to later prove the soundness of the analysis in Theorem 1, but, as we will see in Section 5, it is required for the selective CPS transformation.
We find a solution to the constraints through Algorithm 3 in Appendix A.1.The algorithm propagates abstract values according to the constraints until fixpoint, and is fairly standard.However, we extend the algorithm to support the new suspension constraints.The algorithm is a function analyzeSuspend: T ANF → ((X → P(A)) × P(X)).The function returns a map data : X → P(A) that assigns sets of abstract values to all S x and a set suspend : P(X) that assigns suspend x = true iff x ∈ suspend.Importantly, the assignments to S x and suspend x satisfy all generated constraints.To illustrate the algorithm, here are the analysis results analyzeSuspend(t example ): The above results confirm our earlier reasoning: the expressions labeled by obs, w 1 , t 8 , t 17 , and t 22 may require suspension.
We now consider the soundness of the analysis.First, the soundness of 0-CFA is well established (see, e.g., Nielson et al. [38]) and extends to our new constraints, and we take the following lemma to hold without proof.
Lemma 1 (0-CFA soundness).For every t ∈ T ANF , the solution given by analyzeSuspend(t) for S x and suspend x , x ∈ X, satisfies the constraints generateConstraints(t).
Next, we must show that the constraints themselves are sound.Consider the evaluation of an arbitrary term t ∈ T ANF .For each subderivation of t, labeled by a name x (due to ANF), it must hold that suspend x = true if the subderivation requires suspension.Otherwise, the analysis is unsound.Theorem 1 formally captures the soundness.Note that the analysis is conservative (i.e., incomplete), because it may find suspend x = true even if the subderivation for x does not require suspension.
Theorem 1 (Suspension analysis soundness).Let t ∈ T ANF , s ∈ S, u ∈ {false, true}, w ∈ R, and v ∈ V such that ∅ ⊢ t s ⇓ w u v. Now, let S x and suspend x for x ∈ X according to analyzeSuspend(t).For every subderivation Algorithm 2 Selective continuation-passing style transformation.We define t id = λx.x.The term c CPS is the CPS version of c.We write the functional-style pseudocode for the algorithm itself in sans serif font to distinguish it from terms in T .
function cps(vars, t): P(X) × TANF → T + = 1 return cps ′ (t id , t)  The proof of Lemma 2 uses Lemma 1 and structural induction over the derivation ∅ ⊢ t s ⇓ w u v. Next, we use the suspension analysis to selectively CPS transform programs.

Selective CPS Transformation
This section presents the second technical contribution: the selective CPS transformation.The transformations themselves are standard, and the challenge is to correctly use the suspension analysis results for a selective transformation.
Algorithm 2 is the full algorithm.Using terms in ANF as input significantly helps reduce the algorithm's complexity.The main function cps takes as input a set vars : P(X), indicating which expressions to CPS transform, and a program t ∈ T ANF to transform.It is the new vars argument that separates the transformation from a standard CPS transformation.For the purposes of this paper, we always use vars = {x | suspend x = true}, where the suspend x come from analyzeSuspend(t).One could also use vars = X for a standard full CPS transformation (e.g., Fig 1f), or some other set vars for other application domains.The value returned from the cps function is a (non-ANF) term of the type T + .The helper function cps ′ , initially called at line 1, takes as input a continuation term cont, indicating the continuation to apply in tail position.Initially, this continuation term is t id , which indicates no continuation.Similarly to Algorithm 1, the top-level match at line 4 has two cases: a simple case for variables (line 5) and a complex case for let expressions (lines 6-49).To enable optimization of tail calls, the auxiliary function tailCall indicates whether or not an ANF expression is a tail call (i.e., of the form let x = t ′ in x).
We now illustrate Algorithm 2 by computing cps(vars example , t example ), where vars example = {obs, w 1 , t 8 , t 17 , t 22 } is from (7), and t example is from Fig. 3. Fig. 4 presents the final result.First, we note that the transformation does not change expressions not labeled by a name in vars example , as they do not require suspension.In the following, we therefore focus only on the transformed expressions.First, consider the abstraction obs defined at line 7 in Fig. 3, handled by the case at line 12 in Algorithm 2. As obs ∈ vars example , we apply the standard CPS transformation for abstractions: add a continuation parameter to the abstraction and recursively transform the body with this continuation.Next, consider the transformation of the weight expression w 1 at line 20 in Fig. 3, handled by the case at line 44 in Algorithm 2. The expression is not at tail position, so we build a new continuation containing the subsequent let expressions, recursively transform the body of the continuation, and then wrap the end result in a Suspension object.The if expression t 8 at line 10 in Fig. 3, handled by the case at line 28 in Algorithm 2, is in tail position (it is directly followed by returning t 8 ).Consequently, we transform both branches recursively.Finally, we have the applications t 17 and t 22 at lines 23 and 33 in Fig. 3  not at tail position, so we construct a continuation k ′ that returns the final value a 1 (line 34 in Fig. 3), and then add it as an argument to the application.It is not guaranteed that Algorithm 2 produces a correct result.Specifically, for all applications lhs rhs, we must ensure that (i) if we CPS transform the application, we must also CPS transform all possible abstractions that can occur at lhs, and (ii) if we do not CPS transform the application, we must not CPS transform any abstraction that can occur at lhs.We control this through the argument vars.In particular, assigning vars according to the suspension analysis produces a correct result.To see this, consider the application constraints at lines 13-25 in Algorithm 1 again, and note that if any abstraction or intrinsic operation that requires suspension occur at lhs, suspend x = true.Furthermore, the last application constraint ensures that if suspend x = true, then all abstractions and intrinsic operations that occur at lhs require suspension.Consequently, for all λy._ and const y _, either all suspend y = true or all suspend y = false.

Implementation
We implement the suspension analysis and selective CPS transformation in Miking CorePPL [30], a core PPL implemented in the domain-specific language construction framework Miking [9].We choose Miking CorePPL for the implementation over other CPS-based PPLs, as the language implementation contains an existing 0-CFA base implementation which simplifies the suspension analysis implementation.Fig. 5 presents the organization of the CorePPL compiler.The input is a CorePPL program that may contain many inference problems and applications of inference algorithms, similar to WebPPL and Anglican.The output is an executable produced by one of the Miking backend compilers.Section 6.1 gives the details of the suspension analysis and selective CPS implementations, and in particular the differences compared to the core calculus in Section 3. Section 6.2 presents the inference extractor and its operation combined with selective CPS.The suspension analysis, selective CPS transformation, and inference extraction implementations consist of roughly 1500 lines of code (a contribution in this paper).The code is available on GitHub [2].

Suspension Analysis and Selective CPS
Miking CorePPL extends the abstract syntax in Definition 1 with standard functional data structures and features such as algebraic data types (records, tuples, and variants), lists, and pattern matching.The suspension analysis and selective CPS implementations in Miking CorePPL extend Algorithm 1 and Algorithm 2 to support these language features.Furthermore, compared to suspend weight and suspend assume in Fig. 2, the implementation allows arbitrary configuration of suspension sources.In particular, the implementation uses this arbitrary configuration together with the alignment analysis by Lundén et al. [29].This combination allows selectively CPS transforming to suspend at a subset of assumes or weights for aligned versions of SMC and MCMC inference algorithms.
Miking CorePPL also includes a framework for inference algorithm implementation.Specifically, to implement new inference algorithms, users implement an inference-specific compiler and inference-specific runtime.Fig. 5 illustrates the different compilers and runtimes.Each inference-specific compiler applies the suspension analysis and selective CPS transformation to suit the inference algorithm's particular suspension requirements.
Next, we show how Miking CorePPL handles programs containing many inference problems solved with different inference algorithms.

Inference Problem Extraction
Fig. 5 includes the inference extraction compiler procedure.First, the compiler applies an inference extractor to the input program.The result is a set of inference problems and a main program containing remaining glue code.Second, the compiler applies inference-specific compilers to each inference problem.Finally, the compiler combines the main program and the compiled inference problems with inference-specific runtimes and supplies the result to a backend compiler.
Consider the example in Fig. 6a.We define a function m that constructs a minimal inference problem on lines 7-10, using a single call to assume and a single call to observe (modifying the execution weight similar to weight).The function takes an initial probability distribution d and a data point y as input.We apply aligned lightweight MCMC inference for the inference problem through the infer construct on lines 12-16.The first argument to infer gives the inference algorithm configuration, and the second argument the inference problem.Inference problems are thunks (i.e., functions with a dummy unit argument).We construct the inference problem thunk by an application of m with a uniform   initial distribution and data point 1.0.The inference result d0 is another probability distribution, and we use it as the first initial distribution in the recursive repeat function (lines [19][20][21][22][23][24].This function repeatedly performs inference using the SMC bootstrap particle filter (lines 21-23), again using the function m to construct the sequence of inference problems.Each infer application uses the result distribution from the previous iteration as the initial distribution and consumes data points from the data sequence.We extract and print the samples from the final result distribution d1 at lines 29-33.A limitation with the current extraction approach is that we do not yet support nested infers.
A key challenge in the compiler design is how to handle different inference algorithms within one probabilistic program.In particular, inference algorithms require different selective CPS transformations, applied to different parts of the code.To allow the separate handling of inference algorithms, we apply the extraction approach by Hummelgren et al. [22] on the infer applications, producing separate inference problems for each occurrence of infer.Although the compiler design mostly concerns rather comprehensive engineering work, special care must be taken to handle the non-trivial problem of name bindings when transforming and combining different code entities together.For instance, the compiler must selectively CPS transform Fig. 6b to suspend at assume (required by MCMC) and selectively CPS transform Fig. 6c to suspend at observe (required by SMC).We design a robust and modular solution, where it is possible to easily add new inference algorithms without worrying about name conflicts.

Evaluation
This section presents the evaluation of the suspension analysis and selective CPS implementations.Our main claims are that (i) the approach of selective CPS significantly improves performance compared to traditional full CPS, and (ii) that this holds for a significant set of inference algorithms, evaluated on realistic inference problems.We use four PPL models and corresponding data sets from the Miking benchmarks repository, available on GitHub [1].The models are: constant rate birth-death (CRBD) in Section 7.1, cladogenetic diversification rate shift (ClaDS) in Section 7.2, latent Dirichlet allocation (LDA) in Section 7.3, and vector-borne disease (VBD) in Section 7.4.All models are significant and actively used in different research areas: CRBD and ClaDS in evolutionary biology and phylogenetics [36,42,31], LDA in topic modeling [7], and VBD in epidemiology [14,33].In addition to the Miking CorePPL models from the Miking benchmarks, we also implement CRBD in WebPPL and Anglican.
We add a number of popular inference algorithms in Miking CorePPL with support for selective CPS.The first is standard likelihood weighting (LW), as introduced in Section 2. LW does not strictly require CPS, but we implement it with suspensions at weight to highlight the difference between no CPS, selective CPS, and full CPS.LW gives a good direct measure of CPS overhead as the algorithm simply executes programs many times.Suspending at weight can also be useful in LW to stop executions with weight 0 (i.e., useless samples) early.However, we do not use early stopping to isolate the effect CPS has on execution time.Next, we add the bootstrap particle filter (BPF) and alive particle filter (APF).Both are SMC algorithms that suspend at weight to resample executions.BPF is a standard algorithm often used in PPLs, and APF is a related algorithm introduced in a PPL context by Kudlicka et al. [24].The final two inference algorithms we add are aligned lightweight MCMC (just MCMC for short) and particle-independent Metropolis-Hastings. Aligned lightweight MCMC [29] is an extension to the standard PPL Metropolis-Hastings approach introduced by Wingate et al. [48], and suspends at a subset of calls to assume.Particle-independent Metropolis-Hastings (PIMH) is an MCMC algorithm that repeatedly uses the BPF (suspending at weight) within a Metropolis-Hastings MCMC algorithm [39].We limit the scope to single-core CPU inference.
In addition to the inference algorithms in Miking CorePPL, we also use three other state-of-the-art PPLs for CRBD: Anglican, WebPPL, and the special highperformance RootPPL compiler for Miking CorePPL [30].For Anglican, we apply LW, BPF, and PIMH inference.For WebPPL, we use BPF and (non-aligned) lightweight MCMC.For the RootPPL version of Miking CorePPL, we use BPF inference (the only supported inference algorithm).We consider two configurations for each model: 1 000 and 10 000 samples.An exception is for CRBD and ClaDS, where we adjust APF to use 500 and 5 000 samples to make the inference accuracy comparable to the related BPF.We run each experiment 300 times (with one warmup run) and measure execution time (excluding compile time).To justify the efficiency of the suspension analysis and selective CPS transformation that are part of the compiler, we note here that they, combined, run in only 1-5 ms for all models.
The experiments do not compare the performance of different inference algorithms.To do this, one would also need to consider how accurate the inference results are for a given amount of execution time.Accuracy varies dramatically between different combinations of inference algorithms and models.We evaluate the execution time of selective and full CPS in isolation for individual inference algorithms.Selective CPS is solely an execution time optimization-the algorithms themselves and their accuracy remain unchanged (we verify this in Appendix B for LW, BPF, and APF).
For Miking CorePPL, we used OCaml 4.12.0 as backend compiler for the implementation in Section 6 and GCC 7.5.0 for the separate RootPPL compiler.We used Anglican 1.1.0(OpenJDK 11.0.19) and WebPPL 0.9.15 (Node.js16.18.0).We ran the experiments on an Intel Xeon Gold 6148 CPU with 64 GB of memory using Ubuntu 18.04.6.

Constant Rate Birth-Death
CRBD is a diversification model, used by evolutionary biologists to infer distributions over birth and death rates for observed evolutionary trees of groups of species, called phylogenies.For the CRBD experiment, we use the Alcedinidae phylogeny (Kingfisher birds, 54 extant species) [42,23].We compare CRBD in  Miking CorePPL (55 lines of code), Anglican (129 lines of code), and WebPPL (66 lines of code).The source code is available in Appendix B.1.The total experiment execution time was 9 hours.Fig. 7 presents the results.We note that selective CPS is faster than full CPS in all cases.Unlike full CPS, the overhead of selective CPS compared to no CPS is marginal for LW.The execution time for early MCMC samples is sensitive to initial conditions, and we therefore see more variance for MCMC compared to the other algorithms.When we increase the number of samples to 10 000, the variance reduces.With the exception of MCMC in WebPPL, the execution times for Anglican and WebPPL are one order of magnitude slower than the equivalent algorithms in Miking CorePPL.However, note that the comparison is only for reference and not entirely fair, as Anglican and WebPPL use different execution environments compared to Miking CorePPL.Lastly, we note that the Miking CorePPL BPF implementation with selective CPS is not much slower than when compiling Miking CorePPL to RootPPL BPF-a compiler designed specifically for efficiency (but with other limitations, such as the lack of garbage collection).RootPPL does not use CPS, and instead enables suspension through a low-level transformation using the concept of PPL control-flow graphs [30].

Cladogenetic Diversification Rate Shift
ClaDS is another diversification model used in evolutionary biology [31,42].Unlike CRBD, it allows birth and death rates to change over time.We again use the Alcedinidae phylogeny.The full source code (72 lines of code) is available in Appendix B.2.The total experiment execution time was 3 hours.Fig. 8 presents the results.We note that selective CPS is faster than full CPS in all cases.

Latent Dirichlet Allocation
LDA [7] is a model from natural language processing used to categorize documents into topics.We use a synthetic data set with size comparable to the data  set in Ritchie et al. [40]: a vocabulary of 100 words, 10 topics, and 25 observed documents (30 words in each).We do not apply any optimization techniques such as collapsed Gibbs sampling [21].Solving the inference problem using a PPL is therefore challenging already for small data sets.The full source code (26 lines of code) is available in Appendix B.3.The total experiment execution time was 12 hours.Fig. 9 presents the results.We note that selective CPS is faster than full CPS in all cases.Interestingly, the reduction in overhead compared to full CPS for LW is not as significant.The reason is that suspension at weight for the model requires that we CPS transform the most computationally expensive recursion.

Vector-Borne Disease
We use the VBD model from Funk et al. [14] and later Murray et al. [33].The background is a dengue outbreak in Micronesia and the spread of disease between mosquitos and humans.The inference problem is to find the true numbers of susceptible, exposed, infectious, and recovered (SEIR) individuals each day, given daily reported numbers of new cases at health centers.The full source code (140 lines) is available in Appendix B.4.The total execution time was 8 hours.Fig. 10 presents the results.Again, we note that selective CPS is faster than full CPS in all cases, except seemingly for APF and 1 000 samples.This is very likely a statistical anomaly, as the variance for APF is quite severe for the case with 1 000 samples.Compared to the BPF, APF uses a resampling approach for which the execution time varies a lot if the number of samples is too low [24].The plots clearly show this as, compared to 1 000 samples, the variance is reduced to BPF-comparable levels for 10 000 samples.In summary, the evaluation demonstrates the clear benefits of selective CPS over full CPS for universal PPLs.

Related Work
There are a number of universal PPLs that require non-trivial suspension.One such language is Anglican [49], which solves the suspension problem using CPS.Anglican performs a full CPS transformation with one exception-certain statically known functions named primitive procedures, that include a subset of the regular Clojure (the host language of Anglican) functions, are guaranteed to not execute PPL code, and Anglican does not CPS transform them [46].However, higher-order functions in Clojure libraries cannot be primitive procedures, and Anglican must manually reimplement such functions (e.g., map and fold).Anglican does not consider a selective CPS transformation of PPL code, and always fully CPS transforms the PPL part of Anglican programs.
WebPPL [18] and the approach by Ritchie et al. [40] also make use of CPS transformations to implement PPL inference.They do not, however, consider selective CPS transformations.Ścibior et al. [43] present an architectural design for a probabilistic functional programming library based on monads and monad transformers (and corresponding theory in Ścibior et al. [44]).In particular, they use a coroutine monad transformer to suspend SMC inference.This approach is similar to ours in that it makes use of high-level functional language features to enable suspension.They do not, however, consider a selective transformation.
The PPLs Pyro [6], Stan [10,5], Gen [11,27], and Edward [47] either implement inference algorithms that do not require suspension (e.g., Hamiltonian Monte Carlo), or restrict the language in such a way that suspension is explicit and trivially handled by the language implementation.For example, SMC in Pyro8 and newer versions of Birch require that users explicitly write programs as a step function that the SMC implementation calls iteratively.Resampling only occurs in between calls to step, and suspension is therefore trivial.
Work on general-purpose selective CPS transformations include Nielsen [37], Asai and Uehara [4], Rompf et al. [41], and Leijen [26].They consider typed languages, unlike the untyped language in this paper.The early work by Nielsen [37] considers the efficient implementation of call/cc through a selective CPS transformation.The transformation requires manual user annotations, unlike the fully automatic approach in this paper.A more recent approach is due to Asai and Uehara [4], who consider an efficient implementation of delimited continuations us-ing shift and reset through a selective CPS transformation.Similar to us, they automatically determine where to selectively CPS transform programs.They use an approach based on type inference, while our approach builds upon 0-CFA.Rompf et al. [41] follow a similar approach to Asai and Uehara [4], but for Scala, and additionally require user annotations.Leijen [26] uses a type-directed selective CPS transformation to compile algebraic effect handlers.
There are low-level alternatives to CPS for suspension in PPLs.In particular, there are various languages and approaches that directly implement support for non-preemptive multitasking (e.g., coroutines).Turing [15] and older versions of Birch [35,34] implement coroutines to enable arbitrary suspension, but do not discuss the implementations in detail.Lundén et al. [30] introduces and uses the concept of PPL control-flow graphs to compile Miking CorePPL to the low-level C++ framework RootPPL.The compiler explicitly introduces code that maintains special execution call stacks, distinct from the implicit C++ call stacks.The implementation results in excellent performance, but supports neither garbage collection nor higher-order functions.Another low-level approach is due to Paige and Wood [39], who exploits mutual exclusion locks and the fork system call to suspend and resample SMC executions.In theory, many of the above low-level alternatives to CPS can, if implemented efficiently, result in the least possible overhead due to more fine-grained low-level control.The approaches do, however, require significantly more implementation effort compared to a CPS transformation.Comparatively, the selective CPS transformation is a surprisingly simple, high-level, and easy-to-implement alternative that brings the overhead of CPS closer to that of more low-level approaches.

Conclusion
This paper introduces a selective CPS transformation for the purpose of execution suspension in PPLs.To enable the transformation, we develop a static suspension analysis that determines parts of programs that require a CPS transformation as a consequence of inference algorithm suspension requirements.We implement the suspension analysis, selective CPS transformation, and an inference problem extraction procedure (required as a result of the selective CPS transformation) in Miking CorePPL.Furthermore, we evaluate the implementation on real-world models from phylogenetics, topic-modeling, and epidemiology.The results demonstrate significant speedups compared to the standard full CPS suspension approach for a large number of Monte Carlo inference algorithms.

A Suspension Analysis, Continued
This section provides additional details on the suspension analysis.Specifically, Section A.1 describes the suspension analysis algorithm, and Section A.2 presents the proof for Theorem 1.

A.1 Algorithm
Before proceeding to the algorithm, we formally define constraints.Definition 8 (Constraints).We define the constraints c ∈ R as follows.x, y, lhs, rhs, app, res ∈ X.
(8) Algorithm 3 is the full suspension analysis.The algorithm uses a worklist and constraints produced by Algorithm 1 to propagate abstract values throughout the program until fixpoint.In particular, the algorithm propagates the new suspension-related constraints.
Algorithm 3 Suspension analysis.We write the functional-style pseudocode for the algorithm itself in sans serif font to distinguish it from terms in T .

3 if
, using the recursive function iter .The function application f Bernoulli a 1 o gives the probability of the outcome o given a bias a 1 for the coin.I.e., f Bernoulli a 1 true = a 1 and f Bernoulli a 1 false = 1 − a 1 .So, for example, a sample a 1 = 0.4 gets the accumulated weight 0.4 • 0.4 • 0.6 • 0.4 and a 1 = 0.7 the accumulated weight 0.7•0.7•0.3•0.7.The end result is an infinite set of weighted samples of a 1 (the program returns a 1 at line 8) that approximate the posterior or target distribution of Fig. 1a, illustrated in Fig 1c.Note 1 let a1 = assume (Beta 2 2) in 2 let rec iter = λobs.null obs then () else 4 weight (f Bernoulli a1 (head obs)); 5 iter (tail obs) 6 in 7 iter [true,true,false,true]; 8 a1 (a) Program t example .Distribution of t example .

Fig. 1 :
Fig. 1: A probabilistic program t example modeling the bias of a coin.Fig.(a) gives the program.The function f Bernoulli is the probability mass function of the Bernoulli distribution.Fig.(b) illustrates the distribution for a 1 at line 1 in (a).Fig. (c) shows the set of (weighted) samples resulting from conceptually running t example infinitely many times.Fig.(d) and Fig. (e) show the selective CPS transformations required for suspension at assume and weight, respectively.Fig. (f) gives t example in full CPS, with suspensions at assume and weight.The CPS subscript indicates CPS-versions of intrinsic functions such as head and tail .

Fig. 5 :
Fig.5: Overview of the Miking CorePPL compiler implementation.We divide the overall compiler into two parts, (i) suspension analysis and selective CPS (Section 6.1) and (ii) inference problem extraction (Section 6.2).The figure depicts artifacts as gray rectangular boxes and transformation units and libraries as blue rounded boxes.Note how the inference extractors transformation separates the program into two different paths that are combined again after the inference-specific compilation.The white inheritance arrows (pointing to suspension analysis and selective CPS transformations) mean that these libraries are used within the inference-specific compiler transformation.

1
let m = lam d. lam y. lam. 2 let x = assume d in 3 observe y (Gaussian x 0.1); 4 x in 5 m d y () (c) Extracted inference problem from line 22 in (a).

Fig. 6 :
Fig. 6: Example Miking CorePPL program in (a) with two non-trivial uses of infer.Figures (b) and (c) show the extracted and selectively CPS-transformed inference problems at lines 13 and 22 in (a), respectively.The compiler handles the free variables d and y in (c) in a later stage.
Suspension assume (y,λx.cps′ (cont, t2)) vars 14 then λk.λy.cps ′ (k, t b ) 15 else λy.cps ′ (t id , t b ) 19 if x ∈ vars then 20 if tailCall(t) 21 then lhs cont rhs 22 else lhs (λx.t ′ 2 ) rhs 23 else let x = t1 in t ′ 35 if y then cps ′ (k, tt) else cps ′ (k, te) 36 else let x = if y then cps ′ (t id , tt) 37 else cps ′ (t id , te) in t ′ 2 38 | assume y → let x = t1 in t ′ 41 then Suspension assume (y, cont) 42 else 47 then Suspension weight (y, cont) , handled by the case at line 18 in Algorithm 2. The application t 17 is at tail position, and we transform it by adding the current continuation as an argument.The application at t 22 is Fig. 7: Mean execution times for the CRBD model.The error bars show 95% confidence intervals (using the option ('ci', 95) in Seaborn's barplot).The table shows standard deviations.
Mean execution times for the LDA model.The error bars show 95% confidence intervals (using the option ('ci', 95) in Seaborn's barplot).