How long, O Bayesian network, will I sample thee? A program analysis perspective on expected sampling times

Bayesian networks (BNs) are probabilistic graphical models for describing complex joint probability distributions. The main problem for BNs is inference: Determine the probability of an event given observed evidence. Since exact inference is often infeasible for large BNs, popular approximate inference methods rely on sampling. We study the problem of determining the expected time to obtain a single valid sample from a BN. To this end, we translate the BN together with observations into a probabilistic program. We provide proof rules that yield the exact expected runtime of this program in a fully automated fashion. We implemented our approach and successfully analyzed various real-world BNs taken from the Bayesian network repository.


Introduction
Bayesian networks (BNs) are probabilistic graphical models representing joint probability distributions of sets of random variables with conditional dependencies. Graphical models are a popular and appealing modeling formalism, as they allow to succinctly represent complex distributions in a human-readable way. Bayesian Networks have been intensively studied at least since 1985 [42] and have a wide range of applications including machine learning [23], speech recognition [49], sports betting [10], gene regulatory networks [17], diagnosis of diseases [26], and finance [38].
Probabilistic programs are programs with the key ability to draw values at random. Seminal papers by Kozen from the 1980s consider formal semantics [31] as well as initial work on verification [46,32]. McIver & Morgan [34] build on this work to further weakest-precondition style verification for imperative probabilistic programs. Given a Bayesian network with observed evidence, how long does it take in expectation to obtain a single sample that satisfies the observations?
As an example, consider the BN in Figure 1 which consists of just three nodes (random variables) that can each assume values 0 or 1. Each node X comes with a conditional probability table determining the probability of X assuming some value given the values of all nodes Y that X depends on (i.e. X has an incoming edge from Y ), see Appendix A.1 for detailed calculations. For instance, the probability that G assumes value 0, given that S and R are both assume 1, is 0.2. Note that this BN is paramterized by a ∈ [0, 1]. Now, assume that our observed evidence is the event G=0 and we apply rejection sampling to obtain one accepting sample from this BN. Then our approach will yield that a rejection sampling algorithm will, on average, require 200a 2 − 40a − 460 89a 2 − 69a − 21 guard evaluations, random assignments, etc. until it obtains a single sample that complies with the observation G=0 (the underlying runtime model is discussed in detail in Section 3.3). By examination of this function, we see that for large ranges of values of a the BN is rather well-behaved: For a ∈ [0.08, 0.78] the expected sampling time stays below 18. Above a = 0.95 the expected sampling time starts to grow rapidly up to 300.
While 300 is still moderate, we will see later that expected sampling times of real-world BNs can be much larger. For some BNs, the expected sampling time even exceeded 10 18 , rendering sampling based methods infeasible. In this case, exact inference (despite NP-hardness) was a viable alternative (see Section 6).
Our approach. We apply weakest precondition style reasoning a lá McIver & Morgan [34] and Kaminski et al. [29] to analyze both expected outcomes and expected runtimes (ERT) of a syntactic fragment of pGCL, which we call the Bayesian Network Language (BNL). Note that since BNL is a syntactic fragment of pGCL, every BNL program is a pGCL program but not vice versa. The main restriction of BNL is that (in contrast to pGCL) loops are of a special form that prohibits undesired data flow across multiple loop iterations. While this restriction renders BNL incapable of, for instance, counting the number of loop iterations 1 , BNL is expressive enough to encode Bayesian networks with observed evidence.
As a central notion behind these rules, we introduce f -i.i.d.-ness of probabilistic loops, a concept closely related to stochastic independence, that allows us to rule out undesired parts of the data flow across loop iterations. Furthermore, we show how every BN with observations is translated into a BNL program, such that (a) executing the BNL program corresponds to sampling from the conditional joint distribution given by the BN and observed data, and (b) the ERT of the BNL program corresponds to the expected time until a sample that satisfies the observations is obtained from the BN.
As a consequence, exact expected sampling times of BNs can be inferred by means of weakest precondition reasoning in a fully automated fashion. This can be seen as a first step towards formally evaluating the quality of a plethora of different sampling methods (cf. [30,48]) on source code level.
Contributions. To summarize, our main contributions are as follows: -We develop easy-to-apply proof rules to reason about expected outcomes and expected runtimes of probabilistic programs with f -i.i.d. loops. -We study a syntactic fragment of probabilistic programs, the Bayesian network language (BNL), and show that our proof rules are applicable to every BNL program; expected runtimes of BNL programs can thus be inferred. -We give a formal translation from Bayesian networks with observations to BNL programs; expected sampling times of BNs can thus be inferred. -We implemented a prototype tool that automatically analyzes the expected sampling time of BNs with observations. An experimental evaluation on real-world BNs demonstrates that very large expected sampling times (in the magnitude of millions of years) can be inferred within less than a second; This provides practitioners the means to decide whether sampling based methods are appropriate for their models.
Outline. We discuss related work in Section 2. Syntax and semantics of the probabilistic programming language pGCL are presented in Section 3. Our proof rules are introduced in Section 4 and applied to BNs in Section 5. Section 6 reports on experimental results and Section 7 concludes.

Related Work
While various techniques for formal reasoning about runtimes and expected outcomes of probabilistic programs have been developed, e.g. [24,6,5,16,37], none of them explicitly apply formal methods to reason about Bayesian networks on source code level. In the following, we focus on approaches close to our work.
Weakest preexpectation calculus. Our approach builds upon the expected runtime calculus [29], which is itself based on work by Kozen [31,32] and McIver and Morgan [34]. In contrast to [29], we develop specialized proof rules for a clearly specified program fragment without requiring user-supplied invariants. Since finding invariants often requires heavy calculations, our proof rules contribute towards simplifying and automating verification of probabilistic programs.
Ranking supermartingales. Reasoning about almost-sure termination is often based on ranking (super)martingales (cf. [7,9]). In particular, Chatterjee et al. [8] consider the class of affine probabilistic programs for which linear ranking supermartingales exist (Lrapp); thus proving (positive 2 ) almost-sure termination for all programs within this class. They also present a doubly-exponential algorithm to approximate ERTs of Lrapp programs. While all BNL programs lie within Lrapp, our proof rules yield exact ERTs as expectations (thus allowing for compositional proofs), in contrast to a single number for a fixed initial state.
Bayesian networks and probabilistic programs. Bayesian networks are a -if not the most -popular probabilistic graphical model (cf. [3,30] for details) for reasoning about conditional probabilities. They are closely tied to (a fragment of) probabilistic programs. For example, Infer.NET [35] performs inference by compiling a probabilistic program into a Bayesian network. While correspondences between probabilistic graphical models, such as BNs, have been considered in the literature [36,20,22], we are not aware of a formal soudness proof for a translation from classical BNs into probabilistic programs including conditioning. Conversely, some probabilistic programming languages such as Church [20], Stan [25], and R2 [39] directly perform inference on the program level using sampling techniques similar to those developed for Bayesian networks. Our approach is a step towards understanding sampling based approaches formally: We obtain the exact expected runtime required to generate a sample that satisfies all observations. This may ultimately be used to evaluate the quality of a plethora of proposed sampling methods for Bayesian inference (cf. [30,48]).

Probabilistic Programs
We briefly present the probabilistic programming language that is used throughout this paper. Since our approach is embedded into weakest-precondition style approaches, we also recap calculi for reasoning about both expected outcomes and expected runtimes of probabilistic programs.

The Probabilistic Guarded Command Language
We enhance Dijkstra's Guarded Command Language [14,13] by a probabilistic construct, namely a random assignment. We thereby obtain a probabilistic Guarded Command Language (for a closely related language, see [34]).
Let Vars be a finite set of program variables. Moreover, let Q be the set of rational numbers, and let D (Q) be the set of discrete probability distributions over Q. The set of program states is given by A distribution expression µ is a function of type µ : Σ → D (Q) that takes a program state and maps it to a probability distribution on values from Q. We denote by µ σ the distribution obtained from applying σ to µ.
The probabilistic guarded command language (pGCL) is given by the grammar where x ∈ Vars is a program variable, µ is a distribution expression, and ϕ is a Boolean expression guarding a choice or a loop. A pGCL program that contains neither diverge, nor while, nor repeat − until loops is called loop-free. For σ ∈ Σ and an arithmetical expression E over Vars, we denote by σ(E) the evaluation of E in σ, i.e. the value that is obtained by evaluating E after replacing any occurrence of any program variable x in E by the value σ(x). Analogously, we denote by σ(ϕ) the evaluation of a guard ϕ in state σ to either true or false. Furthermore, for a value v ∈ Q we write σ [x → v] to indicate that we set program variable We use the Iverson bracket notation to associate with each guard its according indicator function. Formally, the Iverson bracket [ϕ] of ϕ is thus defined as the function [ϕ] = λ σ. σ(ϕ).
Let us briefly go over the pGCL constructs and their effects: skip does not alter the current program state. The program diverge is an infinite busy loop, thus takes infinite time to execute. It returns no final state whatsoever.
The random assignment x :≈ µ is (a) the only construct that can actually alter the program state and (b) the only construct that may introduce random behavior into the computation. It takes the current program state σ, then samples a value v from probability distribution µ σ , and then assigns v to program variable x. An example of a random assignment is If the current program state is σ, then the program state is altered to either σ [x → 5] with probability 1 /2, or to σ [x → σ(y) + 1] with probability 1 /6, or to σ [x → σ(y) − 1] with probability 1 /3. The remainder of the pGCL constructs are standard programming language constructs.
In general, a pGCL program C is executed on an input state and yields a probability distribution over final states due to possibly occurring random assignments inside of C. We denote that resulting distribution by C σ . Strictly speaking, programs can yield subdistributions, i.e. probability distributions whose total mass may be below 1. The "missing" probability mass represents the probability of nontermination. Let us conclude our presentation of pGCL with an example: Example 1 (Geometric Loop). Consider the program C geo given by x :≈ 0; c :≈ 1 /2 · 0 + 1 /2 · 1 ; while (c = 1) {x :≈ x + 1; c :≈ 1 /2 · 0 + 1 /2 · 1 } This program basically keeps flipping coins until it flips, say, heads (c = 0). In x it counts the number of unsuccessful trials. 4 In effect, it almost surely sets c to 0 and moreover it establishes a geometric distribution on x. The resulting distribution is given by

The Weakest Preexpectation Transformer
We now present the weakest preexpectation transformer wp for reasoning about expected outcomes of executing probabilistic programs in the style of McIver & Morgan [34]. Given a random variable f mapping program states to reals, it allows us to reason about the expected value of f after executing a probabilistic program on a given state.
Expectations. The random variables the wp transformer acts upon are taken from a set of so-called expectations, a term coined by McIver & Morgan [34]: Definition 1 (Expectations). The set of expectations E is defined as . We will use the notation f [x/E] to indicate the replacement of every occurrence of x in f by E. Since x, however, does not actually occur in f , we more A complete partial order ≤ on E is obtained by point-wise lifting the canonical total order on R ∞ ≥0 , i.e.
Its least element is given by λσ. 0 which we (by slight abuse of notation) also denote by 0. Suprema are constructed pointwise, i.e. for S ⊆ E the supremum sup S is given by sup S = λσ. sup f ∈S f (σ).
We allow expectations to map only to positive reals, so that we have a complete partial order readily available, which would not be the case for expectations of type Σ → R ∪ {−∞, +∞}. A wp calculus that can handle expectations of such type needs more technical machinery and cannot make use of this underlying natural partial order [28]. Since we want to reason about ERTs which are by nature non-negative, we will not need such complicated calculi. Notice that we use a slightly different definition of expectations than McIver & Morgan [34], as we allow for unbounded expectations, whereas [34] requires that expectations are bounded. This however would prevent us from capturing ERTs, which are potentially unbounded.
Expectation Transformers. For reasoning about the expected value of f ∈ E after execution of C, we employ a backward-moving weakest preexpectation transformer wp C : E → E, that maps a postexpectation f ∈ E to a preexpectation wp C (f ) ∈ E, such that wp C (f ) (σ) is the expected value of f after executing C on initial state σ. Formally, if C executed on input σ yields final distribution C σ , then the weakest preexpectation wp C (f ) of C with respect to postexpectation f is given by  Let us briefly go over the definitions in Table 1: For skip the program state is not altered and thus the expected value of f is just f . The program diverge will never yield any final state. The distribution over the final states yielded by diverge is thus the null distribution ν 0 (τ ) = 0, that assigns probability 0 to every state. Consequently, the expected value of f after execution of diverge is given by The rule for the random assignment x :≈ µ is a bit more technical: Let the current program state be σ. Then for every value v ∈ Q, the random assignment assigns v to x with probability µ σ (v), where σ is the current program state. The value of f after assigning v to x is f (σ [x → v]) = f [x/v](σ) and therefore the expected value of f after executing the random assignment is given by Expressed as a function of σ, the latter yields precisely the definition in Table 1.
The definition for the conditional choice if (ϕ) {C 1 } else {C 2 } is not surprising: if the current state satisfies ϕ, we have to opt for the weakest preexpectation of C 1 , whereas if it does not satisfy ϕ, we have to choose the weakest preexpectation of C 2 . This yields precisely the definition in Table 1.
The definition for the sequential composition C 1 ; C 2 is also straightforward: We first determine wp C 2 (f ) to obtain the expected value of f after executing C 2 . Then we mentally prepend the program C 2 by C 1 and therefore determine the expected value of wp C 2 (f ) after executing C 1 . This gives the weakest preexpectation of C 1 ; C 2 with respect to postexpectation f .
The definition for the while loop makes use of a least fixed point, which is a standard construction in program semantics. Intuitively, the fixed point iteration of the wp-characteristic functional, given by 0, . ., corresponds to the portion the expected value of f after termination of the loop, that can be collected within at most 0, 1, 2, 3, . . . loop guard evaluations. The Kleene Fixed Point Theorem [33] ensures that this iteration converges to the least fixed point, i.e.
By inspection of the above equality, we see that the least fixed point is exactly the construct that we want for while loops, since sup n∈N F n f (0) in principle allows the loop to run for any number of iterations, which captures precisely the semantics of a while loop, where the number of loop iterations is -in contrast to e.g. for loops -not determined upfront.
Finally, since repeat {C} until (ϕ) is syntactic sugar for C; while (ϕ) {C}, we simply define the weakest preexpectation of the former as the weakest preexpectation of the latter. Let us conclude our study of the effects of the wp transformer by means of an example: Example 2. Consider the following program C: Say we wish to reason about the expected value of x + c after execution of the above program. We can do so by calculating wp C (x + c) using the rules in Table 1. This calculation in the end yields wp C (x + c) = 3y + 26 /18 The expected valuation of the expression x + c after executing C is thus 3y + 26 /18. Note that x + c can be thought of as an expression that is evaluated in the final states after execution, whereas 3y + 26 /18 must be evaluated in the initial state before execution of C. △ Healthiness Conditions of wp. The wp transformer enjoys some useful properties, sometimes called healthiness conditions [34]. Two of these healthiness conditions that we will heavily make use of are given below: Theorem 1 (Healthiness Conditions for the wp Transformer [34]). For all C ∈ pGCL, f 1 , f 2 ∈ E, and a ∈ R ≥0 , the following holds:

The Expected Runtime Transformer
While for deterministic programs we can speak of the runtime of a program on a given input, the situation is different for probabilistic programs: For those we instead have to speak of the expected runtime (ERT). Notice that the ERT can be finite (even constant) while the program may still admit infinite executions. An example of this is the geometric loop in Example 1.
A wp-like transformer designed specifically for reasoning about ERTs is the ert transformer [29]. Like wp, it is of type ert C : E → E and it can be shown that measures the time that is needed after executing C (thus f is evaluated in the final states after termination of C), then ert C (f ) (σ) is the expected time that is needed to run C on input σ and then let time f pass. For a more in-depth treatment of the ert transformer, see [29,Section 3]. The transformer is defined as follows: Definition 3 (The ert Transformer [29]). The expected runtime transformer ert : pGCL → E → E is defined by induction on all pGCL programs according to the rules given in Table 2. We call F f (X) = 1 + [¬ϕ]·f + [ϕ] ·wp C (X) the ertcharacteristic functional of the loop while (ϕ) {C} with respect to postexpectation f . As with wp, for a given ert-characteristic function F f , we call the sequence The rules for ert are very similar to the rules for wp. The runtime model we assume is that skip statements, random assignments, and guard evaluations for both conditional choice and while loops cost one unit of time. This runtime model can easily be adopted to count only the number of loop iterations or only the number of random assignments, etc. We conclude with a strong connection between the wp and the ert transformer, that is crucial in our proofs: Theorem 2 (Decomposition of ert [40]). For any C ∈ pGCL and f ∈ E,

Expected Runtimes of i.i.d. Loops
We derive a proof rule that allows to determine exact ERTs of independent and identically distributed loops (or i.i.d. loops for short). Intuitively, a loop loop sampling a point within a circle uniformly at random using rejection sampling. The picture on the right-hand side visualizes the procedure: In each iteration a point (×) is sampled. If we obtain a point within the white area inside the square, we terminate. Otherwise, i.e. if we obtain a point within the gray area outside the circle, we resample.
if the distributions of states that are reached at the end of different loop iterations are equal. This is the case whenever there is no data flow across different iterations. In the non-probabilistic case, such loops either terminate after exactly one iteration or never. This is different for probabilistic programs. As a running example, consider the program C circle in Figure 2. C circle samples a point within a circle with center (5, 5) and radius r = 5 uniformly at random using rejection sampling. In each iteration, it samples a point (x, y) ∈ [0, . . . , 10] 2 within the square (with some fixed precision). The loop ensures that we resample if a sample is not located within the circle. Our proof rule will allow us to systematically determine the ERT of this loop, i.e. the average amount of time required until a single point within the circle is sampled.
Towards obtaining such a proof rule, we first present a syntactical notion of the i.i.d. property. It relies on expectations that are not affected by a pGCL program: Definition 4. Let C ∈ pGCL and f ∈ E. Moreover, let Mod (C) denote the set of all variables that occur on the left-hand side of an assignment in C, and let Vars (f ) be the set of all variables that "occur in f ", i.e. formally .
We are interested in expectations that are unaffected by pGCL programs because of a simple, yet useful observation: If g ⋓ C, then g can be treated like a constant w.r.t. the transformer wp (i.e. like the a in Theorem 1 (1)). For our running example C circle (see Figure 2), the expectation f = wp C body ([x + y ≤ 10]) is unaffected by the loop body C body of C circle . Consequently, we have wp C body (f ) = f · wp C body (1) = f . In general, we obtain the following property: Proof. By induction on the structure of C. See Appendix A.2.

⊓ ⊔
We develop a proof rule that only requires that both the probability of the guard evaluating to true after one iteration of the loop body (i.e. wp C ([ϕ])) as well as the expected value of [¬ϕ] · f after one iteration (i.e. wp C ([¬ϕ] · f )) are unaffected by the loop body. We thus define the following: Definition 5 (f -Independent and Identically Distributed Loops). Let C ∈ pGCL, ϕ be a guard, and f ∈ E. Then we call the loop while This is due to the fact that Table 1) and (again for some fixed precision p ∈ N \ {0}) Our main technical Lemma is that we can express the orbit of the wp-characteristic function as a partial geometric series: and let F f be the corresponding wpcharacteristic function. Then for all n ∈ N \ {0}, it holds that Proof. By use of Lemma 1, see Appendix A.3.
Using this precise description of the wp orbits, we now establish proof rules for f -i.i.d. loops, first for wp and later for ert.
Proof. We have The preexpectation ( †) is to be evaluated in some state σ for which we have two cases: The first case is when wp C ([ϕ]) (σ) < 1. Using the closed form of the geometric series, i.e.
(closed form of geometric series) The second case is when wp C ([ϕ]) (σ) = 1. This case is technically slightly more involved. The full proof can be found in Appendix A.4 ⊓ ⊔ We now derive a similar proof rule for the ERT of an f -i.i.d. loop while (ϕ) {C}.
Theorem 4 (Proof Rule for ERTs of f -i.i.d. Loops). Let C ∈ pGCL, ϕ be a guard, and f ∈ E such that all of the following conditions hold: 2. wp C (1) = 1 (loop body terminates almost-surely).
3. ert C (0) ⋓ C (every iteration runs in the same expected time).
Then for the ERT of the loop while (ϕ) {C} w.r.t. postruntime f it holds that where we define 0 0 := 0 and a 0 := ∞, for a = 0.
Proof. We first prove To this end, we propose the following expression as the orbit of the ert-characteristic function of the loop w.r.t. 0: For a verification that the above expression is indeed the correct orbit, we refer to the rigorous proof of this theorem located in Appendix A.5. Now, analogously to the reasoning in the proof of Theorem 3 (i.e. using the closed form of the geometric series and case distinction on whether wp C ([ϕ]) < 1 or wp C ([ϕ]) = 1), we get that the supremum of this orbit is indeed the right-hand side of ( ‡). To complete the proof, consider the following: (by ( ‡) and Theorem 3)

A Programming Language for Bayesian Networks
So far we have derived proof rules for formal reasoning about expected outcomes and expected run-times of i.i.d. loops (Theorems 3 and 4). In this section, we apply these results to develop a syntactic pGCL fragment that allows exact computations of closed forms of ERTs. In particular, no invariants, (super)martingales or fixed point computations are required. After that, we show how BNs with observations can be translated into pGCL programs within this fragment. Consequently, we call our pGCL fragment the Bayesian Network Language. As a result of the above translation, we obtain a systematic and automatable approach to compute the expected sampling time of a BN in the presence of observations. That is, the expected time it takes to obtain a single sample that satisfies all observations.

The Bayesian Network Language
Programs in the Bayesian Network Language are organized as sequences of blocks. Every block is associated with a single variable, say x, and satisfies two constraints: First, no variable other than x is modified inside the block, i.e. occurs on the left-hand side of a random assignment. Second, every variable accessed inside of a guard has been initialized before. These restrictions ensure that there is no data flow across multiple executions of the same block. Thus, intuitively, all loops whose body is composed from blocks (as described above) are f -i.i.d. loops.
Definition 6 (The Bayesian Network Language). Let Vars = {x 1 , x 2 , . . .} be a finite set of program variables as in Section 3. The set of programs in Bayesian Network Language, denoted BNL, is given by the grammar where x i ∈ Vars is a program variable, all variables in ϕ have been initialized before, and B xi is a non-terminal parameterized with program variable x i ∈ Vars.
That is, for all x i ∈ Vars there is a non-terminal B xi . Moreover, ψ is an arbitrary guard and µ is a distribution expression of the form µ = n j=1 p j · a j with a j ∈ Q for 1 ≤ j ≤ n.
Example 4. Consider the BNL program C dice : This program first throws a fair die. After that it keeps throwing a second die until its result is at least as large as the first die. △ For any C ∈ BNL, our goal is to compute the exact ERT of C, i.e. ert C (0). In case of loop-free programs, this amounts to a straightforward application of the ert calculus presented in Section 3. To deal with loops, however, we have to perform fixed point computations or require user-supplied artifacts, e.g. invariants, supermartingales, etc. For BNL programs, on the other hand, it suffices to apply the proof rules developed in Section 4. As a result, we directly obtain an exact closed form solution for the ERT of a loop. This is a consequence of the fact that all loops in BNL are f -i.i.d., which we establish in the following. By definition, every loop in BNL is of the form repeat {B xi } until (ψ), which is equivalent to B xi ; while (¬ψ) {B xi }. Hence, we want to apply Theorem 4 to that while loop. Our first step is to discharge the theorem's premises: Lemma 3. Let Seq be a sequence of BNL-blocks, g ∈ E, and ψ be a guard. Then: 1. The expected value of g after executing Seq is unaffected by Seq. That is, Proof. 1. is proven by induction on the length of the sequence of blocks Seq and 2. is a consequence of 1., see Appendix A.6. 3. follows immediately from 1. by instantiating g with [¬ψ] and [ψ] · f , respectively.

⊓ ⊔
We are now in a position to derive a closed form for the ERT of loops in BNL.
Theorem 5. For every loop repeat {Seq} until (ψ) ∈ BNL and every f ∈ E, Proof. Let f ∈ E. Moreover, recall that repeat {Seq} until (ψ) is equivalent to the program Seq; while (¬ψ) {Seq} ∈ BNL. Applying the semantics of ert (Table 2), we proceed as follows: Since the loop body Seq is loop-free, it terminates certainly, i.e. wp Seq (1) = 1 (Premise 2. of Theorem 4). Together with Lemma 3.1. and 3., all premises of Theorem 4 are satisfied. Hence, we obtain a closed form for ert while (¬ψ) {Seq} (f ): By Theorem 2, we know ert Seq (g) = ert Seq (0) + wp C (g) for any g. Thus: Since wp is linear (Theorem 1 (2)), we obtain: By a few simple algebraic transformations, this coincides with: Let R denote the fraction above. Then Lemma 3.1. and 2. implies R ⋓ Seq. We may thus apply Lemma 1 to derive wp Seq ([¬ψ] · R) = wp Seq ([¬ψ]) · R. Hence: Again, by Theorem 2, we know that ert Seq (g) = ert Seq (0) + wp Seq (g) for any g. Thus, for g = [ψ] · f , this yields: Then a few algebraic transformations lead us to the claimed ERT: . ⊓ ⊔ Note that Theorem 5 holds for arbitrary postexpectations f ∈ E. This enables compositional reasoning about ERTs of BNL programs. Since all other rules of the ert-calculus for loop-free programs amount to simple syntactical transformations (see Table 2), we conclude that Corollary 1. For any C ∈ BNL, a closed form for ert C (0) can be computed compositionally.
Example 5. Theorem 5 allows us to comfortably compute the ERT of the BNL program C dice introduced in Example 4: For the ERT, we have Table 2) = 3.45 . △

Bayesian Networks
To reason about expected sampling times of Bayesian networks, it remains to develop a sound translation from BNs with observations into equivalent BNL programs. Before we present a formal translation, let us briefly recall BNs. A BN is a probabilistic graphical model that is given by a directed acyclic graph. Every node is a random variable and a directed edge between two nodes expresses a probabilistic dependency between these nodes. As a running example, consider the BN depicted in Figure 3 (inspired by [30]) that models the mood of students after taking an exam. The network contains four random variables. They represent the difficulty of the exam (D), the level of preparation of a student (P ), the achieved grade (G), and the resulting mood (M ). For simplicity, let us assume that each random variable assumes either 0 or 1. The underlying dependencies express that the mood of a student depends on the achieved grade which, in turn, depends on the difficulty of the exam and the amount of preparation before taking it. Every node is accompanied by a conditional probability table that provides the probabilities of a node given the values of all the nodes it depends upon. We can then use the BN to answer queries such as "What is the probability that a student is well-prepared for an exam (P = 1), but ends up with a bad mood (M = 0)?" In order to translate BNs into equivalent BNL programs, we need a formal representation first. Technically, we consider extended BNs in which nodes may additionally depend on inputs that are not represented by nodes in the network. This allows us to define a compositional translation without modifying conditional probability tables. Towards a formal definition of extended BNs, we use the following notation. A tuple (s 1 , . . . , s k ) ∈ S k of length k over some set S is denoted by s. The empty tuple is ε. Moreover, for 1 ≤ i ≤ k, the i-th element of tuple s is given by s(i). To simplify the presentation, we assume that all nodes and all inputs are represented by natural numbers. -E ⊆ V × V is a set of edges such that (V, E) is a directed acyclic graph.
-Vals is a finite set of possible values that can be assigned to each node.
dep : V → (V ∪I) * is a function assigning each node v to an ordered sequence of dependencies. That is, dep(v) = (u 1 , . . . , u m ) such that u i < u i+1 (1 ≤ i < m). Moreover, every dependency u j (1 ≤ j ≤ m) is either an input, i.e. u j ∈ I, or a node with an edge to v, i.e. u j ∈ V and (u j , v) ∈ E. cpt is a function mapping each node v to its conditional probability Here, the i-th entry in a tuple z ∈ Vals k corresponds to the value assigned to the i-th entry in the sequence of dependencies dep(v). In general, the conditional probability table cpt determines the conditional probability distribution of each node v ∈ V given the nodes and inputs it depends on. Formally, we interpret an entry in a conditional probability table as follows: where v ∈ V is a node, a ∈ Vals is a value, and z is a tuple of values of length |dep(v)|. Then, by the chain rule, the joint probability of a BN is given by the product of its conditional probability tables (cf. [3]).
Definition 8. Let B = (V, I, E, Vals, dep, cpt) be an extended Bayesian network. Moreover, let W ⊆ V be a downward closed 5 set of nodes. With each w ∈ W ∪ I, we associate a fixed value w ∈ Vals. This notation is lifted pointwise to tuples of nodes and inputs. Then the joint probability in which nodes in W assume values W is given by The conditional joint probability distribution of a set of nodes W , given observations on a set of nodes O, is then given by the quotient Pr(W =W ) /Pr(O=O).
For example, the probability of a student having a bad mood, i.e. M = 0, after getting a bad grade (G = 0) for an easy exam (D = 0) given that she was well-prepared, i.e. P = 1, is

From Bayesian Networks to BNL
We now develop a compositional translation from EBNs into BNL programs. Throughout this section, let B = (V, I, E, Vals, dep, cpt) be a fixed EBN. Moreover, with every node or input v ∈ V ∪ I we associate a program variable x v . We proceed in three steps: First, every node together with its dependencies is translated into a block of a BNL program. These blocks are then composed into a single BNL program that captures the whole BN. Finally, we implement conditioning by means of rejection sampling.
Step 1: We first present the atomic building blocks of our translation. Let v ∈ V be a node. Moreover, let z ∈ Vals |dep(v)| be an evaluation of the dependencies of v. That is, z is a tuple that associates a value with every node and input that v depends on (in the same order as dep(v)). For every node v and evaluation of its dependencies z, we define a corresponding guard and a random assignment: Note that dep(v)(i) is the i-th element from the sequence of nodes dep(v).
Example 7. Continuing our previous example (see Figure 1), assume we fixed the node v = G. Moreover, let z = (1, 0) be an evaluation of dep(v) = (S, R). Then the guard and assignment corresponding to v and z are given by: guard B (G, (1, 0)) = x D = 1 ∧ x P = 0, and assign B (G, (1, 0)) = x G :≈ 0.6 · 0 + 0.4 · 1 . △ We then translate every node v ∈ V into a program block that uses guards to determine the rows in the conditional probability table under consideration. After that, the program samples from the resulting probability distribution using the previously constructed assignments. In case a node does neither depend on other nodes nor input variables we omit the guards. Formally, Remark 1. The guards under consideration are conjunctions of equalities between variables and literals. We could thus use a more efficient translation of conditional probability tables by adding a switch-case statement to our probabilistic programming language. Such a statement is of the form where x is a tuple of variables, and a 1 , . . . a m−1 are tuples of rational numbers of the same length as x. With respect to the wp semantics, a switch-case statement is syntactic sugar for nested if-then-else blocks as used in the above translation. However, the runtime model of a switch-case statement requires just a single guard evaluation (ϕ) instead of potentially multiple guard evaluations when evaluating nested if-then-else blocks. Since the above adaption is straightforward, we opted to use nested if-then-else blocks to keep our programming language simple and allow, in principle, more general guards. △ Step 2 The next step is to translate a complete EBN into a BNL program. To this end, we compose the blocks obtained from each node starting at the roots of the network. That is, all nodes that contain no incoming edges. Formally, After translating every node in the network, we remove them from the graph, i.e. every root becomes an input, and proceed with the translation until all nodes have been removed. More precisely, given a set of nodes S ⊆ V , the extended BN B \ S obtained by removing S from B is defined as With these auxiliary definitions readily available, an extended BN B is translated into a BNL program as follows: Step 3 To complete the translation, it remains to account for observations. Let cond : V → Vals ∪ {⊥} be a function mapping every node either to an observed value in Vals or to ⊥. The former case is interpreted as an observation that node v has value cond(v). Otherwise, i.e. if cond(v) = ⊥, the value of node v is not observed. We collect all observed nodes in the set O = {v ∈ V | cond(v) = ⊥}. It is then natural to incorporate conditioning into our translation by applying rejection sampling: We repeatedly execute a BNL program until every observed node has the desired value cond(v). In the presence of observations, we translate the extended BN B into a BNL program as follows: Example 8. Consider, again, the BN B depicted in Figure 3. Moreover, assume we observe P = 1. Hence, the conditioning function cond is given by cond(P ) = 1 and cond(v) = ⊥ for v ∈ {D, G, M }. Then the translation of B and cond, i.e. BNL(B, cond), is the BNL program C mood depicted in Figure 4. △ Since our translation yields a BNL program for any given BN, we can compositionally compute a closed form for the expected simulation time of a BN. This is an immediate consequence of Corollary 1. We still have to prove, however, that our translation is sound, i.e. the conditional joint probabilities inferred from a BN coincide with the (conditional) joint probabilities from the corresponding BNL program. Formally, we obtain the following soundness result.  Figure 3. Moreover, recall the corresponding program C mood derived from B in Figure 4, where we observed P = 1. By Theorem 6 we can also determine the probability that a student who got a bad grade in an easy exam was wellprepared by means of weakest precondition reasoning. This yields Furthermore, by Corollary 1, it is straightforward to determine the expected time to obtain a single sample of B that satisfies the observation P = 1:

Implementation
We implemented a prototype in Java to analyze expected sampling times of Bayesian networks. More concretely, our tool takes as input a BN together with observations in the popular Bayesian Network Interchange Format. 6 The BN is then translated into a BNL program as shown in Section 5. Our tool applies the ert-calculus together with our proof rules developed in Section 4 to compute the exact expected runtime of the BNL program. The size of the resulting BNL program is linear in the total number of rows of all conditional probability tables in the BN. The program size is thus not the bottleneck of our analysis. As we are dealing with an NP-hard problem [11,12], it is not surprising that our algorithm has a worst-case exponential time complexity. However, also the space complexity of our algorithm is exponential in the worst case: As an expectation is propagated backwards through an if-clause of the BNL program, the size of the expectation is potentially multiplied. This is also the reason that our analysis runs out of memory on some benchmarks.
All experiments were performed on an HP BL685C G7. Although up to 48 cores with 2.0GHz were available, only one core was used apart from Java's garbage collection. The Java virtual machine was limited to 8GB of RAM.
Our experimental results are shown in Table 3. The number of nodes of the considered BNs ranges from 56 to 1041. For each Bayesian network, we computed the expected sampling time (EST) for different collections of observed nodes (#obs). Furthermore, Table 3 provides the average Markov Blanket size, i.e. the average number of parents, children and children's parents of nodes in the BN [42], as an indicator measuring how independent nodes in the BN are.
Observations were picked at random. Note that the time required by our prototype varies depending on both the number of observed nodes and the actual observations. Thus, there are cases in which we run out of memory although the total number of observations is small.
In order to obtain an understanding of what the EST corresponds to in actual execution times on a real machine, we also performed simulations for the win95pts network. More precisely, we generated Java programs from this network analogously to the translation in Section 5. This allowed us to approximate that our Java setup can execute 9.714 · 10 6 steps (in terms of EST) per second.
For the win95pts with 17 observations, an EST of 1.11·10 15 then corresponds to an expected time of approximately 3.6 years in order to obtain a single valid sample. We were additionally able to find a case with 13 observed nodes where our tool discovered within 0.32 seconds an EST that corresponds to approximately 4.3 million years. In contrast, exact inference using variable elimination was almost instantaneous. This demonstrates that knowing expected sampling times upfront can indeed be beneficial when selecting an inference method.

Conclusion
We presented a syntactic notion of independent and identically distributed probabilistic loops and derived dedicated proof rules to determine exact expected outcomes and runtimes of such loops. These rules do not require any user-supplied information, such as invariants, (super)martingales, etc.
Moreover, we isolated a syntactic fragment of probabilistic programs that allows to compute expected runtimes in a highly automatable fashion. This fragment is non-trivial: We show that all Bayesian networks can be translated into programs within this fragment. Hence, we obtain an automated formal method for computing expected simulation times of Bayesian networks. We implemented this method and successfully applied it to various real-world BNs that stem from, amongst others, medical applications. Remarkably, our tool was capable of proving extremely large expected sampling times within seconds.
There are several directions for future work: For example, there exist subclasses of BNs for which exact inference is in P, e.g. polytrees. Are there analogies for probabilistic programs? Moreover, it would be interesting to consider more complex graphical models, such as recursive BNs [15].

A.1 Calculation for Expected Sampling Time of Example Network
Let B denote the Bayesian Network shown in Figure 1. We denote the nodes by R, S, and G, respectively. Since we want to condition G to value 0, let cond(G) = 0, cond(R) = ⊥, and cond(S) = ⊥. According to the translation from Bayesian Networks to BNL programs presented in Section 5.1, we obtain the following program BNL(B, cond): repeat{ // translation of node R, denoted block B (R, cond) x R :≈ a · 0 + (1 − a) · 1 ; Let C denote the body of the above loop. Theorem 5 then yields We first compute ert C (0) = ert block B (R, cond); block B (S, cond); block B (G, cond) (0). We have and finally Next, we determine wp C ([x G = 0]) and therefore calculate Overall, by applying some simple algebra, we get A.2 Proof of Lemma 1 Proof. By induction on the structure of C. As the induction base we have the atomic programs: (by Table 1) empty: Analogous to skip.
diverge: We have (by Table 1) x :≈ µ: We have Table 1 As the induction hypothesis we now assume that for arbitrary but fixed C 1 , C 2 ∈ pGCL and all f, g ∈ E with g ⋓ C 1 and g ⋓ C 2 it holds that both (by Table 1) (by Table 1) Table 1) = wp C 1 (g · wp C 2 (f )) (g ⋓ C 2 and I.H. on C 2 ) = g · wp C 1 (wp C 2 (f )) (g ⋓ C 1 and I.H. on C 1 ) (by Table 1) holds for all h ∈ E. Thus we show for all n ∈ N, which implies the desired result. We proceed by induction on n.
Induction base n = 0. We have Induction Hypothesis: Equation 3 holds for some arbitrary but fixed n ∈ N.

A.3 Proof of Lemma 2
Proof. By induction on n. We consider the cases n = 1 and n = 2 as the induction base. For n = 1, we have For n = 2, we have Before we perform the inductive step, observe the following equality implied by Lemma 1, which holds for all n ∈ N: As induction hypothesis, we now assume that the lemma holds for some arbitrary but fixed n ∈ N \ {0}. Then Proof. We have ( †) † is to be evaluated in some state σ for which we have two cases: The first case is when wp C ([ϕ]) (σ) < 1. Using the closed form of the geometric series, i.e.
(closed form of geometric series) The second case is when wp C ([ϕ]) (σ) = 1. This can only be true if every state τ that is reachable with non-zero probability by executing C on σ has to satisfy ϕ. Otherwise, the integral would not add up to exactly 1. Formally, From that, it follows that We then get and ert C (0) ⋓ C .
It then follows that Vars (wp Seq (g)) ∩ Mod (Seq) = ∅ and hence wp Seq (g) ⋓ Seq by definition. Note that every sequence Seq ∈ BNL is of the form B xi 1 ; . . . ; B xi n for some n ≥ 1, where each B xi j is a block. Given a program C ∈ pGCL, let Vars Guard (C) denote the the set of variables occurring in a guard in C. A straightforward induction on the structure of a block B xi ∈ BNL yields We now proceed by induction on the length of a sequence n. Proof. We show that Vars (ert Seq (0)) = ∅ holds for every sequence Seq ∈ BNL. This implies ert Seq (0) ⋓ Seq. First observe that since every Seq ∈ BNL is loopfree, we have Vars (wp Seq (g)) = Vars (ert Seq (g)) . The proof makes use of two technical lemmas that are proven afterwards.