How long, O Bayesian network, will I sample thee?
 4 Citations
 7.4k Downloads
Abstract
Bayesian networks (BNs) are probabilistic graphical models for describing complex joint probability distributions. The main problem for BNs is inference: Determine the probability of an event given observed evidence. Since exact inference is often infeasible for large BNs, popular approximate inference methods rely on sampling.
We study the problem of determining the expected time to obtain a single valid sample from a BN. To this end, we translate the BN together with observations into a probabilistic program. We provide proof rules that yield the exact expected runtime of this program in a fully automated fashion. We implemented our approach and successfully analyzed various real–world BNs taken from the Bayesian network repository.
Keywords
Probabilistic programs Expected runtimes Weakest preconditions Program verification1 Introduction
Bayesian networks (BNs) are probabilistic graphical models representing joint probability distributions of sets of random variables with conditional dependencies. Graphical models are a popular and appealing modeling formalism, as they allow to succinctly represent complex distributions in a human–readable way. BNs have been intensively studied at least since 1985 [43] and have a wide range of applications including machine learning [24], speech recognition [50], sports betting [11], gene regulatory networks [18], diagnosis of diseases [27], and finance [39].
Probabilistic programs are programs with the key ability to draw values at random. Seminal papers by Kozen from the 1980s consider formal semantics [32] as well as initial work on verification [33, 47]. McIver and Morgan [35] build on this work to further weakest–precondition style verification for imperative probabilistic programs.
The interest in probabilistic programs has been rapidly growing in recent years [20, 23]. Part of the reason for this déjà vu is their use for representing probabilistic graphical models [31] such as BNs. The full potential of modern probabilistic programming languages like Anglican [48], Church [21], Figaro [44], R2 [40], or Tabular [22] is that they enable rapid prototyping and obviate the need to manually provide inference methods tailored to an individual model.
Probabilistic inference is the problem of determining the probability of an event given observed evidence. It is a major problem for both BNs and probabilistic programs, and has been subject to intense investigations by both theoreticians and practitioners for more than three decades; see [31] for a survey. In particular, it has been shown that for probabilistic programs exact inference is highly undecidable [28], while for BNs both exact inference as well as approximate inference to an arbitrary precision are NP–hard [12, 13]. In light of these complexity–theoretical hurdles, a popular way to analyze probabilistic graphical models as well as probabilistic programs is to gather a large number of independent and identically distributed (i.i.d. for short) samples and then do statistical reasoning on these samples. In fact, all of the aforementioned probabilistic programming languages support sampling based inference methods.
Rejection sampling is a fundamental approach to obtain valid samples from BNs with observed evidence. In a nutshell, this method first samples from the joint (unconditional) distribution of the BN. If the sample complies with all evidence, it is valid and accepted; otherwise it is rejected and one has to resample.
Apart from rejection sampling, there are more sophisticated sampling techniques, which mainly fall in two categories: Markov Chain Monte Carlo (MCMC) and importance sampling. But while MCMC requires heavy hand–tuning and suffers from slow convergence rates on real–world instances [31, Chapter 12.3], virtually all variants of importance sampling rely again on rejection sampling [31, 49].
If too many samples are rejected, the expected sampling time grows so large that sampling becomes infeasible. The expected sampling time of a BN is therefore a key figure for deciding whether sampling based inference is the method of choice.“the main challenge in this setting [i.e. sampling based approaches] is that many samples that are generated during execution are ultimately rejected for not satisfying the observations.”
Given a Bayesian network with observed evidence, how long does it take in expectation to obtain a single sample that satisfies the observations?
As an example, consider the BN in Fig. 1 which consists of just three nodes (random variables) that can each assume values 0 or 1. Each node X comes with a conditional probability table determining the probability of X assuming some value given the values of all nodes Y that X depends on (i.e. X has an incoming edge from Y), see [3, Appendix A.1] for detailed calculations. For instance, the probability that G assumes value 0, given that S and R are both assume 1, is 0.2. Note that this BN is paramterized by \(a \in [0,1]\).
While 300 is still moderate, we will see later that expected sampling times of real–world BNs can be much larger. For some BNs, the expected sampling time even exceeded \(10^{18}\), rendering sampling based methods infeasible. In this case, exact inference (despite NP–hardness) was a viable alternative (see Sect. 6).
Our Approach. We apply weakest precondition style reasoning a lá McIver and Morgan [35] and Kaminski et al. [30] to analyze both expected outcomes and expected runtimes (ERT) of a syntactic fragment of pGCL , which we call the Bayesian Network Language (BNL). Note that since BNL is a syntactic fragment of pGCL, every BNL program is a pGCL program but not vice versa. The main restriction of BNL is that (in contrast to pGCL) loops are of a special form that prohibits undesired data flow across multiple loop iterations. While this restriction renders BNL incapable of, for instance, counting the number of loop iterations^{1}, BNL is expressive enough to encode Bayesian networks with observed evidence.
For BNL, we develop dedicated proof rules to determine exact expected values and the exact ERT of any BNL program, including loops, without any user–supplied data, such as invariants [30, 35], ranking or metering functions [19], (super)martingales [8, 9, 10], etc.
 (a)
executing the BNL program corresponds to sampling from the conditional joint distribution given by the BN and observed data, and
 (b)
the ERT of the BNL program corresponds to the expected time until a sample that satisfies the observations is obtained from the BN.
As a consequence, exact expected sampling times of BNs can be inferred by means of weakest precondition reasoning in a fully automated fashion. This can be seen as a first step towards formally evaluating the quality of a plethora of different sampling methods (cf. [31, 49]) on source code level.

We develop easy–to–apply proof rules to reason about expected outcomes and expected runtimes of probabilistic programs with f–i.i.d. loops.

We study a syntactic fragment of probabilistic programs, the Bayesian network language (BNL), and show that our proof rules are applicable to every BNL program; expected runtimes of \( \textsf {{BNL}} \) programs can thus be inferred.

We give a formal translation from Bayesian networks with observations to \( \textsf {{BNL}} \) programs; expected sampling times of BNs can thus be inferred.

We implemented a prototype tool that automatically analyzes the expected sampling time of BNs with observations. An experimental evaluation on real–world BNs demonstrates that very large expected sampling times (in the magnitude of millions of years) can be inferred within less than a second; This provides practitioners the means to decide whether sampling based methods are appropriate for their models.
2 Related Work
While various techniques for formal reasoning about runtimes and expected outcomes of probabilistic programs have been developed, e.g. [6, 7, 17, 25, 38], none of them explicitly apply formal methods to reason about Bayesian networks on source code level. In the following, we focus on approaches close to our work.
Weakest Preexpectation Calculus. Our approach builds upon the expected runtime calculus [30], which is itself based on work by Kozen [32, 33] and McIver and Morgan [35]. In contrast to [30], we develop specialized proof rules for a clearly specified program fragment without requiring user–supplied invariants. Since finding invariants often requires heavy calculations, our proof rules contribute towards simplifying and automating verification of probabilistic programs.
Ranking Supermartingales. Reasoning about almost–sure termination is often based on ranking (super)martingales (cf. [8, 10]). In particular, Chatterjee et al. [9] consider the class of affine probabilistic programs for which linear ranking supermartingales exist (Lrapp); thus proving (positive^{2}) almost–sure termination for all programs within this class. They also present a doubly–exponential algorithm to approximate ERTs of Lrapp programs. While all BNL programs lie within Lrapp, our proof rules yield exact ERTs as expectations (thus allowing for compositional proofs), in contrast to a single number for a fixed initial state.
Bayesian Networks and Probabilistic Programs. Bayesian networks are a—if not the most—popular probabilistic graphical model (cf. [4, 31] for details) for reasoning about conditional probabilities. They are closely tied to (a fragment of) probabilistic programs. For example, Infer.NET [36] performs inference by compiling a probabilistic program into a Bayesian network. While correspondences between probabilistic graphical models, such as BNs, have been considered in the literature [21, 23, 37], we are not aware of a formal soudness proof for a translation from classical BNs into probabilistic programs including conditioning.
Conversely, some probabilistic programming languages such as Church [21], Stan [26], and R2 [40] directly perform inference on the program level using sampling techniques similar to those developed for Bayesian networks. Our approach is a step towards understanding sampling based approaches formally: We obtain the exact expected runtime required to generate a sample that satisfies all observations. This may ultimately be used to evaluate the quality of a plethora of proposed sampling methods for Bayesian inference (cf. [31, 49]).
3 Probabilistic Programs
We briefly present the probabilistic programming language that is used throughout this paper. Since our approach is embedded into weakestprecondition style approaches, we also recap calculi for reasoning about both expected outcomes and expected runtimes of probabilistic programs.
3.1 The Probabilistic Guarded Command Language
We enhance Dijkstra’s Guarded Command Language [14, 15] by a probabilistic construct, namely a random assignment. We thereby obtain a probabilistic Guarded Command Language (for a closely related language, see [35]).
Let \(\textsf {Vars}\) be a finite set of program variables. Moreover, let \(\mathbb {Q}\) be the set of rational numbers, and let \(\mathcal {D}\left( {\mathbb {Q}} \right) \) be the set of discrete probability distributions over \(\mathbb {Q}\). The set of program states is given by Open image in new window .
A distribution expression \(\mu \) is a function of type \(\mu :\varSigma \rightarrow \mathcal {D}\left( {\mathbb {Q}} \right) \) that takes a program state and maps it to a probability distribution on values from \(\mathbb {Q}\). We denote by \(\mu _\sigma \) the distribution obtained from applying \(\sigma \) to \(\mu \).
where \(x \in \textsf {Vars}\) is a program variable, \(\mu \) is a distribution expression, and \(\varphi \) is a Boolean expression guarding a choice or a loop. A \( \textsf {{pGCL}} \) program that contains neither \(\mathtt {diverge}\), nor \(\mathtt {while}\), nor \(\mathtt {repeatuntil}\) loops is called loop–free.
Let us briefly go over the \( \textsf {{pGCL}} \) constructs and their effects: \(\mathtt {skip}\) does not alter the current program state. The program \(\mathtt {diverge}\) is an infinite busy loop, thus takes infinite time to execute. It returns no final state whatsoever.
In general, a \( \textsf {{pGCL}} \) program C is executed on an input state and yields a probability distribution over final states due to possibly occurring random assignments inside of C. We denote that resulting distribution by Open image in new window . Strictly speaking, programs can yield subdistributions, i.e. probability distributions whose total mass may be below 1. The “missing”probability mass represents the probability of nontermination. Let us conclude our presentation of pGCL with an example:
Example 1 (Geometric Loop)
3.2 The Weakest Preexpectation Transformer
We now present the weakest preexpectation transformer \(\mathsf {wp}\) for reasoning about expected outcomes of executing probabilistic programs in the style of McIver and Morgan [35]. Given a random variable f mapping program states to reals, it allows us to reason about the expected value of f after executing a probabilistic program on a given state.
Expectations. The random variables the \(\mathsf {wp}\) transformer acts upon are taken from a set of socalled expectations, a term coined by McIver and Morgan [35]:
Definition 1 (Expectations)
We allow expectations to map only to positive reals, so that we have a complete partial order readily available, which would not be the case for expectations of type \(\varSigma \rightarrow \mathbb {R}\cup \{\infty ,\, +\infty \}\). A \(\mathsf {wp}\) calculus that can handle expectations of such type needs more technical machinery and cannot make use of this underlying natural partial order [29]. Since we want to reason about ERTs which are by nature non–negative, we will not need such complicated calculi.
Notice that we use a slightly different definition of expectations than McIver and Morgan [35], as we allow for unbounded expectations, whereas [35] requires that expectations are bounded. This however would prevent us from capturing ERTs, which are potentially unbounded.
Definition 2
(The \({{\mathbf {\mathsf{{wp}}}}}\) Transformer [35]). The weakest preexpectation transformer \(\mathsf {wp}: \textsf {{pGCL}} \rightarrow \mathbb {E}\rightarrow \mathbb {E}\) is defined by induction on all \( \textsf {{pGCL}} \) programs according to the rules in Table 1. We call Open image in new window the \(\mathsf {wp}\)–characteristic functional of the loop Open image in new window with respect to postexpectation f. For a given \(\mathsf {wp}\)–characteristic function \(F_f\), we call the sequence \(\{F_f^n(0) \}_{n\in \mathbb {N}}\) the orbit of \(F_f\).
Rules for the \(\mathsf {wp}\)–transformer.
\(\varvec{C}\)  

\(\mathtt {skip}\)  f 
\(\mathtt {diverge}\)  0 
\({x}\mathrel {:\approx }{\mu }\)  
\(\mathtt {if} \left( {\varphi } \right) \left\{ {C_1} \right\} \mathtt {else} \left\{ {C_2} \right\} \)  
\({C_1};\,{C_2}\)  
\(\mathtt {while} \left( {\varphi }\right) \left\{ {C'} \right\} \)  
\(\mathtt {repeat}\left\{ {C'}\right\} \mathtt {until}\left( {\varphi }\right) \) 
Let us briefly go over the definitions in Table 1: For \(\mathtt {skip}\) the program state is not altered and thus the expected value of f is just f. The program \(\mathtt {diverge}\) will never yield any final state. The distribution over the final states yielded by \(\mathtt {diverge}\) is thus the null distribution \(\nu _0(\tau ) = 0\), that assigns probability 0 to every state. Consequently, the expected value of f after execution of \(\mathtt {diverge}\) is given by \(\int _{\varSigma }~{f}~d{\nu _0} = \sum _{\tau \in \varSigma }0 \cdot f(\tau ) = 0\).
The definition for the conditional choice \(\mathtt {if} \left( {\varphi } \right) \left\{ {C_1} \right\} \mathtt {else} \left\{ {C_2} \right\} \) is not surprising: if the current state satisfies \(\varphi \), we have to opt for the weakest preexpectation of \(C_1\), whereas if it does not satisfy \(\varphi \), we have to choose the weakest preexpectation of \(C_2\). This yields precisely the definition in Table 1.
The definition for the sequential composition \({C_1};\,{C_2}\) is also straightforward: We first determine Open image in new window to obtain the expected value of f after executing \(C_2\). Then we mentally prepend the program \(C_2\) by \(C_1\) and therefore determine the expected value of Open image in new window after executing \(C_1\). This gives the weakest preexpectation of \({C_1};\,{C_2}\) with respect to postexpectation f.
Finally, since \(\mathtt {repeat}\left\{ {C}\right\} \mathtt {until}\left( {\varphi }\right) \) is syntactic sugar for \({C};\,{\mathtt {while} \left( {\varphi }\right) \left\{ {C} \right\} }\), we simply define the weakest preexpectation of the former as the weakest preexpectation of the latter. Let us conclude our study of the effects of the \(\mathsf {wp}\) transformer by means of an example:
Example 2
Healthiness Conditions of wp. The \(\mathsf {wp}\) transformer enjoys some useful properties, sometimes called healthiness conditions [35]. Two of these healthiness conditions that we will heavily make use of are given below:
Theorem 1
3.3 The Expected Runtime Transformer
While for deterministic programs we can speak of the runtime of a program on a given input, the situation is different for probabilistic programs: For those we instead have to speak of the expected runtime (ERT). Notice that the ERT can be finite (even constant) while the program may still admit infinite executions. An example of this is the geometric loop in Example 1.
A \(\mathsf {wp}\)–like transformer designed specifically for reasoning about ERTs is the \(\mathsf {ert}\) transformer [30]. Like \(\mathsf {wp}\), it is of type \(\mathsf {ert}\llbracket C \rrbracket :\mathbb {E}\rightarrow \mathbb {E}\) and it can be shown that Open image in new window is precisely the expected runtime of executing C on input \(\sigma \). More generally, if \(f:\varSigma \rightarrow \mathbb {R}_{\ge 0}^{\infty }\) measures the time that is needed after executing C (thus f is evaluated in the final states after termination of C), then Open image in new window is the expected time that is needed to run C on input \(\sigma \) and then let time f pass. For a more in–depth treatment of the \(\mathsf {ert}\) transformer, see [30, Sect. 3]. The transformer is defined as follows:
Definition 3
Rules for the \(\mathsf {ert}\)–transformer.
\(\varvec{C}\)  

\(\mathtt {skip}\)  \(1 + f\) 
\(\mathtt {diverge}\)  \(\infty \) 
\({x}\mathrel {:\approx }{\mu }\)  
\(\mathtt {if} \left( {\varphi } \right) \left\{ {C_1} \right\} \mathtt {else} \left\{ {C_2} \right\} \)  
\({C_1};\,{C_2}\)  
\(\mathtt {while} \left( {\varphi }\right) \left\{ {C'} \right\} \)  
\(\mathtt {repeat}\left\{ {C'}\right\} \mathtt {until}\left( {\varphi }\right) \) 
The rules for \(\mathsf {ert}\) are very similar to the rules for \(\mathsf {wp}\). The runtime model we assume is that \(\mathtt {skip}\) statements, random assignments, and guard evaluations for both conditional choice and while loops cost one unit of time. This runtime model can easily be adopted to count only the number of loop iterations or only the number of random assignments, etc. We conclude with a strong connection between the \(\mathsf {wp}\) and the \(\mathsf {ert}\) transformer, that is crucial in our proofs:
Theorem 2
4 Expected Runtimes of i.i.d Loops
As a running example, consider the program \(C_{ circle }\) in Fig. 2. \(C_{ circle }\) samples a point within a circle with center (5, 5) and radius \(r=5\) uniformly at random using rejection sampling. In each iteration, it samples a point \((x,y) \in [0, \ldots , 10]^2\) within the square (with some fixed precision). The loop ensures that we resample if a sample is not located within the circle. Our proof rule will allow us to systematically determine the ERT of this loop, i.e. the average amount of time required until a single point within the circle is sampled.
Towards obtaining such a proof rule, we first present a syntactical notion of the i.i.d. property. It relies on expectations that are not affected by a \( \textsf {{pGCL}} \) program:
Definition 4
We are interested in expectations that are unaffected by \( \textsf {{pGCL}} \) programs because of a simple, yet useful observation: If Open image in new window , then g can be treated like a constant w.r.t. the transformer \(\mathsf {wp}\) (i.e. like the a in Theorem 1 (1)). For our running example \(C_{ circle }\) (see Fig. 2), the expectation Open image in new window is unaffected by the loop body \(C_{ body }\) of \(C_{ circle }\). Consequently, we have Open image in new window . In general, we obtain the following property:
Lemma 1 (Scaling by Unaffected Expectations)
Let \(C\in \textsf {{pGCL}} \) and \(f,g \in \mathbb {E}\). Then Open image in new window implies Open image in new window .
Proof
By induction on the structure of C. See [3, Appendix A.2]. \(\square \)
We develop a proof rule that only requires that both the probability of the guard evaluating to true after one iteration of the loop body (i.e. Open image in new window ) as well as the expected value of \(\left[ {\lnot \varphi } \right] \cdot f\) after one iteration (i.e. Open image in new window ) are unaffected by the loop body. We thus define the following:
Definition 5
Example 3
Our main technical Lemma is that we can express the orbit of the \(\mathsf {wp}\)–characteristic function as a partial geometric series:
Lemma 2
Using this precise description of the \(\mathsf {wp}\) orbits, we now establish proof rules for f–i.i.d. loops, first for \(\mathsf {wp}\) and later for \(\mathsf {ert}\).
Theorem 3
Proof
We now derive a similar proof rule for the ERT of an f–i.i.d. loop \(\mathtt {while} \left( {\varphi }\right) \left\{ {C} \right\} \).
Theorem 4
 1.
\(\mathtt {while} \left( {\varphi }\right) \left\{ {C} \right\} \) is f–i.i.d.
 2.
Open image in new window (loop body terminates almost–surely).
 3.
Open image in new window (every iteration runs in the same expected time).
Proof
5 A Programming Language for Bayesian Networks
So far we have derived proof rules for formal reasoning about expected outcomes and expected runtimes of i.i.d. loops (Theorems 3 and 4). In this section, we apply these results to develop a syntactic \( \textsf {{pGCL}} \) fragment that allows exact computations of closed forms of ERTs. In particular, no invariants, (super)martingales or fixed point computations are required.
After that, we show how BNs with observations can be translated into \( \textsf {{pGCL}} \) programs within this fragment. Consequently, we call our \( \textsf {{pGCL}} \) fragment the Bayesian Network Language. As a result of the above translation, we obtain a systematic and automatable approach to compute the expected sampling time of a BN in the presence of observations. That is, the expected time it takes to obtain a single sample that satisfies all observations.
5.1 The Bayesian Network Language
Programs in the Bayesian Network Language are organized as sequences of blocks. Every block is associated with a single variable, say x, and satisfies two constraints: First, no variable other than x is modified inside the block, i.e. occurs on the left–hand side of a random assignment. Second, every variable accessed inside of a guard has been initialized before. These restrictions ensure that there is no data flow across multiple executions of the same block. Thus, intuitively, all loops whose body is composed from blocks (as described above) are f–i.i.d. loops.
Definition 6
where \(x_i \in \textsf {Vars}\) is a program variable, all variables in \(\varphi \) have been initialized before, and \(B_{x_i}\) is a non–terminal parameterized with program variable \(x_i \in \textsf {Vars}\). That is, for all \(x_i \in \textsf {Vars}\) there is a non–terminal \(B_{x_i}\). Moreover, \(\psi \) is an arbitrary guard and \(\mu \) is a distribution expression of the form \(\mu = \sum _{j=1}^{n} p_j \cdot \langle a_j \rangle \) with \(a_j \in \mathbb {Q}\) for \(1 \le j \le n\).
Example 4
For any \(C \in \textsf {{BNL}} \), our goal is to compute the exact ERT of C, i.e. Open image in new window . In case of loop–free programs, this amounts to a straightforward application of the \(\mathsf {ert}\) calculus presented in Sect. 3. To deal with loops, however, we have to perform fixed point computations or require user–supplied artifacts, e.g. invariants, supermartingales, etc. For \( \textsf {{BNL}} \) programs, on the other hand, it suffices to apply the proof rules developed in Sect. 4. As a result, we directly obtain an exact closed form solution for the ERT of a loop. This is a consequence of the fact that all loops in \( \textsf {{BNL}} \) are f–i.i.d., which we establish in the following.
By definition, every loop in \( \textsf {{BNL}} \) is of the form \(\mathtt {repeat}\left\{ {B_{x_{i}}}\right\} \mathtt {until}\left( {\psi }\right) \), which is equivalent to \({B_{x_{i}}};\,{\mathtt {while} \left( {\lnot \psi }\right) \left\{ {B_{x_{i}}} \right\} }\). Hence, we want to apply Theorem 4 to that while loop. Our first step is to discharge the theorem’s premises:
Lemma 3
 1.
The expected value of g after executing \(\textit{Seq}\) is unaffected by \(\textit{Seq}\). That is, Open image in new window .
 2.
The ERT of \(\textit{Seq}\) is unaffected by \(\textit{Seq}\), i.e. Open image in new window .
 3.
For every \(f \in \mathbb {E}\), the loop \(\mathtt {while} \left( {\lnot \psi }\right) \left\{ {\textit{Seq}} \right\} \) is f–i.i.d.
Proof
1. is proven by induction on the length of the sequence of blocks \(\textit{Seq}\) and 2. is a consequence of 1., see [3, Appendix A.6]. 3. follows immediately from 1. by instantiating g with \(\left[ {\lnot \psi } \right] \) and \(\left[ {\psi } \right] \cdot f\), respectively. \(\square \)
We are now in a position to derive a closed form for the ERT of loops in \( \textsf {{BNL}} \).
Theorem 5
Proof
Note that Theorem 5 holds for arbitrary postexpectations \(f \in \mathbb {E}\). This enables compositional reasoning about ERTs of \( \textsf {{BNL}} \) programs. Since all other rules of the \(\mathsf {ert}\)–calculus for loop–free programs amount to simple syntactical transformations (see Table 2), we conclude that
Corollary 1
For any \(C \in \textsf {{BNL}} \), a closed form for Open image in new window can be computed compositionally.
5.2 Bayesian Networks
To reason about expected sampling times of BNs, it remains to develop a sound translation from BNs with observations into equivalent \( \textsf {{BNL}} \) programs. A BN is a probabilistic graphical model that is given by a directed acyclic graph. Every node is a random variable and a directed edge between two nodes expresses a probabilistic dependency between these nodes.
In order to translate BNs into equivalent BNL programs, we need a formal representation first. Technically, we consider extended BNs in which nodes may additionally depend on inputs that are not represented by nodes in the network. This allows us to define a compositional translation without modifying conditional probability tables.
Towards a formal definition of extended BNs, we use the following notation. A tuple \((s_1,\ldots ,s_k) \in S^{k}\) of length k over some set S is denoted by \(\mathbf {s}\). The empty tuple is \(\mathbf {\varepsilon }\). Moreover, for \(1 \le i \le k\), the ith element of tuple \(\mathbf {s}\) is given by \(\mathbf {s}(i)\). To simplify the presentation, we assume that all nodes and all inputs are represented by natural numbers.
Definition 7

\(V\subseteq \mathbb {N}\) and \(I\subseteq \mathbb {N}\) are finite disjoint sets of nodes and inputs.

\(E\subseteq V\times V\) is a set of edges such that \((V,E)\) is a directed acyclic graph.

\(\textsf {Vals}\) is a finite set of possible values that can be assigned to each node.

\(\mathsf {dep}:V\rightarrow (V\cup I)^{*}\) is a function assigning each node v to an ordered sequence of dependencies. That is, \(\mathsf {dep}(v) ~{}={}~(u_{1}, \ldots , u_{m})\) such that \(u_i < u_{i+1}\) (\(1 \le i < m\)). Moreover, every dependency \(u_j\) \((1 \le j \le m\)) is either an input, i.e. \(u_j \in I\), or a node with an edge to v, i.e. \(u_j \in V\) and \((u_j,v) \in E\).
 \(\mathsf {cpt}\) is a function mapping each node v to its conditional probability table \(\mathsf {cpt}[v]\). That is, for \(k = \mathsf {dep}(v)\), \(\mathsf {cpt}[v]\) is given by a function of the formHere, the ith entry in a tuple \(\mathbf {z} \in \textsf {Vals}^{k}\) corresponds to the value assigned to the ith entry in the sequence of dependencies \(\mathsf {dep}(v)\).$$\begin{aligned} \mathsf {cpt}[v] \,:\, \textsf {Vals}^{k} \rightarrow \textsf {Vals}\rightarrow [0,1] \quad \text {such that}\quad \sum _{\mathbf {z} \in \textsf {Vals}^{k}, a \in \textsf {Vals}} \mathsf {cpt}[v](\mathbf {z})(a) ~{}={}~1. \end{aligned}$$
A Bayesian network (BN) is an extended BN without inputs, i.e. \(I= \emptyset \). In particular, the dependency function is of the form \(\mathsf {dep}:V\rightarrow V^{*}\).
Example 6
The formalization of our example BN (Fig. 3) is straightforward. For instance, the dependencies of variable G are given by \(\mathsf {dep}(G) = (D,P)\) (assuming D is encoded by an integer less than P). Furthermore, every entry in the conditional probability table of node G corresponds to an evaluation of the function \(\mathsf {cpt}[G]\). For example, if \(D = 1\), \(P = 0\), and \(G = 1\), we have \(\mathsf {cpt}[G](1,0)(1) = 0.4\).\(\triangle \)
Definition 8
5.3 From Bayesian Networks to BNL
We now develop a compositional translation from EBNs into BNL programs. Throughout this section, let \(\mathcal {B}= (V,I,E,\textsf {Vals},\mathsf {dep},\mathsf {cpt})\) be a fixed EBN. Moreover, with every node or input \(v \in V\cup I\) we associate a program variable \(x_{v}\).
We proceed in three steps: First, every node together with its dependencies is translated into a block of a \( \textsf {{BNL}} \) program. These blocks are then composed into a single \( \textsf {{BNL}} \) program that captures the whole BN. Finally, we implement conditioning by means of rejection sampling.
Example 7
Remark 1
Example 8
Consider, again, the BN \(\mathcal {B}\) depicted in Fig. 3. Moreover, assume we observe \(P = 1\). Hence, the conditioning function \(\textit{cond}\) is given by \(\textit{cond}(P) = 1\) and \(\textit{cond}(v) = \bot \) for \(v \in \{D,G,M\}\). Then the translation of \(\mathcal {B}\) and \(\textit{cond}\), i.e. \(\textit{BNL}(\mathcal {B},\textit{cond})\), is the \( \textsf {{BNL}} \) program \(C_{\textit{mood}}\) depicted in Fig. 4.\(\triangle \)
Since our translation yields a \( \textsf {{BNL}} \) program for any given BN, we can compositionally compute a closed form for the expected simulation time of a BN. This is an immediate consequence of Corollary 1.
We still have to prove, however, that our translation is sound, i.e. the conditional joint probabilities inferred from a BN coincide with the (conditional) joint probabilities from the corresponding \( \textsf {{BNL}} \) program. Formally, we obtain the following soundness result.
Theorem 6 (Soundness of Translation)
Proof
Without conditioning, i.e. \(O= \emptyset \), the proof proceeds by induction on the number of nodes of \(\mathcal {B}\). With conditioning, we additionally apply Theorems 3 and 5 to deal with loops introduced by observed nodes. See [3, Appendix A.7]. \(\square \)
Example 9 (Expected Sampling Time of a BN)
6 Implementation
We implemented a prototype in Java to analyze expected sampling times of Bayesian networks. More concretely, our tool takes as input a BN together with observations in the popular Bayesian Network Interchange Format.^{6} The BN is then translated into a BNL program as shown in Sect. 5. Our tool applies the \(\mathsf {ert}\)–calculus together with our proof rules developed in Sect. 4 to compute the exact expected runtime of the BNL program.
The size of the resulting BNL program is linear in the total number of rows of all conditional probability tables in the BN. The program size is thus not the bottleneck of our analysis. As we are dealing with an NP–hard problem [12, 13], it is not surprising that our algorithm has a worst–case exponential time complexity. However, also the space complexity of our algorithm is exponential in the worst case: As an expectation is propagated backwards through an \(\texttt {if}\)–clause of the BNL program, the size of the expectation is potentially multiplied. This is also the reason that our analysis runs out of memory on some benchmarks.
We evaluated our implementation on the largest BNs in the Bayesian Network Repository [46] that consists—to a large extent—of real–world BNs including expert systems for, e.g., electromyography (munin) [2], hematopathology diagnosis (hepar2) [42], weather forecasting (hailfinder) [1], and printer troubleshooting in Windows 95 (win95pts) [45, Sect. 5.6.2]. For a evaluation of all BNs in the repository, we refer to the extended version of this paper [3, Sect. 6].
All experiments were performed on an HP BL685C G7. Although up to 48 cores with 2.0 GHz were available, only one core was used apart from Java’s garbage collection. The Java virtual machine was limited to 8 GB of RAM.
Experimental results. Time is in seconds. MO denotes out of memory.
BN  #obs  Time  EST  #obs  Time  EST  #obs  Time  EST 

hailfinder  #nodes: 56, #edges: 66, avg. Markov Blanket: 3.54  
0  0.23  \(9.500 \cdot 10^1\)  5  0.63  \(5.016 \cdot 10^5\)  9  0.46  \(9.048 \cdot 10^6\)  
hepar2  #nodes: 70, #edges: 123, avg. Markov Blanket: 4.51  
0  0.22  \(1.310 \cdot 10^2\)  1  1.84  \(1.579 \cdot 10^2\)  2  MO  –  
win95pts  #nodes: 76, #edges: 112, avg. Markov Blanket: 5.92  
0  0.20  \(1.180 \cdot 10^2\)  1  0.36  \(2.284 \cdot 10^3\)  3  0.36  \(4.296 \cdot 10^5\)  
7  0.91  \(1.876 \cdot 10^6\)  12  0.42  \(3.973 \cdot 10^7\)  17  61.73  \(1.110 \cdot 10^{15}\)  
pathfinder  #nodes: 135, #edges: 200, avg. Markov Blanket: 3.04  
0  0.37  217  1  0.53  \(1.050 \cdot 10^4\)  3  31.31  \(2.872 \cdot 10^4\)  
5  MO  –  7  5.44  \(\infty \)  7  480.83  \(\infty \)  
andes  #nodes: 223, #edges: 338, avg. Markov Blanket: 5.61  
0  0.46  \(3.570 \cdot 10^2\)  1  MO  –  3  1.66  \(5.251 \cdot 10^3\)  
5  1.41  \(9.862 \cdot 10^3\)  7  0.99  \(8.904 \cdot 10^4\)  9  0.90  \(6.637 \cdot 10^5\)  
pigs  #nodes: 441, #edges: 592, avg. Markov Blanket: 3.66  
0  0.57  \(7.370 \cdot 10^2\)  1  0.74  \(2.952 \cdot 10^3\)  3  0.88  \(2.362 \cdot 10^3\)  
5  0.85  \(1.260 \cdot 10^5\)  7  1.02  \(1.511 \cdot 10^6\)  8  MO  –  
munin  #nodes: 1041, #edges: 1397, avg. Markov Blanket: 3.54  
0  1.29  \(1.823 \cdot 10^3\)  1  1.47  \(3.648 \cdot 10^4\)  3  1.37  \(1.824 \cdot 10^7\)  
5  1.43  \(\infty \)  9  1.79  \(1.824 \cdot 10^{16}\)  10  65.64  \(1.153 \cdot 10^{18}\) 
Observations were picked at random. Note that the time required by our prototype varies depending on both the number of observed nodes and the actual observations. Thus, there are cases in which we run out of memory although the total number of observations is small.
In order to obtain an understanding of what the EST corresponds to in actual execution times on a real machine, we also performed simulations for the win95pts network. More precisely, we generated Java programs from this network analogously to the translation in Sect. 5. This allowed us to approximate that our Java setup can execute \(9.714\cdot 10^6\) steps (in terms of EST) per second.
For the win95pts with 17 observations, an EST of \(1.11 \cdot 10^{15}\) then corresponds to an expected time of approximately 3.6 years in order to obtain a single valid sample. We were additionally able to find a case with 13 observed nodes where our tool discovered within 0.32 s an EST that corresponds to approximately 4.3 million years. In contrast, exact inference using variable elimination was almost instantaneous. This demonstrates that knowing expected sampling times upfront can indeed be beneficial when selecting an inference method.
7 Conclusion
We presented a syntactic notion of independent and identically distributed probabilistic loops and derived dedicated proof rules to determine exact expected outcomes and runtimes of such loops. These rules do not require any user–supplied information, such as invariants, (super)martingales, etc.
Moreover, we isolated a syntactic fragment of probabilistic programs that allows to compute expected runtimes in a highly automatable fashion. This fragment is non–trivial: We show that all Bayesian networks can be translated into programs within this fragment. Hence, we obtain an automated formal method for computing expected simulation times of Bayesian networks. We implemented this method and successfully applied it to various real–world BNs that stem from, amongst others, medical applications. Remarkably, our tool was capable of proving extremely large expected sampling times within seconds.
There are several directions for future work: For example, there exist subclasses of BNs for which exact inference is in \(\textsf {P}\), e.g. polytrees. Are there analogies for probabilistic programs? Moreover, it would be interesting to consider more complex graphical models, such as recursive BNs [16].
Footnotes
 1.
An example of a program that is not expressible in BNL is given in Example 1.
 2.
Positive almost–sure termination means termination in finite expected time [5].
 3.
We use \(\lambda \)–expressions to construct functions: Function Open image in new window applied to an argument \(\alpha \) evaluates to \(\epsilon \) in which every occurrence of X is replaced by \(\alpha \).
 4.
This counting is also the reason that \(C_{ geo }\) is an example of a program that is not expressible in our BNL language that we present later.
 5.
W is downward closed if \(v \in W\) and \((u,v) \in E\) implies \(u \in E\).
 6.
References
 1.Abramson, B., Brown, J., Edwards, W., Murphy, A., Winkler, R.L.: Hailfinder: a Bayesian system for forecasting severe weather. Int. J. Forecast. 12(1), 57–71 (1996)CrossRefGoogle Scholar
 2.Andreassen, S., Jensen, F.V., Andersen, S.K., Falck, B., Kjærulff, U., Woldbye, M., Sørensen, A., Rosenfalck, A., Jensen, F.: MUNIN: an expert EMG Assistant. In: ComputerAided Electromyography and Expert Systems, pp. 255–277. Pergamon Press (1989)Google Scholar
 3.Batz, K., Kaminski, B.L., Katoen, J., Matheja, C.: How long, O Bayesian network, will I sample thee? arXiv extended version (2018)Google Scholar
 4.Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)zbMATHGoogle Scholar
 5.Bournez, O., Garnier, F.: Proving positive almostsure termination. In: Giesl, J. (ed.) RTA 2005. LNCS, vol. 3467, pp. 323–337. Springer, Heidelberg (2005). https://doi.org/10.1007/9783540320333_24CrossRefGoogle Scholar
 6.Brázdil, T., Kiefer, S., Kucera, A., Vareková, I.H.: Runtime analysis of probabilistic programs with unbounded recursion. J. Comput. Syst. Sci. 81(1), 288–310 (2015)MathSciNetCrossRefGoogle Scholar
 7.Celiku, O., McIver, A.: Compositional specification and analysis of costbased properties in probabilistic programs. In: Fitzgerald, J., Hayes, I.J., Tarlecki, A. (eds.) FM 2005. LNCS, vol. 3582, pp. 107–122. Springer, Heidelberg (2005). https://doi.org/10.1007/11526841_9CrossRefGoogle Scholar
 8.Chakarov, A., Sankaranarayanan, S.: Probabilistic program analysis with martingales. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 511–526. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642397998_34CrossRefGoogle Scholar
 9.Chatterjee, K., Fu, H., Novotný, P., Hasheminezhad, R.: Algorithmic analysis of qualitative and quantitative termination problems for affine probabilistic programs. In: POPL, pp. 327–342. ACM (2016)Google Scholar
 10.Chatterjee, K., Novotný, P., Zikelic, D.: Stochastic invariants for probabilistic termination. In: POPL, pp. 145–160. ACM (2017)Google Scholar
 11.Constantinou, A.C., Fenton, N.E., Neil, M.: pifootball: a Bayesian network model for forecasting association football match outcomes. Knowl. Based Syst. 36, 322–339 (2012)CrossRefGoogle Scholar
 12.Cooper, G.F.: The computational complexity of probabilistic inference using Bayesian belief networks. Artif. Intell. 42(2–3), 393–405 (1990)MathSciNetCrossRefGoogle Scholar
 13.Dagum, P., Luby, M.: Approximating probabilistic inference in Bayesian belief networks is NPhard. Artif. Intell. 60(1), 141–153 (1993)MathSciNetCrossRefGoogle Scholar
 14.Dijkstra, E.W.: Guarded commands, nondeterminacy and formal derivation of programs. Commun. ACM 18(8), 453–457 (1975)MathSciNetCrossRefGoogle Scholar
 15.Dijkstra, E.W.: A Discipline of Programming. PrenticeHall, Upper Saddle River (1976)zbMATHGoogle Scholar
 16.Etessami, K., Yannakakis, M.: Recursive Markov chains, stochastic grammars, and monotone systems of nonlinear equations. JACM 56(1), 1:1–1:66 (2009)MathSciNetCrossRefGoogle Scholar
 17.Fioriti, L.M.F., Hermanns, H.: Probabilistic termination: soundness, completeness, and compositionality. In: POPL, pp. 489–501. ACM (2015)Google Scholar
 18.Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. In: RECOMB, pp. 127–135. ACM (2000)Google Scholar
 19.Frohn, F., Naaf, M., Hensel, J., Brockschmidt, M., Giesl, J.: Lower runtime bounds for integer programs. In: Olivetti, N., Tiwari, A. (eds.) IJCAR 2016. LNCS (LNAI), vol. 9706, pp. 550–567. Springer, Cham (2016). https://doi.org/10.1007/9783319402291_37CrossRefGoogle Scholar
 20.Goodman, N.D.: The principles and practice of probabilistic programming. In: POPL, pp. 399–402. ACM (2013)Google Scholar
 21.Goodman, N.D., Mansinghka, V.K., Roy, D.M., Bonawitz, K., Tenenbaum, J.B.: Church: A language for generative models. In: UAI, pp. 220–229. AUAI Press (2008)Google Scholar
 22.Gordon, A.D., Graepel, T., Rolland, N., Russo, C.V., Borgström, J., Guiver, J.: Tabular: a schemadriven probabilistic programming language. In: POPL, pp. 321–334. ACM (2014)Google Scholar
 23.Gordon, A.D., Henzinger, T.A., Nori, A.V., Rajamani, S.K.: Probabilistic programming. In: Future of Software Engineering, pp. 167–181. ACM (2014)Google Scholar
 24.Heckerman, D.: A tutorial on learning with Bayesian networks. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Bayesian Networks. Studies in Computational Intelligence, vol. 156, pp. 33–82. Springer, Heidelberg (2008)CrossRefGoogle Scholar
 25.Hehner, E.C.R.: A probability perspective. Formal Aspects Comput. 23(4), 391–419 (2011)MathSciNetCrossRefGoogle Scholar
 26.Hoffman, M.D., Gelman, A.: The NoUturn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)MathSciNetzbMATHGoogle Scholar
 27.Jiang, X., Cooper, G.F.: A Bayesian spatiotemporal method for disease outbreak detection. JAMIA 17(4), 462–471 (2010)Google Scholar
 28.Kaminski, B.L., Katoen, J.P.: On the hardness of almost–sure termination. In: Italiano, G.F., Pighizzini, G., Sannella, D.T. (eds.) MFCS 2015. LNCS, vol. 9234, pp. 307–318. Springer, Heidelberg (2015). https://doi.org/10.1007/9783662480571_24CrossRefGoogle Scholar
 29.Kaminski, B.L., Katoen, J.: A weakest preexpectation semantics for mixedsign expectations. In: LICS (2017)Google Scholar
 30.Kaminski, B.L., Katoen, J.P., Matheja, C., Olmedo, F.: Weakest precondition reasoning for expected run–times of probabilistic programs. In: Thiemann, P. (ed.) ESOP 2016. LNCS, vol. 9632, pp. 364–389. Springer, Heidelberg (2016). https://doi.org/10.1007/9783662494981_15CrossRefGoogle Scholar
 31.Koller, D., Friedman, N.: Probabilistic Graphical Models  Principles and Techniques. MIT Press, Cambridge (2009)zbMATHGoogle Scholar
 32.Kozen, D.: Semantics of probabilistic programs. J. Comput. Syst. Sci. 22(3), 328–350 (1981)MathSciNetCrossRefGoogle Scholar
 33.Kozen, D.: A probabilistic PDL. J. Comput. Syst. Sci. 30(2), 162–178 (1985)MathSciNetCrossRefGoogle Scholar
 34.Lassez, J.L., Nguyen, V.L., Sonenberg, L.: Fixed point theorems and semantics: a folk tale. Inf. Process. Lett. 14(3), 112–116 (1982)MathSciNetCrossRefGoogle Scholar
 35.McIver, A., Morgan, C.: Abstraction, Refinement and Proof for Probabilistic Systems. Springer, New York (2004). http://doi.org/10.1007/b138392
 36.Minka, T., Winn, J.: Infer.NET (2017). http://infernet.azurewebsites.net/. Accessed Oct 17
 37.Minka, T., Winn, J.M.: Gates. In: NIPS, pp. 1073–1080. Curran Associates (2008)Google Scholar
 38.Monniaux, D.: An abstract analysis of the probabilistic termination of programs. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 111–126. Springer, Heidelberg (2001). https://doi.org/10.1007/3540477640_7CrossRefzbMATHGoogle Scholar
 39.Neapolitan, R.E., Jiang, X.: Probabilistic Methods for Financial and Marketing Informatics. Morgan Kaufmann, Burlington (2010)zbMATHGoogle Scholar
 40.Nori, A.V., Hur, C., Rajamani, S.K., Samuel, S.: R2: an efficient MCMC sampler for probabilistic programs. In: AAAI, pp. 2476–2482. AAAI Press (2014)Google Scholar
 41.Olmedo, F., Kaminski, B.L., Katoen, J., Matheja, C.: Reasoning about recursive probabilistic programs. In: LICS, pp. 672–681. ACM (2016)Google Scholar
 42.Onisko, A., Druzdzel, M.J., Wasyluk, H.: A probabilistic causal model for diagnosis of liver disorders. In: Proceedings of the Seventh International Symposium on Intelligent Information Systems (IIS98), pp. 379–387 (1998)Google Scholar
 43.Pearl, J.: Bayesian networks: a model of selfactivated memory for evidential reasoning. In: Proceedings of CogSci, pp. 329–334 (1985)Google Scholar
 44.Pfeffer, A.: Figaro: an objectoriented probabilistic programming language. Charles River Analytics Technical Report 137, 96 (2009)Google Scholar
 45.Ramanna, S., Jain, L.C., Howlett, R.J.: Emerging Paradigms in Machine Learning. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 46.Scutari, M.: Bayesian Network Repository (2017). http://www.bnlearn.com
 47.Sharir, M., Pnueli, A., Hart, S.: Verification of probabilistic programs. SIAM J. Comput. 13(2), 292–314 (1984)MathSciNetCrossRefGoogle Scholar
 48.Wood, F., van de Meent, J., Mansinghka, V.: A new approach to probabilistic programming inference. In: JMLR Workshop and Conference Proceedings, AISTATS, vol. 33, pp. 1024–1032 (2014). JMLR.org
 49.Yuan, C., Druzdzel, M.J.: Importance sampling algorithms for Bayesian networks: principles and performance. Math. Comput. Model. 43(9–10), 1189–1207 (2006)MathSciNetCrossRefGoogle Scholar
 50.Zweig, G., Russell, S.J.: Speech recognition with dynamic Bayesian networks. In: AAAI/IAAI, pp. 173–180. AAAI Press/The MIT Press (1998)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.