Bayesian strategies: probabilistic programs as generalised graphical models

We introduce Bayesian strategies, a new interpretation of probabilistic programs in game semantics. This interpretation can be seen as a refinement of Bayesian networks. Bayesian strategies are based on a new form of event structure, with two causal dependency relations respectively modelling control flow and data flow. This gives a graphical representation for probabilistic programs which resembles the concrete representations used in modern implementations of probabilistic programming. From a theoretical viewpoint, Bayesian strategies provide a rich setting for denotational semantics. To demonstrate this we give a model for a general higher-order programming language with recursion, conditional statements, and primitives for sampling from continuous distributions and trace re-weighting. This is significant because Bayesian networks do not easily support higher-order functions or conditionals.


Introduction
One promise of probabilistic programming languages (PPLs) is to make Bayesian statistics accessible to anyone with a programming background. In a PPL, the programmer can express complex statistical models clearly and precisely, and they additionally gain access to the set of inference tools provided by the probabilistic programming system, which they can use for simulation, data analysis, etc. Such tools are usually designed so that the user does not require any in-depth knowledge of Bayesian inference algorithms.
A challenge for language designers is to provide efficient inference algorithms. This can be intricate, because programs can be arbitrarily complex, and inference requires a close interaction between the inference engine and the language interpreter [42,Ch.6]. In practice, many modern inference engines do not manipulate the program syntax direcly but instead exploit some representation of it, more suited to the type of inference method at hand (Metropolis-Hastings (MH), Sequential Monte Carlo (SMC), Hamiltonian Monte Carlo, variational inference, etc.).
While many authors have recently given proofs of correctness for inference algorithms (see for example [11,24,32]), most have focused on idealised descriptions of the algorithms, based on syntax or operational semantics, rather than on the concrete program representations used in practice. In this paper we instead put forward a mathematical semantics for probabilistic programs designed to provide reasoning tools for existing implementations of inference.
Our work targets a specific class of representations which we call data flow representations. We understand data flow as describing the dependence relationships between random variables of a program. This is in contrast with control flow, which describes in what order samples are performed. Such data flow representations are widely used in practice. We give a few examples. For Metropolis-Hastings inference, Church [30] and Venture [41] manipulate dependency graphs for random variables ("computation traces" or "probabilistic execution traces"); Infer.NET [22] compiles programs to factor graphs in order to apply message passing algorithms; for a subset of well-behaved programs, Gen [23] statically constructs a representation based on certain combinators which is then exploited by a number of inference algorithms; and finally, for variational inference, Pyro [9] and Edward [55] rely on data flow graphs for efficient computation of gradients by automatic differentiation. (Also [52,28].) In this paper, we make a step towards correctness of these implementations and introduce Bayesian strategies, a new representation based on Winskel's event structures [46] which tracks both data flow and control flow. The Bayesian strategy corresponding to a program is obtained compositionally as is standard in concurrent game semantics [63], and provides an intensional foundation for probabilistic programs, complementary to existing approaches [24,57]. This paper was inspired by the pioneering work ofŚcibior et al. [53], which provides the first denotational analysis for concrete inference representations. In particular, their work provides a general framework for proving correct inference algorithms based on static representations. But the authors do not show how their framework can be used to accommodate data flow representations or verify any of the concrete implementations mentioned above. The work of this paper does not fill this gap, as we make no attempt to connect our semantic constructions with those of [53], or indeed to prove correct any inference algorithms. This could be difficult, because our presentation arises out of previous work on game semantics and thus does not immediately fit in with the monadic techniques employed in [53]. Nonetheless, efforts to construct game semantics monadically are underway [14], and it is hoped that the results presented here will set the ground for the development of event structure-based validation of inference.

From Bayesian networks to Bayesian strategies
Consider the following basic model, found in the Pyro tutorials (and also used in [39]), used to infer the weight of an object based on two noisy measurements. The measurements are represented by random variables meas 1 and meas 2 , whose values are drawn from a normal distribution around the true weight (weight), whose prior distribution is also normal, and centered at 2. (In this situation, meas 1 and meas 2 are destined to be conditioned on actual observed values, and the problem is then to infer the posterior distribution of weight based on these observations. We leave out conditioning in this example and focus on the model specification.) To describe this model it is convenient to use a Bayesian network, i.e. a DAG of random variables in which the distribution of each variable depends only on the value of its parents: The same probabilistic model can be encoded in an ML-style language: let weight = sample weight normal(2, 1) in sample meas1 normal(weight, 0.1); sample meas2 normal(weight, 0.1); () Our choice of sampling meas 1 before meas 2 is arbitrary: the same program with the second and third lines swapped corresponds to the same probabilistic model. This redundancy is unavoidable because programs are inherently sequential. It is the purpose of "commutative" semantics for probabilistic programs, as introduced by Staton et al. [54,57], to clarify this situation. They show that reordering program lines does not change the semantics, even in the presence of conditioning. This result says that when specifying a probabilistic model, only data flow matters, and not control flow. This motivates the use of program representations based on data flow such as the examples listed above. In our game semantics, a probabilistic program is interpreted as a control flow graph annotated by a data dependency relation. The Bayesian strategy associated with the program above is as follows: where (in brief), is data flow, is control flow, and the dashed node is the program output. (Probability distributions are as in the Bayesian network.) The semantics is not commutative, simply because reordering lines affects control flow; we emphasise that the point of this work is not to prove any new program equations, but instead to provide a formal framework for the representations involved in practical inference settings.

Our approach
To formalise this idea we use event structures, which naturally model control flow, enriched with additional structure for probability and an explicit data flow relation. Event structures were used in previous work by the author and Castellan on probabilistic programming [18], and were shown to be a good fit for reasoning about MH inference. But the representation in [18] combines data flow and control flow in a single transitive relation, and thus suffers from important limitations. The present paper is a significant improvement: by maintaining a clear separation between control flow and data flow, we can reframe the ideas in the well-established area of concurrent game semantics [63], which enables an interpretation of recursion and higher-order functions; these were not considered in [18]. Additionally, here we account for the fact that data flow in probabilistic programming is not in general a transitive relation.
While there is some work in setting up the right notion of event structure, the standard methods of concurrent game semantics adapt well to this setting. This is not surprising, as event structures and games are known to be resistant to the addition of extra structure, see e.g. [21,5,15]. One difficulty is to correctly define composition, keeping track of potential hidden data dependencies. In summary: -We introduce a general notion of Bayesian event structure, modelling control flow, data flow, and probability. -We set up a compositional framework for these event structures based on concurrent games. Specifically, we define a category BG of arenas and Bayesian strategies, and give a description of its abstract properties. -We give a denotational semantics for a higher-order statistical language. Our semantics gives an operationally intuitive representation for programs and their data flow structure, while only relying on standard mathematical tools. Paper outline. We start by recalling the basics of probability and Bayesian networks, and we then describe the syntax of our language (Sec. 2). In Sec. 3, we introduce event structures and Bayesian event structures, and informally describe our semantics using examples. In Sec. 4 we define our category of arenas and strategies, which we apply to the denotational semantics of the language in Sec. 5. We give some context and perspectives in Sec. 6. Acknowledgements. I am grateful to Simon Castellan, Mathieu Huot and Philip Saville for helpful comments on early versions of this paper. This work was supported by grants from EPSRC and the Royal Society.

Probability and measure
We recall the basic notions, see e.g. [8] for a reference.

Measures.
A measurable space is a set X equipped with a σ-algebra, that is, a set Σ X of subsets of X containing X itself, and closed under completements and countable unions. The elements of Σ X are called measurable subsets of X. An important example of measurable space is the set R equipped with its σ-algebra Σ R of Borel sets, the smallest one containing all intervals. Another basic example is the discrete space N, in which all subsets are measurable. A measure on (X, Σ X ) is a function μ : Σ X → [0, ∞] which is countably additive, i.e. μ( i∈I U i ) = i∈I U i for I countable, and satisfies μ(∅) = 0. A fundamental example is the Lebesgue measure λ on R, defined on intervals as λ([a, b]) = b − a and extended to all Borel sets. Another example (for arbitrary X) is the Dirac measure at a point x ∈ X: for any Given a measure on a space X and a measurable function f : X → R, for every measurable subset U of X we can define the integral U dμf , an element of R∪{∞}. This construction yields a measure on X. (Many well-known probability distributions on the reals arise in this way from their density.) Kernels. We will make extensive use of kernels, which can be seen as parametrised families of measures. Formally a kernel from X to Y is a map k : X × Σ Y → [0, ∞] such that for every x ∈ X, k(x, −) is a measure on Y , and for every V ∈ Σ Y , k(−, V ) is a measurable function. It is a sub-probability kernel if each k(x, −) is a sub-probability measure, and it is an s-finite kernel if it is a countable (pointwise) sum of sub-probability kernels. Every measurable function is also a kernel, and the Dirac kernel δ id (often just δ) is an identity for this composition. We note that if both h and k are sub-probability kernels, then h • k is a sub-probability kernel. Finally, observe that a kernel 1 X, for 1 a singleton space, is the same thing as a measure on X.
In this paper we will refer to the bernoulli, normal, and uniform families of distributions; all of these are sub-probability kernels from their parameters spaces to N or R. For example, there is a kernel R 2 R : ((x, y), U) → μ N (x,y) (U ), where μ N (x,y) is the measure associated with a normal distribution with parameters (x, y), if y > 0, and the 0 measure otherwise. We understand the bernoulli distribution as returning either 0 or 1 ∈ N.
Product spaces and independence. When several random quantities are under study one uses the notion of product space: given (X, Σ X ) and (Y, Σ Y ) we can equip the set X × Y with the product σ-algebra, written Σ X×Y , defined as the smallest one containing U × V , for U ∈ Σ X and V ∈ Σ Y .
A measure μ on X × Y gives rise to marginals μ X and μ Y , measures on X and Y respectively, defined by μ Given kernels k : X Y and h : Z W we define the product kernel k × h : X × Z Y × W via iterated integration: where χ U is the characteristic function of U ∈ Σ Y ×V . When X = Z = 1 this gives the notion of product measure.
The definitions above extend with no difficulty to product spaces i∈I X i . A measure P on i∈I X i has marginals P J for any J ⊆ I, and we say that X i and X j are independent w.r.t. P if the marginal P i,j is equal to the product measure P i × P j .

Bayesian networks
An efficient way to define measures on product spaces is using probabilistic graphical models [37], for example Bayesian networks, whose definition we briefly recall now. The idea is to use a graph structure to encode a set of independence constraints between the components of a product space. We recall the definition of conditional independence. With respect to a joint distribution P on i∈I X i , we say X i and X j are conditionally independent given X k if there exists a kernel k : In this definition, k is a conditional distribution of X i × X j given X k (w.r.t. P ); under some reasonable conditions [8] this always exists, and the independence condition is the main requirement.
Adapting the presentation used in [27], we define a Bayesian network as a directed acyclic graph G = (V, ) where each node v ∈ V is assigned a measurable space M(v). We define the parents pa(v) of v to be the set of nodes u with u v, and its non-descendants nd(v) to contain the nodes u such that there is no path v · · · u. Writing M(S) = v∈S M(v) for any subset S ⊆ V , a measure P on M(V ) is said to be compatible with G if for every v ∈ V , M(v) and M(nd(v)) are independent given M(pa(v)). It is straightforward to verify that given a Bayesian network G, we can construct a compatible measure by supplying for every v ∈ V , an s-finite kernel k v : M(pa(v)) M(v). (In practice, Bayesian networks are used to represent probabilistic models, and so typically every kernel k v is strictly probabilistic. Here the k v are only required to be s-finite, so they are in general unnormalised. As we will see, this is because we consider possibly conditioned models.) Bayesian networks are an elegant way of constructing models, but they are limited. We now present a programming language whose expressivity goes beyond them.

A language for probabilistic modelling
Our language of study is a call-by-value statistical language with sums, products, and higher-order types, as well as recursive functions. Languages with comparable features are considered in [11,57,40].
The syntax of this language is described in Fig. 1. Note the distinction between general terms M, N and values V . The language includes the usual term constructors and pattern matching. Base types are the unit type, the real numbers and the natural numbers, and for each of them there are associated constants. The language is parametrised by a set L of labels, a set F of partial measurable functions R n → R or R n → N, and a set D of standard distribution families, which are sub-probability kernels 1 R n R or R n N. There is also a primitive score which multiplies the weight of the current trace by the value of its argument. This is an idealised form of conditioning via soft constraints, which justifies the move from sub-probability to s-finite kernels (see [54]). Terms of the language are typed in the standard way; in Fig. 2 we present a subset of the rules which could be considered non-standard. We use X to stand for either N or R, and we do not distinguish between the type and the corresponding measurable space. We also write B for 1 + 1, and use syntactic sugar for let-bindings, sequencing, and conditionals:

Programs as event structures
In this section, we introduce our causal approach. We give a series of examples illustrating how programs can be understood as graph-like structures known as event structures, of which we assume no prior knowledge. Event structures were introduced by Winskel et al. [46], though for the purposes of this work the traditional notion must be significantly enriched.
let weight = sample weight normal(2, 1) in sample meas 1 normal(weight, 0.1); The examples which follow are designed to showcase the following features of the semantics: combination of data flow and control flow with probability (Sec. 3.1), conditional branching (Sec. 3.2), open programs with multiple arguments (Sec. 3.3) and finally higher-order programs (Sec. 3.4). We will then give further definitions in Sec. 3.5 and Sec. 3.6.
Our presentation in Sec. 3.1 and Sec. 3.2 is intended to be informal; we give all the necessary definitions starting from Sec. 3.3.

Control flow, data flow, and probability
We briefly recall the example of the introduction; the program and its semantics are given in Fig. 3. As before, represents control flow, and represents data flow. There is a node for each random choice in the program, and the dependency relationships are pictured using the appropriate arrows. Naturally, a data dependency imposes constraints on the control flow: every arrow must be realised by a control flow path * . There is an additional node for the output value, drawn in a dashed box, which indicates that it is a possible point of interaction with other programs. This will be discussed in Sec. 3.3.
Although this is not pictured in the above diagram, the semantics also comprises a family of kernels, modelling the probabilistic execution according to the distributions specified by the program. Intuitively, each node has a distribution whose parameters are its parents for the relation . For example, the node labelled meas 2 will be assigned a kernel k meas2 : R R defined so that k meas2 (weight, −) is a normal distribution with parameters (weight, 0.1).

Branching
Consider a modified scenario in which only one measurement is performed, but with probability 0.01 an error occurs and the scales display a random number between 0 and 10. The corresponding program and its semantics are given in Fig. 4.
In order to represent the conditional statement we have introduced a new element to the graph: a binary relation known as conflict, pictured , and indicating that two nodes are incompatible and any execution of the program will only encounter one of them. Conflict is hereditary, in the sense that the respective futures of two nodes in conflict are also incompatible. Hence we need two copies of (); one for each branch of the conditional statement. Unsurprisingly, beyond the branching point all events depend on error, since their very existence depends on its value.
We continue our informal presentation with a description of the semantics of open terms. This will provide enough context to formally define the notion of event structure we use in this paper, which differs from others found in the literature.

Programs with free variables
We turn the example in Sec. 3.2 into one involving two free variables, guess and rate, used as parameters for the distributions of weight and error, respectively. These allow the same program to serve as a model for different situations. Formally we have a term M such that guess : R, rate : R M : 1, given in by nodes, drawn in dotted boxes, showing that (like the output nodes) they are a point of interaction with the program's external environment; this time, a value is received rather than sent. Below, we will distinguish between the different types of nodes by means of a polarity function. We attach to the parameter nodes the appropriate data dependency arrows. The subtlety here is with control flow: while is it clear that parameter values must be obtained before the start of the execution, and that necessarily guess weight and rate weight, it is less clear what relationship guess and rate should have with each other.
In a call-by-value language, we find that leaving program arguments causally independent (of each other) leads to soundness issues. But it would be equally unsound to impose a causal order between them. Therefore, we introduce a form of synchronisation relation, amounting to having both guess rate and rate guess, but we write guess rate instead. In event structure terminology this is known as a coincidence, and was introduced by [19] to study the synchronous π-calculus. Note that in many approaches to call-by-value games (e.g. [31,26]) one would bundle both parameters into a single node representing the pair (guess, rate), but this is not suitable here since our data flow analysis requires separate nodes.
We proceed to define event structures, combining the ingredients we have described so far: control dependency, data dependency, conflict, and coincidence, together with a polarity function, used implicitly above to distinguish between input nodes (−), output nodes (+), and internal random choices (0).

Definition 1. An
Often we write E instead of the whole tuple (E, ≤, #, , pol). It is sometimes useful to quotient out coincidences: we write E for the set of -equivalence classes, which we denote as boldface letters (e, a, s, . . . ). It is easy to check that this is also an event structure with e ≤ e (resp. #, ) if there is e ∈ e and e ∈ e with e ≤ e (resp. #, ), and evident polarity function. We will see in Sec. 3.5 how this structure can be equipped with quantitative information (in the form of measurable spaces and kernels). Before discussing higher-order programs, we introduce the fundamental concept of configuration, which will play an essential role in the technical development of this paper.
x then e ∈ x) and conflict-free (if e, e ∈ x then ¬(e # e )). The set of all configurations of E is denoted C (E) and it is a partial order under ⊆.
We introduce some important terminology. For an event e ∈ E, we have defined its history [e] above. This is always a configuration of E, and the smallest one containing e. More generally we can define [e] = {e | ∀e ∈ e. e ≤ e}, and [e) = [e] \ e.
The covering relation −⊂ defines the smallest non-trivial extensions to a configuration; it is defined as follows: x−⊂y if there is e ∈ E such that x∩e = ∅ and y = x ∪ e. We will sometimes write x −⊂ e y. We sometimes annotate −⊂ and ⊆ with the polarities of the added events: so for instance x ⊆ +,0 y if each e i ∈ y \ x has polarity + or 0.

Higher-order programs
We return to a fairly informal presentation; our goal now is to convey intuition about the representation of higher-order programs in the framework of event structures. We will see in Sec. 4 how this representation is obtained from the usual categorical approach to denotational semantics.
Consider yet another faulty-scales scenario, in which the probability of error now depends on the object's weight. Suppose that this dependency is not known by the program, and thus left as a parameter rate : R → R. The resulting program has type rate : R → R, guess : R R, as follows: It is an important feature of the semantics presented here that higherorder programs are interpreted as causal structures involving only values of ground type. In the example, the argument rate is initially received not as a mathematical function, but as a single message of unit type (labelled λ rate ), which gives the program the possibility to call the function rate by feeding it an input value. Because the behaviour of rate is unknown, its output is treated as a new argument to the program, represented by the negative out node. The shaded region highlights the part of computation during which the program interacts with its argument rate. The semantics accommodates the possibiliy that rate itself has internal random choices; this will be accounted for in the compositional framework of Sec. 4.

Bayesian event structures
We show now that event structures admit a probabilistic enrichment. 2 Definition 3. A measurable event structure is an event structure together with the assignment of a measurable space M(e) for every event e ∈ E. For any X ⊆ E we set M(X) = e∈X M(e).
As is common in statistics, we often call e (or X) an element of M(e) (or M(X)). We now proceed to equip this with a kernel for each event. Our Bayesian event structures are quantitative event structures satisfying an additional axiom, which we introduce next. This axiom is necessary for a smooth combination of data flow and control flow; without it, the compositional framework of the next section is not possible. We finally define: Definition 7. A Bayesian event structure is a quantitative event structure such that if e ∈ E is non-uniform, and e ≤ e with e and e not coincident, then pa(e) ⊆ pa(e ).
The purpose of this condition is to ensure that Bayesian event structures support a well-behaved notion of "hiding", which we will define in the next section.

Symmetry
For higher-order programs, event structures in the sense of Definition 1 present a limitation. This has to do with the possibility for a program to call a function argument more than once, which the compositional framework of Sec. 4 does not readily support. We will use a linear logic-inspired "!" to duplicate nodes, thus making certain configurations available in infinitely many copies. The following additional structure, called symmetry, is there to enforce that these configurations yield equivalent behaviour.

Definition 8 (Winskel [61]).
A symmetry on an event structure E is a family ∼ =E of bijections θ : x ∼ = y, with x, y ∈ C (E), containing all identity bijections and closed under composition and inverses, satisfying the following axioms.
We write θ : When E is Bayesian, we additionally require k e = k θ(e) for every non-negative e ∈ x. (This is well-defined because θ preserves data flow and thus pa(θ(e)) = θ pa(e).) Although symmetry can be mathematically subtle, combining it with additional data on event structures does not usually pose any difficulty [15,48].
In this section we have described Bayesian event structures with symmetry, which are the basic mathematical objects we use to represent programs. A central contribution of this paper is to define a compositional semantics, in which the interpretation of a program is obtained from that of its sub-programs. This is the topic of the next section.

Games and Bayesian strategies
The presentation is based on game semantics, a line of research in the semantics of programming languages initiated in [3,33], though the subject has earlier roots in the semantics of linear logic proofs (e.g. [10]).
It is typical of game semantics that programs are interpreted as concrete computational trees, and that higher-order terms are described in terms of the possible interactions with their arguments. As we have seen in the examples of the previous section, this interaction takes the form of an exchange of firstorder values. The central technical achievement of game semantics is to provide a method for composing such representations.
To the reader not familiar with game semantics, the terminology may be misleading: the work of this paper hardly retains any connection to game theory. In particular there is no notion of winning. The analogy may be understood as follows for a given program of type Γ M : A. There are two players: the program itself, and its environment. The "game", which we study from the point of view of the program, takes place in the arena Γ A , which specifies which moves are allowed (calls to arguments in Γ , internal samples, return values in A, etc.). The semantics of M is a strategy (written M ), which specifies a plan of action for the program to follow in reaction to the moves played by the environment; this plan has to obey the constraints specified by the arena.

An introduction to game semantics based on event structures
There are many formulations of game semantics in the literature, with varying advantages. This paper proposes to use concurrent games, based on event structures, for reasoning about data flow in probabilistic programs. Originally introduced in [51] (though some important ideas appeared earlier: [25,44]), concurrent games based on event structures have been extensively developed and have found a range of applications.
In Sec. 2, we motivated our approach by assigning event structures to programs; these event structures are examples of strategies, which we will shortly define. First we define arenas, which are the objects of the category we will eventually build. (The morphisms will be strategies.) Perhaps surprisingly, an arena is also defined as an event structure, though a much simpler one, with no probabilistic information, empty data dependency relation , and no neutral polarity events. We call this a simple event structure. This event structure does not itself represent any computation, but is simply there to constrain the shape of strategies, just as types constrain programs. Before giving the definition, we present in Fig. 7   So, arenas provide a set of moves together with certain constraints for playing those moves. Our definition of strategy is slightly technical, but the various conditions ensure that strategies can be composed soundly; we will explore this second point in Sec. 4.2.
For a strategy S to be well-defined relative to an arena A, each positive or negative move of S must correspond to a move of A; however neutral moves of S correspond to internal samples of the program; these should not be constrained by the type. Accordingly, a strategy comprises a partial map S A defined precisely on the non-neutral events. The reader should be able to reconstruct Condition (1) amounts to σ being a map of event structures [60]. Combined with (2) and (3), we get the usual notion of a concurrent strategy on an arena with symmetry [17]; and finally (4) is a form of -courtesy. To these four conditions we add the following: Innocence [33,56,16] prevents any non-local or concurrent behaviour. It is typically used to characterise "purely functional" sequential programs, i.e. those using no state or control features. Here, we use innocence as a way to confine ourselves to a simpler semantic universe. In particular we avoid the need to deal with the difficulties of combining concurrency and probability [62].
In the rest of the paper, a Bayesian strategy is an innocent strategy in the sense of Definition 10 and Definition 11.

Composition of strategies
At this point, we have seen how to define arenas, and we have said that the event structures of Sec. 2 arise as strategies σ : S A for an arena A. As usual in denotational semantics, these will be obtained compositionally, by induction on the syntax. For this we must move to a categorical setting, in which arenas are objects and strategies are morphisms.

Strategies as morphisms.
Before we introduce the notion of strategy from A to B we must introduce some important construction on event structures.

Definition 12. If
A is an event structure, its dual A ⊥ is the event structure whose structure is the same as A but for polarity, which is defined at pol A ⊥ (a) = −pol A (a). (Negative moves become positive, and vice-versa, with neutral moves not affected.) For arenas, we define (A, . Given a family (A i ) i∈I of event structures with symmetry, we define their parallel composition to have events i∈I A i = i∈I A i × {i} with polarity, conflict and both kinds of dependency obtained componentwise. Noticing that a configuration x ∈ C ( i∈I A i ) corresponds to i∈I x i where each x i ∈ C (A i ), and x i = ∅ for all but finitely many i, we define the symmetry ∼ = i∈I Ai to contain

If the A i are arenas we define the two other symmetries in the same way.
We can now define our morphisms: a strategy from A to B is a strategy on the arena A ⊥ B, i.e. a map σ : S A ⊥ B. The event structure S consists of A-moves (those mapped to the A ⊥ component), B-moves, and internal (i.e. neutral) events. We sometimes write S : A + → B. The purpose of the composition operation which we proceed to define is therefore to produce, from a pair of strategies σ : S A ⊥ B and τ : T B ⊥ C, a strategy τ σ : T S A ⊥ C. A constant feature of denotational games models is that composition is defined in two steps: interaction, in which S and T synchronise by playing matching B-moves, and hiding, where the matching pairs of events are deleted. The setting of this paper allows both σ and τ to be partial maps, so that in general there can be neutral events in both S and T ; these never synchronise, and indeed they should not be hidden, since we aim to give an account of internal sampling.
Before moving on to composition, a word of warning: the resulting structure will not be a category. Instead, arenas and strategies assemble into a weaker structure called a bicategory [6]. Bicategories have objects, morphisms, and 2cells (morphisms between morphisms), and the associativity and identity laws are relaxed, and only need to hold up to isomorphisms. (This situation is relatively common for intensional models of non-determinism.) Intuitively, S and S have the same moves up to the choice of copy indices. We know from [17] that isomorphism is preserved by composition (and all other constructions), so from now on we always consider strategies up to isomorphism; then we will get a category.

Interaction.
In what follows we assume fixed Bayesian innocent strategies S : A + → B and T : B + → C as above, and study their interaction. We have hinted at the concept of "matching events" but the more convenient notion is that of matching configurations, which we define next.
There is an event structure with symmetry T S whose configurations correspond precisely to matching pairs; it is a well-known fact in game semantics that innocent strategies compose "like relations" [43,15]. Because "matching" Bmoves have a different polarity in S and T , there is an ambiguity in the polarity of some events in T S; we address this after the lemma.

Lemma 1. Ignoring polarity, there is, up to isomorphism, a unique event structure with symmetry T S, such that:
-For every e, e ∈ T S, e e iff either Π S (e) Π S (e ) or Π T (e) Π T (e ), and the same property holds for the conflict and data dependency relations.
-Π S and Π T preserve and reflect labels.
Furthermore, for every e ∈ T S, at least one of Π S (e) and Π T (e) is defined.
When reasoning about the polarity of events in T S, a subtlety arises because B-moves are not assigned the same polarity in S and T . This is not surprising: polarity is there precisely to allow strategies to communicate by sending (+) and receiving (−) values; in this interaction, S and T play complementary roles. To reason about the flow of information in the event structure T S it will be important, for each B-move e of T S, to know whether it is positive in S or in T ; in other words, whether information is flowing from S to T , or vice-versa.
Accordingly Probability in the interaction. Unlike with polarity, S and T agree on what measurable space to assign to each B-move, since by the conditions on strategies, this is determined by the arena. So for each e ∈ T S we can set M(e) = M(Π S (e)) or M(Π T (e)), unambiguously, and an easy argument shows that this makes T S a well-defined measurable event structure with symmetry.
We can turn T S into a quantitative event structure by defining a kernel k e : M(pa (e)) M(e) for every e ∈ T S such that pol (e) = −. The key observation is that when pol (e) ∈ {+ S , 0 S }, the parents of e correspond precisely to the parents of Π S (e) in S. Since Π S preserves the measurable space associated to an event, we may then take k e = k Π S (e) .
Hiding. Hiding is the process of deleting the B-moves from T S, yielding a strategy from A to C. The B-moves are exactly those on which both projections are defined, so the new set of events is obtained as follows: T S = {e ∈ T S | Π S (e) and Π T (e) are not both defined}.
This set inherits a preorder ≤, conflict relation #, and measurable structure directly from T S. Polarity is lifted from either S or T via the projections. (Note that by removing the B-moves we resolved the mismatch.) To define the data flow dependency, we must take care to ensure that the resulting T S is Bayesian. For e, e ∈ T S, we say e e if one of the following holds: (1) There exist n ≥ 0 and e 1 , . . . , e n ∈ T S, all B-moves, such that e e 1 · · · e n e (in T S). (2) There exist a non-uniform d ∈ T S, n ≥ 0 and e 1 , . . . , e n ∈ T S, all B-moves, such that such that e e 1 · · · e n d and d ≤ e .
From a configuration x ∈ C (T S) we can recover the hidden moves to get an interaction witness x = {e ∈ T S | e ≤ e ∈ x}, a configuration of C (T S). For x, y ∈ C (T S), a bijection θ : x ∼ = y is in ∼ =T S if there is θ : x ∼ =T S y which restricts to θ. This gives a measurable event structure with symmetry T S.
To make T S a Bayesian event structure, we must define for every e ∈ T S a kernel k e , which we denote k e to emphasise the difference with the kernel k e defined above. Indeed the parents pa (e) of e in T S may no longer exist in T S, where e has a different set of parents pa (e).
We therefore consider the subset of hidden ancestors of e which ought to affect the kernel k e : Definition 15. For strategies S : A + → B and T : B + → C, and e ∈ T S, an essential hidden ancestor of e is a B-move d ∈ T S, such that d ≤ e and one of the following holds: (1) There are e 1 ∈ pa (e), e 2 ∈ pa (e) such that e 1 · · · d · · · e 2 . (2) There are e 0 ∈ pa (e) , B-moves d and e 1 , . . . , e n , with d non-uniform, such that e 0 e 1 · · · e j d e j+1 · · · e n d .
Since T S is innocent, e has a sequential history, and thus the set of essential hidden ancestors of e forms a finite, total preorder, for which there exists a linear enumeration d 1 ≤ · · · ≤ d n . We then define k e : M(pa(e)) M(e) as follows: where we abuse notation: using that for every i ≤ n, pa (d i ) ⊆ pa (e) ∪ {d j | j < i}, we may write pa (d i ) for the only element of M(pa (d i )) compatible with pa (e) and d 1 , . . . , d i−1 . The particular choice of linear enumeration does not matter by Fubini's theorem for s-finite kernels.

Lemma 2.
There is a map τ σ : T S A ⊥ C making T S a Bayesian strategy. We call this the composition of S and T .
Copycat. We have defined morphisms between arenas, and how they compose. We now define identities, called copycat strategies. In the semantics of our language, these are used to interpret typing judgements of the form x : A x : A, and the copycat acts by forwarding values received on one side across to the other. To guide the intuition, the copycat strategy for the game R R is pictured in Fig. 8. (We will define the construction later.) Formally, the copycat strategy on an arena A is a Bayesian event structure (with symmetry) C C A , together with a (total) map cc A : C C A → A ⊥ A. As should be clear in the example of Fig. 8, the events, polarity, conflict, and measurable structure of C C A are those of A ⊥ A. The order ≤ is the transitive closure of that in A ⊥ A enriched with the pairs {((a, 1), (a, 2)) | a ∈ A and pol A (a) = +} ∪ {((a, 2), (a, 1)) | pol A (a) = −}. The same sets of pairs also make up the data dependency relation in C C A ; recall that there is no data dependency in the event structure A. Note that because C C A is just A ⊥ A with added constraints, configurations of C C A can be seen as a subset of those of A ⊥ A, and thus the symmetry ∼ =CC A is inherited from ∼ = A ⊥ A .
To make copycat a Bayesian strategy, we observe that for every positive e ∈ C C A , pa(e) contains a single element, the correponding negative move in A ⊥ A, which carries the same measurable space. Naturally, we take k e : M(e) M(e) to be the identity kernel.
We have defined objects, morphisms, composition, and identities. They assemble into a category. Theorem 1. Arenas and Bayesian strategies, with the latter considered up to isomorphism, form a category BG. BG has a subcategory BG + whose objects are positive, regular arenas and whose morphisms are negative strategies ( i.e. strategies whose inital moves are negative), up to isomorphism.
The restriction implies (using receptivity) that for every strategy A + → B in BG + , initial moves of S correspond to init(A). This reflects the dynamics of a call-by-value language, where arguments are received before anything else. We now set out to define the semantics of our language in BG + .

A denotational model
In Sec. 5.1, we describe some abstract constructions in the category, which provide the necessary ingredients for interpreting types and terms in Sec. 5.2.

Categorical structure
The structure required to model a calculus of this kind is fairly standard. The first games model for a call-by-value language was given by Honda and Yoshida [31] (see also [4]). Their construction was re-enacted in the context of concurrent games by Clairambault et al. [20], from whom we draw inspiration. The adaptation is not however automatic as we must account for measurability, probability, data flow, and an interpretation of product types based on coincidences. Tensor. Tensor products are more subtle, partly because in this paper we use coincidence to deal with pairs, as motivated in Sec. 3.3. For example, given two arenas each having a single initial move, we construct their tensor product by taking their parallel composition and making the two initial moves coincident. Now, because arenas in BG + are regular (Definition 9), it is easy to see that each A is isomorphic to a sum i∈I

Coproducts. Given arenas
In order to give semantics to pairs of terms, we must define the action of ⊗ of strategies. Consider two strategies σ : Informally, the strategies synchronise at the start, i.e. all initial moves are received at the same time, and they synchronise again when they are both ready to move to the A ⊗ B side for the first time.
The operations − ⊗ B and A ⊗ − on BG + define functors. However, as is typically the case for models of call-by-value, the tensor fails to be bifunctorial, and thus BG + is not monoidal but only premonoidal [50]. The unit for ⊗ is the arena 1 with one (positive) event () : 1. There are "copycat-like" associativity, unit and braiding strategies, which we omit.
The failure of bifunctoriality in this setting means that for σ : A + → A and τ : B + → B , the strategy S ⊗ T is in general distinct from the following two strategies: Fig. 9 for an example of the ⊗ and ⊗ l constructions on simple strategies. Observe that the data flow relation is not affected by the choice of tensor: this is related to our discussion of commutativity in Sec. 1.1: a commutative semantics is one that satisfies ⊗ l = ⊗ r = ⊗.
We will make use of the left tensor ⊗ l in our denotational semantics, because it reflects a left-to-right evaluation strategy, which is standard. It will also be important that the interpretation of values lies in the centre of the premonoidal category, which consists of those strategies S for which S ⊗ l T = S ⊗ r T and T ⊗ l S = T ⊗ r S for every T . Finally we note that ⊗ distributes over +, in the sense that for every A, B, C the canonical strategy (A⊗B)+(A⊗C) + → A⊗(B+C) has an inverse λ. Function spaces. We now investigate the construction of arenas of the form A B. This is a linear function space construction, allowing at most one call to the argument A; in Sec. 5.1 we will construct an extended arena !(A B) permitting arbitrary usage. Given A and B we construct A B as follows. (This construction is the same as in other call-by-value game semantics, e.g. [31,20].) Recall that we can write A = i∈I A i with each A i an elementary arena. Then, A B has the same set of events as 1 i∈I (A ⊥ i B), with inherited polarity and measurable structure, but with a preorder enriched with the pairs where in this case we call λ the unique move of 1.
For every strategy σ : A ⊗ B + → C we call Λ(σ) : A + → B C the strategy which, upon receiving an opening A-move (or coincidence) a, deterministically (and with no data-flow link) plays the move λ in B C, waits for Opponent to play a B-move (or coincidence) b and continues as σ would on input a b. Additionally there is for every B and C an evaluation morphism ev B,C : (B C) ⊗ B + → C defined as in [20]. Duplication. We define, for every arena A, a "reusable" arena !A. Its precise purpose will become clear when we define the semantics of our language. It is helpful to start with the observation that ground type values are readily duplicable, in the sense that there is a strategy R + → R ⊗ R in BG. Therefore ! will have no effect on R , but only on more sophisticated arenas (e.g. R R ) for which no such (well-behaved) map exists. We start by studying negative arenas.

Definition 16. Let
A be a negative arena. We define !A to be the measurable event structure !A = i∈ω A, equipped with the following symmetries: It can be shown that !A is a well-defined negative arena, i.e. meets the conditions of Definition 9. Observe that an elementary positive arena B corresponds precisely to a set e of coincident positive events, all initial for , immediately followed by a negative arena which we call B − . Followed here means that e ≤ b for all e ∈ e and b ∈ B − , and we write B = e·B − . We define !B = e·!B − . Finally, recall that an arbitrary positive arena B can be written as a sum of elementary ones: B = i∈I B i . We then define !B = i∈I !B i . For positive A and B, a central strategy σ : A + → B induces a strategy !σ : !A + → !B, and this is functorial. The functor ! extends to a linear exponential comonad on the category with elementary arenas as objects and central strategies as morphisms (see [20] for the details of a similar construction).

Recursion.
To interpret fixed points, we consider an ordering relation on strategies. We momentarily break our habit of considering strategies up to isomorphism, as in this instance it becomes technically inconvenient [17].

Definition 17. If σ : S
A and τ : T A are strategies, we write S T if S ⊆ T , the inclusion map is a map of event structures, preserves all structure, including kernels, and for every s ∈ S, σ(s) = τ (s).

Lemma 5.
Every ω-chain S 0 S 1 . . . has a least upper bound i∈ω S i , given by the union i∈ω S i , with all structure obtained by componentwise union.
There is also a least strategy ⊥ on every arena, unique up to isomorphism. We are now ready to give the semantics of our language.

Denotational semantics
The interpretation of types is as follows: This interpretation extends to contexts via · = 1 and x 1 : A 1 , . . . , x n : A n = A 1 ⊗ . . . ⊗ A n . (In Fig. 7 we used Γ A to refer to the arena Γ ⊥ A .) A term Γ M : A is interpreted as a strategy M Γ : Γ → A , defined inductively. For every type A, the arena A is both a !-coalgebra and a commutative comonoid, so there are strategies w A : A + → 1, c A : A + → A ⊗ A , and h A : A + → ! A . Using that the comonad ! is monoidal, this structure extends to contexts; we write c Γ , w Γ and h Γ for the induced maps. The interpretation of constants is shown in Fig. 10, and the rest of the semantics is given in Fig. 11. Lemma 6. For a value Γ V : A, the strategy V Γ is central.
The semantics is sound for the usual call-by-value equations.
The equations are directly verified. Standard reasoning principles apply given the categorical structure we have outlined above. (It is well known that premonoidal categories provide models for call-by-value [50], and our interpretation is a version of Girard's translation of call-by-value into linear logic [29].)

Conclusion and perspectives
We have defined, for every term Γ M : A, a strategy M Γ . This gives a model for probabilistic programming which provides an explicit representation of data flow. In particular, if M : 1, and M has no subterm of type B + C, then the Bayesian strategy M is a Bayesian network equipped with a total ordering of its nodes: the control flow relation ≤. Our proposed compositional semantics additionally supports sum types, higher types, and open terms. This paper does not contain an adequacy result, largely for lack of space: the 'Monte Carlo' operational semantics of probabilistic programs is difficult to define in full rigour. In further work I hope to address this and carry out the integration of causal models into the framework of [53]. The objective remains to obtain proofs of correctness for existing and new inference algorithms.
Related work on denotational semantics. Our representation of data flow based on coincidences and a relation is novel, but the underlying machinery relies on existing work in concurrent game semantics, in particular the framework of games with symmetry developed by Castellan et al. [17]. This was applied to a language with discrete probability in [15], and to a call-by-name and affine language with continuous probability in [49]. This paper is the first instance of a concurrent games model for a higher-order language with recursion and continuous probability, and the first to track internal sampling and data flow.
There are other interactive models for statistical languages, e.g. by Ong and Vákár [47] and Dal Lago et al. [38]. Their objectives are different: they do not address data flow (i.e. their semantics only represents the control flow), and do not record internal samples.
Prior to the development of probabilistic concurrent games, probabilistic notions of event structures were considered by several authors (see [58,1,59]). The literature on probabilistic Petri nets important related work, as Petri nets can sometimes provide finite representations for infinite event structures. Markov nets [7,2] satisfy conditional independence conditions based on the causal structure of Petri nets. More recently Bruni et al. [12,13] relate a form of Petri nets to Bayesian networks and inference, though their probability spaces are discrete.
Related work on graphical representations. Our event structures are reminiscent of Jeffrey's graphical language for premonoidal categories [35], which combines string diagrams [36] with a control flow relation. Note that in event structures the conflict relation provides a model for sum types, which is difficult to obtain in Jeffrey's setting. The problem of representing sum types arises also in probabilistic modelling, because Bayesian networks do not support them: [45] propose an extended graphical language, which could serve to interpret first-order probabilistic programs with conditionals. Another approach is by [42], whose Bayesian networks have edges labelled by predicates describing the branching condition. Finally, the theory of Bayesian networks has also been investigated extensively by Jacobs [34] with a categorical viewpoint. It will be important to understand the formal connections between our work and the above.