Causal Inference by String Diagram Surgery

Extracting causal relationships from observed correlations is a growing area in probabilistic reasoning, originating with the seminal work of Pearl and others from the early 1990s. This paper develops a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors. A key notion in the identification of causal effects is that of an intervention, whereby a variable is forcefully set to a particular value independent of any prior propensities. We represent the effect of such an intervention as an endofunctor which performs `string diagram surgery' within the syntactic category of string diagrams. This diagram surgery in turn yields a new, interventional distribution via the interpretation functor. While in general there is no way to compute interventional distributions purely from observed data, we show that this is possible in certain special cases using a calculational tool called comb disintegration. We demonstrate the use of this technique on a well-known toy example, where we predict the causal effect of smoking on cancer in the presence of a confounding common cause. After developing this specific example, we show this technique provides simple sufficient conditions for computing interventions which apply to a wide variety of situations considered in the causal inference literature.


Introduction
Causality is about understanding the mechanics of the world around us. This world presents itself in the form of streams of observations, in which statistical (in)dependences can be recognised. A big question, both in science and in daily life, is: how to distinguish correlation from causation and recognise genuine causal relationships.
An important conceptual tool for distinguishing correlation from causation is the possibility of intervention. For example, a randomised drug trial attempts to destroy any confounding 'common cause' explanation for correlations between drug use and recovery by randomly assigning a patient to the control or treatment group, independent of any background factors. In an ideal setting, the observed correlations of such a trial will reflect genuine causal influence. Unfortunately, it is not always possible (or ethical) to ascertain causal effects by means of actual interventions. For instance, one is unlikely to get approval to run a clinical trial on whether smoking causes cancer by randomly assigning 50% of the patients to smoke, and waiting a bit to see who gets cancer. However, in certain situations it is possible to predict the effect of such a hypothetical intervention from purely observational data.
In this paper, we will focus on the problem of causal identifiability. For this problem, we are given observational data as a joint distribution on a set of variables and we are furthermore provided with a causal structure associated with those variables. This structure, which typically takes the form of a directed acyclic graph or some variation thereof, tells us which variables can in principle have a causal influence on others. The problem then becomes whether we can measure how strong those causal influences are, by means of computing an interventional distribution. That is, can we ascertain what would have happened if a (hypothetical) intervention had occurred?
Over the past 3 decades, a great deal of work has been done in identifying necessary and sufficient conditions for causal identifiability in various special cases, starting with very specific notions such as the back-door and front-door criteria [19] and progressing to more general necessary and sufficient conditions for causal identifiability based on the do-calculus [11], or combinatoric concepts such as confounded components in semi-Makovian models [25,24].
This style of causal reasoning relies crucially on a delicate interplay between syntax and semantics, which is often not made explicit in the literature. The syntactic object of interest is the causal structure (e.g. a causal graph), which captures something about our understanding of the world, and the mechanisms which gave rise to some observed phenomena. The semantic object of interest is the data: joint and conditional probability distributions on some variables. Fixing a causal structure entails certain constraints on which probability distributions can arise, hence it is natural to see distributions satisfying those constraints as models of the syntax.
In this paper, we make this interplay precise using functorial semantics in the spirit of Lawvere [16], and develop basic syntactic and semantic tools for causal reasoning in this setting. We take as our starting point a functorial presentation of Bayesian networks similar to the one appearing in [7]. The syntactic role is played by string diagrams, which give an intuitive way to represent morphisms of a monoidal category as boxes plugged together by wires. Given a directed acyclic graph (dag) G, we can form a free category Syn G whose arrows are (formal) string diagrams which represent the causal structure syntactically. Structurepreserving functors from Syn G to Stoch, the category of stochastic matrices, then correspond exactly to Bayesian networks based on the dag G.
Within this framework, we develop the notion of intervention as an operation of 'string diagram surgery'. Intuitively, this cuts a string diagram at a certain variable, severing its link to the past. Formally, this is represented as an endofunctor on the syntactic category cut X : Syn G → Syn G , which propagates through a model F : Syn G → Stoch to send observational probabilities F (ω) to interventional probabilities F (cut X (ω)).
The cut X endofunctor gives us a diagrammatic means of computing interventional distributions given complete knowledge of F . However, more interestingly, we can sometimes compute interventionals given only partial knowledge of F , namely some observational data. We show that this can also be done via a technique we call comb disintegration, which is a string diagrammatic version of a technique called c-factorisation introduced by Tian and Pearl [25]. Our approach generalises disintegration, a calculational tool whereby a joint state on two variables is factored into a single-variable state and a channel, representing the marginal and conditional parts of the distribution, respectively. Disintegration has recently been formulated categorically in [5] and using string diagrams in [4]. We take the latter as a starting point, but instead consider a factorisation of a three-variable state into a channel and a comb. The latter is a special kind of map which allows inputs and outputs to be interleaved. They were originally studied in the context of quantum communication protocols, seen as games [8], but have recently been used extensively in the study of causally-ordered quantum [3,20] and generalised [14] processes. While originally imagined for quantum processes, the categorical formulation given in [14] makes sense in both the classical case (Stoch) and the quantum. Much like Tian and Pearl's technique, comb factorisation allows one to characterise when the confounding parts of a causal structure are suitably isolated from each other, then exploit that isolation to perform the concrete calculation of interventional distributions.
However, unlike in the traditional formulation, the syntactic and semantic aspects of causal identifiability within our framework exactly mirror one-another. Namely, we can give conditions for causal identifiability in terms of factorisation a morphism in Syn G , whereas the actual concrete computation of the interventional distribution involves factorisation of its interpretation in Stoch. Thanks to the functorial semantics, the former immediately implies the latter.
To introduce the framework, we make use of a running example taken from Pearl's book [19]: identifying the causal effect of smoking on cancer with the help of an auxiliary variable (the presence of tar in the lungs). After providing some preliminaries on stochastic matrices and the functorial presentation of Bayesian networks in Sections 2 and 3, we introduce the smoking example in Section 4. In Section 5 we formalise the notion of intervention as string diagram surgery, and in Section 6 we introduce the combs and prove our main calculational result: the existence and uniqueness of comb factorisations. In Section 7, we show how to apply this theorem in computing the interventional distribution in the smoking example, and in 8, we show how this theorem can be applied in a more general case which captures (and slightly generalises) the conditions given in [25]. In Section 9, we conclude and describe several avenues of future work.

Stochastic Matrices and Conditional Probabilities
Symmetric monoidal categories (SMCs) give a very general setting for studying processes which can be composed in sequence (via the usual categorical composition •) and in parallel (via the monoidal composition ⊗). Throughout this paper, we will use string diagram notation [23] for depicting composition of morphisms in an SMC. In this notation, morphisms are depicted as boxes with labelled input and output wires, composition • as 'plugging' boxes together, and the monoidal product ⊗ as placing boxes side-by-side. Identitiy morphisms are depicted simply as a wire and the unit I of ⊗ as the empty diagram. The 'symmetric' part of the structure consists of symmetry morphisms, which enable us to permute inputs and outputs arbitrarily. We depict these as wire-crossings: . Morphisms whose domain is I are called states, and they will play a special role throughout this paper.
A monoidal category of prime interest in this paper is Stoch, whose objects are finite sets and morphisms f : A → B are |B| × |A| dimensional stochastic matrices. That is, they are matrices of positive numbers (including 0) whose columns each sum to 1: Note we adopt the physicists convention of writing row indices as superscripts and column indices as subscripts. Stochastic matrices are of interest for probabilistic reasoning, because they exactly capture the data of a conditional probability distribution. That is, if we take A := {1, . . . , m} and B := {1, . . . , n}, conditional probabilities naturally arrange themselves into a stochastic matrix: States, i.e. stochastic matrices from a trivial input I := { * }, are (nonconditional) probability distributions, represented as column vectors. There is only one stochastic matrix with trivial output: the row vector consisting only of 1's. The latter, with notation as on the right, will play a special role in this paper (see (1) below).
Composition of stochastic matrices is matrix multiplication. In terms of conditional probabilities, this corresponds to multiplication, followed by marginalization over the shared variable: B P (C|B)P (B|A). Identities are therefore given by identity matrices, which we will often express in terms of the Kronecker delta function δ j i . The monoidal product ⊗ in Stoch is the cartesian product on objects, and Kronecker product of matrices: We will typically omit parentheses and commas in the indices, writing e.g. h kl ij instead of h (k,l) (i,j) for an arbitrary matrix entry of h : A⊗B → C⊗D. In terms of conditional probabilities, the Kronecker product corresponds to taking product distributions. That is, if f represents the conditional probabilities P (B|A) and g the probabilities P (D|C), then f ⊗ g represents P (B|A)P (D|C). Stoch also comes with a natural choice of 'swap' matrices σ : A ⊗ B → B ⊗ A given by σ kl ij := δ l i δ k j , making it into a symmetric monoidal category. Every object A in Stoch has three other pieces of structure which will play a key role in our formulation of Bayesian networks and interventions: the copy map, the discarding map, and the uniform state: Abstractly, this provides Stoch with the structure of a CDU category.
CDU functors are symmetric monoidal functors between CDU categories preserving copy maps, discard maps and uniform states.
We assume that the CDU structure on I is trivial and the CDU structure on A ⊗ B is constructed in the obvious way from the structure on A and B. We also use the first equation in (2) to justify writing 'copy' maps with arbitrarily many output wires: ... . Similar to [2], we can form the free CDU category FreeCDU(A, Σ) over a pair (X, Σ) of a generating set of objects X and a generating set Σ of typed morphisms f : u → w, with u, w ∈ X ⋆ as follows. The category FreeCDU(A, Σ) has X ⋆ as set of objects, and morphisms the string diagrams constructed from the elements of Σ and maps : x → x ⊗ x, : x → I and : I → x for each x ∈ X, taken modulo the equations (2).

Lemma 2.2.
Stoch is a CDU category, with CDU structure defined as in (1).
An important feature of Stoch is that I = {⋆} is the final object, with : B → I the map provided by the universal property, for any set B. This yields equation (3) on the right, for any f : A → B, justifying the name "discarding map" for .
We conclude by recording another significant feature of Stoch: disintegration [5,4]. In probability theory, this is the mechanism of factoring a joint probability distribution P (AB) as a product of the first marginal P (A) and a conditional distribution P (B|A). We recall from [4] the string diagrammatic rendition of this process. We say that a morphism f : X → Y in Stoch has full support if, as a stochastic matrix, it has no zero entries. When f is a state, it is a standard result that full support ensures uniqueness of disintegrations of f .

Proposition 2.3 (Disintegration). For any state
Note that equation (3) and the CDU rules immediately imply that the unique

Bayesian Networks as String Diagrams
Bayesian networks are a widely-used tool in probabilistic reasoning. They give a succinct representation of conditional (in)dependences between variables as a directed acyclic graph. Traditionally, a Bayesian network on a set of variables A, B, C, . . . is defined as a directed acyclic graph (dag) G, an assignment of sets to each of the nodes V G := {A, B, C, . . .} of G and a joint probability distribution over those variables which factorises as P (V G ) = A∈VG P (A | Pa(A)) where Pa(A) is the set of parents of A in G. Any joint distribution that factorises this way is said to satisfy the global Markov property with respect to the dag G. Alternatively, a Bayesian network can be seen as a dag equipped with a set of conditional probabilities {P (A | Pa(A)) | A ∈ V G } which can be combined to form the joint state. Thanks to disintegration, these two perspectives are equivalent.
Much like in the case of disintegration in the previous section, Bayesian networks have a neat categorical description as string diagrams in the category Stoch [7,12,13]. For example, here is a Bayesian network in its traditional depiction as a dag with an associated joint distribution over its vertices, and as a string diagram in Stoch: In the string diagram above, the stochastic matrix a : I → A contains the probabilities P (A), b : B → A contains the conditional probabilities P (B|A), c : B ⊗ D → C contains P (C|BD), and so on. The entire diagram is then equal to a state ω : Note the dag and the diagram above look similar in structure. The main difference is the use of copy maps to make each variable (even those that are not leaves of the dag, A, B and D) an output of the overall diagram. This corresponds to a variable being observed. We can also consider Bayesian networks with latent variables, which do not appear in the joint distribution due to marginalisation. Continuing the example above, making A into a latent variable yields the following depiction as a string diagram: In general, a Bayesian network (with possible latent variables), is a string diagram in Stoch that (1) only has outputs and (2) consists only of copy maps and boxes which each have exactly one output.
By 'a string diagram in Stoch', we mean not only the stochastic matrix itself, but also its decomposition into components. We can formalise exactly what we mean by taking a perspective on Bayesian networks which draws inspiration from Lawvere's functorial semantics of algebraic theories [15]. In this perspective, which elaborates on [7, Ch. 4], we maintain a conceptual distinction between the purely syntactic object (the diagram) and its probabilistic interpretation.
Starting from a dag G = (V G , E G ), we construct a free CDU category Syn G which provides the syntax of causal structures labelled by G. The objects of Syn G are generated by the vertices of G, whereas the morphisms are generated by the following signature: The following result establishes that models (à la Lawvere) of Syn G coincide with G-based Bayesian networks.
Proposition 3.1. There is a 1-1 correspondence between Bayesian networks based on the dag G and CDU functors of type Syn G → Stoch.
The proof is given in Appendex A.1. This proposition justifies the following definition of a category BN G of G-based Bayesian networks: objects are CDU functors Syn G → Stoch and arrows are monoidal natural transformations between them.

Towards Causal Inference: the Smoking Scenario
We will motivate our approach to causal inference via a classic example, inspired by the one given in the Pearl's book [19]. Imagine a dispute between a scientist and a tobacco company. The scientist claims that smoking causes cancer. As a source of evidence, the scientist cites a joint probability distribution ω over variables S for smoking and C for cancer, which disintegrates as in (5) below, with matrix c = ( 0.9 0.7 0.1 0.3 ). Inspecting this c : S → C, the scientist notes that the probability of getting cancer for smokers (0.3) is three times as high as for nonsmokers (0.1). Hence, the scientist claims that smoking has a significant causal effect on cancer.
An important thing to stress here is that the scientist draws this conclusion using not only the observational data ω but also from an assumed causal structure which gave rise to that data, as captured in the diagram in equation (5). That is, rather than treating diagram (5) simply as a calculation on the observational data, it can also be treated as an assumption about the actual, physical mechanism that gave rise to that data. Namely, this diagram encompasses the assumption that there is some prior propensity for people to smoke captured by s : I → S, which is both observed and fed into some other process c : S → C whereby an individuals choice to smoke determines whether or not they get cancer.
The tobacco company, in turn, says that the scientists' assumptions about the provenance of this data are too strong. While they concede that in principle it is possible for smoking to have some influence on cancer, the scientist should allow for the possibility that there is some latent common cause (e.g. genetic conditions, stressful work environment, etc.) which leads people both to smoke and get cancer. Hence, says the tobacco company, a 'more honest' causal structure to ascribe to the data ω is (6). This structure then allows for either party to be correct. If the scientist is right, the output of c : S ⊗ H → C depends mostly on its first input, i.e. the causal path from smoking to cancer. If the tabacco company is right, then c depends very little on its first input, and the correlation between S and C can be explained almost entirely from the hidden common cause.
So, who is right after all? Just from the observed distribution ω, it is impossible to tell. So, the scientist proposes a clinical trial, in which patients are randomly required to smoke or not to smoke. We can model this situation by replacing s in (6) with a process that ignores its inputs and outputs the uniform state. Graphically, this looks like 'cutting' the link s between H and S: This captures the fact that variable S is now randomised and no longer dependent on any background factors. This new distribution ω ′ represents the data the scientist would have obtained had they run the trial. That is, it gives the results of an intervention at s. If this ω ′ still shows a strong correlation between smoking and cancer, one can conclude that smoking indeed causes cancer even when we assume the weaker causal structure (6).
Unsurprisingly, the scientist fails to get ethical approval to run the trial, and hence has only the observational data ω to work with. Given that the scientist only knows ω (and not c and h), there is no way to compute ω ′ in this case. However, a key insight of statistical causal inference is that sometimes it is possible to compute interventional distributions from observational ones. Continuing the smoking example, suppose the scientist proposes the following revision to the causal structure: they posit a structure (8) that includes a third observed variable (the presence of T of tar in the lungs), which completely mediates the causal effect of smoking on cancer.
As with our simpler structure, the diagram (8) contains some assumptions about the provenance of the data ω. In particular, by omitting wires, we are asserting there is no direct causal link between certain variables. The absence of an H-labelled input to t says there is no direct causal link from H to T (only mediated by S), and the absence of an S-labelled input wire into c captures that there is no direct causal link from S to C (only mediated by T ). In the traditional approach to causal inference, such relationships are typically captured by a graph-theoretic property called d-separation on the dag associated with the causal structure.
We can again imagine intervening at S by replacing s : H → S by • . Again, this 'cutting' of the diagram will result in a new interventional distribution ω ′ . However, unike before, it is possible to compute this distribution from the observational distribution ω.
However, in order to do that, we first need to develop the appropriate categorical framework. In Section 5, we will model 'cutting' as a functor. In 6, we will introduce a generalisation of disintegration, which we call comb disintegration. These tools will enable us to compute ω ′ for ω, in Section 7.

Interventional Distributions as Diagram Surgery
The goal of this section is to define the 'cut' operation in (7) as an endofunctor on the category of Bayesian networks. First, we observe that such an operation exclusively concerns the string diagram part of a Bayesian network: following the functorial semantics given in Section 3, it is thus appropriate to define cut as an endofunctor on Syn G , for a given dag G.
Definition 5.1. For a fixed node A ∈ V G in a graph G, let cut A : Syn G → Syn G be the CDU functor freely obtained by the following action on the generators Intuitively, cut A applied to a string diagram f of Syn G removes from f each occurrence of a box with output wire of type A.
Proposition 3.1 allows us to "transport" the cutting operation over to Bayesian networks. Given any Bayesian network based on G, let F : Syn G → Stoch be the corresponding CDU functor given by Proposition 3.1. Then, we can define its A-cutting as the Bayesian network identified by the CDU functor F • cut A . This yields an (idempotent) endofunctor Cut A : BN G → BN G .

The Comb Factorisation
Thanks to the developments of Section 5, we can understand the transition from left to right in (7) as the application of the functor Cut S applied to the 'Smoking' node S. The next step is being able to actually compute the individual Stochmorphisms appearing in (8), to give an answer to the causality question.

= =
In order to do that, we want to work in a setting where t : S → T can be isolated and 'extracted' from (8). What is left behind is a stochastic matrix with a 'hole' where t has been extracted. To define 'morphisms with holes', it is convenient to pass from SMCs to compact closed categories (see e.g. [23]). Stoch is not itself compact closed, but it embeds into Mat(R + ), whose morphisms are all matrices over positive numbers. Mat(R + ) has a (self-dual) compact closed structure; that means, for any set A there is a 'cap' ∩ : A ⊗ A → I and a 'cup' ∪ : I → A ⊗ A, which satisfy the 'yanking' equations on the right. As matrices, caps and cups are defined by ∩ ij = ∪ ij = δ j i . Intuitively, they amount to 'bent' identity wires. Another aspect of Mat(R + ) that is useful to recall is the following handy characterisation of the subcategory Stoch.

is a stochastic matrix (thus a morphism of Stoch) if and only if (3) holds.
A suitable notion of 'stochastic map with a hole' is provided by a comb. These structures originate in the study of certain kinds of quantum channels [3].
This definition extends inductively to n-combs, where we require that discarding the rightmost output yields f ′ ⊗ , for some (n − 1)-comb f ′ . However, for our purposes, restricting to 2-combs will suffice.
The intuition behind condition (9) is that the contribution from input A 2 is only visible via output B 2 . Thus, if we discard B 2 we may as well discard A 2 . In other words, the input/output pair A 2 , B 2 happen 'after' the pair A 1 , B 1 . Hence, it is typical to depict 2-combs in the shape of a (hair) comb, with 2 'teeth', as in (10) below: While combs themselves live in Stoch, Mat(R + ) accommodates a second-order reading of the transition in (10): we can treat f as a map which expects as input a map g : B 1 → A 2 and produces as output a map of type A 1 → B 2 . Plugging g : B 1 → A 2 into the 2-comb can be formally defined in Mat(R + ) by composing f and g in the usual way, then feeding the output of g into the second input of f , using caps and cups, as in (11).
Importantly, for generic f and g of Stoch, there is no guarantee that forming the composite (11) in Mat(R + ) yields a valid Stoch-morphism, i.e. a morphism satisfying the finality equation (3). However, if f is a 2-comb and g is a Stochmorphism, equation (9) enables a discarding map plugged into the output B 2 in (11) to 'fall through' the right side of f , which guarantees that the composed map satisfies the finality equation for discarding. See Appendix A.2 for the explicit diagram calculation.
With the concept of 2-combs in hand, we can state our factorisation result. Proof. The construction of f and g mimics the construction of c-factors in [25], using string diagrams and (diagrammatic) disintegration. We first use ω to construct maps a : I → A, b : A → B, c : A ⊗ B → C, then construct f using a and c and construct g using b. The full proof, including uniqueness, is given in Appendix A.3.
Note that Theorem 6.3 generalises the normal disintegration property given in Theorem 2.3. The latter is recovered by taking A := I (or C := I) above.

Returning to the Smoking Scenario
We now return to the smoking scenario of Section 4. There, we concluded by claiming that the introduction of an intermediate variable T to the observational distribution ω : I → S⊗T ⊗C would enable us to calculate the interventional distribution. That is, we can calculate ω ′ = F (cut X (ω)) from ω := F (ω). Thanks to Theorem 6.3, we are now able to perform that calcuation. We first observe that our assumed causal structure for ω fits the form of Theorem 6.3, where g is t and f is a 2-comb containing everything else, as in the diagram on the side. Hence, f and g are computable from ω. If we plug them back together as in (12), we will get ω back. However, if we insert a 'cut' between f and g: we obtain ω ′ = F (cut X (ω)).
Let us now consider a concrete example. We fix interpretations for the sets S, T , and C as booleans: S = T = C = {0, 1} and let ω : I → S ⊗ T ⊗ C be the stochastic matrix: From the interventional distribution, we conclude that, in a (hypothotetical) clinical trial, patients are about twice as likely to get cancer if they smoke (54% vs. 25%). So, since 54 < 68, there was some confounding influence between S and C in our observational data, but after removing it via comb disintegration, we see there is still a signficant causal link between smoking and cancer. Note this conclusion depends totally on the particular observational data that we picked. For a different interpretation of ω in Stoch, one might conclude that there is no causal connection, or even that smoking decreases the chance of getting cancer. Interestingly, all three cases can arise even when a naïve analysis of the data shows a strong direct correlation between S and C. To see and/or experiment with these cases, we have provided the Python code 3 used to perform these calculations. See also [18] for a pedagocical overview of this example (using traditional Bayesian network language) with some sample calculations.

The General Case for a Single Intervention
While we applied the comb decomposition to a particular example, this technique applies essentially unmodified to many examples where we intervene at a single variable (called X below) within an arbitrary causal structure.
Theorem 8.1. Let G be a dag with a fixed node X that has corresponding generator x : Y 1 ⊗ . . . ⊗ Y n → X in Syn G . Then, let ω be a morphism in Syn G of the following form: for some morphisms f 1 , f 2 and g in Syn G not containing x as a subdiagram. Then the interventional distribution ω ′ := F (cut X (ω)) is computable from the observational distribution ω = F (ω).
Proof. The proof is very close to the example in the previous section. Interpreting ω into Stoch, we get a diagram of stochastic maps, which we can combdisintegrate, then recompose with • to produce the interventional distribution: The RHS above is then F (cut X (ω)).
This is general enough to cover several well-known sufficient conditions from the causality literature, including single-variable versions of the so-called frontdoor and back-door criteria, as well as the sufficient condition based on confounding paths given by Pearl and Tian [25]. As the latter subsumes the other two, we will say a few words about the relationship between the Pearl/Tian condition and Theorem 8.1. In [25], the authors focus on semi-Markovian models, where the only latent variables have exactly two observed children and no parents. Suppose we write A ↔ B if two observed variables are connected by a latent common cause, then one can characterise confounding paths as the transitive closure of ↔. They go on to show that the interventional distribution corresponding cutting X is computable whenever there are no confounding paths connecting X to one of its children.
We can compare this to the form of expression ω in equation (14). First, note this factorisation implies that all boxes which take X as an input must occur as sub-diagrams of g. Hence, any 'confounding path' connecting X to its children would yield at least one (un-copied) wire from f 1 to g, hence it cannot be factored as (14). Conversely, if there are no confounding paths from X to its children, then we can we can place the boxes involved in any other confounding path either entirely inside of g or entirely outside of g and obtain factorisation (14). Hence, restricting to semi-Markovian models, the no-confounding-path condition from [25] is equivalent to ours. However, Theorem 8.1 is slightly more general: its formulation doesn't rely on the causal structure ω being semi-Markovian.

Conclusion and future work
This paper takes a fresh, systematic look at the problem of causal identifiability. By clearly distinguishing syntax (string diagram surgery and identification of comb shapes) and semantics (comb-disintegration of joint states) we obtain a clear methodology for computing interventional distributions, and hence causal effects, from observational data.
A natural next step is moving beyond single-variable interventions to the general case, i.e. situations where we allow interventions on multiple variables which may have some arbitrary causal relationships connecting them. This would mean extending the comb factorisation Theorem 6.3 from a 2-comb and a channel to arbitrary n-combs. This seems to be straightforward, via an inductive extension of the proof in Appendix A.3. A more substantial direction of future work will be the strengthening of Theorem 8.1 from sufficient conditions for causal identifiability to a full characterisation. Indeed, the related condition based on confounding paths from [25] is a necessary and sufficient condition for computing the interventional distribution on a single variable. Hence, it will be interesting to formalise this necessity proof (and more general versions, e.g. [10]) within our framework and investigate, for example, the extent to which it holds beyond the semi-Markovian case.
While we focus exclusively on the case of taking models in Stoch in this paper, the techniques we gave are posed at an abstract level in terms of composition and factorisation. Hence, we are optimistic about their prospects to generalise to other probabilistic (e.g. infinite discrete and continuous variables) and quantum settings. In the latter case, this could provide insights into the emerging field of quantum causal structures [6,21,17,22,9], which attempts in part to replay some of the results coming from statistical causal reasoning, but where quantum processes play a role analogous to stochastic ones. A key difficulty in applying our framework to a category of quantum processes, rather than Stoch, is the unavailability of 'copy' morphisms due to the quantum no-cloning theorem [26]. However, a recent proposal for the formulation of 'quantum common causes' [1] suggests a (partially-defined) analogue to the role played by 'copy' in our formulation constructed via multiplication of certain commuting Choi matrices. Hence, it may yet be possible to import results from classical causal reasoning into the quantum case just by changing the category of models.

A Appendix: Omitted Proofs
A.1 Proof of Proposition 3.1 Proposition. There is a 1-1 correspondence between Bayesian networks based on the dag G and CDU functors of type Syn G → Stoch.
Proof. In one direction, consider a Bayesian network consisting of the dag G and, for each node A ∈ V G , an assignment of a set τ (A) and a conditional probability P (A|Pa(A)). This data yields a CDU functor F : Syn G → Stoch, defined by the following mappings: F :: . . . It is immediate that these two mappings are inverse to each other, thus proving the statement.

A.2 Comb composition (11) yields a Stoch morphism
Here is a graphical proof that composition given in (11) satisfies the finality equation (3), and hence yields a Stoch-morphism: Now, we let: Note the last step above is just diagram deformation and the comonoid laws. The rightmost diagram above is equal to ω by (16). For uniqueness, suppose (12) holds for some other f ′ , g ′ . Then by uniqueness of disintegration, it follows that g ′ = g = b.
To show that f = f ′ , we expand (15) explicitly in terms of matrices. This equation is equivalent to ω ijk = f ik j g j i = (f ′ ) ik j g j i . Note that if g had any zero elements, ω would not have full support, hence g j i = 0 and therefore f ik j = (f ′ ) ik j for all i, j, k.