Skip to main content
Log in

Causal identifiability and piecemeal experimentation

  • S.I.: Evidence Amalgamation in the Sciences
  • Published:
Synthese Aims and scope Submit manuscript

Abstract

In medicine and the social sciences, researchers often measure only a handful of variables simultaneously. The underlying assumption behind this methodology is that combining the results of dozens of smaller studies can, in principle, yield as much information as one large study, in which dozens of variables are measured simultaneously. Mayo-Wilson (Philos Sci 78(5):864–874, 2011, Br J Philos Sci 65(2):213–249, 2013. https://doi.org/10.1093/bjps/axs030) shows that assumption is false when causal theories are inferred from observational data. This paper extends Mayo-Wilson’s results to cases in which experimental data is available. I prove several new theorems that show that, as the number of variables under investigation grows, experiments do not improve, in the worst-case, one’s ability to identify the true causal model if one can measure only a few variables at a time. However, stronger statistical assumptions (e.g., Gaussianity) significantly aid causal discovery in piecemeal inquiry, even if such assumptions are unhelpful when all variables can be measured simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. As standard, I use the word experiment to refer to settings in which at least one variable is manipulated. A randomized controlled trial (rct) is a paradigm of an experiment.

  2. Proofs of all new theorems are in the technical appendix.

  3. See Tillman and Spirtes (2011), Tsamardinos et al. (2012), and Triantafillou and Tsamardinos (2015) for algorithms that work with observational data in the presence of latent confounding. Section 6 of Tillman and Eberhardt (2014) extends these algorithms to experimental data.

  4. Two events A and B are conditionally independent givenC if \(P(A,B|C) = P(A|C) \cdot P(B|C)\). This definition extends to random variables in the obvious way.

  5. For the remainder of the paper, I use the uppercase letters G and H to denote dags, and the upper case letters UV,  and W to denote vertices in graphs. I use \({{{\mathcal {V}}}}\) to represent a causally sufficient set of variables under investigation, and I use calligraphic letters like \({{\mathcal {U}}}\) and \({{\mathcal {E}}}\) to denote subsets of \({{\mathcal {V}}}\). Finally, I use the scripted letters \(\mathscr {U}, \mathscr {V}\) and \(\mathscr {W}\) to denote subsets of the power set \({{\mathcal {P}}}({{\mathcal {V}}})\) of \({{\mathcal {V}}}\).

  6. For an extensive discussion of both principles, see Spirtes et al. (2000). For further defenses of the cmc, see Hausman and Woodward (2004) and Steel (2005); for criticisms, see Cartwright (2002, 2007). For criticisms of cfc, see Freedman and Humphreys (1999) and Cartwright (2007).

  7. “I” stands for independence. Causal theories that cannot be distinguished by conditional independence facts might nonetheless be distinguishable using background knowledge, temporal information, and other statistical assumptions [e.g. that variables are non-Gaussian and linear combinations of their causes (Shimizu et al. 2006)]. I discuss these issues further in Sect. 4. Formally, two graphs are what I call “\(\mathsf {I}\)-indistinguishable” if their d-separation relations are identical (see “Appendix”), and so \(\mathsf {I}\)-indistinguishability is a purely graph-theoretic relation that does not require any probabilistic notions. Nonetheless, it is easiest to explain \(\mathsf {I}\)-indistinguishability using its relationship to a mathematically equivalent notion of “Markov equivalence,” which is a relationship between Bayesian Networks, i.e., pairs of the form \(\langle G, p \rangle \) where G is a dag containing random variables as its vertices and p is a probability distribution over the variables in G satisfying particular conditional independence facts. Again, see Sect. 4.

  8. I will not discuss either the ethics or epistemological necessity of randomization in treatment. See Kadane and Seidenfeld (1990) and Worrall (2007) for critical discussions of randomization.

  9. The way in which experiments are represented here is most appropriate for modeling the effects of medical treatments in rcts in which patients comply perfectly with treatment. In such (rare) trials, researchers’ choices completely determine a patient’s treatment. Such experiments are often called “hard” or “surgical” interventions. In this paper, I restrict my attention to hard interventions and ignore “soft” interventions, which introduce a new cause of a variable but fail to eliminate other causal influences. See Nyberg and Korb (2006); Eberhardt (2007), and Eaton and Murphy (2007) for discussions of soft interventions.

  10. There are \(\left( {\begin{array}{c}4\\ 3\end{array}}\right) \) different ways to choose which three variables are observed. Given three variables, one can perform \(3=\left( {\begin{array}{c}3\\ 1\end{array}}\right) \) many interventions on one variable, \(3=\left( {\begin{array}{c}3\\ 2\end{array}}\right) \) on two variables, and \(1=\left( {\begin{array}{c}3\\ 3\end{array}}\right) \) on all three variables. So there are \(\left( {\begin{array}{c}4\\ 3\end{array}}\right) (3+3+1) =28\) possible experiments. Not all such experiments (e.g., intervening on all three variables simultaneously) will be informative.

  11. If every pair in \(\mathscr {E}\) is of the form \(\langle {{\mathcal {E}}}, {{\mathcal {V}}} \rangle \) (i.e., all variables are observed in the intervention) and \(\langle \emptyset , {{\mathcal {V}}} \rangle \in \mathscr {E}\) (i.e. all variables are passively observed simultaneously), then the \(\mathscr {E}\)-indistinguishability class is identical to the interventional equivalence class induced by \({{\mathcal {I}}} = \{ {{\mathcal {E}}}: \langle {{\mathcal {E}}}, {{\mathcal {V}}} \rangle \in \mathscr {E} \}\) in the sense of Hauser and Bühlmann (2012).

  12. This theorem follows immediately from the next, and so only the latter is proven in the “Appendix”.

  13. Importantly, this argument does not rule out the possibility that, if no interventions are performed, the two share a common cause. It simply shows that plaque is a cause of heart disease.

  14. See Theorem 6 in Mayo-Wilson (2012).

  15. See Theorem 6 in Mayo-Wilson (2012).

  16. Thanks to David Danks for suggesting this point.

  17. So distributional assumptions include both parametric assumptions (e.g., that the true model is linear Gaussian) and non-parametric ones (e.g., that the model is non-Gaussian).

  18. Saying that G satisfies the cmc “with respect to p” means that every variable in G is conditionally independent with respect to p of its non-descendants given its ancestors.

  19. So “q is \(\varvec{M}_{BN}\) compatible with H” means “q is Markov and faithful to H”.

  20. Note, I have defined “Bayesian network” using the Markov and faithfulness conditions. If one defines “Bayesian network” in terms of a condition that requires a probability distribution to factor in a particular way, the equivalence of Markov-equivalence and \(\mathsf {I}\)-indistinguishability requires a proof. See Lauritzen et al. (1990).

  21. To my knowledge, Theorem 10 was first proven by Geiger et al. (1990). In the linear Gaussian case, the theorem was generalized by Richardson and Spirtes (2002) to include causal theories with latent variables in the presence of selection bias. An alternative proof in the discrete case was given by Meek (1995). A constructive proof for the linear Gaussian case is given in Mayo-Wilson (2012). Theorem 10 ought to be contrasted with results of Shimizu et al. (2006), which shows that G is \(\varvec{M}\)-indistinguishable from only itself if \(\varvec{M}=\varvec{Lingam}\). Hoyer et al. (2009) show that \(\varvec{M}\)-indistinguishability differs from \(\mathsf {I}\)-indistinguishability when \(\varvec{M}\) contains nonlinear additive noise models, but as far as I am aware, there is no general characterization of \(\varvec{M}\)-indistinguishability for noisy-or models or for when \(\varvec{M}\) contains nonlinear additive noise models.

  22. At this point, I should reminder readers that all of the equivalence classes I have introduced characterize indistinguishability “in the limit” i.e., with arbitrarily large samples.

  23. Here’s the proof, which was suggested to me by Frederick Eberhardt. If every pair of variables is comeasured, then it follows that every variable V is measured in some study. So one can calculate the mean of each variable V. Similarly, if V and W are comeasured in an observational study, then one can calculate the covariance of V and W. Thus, if one can comeasure all pairs of variables, one can calculate the entire covariance matrix and mean vector. If the true model is linear Gaussian, then the unknown probability distribution is completely characterized by the mean vector and covariance matrix. Hence, any two models that are \(\varvec{M}2\)-indistinguishable are likewise \(\varvec{M}\)-indistinguishable.

  24. But one will need to eliminate the last clause, “and hence, \(\mathsf {I}\)-indistinguishable”, from the statement of the theorem, as \(\varvec{M}\)-indistinguishability does not entail \(\mathsf {I}\)-indistinguishability if \(\varvec{M} = \varvec{Lingam}\).

References

  • Cartwright, N. (2002). Against modularity, the causal Markov condition, and any link between the two: Comments on Hausman and Woodward. The British Journal for the Philosophy of Science, 53(3), 411–453.

    Article  Google Scholar 

  • Cartwright, N. (2007). Hunting causes and using them: Approaches in philosophy and economics. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Danks, D., & Glymour, C. (2001). Linearity properties of Bayes nets with binary variables. In Proceedings of the seventeenth conference on uncertainty in artificial intelligence (pp. 98–104).

  • Eaton, D., & Murphy, K. (2007). Exact Bayesian structure learning from uncertain interventions. Artificial Intelligence and Statistics, 107–114.

  • Eberhardt, F. (2007). Causation and intervention, Unpublished doctoral dissertation. Carnegie Mellon University.

  • Eberhardt, F., & Scheines, R. (2007). Interventions and causal inference. Philosophy of Science, 74(5), 981–995.

    Article  Google Scholar 

  • Eberhardt, F., Glymour, C., & Scheines, R. (2006). N-1 experiments suffice to determine the causal relations among n variables. In D. E. Holmes & L. C. Jain (Eds.), Innovations in machine learning (pp. 97–112). Berlin, Heidelberg: Springer.

    Chapter  Google Scholar 

  • Freedman, D., & Humphreys, P. (1999). Are there algorithms that discover causal structure? Synthese, 121(1), 29–54.

    Article  Google Scholar 

  • Geiger, D., Verma, T., & Pearl, J. (1990). Identifying independence in Bayesian networks. Networks, 20(5), 507–534.

    Article  Google Scholar 

  • Hauser, A., & Bühlmann, P. (2012). Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 13(Aug), 2409–2464.

    Google Scholar 

  • Hausman, D., & Woodward, J. (2004). Manipulation and the causal Markov condition. Philosophy of Science, 71(5), 846–856.

    Article  Google Scholar 

  • Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schlkopf, B. (2009). Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems (pp. 689–696).

  • Hyttinen, A., Eberhardt, F., & Hoyer, P. O. (2010). Causal discovery for linear cyclic models with latent variables. In Proceedings of the 5th European workshop on probabilistic graphical models (PGM 2010).

  • Kadane, J. B., & Seidenfeld, T. (1990). Randomization in a Bayesian perspective. Journal of Statistical Planning and Inference, 25(3), 329–345.

    Article  Google Scholar 

  • Lauritzen, S. L., Dawid, A. P., Larsen, B. N., & Leimer, H. G. (1990). Independence properties of directed Markov fields. Networks, 20(5), 491–505.

    Article  Google Scholar 

  • Mayo-Wilson, C. (2011). The problem of piecemeal induction. Philosophy of Science, 78(5), 864–874.

    Article  Google Scholar 

  • Mayo-Wilson, C. (2012). Combining causal theories and dividing scientific labor, Doctoral Dissertation. Carnegie Mellon University.

  • Mayo-Wilson, C. (2013). The limits of piecemeal causal inference. The British Journal for the Philosophy of Science, 65, 213–249. https://doi.org/10.1093/bjps/axs030.

    Article  Google Scholar 

  • Meek, C. (1995). Strong completeness and faithfulness in Bayesian networks. In Proceedings of the eleventh conference on uncertainty in artificial intelligence (pp. 411–418).

  • Nyberg, E., & Korb, K. (2006). Informative interventions. In P. McKay Illari, F. Russo, J. Williamson (Eds.), Causality and probability in the sciences. College Publications, London.

  • Pearl, J. (2000). Causality: Models, reasoning, and inference (Vol. 47). Cambridge: Cambridge University Press.

    Google Scholar 

  • Pearl, J., & Verma, T. S. (1995). A theory of inferred causation. In D. Prawitz, B. Skyrms, & D. Westersthl (Eds.), Studies in logic and the foundations of mathematics volume 134 of logic, methodology and philosophy of science IX (Vol. 134, pp. 789–811). New York City: Elsevier.

    Google Scholar 

  • Richardson, T., & Spirtes, P. (2002). Ancestral graph Markov models. The Annals of Statistics, 30(4), 962–1030.

    Article  Google Scholar 

  • Shimizu, S., Hoyer, P. O., Hyvrinen, A., & Kerminen, A. (2006). A linear non-Gaussian acyclic model for causal discovery. The Journal of Machine Learning Research, 7, 2003–2030.

    Google Scholar 

  • Spirtes, P., Glymour, C. N., & Scheines, R. (2000). Causation, prediction, and search. Cambridge: The MIT Press.

    Google Scholar 

  • Steel, D. (2005). Indeterminism and the causal Markov condition. The British Journal for the Philosophy of Science, 56(1), 3–26.

    Article  Google Scholar 

  • Tillman, R. E., & Eberhardt, F. (2014). Learning causal structure from multiple datasets with similar variable sets. Behaviormetrika, 41(1), 41–64.

    Article  Google Scholar 

  • Tillman, R. E., & Spirtes, P. (2011). Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables. In Proceedings of the 14th international conference on artificial intelligence and statistics (AISTATS 2011).

  • Triantafillou, S., & Tsamardinos, I. (2015). Constraint-based causal discovery from multiple interventions over overlapping variable sets. Journal of Machine Learning Research, 16, 2147–2205.

    Google Scholar 

  • Tsamardinos, I., Triantafillou, S., & Lagani, V. (2012). Towards integrative causal analysis of heterogeneous data sets and studies. Journal of Machine Learning Research, 13, 1097–1157.

    Google Scholar 

  • Worrall, J. (2007). Why there’s no cause to randomize. The British Journal for the Philosophy of Science, 58(3), 451–488.

    Article  Google Scholar 

Download references

Acknowledgements

Thanks to David Danks, Clark Glymour, and Peter Spirtes for asking the questions that led to the results reported in this paper. While writing the paper, I benefited from several discussions with Frederick Eberhardt. Finally, thanks to three anonymous reviewers for their detailed comments and feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Conor Mayo-Wilson.

Appendices

Appendix

Directed acyclic digraphs

1.1 Notation and definitions

For any finite set \({{\mathcal {V}}}\), let \(\textsc {dag}_{{\mathcal {V}}}\) denote the set of all directed acyclic graphs (dags) that have the vertex set \({{\mathcal {V}}}\).

Let \(G \in \textsc {dag}_{{\mathcal {V}}}\) and \(V\rightarrow W\) be an edge in G. Then V is called a parent of W, and W is called a child of V. If V is either a parent or child of W, then the two vertices are said to be adjacent in G. Let \(\textsc {pa}_G(V)\) denote the set of parents of V in G, and let \(\textsc {ch}_G(V)\) denote its children. If \(V_1 \rightarrow V_3 \leftarrow V_2 \in G\), then \(V_3\) is called a collider with respect to \(V_1\) and \(V_2\). If \(V_3\) is a collider with respect to \(V_1\) and \(V_2\) and, in addition, there is no edge between \(V_1\) and \(V_2\), then \(V_3\) is called an unshielded collider with respect to \(V_1\) and \(V_2\).

A path\(\pi \) in G is a non-repeating sequence of vertices \(\pi = \langle V_1,V_2,\ldots ,V_n \rangle \) such that \(V_i\) and \(V_{i+1}\) are adjacent if \(1 \le i < n\). If \(V_{i}\) and \(V_{i+2}\) are both parents of \(V_{i+1}\) in G, then \(V_{i+1}\) is said to be a collider on\(\pi \); if not, it is a non-collider on \(\pi \). Endpoints of a path are, by definition, non-colliders on the path. A path \(\pi \) is called directed if \(V_i\) is a parent of \(V_{i+1}\) for all i. If there is a directed path from V to W, then V is said to be an ancestor of W, and W is said to be a descendant of V. For any vertex V, let \(\textsc {desc}_G(V)\) denote the set of descendants of V in G.

Given a path \(\pi = \langle V_1,V_2,\ldots ,V_n \rangle \), let \(\pi \downarrow V_i = \langle V_1, \ldots , V_i \rangle \), and call \(\pi \downarrow V_i\) the initial segment of \(\pi \) that terminates with \(V_i\). Similarly, let \(\pi \uparrow V_i = \langle V_i, \ldots , V_n \rangle \), and call \(\pi \uparrow V_i \) the tail of \(\pi \) that begins with \(V_i\) and terminates with the end of \(\pi \). Given two paths \(\pi _1\) and \(\pi _2\) in a graph G such that the endpoint of \(\pi _1\) is the starting point of \(\pi _2\), let \(\pi _1 \frown \pi _2\) denote the concatenation of the two paths.

In diagrams, I use straight lines to indicate the existence of an edge (Fig. 8). Undirected paths are indicated by curves with no end markers (like that between \(V_2\) and \(V_3\)), and a directed path is indicated by a curve with an arrow marker at one end (e.g. there is a directed path from \(V_4\) to \(V_3\)).

Fig. 8
figure 8

Edges, undirected paths, and directed paths

1.2 d-separation

Fix a set \({{\mathcal {V}}}\) and \(G \in \textsc {dag}_{{\mathcal {V}}}\). Let \(V_1, V_2 \in {{\mathcal {V}}}\) be distinct vertices and \({{\mathcal {U}}} \subseteq {{\mathcal {V}}} \setminus \{V_1,V_2\}\). A path \(\pi \) between \(V_1\) and \(V_2\) is said to be d-connecting given \({{\mathcal {U}}}\) (or active given \({{\mathcal {U}}}\)) if both of the following conditions hold:

  1. 1.

    Every non-collider on \(\pi \) is not in \({{\mathcal {U}}}\), and

  2. 2.

    Every collider on \(\pi \) is either in \({{\mathcal {U}}}\) or contains a descendant in \({{\mathcal {U}}}\).

Say \(V_1\) and \(V_2\)d-connected given \({{\mathcal {U}}}\) in G if there is a d-connecting path between the two, and say they are d-separated otherwise. Given three disjoint vertex sets \({{\mathcal {V}}}_1, {{\mathcal {V}}}_2, {{\mathcal {U}}} \in {{\mathcal {V}}}\), say that \({{\mathcal {V}}}_1\) and \({{\mathcal {V}}}_2\) are d-connected given \({{\mathcal {U}}}\) if there exists vertices \(V_1 \in {{\mathcal {V}}}_1\) and \(V_2 \in {{\mathcal {V}}}_2\) such that \(V_1\) and \(V_2\) are d-connected given \({{\mathcal {U}}}\); otherwise, say that \({{\mathcal {V}}}_1\) and \({{\mathcal {V}}}_2\) are d-separated given \({{\mathcal {U}}}\).

Let \({{\mathcal {V}}}\) be any set. Given \(\mathscr {U} \subseteq {{\mathcal {P}}}({{\mathcal {V}}})\), let \(\mathsf {I}^{\mathscr {U}}\) denote the set of all pairwise disjoint triples \(\langle {{\mathcal {V}}}_1, {{\mathcal {V}}}_2, {{\mathcal {W}}} \rangle \) such that \({{\mathcal {V}}}_1 \cup {{\mathcal {V}}}_2 \cup {{\mathcal {W}}} \subseteq {{\mathcal {U}}}\) for some \({{\mathcal {U}}} \in \mathscr {U}\). If \({{\mathcal {V}}} \in \mathscr {U}\), I will write \(\mathsf {I}^{{{\mathcal {V}}}}\) instead of \(\mathsf {I}^{\mathscr {U}}\) Let |S| denote the cardinality of the set S. For each natural number \(k \le |{{\mathcal {V}}}|\), let \(\mathscr {U}_k = \{ {{\mathcal {U}}} \subseteq {{\mathcal {V}}}: |{{\mathcal {U}}}| \le k \}\), and define \(\mathsf {I}^k=\mathsf {I}^{\mathscr {U}_k}\).

Given \(G \in \textsc {dag}_{{\mathcal {V}}}\) and \(\mathscr {U} \subseteq {{\mathcal {P}}}({{\mathcal {V}}})\), let \(\mathsf {I}^{\mathscr {U}}_G\) denote the set of all triples \(\langle {{\mathcal {V}}}_1, {{\mathcal {V}}}_2, {{\mathcal {W}}} \rangle \in \mathsf {I}^{\mathscr {U}}\) such that \({{\mathcal {V}}}_1\) and \({{\mathcal {V}}}_2\) are d-separated in G given \({{\mathcal {W}}}\). Let \(\mathsf {D}^{\mathscr {U}}_G\) denote the relative complement of \(\mathsf {I}^{\mathscr {U}}_G\) in \(\mathsf {I}^{\mathscr {U}}\). When \({{\mathcal {V}}} \in \mathscr {U}\), write \(\mathsf {I}_G\) instead of \(\mathsf {I}^{\mathscr {U}}_G\). If \({{\mathcal {V}}}_1\) and \({{\mathcal {V}}}_2\) are d-separated in G given \({{\mathcal {U}}}\), we say that Gsatisfies the triple \(\langle {{\mathcal {V}}}_1, {{\mathcal {V}}}_2, {{\mathcal {U}}} \rangle \) (or that the triple holds in G). In the special case in which \({{\mathcal {V}}}_1\) and \({{\mathcal {V}}}_2\) are singletons \(\{V_1\}\) and \(\{V_2\}\) respectively, we write \(\langle V_1, V_2, {{\mathcal {W}}} \rangle \in \mathsf {I}_G\) instead of \(\langle {{\mathcal {V}}}_1, {{\mathcal {V}}}_2, {{\mathcal {W}}} \rangle \in \mathsf {I}_G\).

Typically, only paths are said to be d-connecting/active or not. In some of the theorems below, it will be helpful to consider active variable sequences, which may contain some vertex twice. Let \(\alpha \) be a variable sequence with endpoints \(V_1\) and \(V_2\), and let \({{\mathcal {U}}} \subseteq {{\mathcal {V}}} \setminus \{V_1,V_2\}.\) Then a vertex \(V_3\) is active on\(\alpha \)inGgiven\({{\mathcal {U}}}\) just in case either

  1. 1.

    \(V_3\) is not a collider on \(\pi \) and \(V_3 \not \in {{\mathcal {U}}}\)

  2. 2.

    \(V_3\) is a collider on \(\pi \) and either (i) \(V_3 \in {{\mathcal {U}}}\) or (ii) there is \(w \in Desc_{G}(V_3) \cap {{\mathcal {U}}}\) (or both).

The following lemma, which is a special case of Lemma 3.3.1. in Spirtes et al. (2000, pp. 386), asserts that an active variable sequence indicates the existence of an active path with the same endpoints.

Lemma 1

Let G be any \(\textsc {dag}\), and suppose that \(\beta \) is an active variable sequence given \({{\mathcal {U}}}\) with endpoints V and W. Then there is d-connecting path between V and W given \({{\mathcal {U}}}\).

\(\mathscr {E}\)-equivalence

I say \(G, H \in \textsc {dag}_{{\mathcal {V}}}\) are \(\mathscr {U}\)-equivalent if \(\mathsf {I}^{\mathscr {U}}_G = \mathsf {I}^{\mathscr {U}}_H\) and write \(G \equiv _{\mathscr {U}} H\) in this case. When \({{\mathcal {V}}} \in \mathscr {U}\), I will write \(G \equiv H\), and say that G and H are \(\mathsf {I}\)-equivalent. Let \(G \in \textsc {dag}_{{\mathcal {V}}}\). Given a subset \({{\mathcal {E}}} \subseteq {{\mathcal {V}}}\), let \(G \parallel {{\mathcal {E}}}\) be the graph obtained by removing from G all edges into the each variable in \({{\mathcal {E}}}\). I will use the script letter \(\mathscr {E}\) to denote sets of pairs \(\{ \langle {{\mathcal {E}}}_1, {{\mathcal {U}}}_1 \rangle , \ldots , \langle {{\mathcal {E}}}_m, {{\mathcal {U}}}_m \rangle \}\) such that \({{\mathcal {E}}} \subseteq {{\mathcal {U}}} \subseteq {{\mathcal {V}}}\). I will call such pairs experiments, and I will say that \({{\mathcal {U}}}\) is observed and that the variables of \({{\mathcal {E}}}\) are subject to an intervention. Given \(G, H \in \textsc {dag}_{{\mathcal {V}}}\), write \(G \equiv _{\mathscr {E}} H\) if \(\mathsf {I}_{G \parallel {{\mathcal {E}}}}^{{\mathcal {U}}} = \mathsf {I}_{H \parallel {{\mathcal {E}}}}^{{\mathcal {U}}}\) for all \(\langle {{\mathcal {E}}}, {{\mathcal {U}}} \rangle \in \mathscr {E}\). Let \([G]_\mathscr {E}\) denote the \(\mathscr {E}\)-equivalence class of G.

I will study the special case of \(\mathscr {E}\)-equivalence in which all subsets of k or fewer variables are observed, and all interventions of size \(j \le k\) are possible. To do so, define:

$$\begin{aligned} \mathscr {E}_{k,j} = \{\langle {{\mathcal {E}}}, {{\mathcal {U}}} \rangle \in {{\mathcal {V}}}^2: {{\mathcal {E}}} \subseteq {{\mathcal {U}}} \text{ and } |{{\mathcal {U}}}| \le k \text{ and } |{{\mathcal {E}}}| \le j\}. \end{aligned}$$

Write \(G \equiv _{k,j} H\) if \(G \equiv _{\mathscr {E}_{k,j} } H\). Similarly, let \([G]_{k,j}\) be the \(\mathscr {E}_{k,j}\)-equivalence class of G. Notice that \(G \equiv _{k,j} H\) entails that \(G \equiv _{k,l}\) for all \(l \le j\). When \(j = 0\), I drop the subscript and write \(G \equiv _k H\), \([G]_k\), and so on.

The following lemma will be essential. It is a generalization of Lemma 8 in Mayo-Wilson (2013).

Lemma 2

Let \(G \in \textsc {dag}_{{\mathcal {V}}}\) be a graph with n vertices, and let \(k < n\). Suppose that there are \(k-1\) disjoint, directed paths from \(V_1\) to \(V_2\) in G. Moreover, suppose that \(V_1\) is not a parent of \(V_2\). Let H be the graph obtained by adding to G an edge from \(V_1\) to \(V_2\). Then \(G \equiv _{k,j} H\) for all \(j \le k\).

Proof

Let \(\langle {{\mathcal {E}}}, {{\mathcal {U}}} \rangle \) be an experiment such that \(|{{\mathcal {U}}}| \le k\). It is necessary to show that \(\mathsf {D}_{G \parallel {{\mathcal {E}}}}^{{\mathcal {U}}} =\mathsf {D}_{H \parallel {{\mathcal {E}}}}^{{\mathcal {U}}}\). First, suppose that \(V_2 \in {{\mathcal {E}}}\). Then it follows that \(G \parallel {{\mathcal {E}}} = H \parallel {{\mathcal {E}}}\), as the G and H differ by only one one edge that points into \(V_2\). Hence, it immediately follows that \(\mathsf {D}_{G \parallel {{\mathcal {E}}}}^{{\mathcal {U}}} =\mathsf {D}_{H \parallel {{\mathcal {E}}}}^{{\mathcal {U}}}\).

So suppose that if \(V_2 \not \in {{\mathcal {E}}}\). Then \(H \parallel {{\mathcal {E}}}\) is the graph obtained by adding the \(V_1 \rightarrow V_2\) edge to \(G \parallel {{\mathcal {E}}}\). It follows that \(\mathsf {D}_{G \parallel {{\mathcal {E}}}}^k \subseteq \mathsf {D}_{H \parallel {{\mathcal {E}}}}^k\). So it suffices to show that \(\mathsf {D}_{H\parallel {{\mathcal {E}}}}^k \subseteq \mathsf {D}_{G\parallel {{\mathcal {E}}}}^k\). To this end, consider any triple \(\langle Z_1 , Z_2, {{\mathcal {W}}} \rangle \) in \(\mathsf {D}_{H\parallel {{\mathcal {E}}}}^k\). By definition, there is a d-connecting path \(\pi _{H \parallel {{\mathcal {E}}}}\) from \(Z_1\) to \(Z_2\) given \({{\mathcal {W}}}\) in \(H \parallel {{\mathcal {E}}}\). I will construct a d-connecting path \(\pi _{G \parallel {{\mathcal {E}}}}\) from \(Z_1\) to \(Z_2\) given \({{\mathcal {W}}}\) in \(G \parallel {{\mathcal {E}}}\). To do so, let the \(k-1\) distinct, directed paths from \(V_1\) to \(V_2\) be denoted \(\delta _1\) through \(\delta _{k-1}\) respectively. The proof breaks into two cases, and each case has two subcases.

Case 1 Suppose there is some \(\delta _i\) that contains no members of \({{\mathcal {E}}} \cup {{\mathcal {W}}}\). Now either \(\pi _{H \parallel {{\mathcal {E}}}}\) is also a path in \(G \parallel {{\mathcal {E}}}\) or it is not.

If \(\pi _{H \parallel {{\mathcal {E}}}}\) is also a path in \(G \parallel {{\mathcal {E}}}\). Then it’s easy to show that \(\pi _{G \parallel {{\mathcal {E}}}} = \pi _{H \parallel {{\mathcal {E}}}}\) is likewise d-connecting in \(G \parallel {{\mathcal {E}}}\). Why? Every non-collider on \(\pi _{G \parallel {{\mathcal {E}}}}\) is not a member of \({{\mathcal {W}}}\) because \(\pi _{H \parallel {{\mathcal {E}}}}\) is active given \({{\mathcal {W}}}\) in \(H \parallel {{\mathcal {E}}}\). Moreover, every collider C on \(\pi _{G \parallel {{\mathcal {E}}}}\) is also a collider on \(\pi _{H \parallel {{\mathcal {E}}}}\). Since \(\pi _{H \parallel {{\mathcal {E}}}}\) is active given \({{\mathcal {W}}}\), it follows that either C or one of C’s descendants in \(H \parallel {{\mathcal {E}}}\) is a member of \({{\mathcal {W}}}\). But since \(H \parallel {{\mathcal {E}}}\) is obtained from \(G \parallel {{\mathcal {E}}}\) by adding an edge from \(V_1\) to \(V_2\), and \(V_1\) is already an ancestor of \(V_2\) in \(G \parallel {{\mathcal {E}}}\) (as \(G \parallel {{\mathcal {E}}}\) contains the directed path \(\delta _i\)), it follows that the set of descendants of C in \(H \parallel {{\mathcal {E}}}\) and in \(G \parallel {\mathcal {E}}\) are identical.

If \(\pi _{H \parallel {{\mathcal {E}}}}\) is not a path in \(G \parallel {{\mathcal {E}}}\), then it must contain the edge \(V_1 \rightarrow V_2\). Let \(\beta \) be the variable sequence obtained by replacing the edge \(V_1 \rightarrow V_2\) with the directed path \(\delta _i\). In other words, define:

$$\begin{aligned} \beta = (\pi _{H \parallel {{\mathcal {E}}}} \downarrow V_1) \frown \delta _i \frown (\pi _{H \parallel {{\mathcal {E}}}} \uparrow V_2). \end{aligned}$$

The sequence \({\beta }\) is pictured in red in Fig. 9

Fig. 9
figure 9

The path \(\beta \) described in Lemma 2

Notice every variable on \((\pi _{H \parallel {{\mathcal {E}}}} \downarrow V_1)\) and \((\pi _{H \parallel {{\mathcal {E}}}} \uparrow V_2)\) other than \(V_1\) and \(V_2\) is active on \(\beta \) because it is active on \(\pi _{H \parallel {{\mathcal {E}}}}\). Moreover, \(V_1\) is a non-collider on both \(\pi _{H \parallel {{\mathcal {E}}}}\) and \(\beta \), and hence, it is active on both. Similarly, \(V_2\) is a collider on \(\beta \) if and only if it is collider on \(\pi _{H \parallel {{\mathcal {E}}}}\). Therefore, it is active given \({{\mathcal {W}}}\) on \(\beta \) because it is on \(\pi _{H \parallel {{\mathcal {E}}}}\). So \(\beta \) is an active variable sequence given \({{\mathcal {W}}}\) between \(Z_1\) and \(Z_2\). By Lemma 1, there is an active path \(\pi _{G \parallel {{\mathcal {E}}}}\) between \(Z_1\) and \(Z_2\) given \({{\mathcal {W}}}\).

Case 2 Suppose all of the paths \(\delta _1, \delta _2, \ldots \delta _{k-1}\) contain some member of \({{\mathcal {E}}} \cup {{\mathcal {W}}}\). As the paths are disjoint by assumption, it follows that \({{\mathcal {E}}} \cup {{\mathcal {U}}}\) contains at least \(k-1\) members. Since \(Z_1, Z_2 \in {{\mathcal {U}}} \setminus {{\mathcal {W}}}\) and \({{\mathcal {E}}} \cup {{\mathcal {W}}} \subseteq {{\mathcal {U}}}\), it follows that either (a) \(Z_1 \in {{\mathcal {E}}}\) and is a member of some path \(\delta _i\), or (b) \(Z_2 \in {{\mathcal {E}}}\) and is a member of some path \(\delta _i\).

Case 2a Suppose \(Z_1 \in {{\mathcal {E}}}\) and is a member of some path \(\delta _i\). Then \(Z_1\) is a descendant of \(V_1\) in G. Again, either (i) \(\pi _{H \parallel {{\mathcal {E}}}}\) is a path in \(G \parallel {{\mathcal {E}}}\) or (ii) it is not.

Case 2ai Suppose that \(\pi _{H \parallel {{\mathcal {E}}}}\) is a path in \(G \parallel {{\mathcal {E}}}\). I claim it is active given \({{\mathcal {W}}}\) in \(G \parallel {{\mathcal {E}}}\). Suppose for the sake of contradiction it is not. Since \(\pi _{H \parallel {{\mathcal {E}}}}\) is active given \({{\mathcal {W}}}\) in \(H \parallel {{\mathcal {E}}}\), every non-collider on \(\pi _{H \parallel {{\mathcal {E}}}}\) is not in \({{\mathcal {W}}}\). Hence, every such non-collider is also active given \({{\mathcal {W}}}\) in \(G \parallel {{\mathcal {E}}}\). So if \(\pi _{H \parallel {{\mathcal {E}}}}\) is not active, then there is some collider C on \(\pi _{H \parallel {{\mathcal {E}}}}\) that is active in \(H \parallel {{\mathcal {E}}}\) but not so in \(G \parallel {{\mathcal {E}}}\). Let C be the collider closest to \(Z_1\). Since \(Z_1 \in {{\mathcal {E}}}\), it follows that the initial segment of \(\pi _{H \parallel {{\mathcal {E}}}}\) points away from \(Z_1\). Hence, as C is the closest collider to \(Z_1\) on \(\pi _{H \parallel {{\mathcal {E}}}}\), it follows that C is a descendant of \(Z_1\) in \(H \parallel {{\mathcal {E}}}\). As \(Z_1\) occurs on a directed path from \(V_1\) to \(V_2\) in H by assumption of Case 2ai, it follows that C is also a descendant of \(V_1\) in H.

Because C is active on \(\pi _{H \parallel {{\mathcal {E}}}}\) in \(H \parallel {{\mathcal {E}}}\) but not so in \(G \parallel {{\mathcal {E}}}\), it follows that C has a descendant in \(H \parallel {{\mathcal {E}}}\) that is not a descendant in \(G \parallel {{\mathcal {E}}}\). Because \(H \parallel {{\mathcal {E}}}\) is obtained from \(G \parallel {{\mathcal {E}}}\) by adding the edge \(V_1 \rightarrow V_2\), it must be the case that C is an ancestor of \(V_1\). So C is an ancestor of \(V_1\) in \(H \parallel {{\mathcal {E}}}\), and hence, also in H. But I have already shown that C is a descendant of \(V_1\) in H. So H contains a cycle, contradicting assumption.

Case 2aii Now suppose \(\pi _{H \parallel {{\mathcal {E}}}}\) is not a path in \(G \parallel {{\mathcal {E}}}\). Then \(\pi _{H \parallel {{\mathcal {E}}}}\) contains the edge \(V_1 \rightarrow V_2\). Let \(\beta = (\delta _i \uparrow Z_1) \frown (\pi _{H \parallel {{\mathcal {E}}}} \uparrow V_2)\). If \(\beta \) is active in \(G \parallel {{\mathcal {E}}}\), then by Lemma 1, there is a d-connecting path given \({{\mathcal {W}}}\) between \(G \parallel {{\mathcal {E}}}\).

If the variables in \(\beta \) are not adjacent in \(G \parallel {{\mathcal {E}}}\), it follows that \(\delta _i \uparrow Z_1\) contains some member \(W_i\) of \({{\mathcal {E}}} \cup {{\mathcal {W}}}\) other than \(Z_1\). Thus:

$$\begin{aligned} W_i,Z_1 \in {{\mathcal {E}}} \subseteq {{\mathcal {E}}} \cup {{\mathcal {W}}} \subseteq {{\mathcal {U}}}. \end{aligned}$$

Recall by assumption of Case 2, there are \(k-2\) many other paths \(\{\delta _j\}_{j \ne 2}\), each of which contains some member of \({{\mathcal {E}}} \cup {{\mathcal {W}}} \subseteq {{\mathcal {U}}}\). So \({{\mathcal {U}}}\) contains \(Z_1,Z_2,W_i\) and a distinct element \(W_j\) from each of the paths \(\delta _j\), where \(j \ne i\). It follows that \(Z_2\) is one of the \(W_l\)’s, as otherwise \({{\mathcal {U}}}\) would contain \(k+1\) many elements. Hence, \(Z_2\) is an ancestor of \(V_2\) and a descendant of \(V_1\), and in particular, \(Z_2\) does not equal \(V_1\) or \(V_2\).

Notice that \(V_2\) is not equal to any of the elements of \({{\mathcal {U}}}=\{Z_1, Z_2, W_1, \ldots , W_{k-1} \}\). So, in particular, \(V_2 \not \in {{\mathcal {W}}} \subseteq {{\mathcal {U}}}\). Hence, \(V_2\) is not a collider on \(\pi _{H \parallel {{\mathcal {E}}}}\). Since \(\pi _{H \parallel {{\mathcal {E}}}}\) contains the edge \(V_1 \rightarrow V_2\), it follows that the path \(\pi _{H \parallel {{\mathcal {E}}}} \uparrow V_2\) points away from \(V_2\) and towards \(Z_2\). Since \(Z_2\) is an ancestor of \(V_2\), the path \(\pi _{H \parallel {{\mathcal {E}}}} \uparrow V_2\) is not directed from \(V_2\) to \(Z_2\). So it must contain a collider C which is closest to \(V_2\); so C is a descendant of \(V_2\). Because \(\pi _{H \parallel {{\mathcal {E}}}}\) is active given \({{\mathcal {W}}}\) in \(H \parallel {{\mathcal {E}}}\), it follows that either C or one of its descendants is in \({{\mathcal {W}}}\). Since \({{\mathcal {W}}} \subseteq {{\mathcal {U}}} \setminus \{Z_1,Z_2\}\) and \({{\mathcal {U}}} = \{Z_1, Z_2, W_1, \ldots , W_{k-1}\}\), it follows that C is either equal to \(W_j\) for some j, or is an ancestor of some \(W_j\). In either case, C is an ancestor of \(V_2\). But I have already shown that C is a descendant of \(V_2\). So G contains a cycle, contradicting assumption.

Case 2b Suppose \(Z_2 \in {{\mathcal {E}}}\) and is a member of some path \(\delta _i\). Again, either (i) \(\pi _{H \parallel {{\mathcal {E}}}}\) is a path in \(G \parallel {{\mathcal {E}}}\) or (ii) it is not.

Case 2bi This case is symmetric to Case 2ai.

Case 2bii Suppose that \(\pi _{H \parallel {{\mathcal {E}}}}\) is not a path in \(G \parallel {{\mathcal {E}}}\), and hence, it contains the edge \(V_1 \rightarrow V_2\). Since \(Z_2 \in {{\mathcal {E}}}\), it follows that the tail of \(\pi _{H \parallel {{\mathcal {E}}}}\) between \(V_1\) and \(Z_2\) points away from \(Z_2\). Hence, there is a collider between \(V_1\) and \(Z_2\) on \(\pi _{H \parallel {{\mathcal {E}}}}\). Let C be the collider closest to \(V_1\), so that C is either \(V_2\) or a descendant of \(V_2\) in H.

Since C is active on \(\pi _{H \parallel {{\mathcal {E}}}}\), it follows that C or one of its descendants is in \({{\mathcal {W}}}\) (and similarly, either \(V_2\) or one of its descendants is in \({{\mathcal {W}}}\)). Recall, by assumption of Case 2, each path \(\delta _j\) contains at least one member from the set \({{\mathcal {E}}} \cup {{\mathcal {W}}}\). Since there are \(k-2\) disjoint paths other than \(\delta _i\) (i.e., the directed path containing \(Z_2\)), it follows that \({{\mathcal {U}}}\) contains a distinct element \(W_j\) from each path \(\delta _j\) such that \(j \ne i\). So \({{\mathcal {U}}}\) contains \(Z_1, Z_2, C\) and \(k-2\) many elements \(W_j\). It follows that \(Z_1 = W_j\) for some \(j \ne i\), as otherwise, \({{\mathcal {U}}}\) would contain \(k+1\) many elements.

Consider \(\beta = (\delta _j \uparrow Z_1) \frown rev(\delta _i \uparrow Z_2)\), where rev reverses the order of the variables. Clearly, \(\beta \) has endpoints \(Z_1\) and \(Z_2\). Now if \(\beta \) is an active variable sequence given \({{\mathcal {U}}}\) in \(G || {{\mathcal {E}}}\), then Lemma 1 entails the result.

Suppose for the sake of contradiction that \(\beta \) is not active. Since every variable on \(\beta \) other than \(V_2\) is a non-collider, it follows that at least one of the four following cases holds: (1) at least one variable on \((\delta _j \uparrow Z_1)\), other than \(Z_1\) and \(V_2\) is a member of \({{\mathcal {E}}} \cup {{\mathcal {W}}}\), (2) at least one variable on \((\delta _i \uparrow Z_2)\) other than \(Z_2\) and \(V_2\) is a member of \({{\mathcal {E}}} \cup {{\mathcal {W}}}\), (3) \(V_2 \in {{\mathcal {E}}}\), or (4) neither \(V_2\) nor any of its descendants is a member of \({{\mathcal {W}}}\). I ruled out possibility (3) at the beginning of the proof, and possibility (4) contradicts the first sentence of the second to last paragraph.

So either (1) or (2) must hold. Consider (1) first, i.e., that \((\delta _j \uparrow Z_1)\) contains some member \(W_l \in {{\mathcal {E}}} \cup S\) other than \(Z_1\) or \(V_2\). So \({{\mathcal {U}}}\) contains \(Z_1, Z_2, W_l, C\) and \(k-3\) distinct elements \(W_m\) from each of the paths \(\delta _m\), where \(m \ne i,j\). Since \(Z_1, Z_2\) and \(W_l\) are pairwise distinct, it follows \(C=W_m\) for some \(m \ne i, j\), as otherwise \({{\mathcal {U}}}\) would contain at least \(k+1\) many members. So C is on a directed path from \(V_1\) to \(V_2\) in H, and hence, an ancestor of \(V_2\). But I have shown already that c is a descendant of \(V_2\) in H. This is a contradiction. The proof that (2) leads to a contradiction is similar. \(\square \)

Theorem 4

If \(G \equiv _{2,1} H\) and there is a directed path from \(V_1\) to \(V_2\) in G, then there is likewise a directed path from \(V_1\) to \(V_2\) in H.

Proof

Consider the experiment \(\langle {{\mathcal {E}}}, {{\mathcal {U}}} \rangle := \langle \{V_1 \}, \{V_1, V_2 \} \rangle \). Since \(G \equiv _{2,1} H\), it follows that \(\mathsf {D}^{{\mathcal {U}}}_{G \parallel {{\mathcal {E}}}}=\mathsf {D}^{{\mathcal {U}}}_{H \parallel {{\mathcal {E}}}}\). By assumption, there is a directed path from \(V_1\) to \(V_2\) in G. So there is still a directed path from \(V_1\) to \(V_2\) in \(G \parallel {{\mathcal {E}}}\), and that path is clearly d-connecting given the empty set. So \(\langle V_1, V_2, \emptyset \rangle \in \mathsf {D}^{{\mathcal {U}}}_{G \parallel {{\mathcal {E}}}}\). Because \(\mathsf {D}^{{\mathcal {U}}}_{G \parallel {{\mathcal {E}}}}=\mathsf {D}^{{\mathcal {U}}}_{H \parallel {{\mathcal {E}}}}\), it follows that there is a d-connecting path \(\pi \) between \(V_1\) and \(V_2\) in \(H \parallel {{\mathcal {E}}}\) given the empty set. By definition of d-connecting, it follows that there are no colliders on \(\pi \). Since all edges incident to \(V_1\) in \(H \parallel {{\mathcal {E}}}\) point “out of” \(V_1\), it follows that the path \(\pi \) is out of \(V_1\). Hence, because \(\pi \) contains no colliders and points away from \(V_1\), it follows that \(\pi \) is a directed path from \(V_1\) to \(V_2\) as desired. \(\square \)

Theorem 5

Let nk be natural numbers such that \(k <n\). Let \(h= \lfloor \frac{n-2}{k-1} \rfloor \) and \(M=(n - 1) - h(k-1)\). Then there exist graphs G and H such that G is k, 1-equivalent to H, and yet H contains f(nk) edges that G does not, where

  1. 1.

    \(f(n,k)=n-k\) if \(h=1\),

  2. 2.

    \(f(n,k)=(2n+1)-3k\) if \(h=2\), and

  3. 3.

    \(f(n,k)=(k-1)^2\left( {\begin{array}{c}h-1\\ 2\end{array}}\right) +(k-1)(h-1)+Mh\) if \(h \ge 3\).

Proof

Let \({{\mathcal {V}}}\) be a set with n many elements. Divide the vertices of \({{\mathcal {V}}}\) into groups as follows. Make h many groups of size \(k-1\). Enumerate those groups as follows: \(\{V_{1,1}, V_{1,2}, \ldots V_{1,k-1} \}\), \(\{V_{2,1}, V_{2,2}, \ldots V_{2,k-1} \}, \ldots \)\(\{V_{h,1}, \ldots , V_{h,k-1}\}\). There will be \(n - (k-1)h = M+1\) vertices remaining. Denote one of those remaining vertices by \(V_{0,1}\), and let the remainder be enumerated by \(V_{h+1,1}, V_{h+1,2}, \ldots V_{h,M}\). To construct the graph G, place the variables in a matrix (as shown in Fig. 10) so that \(V_{r,c}\) is in the \(r^{th}\) row and \(c^{th}\) column. Draw an edge from every vertex in row \(r+1\) to every vertex in row r. Call the resulting graph G. The graph H is a complete graph obtained by adding an edge from each vertex in row r to each vertex in row \(s < r\) in G.

Fig. 10
figure 10

The graph described in Theorem 5

Notice that, if V is a vertex in row \(r+2\) and W is a vertex in row r, then V and W are not adjacent and there are \(k-1\) disjoint directed paths from V to W. Hence, by Lemma 2, the edge \(V \rightarrow W\) can be added to G without breaking kj-equivalence. Since H is the result of adding all such edges to G, it follows that \(G \equiv _{k,j} H\). \(\square \)

Let \(G_k\) denote the dag (pictured below) containing k vertices \(\{X_1,\)\(X_2, W_1, \ldots W_k \}\) (1) an edge from \(X_1\) to \(W_i\) for all \(1 \le i \le k\), (2) an edge from \(W_i\) to \(X_2\) for all \(1 \le i \le k\), and (3) an edge from \(X_1\) to \(X_2\).

Fig. 11
figure 11

The graph denoted \(G_k\) in the remaining theorems

Theorem 6

Suppose \(k \ge 2\) and that G has fewer than \(2k-2\) edges. Then \([G]_{k,1} = \{G \}\). Moreover, there is a graph H with exactly \(2k-2\) edges such that \([H]_{k,j} = \{H_1, H_2 \}\) for all \(j \le k\), and moreover, \(H_1\) and \(H_2\) are not \(\mathsf {I}\) equivalent.

Proof

Suppose \(G \equiv _{k,1} H\) and G has fewer than \(2k-2\) edges. Then \(G \equiv _{k,0} H\), and so by Theorem 7 in Mayo-Wilson (2013), it follows that G and H are \(\mathsf {I}\)-equivalent. Hence, they share the same skeleton by Verma and Pearl’s theorem. So it suffices to show that all edges in G and H are oriented in the same direction. To do so, note that if \(V_1 \rightarrow V_2\) is an edge in G, then trivially there is a directed path from \(V_1\) to \(V_2\) in G. By Theorem 4, it follows that there is a directed path from \(V_1\) to \(V_2\) in H. By acyclicity, it follows that the edge between \(V_1\) and \(V_2\) in H must be oriented as \(V_1 \rightarrow V_2\).

For the second part of the theorem, let \(H_1 = G_{k-1}\) and \(H_2\) be the graph obtained by deleting the \(X_1 \rightarrow X_2\) edge from \(H_1\). By Lemma 2, it follows that \(H_1 \equiv _{k,j} H_2\). \(\square \)

Theorem 7

Fix a natural number k and any \(j \le g\), and let \(p_{k,j}(n)\) be the fraction of dags G with \(n \ge k\) many variables such that \([G]_{k,j}\) contains a graph that is not \(\mathsf {I}\)-equivalent to G. Then \(p_{k,j}(n) \rightarrow 1\) as \(n \rightarrow \infty \).

Proof

By Lemma 15 in Mayo-Wilson (2012), the proportion of \(\textsc {dag}\)s containing an isomorphic copy of \(G_{k-1}\) approaches one as \(n \rightarrow \infty \). If G contains a copy of \(G_{k-1}\), then by Lemma 2, \([G]_{k,j}\) contains a graph that is not \(\mathsf {I}\)-equivalent to G, namely, the graph in which the \(X_1 \rightarrow X_2\) edge in G is removed. Hence, \(p_{k,j}(n) \rightarrow 1\) as \(n \rightarrow \infty \). \(\square \)

Theorem 8

Fix any \(k \in {\mathbb {N}}\) and any \(j \le k\). Let \(E_{k,j}(G)\) be the maximum number m of dags \(H_1, H_2, \ldots H_m\) such that \(G \equiv _{k,j} H_i\) and \(G \not \equiv H_i\) for all \(i \le m\). Let \(E_{k,j}(n)\) be the average \(E_{k,j}(G)\) for all dags G with n many variables. Then \(E_{k,j}(n) \rightarrow \infty \) as \(n \rightarrow \infty \).

Proof

Let \(m \in {\mathbb {N}}\). Let H be a dag containing m disjoint copies of \(G_{k-1}\), and let \(H_i\) be the result of removing the \(X_1 \rightarrow X_2\) edge from the \(i^{th}\) copy. By Lemma 2, \(G \equiv _{k,j} H_i\) for all \(i \le m\), but \(G \not \equiv H_i\) as their skeletons differ. Thus, \(E_{k,j}(H) \ge m\). By Lemma 15 in Mayo-Wilson (2012), the proportion of \(\textsc {dag}\)s containing m disjoint copies of \(G_{k-1}\) approaches one as \(n \rightarrow \infty \). Since m was arbitrary, \(E_{k,j}(n) \rightarrow \infty \). \(\square \)

Bayesian networks

1.1 Markov equivalence

In this section, random variables will be denoted by the capital letters XY,  and Z, and values of random variables will be denoted xyz. Vectors will be bolded. So \(\varvec{X}\) will represent a vector of random variables, and \(\varvec{x}\) is a vector of values of \(\varvec{X}\). Sets of random variables will be denoted in a scripted font, e.g., \({{\mathcal {X}}}, {{\mathcal {Y}}}, {{\mathcal {Z}}}\). Given some ordering of the random variables \({{\mathcal {X}}}\), let \(\varvec{{{\mathcal {X}}}}\) denote the random vector obtained by ordering those variables.

Given a probability measure p, write \(p \models {{\mathcal {X}}} \amalg {{\mathcal {Y}}} | {{\mathcal {Z}}}\) if the variables \({{\mathcal {X}}}\) and \({{\mathcal {Y}}}\) are conditionally independent given \({{\mathcal {Z}}}\) with respect to the measure p. Let \({{\mathcal {V}}}\) be any finite set and \({{\mathcal {X}}}=\{X_V\}_{V \in {{\mathcal {V}}}}\) be a collection of random variables indexed by \({{\mathcal {V}}}\). A Bayesian network over \({{\mathcal {X}}}\) is a pair \(\langle G, p \rangle \), where \(G \in \textsc {dag}_{{{\mathcal {X}}}}\) such that \(p \models {{\mathcal {W}}} \amalg {{\mathcal {Y}}} | {{\mathcal {Z}}}\) if and only if \(\langle {{\mathcal {W}}}, {{\mathcal {Y}}}, {{\mathcal {Z}}} \rangle \in \mathsf {I}_G\). For any graph \(G \in \textsc {dag}_{{\mathcal {V}}}\), let \({\mathbb {P}}^{{\mathcal {X}}}_G\) be the set of probability measures p such that \(\langle {{\mathcal {X}}}, G^{{\mathcal {X}}}, p \rangle \) is a Bayesian network, where \(G^{{\mathcal {X}}}\) is the dag obtained by replacing V with \(X_{V}\) for all \(V \in {{\mathcal {V}}}\). Say that two graphs \(G, H \in \textsc {dag}_{{\mathcal {V}}}\) are Markov equivalent if \({\mathbb {P}}^{{\mathcal {X}}}_{G}={\mathbb {P}}^{{\mathcal {X}}}_H\) for any collection of random variables \({{\mathcal {X}}}\) indexed by \({{\mathcal {V}}}\).

Given \(p \in {\mathbb {P}}^{{\mathcal {X}}}_G\) and some \({{\mathcal {U}}} \subseteq {{\mathcal {V}}}\), let \(p_{{\mathcal {U}}}\) denote the marginal distribution of p over \(\{X_U \in {{\mathcal {X}}}: U \in {{\mathcal {U}}}\}\). Given \(\mathscr {U} \subseteq {{\mathcal {P}}}({{\mathcal {V}}})\), let \({\mathbb {P}}^{\mathscr {U}}_G\) be the set \(\{\{p_{{\mathcal {U}}}\}_{{{\mathcal {U}}} \in \mathscr {U}}: p \in {\mathbb {P}}^{{\mathcal {X}}}_G\}\). Say \(G, H \in \textsc {dag}_{{\mathcal {V}}}\) are \(\mathscr {U}\)Markov-equivalent if \({\mathbb {P}}^{\mathscr {U}}_G = {\mathbb {P}}^{\mathscr {U}}_H\) with respect to all collections of random variables \({{\mathcal {X}}}\). Write \(G \approx _{\mathscr {U}} H\) in this case.

Theorem 9

If G and H are \(\mathscr {U}\)-Markov equivalent, they are also \(\mathscr {U}\) equivalent. The converse is false in a fairly strong sense: for all \({{\mathcal {V}}}\) and all \(\mathscr {U}\) not containing \({{\mathcal {V}}}\), there exist G and H such that G and H are \(\mathscr {U}\)-equivalent but not \(\mathscr {U}\)Markov equivalent.

Proof

The first claim is trivial. To show the converse is false, let \(G=G_b\) be the graph pictured in Fig. 11. Let H be the graph obtained by adding the \(X_1 \rightarrow X_2\) edge to G. \(\square \)

Suppose \({{\mathcal {X}}}\) consists exclusively of binary random variables, and define a Bayesian network \(\langle {{\mathcal {X}}}, G, p \rangle \) satisfying the following conditions:

  • \(p(X_1=0) = \frac{1}{2}\),

  • \(p(W_i = 0|X_1= 0)= 1\) and \(p(W_i = 0|X_1= 1) = \frac{1}{2}\) for all \(1 \le i \le n-2\), and

  • \(p(X_2 = 1|\varvec{W} = \varvec{w})= \frac{2^{\Sigma \varvec{w}} - 1}{2^{\Sigma \varvec{w}}}\) .

where \(\varvec{W}= \langle W_1, W_2, \ldots , W_{n-2} \rangle \) and \(\Sigma \varvec{w}\) is the number of non-zero coordinates in \(\varvec{w}\). Below, I show that the distribution p is faithful to G. Before doing so, let q be any distribution over \({{\mathcal {X}}}\) that agrees with p on all of the marginal distributions over any proper subset of variables of \({{\mathcal {X}}}\). I show that \(q \models X_1 \amalg X_2 | \{W_1, \ldots , W_{n-2}\}\), and hence, q is not faithful to H. By definition, this entails that G and H are not \(\mathscr {U}\)Markov equivalent.

To prove \(q \models X_1 \amalg X_2 | \{W_1, \ldots , W_{n-2}\}\), it is necessary to show

$$\begin{aligned} q(X_2 = x_2 | X_1=x_1, \varvec{W}=\varvec{w}) = q(X_2 = x_2 | \varvec{W}=\varvec{w}) \end{aligned}$$

for any set of values \(x_1, x_2,\varvec{w}\). There are two cases to consider:

Case 1 Assume that at least one coordinate of \(\varvec{w}\) is equal to one.

$$ \begin{aligned} q(X_2=x_2|\varvec{W}=\varvec{w})= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \cdot q(X_1=x_1|\varvec{W}=\varvec{w}) \\&+\,q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \cdot q(X_1=x_0|\varvec{W}=\varvec{w}) \\&\text{ by } \text{ total } \text{ probability }\\= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \cdot p(X_1=1|\varvec{W}=\varvec{w}) \\&+\,q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \cdot p(X_1=0|\varvec{W}=\varvec{w}) \\&\text{ as } p \& q \text{ agree } \text{ on } \text{ all } \text{ marginal } \text{ distributions } \text{ over } {{\mathcal {X}}} \\= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \cdot 1 + q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \cdot 0 \\&\text{ by } \text{ definition } \text{ of } p \text{ as } \text{ there } \text{ is } \text{ at } \text{ least } \text{ one } \text{ coordinate } \text{ of } \varvec{w} \text{ equal } \text{ to } \text{ one } \\= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \end{aligned}$$

Case 2 Assume that all coordinates of \(\varvec{w}\) are equal to zero, i.e., \(\varvec{w} = \varvec{0}\) There are two subcases to consider.

Case 2a Suppose \(x_2 = 0\). Then:

$$ \begin{aligned} 1= & {} p(X_2=x_2=0|\varvec{W}=\varvec{0}) = q(X_2=x_2|\varvec{W}=\varvec{0}) \\&\text{ as } p \& q \text{ agree } \text{ on } \text{ all } \text{ marginal } \text{ distributions } \text{ over } {{\mathcal {X}}} \\= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \cdot q(X_1=1|\varvec{W}=\varvec{0}) \\&+\,q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \cdot q(X_1=0|\varvec{W}=\varvec{0}) \\&\text{ by } \text{ total } \text{ probability }\\= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \cdot q(X_1=1|\varvec{W}=\varvec{0}) \\&+\,q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \cdot \left( 1 - q(X_1=1|\varvec{W}=\varvec{0}) \right) \\&\text{ by } \text{ properties } \text{ of } \text{ conditional } \text{ probability }\\= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \cdot p(X_1=1|\varvec{W}=\varvec{0}) \\&+\,q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \cdot \left( 1 - p(X_1=1|\varvec{W}=\varvec{0}) \right) \\&\text{ as } p \& q \text{ agree } \text{ on } \text{ all } \text{ marginal } \text{ distributions } \text{ over } {{\mathcal {X}}} \end{aligned}$$

As \(1> p(X_1=1|\varvec{W}=\varvec{0}) > 0\), the last equation holds if and only if:

$$\begin{aligned} 1=q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1)= q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \end{aligned}$$

and hence,

$$\begin{aligned} q(X_2=x_2|\varvec{W}=\varvec{0})=q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1)= q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0). \end{aligned}$$

Case 2b Suppose \(x_1 = 0\). Then \(p(X_2=x_2=1|\varvec{W}=\varvec{0}) =0\) by definition of p, and so

$$ \begin{aligned} 0= & {} p(X_2=x_2=1|\varvec{W}=\varvec{0}) \\= & {} q(X_2=x_2=1|\varvec{W}=\varvec{0}) \\&\text{ as } p \& q \text{ agree } \text{ on } \text{ all } \text{ marginal } \text{ distributions } \text{ over } {{\mathcal {X}}} \\= & {} q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1) \cdot p(X_1=1|\varvec{W}=\varvec{0}) \\&+\,q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \cdot \left( 1 - p(X_1=1|\varvec{W}=\varvec{0}) \right) \\&\text{ by } \text{ the } \text{ same } \text{ reasoning } \text{ as } \text{ in } \text{ Case } \text{2a. } \end{aligned}$$

As\(1> p(X_1=1|\varvec{W}=\varvec{0}) > 0\), the last equation holds if and only if:

$$\begin{aligned} 0=q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1)= q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0) \end{aligned}$$

and hence,

$$\begin{aligned} q(X_2=x_2|\varvec{W}=\varvec{0})=q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=1)= q(X_2=x_2|\varvec{W}=\varvec{w}, X_1=0). \end{aligned}$$

To show p is faithful to G, note that all triples not entailed by G fall into one of the following five categories:

  1. 1.

    \(X_1 \amalg X_2 | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subsetneq \{W_1, W_2, \ldots W_{n-2} \}\),

  2. 2.

    \(X_1 \amalg W_i | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{X_1, W_i \}\) and \(1 \le i \le n-2\),

  3. 3.

    \(X_2 \amalg W_i | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{X_2, W_i \}\) and \(1 \le i \le n-2\),

  4. 4.

    \(W_i \amalg W_j | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{X_1, X_2, W_i, W_j \}\) and \(1 \le i,j \le n-2\),

  5. 5.

    \(W_i \amalg W_j | {{\mathcal {Z}}}\) where \(X_2 \in {{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{W_i, W_j \}\) and \(1 \le i,j \le n-2\).

I now show that p likewise does not entail any of the above conditional independences.

Category 1 It must be shown that \(p \not \models X_1 \amalg X_2 | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subsetneq \{W_1, W_2, \ldots W_{n-2} \}\). By definition, \(p(X_2=0 | X_1 = 0, \varvec{{{\mathcal {Z}}}} = \varvec{0})=1\) whereas \(p(X_2=0 | \varvec{{{\mathcal {Z}}}} = \varvec{0}) < 1\) because there is some \(W_i \not \in {{\mathcal {Z}}}\).

Category 2 It must be shown that \(p \not \models X_1 \amalg W_i | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{X_1, W_i\}\). This is similar to the last case. Note, by definition of p, it follows that \(p(W_i=0 | X_1 = 0, \varvec{{{\mathcal {Z}}}} = \varvec{0})=1,\) whereas \(p(W_i=0 | \varvec{{{\mathcal {Z}}}} = \varvec{0}) < 1\).

Category 3 It must be shown that \(p \not \models X_2 \amalg W_i | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{X_2, W_i\}\). To do so, it suffices to show \(p(X_2 = 1 | W_i = 1, \varvec{{{\mathcal {Z}}}} = \varvec{0}) \ne p(X_2 = 1 | \varvec{{{\mathcal {Z}}}} = \varvec{0})\). If \(X_1 \in {{\mathcal {Z}}}\), this is trivial because the conditional probability on the left hand-side is undefined whereas that on the right is positive. If \(X_1 \not \in {{\mathcal {Z}}}\), it is tedious but routine to verify that \(p(X_2 = 1 | W_i = 1, \varvec{{{\mathcal {Z}}}} = \varvec{0}) > p(X_2 = 1 | \varvec{{{\mathcal {Z}}}} = \varvec{0})\) using (1) the definition of p, (2) the the law of total probability, and (3) the fact that \(p(\varvec{{{\mathcal {W}}}} = \varvec{w} | X_1 = 1) = \prod _{W_j \in {{\mathcal {W}}}} p(W_j = \varvec{w}_j|X_1 = 1) = \frac{1}{2^{|{{\mathcal {W}}}|}}\) for all \({{\mathcal {W}}} \subseteq \{W_1, W_2, \ldots W_{n-2}\}\), which is an instance of the factorization property for Bayesian networks.

Category 4 We must show \(p \not \models W_i \amalg W_j | {{\mathcal {Z}}}\) where \({{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{X_1,X_2, W_i,W_j\}\). To do so, use the three facts described in Category 3 to show \(p(W_i=0|W_j=0,\varvec{{{\mathcal {Z}}}} = \varvec{0}) < p(W_i=0|\varvec{{{\mathcal {Z}}}} = \varvec{0}).\)

Category 5 It must be shown that \(p \not \models W_i \amalg W_j | {{\mathcal {Z}}}\) where \(X_2 \in {{\mathcal {Z}}}\) and \({{\mathcal {Z}}} \subseteq {{\mathcal {X}}} \setminus \{W_i,W_j\}\). To do this, use the three facts described in Category 3 \(p(W_i=1|W_j=0, \varvec{{{\mathcal {Z}}} \setminus \{X_2\}} = \varvec{0}, X_2 = 1) > p(W_i=0|\varvec{{{\mathcal {Z}}} \setminus \{X_2\}} = \varvec{0}, X_2 = 1).\)

1.2 Stronger distributional assumptions

Write \(G \approx _{\mathscr {U}}^{\varvec{M}} H\) if G and H are \(\mathscr {U}\varvec{M}\) indistinguishable. In the special case in which \(\mathscr {U}\) is all subsets of \({{\mathcal {V}}}\) of a fixed size k, write \(G \approx _{k}^{\varvec{M}} H\). When \({{\mathcal {V}}} \in \mathscr {U}\), we drop the subscript \(\mathscr {U}\) and write \(G \approx ^{\varvec{M}} H\).

Theorem 12

There are G and H such that \(G \equiv _{2,0} H\) but \(G \not \approx _{2}^{\varvec{D}^{+}} H\).

Proof

Let G be the graph \(X \rightarrow Y \rightarrow Z\), and let H be the graph \(X \rightarrow Z \rightarrow Y\). Clearly, \(G \equiv _2 H\). Suppose for the sake of contradiction that \(G \approx _{2}^{\varvec{D}^{+}} H\).

Suppose XY, and Z are all binary random variables, and consider any two discrete, multinomial models \(\langle G, p \rangle \) and \(\langle H, q \rangle \). Since \(G \approx _{2}^{\varvec{D}^{+}} H\), the correlations of any pair of variables with respect to p and q are identical. Hence, I write \(\rho _{VW}\) to indicate the correlation between \(V,W \in \{X,Y,Z\}\) in the two discrete models.

By theorem 2.12 of Danks and Glymour (2001), the correlation between any two variables in singly connected graphs containing only binary variables is the product of the correlations along the unique trek connecting them. Since G contains the trek \(X \rightarrow Y \rightarrow Z\), it follows \(\rho _{XZ} = \rho _{XY} \cdot \rho _{YZ}\). By the same theorem, since H contains the trek \(X \rightarrow Z \rightarrow Y\), it follows that \(\rho _{XY} = \rho _{XZ} \cdot \rho _{YZ}\). So \(\rho _{YZ}^2=1\), or in other words, Y and Z are perfectly correlated. So the distributions p and q are not positive, contradicting assumption. \(\square \)

The last proof cannot be generalized straightforwardly to either (a) discrete variables taking more than two values, or (b) \(\varvec{D}^{+}k\)-equivalence when \(k >2\). The former generalization is not straightforward because the the theorem that correlations can be multiplied along treks only holds for binary variables. The latter generalization is difficult because the same theorem only applies to singly connected networks, and two graphs that are k, 0-equivalent but not \(\mathsf {I}\)-equivalent will often contain multiple treks between two variables (because they will generally contain different orientations).

Theorem 13

For all \({{\mathcal {V}}}\) and all \(\mathscr {U} \subseteq {{\mathcal {V}}}\) such that \({{\mathcal {V}}} \not \in \mathscr {U}\), there exist G and H such that \(G \equiv _{\mathscr {U}} H\) but \(G \not \approx _{\mathscr {U}}^{\varvec{N} \vee } H\).

Proof

The proof is the same as that of Theorem 9 because the distribution p in that proof is a noisy or parametrization. In greater detail, for a graph with n many variables, enumerate the variables \({{\mathcal {V}}} = \{X_1, X_2, W_1, \ldots W_{n-2}\}\). Consider the graph \(G_{n-2}\) with the latent “noise” terms \(\{E_V: V \in {{\mathcal {V}}}\} \cup \{B_{U,V}: U \rightarrow V \text{ is } \text{ an } \text{ edge } \text{ in } G_{n-2} \}\) as shown in Fig. 12.

Fig. 12
figure 12

The graph denoted \(G_{n-2}\) with noise terms, as described in Theorem 13

Let p be the unique noisy-or parameterization of \(G_{n-2}\) such that (1) \(p(E_{X_1}) = p(E_{X_2}) = 1/2\), (2) \(p(E_{W_i} = 1) = 0\) for all \(W_i\), and (3) \(p(B_{U,V} = 1) = 1/2\) if \(U \rightarrow V\) is an edge in \(G_{n-2}\). Then

  • \(p(X_1=0) = \frac{1}{2}\) because \(p(X_1 = 0) = p(E_{X_1} = 0) = \frac{1}{2}\).

  • \(p(W_i = 0|X_1= 0)= 1\) and \(p(W_i = 0|X_1= 1) = \frac{1}{2}\) for all \(W_i\). The former equation holds because \(p(W_i = 0|X_1= 0)= p(E_{W_i} = 0) = 1\) and the latter holds because \(p(E_{W_i} = 1) = 0\) and \(p(W_i = 0|X_1= 1) = p(B_{X_1, W_i} = 1) = \frac{1}{2}\).

So to show p is the same distribution as in the proof of Theorem 9, we need to show only that \(p(X_2 = 1|\varvec{W} = \varvec{w})= \frac{2^{\Sigma \varvec{w}} - 1}{2^{\Sigma \varvec{w}}}\). To do that, we rely on the following lemma, which can be proven by induction on k. \(\square \)

Lemma 3

\(\sum _{m=1}^k (-1)^{m+1} \cdot \left( {\begin{array}{c}k\\ m\end{array}}\right) \frac{1}{2^m} = \frac{2^k - 1}{2^k}\) for all \(k \ge 1\).

With that lemma, let \({{\mathcal {Z}}}\) be the parents of \(X_2\) including the error terms (i.e., \({{\mathcal {Z}}} = \{E_{X_2}\} \cup \{W_j: \varvec{w}_j=1\}\)), and suppose J has size k. Given \(m \le k\), let \({{\mathcal {Z}}}_{m} = \{{{\mathcal {U}}} \subseteq {{\mathcal {Z}}}: |{{\mathcal {U}}}|=m\}\) be all subsets of \({{\mathcal {Z}}}\) of size m. Then:

$$\begin{aligned} p(X_2 = 1|\varvec{W} = \varvec{w})= & {} p(\bigcup _{Z \in {{\mathcal {Z}}}} Z = 1) \\= & {} \sum _{m \le k} (-1)^{m+1} \cdot \sum _{{{\mathcal {U}}} \in {{\mathcal {Z}}}_{m}} p\left( \bigcap _{U \in {{\mathcal {U}}}} U = 1 \right) \\&\text{ by } \text{ the } \text{ inclusion } \text{ exclusion } \text{ principle } \\= & {} \sum _{m \le k+1} (-1)^{m+1} \sum _{{{\mathcal {U}}} \in {{\mathcal {Z}}}_{m}} \frac{1}{2^m} \\&\text{ because } E_{X_2}, B_{W_1, X_2}, \ldots B_{W_{n-2}, X_2} \text{ are } \text{ mutually } \text{ independent } \\= & {} \sum _{m \le k+1} (-1)^{m+1} \cdot \left( {\begin{array}{c}k\\ m\end{array}}\right) \cdot \frac{1}{2^m} \\= & {} \frac{2^k - 1}{2^k} \text{ by } \text{ the } \text{ lemma } \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mayo-Wilson, C. Causal identifiability and piecemeal experimentation. Synthese 196, 3029–3065 (2019). https://doi.org/10.1007/s11229-018-1826-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11229-018-1826-4

Keywords

Navigation