Improved baselines for causal structure learning on interventional data

Causal structure learning (CSL) refers to the estimation of causal graphs from data. Causal versions of tools such as ROC curves play a prominent role in empirical assessment of CSL methods and performance is often compared with “random” baselines (such as the diagonal in an ROC analysis). However, such baselines do not take account of constraints arising from the graph context and hence may represent a “low bar”. In this paper, motivated by examples in systems biology, we focus on assessment of CSL methods for multivariate data where part of the graph structure is known via interventional experiments. For this setting, we put forward a new class of baselines called graph-based predictors (GBPs). In contrast to the “random” baseline, GBPs leverage the known graph structure, exploiting simple graph properties to provide improved baselines against which to compare CSL methods. We discuss GBPs in general and provide a detailed study in the context of transitively closed graphs, introducing two conceptually simple baselines for this setting, the observed in-degree predictor (OIP) and the transitivity assuming predictor (TAP). While the former is straightforward to compute, for the latter we propose several simulation strategies. Moreover, we study and compare the proposed predictors theoretically, including a result showing that the OIP outperforms in expectation the “random” baseline on a subclass of latent network models featuring positive correlation among edge probabilities. Using both simulated and real biological data, we show that the proposed GBPs outperform random baselines in practice, often substantially. Some GBPs even outperform standard CSL methods (whilst being computationally cheap in practice). Our results provide a new way to assess CSL methods for interventional data.

2009; Spirtes 2010). CSL is an important and challenging topic in its own right and has attracted a great deal of recent research attention in a number of fields including statistics, machine learning and philosophy (reviewed in Heinze-Deml et al. 2018). Broadly speaking, given data X (which might be observational and/or interventional), CSL methods provide a graph estimateĜ(X ) (or probabilistic analogue) with edges intended to encode causal relationships. The semantics of such graphs can be complex and depend on the precise model and application domain but for the present it is important only to emphasize that such estimators use data X to infer relationships between entities and can be viewed as encoding such information as a directed graphĜ.
CSL methods necessarily require assumptions on the underlying causal system that may or may not hold in real applications and whose validity may be difficult to check in practice. As a result the behaviour of CSL methods under realistic conditions (noise levels, limited sample sizes etc.) may not be clear in advance. As such, in practical settings it is important to empirically assess the efficacy of CSL methods. To this end a number of studies have focused on such assessment (including, among others, Hill et al. 2016;Heinze-Deml et al. 2018;Eigenmann et al. 2020). In the empirical assessment of CSL methods, a common strategy is to compare the estimated graphĜ with a "ground truth" graph G * (depending on context either the true data-generating graph in a simulation, or an scientifically/experimentally-defined gold standard). Such quantitative comparisons are usually made alongside baselines, which provide a way to contextualize the performance of the estimatorĜ on the specific problem. Random baselines, such as the diagonal in an ROC analysis, are widely used, motivated by the idea that large deviations from the random case are an indicator that the estimator is successfully identifying causal structure.
In this paper we put forward a new class of baselines for the assessment of CSL methods in the setting that (some) interventional data is available. While random baselines are a good and useful tool, they ignore structure that might be inherent in the problem, in the sense of regularities in the ground truth graph G * . In the interventional data setting, some information on G * is available at the outset. We argue that such information can constrain possible solutions such that random baselines are in a way too general for this setting and provide only a "low bar" against which to assess CSL methods. Instead, we propose to exploit the knowledge of part of the ground truth graph in combination with straightforward graph properties, to define new baselines called graph-based predictors (GBPs), that share conceptual simplicity with classical baselines but that constitute a demonstrably stronger test.
A related line of work, developing and utilizing null models for networks seeks to contextualise interesting network features with reference to default, background models, see e.g. the surveys Fosdick et al. (2018) and Gauvin et al. (2018) as well as Chapter 11 in Fornito et al. (2016) and references therein. The key idea in these approaches is to understand whether a seemingly salient feature of a network (e.g. high levels of connectivity within specific subsets of the graph leading to the thriving area of community detection, see e.g. Newman and Girvan (2004) and the survey Fortunato (2010) and the references therein) is really unusual or noteworthy. In a similar fashion, we seek to contextualise the performance of CSL methods, using certain graph properties to define suitable baselines. However, a key difference is that in the null models literature the network itself is assumed known; in contrast, in our paper and CSL in general, the network itself is (partially) inferred.
Our work is motivated by, and illustrated in the context of, interventional experiments that have become feasible in recent years in molecular biology (see, among others, Sachs et al. 2005;Kemmeren et al. 2014;Shalem et al. 2015;Dixit et al. 2016;Ursu et al. 2022). Such experiments are crucial for the inference of molecular networks, encoding causal relationships between entities such as genes or proteins, which in turn play a central role in disease and systems biology (see e.g. Phillips 2008;Parikshak et al. 2015). The inference of molecular networks from data is a long-standing problem at the intersection of statistics, machine learning and systems biology (for introductions see e.g. Ideker et al. 2001;Babu et al. 2004;Sanguinetti and Huynh-Thu 2019;Nogueira et al. 2022).
In practice the interventional experiments in biology involve perturbation of molecular nodes (for example genes) and subsequent measurement of a high-dimensional readout (such as gene expression), specific examples of these include gene knock-out /-down, /-up, /-in experiments. Such data are relevant for causal learning because the measurement of a gene expression level for a gene B after perturbation of a gene A gives information on the (total) causal effect of A on B. Hence, if available, incorporating interventional data alongside observational data in CSL methods is desirable, and this has been studied from a number of perspectives (relevant literature includes Hauser and Bühlmann 2012;Rau et al. 2013;Spencer et al. 2015;Peters et al. 2016;Magliacane et al. 2016a, b;Meinshausen et al. 2016;Magliacane and van Ommen 2017;Wang et al. 2017;Hill et al. 2019;Rothenhäusler et al. 2019;Brouillard et al. 2020).
At the same time, interventional data are widely used to obtain gold standards to assess CSL methods (see e.g. Colombo and Maathuis 2014;Meinshausen et al. 2016;Wang et al. 2017). Notably, in practice, it is usually not feasible to perform all possible perturbation experiments due to timeand cost-constraints, rather only a subset can be performed. As we discuss in detail in Sect. 2, this can be viewed as providing information on a partial observation of the ground truth graph G * and this practical scenario is the one we focus on.
A particularly interesting and relevant special case concerns transitively closed graphs. As noted above, in realworld gene perturbation experiments, one observes the total causal effect of perturbing one gene (the target A) on another gene B (usually many such genes are measured in contemporary "omics" designs, we refer to such data in the following as omics readouts or simply as omics data). An effect of A on B may be mediated by other genes intermediate in the underlying causal path. For this reason, such effects are transitive in the sense that if A has a causal edge to B (in the underlying causal graph) and B to C, then an intervention on A may change C (this corresponds to the total causal effect of A on C), resulting in an edge from A to C in a graph constructed directly from the perturbation experiments. The assumption of observing transitively closed or ancestral causal graphs has also been made in Magliacane et al. (2016a) who consider estimating transitively closed graphs and in Heinze-Deml et al. (2018) and Eigenmann et al. (2020) where CSL methods were evaluated with respect to ancestral relations of this kind.
The contributions of this paper are as follows: • New class of baselines. We propose a new class of baselines for CSL that take account of graph properties in the case that interventional data is available. The proposed baselines leverage structural properties rooted in the underlying causal graph. • Methods for transitively closed graphs. Motivated by the nature of real-world gene perturbation experiments, we focus particular attention on transitivity and related properties and put forward specific baselines that exploit constraints derived from these properties. • Theoretical results of superiority and delineation. We show for a particular baseline a superiority-statement in the context of latent-network models. Moreover, we delineate the proposed baselines from each other theoretically. • Empirical results using real gene/protein perturbation data. Using real data from large-scale gene and protein perturbation experiments we study the behaviour of the proposed methods to understand whether they can actually provide improved baselines in practice.
Taken together, our results provide a framework for constructing improved baselines for CSL and thereby to more thoroughly assess the capabilities of CSL methods, with a focus on the use of interventional data, an area of key relevance for ongoing efforts at the interface between systems biology and large-scale perturbation designs.
The remainder of the paper is organized as follows. We begin in Sect. 2 with notation and background, defining the precise set-up for which the proposed baselines are intended. In Sects. 3.1 and 3.2 we introduce two general ways to construct graph-based predictors, based respectively on indegree information and constraints rooted in transitivity. These two classes are illustrated with specific implementations -the observed indegree predictor (OIP) and several transitivity assuming predictors (TAPs) respectively -which are specifically derived for their use as baselines in system biology experiments. For the OIP a theoretical result of superiority over random baselines is given. Moreover, in Sect. 3.2 we propose simulation strategies for the TAPs as their direct computation is infeasible. In Sect. 3.3 combinations of the OIP and the TAPs are discussed. We detail in Sect. 3.4 the theoretical differences of all introduced candidate baselines and outline potential similarities. Section 4.1 provides detailed analysis of a simulation study of the proposed GBPs. In Sect. 4.2 we then study the behaviour of the proposed GBPs using real transcriptomics and proteomics data including observational and interventional experiments, alongside application of standard CSL methods from the literature to the same data sets. We conclude with a brief discussion on open questions and possible future work in Sect. 5.

Notation and background
In this section we give some background on CSL and introduce notation and the general set-up. In particular, we detail the structure of the data X and its underlying causal graph G in the context of CSL on interventional data.

Contextualization within CSL
We focus on the setting in which interventional and observational data are included in X . For example in the case of omics data X includes rows of readouts after targeted gene perturbations (interventional) and after control experiments (observational). In practice a gold standard ground truth graph G * might be obtained by comparing interventional and observational data, either in the current set of experiments or using previous experimental data. Given measurement of a variable B after perturbation of variable A, the causal relationship (A, B) ("from" A "to" B) is inferred by comparing the empirical distribution of B under the control experiments with the corresponding distribution under intervention on A. Since omics designs usually involve measuring many variables in parallel we consider here the common case that given a perturbation is performed on A we measure all other genes, i.e. each intervention experiment corresponds to a whole row of readouts in X . We consider only single interventions (i.e. only one node A is intervened upon in a given experiment). It is important in the below detailed set-up that we have access to interventional data in which some (but not all) genes are intervened upon, which is the common case in practice.
Some clarifications regarding our set-up are as follows: (1) We do not a priori rule out cycles in directed graphs. This is because in practice an intervention on a variable A may change B and vice versa (see also below). (2) For ease of discussion we assume that the type of intervention is fixed and that causal claims relate to the specific type of intervention. This is motivated by the fact that in practice, the precise nature of an intervention is defined by the experimental protocol, hence claims and predictions are limited to changes under the specific protocol. As a concrete example, if a knock-out of a gene A changes gene B, this does not imply that a knockdown of A would change B (since the latter experiment might induce a sub-threshold change to A) and so on. (3) For ease of computation we consider self-edges to be present at every node (compare M[k, k] = 1 for all k in (2.1) further below).
Point (1) stands in contrast to some of the classical CSL literature, in particular to methods based on directed acyclic graphs (DAGs), where the assumption of acyclicity plays a crucial role (Spirtes et al. 2000;Maathuis et al. 2009;Colombo and Maathuis 2014). Cyclic models have been discussed in the literature (see e.g. Richardson 1996;Hyttinen et al. 2014;Hill et al. 2019). In the applied context of perturbation omics experiments, cyclic models are natural, because an intervention on one gene A may lead to a change in another gene B, but an intervention on B may vice versa lead to a change in A. This is essentially due to the fact that real omics data are measurements at a given time in a dynamic system (with the causal effects always forward-in-time in the underlying system).

Notation and basic definitions
Denote a directed, unweighted graph by G = (V , E) with vertex set V = v 1 , v 2 , . . . , v p and edge matrix E ∈ E with, (2.1) As the graphs of interest encode causal relationships between entities in V where useful we refer to them as causal graphs.
(1) We say there exists a causal path (2) Call G + = (V , E + ) the ancestral causal graph (or the causal transitive closure) of G if holds. Moreover, call G an underlying causal graph of G + . For an example see Fig. 1.
(3) Call G ancestral or transitively closed if G + = G holds.
(4) For a node v k ∈ V define the indegree of v k by We note that ancestral causality has been studied in the literature using a variety of models (see e.g. Zhang 2008;Magliacane et al. 2016b;Malinsky and Spirtes 2016;Mooij and Claassen 2020) and is a complex topic in its own right. The purpose of the above definition is simply to introduce the notion of a transitive closure and make the connection to indirect causation to facilitate introduction of specific, transitivity assuming baselines below.
We will use directed graphs that are random in an edgewise Erdős-Rényi sense as defined next (such graphs are studied in Karp 1990).
Definition 2.2 Define a random directed graph (RDG) of size p, with edge probability q and denoted by RDG q ( p) = (V , E), as a directed graph with |V | = p nodes, where all off-diagonal entries of E are iid draws from a Bernoulli distribution with success probability q. Moreover, given a graphG = (Ṽ ,Ẽ) and a subset of edges K ⊂{[k, ]} 1≤k = ≤ p we construct as RDG q,K (G) = (V , E) the partially random directed graph with underlyingG and edge probability q by drawing with δ denoting the Dirac delta distribution and with iid draws from the Bernoulli distribution B(1, q).
Assumption 2.3 below specifies the set-up of the CSL problem on interventional data.

Assumption 2.3
Let G = (V , E) be a causal graph with |V | = p. Given available interventional data X 1 ∈ R n 1 × p and observational data X 2 ∈ R m 1 × p as well as latent, unavailable interventional data Y 1 ∈ R n 2 × p and latent, unavailable observational data Y 2 ∈ R m 2 × p , on the nodes V with n 1 , n 2 , m 1 , m 2 ∈ N >0 . We assume there exists a set of indices/vertices I ⊂ {1, 2, . . . , p}, called the set of available interventions, such that all interventional measurements in X 1 correspond to an intervention on a node v k with k ∈ I and all interventional measurements in Y 1 correspond to an intervention on a node v with / ∈ I. Moreover, we assume the existence of two ground truth functions where S 1 := ([k, ]) k∈I, ∈{1,..., p},k = , Define by the available data and the latent data, respectively. We denote by the partial observation of G w.r.t. I. Define analogously the unobserved causal relationships of G. Note that we have after possibly reordering of the rows of E the relationship by slight abuse of notation as we consider only off-diagonal entries.
Let a partial observation E X of a causal graph G based on available observational and interventional data X be given. We call a predictor assigning to each unobserved causal relationship a probability of its existence, based solely on the partial observation E X a graph-based predictor (GBP). Meanwhile, a predictor assigning to each unobserved causal relationship a probability of its existence, based on the available data matrix X will be called a data-based predictor (DBP).
The foregoing assumptions essentially ensure that the graph estimand is operationally well-defined as it is assumed that there exists some oracle procedure by which the edge structure could be determined from idealized data. In the terms above, CSL methods would usually be classified as DBPs, since they use empirical data to obtain a graph estimand.
For the sake of completeness, we introduce here notation and nomenclature for the ROC curve and the AUC in terms of our set-up, as it is a widely used performance measure for predictors such as and given in (2.2) and (2.3), respectively. The ROC curve has to be defined with respect to a gold standard; accordingly for Definition 2.4 we assume that the entire graph is known for the purpose of computing the ROC curve and related quantities (of course only part of the graph is available to any estimator/CSL method; specifically, E Y is unavailable).

Definition 2.4
Let E X be a partial observation of a non-trivial causal graph G = (V , E) and S 2 be the indices of the unobserved causal relationships. Let R ∈ [0, 1] |S 2 | be the output of The receiver operator characteristic (ROC) curve R OC(R) is given as the linear interpolation of the points, for c t = 0 and F P R R (0) = 1 = T P R R (0), note that both denominators are not 0 by non-triviality of G. We define the area under curve (AUC) of the ROC curve as the finite area enclosed in ROC(R), the x-axis and the line {x = 1}. Note, that by definition (F P R R (c 0 ), T P R R (c 0 )) = (0, 0), (2.4) By the above definition the random predictor given by R[k, ] = 0.5 for all [k, ] ∈ S 2 induces a diagonal ROC curve, as it is the linear interpolation of the points (0, 0) and (1, 1), yielding an AUC of 0.5.

Construction and theory
In the following section we propose two general forms of graph-based predictors and derive special cases thereof. Moreover, we propose computation and simulation strategies and delineate the proposed GBPs from each other. R-code for the proposed GBPs is available at github.com/richterrob/ GraphBasedPredictors.

Observed indegree predictor
We start in this subsection with the idea that a node-level statistic which is partially observed in E X can carry nontrivial information about edge labels in E Y . We go on to provide a specific instance of this general approach that uses the indegree as the node level statistic, leading to the observed indegree predictor (OIP).
GBPs based on a node-level statistic To utilize a node-level statistic to predict the unknown entries of E Y , we need it to be both estimable from the partial observation E X and to carry information about E Y . Suppose G = (V , E) is a causal graph and that we are given a statistic γ G : V → W mapping the nodes of G to some feature space, e.g. W = R, Z. We desire of γ G that it, (1.) depends only on G = (V , E); (2.) is not constant on V ; and, predicting the edge labels in E Y "better than random" given (1.) and (2.) are satisfied, with "better than random" meaning that the AUC as defined in Definition 2.4 for R = θ(γ G (V )) is strictly larger than 0.5.
Examples of such a statistic γ G might include • mappings to the respective in-and outdegrees; • mappings to the respective number of ancestors and/or descendants; Let us give an example how (3.) might be satisfied for the above given node-level statistics. Consider a graph G with p nodes, featuring nodes v 1 , . . . , v ∈ V with no incoming and no outgoing edge and nodes v +1 , . . . , v p with at least one incoming and one outgoing edge (here 1 ≤ ≤ p − 1). In this case for v 1 , . . . , v the statistics mentioned above would either convey the information that there are no incoming edges or that there are no outgoing edges. Considering as an example the first case with γ G assigning to each node the number of its ancestors, we can set to obtain a predictor performing better than random with respect to the area under the curve of R = θ(γ G (V )). We formalize a graph-based predictor based on a node-level statistic in the following definition.
Definition 3.1 Let E X be a partial observation of a causal graph G, γ G : V → W a statistic on the nodes of G and θ as in Assume that G, γ G and θ of the above Definition 3.1 satisfy the desiderata (1.), (2.) and (3.) stated further above. Then, given that β predicts γ G (V ) sufficiently well it is reasonable to claim NLS is performing better than random with respect to the AUC. A concrete example follows in the following subsection with the OIP including a discussion under which regime the given GBP performs better than random. For the moment let us make the following remark.

Remark 3.2
The construction of the GBP NLS as a general construct given in (3.2) encodes the idea "The partial observation of a node-level statistic can carry information on unseen edges". Under which conditions the NLS performs "better" than the random baseline depends on its actual construction (i.e. choices of γ G , θ, β, I) and is subject to an underlying distribution on the sets of graphs, i.e. G ∼ D.
Observed indegree predictor In the following we consider the indegree statistic by setting γ G (v k ) = deg − (v k ). Consider the desiderata on γ G of Sect. 3.1, then, given that (2.) is satisfied, we have by construction that γ G satisfies (1.) and (3.). To see this for (3.) consider any predictor θ in (3.1) that is strictly increasing with respect to the indegree of the potential effect. It remains to assume (2.), given below as Assumption 3.3. Assumption 3.3 Given the set-up of Assumption 2.3, deg − is not a constant function on the vertex set V .
Note, that Assumption 3.3 is arguably a weak assumption, especially for large p. Thus, the indegree yields the following graph-based predictor, as a special case of (3.2).

Definition 3.4 Given a partial observation
The OIP is a good candidate for a graph-based predictor under Assumption 3.3 due to the following heuristic. Assuming that the set of performed interventions I was chosen independently of the edge matrix E, we have that deg − X (v k ) is the sample mean of a hypergeometric distribution (population size p−1, number of success states deg − (v k ) and number of draws |I|, with sample size 1), yielding in In fact, for graphs with positive correlation structure we have the following result on the expected AUC of the OIP on a subset of S 2 .

Theorem 3.5 Let G = (V , E) be such that E is drawn at random with marginal probabilities
where q ∈ (0, 1), with E[k, ] and E[k , ] drawn independently for all k, k and all = , and with a covariance structure given by , for all and any pairwise distinct

Then, for any realization of the unknown relationships
The proof of Theorem 3.5 can be found in "Appendix 1". Furthermore, we show that a subclass of latent network models (e.g. Hoff et al. 2002;Bollobás et al. 2007) fall in the setting of Theorem 3.5 (see Lemma 3 in "Appendix 1").
Remark 3.6 To extend Theorem 3.5 to the AUC on all of S 2 (the complete predicted E Y by OIP ) is at this point open. Considering the proof of Theorem 3.5 additional assumptions on the distributions of deg − Y and/or additional assumptions on q and κ N ,J seem to be needed. For more details we refer the reader to "Appendix 1".
Notably, the outdegree on the other hand is not a suitable candidate for a graph-based predictor in the context of Assumption 2.3: Consider any unknown relationship [k, ] ∈ S 2 , since E X is formed by complete rows of E we have no observations on the outgoing edge-labels of v k helping us to estimate deg + (v k ).

Transitivity assuming predictor
In this section we introduce a second way to construct a graph-based predictor by assuming that the graph satisfies some property relating to a non-trivial constraint(s) on its edge matrix such that E X carries information on E Y . Moreover, a special case of such a graph-based predictor based on transitive closedness will be derived.
GBPs based on a graph property Let the graph G in Assumption 2.3 satisfy some constraint(s) denoted by (C), such that the partial observation E X carries information on E Y . We then construct a graph-based predictor via the matrix of expected values of the existence of an edge given a random draw from all graphs that satisfy (C) and are consistent with E X . Examples of (C) might include • the graph being transitively closed; • the graph being a k-reachability graph; • the nodes of the graph having an upper/lower bound on its in-and/or outdegrees.
Definition 3.7 LetẼ X be a partial observation of a causal graphG = (V ,Ẽ). SupposeG satisfies constraint(s) denoted by (C). Then a graph-based predictor based on a graph prop- We have at once the following remark.
Remark 3.8 LetG = (V ,Ẽ) be drawn uniformly from the set of all graphs satisfying (C), then In general d-GP can be very hard to compute or even to simulate. For a feasible example consider (C) = (G is undirected (i.e. E is symmetric) and features degree sequence d ∈ R p ) of prescribed edge degrees. In this case there exists a broad literature on how to draw (asymptotically) uniformly at random from the set {G : G satisfies (C)} (see e.g. Artzy-Randrup and Stone 2005; Newman 2003; Blitzstein and Diaconis 2011;Milo et al. 2003;Greenhill 2014), allowing in the worst case for Monte Carlo rejection sampling of (3.6), and, in the best case for direct sampling via a suitable adaptation of the Maslov-Sneppen MCMC algorithm. Unfortunately, similar strategies are not known, to the best of the authors' knowledge, for drawing uniformly at random out of the set of all transitively closed graphs, not to mention the denominator set of (3.6) with (C) = (G is transitively closed). However, as elaborated in the introduction, the case of transitively closed graphs is of particular interest in the context of omics readouts after gene perturbation experiments due to the fact that in conventional designs for such experiments, direct causal relationships are in general not easily distinguished from ancestral relationships. Thus, to the end of obtaining an easier to compute/simulate GBP we construct an indirect version of (3.6) described in Eq. (3.8), below. Definition 3.9 LetẼ X be a partial observation of a causal graphG. LetG satisfy constraint(s) denoted by (C) and let φ be a surjective mapping from the space of all graphs to the space of all graphs satisfying (C). A graph-based predictor based on a graph property (indirect version) is defined by The special case of (3.8) considered in the remainder of this Section is for which we use φ(G) =G + . Moreover, also for the indirect version we can make a remark in the spirit of Remark 3.8. Remark 3.10 Let G = (V ,Ẽ) be drawn uniformly from the set of all graphs and let φ be as in Definition 3.9 mapping into the set of all graphs satisfying (C), then (3.9) Transitivity assuming predictors As an instance of a graphbased predictor arising from a graph property we consider in this section predicting E Y of ancestral causal graphs. As mentioned earlier, this is motivated by the nature of omics readouts after intervention, since in such experiments what is seen is the total causal effect of perturbing gene A on gene B-potentially via mediators-rather than a necessarily direct causal effect.
Assumption 3.11 Given the set-up of Assumption 2.3, the causal graph G is ancestral.
Following Assumption 3.11, as a special case of (3.8), we define the following graph-based predictor.

Definition 3.12
Let E X be a partial observation of an ancestral causal graph G = (V , E) with S 2 being the indices of the unobserved causal relationship of G. Define by the set of all edge matrices E 0 , whose transitive closure E + 0 coincides with E on the index set S 1 , i.e. the set of all edge matrices that are consistent with the partial observation E X . We define calling TAP the transitivity assuming predictor (TAP).
In contrast to the OIP, for which computation is straightforward, computing/simulating the TAP is non-trivial. Given a non-trivial scenario, i.e. S 1 , S 2 = ∅, the set X is determined by constraints on the ( p −1)-th power of E 0 . Concretely, two types of constraints surface, in detail we have E 0 ∈ X if and only if Eqs. (3.11) and (3.12) below are both satisfied.
A closed form for (3.10) can, to the best of the authors' knowledge, only be given for those entries [k, ] ∈ S 2 which features TAP (E X )[k, ] = 0, as they are induced by the constraint (3.11) as Lemma 3.13 below implies, the proof of which is given in "Appendix 1".

Algorithm 1 TAP -Monte-Carlo Rejection Sampling
Input: E X partial observation of G, T ∈ N number of required successful draws, q ∈ [0, 1] edge probability for drawing the partial RDG.
Compute the set of impossible edges K ⊂ S 1 ∪ S 2 using the characterization in Lemma 3.13. while τ < T do 3.A Let G be an edgeless graph with p vertices. Draw E 0 as RDG q,K (G), using Definition 2.2.
if E 0 ∈ X then 3.B Set for all [k, ] ∈ S 2 : (T ,q) Lemma 3.13 Given E X a partial observation of an ancestral causal graph G = (V , E) and let TAP be the TAP defined in (3.10). Then we have where A v denotes the set of known parents of v ∈ V given in E X . We call edges satisfying the right hand side of (3.13) impossible edges.
To compute TAP (E X )[k, ] beyond impossible edges, we are left with brute-force calculation with unfavourable computational complexity such that already for p 10 calculations may be intractable. In the remainder of the chapter we propose simulation strategies of the TAP and variants thereof, which are computationally less expensive.
Rejection sampling and choice of q Algorithm 1, given below, simulates for q = 0.5 the TAP defined in (3.10) by straightforward Monte Carlo rejection sampling, with edge probability 0 for impossible edges, cf. Lemma 3.13. In general, it sets impossible edges to zero, draws the rest of the edge matrix entries as a partial RDG with edge probability q ∈ (0, 1), see Definition 2.2, and, rejects the so drawn edge matrix E 0 if E 0 / ∈ X . This procedure is repeated until a fixed number of T ∈ N non-discarded graphs have been drawn. By construction the so obtained (T ,0.5) TAP is a consistent estimator of TAP .
The rationale for introducing parameter q in Algorithm 1 is as follows. Since the probability that an RDG features the complete graph as its transitive closure goes to 1 as p → ∞ (see Karp 1990;Krivelevich and Sudakov 2013), we have to scale the parameter T with p for sufficient convergence, increasing the computational costs. Meanwhile, letting q → Algorithm 2 B-TAP -Biased Sampling from X Input: E X partial observation of G, T ∈ N number of draws, q ∈ [0, 1] edge probability for drawing the partial RDG.
Compute the set of impossible edges K ⊂ S 1 ∪ S 2 using the characterization in Lemma 3.13. 3. Let G = (V , E) be given by |V | = p and (3.13) 0 as p → ∞ reduces the convergence time of Algorithm 1, as we will see in Fig. 12 in "Appendix 1" (in particular with regards to Algorithm 2 further below) where q is chosen with respect to the sparsity of the observed graph. The caveat of choosing q = 0.5 is thatˆ (T ,q) TAP which in general is not equal to TAP , i.e.ˆ (T ,q) TAP is for q = 0.5 not a consistent estimator of TAP , as shown in Lemma 3.22 in Sect. 3.4.
Biased sampling from X Even for q selected smaller and smaller as the size of the graph p grows, since the rejection sampler of Algorithm 1 draws an ever growing number of discarded edge matrices, the computational costs of Algorithm 1 sill grow with p → ∞, see Fig. 12 in "Appendix 1". To the end of reducing computational costs of Algorithm 1 further, consider Algorithm 2 avoiding rejections all together. Additional to the exclusion of impossible edges, Algorithm 2 includes a step drawing spanning trees to ensure the inequality constraints of (3.12) are met by pasting them in the partial RDGs drawn in Step 3.A. To this end introduce the Broder Algorithm below.
Definition 3.14 (Broder 1989) Given an un-directed graph G = (V , E), i.e. a graph as introduced in Sect. 2.2 with symmetric E. Assume G to be connected. Draw a random spanning tree (RST) rooted in v 1 by simulating a random walk x 1 , x 2 , x 3 , . . . , x T on G with x 1 = v 1 and stopping time T ∈ N such that every vertex is visited at least once.
Denote for each vertex v k = v 1 the index t k featuring x min({1≤ ≤T :x =v k })−1 = v t k , i.e. v t k is the predecessor of the first visit of the random walk to v k . Then, the RST is given by the set of edges In Broder (1989) it is shown that an RST of Definition 3.14 is drawn uniformly at random out of the set of all spanning trees of G rooted in v 1 . However, for a directed graph G which is not strongly connected, a random walk as in Definition 3.14 could get "stuck" (compare also Anari et al. 2020). Consider in the following a directed graph G = (V , E) featuring a path from v 1 to any other vertex. To the end of drawing a computationally feasible spanning tree in G rooted in v 1 we use a modified version of the RST: and set We call the so obtained T mod a modified RST (m-RST). Note that the above construction does not vouch for T mod being drawn uniformly at random out of the set of all spanning trees rooted in v 1 . In the case that Y = ∅ however the draw of the modified RST coincides with the draw of an RST. Since, as we will show in Sect. 3.4, Algorithm 2 does not draw uniform at random from X even if the spanning tree was drawn uniformly at random from all spanning trees, we except this caveat for the sake of computational simplicity. In particular, we have thatˆ (T ,q) TAP , for any q ∈ [0, 1]. We call the predictor B-TAP (q) biased transitivity assuming predictor (B-TAP) with edge-probability q ∈ (0, 1). As for Algorithm 1 with growing p we propose to choose q according to the sparsity of the observed graph for feasible run times.

Extensions
The graph-based predictors defined in (3.2), (3.6) and (3.8) are related. Furthermore, additional graph-based predictors could be constructed. In the following section we exemplify this.
First, given Assumption 3.11 the graph G has to stem from a quite restrictive subset of all graphs in order not to satisfy Assumption 3.3, as Lemma 3.15 below shows.
Then there exists m, K ∈ N with K m = n such that G has K strongly connected components of cardinality m that each form complete subgraphs.
The proof of Lemma 3.15 can be found in "Appendix 1". Due to Lemma 3.15 we can motivate the OIP not only by the type of observations E X -complete rows -but also by the heuristic of observing ancestral graphs. This leads to a combination of the TAP and the OIP given below.
Definition 3.16 Given a partial observation E X of a causal graph G, let K be the set of impossible edges as given by Lemma 3.13. Define the transitivity-assuming observed indegree predictor (T-OIP) by (3.14) Note, that to define the T-OIP, the assumption of transitivity is not needed.
Second, we can extend the definition of the TAPs, from ancestral causal graphs to all possible causal graphs. This is particularly important for omics data: First, because the TAPs should be computable even if Assumption 3.11 does not hold, for example when assuming that the causal effect dies out over long causal chains. Second, because we need to be able to compute TAPs also in the case of faulty assignments in E X , e.g. due to measurement errors. To this end we introduce the following relaxed versions.

Non-equivalence of the proposed predictors
Having introduced multiple predictors, using closely related heuristics, cf. Lemma 3.15, the question arises whether the respective ROC curves of the predictors are related or even coincide. To this end, we provide a set of counterexamples demonstrating the differences in predicted values and, when applicable, differences in induced ROC curves between the predictors. The first example shows that TAP = (0.5) B-TAP and in particular that the random draw from X described in Algorithm 2 is not uniform even if the RST are drawn uniformly at random. To compare the marginal distribution on the edges of the drawn graphs from X introduce as the marginal conditional probability of the existence of an edge when drawing E 0 according to Algorithm 1 with q = 0.5 conditioned on E 0 ∈ X . We have at once the following Corollary to Lemma 3.13, for a proof see "Appendix 1".

Corollary 3.18
Given E X a partial observation of an ancestral causal graph G and let θ TAP (E X ) be given as in (3.17).
Moreover, the following Lemma shows that edges that are "not-impossible" edges between nodes without a known ancestor in common feature θ TAP = 1 /2. Lemma 3.19 Given E X a partial observation of an ancestral causal graph G and let θ TAP be as in (3.17). Then we have The proof is given in "Appendix 1". Given the above we state below the counterexample for TAP = (0.5) B-TAP . Note that in the following, for the sake of readability, we will augment the image space of the predictors to   A detailed computation of the above matrices is given in "Appendix 1", consider to this end also (b) and (c) of Fig. 2. In this example we observe:

The marginal distributions θ (0.5)
B-TAP (E X ) of Algorithm 2 and the resulting prediction (0.5) B-TAP (E X ) are not equal to θ TAP (E X ) and TAP (E X ), respectively. Hence, (T ,0.5)

B-TAP (E X )
does not converge to TAP for T → ∞. To the best of the authors' knowledge a counterexample of different ROC curves for the TAP and the B-TAP is not known. As a consequence we make the following conjecture.

Conjecture 3.20 Given a partial observation E X of an ancestral causal graph G. Under (possibly quite restrictive) conditions on the descendant sets D v k for k ∈ I we have that the ROC curves induced by TAP and (0.5)
B-TAP coincide. Staying with the above Example we show T-OIP (E X ) = TAP (E X ) and, even more, that the induced ROC curves might differ.
Example (cont'd) Let G and its partial observation E X be as before. Compute  Analogously to Conjecture 3.20 we conjecture that T-OIP is a "coarser" predictor than the TAP in the following.

Conjecture 3.21 Given a partial observation E X of an ancestral causal graph G underlying (possibly quite restrictive) conditions. We have for edges [k, ], [r , s] ∈ S 2 (possibly underlying some condition) that
Last, Lemma 3.22, below, shows that changing q in Algorithm 1 may lead to potentially different ROC curves.

Lemma 3.22
There exists an ancestral causal graph G and a partial observation E X of G, as well as, q 0 ∈ (0, 1) and edges [k, ], [k , ] ∈ S 2 , such that and, The proof can be found in "Appendix 1". Similarly we can deduce that changing q in Algorithm 2 with RST drawn uniformly at random may lead to potentially different ROC curves, leading us to conjecture that one can also find a counterexample for Algorithm 2 with RSTs drawn as m-RSTs, we refer again to "Appendix 1" for details.

Simulation study
In this section we study the use of the graph-based predictors as baselines in the case where the underlying ground truth graph satisfies Assumption 3.11 and beyond. To this end, we use simulated and real graphs.

On simulated graphs
We simulate graphs of cardinality p as transitive closures of RDGs with edge probability α /p governed by a sparsity parameter α ∈ (0, 1). Note that the dependence of the sparsity on p is needed in order not to draw only graphs featuring the complete graph as their transitive closure (Krivelevich and Sudakov 2013).
In Fig. 3 box plots of the AUC performance of the ROC curves for 20 runs are presented for varying graph size p. In Fig. 3 the parameter α was set to 0.7 and the amount of known rows was given by |I| = p /5. Compared are the predictors TAP (Algorithm 1, (T , q) = (100, 0.5)), TAP-q (Algorithm 1, (100, α /p)), B-TAP (Algorithm 2, (100, 0.5)), B-TAP-q (Algorithm 2, (100, α /p)), T-OIP and the OIP. The TAP, TAP-q, B-TAP and B-TAP-q are only computed until p = 25, p = 100, p = 1000 and p = 1000, respectively, due to their exploding computational costs (see "Appendix 1"). For all p and all predictors the respective performance is on average better than random. While the variability in AUC performance decreases with growing p, the mean performance increases for all but the B-TAP and TAP. For large p the B-TAP and the TAP suffer from their slow convergence, which is especially visible when compared to the B-TAP-q and TAP-q, respectively. It stands out that the OIP, T-OIP and the B-TAP-q have a similar performance and substantially outperform the classic random baseline (at 0.5 AUC). Moreover, the OIP and the T-OIP were by a margin the fastest to compute, see for a comparison of computation times Fig. 12 in "Appendix 1".
For the influence of α and |I| on the B-TAB, B-TABq, OIP and T-OIP performance we refer the reader to Figs. 9 and 10 in "Appendix 1". In summary, the order in performance of the methods remains mainly unchanged. Furthermore, for some example mean ROC curves of Fig. 3 we refer to Fig. 8 in "Appendix 1".
In Fig. 4 we present the AUC performance of the B-TAP, B-TAP-q, OIP and T-OIP in the case the ground truth graph is a k-reachability graph and thus violates Assumption 3.11 to various extends.
In particular, k = 1 yields G =G and k ≥ p − 1 yields G =G + . For Fig. 4 we drew a RDG with edge probability 0.7 /p and graph size p = 1000 and computed the respective k-reachability graph. For each, the number of known rows was set to |I| = 200. We observe that already for k = 25 the AUC performance was comparable to the AUC performance on the transitively closed graph (k = 1000). Meanwhile, performance did not decrease drastically for k = 2, 5 and prediction performance for all predictors remains better than random. One reason might be that Assumption 3.3 continues to hold even if Assumption 3.11 is violated. Additionally, drawing k-reachability graphs in this way the probabilities of existence of incoming edges at a particular node are positively correlated relating to our findings in Theorem 3.5. Note that the T-OIP looses its advantage over the OIP from incorporating the impossible edges the more Assumption 3.11 is violated. As in Fig. 3 we see that the B-TAP performs significantly worse compared to the B-TAP-q due to its slower convergence with respect to T . Last, for k = 1 we see a performance of all predictors around random, which could be expected, as for randomly drawn graphs the expected indegree of each node is equal, possibly violating Assumption 3.3.

On graphs derived from "omics"-data
In the following we test the new predictors on real yeast gene expression data 1 from Kemmeren et al. (2014) (used for CSL by Meinshausen et al. 2016) and on proteomics data 2 from Sachs et al. (2005) (used for CSL by Wang et al. 2017). Compared are the baselines proposed in this paper with the performance of the PC and IDA algorithms (see Spirtes et al. 2000;Maathuis et al. 2009, respectively), the MCMC-Mallow approach by Rau et al. (2013), the GIES algorithm (Hauser and Bühlmann 2012) (using the Rpackage pcalg Kalisch et al. 2012) and the IGSP algorithm of Wang et al. (2017).
As the backgrounds of all approaches vary let us make some remarks on their usage in this study: • For PC, GIES and IGSP the output is an estimated graph (rather than a matrix of scores) and as such only points (one for each run) on the ROC plane are depicted in Fig. 7 and comparison via the AUC is not possible. • The PC and IDA algorithms are considering any measurement as observational, as they are not designed to deal with interventional measurements. To the end of a fair comparison, we report their performance when only the available observational measurements (i.e. X 2 ) are passed to the algorithms (denoted by (obs)) and their performance when all available measurements (i.e. X ) are passed to the respective algorithms (denoted by (intobs)). Note, that even when interventional measurements are passed, they are treated by PC and IDA as observational.
Statistics and Computing (2023) Kalisch et al. (2012), for IGSP α IGSP has been set to 0.2 as it was among the best performing α's in the corresponding experiment in Wang et al. (2017) and the MCMC-Mallow algorithm has been used with constants set as in the accompanying R-code of Rau et al. (2013).

Transcriptomics data (Kemmeren et al.)
The data consists of gene expression readouts of 262 observational experiments (i.e. with no intervention) and 1479 interventional experiments (each interventions is on a single gene, specifically knock-outs; each intervention targeting a different gene), measured are 6170 genes in total (including the 1479 intervened upon genes). We consider in this evaluation the "square" graph using only the readouts of the 1479 genes that have been intervened upon. Denote by X 1 ∈ Rp × p the available interventional measurements and by X 2 ∈ R N 1 × p the available observational measurements, denote furthermore  Kemmeren et al. (2014) with varying gold-standard-threshold, the size of the graph is fixed at p = 500 by Y 1 ∈ R ( p−p)× p and Y N 2 × p 2 their unavailable counterparts, cf. Assumption 2.3, and assume (if necessary via reordering) that row k and column k correspond to gene v k . Then the partial observation E X is constructed by the following goldstandard-rule: where X 1 [k, ] is the readout of gene v after the intervention on v k , Med(·) assigns its median to a vector, Z > 0 is the gold-standard-threshold and IQR(·) assigns its interquartile-distance to a vector, i.e. there exists an edge from A to B if and only if the readout of B under intervention on A has an absolute z-score higher than Z with respect to the empirical distribution of readouts of B under no intervention. The unobserved causal relationships E Y are constructed analogously via Y 1 and Y 2 with the same gold-standard-threshold Z . Given a graph size p, the following protocol was used to obtain available and unavailable data: 1. Pick p of the 1479 genes at random and discard the rest. 2. Pickp = p /5 rows of the interventional readouts at random, those constitute X 1 . The remaining rows constitute Y 1 . 3. Pick N 1 = 131 (half) of the rows of the observational readouts at random, those constitute X 2 . The remaining rows constitute Y 2 .
For Figs. 5 and 6 the above protocol was repeated 10 times, T was set to 100 for TAP, TAP-q, B-TAP and B-TAP-q, and, q of TAP-q and B-TAP-q was set to the sparsity of the partial observation provided. While in Fig. 5 the graph size p varies and Z is set to 5, in Fig. 6 p is set to 500 and Z varies. Due to computational demands it was not feasible to apply all methods for all p. Figures 5 and 6 show that the proposed graph-based predictors clearly outperform the classical random baseline on the given data set. Moreover, they outper-form IDA and MCMC-Mallow (where the latter ones were computed). The order of performance holds generally also for varying Z , in particular the OIP consistently outperforms IDA. Interestingly, for large Z corresponding to considering only "large" effects the differences in performance between the OIP and TAPs seem to slightly diminish, while as Z decreases only the OIP achieves a performance clearly better than random. This suggests that Assumption 3.11 may hold in practice in particular when considering larger effects.
In (a) and (b) of Fig. 7 close-ups of the mean ROC curves for p = 250 and p = 500 are displayed. For methods producing an estimated graph results are shown as points on the ROC plane. For both, PC and GIES, we observe a perfor-mance slightly above random which is outperformed by the OIPs and the TAPs. Moreover, on closer inspection the ascent of the OIPs and TAPs is particularly steep at the start of the ROC curves in the bottom left corner, a region often considered important when CSL methods are used for hypothesis generation (see e.g. Colombo et al. 2012;Meinshausen et al. 2016).

Proteomics data (Sachs et al.)
The data consists of protein measurements of 992 observational experiments (i.e. with no interventions) and in total 13435 interventional experiments, each targeting a single protein, spread over 8 target-proteins (the number of interventional measurements per target-protein varies between 301 and 3602). In total 24 proteins are measured (among them the 8 targeted in the interventions).
As sample size for the interventional experiments is far larger compared to the data from Kemmeren et al. the twosided Wilcoxon-ranksum test is used to construct the ground truth as done in Wang et al. (2017). In detail, given available observational measurements X 2 and available interventional measurements X 1,k with k corresponding to the targeted intervention, i.e. X 1 = (X T 1,1 · · · X T 1,m ) T (for some 1 ≤m ≤ 7), we say that there is an edge from protein k to protein , i.e. E X [k, ] = 1, if the two-sided Wilcoxon-ranksum test rejects (at significance level 0.05) the null hypothesis that the samples (X 2 [·, ]) and (X 1,k [·, ]) stem from the same distribution. Via the same gold standard rule E Y is constructed from Y 1 and Y 2 . We followed the protocol below: 1. Pickm = 4 = 8 /2 interventional targets at random, all of their interventional measurements combined constitute X 1 . The remaining measurements, namely those targeting one of the other four interventional targets, constitute Y 1 . 2. Pick 496 = 992 /2 rows of the observational measurements at random, those constitute X 2 . The remaining rows constitute Y 2 .
In (c) and (d) of Fig. 7 the mean ROC curves over 10 runs of the protocol are compared. Again, for methods producing an estimated graph results are shown as points on the ROC plane. Even on this graph with a few number of nodes and with only |I| = 4 we observe a better performance than random of the GBPs, in particular the variants of the OIP and the TAP even outperform the IDA and perform comparably or slightly better than the MCMC-Mallow approach, compare also the AUC comparison in Fig. 11 in "Appendix 1". Moreover, we see that CSL methods outputting an estimated graph in fact lie only in a minority of runs over the mean OIP ROC curve (PC (obs-int) (2-3/10), IGSP (1-2/10)), or in fact, never as is the case for PC (obs) and GIES.
Furthermore, in Fig. 12 of "Appendix 1" the computational costs of Figs. 5 and 7 are reported. In particular the OIPs have very low computation times, while the MCMC-Mallow and IGSP take considerable longer to compute.

Discussion
In this paper we have argued for new baselines to evaluate causal structure learning methods on interventional data, as a complement to random baselines that in some settings may represent a "low bar". The inclusion of interventional measurements carries information not only on the edges of the causal graph corresponding to the available interventional measurements, but also, to some extent, on remaining edges in the graph. This is why in settings where such data are available, simple heuristics to account for the available information can provide improved baselines. For these settings we introduced three general graph-based predictors, cf. (3.2), (3.6) and (3.8). Motivated by large-scale systems biology experiments we went on to consider special cases of (3.2) and (3.8) in the observed indegree predictor (OIP) and the transitivity assuming predictor (TAP) and extensions thereof. We showed that the OIP will perform under quite general conditions better than the random baseline and we showed theoretical differences of the introduced predictors. The potential of the OIPs and TAPs as more challenging baselines were demonstrated in a simulation study as well as on real data. In fact on real data the newly defined baselines can outperform standard CSL methods (with default tuning parameter values), although it should be emphasized that in the particular application studied, the assumptions underpinning some of the methods may not hold and furthermore in some examples we had to apply the methods in ways that deviate from their intended use.
In the future new graph-based predictors could be defined for specific use-cases. Moreover, an evaluation of the baseline's performance on further metrics, beyond the ROC, might be desirable. In its general nature, this paper focussed on ROC curves and their accompanying AUCs. As GBPs estimate only the graph structure and not underlying distributions, recently proposed evaluations of CSL methods taking in account estimated distributions of the measurements X can not be considered (O'Donnell et al. 2021). However, for particular use-cases evaluation on a more specific metric and/or forcing the GBPs to predict binary graphs -as PC, IGSP and GIES do, for example via cross-validation -might be insightful. This is in particularly true for the OIP as it performed best on the real data in Sect. 4.2.
Regarding the computation of the TAP, it remains to be seen whether for large p one can devise a feasible, consistent simulation procedure, or, if resorting to the B-TAP or a changed q remains necessary. Moreover, it would be of interest to study whether the resulting ROC curves of the TAP, B-TAP and OIP can in general be related as conjectured in Sect. 3.4.
Acknowledgements We would like to thank Bernd Taschler of the University of Oxford for feedback and help in data processing and in testing the R code. Moreover, we thank the anonymous reviewers and the associate editor for their help in improving this contribution. This work was supported by the Bundesministerium für Bildung und Forschung (BMBF) project "MechML". S. Bhamidi was supported in part by the National Science Foundation (NSF) grants DMS-2113662 and DMS-2134107.
Author contributions All authors contributed to the conceptualization and methodology of the paper. R.R. conducted the formal analysis including the derivation of the proofs, wrote the software and prepared the figures. Proofs and statements of Theorem 3.4, Lemma A.1 and Lemma A.3 were derived equally by R.R. and S.B. R.R. and S.M. wrote the main manuscript text with input from S.B.. All authors reviewed the manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A: Proofs
Appendix A.1: Proof of Theorem 3.5 To prove Theorem 3.5 we need the following preliminary result.

Lemma 1 Let G = (V , E) be such that E is drawn at random with marginal probabilities
where q ∈ (0, 1), with E[k, ] and E[k , ] drawn independently for all k, k and all = , and with a covariance structure given by ], for all and any pairwise distinct k, k ,k 1 , . . . ,k J ∈ {1, 2, . . . , p}. Fix = and disjoint sets Q 1 , Q 2 ⊂ {1, 2, . . . , p} and disjoint sets Q 1 , Q 2 ⊂ {1, 2, . . . , p} such that Then, we have if and only if Proof First note that by construction we have for x,x ∈ {0, 1} |Q 1 | and y,ỹ ∈ {0, 1} |Q 2 | that Second, let and Q {1, . . . , p} be arbitrary such that / ∈ Q, and m = (m j ) ∈ {0, 1} |Q| a vector such that j m j ≥ 1. Suppose furthermore without loss of generality that 1 ∈ Q and m 1 = 1. We have for = k / ∈ Q that wherem is the vector m without its first entry. For ease of notation denote Then, using (A1) we have By symmetry of the covariance structure yielding (A3) we follow that for any k, k , , s.t. k = and k = and subsets Q, Q with |Q| = | Q| that if and only if In particular, by (A1) we have that if equality holds in (A6), equality holds in (A7), yielding the "if and only if" part of the statement. Moreover, we obtain analogously the opposite statement that for any k, k , , s.t. k = and k = and subsets Q, Q with |Q| = | Q| we have if and only if Again, with equality holding either in both, or in none of the equations by virtue of (A3). Let us in the following assume without loss of generality by symmetry of the covariance structure and independence between columns that Q x = Q x for x = 1, 2.
The claim of Lemma 1 can now be proven via induction on the size of the set Q 1 , while keeping Q 2 fixed. To this end, , , m,m, Q 1 and Q 2 be as in the assumption (including / ∈ Q 1 , Q 2 ). We initialize the induction hypothesis with |Q 1 | = | {k} | = 1. Define In the case ||m|| 1 > ||m|| 1 we have by (A6) and (A8) that In the same way we have that if ||m|| 1 = ||m|| 1 we obtain equality in (A9) by using that equality in (A7) yields equality in (A6), yielding the base case of the induction. It remains to show the induction step. Let the claim be shown for |Q 1 | = N ∈ N and consider now |Q 1 | = | {k 1 , . . . , k N +1 } | = N + 1 and ||m|| 1 > ||m|| 1 . For ease of notation let us define for x, y = 0, 1 the events We have Using the induction assumption we have immediately Moreover we obtain by induction assumption as ||m|| 1 +1 > ||m|| 1 that and by ||m|| 1 ≥ ||m|| 1 + 1 we have, Now we need to show that in order that we can drop it. We start by using the induction assumption to show: For the second term we use the symmetry of the covariance structure, yielding (A3), and the independence between the columns of the edge matrix: Using (A14) and (A15) and plugging (A11), (A12) and (A13) in (A10) we get showing the induction step for ||m|| 1 > ||m|| 1 . Which leaves the case ||m|| 1 = ||m|| 1 . First, by induction assumption we have equality in both displays of (A11). Second, by (A3) and symmetry of construction we have that Hence, putting both observations together we have equality in (A16) for ||m|| 1 = ||m|| 1 , finishing the proof.
Proof of Theorem 3.5 Let us start by stating the expected value for the AU C I C derived from Remark 2.5. where andẼ Y ,x = [k, ] ∈ E Y ,x : / ∈ I for x = 0, 1. Case 1: = : Case 2: = : By assumption we have , / ∈ I. Let us first consider the case then we have for any where the last inequality holds true by Lemma 1. Since the sum of the last two lines in (A21) is 1 by construction, we have Moreover we have by the last line of (A21) that for k 1 , k 1 / ∈ I with E[k 1 , ] = 0 = 1 = E[k 1 , ] the following holds: recall to this end also Eq. (A3) of Lemma 1.
In the case that We have by Lemma 1 equality in (A21) and thus Plugging Eqs. (A19), (A24) and (A26) into (A17) and since by assumption there exists at least one pair , / ∈ I such that (A20) holds, we obtain Last, let us give an example of a graph generation process that falls under Theorem 3.5.
Definition 2 (compare e.g. Hoff et al. 2002;Bollobás et al. 2007) We define a directed latent network model with fixed outgoing and node depending incoming propensities (LNMfix-O) by drawing from some non-degenerate distribution D on (0, 1) (i.e. D is not a Dirac delta distribution). Subsequently, draw G = (V , E) by iid draws Let us assume for now that D is a discrete random variable. In this case consider Second by Bayes theorem we have we have that (A29) only depends on J and N and not on the exact configuration of m. Hence the covariance of (A28) depends only on J and N and by construction not on k, k , and the exact configuration of m. It remains to show that (A28) is strictly greater than 0. To this end, consider m = (m, 1) ∈ {0, 1} J +1 . We have by (A29), (A30) and renaming k = k J +1 , where the last equality is by virtue of By the non-degenerate nature of D we can use the strict form of Jensen's inequality to conclude from (A31) that By plugging (A32) into (A28) we obtain Since M ∈ X (V , E X ) and A v k ⊆ A v adding the edge [k, ] will not interfere with the zero-constraints given by (3.11) and as adding edges can never interfere with the = 0 constraints in (3.12), we have M +[k, ] ∈ X (V , E X ) and thus φ is well-defined. Moreover, φ is by definition injective.
is well defined, since deleting an edge can not interfere with the zero-constraints given by E X . By definition ψ is the inverse function of φ, making φ a bijection. Hence, we have Proof Let G q be an ancestral causal graph with node set V s := {v 0 , v 1 , v 2 , v 3 , w 0 , w 1 , . . . , w s } for s ≥ 2. The set of available interventions is given by I = {v 0 , v 3 , w 0 } and the partial observation E To show the claim, it suffices to show that there exist q 0 and s 0 such that γ 0 (0.5, s 0 ) < γ 1 (0.5, s 0 ) and γ 0 (q 0 , s 0 ) > γ 1 (q 0 , s 0 ).
For any s ≥ 2, q < 1 /2 and x, y ∈ V we compute where X 0 ⊂ X is the subset of graphs with minimal edges and 3. Claim: For fixed s ≥ 2 there exists a c < 1 /2 such that for q → 0 we have γ 1 (q, s) → c.
Consider that for any spanning tree on V rooted in w 0 that features a path from w 1 to w 2 we can switch the labels of w 1 and w 2 to obtain a spanning tree without a path from w 1 to w 2 . By construction, the above assignment is injective yielding that there are at most as many spanning trees featuring a path from w 1 to w 2 as there are spanning trees who do not feature such a path. Furthermore, consider the spanning tree given by E[w 0 , w k ] = 1 for all k = 1, . . . , q to obtain that there exist spanning trees that feature neither a path from w 1 to w 2 , nor from w 2 to w 1 . Hence by (A35) and acyclicity of spanning trees there exists a c < 1 /2 such that γ 1 (q 0 , s) q→0 −→ c.
To adapt the above proof for Algorithm 2 when the RSTs in Step 4.B are drawn uniformly at random consider the following step: • The first claim follows analogously.
• Instead of the second claim it can be shown that γ 0 (q, s) → 5 /8 by considering that E + [v 1 , v 2 ] = 1 for 5 of 8 possible spanning trees rooted in v 0 (note that E[v 3 , v 2 ] = 1 in any case). Note moreover, that in the case the RSTs are drawn via the modified Broder algorithm as given in Sect. 3.2 γ 0 (q, s) → c < 1 /2. • The third claim can be shown to hold also for Algorithm 2 by the analogous arguments. This is true even for the modified version of the Broder algorithm, as it coincides on V with the classical one.