Structural learning and estimation of joint causal effects among network-dependent variables

Bayesian networks in the form of Directed Acyclic Graphs (DAGs) represent an effective tool for modeling and inferring dependence relations among variables, a process known as structural learning. In addition, when equipped with the notion of intervention, a causal DAG model can be adopted to quantify the causal effect on a response due to a hypothetical intervention on some variable. Observational data cannot distinguish between DAGs encoding the same set of conditional independencies (Markov equivalent DAGs), which however can be different from a causal perspective. In addition, because causal effects depend on the underlying network structure, uncertainty around the DAG generating model crucially affects the causal estimation results. We propose a Bayesian methodology which combines structural learning of Gaussian DAG models and inference of causal effects as arising from simultaneous interventions on any given set of variables in the system. Our approach fully accounts for the uncertainty around both the network structure and causal relationships through a joint posterior distribution over DAGs, DAG parameters and then causal effects.

1 Introduction sciences and biology; see for instance Friedman (2004), Yin and Li (2011), Markowetz and Spang (2007) and references therein. A DAG imposes a set of conditional independencies between variables through a DAG-dependent factorization of the joint distribution and provides an effective tool to read-off such relations directly from the graph using graphical criteria.
Real data problems generally suggest that the set of conditional dependencies between variables cannot be postulated in advance, which makes the DAG generating model unknown. In many fields (e.g. genomics) DAG structural learning also represents the very ultimate goal of the analysis since it can reveal dependence relations which may help understanding the behavior of biological pathways; see for instance Friedman and Koller (2003) and Sachs et al. (2005). From a statistical viewpoint this issue can be tackled by adopting a model selection perspective and several methodologies have been proposed accordingly. On the frequentist side, a distinction can be made between score and constraint-based methods. The former implement a score function which is maximised over the model space (Chickering 2002) while the latter usually provide a graph estimate by performing a sequence of conditional independence tests; see for instance Kalish and Buhlmann (2007). Moreover, Bayesian methodologies for DAG structural learning estimate a posterior distribution over the space of graphs which in turn provides a coherent quantification of the uncertainty around the data generating model; see for instance Cooper and Herskovits (1992), Ben-David et al. (2015) and the more recent paper Ni et al. (2017) which presents a unified framework for model selection of both directed and undirected graphs. By contrast, Bayesian methods require prior elicitation for each model-dependent parameter which is subject to constraints imposed by the conditional independencies of the underlying DAG structure; see also Geiger and Heckerman (2002).
In their standard formulation, DAGs provide information on the dependence structure between variables in terms of association. In many contexts however one might be interested in establishing causal relationships, whose precise quantification would require experimental (interventional) data, e.g. produced from randomized controlled experiments; see also Spirtes et al. (2000) and Ellis and Wong (2008). However intervention experiments are not always available since they may be unethical or infeasible. Because causal concepts are relationships that cannot be defined from the (observational) DAG distribution alone (Pearl, 2003) causal inference from non-experimental data requires further assumptions on the DAG generating mechanism. These are encoded in the notions of intervention and postintervention distribution, leading to the definition of causal DAG; see also Sect. 2.2 for a more detailed discussion. Causal DAGs applied to observational data hence provide a quantification of the effect of a hypothetical intervention on a specific variable w.r.t. a response of interest. This can be used for instance to predict the effect of a single gene knockout on some other gene or phenotype of interest; see for instance Maathuis et al. (2009). In turn, causal DAGs provide a tool for the design of experiments, because the collection of gene-causal effects can indicate which of them are likely to have a large effect on the response. In many situations it is impossible to design experiments which act on one specific variable only while keeping all the other fixed. Also, causal effects from simultaneous interventions can significantly deviate from their single-variable counterparts, since the contribution of a given intervened variable in a joint intervention is ''adjusted'' by the effect of knocking out the others; see also Henckel et al. (2019). The following example aims at illustrating it.
Example 1 Consider DAG D in Fig. 1 and the parameters associated to its edges, which can be interpreted as the coefficients of a linear Structural Equation Model (SEM); see also Sect. 3. Table 1 provides a comparison between the causal effect of a given variable w.r.t. response node 1 from single and joint (simultaneous) interventions on set of variables. Each column refers to an intervened variable, while the two rows report the causal effect under a single intervention and the causal effect under a joint intervention on a pair of variables. As an instance, it appears that, while the causal effect of (an intervention on) node 6 is the same under single and joint interventions on f6; 5g (À1:09), this increases up to À0:68 when node 6 is intervened simultaneously with 4. DAGs encoding the same conditional independencies are called Markov equivalent and cannot be distinguished from observational data. However they can be different from a causal perspective; see also Sect. 2. Accordingly, a frequentist approach would estimate first an equivalence class of DAGs using observational data and then a set of DAG-dependent causal effects within the class. This is at the basis of the IDA and joint-IDA methods developed by Maathuis et al. (2009) andNandy et al. (2017). However, results can be highly sensitive to the input (estimated) equivalence class, which also depends on the specific structural learning methodology that is adopted; see also . All of these features suggest the adoption of a unified Bayesian method for DAG structural learning and causal effect estimation which fully accounts for the uncertainty around the underlying network structure. We remark that, within the Guassian setting hereinafter considered, literature on Bayesian causal discovery is, to our knowledge, still narrow, with few recent exceptions such as .
In this work we present a Bayesian methodology which combines DAG structural learning and causal effect estimation for continuous, Gaussian data. As the result, our method returns an approximated posterior distribution over the space of DAGs and a posterior of the causal effects as arising from simultaneous interventions on any given set of variables. The rest of the paper is organized as follows. In Sect. 2 we introduce some theory and notation on graphical models based on DAGs, causal diagrams and causal discovery. In Sect. 3 we introduce Gaussian DAG models in terms of likelihood factorization and derive the post-intervention distribution and allied causal effect in the general case of simultaneous interventions. Section 4 deals with our Bayesian methodology for structural learning and causal discovery and introduces prior on DAGs and Cholesky parameters. We discuss in Sect. 5 computational details leading to an MCMC scheme for posterior inference. We apply our methodology on simulation settings and real data in Sects. 6 and 7 respectively. Finally, Sect. 8 offers a brief discussion together with possible future developments. All codes are written in R (R Core Team 2017) and are available upon request to the authors.

Graphical models, causal diagrams and causal effects
In this section we introduce basic concepts relative to graphical models based on directed acyclic graphs and causal inference. Fur further information on these topics the reader can refer to Pearl (2009) andLauritzen (1996).

Graphical models
We briefly introduce the graph notation hereinafter adopted. Let G ¼ ðV; EÞ be a graph, where V ¼ f1; . . .; qg is a set of nodes (or vertices) and E V Â V a set of edges. In what follows, if ðu; vÞ 2 E and ðv; uÞ 6 2 E, G contains a directed edge u ! v, while if both ðu; vÞ 2 E and ðv; uÞ 2 E, then G contains an undirected edge u À v. A graph is called directed if contains only directed edges. Moreover, a Directed Acyclic Graph (DAG) D is a directed graph which contains no loops, that is sequences of nodes ðu 1 ; u 2 ; . . .; u k Þ with u 1 ¼ u k , such that there exists a path u 1 ! u 2 ! Á Á Á ! u k . Moreover, if ðu; vÞ 2 E we say that u is a parent of v and denote the set of all parents of v in D as pa D ðvÞ. Also, if there exists a directed path from u to v we say that v is a descendant of u and let de D ðuÞ be the set of all descendants of u in D. Hence, the non-descendants of u are all nodes in the set nd D ðuÞ ¼ V n de D ðuÞ.
Let now ðX 1 ; . . .; X q Þ be a random vector. The connection between a graph and probabilistic model f ðx 1 ; . . .; x q Þ for the random vector arises as we associate each variable X j to a node in the graph. The latter introduces a set of conditional independencies among X 1 ; . . .; X q via the so-called Markov property of the graph. As different types of dependence patterns exist, different types of graphs are in general equipped with different Markov properties. A DAG encodes a set of conditional independencies between variables which can be read-off from the DAG using graphical criteria such as d-separation (Pearl, 2009). We then denote with IðDÞ the set of all conditional independencies implied by D. Let D be a DAG, ðX 1 ; . . .; X q Þ a collection of random variables. A distribution f ðx 1 ; . . .; x q Þ is said to be compatible with the DAG D or Markov relative to D if it admits the factorization As many distributions may admit the factorization (1), it is possible to define a family of distributions MðDÞ that are Markov relative to D. For a given f ðx 1 ; . . .; x q Þ f , if we let I(f) be the set of conditional independencies in f, then f 2 MðDÞ if and only if IðDÞ Iðf Þ. Moreover, if IðDÞ ¼ Iðf Þ, then f is said to be faithful to DAG D. This means that the conditional independencies in D are all and only those embodied in the joint distribution f. A further important property of Bayesian networks is Markov equivalence. In particular, two DAGs D 1 and D 2 are Markov equivalent if and only if IðD 1 Þ ¼ IðD 2 Þ. It follows that a given set of conditional independencies can be described by several DAGs which are collected into an equivalence class. The latter can be uniquely represented through a Completed Partially Directed Acyclic graph (CPDAG) (Chickering 2002), also known as Essential Graph (EG) (Andersson et al. 1997) which is obtained as the union (over the edge sets) of all DAGs within the equivalence class.

Structural causal models and causal diagrams
DAGs are not necessarily carriers of causal information and their common extension to probabilistic graphical models, namely Bayesian networks, only allow to make conditional independence statements. Causal concepts are instead relationships that cannot be deduced from the distribution alone (Pearl, 2009) and accordingly require additional assumptions on the generating mechanism. One possibility is to assume that each parent-child relationship in the network represents a stable and autonomous physical mechanism, which means that it is possible to change one relationship without affecting the others. This assumption leads to the construction of Structural Causal Models (SCM) and the corresponding graphical tools, named causal diagrams. See also Pearl (2009) Traditionally, causal concepts are handled in econometrics and social sciences through linear structural causal models, that is SCM in which the relationships between variables are assumed to be linear. In general, an SCM can be represented through a system of relations of the form where pa D ðjÞ is now to be interpreted as the set of variables which directly determine the level of X j . Moreover, U j is an error term and the left-pointing arrow indicates a structural relation (as opposed to algebraic relations); see also Pearl (2009)[Sect. 1.4]. Given a causal model in the form of (2), drawing an arrow from each variable in pa D ðjÞ towards X j results in a DAG D called causal diagram which is called Markovian if it is acyclic and the error terms are jointly independent. It can be proved that every Markovian structural causal model M induces a distribution f which admits the same recursive decomposition (1) that characterizes Bayesian networks. However, causal models in the form (2) are more powerful as the assumptions of stability and autonomous mechanism allow to compute the effect of hypothetical interventions from non-experimental (observational) data.
We now introduce the notion of intervention. A hard (or deterministic) intervention on the set of variables fX j ; j 2 Ig, I V, is denoted by dofX j 1 x j g j2I and is defined as the action of fixing each X j , j 2 I, to some chosen valuex j . A hard intervention modifies the SCM by replacing each equations X j f j ðpa D ðjÞ; U j Þ for j 2 I with a point mass atx j . From a graphical perspective, the effect of a hard intervention can be represented through the so-called intervention DAG. This is obtained from the original DAG D by removing all edges (u, j) such that j 2 I and is denoted by D I ; see also the example in Fig. 2.
A hard intervention dofX j ¼x j g j2I leads to the definition of post-intervention distribution which can be written using the truncated factorization Importantly, the conditional densities in (3) are the same appearing in (1): this means that the post-intervention distribution can be expressed in terms of observational densities. Moreover, Nandy et al. (2017) define the total joint effect of an intervention dofX j ¼x j g j2I on X 1 Y as Fig. 2 Three DAGs, D 1 ; D 2 ; D 3 , and the corresponding intervention DAGs of an intervention on I ¼ 2. Intervention graphs D I 2 ; D I 3 are equal to the corresponding preintervention (observational) DAGs D 2 ; D 3 , while D I 1 differs from D 1 for the missing edge where for each h 2 I is the causal effect on Y associated to variable X h in the joint intervention.

Causal discovery and causal effect estimation
Causal effects can be estimated whenever a causal diagram representing the causal structure of the problem is available. However, often this is not the case and the causal structure must be inferred from the data. Causal discovery methods, that is procedures whose aim is to learn causal DAGs from the data, are traditionally divided into three main classes: constraint-based methods, which estimate equivalence classes of DAGs by testing for conditional independencies between variables; score-based methods, which score DAGs through penalized likelihoods; hybrid methods which combine features of the first two approaches. The PC algorithm (Spirtes et al. 2000) is one of the most popular algorithms for causal discovery. It is a constraint-based method that assumes acyclicity, causal faithfulness and causal sufficiency, where the latter refers to the absence of hidden (latent) variables. The PC algorithm provides an estimate of the CPDAG representing the true causal DAG. Specifically, it first estimates the CPDAG skeleton (that is the undirected graph that would be obtained by removing all the edge orientations from the DAG) and then orients as many edges as possible using various orientation rules; see also Kalish and Buhlmann (2007). For a complete review on causal discovery algorithms the reader can refer to Heinze-Deml et al. (2018).
A slightly different approach has been adopted by Maathuis et al. (2009), who propose a methodology for causal effect estimation from single-node hard interventions in Gaussian models when the DAG is not available. The resulting algorithm is called IDA (Identification when DAG is Absent). In its basic version, IDA first estimates an equivalence class using the PC algorithm (alternatively any other score-based method can be adopted). Next, for each DAG within the input class, the causal effect of X h on Y is computed using multiple linear regression models. This basic version is slightly modified due to computational reasons. In particular, they propose a faster alternative which only returns the distinct causal effects compatible with the input equivalence class, thus avoiding a full enumeration of the DAGs. Their methodology is further extended to joint (simultaneous) hard interventions by Nandy et al. (2017), leading to their joint-IDA method.
As in the case of single interventions, joint-IDA relies on a CPDAG which is estimated up-front, e.g. by using the PC algorithm. Next, three alternative methods for causal estimation from joint interventions are proposed, namely the path method, the recursive regression for causal effects method (RRC) and the modified Cholesky decomposition method (MCD); see the original paper for details.

Causal effects in Gaussian graphical models
In this section we focus on Gaussian DAG models and provide the definition of causal effect under the assumption that the distribution of X 1 ; . . .; X q is jointly normal and Markov w.r.t. a given DAG which is known beforehand. In Sect. 4 we will deal instead under model (DAG) uncertainty.
Let D ¼ ðV; EÞ be a DAG and assume ðX 1 ; . . .; X q Þ j R; D $ N q ð0; RÞ, where R 2 C D , the cone of symmetric-positive-definite covariance matrices Markov w.r.t. D. Accordingly, R reflects the conditional independencies encoded by D.
Because of the normality assumption, Equation (6) can be equivalently written as a linear SEM. To this end, consider the decomposition of R ¼ L À> DL À1 , where L is a (q, q) matrix of coefficients with diagonal elements equal to 1 and D ¼ diagðr 2 1 ; . . .; r 2 q Þ is a diagonal matrix collecting node-conditional variances. From this re-parameterization, the constraints imposed by D on the model parameters become more apparent since for each (u, v)-element of L, u 6 ¼ v, we have L u;v 6 ¼ 0 if and only if u 2 pa D ðvÞ, that is there is an edge u ! v in D. Hence, for j ¼ 1; . . .; q, where 0 j ¼ pa D ðjÞ Â j and L AÂB denotes the sub-matrix of L with elements belonging to rows and columns indexed by A and B respectively. Therefore, Let now I f2; . . .; qg be an intervention target. The post-intervention distribution of ðX 1 ; . . .; X q Þ given dofX j ¼x j g j2I becomes An important implication of Equation (9) is that where and in particular Equation (10) corresponds to the post-intervention distribution of ðX 1 ; . . .; X q Þ in the Gaussian setting and depends on the covariance matrix R I . Most importantly, the latter can be reconstructed from the observational parameters ðD; LÞ appearing in (7). Finally, the causal effect of X h on X 1 ðh 2 IÞ in a joint intervention on fX j g j2I is given by see also Nandy et al. (2017). It follows that the causal effect h I h;1 is a function of the covariance matrix R which in turn depends on the underlying DAG D. Therefore, inference on DAG D and its parameter R will drive inference of causal effects under model uncertainty; see the next section for details. (8) and the (n, q) data matrix X, row-binding of the x i 's. The likelihood function relative to ðD; L; DÞ can be written as

Bayesian inference of causal effects under model uncertainty
where X A denotes the sub-matrix of X with columns indexed by A V and I n is the (n, n) identity matrix. We now proceed by assigning a prior distribution on DAG D and its Cholesky parameters ðD; LÞ.

Prior on DAG D
For a given DAG D ¼ ðV; EÞ, let S D be the 0-1 adjacency matrix of its skeleton (the underlying undirected graph obtained after removing the orientation of all of its edges), such that for each (u, v)-element in S D , S D u;v ¼ 1 if and only if ðu; vÞ 2 E or ðv; uÞ 2 E, 0 otherwise. Conditionally on a prior probability of inclusion p 2 ð0; 1Þ we then assume S D u;v j p $ iid BerðpÞ for each u [ v. Therefore, where jS D j is the number of edges in D (equivalently in its skeleton) and qðq À 1Þ=2 corresponds to the maximum number of edges in a DAG on q nodes. Finally we set for any D 2 S q , where S q is the space of all DAGs on q nodes. The resulting prior thus depends on the DAG skeleton only and assigns equal prior weights to DAGs having the same number of edges. Hyperparameter p can be tuned to reflect some prior knowledge of sparsity in the DAG space, when this information is available. Moreover, one can adopt distinct hyperparameters p u;v , to include edge-specific prior information on the edge inclusions.

Prior on Cholesky parameters ðD,LÞ
Consider now the DAG-constrained covariance matrix R 2 C D and its modified Cholesky decomposition R ¼ L À> DL À1 ; see also Sect. 3. Moreover, without loss of generality assume a parent ordering of the nodes such that u [ v whenever u is a parent of v. We assign a prior to R through a DAG-Wishart prior on ðD; LÞ with hyperparameter U (a q Â q s.p.d. matrix) and shape hyperparameter a D ¼ Ben-David et al., 2015). The DAG-Wishart distribution has been introduced as a conjugate prior for covariance matrices Markov w.r.t. a DAG D and therefore provides an extension of the standard Wishart distribution which can be adopted for unconstrained covariance matrices (equivalently, complete DAG models); see also Ben-David et al. (2015) for details.
An interesting feature of the DAG-Wishart distribution is that it induces a reparameterization of R in terms of node-parameters ðL 0j ; r 2 j Þ that are a priori independent with distribution r 2 j j D $ I-Ga a D j ; where  Castelletti and Consonni (2020). Also, a standard choice for U, hereinafter adopted, is U ¼ gI q (g [ 0) which reflects a prior belief of (marginal) independence among variables. Hyperparameter g regulates the strength of our prior statement: lower values of g correspond to less informative prior distributions on X. Under this choice, Equation (17) becomes Therefore, a prior on the DAG Cholesky parameters ðD; LÞ is given by Moreover, because of conjugacy of (18) with the likelihood (14), the posterior distribution of ðD; LÞ given X is such that

Computational details
In this section we detail the MCMC scheme that we adopt to sample from the target distribution pðD; L; D j XÞ / f ðX j D; L; DÞ pðD; L j DÞ pðDÞ; and therefore to approximate the posterior distribution of the causal effect in (13) which is a function of the Cholesky parameters ðD; LÞ; see Equations (12) and (13). An efficient sampler can be implemented by resorting to a Partial Analytic Structure (PAS) algorithm (Godsill, 2012) whereas the update of DAG D and model parameters ðD; LÞ is performed in two steps.

Update of DAG D
In the first step, given the current DAG D, a new DAG D Ã is drawn from a suitable proposal distribution qðD Ã j DÞ which is defined as follows. We consider three types of operators that locally modify a DAG: insert a directed edge (InsertD u ! v for short), delete a directed edge (DeleteD u ! v) and reverse a directed edge (ReverseD u ! v). For a given D 2 S q , being S q the set of all DAGs on q nodes, we then construct the set of valid operators O D , that is operators whose resulting graph is a DAG. A DAG D Ã is then called a direct successor of D if it can be reached by applying an operator in O D to D. Therefore, given the current D we propose D Ã by uniformly sampling an element in O D and applying it to D. Since there is a one-toone correspondence between each operator and resulting DAG, the probability of transition is qðD Ã j DÞ ¼ 1=jO D j, for each D Ã direct successor of D.
Consider now two DAGs D and D Ã which differ by one edge ðu; vÞ 2 D, ðu; vÞ 6 2 D Ã and let ðD D ; L D Þ and ðD D Ã ; L D Ã Þ be the corresponding Cholesky parameters. Notice that these differ only with regard to their v-th component ððr D v Þ 2 ; L D 0v Þ and ððr D Ã v Þ 2 ; L D Ã 0v Þ. Moreover, the remaining parameters fðr D r Þ 2 ; L D 0r ; r 6 ¼ vg and fðr D Ã r Þ 2 ; L D Ã 0r ; r 6 ¼ vg are componentwise equivalent between the two graphs because they refer to structurally equivalent conditional models. This feature is crucial for the correct application of the PAS algorithm; see also Wang and Li (2012) Under a PAS algorithm the acceptance probability for a DAG D Ã (defined as above) is given by a D Ã ¼ minf1; r D Ã g where and (similarly under D Ã ). In addition, because of the likelihood and prior factorizations in (14) and (19), it can be shown that (22) reduces to where, because of conjugacy of the Normal-Inverse-Gamma prior with the Normal density, and S pa D ðvÞ , b v are defined as in Sect. 4.2.

Update of Cholesky parameters ðD,LÞ
In the second step we then sample the model-dependent parameters ðD; LÞ conditionally on the accepted DAG from their full conditional distribution in (20). In addition, samples from the Cholesky parameters ðD; LÞ can be used to recover R I and then compute the causal effect for any given target of intervention nodes I f2; . . .; qg as in (13).

Posterior inference
Our MCMC scheme is summarized in Algorithm 1. Its output consists of a collection of DAGs and Cholesky parameters È D ðsÞ ; ðD ðsÞ ; L ðsÞ Þ É S s¼1 sampled from the posterior (21) and a collection of causal effect coefficients È h IðsÞ h;1 É S s¼1 for an input intervention target I and each h 2 I.
Moreover, the posterior probability of a DAG D can be approximated from the MCMC output as which corresponds to the DAG frequency of visits in the chain. Alternatively, approximations of posterior model probabilities can be obtained from re-normalized marginal likelihoods; see also García-Donato and Martínez-Beneito (2013) for a discussion. From the same output, summaries of interest can be also obtained. In particular, for each pair of nodes (u, v) we can compute the (estimated) posterior probability of edge inclusionp where 6 Simulation study

Settings
In this section we evaluate our methodology through simulation experiments. Specifically, we construct different scenarios by varying the sample size n 2 f50; 100; 200; 500g, the number of intervened nodes (size of the target) s 2 f2; 4g and the number of nodes q 2 f10; 20g. Under each simulation scenario defined by n we generate N ¼ 30 datasets, each obtained as follows. We first randomly sample a sparse DAG D by fixing a probability of edge inclusion equal to 0.2. We then generate the corresponding (true) Cholesky parameters ðD; LÞ by drawing the non-zero elements of L in ½À1; À0:1 [ ½0:1; 1 while fixing D ¼ I q . We finally construct the covariance matrix R ¼ L À> DL À1 and generate n multivariate i.i.d. observations, representing an (n, q) dataset X, from the Gaussian DAG-model N q ð0; RÞ. Moreover, for a given s we randomly choose a target I consisting of s nodes randomly sampled from f2; . . .; qg. We then recover the post-intervention parameters L I and R I using (11) and (12); the true set of causal effects fh I h;1 g h2I follows from (13). For each simulated dataset we run S ¼ 15000 iterations of Algorithm 1 to approximate the posterior distribution over DAGs, Cholesky parameters and causal effects.

Graph selection
To assess the accuracy of our method in recovering the underlying causal structure we compare DAG point estimates that can be retrieved from our MCMC output with the corresponding true DAGs. Similarly, we evaluate the performance of the frequentist PC algorithm, the structural learning method underlying the IDA approach of Maathuis et al. (2009). Specifically, with regard to our method, we consider both the Median Probability (DAG) Model (MPM) and the Maximum A Posteriori (MAP) DAG estimates, where the former corresponds to the DAG obtained by including those edges whose posterior probability of inclusion exceeds 0.5, while the latter corresponds to the DAG with highest MCMC frequency of visits. We implement PC algorithm at significance level a 2 f0:01; 0:05; 0:10g. In addition, because PC outputs a CPDAG, starting from each of our DAG estimates (MPM and MAP) we construct the representative CPDAG, that is the CPDAG representing the equivalence class of the estimated DAG. We compare each CPDAG estimate with the CPDAG representing the equivalence class of the true DAG in terms of Structural Hamming Distance (SHD) and Structural Intervention Distance (SID). SHD corresponds to the number of edge insertions, deletions or flips needed to transform the estimated graph into the true graph. SID was instead introduced by Peters and Bühlmann (2015) and is based on a graphical criterion quantifying the closeness between two graphs in terms of the corresponding sets of compatible intervention distributions; see also the original paper for details. Lower values of SHD and SID correspond to better performances. Our results are summarized in the box-plots of Figs are competitive with PC for moderate sample sizes and perform slightly better than PC as n increases both in terms of SHD and SID.

Causal effect estimation
We now evaluate the performance of our method in causal effect estimation. To this end, under each simulation we compute the BMA estimate (27) We also include in our analysis the joint-IDA method of Nandy et al. (2017) (see also Sect. 2.3). In particular, for the graph selection step we implement PC algorithm at significance level a ¼ 0:01 which has also been shown to perform better in sparse settings Kalish and Buhlmann 2007. Joint-IDA recovers for each intervened node h 2 I the set of distinct causal effects compatible with the input CPDAG. This is then summarized through the arithmetic mean which provides an estimate of h I h;1 , h 2 I. The joint-IDA estimate is compared with the true causal effect by computing the absolute-value distance d IDA differently from our Bayesian method, joint-IDA relies on a given (estimated) equivalence class of DAGs. Indeed, causal inference results strongly depend on the input CPDAG estimate and therefore on the accuracy in the graph selection. Anyway, results obtained under different CPDAG estimates (e.g. the PC algorithm for different significance levels) did not lead to improvements in the causal effect estimation. By contrast, our MCMC-based method relies on a posterior distribution over a collection of DAGs some of which, although lying outside the true-DAG equivalence class, might be ''structurally similar'' to the true causal DAG and still result in a causal effect which is close to the true one. This result is also consistent with the behavior observed in the Structural Intervention Distance (Fig. 5). We finally investigate the computational time required by our method. Figure 6 summarizes the behaviour of the running time (averaged over 12 replicates) per iteration, as a function of q 2 f10; . . .; 100g for n ¼ 500 (upper panel), and as a function of n 2 f50; . . .; 1000g for q ¼ 50 (lower panel). We also highlight that the computational burden of our method might increase with the number of variables also because the size of the DAG space grows more than exponentially in q and accordingly a larger number of MCMC iterations are in general required to reach convergence. By converse, the computational burden of IDA is generally lower. However, we remark that our method provides not only a point estimate of DAGs and causal effects, but also an approximation of the whole posterior distribution over the joint space of DAGs and parameters.

Robustness to model misspecification
To evaluate the performance of our method under non-Gaussian data we implement a simulation study where for a given DAG D data are generated under a Structural Equation Model of the form for j 2 f1. . .; qg. With regard to the error term distribution, we consider three different scenarios: where t 6 denotes a scaled t distribution with 6 degrees of freedom. The specific parameter choice in 2-3 guarantees that Varðe j Þ ¼ 1 which is therefore coherent with setting 1. We fix q ¼ 10 and, for each sample size n 2 f50; 100; 200; 500g and scenario 1-2-3, we perform N ¼ 30 independent simulations using the same generating DAGs and SEMs used in our previous simulation studies. We apply our method to evaluate the causal effect on X 1 of a joint intervention on s ¼ 2 randomly selected nodes among f2; . . .; 10g. The resulting estimates are compared with the true causal effects in terms of the absolute-value distance d as in (28). Our results are summarized in the box-plots of Fig. 7. Results compared across scenarios 1-2-3 are quite similar, showing that our method, when applied to causal effect estimation, is somewhat robust with respect to different distributions of the error term in the SEM model.

Real data analysis
In this section we apply our methodology and joint-IDA to the ''Wine quality'' dataset of Cortez et al. (2009); the dataset is publicly available at https:// archive.ics.uci.edu/. In our analysis we include observations of seven continuous variables measuring physicochemical properties of a Portoguese wine called Vinho verde, and a response variable representing a sensory score of the wine quality (ranging in 0-10) given by n ¼ 1593 independent assessors. This dataset has been often used for prediction tasks, i.e. to evaluate the quality of wine on the basis of its physicochemical properties only. However, one might be also interested in causal questions, such as whether intervening on one (or more) physicochemical property may change the wine sensory score. As a consequence, this can lead to identify the target of intervention which produces the largest increase in the score.
We run Algorithm 1 to approximate the posterior distribution of DAGs, DAGparameters and causal effects for any variable in the system and the joint-IDA method based on a CPDAG estimated obtained from PC algorithm. Because one can reasonably assume that the quality score does not affect any of the physicochemical properties (but rather the opposite is argued), we restrict the space of DAGs by imposing that node 1 (the sensory score) cannot have descendant nodes. Such a constraint introduces prior information on the causal structure which is suggested by the concrete problem. In our MCMC algorithm this is achieved by limiting the set of valid operators of type Insert involving node 1 to those of the form u ! 1; see also Sect. 5. In the PC algorithm instead, this background information is included with the following procedure: we first estimate the skeleton between variables X 1 ; . . .; X q as in the standard first step of PC. Next, we orient undirected edges between variables Y and covariates X 2 ; . . .; X q as X j ! Y, while apply Meek's orientation rules to orient the sub-graph of X 2 ; . . .; X q ; see also Kalish and Buhlmann (2007) for details.
We first assess the convergence of the MCMC algorithm by running two independent chains of length S ¼ 30000. Figure 8 summarizes the estimated posterior probabilities of edge inclusion (Equation 26) computed from each MCMC chain. The two resulting heatmaps suggest a highly satisfactory agreement between the two chains.
Starting from our MCMC output we consider both the Maximum a Posteriori (MAP) and the Median Probability Model (MPM) as DAG estimates. However, we stress that our final BMA estimate does not rely on a single DAG but rather on a full posterior of DAGs and accordingly a single DAG estimate is only constructed as an overall graph summary. The two graphs are reported in Fig. 9, together with the lower heatmaps of Fig. 10. Substantial differences between the two methods appear, instead, for variable total.SO2, which under joint-IDA is associated with a (negative) causal effect on quality. In addition, joint-IDA causal effect estimates are somewhat higher than those obtained under our BMA method. We remark that the effect of joint interventions on more than two variables can be evaluated in a similar way. However, for simplicity of exposition we have limited our analysis to the case of pair-nodes interventions.

Discussion
In this paper we present a Bayesian methodology for structural learning of dependence and causal relations among variables. In particular, we assume that multivariate observational data have been generated by a Directed Acyclic Graph (DAG) model which is unknown. Of special interest is also the causal effect of a specific variable on a response arising from a joint intervention on several variables in the system. The latter depends on the underlying network structure which  Fig. 9 Real data analysis. Comparison between estimated graphs. From upper-left to bottom: maximum a posteriori, median probability and modified-PC DAG estimates estimated equivalence class of DAGs whose identification may affect the causal estimation results.
Joint interventions lead to causal effects that can significantly deviate from their single-node counterparts. Accordingly, a desired effect on the response can be obtained through a unique intervention involving several variables simultaneously, rather than a sequence of single-node interventions. Since the number of possible joint interventions grows exponentially in the number of variables, the investigation of an optimization strategy which identifies the optimal intervention target producing the desired level of the response could be of interest.
In general, a DAG cannot be uniquely identified from observational data and accordingly a possibly large collection of causal effects is estimated. Randomized intervention experiments producing interventional data can be used to improve the identifiability of the data generating model which consequently reduces the uncertainty around the causal effect estimate; see also Castelletti and Consonni (2020). In principle, one could then perform sequential simultaneous intervention leading to the identification of the true causal effect. This issue can be tackled from an optimal design of experiment perspective implementing an objective function whose optimization reduces the uncertainty related to each BMA causal effect estimate of interest.
In this paper we consider causal effect estimation from joint hard interventions. A more general framework, named soft interventions, assumes that parent-child dependencies are ''modified'' but yet preserved after intervention. In this setting, Correa and Bareinboim (2020) introduce a set of rules (named r-calculus) for the identifiability of causal effects arising from soft interventions. They then show how these rules can be applied to identify the causal effect of an interventions from a combination of observational and interventional data, the latter arising from exogenous perturbations of selected variables in the system. A Bayesian framework for causal effect estimation under soft interventions is however still lacking to our knowledge and is currently under investigation by the authors.
Funding Open access funding provided by Università Cattolica del Sacro Cuore within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/.