1 Introduction

1.1 Background

Networks are graphical structures that arise on many biological applications. For example, microbial networks are a graphical representation of a microbial community where nodes represent microbial taxa and edges correspond to some form of interaction (Matchado et al. 2021). Microbial networks arise in microbiome studies from soil (Barberán et al. 2012) to human gut (Baldassano and Bassett 2016; Claesson et al. 2012). Other biological networks include brain networks where nodes correspond to regions in the brain (or voxels) and edges correspond to connections between brain regions (Rubinov and Sporns 2010), or ecological networks which, like microbial networks, represent some community of species in the wild (e.g., food web Pimm et al. 1991).

Many biological networks need to be estimated from data, and given their broad applicability, there has been a plethora of network methods to reconstruct biological networks from a wide variety of data types: from microbiome sequencing data (Jovel et al. 2016) to MRI images of brain regions (van Straaten and Stam 2013). Many network methods used in microbiome studies, however, estimate a network under static conditions. That is, researchers would estimate the human gut microbial network by sequencing reads of gut samples from different individuals without incorporating information about treatments on the individuals. In the absence of any treatments or disturbers of the network structure, Gaussian graphical models are among the most widely used models to estimate such networks (Layeghifard et al. 2017).

Many researchers, however, want to use information of different experimental treatments to better infer the biological network. For example, we can get samples of human gut microbiome for control patients and for patients under antibiotic treatment. The different responses (abundances of microbes) on different treatments can provide information on the microbial network structure by, for example, eliminating key players in the community like in the case of antibiotic treatments. The conditions are no longer static (no treatment). Instead, the nodes are perturbed by experimental treatments, and these perturbations provide information on the network structure itself. We thus highlight that experimental treatments are incorporated into the model to better estimate the network structure, and not because there is an interest in the effect of these treatments on the responses. On the contrary, exact experimental treatment effects are nuisance parameters and our only focus is the estimation of the network structure with the help that the experimental treatments provide.

1.2 The Gaussian Chain Graph Model

Standard Gaussian graphical models no longer apply when there are experimental treatments affecting the network structure. Thus, chain graph models arise as a suitable alternative for the “network under treatment” setting (Lauritzen and Richardson 2002). More formally, a Gaussian chain graph model with k response nodes (\({\textbf{Y}}_i\in \mathbb {R}^{ k}\)) and p predictor nodes (\({\textbf{X}}_i\in \mathbb {R}^{p}\)) is given by:

$$\begin{aligned} \textbf{Y}_i \mid \textbf{X}_i, \textbf{B}, {\varvec{\Omega }} \sim \mathcal {N}({\varvec{\Omega }}^{-1} \textbf{B}^T \textbf{X}_i, {\varvec{\Omega }}^{-1}) \end{aligned}$$
(1)

where \(\textbf{B} \in \mathbb {R}^{p \times k}\) is the matrix for the regression coefficients (e.g., treatment effects) and \({\varvec{\Omega }} \in \mathbb {R}^{k \times k}\) is the precision matrix among responses (e.g., network structure). Note that the treatment effects could also be used to account for subject heterogeneity when samples are not from independent identically distributed from a Gaussian graphical model. In the microbial network example described above, the responses correspond to the abundances of microbes in the samples and the predictors correspond to the experimental treatments. Our parameter of interest is \({\varvec{\Omega }}\) which represents the network among responses (the microbial network) and \(\textbf{B}\) represents the direct effects of treatments on the responses (e.g., the effect of antibiotic on different microbes). As mentioned, we are not interested in \(\textbf{B}\) itself. The introduction of \(\textbf{B}\) to the model is done to facilitate inference of \({\varvec{\Omega }}\), so in this sense, \(\textbf{B}\) is a nuisance parameter. Bayesian implementations of Gaussian chain graph models further include prior distributions for \(\textbf{B}\) and \({\varvec{\Omega }}\) which can allow the inclusion of prior biological knowledge into the model. Different priors have been proposed for precision matrix such as conjugate Wishart prior, shrinkage prior like LASSO and adaptive LASSO prior (Wang 2012), spike-and-slab priors (Gan et al. 2019), priors based on matrix decomposition and transformations like spectrum (Daniels and Kass 1999), Cholesky (Daniels and Pourahmadi 2002), and Givens angle (Kang and Cressie 2011) reference prior (Yang and Berger 1994). Unfortunately, the fact that the mean structure involves the precision matrix in chain graph models, a Gibbs sampler is not straightforward to implement. In multivariate linear regression, one can condition on the regression coefficients and then sample the full conditional of the precision matrix using the same sampling method for Gaussian graphical model. In contrast, in a Gaussian chain graph model, the mean structure involves the precision matrix, and thus, sampling usually requires modifications. For instance, despite following the same derivation as in the original graphical LASSO (Wang 2012), the sampler for graphical LASSO prior on Gaussian chain graph models (Shen and Solis-Lemus 2020) is significantly different in the full conditionals. While a thorough comparison of multiple priors on Gaussian chain graph models is indeed an interesting research direction, it is currently beyond the scope of this paper and will focus on the priors that have already been adopted in chain graphs.

Under the setup of Bayesian inference of Gaussian chain graph model, a rather unexplored statistical question is which experimental design on the predictors (design matrix \({\textbf{X}}\)) would improve the inference of the precision matrix (\({\varvec{\Omega }}\)). Traditional experimental design in linear models involves selecting a data matrix \({\textbf{X}}\) such that the variance (covariance) of the estimator of the regression coefficient \(\hat{\beta }\) is small which translates into selecting \({\textbf{X}}\) such that \(\varvec{\Sigma }_{{\textbf{X}}} = \frac{1}{n} {\textbf{X}}^T {\textbf{X}}\) is as large as possible. The design problem is then built around an optimality criterion (or measure of success) \(V(\varvec{\Sigma }_{{\textbf{X}}})\) or \(V({\textbf{X}})\). For example, under the D-optimality setting, the design aims to maximize the determinant of the inverse \(\varvec{\Sigma }_{{\textbf{X}}}\) matrix: \(V_D(\varvec{\Sigma }_{{\textbf{X}}}) = \det (\varvec{\Sigma }^{-1}_X)\). Other optimality criteria involve maximizing the largest eigenvalue of \(\varvec{\Sigma }_{{\textbf{X}}}\) (E-optimality) or the trace of \(\varvec{\Sigma }_{{\textbf{X}}}\) (A-optimality). Bayesian alternatives of the designs change slightly by the incorporation of priors. For example, the Bayesian D-optimality setting involves maximizing \(V_D(\varvec{\Sigma }_{{\textbf{X}}}) = \det \left( \left( \varvec{\Sigma }_{{\textbf{X}}} + \frac{1}{n} \textbf{V}_0 \right) ^{-1} \right) \) where \(\textbf{V}_0\) is the prior covariance matrix of \(\beta \).

The question of optimal experimental design is not unexplored from the biological perspective. Researchers have always been interested in experiments with specificity: the treatment is conditionally independent with all but one of the response nodes. However, it can be difficult or impossible to design an experiment where only one response node (e.g., microbe) is perturbed. Thus, in the absence of specific treatments, we can study what other alternative experimental designs can aid in the estimation of the network structure.

Here, we address the question of whether we can find an optimal Bayesian experimental design to infer \({\varvec{\Omega }}\), our parameter of interest, on a Gaussian chain graph model. We focus on the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) as our optimality criterion. We choose to focus on the marginal precision matrix of \({\varvec{\Omega }}\) (instead of the joint precision matrix of \(\textbf{B}\) and \({\varvec{\Omega }}\)), given that we want to account for the nuisance parameter \(\textbf{B}\) without focusing on it concretely. In addition, we use the Laplace approximation of the precision matrix (instead of the actual precision matrix), given that the precision matrix tends to be intractable for most posteriors (but see Sect. 1.3).

We study the case of four different prior distributions: flat, conjugate Normal-Wishart, the novel Normal-Matrix Generalized Inverse Gaussian (MGIG) and a general independent prior. For each prior, we obtain the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) to use as our optimality criterion to find the optimal experimental design \({\textbf{X}}\). We find, however, that the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) is not a function of \({\textbf{X}}\) for the flat nor the conjugate priors. This implies that it is difficult if not impossible to find an optimal Bayesian experimental design to aid in the estimation of \({\varvec{\Omega }}\) for these two priors. In contrast, the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) is a function of \({\textbf{X}}\) for the novel Normal-MGIG and for a general independent prior which allows the search for an optimal experimental design. However, we discover an information bound for both of these priors which implies that there is a theoretical limit to how much can be gained from experiments in the inference of \({\varvec{\Omega }}\).

Our work has important repercussions for domain scientists who use experimental settings to aid in the estimation of the network structure (\({\varvec{\Omega }}\)). Under a Bayesian Gaussian chain graph model, the choice of prior is highly impactful, but even when appropriate priors are selected, there will be an information bound to information gained from experiment. Such bound depends on our prior knowledge about the (conditional) effect of the experiment itself.

1.3 Motivating Example: Toy Data with Explicit Posterior Precision

We show here one example that illustrates the interplay between prior knowledge and experiments. We simulate \(k=3\) responses, \(p=1\) predictor and \(n=200\) samples under AR(1) model with \(\sigma _{ij}=0.7^{|i-j|}\) (denoted Model 1 in the Simulations in Sect. 3). We simulate two settings: 1) a null experiment (design matrix of \(\textbf{X}=0\) despite a potential experimental effect of \(\textbf{B}\ne 0\)) and 2) experiment (the predictor has an effect only on the third response), and we consider two priors each with two uncertainty levels: 1) Normal-Wishart prior with \(\lambda =8\), \(\varvec{\Phi }=10^{-3} \textbf{I}_3\) and two uncertainty levels of \(\textbf{B}\): (i) \(\varvec{\Lambda }=10^{-3}\) (certain case) and (ii) \(\varvec{\Lambda }=10^3\) (uncertain case), and 2) Normal-MGIG prior with \(\lambda =4\), \(\varvec{\Psi }=\varvec{\Phi }=10^{-3} \textbf{I}_8\) and two uncertainty levels of \(\textbf{B}\): (i) \(\varvec{\Lambda }=10^{-3}\textbf{I}_3\) (certain case) and (ii) \(\varvec{\Lambda }=10^3\) (uncertain case). We set \(\textbf{B}_0= \textbf{B}\) (see Sect. 2 for more details on the priors). The design matrix is sampled from the standard Normal distribution.

For this toy example, \({\varvec{\Omega }}\) has an explicit marginal posterior distribution, namely a matrix generalized inverse Gaussian (MGIG) distribution (Eq. A1) with parameters \(\lambda + \frac{n}{2}\), \(\varvec{\Psi }+(\textbf{XB}_0)^T (\textbf{XB}_0)\) and \(\varvec{\Phi } + \textbf{Y}^T\textbf{Y}\).

Figure 1 shows the posterior distribution on the (1, 2)-th entry of the precision matrix (\(\rho _{12}\)). We can observe that the experiment has an effect on the inference of the \(\rho _{12}\) and that this effect differs by prior with Normal-MGIG prior displaying less variability compared to the conjugate Normal-Wishart. Empirical knowledge similar to this toy example motivated our pursuit of the theoretical interplay of prior distributions and experimental design in the inference of a precision matrix in a chain graph model.

Fig. 1
figure 1

Posterior distribution for the (1, 2)-th entry of the precision matrix in the toy simulated data with \(k=3\) responses, \(p=1\) predictor and \(n=200\) samples under AR(1) model with and without experiment affecting only the third node, and under Normal-MGIG and Normal-Wishart prior with different uncertainty levels. Note that we are showing an entry in the precision matrix, not the covariance matrix

1.4 Structure of the Paper

The structure of the paper is as follows: In Sect. 2, we aimed to understand how prior choice could influence effectiveness of an experimental design. We began with a recap of the Laplace approximation of the posterior precision matrix, and then, we illustrate the Laplace approximation for the marginal posterior precision matrix of the parameter of interest \({\varvec{\Omega }}\) under four priors: flat prior, Normal-Wishart conjugate prior, Normal-MGIG prior and a general independent prior. We show that optimal experimental design is only possible under the two latter priors, but even in these cases, there is an information limit. In Sect. 3, we numerically simulate posterior under several priors and different experimental design. We evaluated the result using the Kullback–Leibler divergence between prior and posterior which evaluates the information gain by conducting experiments. In addition, we use the Stein’s loss of the maximum a posteriori (MAP) estimation of precision matrix, which evaluates the performance of the point estimates from the experiment.

In Sect. 4, we revisit a similar discussion to the one in the motivating toy dataset where we compare the posterior of partial correlations among responses under different experiments and priors for a real dataset of human gut microbiome. Finally, in Sect. 5, we conclude with some practical advice for domain scientists as well as future directions.

2 Experimental Design under Different Priors in a Gaussian Chain Graph

Ideally, we want to use the marginal posterior precision matrix of our parameter of interest \({\varvec{\Omega }}\) as the optimality criterion in an experimental design setting. That is, we want to find the optimal design matrix \({\textbf{X}}\) that maximizes the posterior precision of \({\varvec{\Omega }}\). However, this posterior precision matrix can be intractable in many cases, and thus, we will use its Laplace approximation instead. We begin this section with a summary of the Laplace approximation of a posterior distribution. Then, we present the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) under the four priors under study.

2.1 Laplace Approximation of the Posterior Precision Matrix

For a log concave posterior distribution \(p(\theta | Y)\) for a random variable Y and parameter of interest \(\theta \), the Laplace approximation of the posterior precision is the negative of the Hessian of the log posterior \(-\nabla ^2_{\theta }\log p(\hat{\theta }|Y)\) which can be partitioned as the sum of the Hessian of the log prior and the Hessian of the log likelihood \(\nabla ^2_{\theta }\log p(\hat{\theta }|Y) =\nabla ^2_{\theta }\log p(\hat{\theta })+\nabla ^2_{\theta }\log p(Y|\theta )\) near the maximum a posteriori (MAP) estimator. However, we often do not have a closed-form expression for the MAP and we simply know than they are close to the true parameter. Thus, we make the additional approximation of using the true parameter to approximate the MAP. We note that while the MAP can depend on the experimental design \(\textbf{X}\), this approximation should not have a considerable impact on the results, given that the MAP should be close to the true parameter when the prior has enough support around this true value.

For the Gaussian chain graph model (1), the Hessian of the log likelihood has the following form (derivation can be found in Appendix A1):

$$\begin{aligned} \begin{aligned} \frac{\partial ^2 \ell }{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vec}\,}}(\textbf{B})^T}&=-\varvec{\Omega }^{-1}\otimes \textbf{X}^T\textbf{X}\\ \frac{\partial ^2 \ell }{\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })^T}&=-\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k\\ \frac{\partial ^2 \ell }{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })^T}&=\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) . \end{aligned} \end{aligned}$$

where \(\textbf{D}_k\) is the duplication matrix (Minka 2000; Magnus and Neudecker 2019), a permutation matrix such that \(\textbf{D}_k{{\,\textrm{vec}\,}}{\varvec{\Omega }}= {{\,\textrm{vech}\,}}({\varvec{\Omega }})\) where \({{\,\textrm{vech}\,}}({\varvec{\Omega }})\) denotes the vectorization of the unique parameters in \({\varvec{\Omega }}\) (upper triangular part in our case) given that \({\varvec{\Omega }}\) is symmetric, so there are fewer free parameters.

Because the Gaussian chain graph model is an exponential family, the Hessian of the log likelihood is also the negative of the Fisher information matrix:

$$\begin{aligned} I({\varvec{\Omega }}, \textbf{B})=\left[ \begin{array}{ll} \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k &{} -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\ -(\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) )^T &{} {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X} \end{array} \right] .\nonumber \\ \end{aligned}$$
(2)

Having the Hessian of the log likelihood (the negative Fisher information in (2)), we only need the Hessian of the log prior for each of the priors under study to obtain the negative Hessian of the log posterior, and thus, the Laplace approximation of the posterior precision matrix of \(\textbf{B}\) and \({\varvec{\Omega }}\) which can be written in matrix form:

$$\begin{aligned} \left[ \begin{array}{ll} \textbf{A} &{} \textbf{G}\\ \textbf{C} &{} \textbf{D}\\ \end{array} \right] . \end{aligned}$$
(3)

Given that this matrix corresponds to the Laplace-approximated posterior precision, its inverse corresponds to the Laplace-approximated posterior covariance matrix of \(\textbf{B}\) and \({\varvec{\Omega }}\):

$$\begin{aligned} \left[ \begin{array}{ll} \textbf{A} &{} \textbf{G}\\ \textbf{C} &{} \textbf{D}\\ \end{array} \right] ^{-1} = \left[ \begin{array}{ll} (\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})^{-1} &{} -\textbf{A}^{-1}\textbf{G}(\textbf{D}-\textbf{C}\textbf{A}^{-1}\textbf{G})^{-1}\\ -\textbf{D}^{-1}\textbf{C}(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})^{-1} &{} (\textbf{D}-\textbf{C}\textbf{A}^{-1}\textbf{G})^{-1}\\ \end{array} \right] . \end{aligned}$$

The block matrix \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})^{-1}\) is denoted the Schur complement of \(\textbf{A}\) (Prasolov 1994), and it corresponds to the Laplace approximation of the marginal posterior covariance matrix of \({\varvec{\Omega }}\). Its inverse \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})\) is then the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\), our optimality criterion to address the question of optimal Bayesian experimental design for each of the priors. As mentioned before, we focus on the marginal precision matrix of \({\varvec{\Omega }}\) because we want to account for \(\textbf{B}\), but only as a nuisance parameter.

Our goal for the remaining of this section is to obtain the matrix \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})\) for each of the four priors under study with \(\textbf{A}, \textbf{G}, \textbf{C}, \textbf{D}\) coming from the negative of the Hessian of the log posterior. This matrix \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})\) is our optimality criterion in the Bayesian experimental design and will allow us to find the optimal design matrix \({\textbf{X}}\) for each prior.

Finally, we note that the Laplace approximation is only valid for posterior densities that are log concave. This is trivially true for the case of the flat prior because the likelihood model is log concave. We prove log concavity of the other three priors under study in the Appendix.

2.2 Flat Prior

As mentioned, the flat prior is trivially log concave because the likelihood model (Gaussian chain graph model) is log concave. Thus, we can obtain the Laplace approximation of the posterior precision matrix of \(\textbf{B}\) and \({\varvec{\Omega }}\) as the negative Hessian of the log likelihood (the Fisher information in (2)) which can be written as the matrix in (3) with

$$\begin{aligned} \textbf{A}&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k, \\ \textbf{G}&=\textbf{C}^T=-\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) ,\\ \textbf{D}&={\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) is given by the inverse of the Schur complement of \(\textbf{A}\): \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})\), typically denoted as \({\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})\):

$$\begin{aligned} {\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})&=\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X})^{-1}\right] \left[ (\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X})\right) \right] ^T\\&=\frac{n}{2}\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\right) \textbf{D}_k. \end{aligned}$$

The next step in Bayesian experimental design would be to find a design matrix \({\textbf{X}}\) that maximizes our optimality criterion \({\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})\). We note, however, that \({\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})\) is not a function of \({\textbf{X}}\), and thus, optimal experimental design cannot be performed with this optimality criterion. We also note that under the setting of \(\textbf{B}\) completely known, the Fisher information matrix becomes the upper left block in (2) (matrix \(\textbf{A}\)) which indeed contains \({\textbf{X}}\), and thus, experimental design is possible for the known \(\textbf{B}\) case. This remark hints at the possibility that prior knowledge from \(\textbf{B}\) could help infer \({\varvec{\Omega }}\) as will be confirmed in the next sections.

2.3 Standard Conjugate Prior: Normal-Wishart

The standard conjugate prior for a Gaussian chain graph model is the Normal-Wishart family:

$$\begin{aligned} \begin{aligned} {\varvec{\Omega }}|\lambda , \varvec{\Phi }&\sim W_k(\lambda , \varvec{\Phi }^{-1})\\ {{\,\textrm{vec}\,}}(\textbf{B})|{\varvec{\Omega }},\textbf{B}_0,\varvec{\Lambda }&\sim N({{\,\textrm{vec}\,}}(\textbf{B}_0{\varvec{\Omega }}),\varvec{\Omega }\otimes \varvec{\Lambda }) \end{aligned} \end{aligned}$$
(4)

where \(\varvec{\Phi }\in \mathbb {R}^{k\times k}\) is positive definite, \(\lambda \) is a scalar, \(\textbf{B}_0 \in \mathbb {R}^{p \times k}\), and \(\varvec{\Lambda }\in \mathbb {R}^{p\times p}\) represents the uncertainty on \(\textbf{B}\). Then, the posterior distribution is given by:

$$\begin{aligned} \begin{aligned} {\varvec{\Omega }}| {\textbf{Y}},{\textbf{X}}, \lambda , \varvec{\Phi }&\sim W_k(\lambda +n,\hat{\varvec{\Phi }}^{-1})\\ {{\,\textrm{vec}\,}}(\textbf{B})|{\varvec{\Omega }},{\textbf{Y}},{\textbf{X}}, \textbf{B}_0, \varvec{\Lambda }&\sim N({{\,\textrm{vec}\,}}((\varvec{\Lambda }^{-1}+{\textbf{X}}^T{\textbf{X}})^{-1}(\textbf{B}_0^T \varvec{\Lambda }^{-1}+{\textbf{Y}}^T {\textbf{X}}) {\varvec{\Omega }}),\varvec{\Omega }\otimes (\varvec{\Lambda }^{-1}+{\textbf{X}}^T{\textbf{X}})^{-1}) \end{aligned} \end{aligned}$$

where \(\hat{\varvec{\Phi }}=\varvec{\Phi }+\textbf{Y}^T\textbf{Y}+\textbf{B}_0^T\varvec{\Lambda }^{-1}\textbf{B}_0-(\textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X})(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}(\textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X})^T\). We show that this matrix is positive definite in Appendix A4.

To use the Laplace approximation for the posterior precision, the posterior distribution needs to be log concave. The Normal-Wishart posterior is log concave under \(\frac{1}{2}(\lambda -k-p-1)\ge k/2\) (Appendix A2), so we can use the Laplace approximation to approximate the posterior precision as the negative Hessian of the log posterior which, in turn, is the sum of the Hessian of the log likelihood (2) and the Hessian of the log prior which is given by:

$$\begin{aligned}&\frac{\partial ^2}{\partial {{\,\textrm{vec}\,}}{(\textbf{B})} \partial {{\,\textrm{vec}\,}}{(\textbf{B})}^T} \log p(\varvec{\Omega },\textbf{B}) =-{\varvec{\Omega }}^{-1}\otimes \varvec{\Lambda }^{-1}\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vech}\,}}{({\varvec{\Omega }})} \partial {{\,\textrm{vech}\,}}{({\varvec{\Omega }})}^T} \log p(\varvec{\Omega },\textbf{B})\nonumber \\&\quad =-\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( \frac{1}{2}(\lambda -k-p-1) {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T \varvec{\Lambda }^{-1}\textbf{B}){\varvec{\Omega }}^{-1} \right) \right) \textbf{D}_k\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vech}\,}}{({\varvec{\Omega }})} \partial {{\,\textrm{vec}\,}}{(\textbf{B})}^T} \log p(\varvec{\Omega },\textbf{B})=\textbf{D}_k^T \left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}\textbf{B}^T \varvec{\Lambda }^{-1} \right) \right) . \end{aligned}$$
(5)

The Hessian of the log posterior can then be written as the matrix in (3) with

$$\begin{aligned} \textbf{A}&= \textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( \alpha {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T \varvec{\Lambda }^{-1}\textbf{B}){\varvec{\Omega }}^{-1} \right) \right) \textbf{D}_k \\&\quad + \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k \\&= \textbf{D}_k^T\left[ {\varvec{\Omega }}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T\textbf{XB}+\textbf{B}^T\Lambda ^{-1}B){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k,\\ \textbf{G}&= \textbf{C}^T = -\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}\textbf{B}^T \varvec{\Lambda }^{-1} \right) \right) -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\&= -\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}\textbf{B}^T (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}) \right) \right) ,\\ \textbf{D}&= {\varvec{\Omega }}^{-1}\otimes {(\textbf{X}^T \textbf{X} + \varvec{\Lambda }^{-1})}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) is given by the inverse of the Schur complement of \(\textbf{A}\): \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})\), typically denoted as \({\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})\):

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})&=\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( \left( \frac{n}{2}+\alpha \right) {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T\textbf{XB}+\textbf{B}^T \varvec{\Lambda }^{-1} \textbf{B}){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})) \right) \\&\quad \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \left[ (\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1}))\right) \right] ^T\\&=\left( \frac{n}{2}+\alpha \right) \textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\right] \textbf{D}_k \end{aligned} \end{aligned}$$
(6)

which is again not a function of design \({\textbf{X}}\) (see Appendix A5 for details on the algebraic simplification). This implies again that we cannot find an optimal experimental design for the conjugate prior with the optimality criterion of the Laplace approximation of the marginal posterior precision of \({\varvec{\Omega }}\).

In addition, if we use the Normal-Wishart conjugate prior, our prior knowledge is actually on the marginal regression coefficient \(\tilde{\textbf{B}} = \textbf{B} {\varvec{\Omega }}^{-1}\), not on \(\textbf{B}\). That is, the Normal-Wishart prior does not identify parameters \({\varvec{\Omega }}\) and \(\textbf{B}\) separately, rather it only has information about linear combinations of them. This is also evident by observing that if we take \(\varvec{\Lambda }\rightarrow 0\) (the uncertainty on \(\textbf{B}\)), \(\textbf{B}\) is still not fully known. Instead, we would only know \(\textbf{B}_0{\varvec{\Omega }}^{-1}=\tilde{\textbf{B}}_0\) where \(\varvec{\Omega }\) is still random. Thus, when the uncertainty of \(\textbf{B}\) goes to zero, the conjugate Normal-Wishart prior does not reduce to the known \(\textbf{B}\) case which, in addition to not allowing optimal experimental design to infer \({\varvec{\Omega }}\), make the conjugate prior a suboptimal prior for the Gaussian chain graph model when our focus is the estimation of \({\varvec{\Omega }}\).

2.4 Normal-Matrix Generalized Inverse Gaussian Prior

The drawbacks on the Normal-Wishart conjugate prior for the estimation of \({\varvec{\Omega }}\) motivate us to search for other prior alternatives. We consider the matrix generalized inverse Gaussian distribution (MGIG) (Barndorff-Nielsen et al. 1982; Fazayeli and Banerjee 2016) for \({\varvec{\Omega }}\) to define the Normal-MGIG prior which is not conjugate for \(\textbf{B}\), but yields a MGIG posterior for \({\varvec{\Omega }}\) that can be sampled via importance sampling (Fazayeli and Banerjee 2016).

The Normal-MGIG prior is given by:

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|\lambda , \varvec{\Psi }, \varvec{\Phi }&\sim MGIG(\lambda ,\varvec{\Psi },\varvec{\Phi })\\ {{\,\textrm{vec}\,}}(\textbf{B})|{\varvec{\Omega }}, \textbf{B}_0, \varvec{\Lambda }&\sim N({{\,\textrm{vec}\,}}(\textbf{B}_0),\varvec{\Omega }\otimes \varvec{\Lambda }) \end{aligned} \end{aligned}$$
(7)

where \(\varvec{\Psi }\), \(\varvec{\Phi }\in {\mathbb {R}}^{k\times k}\) are positive definite while \(\lambda \) is a scalar. \(\textbf{B}_0\in {\mathbb {R}}^{p\times k}\) is the mean of \(\textbf{B}\) and \(\varvec{\Lambda }\in {\mathbb {R}}^{p\times p}\) is the uncertainty on \(\textbf{B}\). Then, the posterior distribution is proportional to:

$$\begin{aligned}&p(\varvec{\Omega },\textbf{B}|\textbf{Y},\textbf{X},\theta )\nonumber \\&\quad \propto |\varvec{\Omega }|^{\frac{n}{2}}\exp \left( {{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{X}\textbf{B})-\frac{1}{2} {{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{Y}\varvec{\Omega })- \frac{1}{2} {{\,\textrm{tr}\,}}(\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B} \varvec{\Omega }^{-1})\right) \nonumber \\&\qquad \times |\varvec{\Omega }|^{-\frac{p}{2}}\exp \left( -\frac{1}{2}{{\,\textrm{tr}\,}}([\textbf{B}-\textbf{B}_0]^T \varvec{\Lambda }^{-1}[\textbf{B}-\textbf{B}_0] \varvec{\Omega }^{-1})) \right) \nonumber \\&\qquad \times |\varvec{\Omega }|^{\lambda -\frac{k+1}{2}}\exp (-\frac{1}{2} {{\,\textrm{tr}\,}}[\varvec{\Psi }\varvec{\Omega }^{-1}]-\frac{1}{2}{{\,\textrm{tr}\,}}[\varvec{\Phi }\varvec{\Omega }])\nonumber \\&\quad \propto |\varvec{\Omega }|^{-\frac{p}{2}}\exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}(-2(\varvec{\Omega }^{-1}\textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T \textbf{X})\textbf{B}+\textbf{B}^T(\varvec{\Lambda }^{-1}+\textbf{X}^T\textbf{X}) \textbf{B}\varvec{\Omega }^{-1})) \right) \nonumber \\&\qquad \times \exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}\left( (\varvec{\Omega }^{-1} \textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X}) (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}(\varvec{\Omega }^{-1} \textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X})^T \varvec{\Omega }\right) \right) \nonumber \\&\qquad \times |\varvec{\Omega }|^{\lambda +\frac{n}{2} -\frac{k+1}{2}}\exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}((\varvec{\Phi }+\textbf{Y}^T \textbf{Y}-\textbf{Y}^T\textbf{X}(\textbf{X}^T\textbf{X} +\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T\textbf{Y})\varvec{\Omega })\right) \nonumber \\&\qquad \times \exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}((\varvec{\Psi }+\textbf{B}_0^T \varvec{\Lambda }^{-1}\textbf{B}_0-\textbf{B}_0^T\varvec{\Lambda }^{-1} (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1} \textbf{B}_0)\varvec{\Omega }^{-1})\right) \end{aligned}$$
(8)

where we use \(\theta \) to denote all hyper-parameters in the prior and thus, we get

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|\textbf{Y},\textbf{X},\theta&\sim MGIG(\lambda +\frac{n}{2},\hat{\varvec{\Psi }},\hat{\varvec{\Phi }}) {{\,\textrm{vec}\,}}(\textbf{B})|\varvec{\Omega },\textbf{Y},\textbf{X},\theta \sim N\\&\quad \left( \varvec{\Omega }\otimes (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}{{\,\textrm{vec}\,}}(\textbf{X}^T\textbf{Y}+\varvec{\Lambda }^{-1}\textbf{B}_0\varvec{\Omega }^{-1}), \varvec{\Omega }\otimes (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1} \right) \end{aligned} \end{aligned}$$

where \(\hat{\varvec{\Psi }}=\varvec{\Psi }+\textbf{B}_0^T\varvec{\Lambda }^{-1}\textbf{B}_0-\textbf{B}_0^T\varvec{\Lambda }^{-1}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1}\textbf{B}_0\) and \(\hat{\varvec{\Phi }}=\varvec{\Phi }+\textbf{Y}^T\textbf{Y}-\textbf{Y}^T\textbf{X}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T\textbf{Y}\). We show that these matrices are positive definite in Appendix A4.

To the best of our knowledge, the Normal-MGIG prior has not been used for the Gaussian chain graph model, and thus, we prove here some of its properties in the Appendix A3: we show that the MGIG prior is conjugate for the case of known \(\textbf{B}\) (Proposition 1), that it is log concave under certain conditions (Proposition 2), that it is unimodal for the case of unknown \(\textbf{B}\) (Proposition 3), and that its limiting case is indeed the case of known \(\textbf{B}\) (Remark 1).

The Normal-MGIG posterior is log concave under \(\lambda -\frac{k+p+1}{2}\ge \frac{p}{2}\) (Proposition 2 in the Appendix), so we can use the Laplace approximation to approximate the posterior precision as the negative Hessian of the log posterior which, in turn, is the sum of the Hessian of the log likelihood (2) and the Hessian of the log prior which has the following form:

$$\begin{aligned}&\frac{\partial ^2}{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vec}\,}}(\textbf{B})^T}\log p({\varvec{\Omega }}, \textbf{B}) =-{\varvec{\Omega }}^{-1}\otimes \varvec{\Lambda }^{-1}\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vech}\,}}({\varvec{\Omega }})\partial {{\,\textrm{vech}\,}}({\varvec{\Omega }})^T}\log p({\varvec{\Omega }}, \textbf{B}) \nonumber \\&\quad = \textbf{D}_k^T\left( \left( \lambda -\frac{k+p+1}{2}\right) {\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\right) \textbf{D}_k\nonumber \\&\qquad +\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}[(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1} (\textbf{B}-\textbf{B}_0)+\varvec{\Psi }]{\varvec{\Omega }}^{-1}\right) \right) \textbf{D}_k\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vech}\,}}({\varvec{\Omega }})^T}\log p({\varvec{\Omega }}, \textbf{B}) =-\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes ({\varvec{\Omega }}^{-1}(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})\right) . \end{aligned}$$
(9)

The Hessian of the log posterior can then be written as the matrix in (3) with:

$$\begin{aligned}&\textbf{A}= \textbf{D}_k^T\left( \alpha {\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\right) \textbf{D}_k +\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}[(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)+\varvec{\Psi }] {\varvec{\Omega }}^{-1}\right) \right) \textbf{D}_k\\&\quad + \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X} \textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k\\&=\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1} +{\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T \textbf{XB}+(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)+\varvec{\Psi }){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k, \\ \textbf{G}&=\textbf{C}^T = -\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes ({\varvec{\Omega }}^{-1}(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}) \right) -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\&= -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})) \right) ,\\ \textbf{D}&= {\varvec{\Omega }}^{-1}\otimes {(\textbf{X}^T \textbf{X} + \varvec{\Lambda }^{-1} )}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) is given by the inverse of the Schur complement of \(\textbf{A}\): \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})\), typically denoted as \({\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})\):

$$\begin{aligned}&\varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})\nonumber \\&\quad =\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1} +{\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T \textbf{XB}+(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0) +\varvec{\Psi }){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k\nonumber \\&\qquad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \nonumber \\&\qquad \left[ (\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}))\right) \right] ^T\nonumber \\&=\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1}+{\varvec{\Omega }}^{-1} \textbf{B}_0^T(\varvec{\Lambda }^{-1} -\varvec{\Lambda }^{-1} ({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1} \varvec{\Lambda }^{-1})\textbf{B}_0 {\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k\nonumber \\&\qquad +\textbf{D}_k^T\left[ \varvec{\Omega }^{-1} \otimes ({\varvec{\Omega }}^{-1}\varvec{\Psi }{\varvec{\Omega }}^{-1})\right] \textbf{D}_k \end{aligned}$$
(10)

which is a function of design \({\textbf{X}}\) (see Appendix A6 for details on the algebraic simplification). Note that the only term that involves the design matrix \(\textbf{X}\) is \({\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0{\varvec{\Omega }}^{-1}\) which is \(\textbf{0}\) when \({\textbf{X}}=\textbf{0}\) (i.e., no experiment), so we can understand this term as the information gain due to experiment. We can then consider the optimal design based on \(( \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\) which is the only term that we have control over. For example, for a D-optimal design, we want to maximize \(|\varvec{\Lambda }^{-1}({\textbf{X}}^T {\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1}|=\frac{1}{|\varvec{\Lambda }|^2|{\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1}|}\) which can be achieved by maximizing \(|{\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1}|\), which coincides with the usual Bayesian D-optimal for marginal regression coefficient (Chaloner and Verdinelli 1995).

2.4.1 Information Bound under the Normal-MGIG Prior

We now know that the experimental design has an effect on the estimation of \({\varvec{\Omega }}\) under a Normal-MGIG prior by influencing its approximate posterior precision. However, it turns out that there is a bound on the information we can gain from the experiment (\({\textbf{X}}\)).

Recall that we denote the term \({\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0{\varvec{\Omega }}^{-1}\) the information gain from experiments. First, we observe that the inequality \((\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\le \varvec{\Lambda }^{-1}\) is true because the difference (\(\varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1}\)) is a positive semi-definite matrix (as it takes a quadratic form with a positive-definite matrix).

By multiplying the inequality by \({\varvec{\Omega }}^{-1} \textbf{B}_0^T\) and \(\textbf{B}_0 {\varvec{\Omega }}^{-1}\) left and right, respectively, we get

$$\begin{aligned} {\varvec{\Omega }}^{-1} \textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0 {\varvec{\Omega }}^{-1} \le {\varvec{\Omega }}^{-1} \textbf{B}_0^T \varvec{\Lambda }^{-1}\textbf{B}_0 {\varvec{\Omega }}^{-1} \end{aligned}$$

where the term on the left is precisely the information that we can gain from nonzero experiments (\({\textbf{X}}\ne \textbf{0}\)) and this term is bounded by \({\varvec{\Omega }}^{-1} \textbf{B}_0^T \varvec{\Lambda }^{-1}\textbf{B}_0 {\varvec{\Omega }}^{-1}\). This means that when we try to find the optimal experimental design \({\textbf{X}}\) that maximizes the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\), the only term that depends on \({\textbf{X}}\) is bounded, and thus, there is a limit to how much can be gained by an optimal experimental design.

Next, we observe that this bound is sharp as the equality is achieved when \({\textbf{X}}^T {\textbf{X}}\rightarrow \infty \). In addition, this inequality provides the intuition that the information gain due to the experiment is bounded by the product of the marginal effect of the experiment (\(\textbf{B}_0 {\varvec{\Omega }}^{-1}\)) and the prior certainty on the experiment’s conditional effect (\(\varvec{\Lambda }^{-1}\)). Thus, if our prior on the effect of the experiment is no effect on any nodes (\(\textbf{B}_0=\textbf{0}\)), we will gain no information from the experiment. We can re-write the bound in terms of the marginal regression coefficient \(\tilde{\textbf{B}}_0 = \textbf{B}_0 {\varvec{\Omega }}^{-1}\): \({\varvec{\Omega }}^{-1} \textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0 {\varvec{\Omega }}^{-1}\le \tilde{\textbf{B}}_0^T \varvec{\Lambda }^{-1}\tilde{\textbf{B}}_0\).

In conclusion, given the information bound, the experimental design \({\textbf{X}}\) is not as important as the prior knowledge on the experiment’s effect. To increase the information bound, there are two directions: either having a large prior marginal effect (\(\tilde{\textbf{B}}_0\)) or a small uncertainty on the conditional regression effects (\(\varvec{\Lambda }\)). That is, the most helpful experiments are the ones with large marginal effects on nodes with well-known conditional effects. More on practical advice for domain scientists in the Discussion (Sect. 5).

2.5 General Independent Prior

We now investigate whether the information bound is specific to the Normal-MGIG case or whether it exists in the general case when we have a prior distribution on \(\textbf{B}\) independent of \({\varvec{\Omega }}\), assuming that the prior distribution is log concave.

Let \(\log p({\varvec{\Omega }}, \textbf{B})=f(\textbf{B})+g({\varvec{\Omega }})\) be the log prior density where \(f(\textbf{B})\) has a Hessian given by \(-\varvec{\Lambda }^{-1}\in \mathbb {R}^{kp\times kp}\) and \(g(\varvec{\Omega })\) has a Hessian (with respect to unique parameters of \(\varvec{\Omega }\)) given by \(-\varvec{\Psi }\in \mathbb {R}^{\frac{k(k+1)}{2}\times \frac{k(k+1)}{2}}\). Then, the negative Hessian of the log posterior is the sum of the negative Hessian of the log prior and the negative Hessian from log likelihood (2):

$$\begin{aligned} \left[ \begin{array}{ll} \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k +\varvec{\Psi }&{} -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\ -(\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) )^T &{} {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1} \end{array} \right] \end{aligned}$$

which we can write as the matrix in (3) with

$$\begin{aligned} \textbf{A}&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k+\varvec{\Psi }, \\ \textbf{G}&= \textbf{C}^T = -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) , \\ \textbf{D}&= {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) is given by the inverse of the Schur complement of \(\textbf{A}\): \((\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})\), typically denoted as \({\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})\):

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k+\varvec{\Psi }\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \left( {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}\right) ^{-1}\left( \varvec{\Omega }^{-1}\otimes (\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}) \right) \textbf{D}_k. \end{aligned} \end{aligned}$$

To simplify the notation, we take \(\textbf{E}=\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \) and \(\textbf{F}={\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}\). It is simple to check that \(\textbf{EF}^{-1}=\textbf{I}_{k}\otimes {\varvec{\Omega }}^{-1}\textbf{B}^T, \textbf{F}^{-1}\textbf{E}^T=\textbf{I}_{k}\otimes \textbf{B}\varvec{\Omega }^{-1}\) and \(\textbf{EF}^{-1}\textbf{E}^{T} = \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\). The Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) (\(\varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})\)) then becomes:

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k+\varvec{\Psi }\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \left( {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}\right) ^{-1}\left( \varvec{\Omega }^{-1}\otimes (\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}) \right) \textbf{D}_k\\&=\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\textbf{EF}^{-1}\textbf{E}^T\right) \textbf{D}_k+\varvec{\Psi } -\textbf{D}_k^T\textbf{E} \left( \textbf{F}+\varvec{\Lambda }^{-1}\right) ^{-1}\textbf{E} ^T \textbf{D}_k\\&=\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T\right) \textbf{D}_k+\varvec{\Psi }. \end{aligned} \end{aligned}$$

Note that the only term that involves \({\textbf{X}}\) is \(\textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T\) which we can re-write by taking the Cholesky decomposition of \(\varvec{\Lambda }^{-1}=\textbf{LL}^T\) and by (A4) (with \(\textbf{A}=\textbf{F}\), \(\textbf{P}=\textbf{I}_{kp}\), \(\textbf{U}=\textbf{V}^T=\textbf{L}\)) as

$$\begin{aligned} \begin{aligned} \textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T&=\textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\textbf{LL}^T)^{-1})\textbf{E}^T\\&=\textbf{E}\textbf{F}^{-1}\textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T\textbf{F}^{-1}\textbf{E}^T. \end{aligned} \end{aligned}$$
(11)

Then, we have the following information bound (derivation in Appendix 5)

$$\begin{aligned} \begin{aligned} \textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T&=\textbf{E}\textbf{F}^{-1}\textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T(\textbf{E}\textbf{F}^{-1})^{T}\\&\le \textbf{E}\textbf{F}^{-1}\varvec{\Lambda }^{-1} (\textbf{E}\textbf{F}^{-1})^{T}\\&=\left( \textbf{I}_{k}\otimes (\textbf{B}{\varvec{\Omega }}^{-1})^T\right) \varvec{\Lambda }^{-1} \left( \textbf{I}_{k}\otimes (\textbf{B}{\varvec{\Omega }}^{-1})^T\right) ^T. \end{aligned} \end{aligned}$$

This bound is similar to the Normal-MGIG case (Sect. 2.4.1) in that it depends on the marginal correlation coefficients \(\textbf{B}\varvec{\Omega }^{-1}\) and is also not a function of \({\textbf{X}}\). We observe that this bound is also sharp as one can achieve equality when \(\textbf{X}^T\textbf{X}\) tends to infinite. In addition, this bound could guide the choice of experiments when the goal is the estimation of the precision matrix. Again, one should chose experiments that have large marginal effects and high certainty in the prior on the conditional effects. More on practical advice for domain scientists in the Discussion (Sect. 5).

3 Simulation Study

In order for an experimental design to help infer network structure, such experiment should at least provide better results than not doing any experiment (a null experiment corresponding to \(\textbf{X}=0\)). In addition, we evaluate a standard type of experiments denoted “specific” in which there is only effect to one of the nodes. For each experimental design (null, random, specific), we calculate the KL divergence between the prior and the posterior which represents the information gain for each experiment. For evaluation of the point estimation, we compare the performance of the different priors with the Stein’s loss (Dey and Srinivasan 1985) between the maximum a posteriori (MAP) and the true value of \({\varvec{\Omega }}\).

We simulate data under the 6 covariance structures in Wang (2012) with \(k=50\) responses and \(p=50\) specific predictors (\(\textbf{B}=\textbf{I}_{50}\)). Each simulation is repeated 100 times.

  • Model 1: AR(1) model with \(\sigma _{ij}=0.7^{|i-j|}\)

  • Model 2: AR(2) model with \(\omega _{ii}=1\), \(\omega _{i-1,i}=\omega _{i,i-1}=0.5\), \(\omega _{i-2,i}=\omega _{i,i-2}=0.25\) for \(i=1,\dots ,k\)

  • Model 3: Block model with \(\sigma _{ii}= 1\) for \(i=1,\dots ,k\), \(\sigma _{ij}= 0.5\) for \(1\le i\ne j\le k/2\), \(\sigma _{ij}=0.5\) for \(k/2 + 1\le i\ne j\le 10\) and \(\sigma _{ij=0}\) otherwise.

  • Model 4: Star model with every node connected to the first node, with \(\omega _{ii}=1\), \(\omega _{1,i}=\omega _{i,1}= 0.1\) for \(i=1,\dots ,k\), and \(\omega _{ij}= 0\) otherwise.

  • Model 5: Circle model with \(\omega _{ii}= 2\), \(\omega _{i-1,i}=\omega _{i,i-1}= 1\) for \(i=1,\dots ,k\), and \(\omega _{1,j}=\omega _{j,1}= 0.9\) for \(j=1,\dots ,k\).

  • Model 6: Full model with \(\omega _{ii}= 2\) and \(\omega _{ij}= 1\) for \(i\ne j \in \{1,\dots ,k\}\).

We take sample sizes ranging from 200 to 2200 increased by 50.

For each covariance structure, we test two (conjugate) priors, each with two levels of uncertainty, and with and without bias in the prior of \(\textbf{B}\). Namely,

  1. 1.

    Normal-Wishart prior with \(\lambda =2k+2\), \(\varvec{\Phi }=10^{-3} \textbf{I}_k\) and two uncertainty levels of \(\textbf{B}\): (i) \(\varvec{\Lambda }=10^{-3} \textbf{I}_p\) (certain case) and (ii) \(\varvec{\Lambda }=10^3 \textbf{I}_p\) (uncertain case).

  2. 2.

    Normal-MGIG prior with \(\lambda =k+1\), \(\varvec{\Psi }=\varvec{\Phi }=10^{-3} \textbf{I}_k\) and two uncertainty levels of \(\textbf{B}\): (i) \(\varvec{\Lambda }=10^{-3}\textbf{I}_{p}\) (certain case) and (ii) \(\varvec{\Lambda }=10^3 \textbf{I}_p\) (uncertain case).

The biased case for the prior of \(\textbf{B}\) is given by the biased prior mean of \(\textbf{B}_0 = \textbf{B} + \epsilon \) for \(\epsilon \sim N(0,1)\). Note that we also test a case with a smaller bias and present these results in the Appendix. In addition, we test for midpoint uncertainty where \(\varvec{\Lambda } = 10 \textbf{I}_{p}\) and \(\varvec{\Lambda } = 0.1 \textbf{I}_{p}\) in the Appendix.

In addition to the conjugate priors, we test two additional shrinkage priors, namely a chain graph lasso (cglasso) prior (Shen and Solis-Lemus 2020) and multivariate regression lasso (mlasso) prior. Both priors define a graphical LASSO prior (Wang 2012) on \({\varvec{\Omega }}\). On the one hand, cglasso prior defines an independent Laplace prior on the entries of \(\textbf{B}\). On the other hand, mlasso prior defines an independent Laplace prior on the entries of \(\tilde{\textbf{B}}=\textbf{B} {\varvec{\Omega }}^{-1}\). Specifically,

  1. 1.

    cglasso prior: \(\textbf{B}_{i,j}\sim \text {Laplace}(\lambda ^2)\) and \(\omega _{k,k'}\sim \text {Laplace}(\lambda ^2)\), \(\omega _{k,k}\sim \text {Exp}(\lambda )\), with two shrinkage levels \(\lambda =2,10\)

  2. 2.

    mlasso prior: \((\textbf{B}\varvec{\Omega }^{-1})_{i,j}\sim \text {Laplace}(\lambda ^2)\) and \(\omega _{k,k'}\sim \text {Laplace}(\lambda ^2)\), \(\omega _{k,k}\sim \text {Exp}(\lambda )\), with two shrinkage levels \(\lambda =2,10\)

We sample the priors with rejection sampling, which has less than \(10^{-8}\) acceptance rate when \(k>10\), so we test with \(k=10\). We then sample the posterior with 10,000 iterations and discarding the first 5,000 as burn-ins. KL divergence are estimated using R package FNN [?].

3.1 Information Gain by Experiments

We evaluated how much the posterior distribution changes compared to the prior under different experimental settings. Intuitively, if the prior and the posterior are similar, then no much information is gained by the data or the experiment. However, simply evaluating the difference between prior and posterior is not enough since sample size surely has an effect. Thus, to better evaluate the effect of the design rather than the effect of the sample size, we need to choose a baseline experiment. We take a null design (design with \(\textbf{X}=0\)) as such baseline because one would expect it to perform poorly. That is, we consider an experiment to provide useful information when it is better than a null experiment in the same settings, and thus, calculate the difference between the given design (random or specific) and the null design as our measure of performance shown in figures below.

For each simulation at a given sample size, we calculate 1) the KL divergence between the prior and the posterior for a given experiment (random or specific) and 2) the KL divergence between the prior and posterior under the null experiment. The information gain due to experiment is then evaluated by taking difference of log KL divergence between the design (random or specific) and the null design. Figure 2 shows the results when simulating data under the covariance structure of Model 4 (star model). Points above 0 indicate more information gained by conducting a non-null experiment. We expect random experiments to do better in this case because the marginal effect \(\textbf{B}{\varvec{\Omega }}^{-1}\) can be large. The results corresponding to all other covariance structures (with similar conclusions) can be found in the Appendix A8, and the simulation results with shrinkage priors can be found in Appendix A9, where the results largely align with results using Normal-MGIG and Normal-Wishart priors.

Fig. 2
figure 2

Difference in log KL divergence between prior and posterior comparing random experiment (\({\textbf{X}} \ne \textbf{0}\)) and specific experiment (diagonal \({\textbf{X}}\)) versus null experiment (\({\textbf{X}}=\textbf{0}\)) under a star model with 50 responses and 50 predictors, with and without bias on the prior of \(\textbf{B}\). Lines are averages over 100 repeats, while error bars are 0.975 and 0.025 quantiles. The only case in which the experiment gains information compared to the null experiment is when we have certain and unbiased prior knowledge on \(\textbf{B}\) (Normal-MGIG certain case in green). With a biased prior, MGIG still has better information gain while certain Wishart also leads to better information gains compared to a null design

Except for the case of Normal-MGIG prior with \(\varvec{\Lambda }=10^{-3} \textbf{I}_p\) (certain case), the difference in information gain will eventually reach 0 for all other prior cases as the sample size increases, meaning that there will no longer be any information gain from doing an experiment (random nor specific) compared to no experiment at all. While the Normal-MGIG prior with \(\varvec{\Lambda }=10^{-3} \textbf{I}_p\) (certain case) stays at a distance from 0, this distance does not depend much on the sample size. With a biased prior, MGIG still shows better information gain while certain Wishart also leads to better information gains compared to a null design. More results on other types of biases can be found in the Appendix A10.

3.2 Performance on Point Estimation of \({\varvec{\Omega }}\)

While KL divergence evaluates the information gain of the experiment, information gain does not imply better point estimates. To compare the performance of point estimation of \({\varvec{\Omega }}\), we use the difference in Stein’s loss of the experiment (random or specific) and the null design. The results are shown in Fig. 3. Points below 0 indicate more accurate point estimates under an experimental design rather than under a null experiment.

We have a similar observation as in the information gain section. That is, except for the case of Normal-MGIG prior with \(\varvec{\Lambda }=10^{-3} \textbf{I}_p\) (certain case), the Stein’s loss ratio will eventually reach 0 for all other prior cases regardless of the experimental design (random or specific). While the Normal-MGIG prior with \(\varvec{\Lambda }=10^{-3} \textbf{I}_p\) (certain case) stays at a distance from 0, this distance does not depend much on the sample size. While biased prior such as certain MGIG and certain Wishart can undermine our ability to get good point estimates from experiments, smaller biases do not have such a strong negative effect (see Appendix A10).

Fig. 3
figure 3

Difference in log Stein’s loss of random experiment (\({\textbf{X}} \ne \textbf{0}\)) and specific experiment (diagonal \({\textbf{X}}\)) versus null experiment (\({\textbf{X}}=\textbf{0}\)) under Star models with 50 responses and 50 predictors, with and without biases on the prior of \(\textbf{B}\). Lines are averages over 100 repeats, while error bars are 0.975 and 0.025 quantiles. The only case in which the random design mostly has lower Stein’s loss is when we have certain and unbiased prior knowledge on \(\textbf{B}\) (Normal-MGIG certain case in green). All other unbiased prior cases eventually reach the zero line (no difference in MAP performance of \({\varvec{\Omega }}\) compared to null experiment). With a biased prior, certain MGIG and certain Wishart lead to less accurate point estimates

4 Human Gut Microbiome Data

We revisit a similar comparison to the one in the motivating toy example (Sect. 1.3) of the posterior distribution of partial correlations among responses under different priors and experiments. We use data from Claesson et al. (2012) which collected fecal microbiota composition from 178 elderly subjects, together with subjects’ residence type (in the community, day-hospital, rehabilitation or in long-term residential care) and diet (data at O’Toole (2008)) with the goal of understanding the interactions between microbes and environment via partial correlations.

We use the MG-RAST server (Meyer et al. 2008) for profiling with an e-value of 5, 60% identity, alignment length of 15 bp, and minimal abundance of 10 reads. Unclassified hits are not included in the analysis. Genus with more than 0.5% relative abundance in more than 50 samples is selected as the focal genus and all other genus serve as the reference group. This yields 13 responses and 11 predictors (i.e., \(p=11,k=14\)). We then fit a Gaussian chain graph model to the data. Since we cannot design an experiment on this already collected data, to compare with a hypothetical null experiment, we draw a simulated sample from a Gaussian chain graph model whose regression coefficients and precision matrix are the MLE from the original data, and a null experiment by definition has \(\textbf{X} = \textbf{0}\).

Here, we focus on the partial correlation between Bacteroides, one of the largest genera in gut and Clostridium, a group known to be pathological. Figure 4 shows the posterior distribution of this partial correlation under the two different priors (Normal-Wishart and Normal-MGIG) with three different uncertainty levels on \(\textbf{B}\): \(\varvec{\Lambda }=10^{-1}\textbf{I}_p\), \(10^0 \textbf{I}_p\), \(10^{-1}\textbf{I}_p\) (shown as columns: 0.1, 1, 10) with \(\varvec{\Psi }= 0.01\textbf{I}_p\) and \(\varvec{\Phi }=0.01\textbf{I}_p\). Just as in the toy example (and in agreement with the expectation of our theory in Sect. 2), only the Normal-MGIG prior results in lower uncertainty when an experiment is performed.

Fig. 4
figure 4

Posterior distribution for the partial correlation between Bacteroides, one of the largest genera in gut and Clostridium, a group known to be pathological. Rows correspond to the two priors: Normal-MGIG (top) and Normal-Wishart (bottom). Columns correspond to the certainty level on the prior of \(\textbf{B}\) (\(\varvec{\Lambda }=10^{-1}\textbf{I}_p\), \(10^0 \textbf{I}_p\), \(10^{-1}\textbf{I}_p\); shown as 0.1, 1, 10 respectively). We observe that only under the MGIG prior, the experiment reduces uncertainty in the posterior

5 Discussion

Chain graph models are relevant in genomic, microbiome and ecological applications because they encode conditional dependence among responses and predictors and focus on the estimation of the precision matrix, an important parameter to understand interactions within microbial and ecological communities. Here, we evaluated the effect of prior knowledge on conducting experiments to better estimate the precision matrix in a chain graph model. Using the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) as the optimality criterion on experimental design settings, we proved theoretically that without prior knowledge that identifies \(\textbf{B}\) and \({\varvec{\Omega }}\) separately (instead of \(\textbf{B}{\varvec{\Omega }}^{-1}\) combined), experiments provide no gain in knowledge for the estimation of \({\varvec{\Omega }}\). That is, the Laplace approximation of the marginal posterior precision matrix of \({\varvec{\Omega }}\) is not a function of \({\textbf{X}}\). We also showed a bound on the information gain under the Normal-MGIG prior which generalizes to the case of any independent priors. Our findings are highly relevant for domain scientists who aim to design optimal experimental designs to infer the precision matrix.

We further verified our theoretical conclusions using numerical simulations where we showed that without certain prior knowledge on \(\textbf{B}\), experiments provide nearly no information gain and there is not an increase performance on the estimation of \({\varvec{\Omega }}\) either. Furthermore, it is not enough for an experiment to be specific, the prior knowledge about this specificity is also needed (more examples below in Practical advice for domain scientists).

Connections to multicollinearity. Chain graph models have a dependence property that is similar to multicollinearity in classical regression. Take the conditional distribution of the qth response node in sample \({\textbf{Y}}_i \in \mathbb {R}^k\) with the design \({\textbf{X}}_i \in \mathbb {R}^p\):

$$\begin{aligned} \begin{aligned} \left[ Y_{qi}|{\textbf{X}}_i= \textbf{x}_i,{\textbf{Y}}_{-q,i}= \textbf{y}_{-q,i}\right] = \frac{1}{\omega _{qq}}\sum _{j=1}^p\beta _{jq}x_{ji}-\frac{1}{\omega _{qq}}\sum _{l\ne q} \omega _{ql} y_{li}+\epsilon _{qi} \end{aligned} \end{aligned}$$
(12)

where \(\epsilon _{qi}\sim {\mathcal {N}}(0,1/\omega _{qq})\), \(\beta _{jq}\) is the (jq) entry of the \(\textbf{B}\) matrix, \(\omega _{qq}\) and \(\omega _{ql}\) are the (qq) and (ql) entries in the \({\varvec{\Omega }}\) matrix, and \({\textbf{Y}}_{-q,i}\) corresponds to the vector of responses for sample i without the qth response. Multicollinearity in this model arises, given that the correlation between \({\textbf{Y}}_l\) and \({\textbf{X}}_j\) is 0 only if when the (jl) entry of \(\textbf{B}\varvec{\Omega }^{-1}\) is 0, which requires \(\textbf{B}_{j\cdot } {\varvec{\Omega }}^{-1}_{\cdot l}=0\), which is difficult to hold for all lj. Thus, in practice, we are most likely to have some intrinsic multicollinearity in chain graph models. In univariate settings, in principle, we could design experiments to avoid problems caused by multicollinearity. When such experiments are hard to conduct, an alternative approach is to have informative priors on some of the parameters. For instance, if two predictors are collinear, the sum of the two respective regression coefficients can be easily identified, but not the individual ones. However, one individual coefficient can be identified if we have informative prior on the other. Take this intuition to chain graph where we have multicollinearity between \(\omega \)’s and \(\beta \)’s. In this case, prior knowledge on the regression coefficients \(\textbf{B}\) might actually help the estimation of \(\varvec{\Omega }\) under certain experimental conditions. Thus, our work to explore the interplay between experimental design and prior knowledge in chain graph models is also justified by the classical regression setting that has routinely used prior knowledge to infer parameters on cases when multicollinearity arises.

Future directions. As mentioned, the difficulty of designing experiments for the estimation of the precision matrix on a chain graph model is similar to the multicollinearity problem in univariate regression. In both cases, a prior can help identify one set of parameters to better infer another set of parameters. A natural question is whether this is also true in a general Gibbs measure with two-body interaction. For instance, the auto-logistic model (Ising model) can been used to infer networks and existing work has discussed the experimental design of this model when the effect of the treatment is completely known (Jiang et al. 2019). One interesting question is whether in order to design experiments effectively to infer the network among responses with this model, we need prior knowledge on the effect of the treatment. One difficulty when answering this question is the intractable normalizing constant or partition function in a general Gibbs measure. Some approximations of the partition function proposed by Wainwright and Jordan (2006) might be helpful to connect it with what we already know for the Gaussian case.

Practical advice for domain scientists. As we presented in both theoretical and simulation studies, for experiments to aid in the estimation of precision matrix under a chain graph model, the experiment should have large marginal effects, prior knowledge on the conditional effect of predictors on responses and high certainty on those conditional effects. For instance, if an experimenter wants to understand a microbial community, she could try different candidate experiments on the community to identify the treatment that alter the community the most (i.e., that has large marginal effects). Then, the experimenter should culture several of the species to evaluate the effect of those candidate treatments (i.e., gain prior knowledge of \(\textbf{B}\)). By focusing on a some single species, the experimenter will (ideally) have high certainty on the conditional effects of some of the experiment.

Similarly, an experiment where the experimenter knocks out one gene and evaluates the reaction of another gene in order to infer the interaction between two genes is useful because it can affect two target genes (marginal effect) while we know (by assumption) it is specific to one of the genes (good prior knowledge of its conditional effect). While there is a keen interest in experiments that are specific (e.g., gene knockout), our theory shows that specificity (that is, the row in \(\textbf{B}\) has only one nonzero entry) is not necessary for the experiment to be useful in the inference of the precision matrix (\({\varvec{\Omega }}\)). However, specificity is helpful, given that it is easier to obtain prior knowledge on specificity (we are certain about some entries in \(\textbf{B}\) being zero) than trying to obtain prior knowledge on multiple nonzero entries of \(\textbf{B}\).

For any given experiment, the experimenter has control over the design (\({\textbf{X}}\)) and the prior of the regression coefficients (\(\textbf{B}\)). Our findings show that without prior knowledge of \(\textbf{B}\) on some single predictors, experiments produce zero information gain on the estimation of \({\varvec{\Omega }}\). Our work draws attention to the importance of thorough analysis of priors and experimental design for domain scientists who aim to infer biological network structures from controlled experimental data.