The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models

Shen, Yunyi; Solís-Lemus, Claudia

doi:10.1007/s13253-024-00621-1

The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models

Open access
Published: 01 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Agricultural, Biological and Environmental Statistics Aims and scope Submit manuscript

The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models

Download PDF

202 Accesses
Explore all metrics

Abstract

Here, we investigate whether (and how) experimental design could aid in the estimation of the precision matrix in a Gaussian chain graph model, especially the interplay between the design, the effect of the experiment and prior knowledge about the effect. Estimation of the precision matrix is a fundamental task to infer biological graphical structures like microbial networks. We compare the marginal posterior precision of the precision matrix under four priors: flat, conjugate Normal-Wishart, Normal-MGIG and a general independent. Under the flat and conjugate priors, the Laplace-approximated posterior precision is not a function of the design matrix rendering useless any efforts to find an optimal experimental design to infer the precision matrix. In contrast, the Normal-MGIG and general independent priors do allow for the search of optimal experimental designs, yet there is a sharp upper bound on the information that can be extracted from a given experiment. We confirm our theoretical findings via a simulation study comparing (i) the KL divergence between prior and posterior and (ii) the Stein’s loss difference of MAPs between random and no experiment. Our findings provide practical advice for domain scientists conducting experiments to better infer the precision matrix as a representation of a biological network.

Approximate Bayesian inference in semi-mechanistic models

Article Open access 16 June 2016

Discussion to: Bayesian graphical models for modern biological applications by Y. Ni, V. Baladandayuthapani, M. Vannucci and F.C. Stingo

Article 03 November 2021

Posterior Contraction Rates for Stochastic Block Models

Article 14 October 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Background

Networks are graphical structures that arise on many biological applications. For example, microbial networks are a graphical representation of a microbial community where nodes represent microbial taxa and edges correspond to some form of interaction (Matchado et al. 2021). Microbial networks arise in microbiome studies from soil (Barberán et al. 2012) to human gut (Baldassano and Bassett 2016; Claesson et al. 2012). Other biological networks include brain networks where nodes correspond to regions in the brain (or voxels) and edges correspond to connections between brain regions (Rubinov and Sporns 2010), or ecological networks which, like microbial networks, represent some community of species in the wild (e.g., food web Pimm et al. 1991).

Many biological networks need to be estimated from data, and given their broad applicability, there has been a plethora of network methods to reconstruct biological networks from a wide variety of data types: from microbiome sequencing data (Jovel et al. 2016) to MRI images of brain regions (van Straaten and Stam 2013). Many network methods used in microbiome studies, however, estimate a network under static conditions. That is, researchers would estimate the human gut microbial network by sequencing reads of gut samples from different individuals without incorporating information about treatments on the individuals. In the absence of any treatments or disturbers of the network structure, Gaussian graphical models are among the most widely used models to estimate such networks (Layeghifard et al. 2017).

Many researchers, however, want to use information of different experimental treatments to better infer the biological network. For example, we can get samples of human gut microbiome for control patients and for patients under antibiotic treatment. The different responses (abundances of microbes) on different treatments can provide information on the microbial network structure by, for example, eliminating key players in the community like in the case of antibiotic treatments. The conditions are no longer static (no treatment). Instead, the nodes are perturbed by experimental treatments, and these perturbations provide information on the network structure itself. We thus highlight that experimental treatments are incorporated into the model to better estimate the network structure, and not because there is an interest in the effect of these treatments on the responses. On the contrary, exact experimental treatment effects are nuisance parameters and our only focus is the estimation of the network structure with the help that the experimental treatments provide.

1.2 The Gaussian Chain Graph Model

Standard Gaussian graphical models no longer apply when there are experimental treatments affecting the network structure. Thus, chain graph models arise as a suitable alternative for the “network under treatment” setting (Lauritzen and Richardson 2002). More formally, a Gaussian chain graph model with k response nodes (${\textbf{Y}}_i\in \mathbb {R}^{ k}$) and p predictor nodes (${\textbf{X}}_i\in \mathbb {R}^{p}$) is given by:

$$\begin{aligned} \textbf{Y}_i \mid \textbf{X}_i, \textbf{B}, {\varvec{\Omega }} \sim \mathcal {N}({\varvec{\Omega }}^{-1} \textbf{B}^T \textbf{X}_i, {\varvec{\Omega }}^{-1}) \end{aligned}$$

(1)

where $\textbf{B} \in \mathbb {R}^{p \times k}$ is the matrix for the regression coefficients (e.g., treatment effects) and ${\varvec{\Omega }} \in \mathbb {R}^{k \times k}$ is the precision matrix among responses (e.g., network structure). Note that the treatment effects could also be used to account for subject heterogeneity when samples are not from independent identically distributed from a Gaussian graphical model. In the microbial network example described above, the responses correspond to the abundances of microbes in the samples and the predictors correspond to the experimental treatments. Our parameter of interest is ${\varvec{\Omega }}$ which represents the network among responses (the microbial network) and $\textbf{B}$ represents the direct effects of treatments on the responses (e.g., the effect of antibiotic on different microbes). As mentioned, we are not interested in $\textbf{B}$ itself. The introduction of $\textbf{B}$ to the model is done to facilitate inference of ${\varvec{\Omega }}$, so in this sense, $\textbf{B}$ is a nuisance parameter. Bayesian implementations of Gaussian chain graph models further include prior distributions for $\textbf{B}$ and ${\varvec{\Omega }}$ which can allow the inclusion of prior biological knowledge into the model. Different priors have been proposed for precision matrix such as conjugate Wishart prior, shrinkage prior like LASSO and adaptive LASSO prior (Wang 2012), spike-and-slab priors (Gan et al. 2019), priors based on matrix decomposition and transformations like spectrum (Daniels and Kass 1999), Cholesky (Daniels and Pourahmadi 2002), and Givens angle (Kang and Cressie 2011) reference prior (Yang and Berger 1994). Unfortunately, the fact that the mean structure involves the precision matrix in chain graph models, a Gibbs sampler is not straightforward to implement. In multivariate linear regression, one can condition on the regression coefficients and then sample the full conditional of the precision matrix using the same sampling method for Gaussian graphical model. In contrast, in a Gaussian chain graph model, the mean structure involves the precision matrix, and thus, sampling usually requires modifications. For instance, despite following the same derivation as in the original graphical LASSO (Wang 2012), the sampler for graphical LASSO prior on Gaussian chain graph models (Shen and Solis-Lemus 2020) is significantly different in the full conditionals. While a thorough comparison of multiple priors on Gaussian chain graph models is indeed an interesting research direction, it is currently beyond the scope of this paper and will focus on the priors that have already been adopted in chain graphs.

Under the setup of Bayesian inference of Gaussian chain graph model, a rather unexplored statistical question is which experimental design on the predictors (design matrix ${\textbf{X}}$) would improve the inference of the precision matrix (${\varvec{\Omega }}$). Traditional experimental design in linear models involves selecting a data matrix ${\textbf{X}}$ such that the variance (covariance) of the estimator of the regression coefficient $\hat{\beta }$ is small which translates into selecting ${\textbf{X}}$ such that $\varvec{\Sigma }_{{\textbf{X}}} = \frac{1}{n} {\textbf{X}}^T {\textbf{X}}$ is as large as possible. The design problem is then built around an optimality criterion (or measure of success) $V(\varvec{\Sigma }_{{\textbf{X}}})$ or $V({\textbf{X}})$. For example, under the D-optimality setting, the design aims to maximize the determinant of the inverse $\varvec{\Sigma }_{{\textbf{X}}}$ matrix: $V_D(\varvec{\Sigma }_{{\textbf{X}}}) = \det (\varvec{\Sigma }^{-1}_X)$. Other optimality criteria involve maximizing the largest eigenvalue of $\varvec{\Sigma }_{{\textbf{X}}}$ (E-optimality) or the trace of $\varvec{\Sigma }_{{\textbf{X}}}$ (A-optimality). Bayesian alternatives of the designs change slightly by the incorporation of priors. For example, the Bayesian D-optimality setting involves maximizing $V_D(\varvec{\Sigma }_{{\textbf{X}}}) = \det \left( \left( \varvec{\Sigma }_{{\textbf{X}}} + \frac{1}{n} \textbf{V}_0 \right) ^{-1} \right) $ where $\textbf{V}_0$ is the prior covariance matrix of $\beta $.

The question of optimal experimental design is not unexplored from the biological perspective. Researchers have always been interested in experiments with specificity: the treatment is conditionally independent with all but one of the response nodes. However, it can be difficult or impossible to design an experiment where only one response node (e.g., microbe) is perturbed. Thus, in the absence of specific treatments, we can study what other alternative experimental designs can aid in the estimation of the network structure.

Here, we address the question of whether we can find an optimal Bayesian experimental design to infer ${\varvec{\Omega }}$, our parameter of interest, on a Gaussian chain graph model. We focus on the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ as our optimality criterion. We choose to focus on the marginal precision matrix of ${\varvec{\Omega }}$ (instead of the joint precision matrix of $\textbf{B}$ and ${\varvec{\Omega }}$), given that we want to account for the nuisance parameter $\textbf{B}$ without focusing on it concretely. In addition, we use the Laplace approximation of the precision matrix (instead of the actual precision matrix), given that the precision matrix tends to be intractable for most posteriors (but see Sect. 1.3).

We study the case of four different prior distributions: flat, conjugate Normal-Wishart, the novel Normal-Matrix Generalized Inverse Gaussian (MGIG) and a general independent prior. For each prior, we obtain the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ to use as our optimality criterion to find the optimal experimental design ${\textbf{X}}$. We find, however, that the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ is not a function of ${\textbf{X}}$ for the flat nor the conjugate priors. This implies that it is difficult if not impossible to find an optimal Bayesian experimental design to aid in the estimation of ${\varvec{\Omega }}$ for these two priors. In contrast, the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ is a function of ${\textbf{X}}$ for the novel Normal-MGIG and for a general independent prior which allows the search for an optimal experimental design. However, we discover an information bound for both of these priors which implies that there is a theoretical limit to how much can be gained from experiments in the inference of ${\varvec{\Omega }}$.

Our work has important repercussions for domain scientists who use experimental settings to aid in the estimation of the network structure (${\varvec{\Omega }}$). Under a Bayesian Gaussian chain graph model, the choice of prior is highly impactful, but even when appropriate priors are selected, there will be an information bound to information gained from experiment. Such bound depends on our prior knowledge about the (conditional) effect of the experiment itself.

1.3 Motivating Example: Toy Data with Explicit Posterior Precision

We show here one example that illustrates the interplay between prior knowledge and experiments. We simulate $k=3$ responses, $p=1$ predictor and $n=200$ samples under AR(1) model with $\sigma _{ij}=0.7^{|i-j|}$ (denoted Model 1 in the Simulations in Sect. 3). We simulate two settings: 1) a null experiment (design matrix of $\textbf{X}=0$ despite a potential experimental effect of $\textbf{B}\ne 0$) and 2) experiment (the predictor has an effect only on the third response), and we consider two priors each with two uncertainty levels: 1) Normal-Wishart prior with $\lambda =8$, $\varvec{\Phi }=10^{-3} \textbf{I}_3$ and two uncertainty levels of $\textbf{B}$: (i) $\varvec{\Lambda }=10^{-3}$ (certain case) and (ii) $\varvec{\Lambda }=10^3$ (uncertain case), and 2) Normal-MGIG prior with $\lambda =4$, $\varvec{\Psi }=\varvec{\Phi }=10^{-3} \textbf{I}_8$ and two uncertainty levels of $\textbf{B}$: (i) $\varvec{\Lambda }=10^{-3}\textbf{I}_3$ (certain case) and (ii) $\varvec{\Lambda }=10^3$ (uncertain case). We set $\textbf{B}_0= \textbf{B}$ (see Sect. 2 for more details on the priors). The design matrix is sampled from the standard Normal distribution.

For this toy example, ${\varvec{\Omega }}$ has an explicit marginal posterior distribution, namely a matrix generalized inverse Gaussian (MGIG) distribution (Eq. A1) with parameters $\lambda + \frac{n}{2}$, $\varvec{\Psi }+(\textbf{XB}_0)^T (\textbf{XB}_0)$ and $\varvec{\Phi } + \textbf{Y}^T\textbf{Y}$.

Figure 1 shows the posterior distribution on the (1, 2)-th entry of the precision matrix ($\rho _{12}$). We can observe that the experiment has an effect on the inference of the $\rho _{12}$ and that this effect differs by prior with Normal-MGIG prior displaying less variability compared to the conjugate Normal-Wishart. Empirical knowledge similar to this toy example motivated our pursuit of the theoretical interplay of prior distributions and experimental design in the inference of a precision matrix in a chain graph model.

1.4 Structure of the Paper

The structure of the paper is as follows: In Sect. 2, we aimed to understand how prior choice could influence effectiveness of an experimental design. We began with a recap of the Laplace approximation of the posterior precision matrix, and then, we illustrate the Laplace approximation for the marginal posterior precision matrix of the parameter of interest ${\varvec{\Omega }}$ under four priors: flat prior, Normal-Wishart conjugate prior, Normal-MGIG prior and a general independent prior. We show that optimal experimental design is only possible under the two latter priors, but even in these cases, there is an information limit. In Sect. 3, we numerically simulate posterior under several priors and different experimental design. We evaluated the result using the Kullback–Leibler divergence between prior and posterior which evaluates the information gain by conducting experiments. In addition, we use the Stein’s loss of the maximum a posteriori (MAP) estimation of precision matrix, which evaluates the performance of the point estimates from the experiment.

In Sect. 4, we revisit a similar discussion to the one in the motivating toy dataset where we compare the posterior of partial correlations among responses under different experiments and priors for a real dataset of human gut microbiome. Finally, in Sect. 5, we conclude with some practical advice for domain scientists as well as future directions.

2 Experimental Design under Different Priors in a Gaussian Chain Graph

Ideally, we want to use the marginal posterior precision matrix of our parameter of interest ${\varvec{\Omega }}$ as the optimality criterion in an experimental design setting. That is, we want to find the optimal design matrix ${\textbf{X}}$ that maximizes the posterior precision of ${\varvec{\Omega }}$. However, this posterior precision matrix can be intractable in many cases, and thus, we will use its Laplace approximation instead. We begin this section with a summary of the Laplace approximation of a posterior distribution. Then, we present the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ under the four priors under study.

2.1 Laplace Approximation of the Posterior Precision Matrix

For a log concave posterior distribution $p(\theta | Y)$ for a random variable Y and parameter of interest $\theta $, the Laplace approximation of the posterior precision is the negative of the Hessian of the log posterior $-\nabla ^2_{\theta }\log p(\hat{\theta }|Y)$ which can be partitioned as the sum of the Hessian of the log prior and the Hessian of the log likelihood $\nabla ^2_{\theta }\log p(\hat{\theta }|Y) =\nabla ^2_{\theta }\log p(\hat{\theta })+\nabla ^2_{\theta }\log p(Y|\theta )$ near the maximum a posteriori (MAP) estimator. However, we often do not have a closed-form expression for the MAP and we simply know than they are close to the true parameter. Thus, we make the additional approximation of using the true parameter to approximate the MAP. We note that while the MAP can depend on the experimental design $\textbf{X}$, this approximation should not have a considerable impact on the results, given that the MAP should be close to the true parameter when the prior has enough support around this true value.

For the Gaussian chain graph model (1), the Hessian of the log likelihood has the following form (derivation can be found in Appendix A1):

$$\begin{aligned} \begin{aligned} \frac{\partial ^2 \ell }{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vec}\,}}(\textbf{B})^T}&=-\varvec{\Omega }^{-1}\otimes \textbf{X}^T\textbf{X}\\ \frac{\partial ^2 \ell }{\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })^T}&=-\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k\\ \frac{\partial ^2 \ell }{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })^T}&=\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) . \end{aligned} \end{aligned}$$

where $\textbf{D}_k$ is the duplication matrix (Minka 2000; Magnus and Neudecker 2019), a permutation matrix such that $\textbf{D}_k{{\,\textrm{vec}\,}}{\varvec{\Omega }}= {{\,\textrm{vech}\,}}({\varvec{\Omega }})$ where ${{\,\textrm{vech}\,}}({\varvec{\Omega }})$ denotes the vectorization of the unique parameters in ${\varvec{\Omega }}$ (upper triangular part in our case) given that ${\varvec{\Omega }}$ is symmetric, so there are fewer free parameters.

Because the Gaussian chain graph model is an exponential family, the Hessian of the log likelihood is also the negative of the Fisher information matrix:

$$\begin{aligned} I({\varvec{\Omega }}, \textbf{B})=\left[ \begin{array}{ll} \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k &{} -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\ -(\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) )^T &{} {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X} \end{array} \right] .\nonumber \\ \end{aligned}$$

(2)

Having the Hessian of the log likelihood (the negative Fisher information in (2)), we only need the Hessian of the log prior for each of the priors under study to obtain the negative Hessian of the log posterior, and thus, the Laplace approximation of the posterior precision matrix of $\textbf{B}$ and ${\varvec{\Omega }}$ which can be written in matrix form:

$$\begin{aligned} \left[ \begin{array}{ll} \textbf{A} &{} \textbf{G}\\ \textbf{C} &{} \textbf{D}\\ \end{array} \right] . \end{aligned}$$

(3)

Given that this matrix corresponds to the Laplace-approximated posterior precision, its inverse corresponds to the Laplace-approximated posterior covariance matrix of $\textbf{B}$ and ${\varvec{\Omega }}$:

$$\begin{aligned} \left[ \begin{array}{ll} \textbf{A} &{} \textbf{G}\\ \textbf{C} &{} \textbf{D}\\ \end{array} \right] ^{-1} = \left[ \begin{array}{ll} (\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})^{-1} &{} -\textbf{A}^{-1}\textbf{G}(\textbf{D}-\textbf{C}\textbf{A}^{-1}\textbf{G})^{-1}\\ -\textbf{D}^{-1}\textbf{C}(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})^{-1} &{} (\textbf{D}-\textbf{C}\textbf{A}^{-1}\textbf{G})^{-1}\\ \end{array} \right] . \end{aligned}$$

The block matrix $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})^{-1}$ is denoted the Schur complement of $\textbf{A}$ (Prasolov 1994), and it corresponds to the Laplace approximation of the marginal posterior covariance matrix of ${\varvec{\Omega }}$. Its inverse $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})$ is then the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$, our optimality criterion to address the question of optimal Bayesian experimental design for each of the priors. As mentioned before, we focus on the marginal precision matrix of ${\varvec{\Omega }}$ because we want to account for $\textbf{B}$, but only as a nuisance parameter.

Our goal for the remaining of this section is to obtain the matrix $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})$ for each of the four priors under study with $\textbf{A}, \textbf{G}, \textbf{C}, \textbf{D}$ coming from the negative of the Hessian of the log posterior. This matrix $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})$ is our optimality criterion in the Bayesian experimental design and will allow us to find the optimal design matrix ${\textbf{X}}$ for each prior.

Finally, we note that the Laplace approximation is only valid for posterior densities that are log concave. This is trivially true for the case of the flat prior because the likelihood model is log concave. We prove log concavity of the other three priors under study in the Appendix.

2.2 Flat Prior

As mentioned, the flat prior is trivially log concave because the likelihood model (Gaussian chain graph model) is log concave. Thus, we can obtain the Laplace approximation of the posterior precision matrix of $\textbf{B}$ and ${\varvec{\Omega }}$ as the negative Hessian of the log likelihood (the Fisher information in (2)) which can be written as the matrix in (3) with

$$\begin{aligned} \textbf{A}&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k, \\ \textbf{G}&=\textbf{C}^T=-\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) ,\\ \textbf{D}&={\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ is given by the inverse of the Schur complement of $\textbf{A}$: $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})$, typically denoted as ${\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})$:

$$\begin{aligned} {\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})&=\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X})^{-1}\right] \left[ (\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X})\right) \right] ^T\\&=\frac{n}{2}\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\right) \textbf{D}_k. \end{aligned}$$

The next step in Bayesian experimental design would be to find a design matrix ${\textbf{X}}$ that maximizes our optimality criterion ${\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})$. We note, however, that ${\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})$ is not a function of ${\textbf{X}}$, and thus, optimal experimental design cannot be performed with this optimality criterion. We also note that under the setting of $\textbf{B}$ completely known, the Fisher information matrix becomes the upper left block in (2) (matrix $\textbf{A}$) which indeed contains ${\textbf{X}}$, and thus, experimental design is possible for the known $\textbf{B}$ case. This remark hints at the possibility that prior knowledge from $\textbf{B}$ could help infer ${\varvec{\Omega }}$ as will be confirmed in the next sections.

2.3 Standard Conjugate Prior: Normal-Wishart

The standard conjugate prior for a Gaussian chain graph model is the Normal-Wishart family:

$$\begin{aligned} \begin{aligned} {\varvec{\Omega }}|\lambda , \varvec{\Phi }&\sim W_k(\lambda , \varvec{\Phi }^{-1})\\ {{\,\textrm{vec}\,}}(\textbf{B})|{\varvec{\Omega }},\textbf{B}_0,\varvec{\Lambda }&\sim N({{\,\textrm{vec}\,}}(\textbf{B}_0{\varvec{\Omega }}),\varvec{\Omega }\otimes \varvec{\Lambda }) \end{aligned} \end{aligned}$$

(4)

where $\varvec{\Phi }\in \mathbb {R}^{k\times k}$ is positive definite, $\lambda $ is a scalar, $\textbf{B}_0 \in \mathbb {R}^{p \times k}$, and $\varvec{\Lambda }\in \mathbb {R}^{p\times p}$ represents the uncertainty on $\textbf{B}$. Then, the posterior distribution is given by:

$$\begin{aligned} \begin{aligned} {\varvec{\Omega }}| {\textbf{Y}},{\textbf{X}}, \lambda , \varvec{\Phi }&\sim W_k(\lambda +n,\hat{\varvec{\Phi }}^{-1})\\ {{\,\textrm{vec}\,}}(\textbf{B})|{\varvec{\Omega }},{\textbf{Y}},{\textbf{X}}, \textbf{B}_0, \varvec{\Lambda }&\sim N({{\,\textrm{vec}\,}}((\varvec{\Lambda }^{-1}+{\textbf{X}}^T{\textbf{X}})^{-1}(\textbf{B}_0^T \varvec{\Lambda }^{-1}+{\textbf{Y}}^T {\textbf{X}}) {\varvec{\Omega }}),\varvec{\Omega }\otimes (\varvec{\Lambda }^{-1}+{\textbf{X}}^T{\textbf{X}})^{-1}) \end{aligned} \end{aligned}$$

where $\hat{\varvec{\Phi }}=\varvec{\Phi }+\textbf{Y}^T\textbf{Y}+\textbf{B}_0^T\varvec{\Lambda }^{-1}\textbf{B}_0-(\textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X})(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}(\textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X})^T$. We show that this matrix is positive definite in Appendix A4.

To use the Laplace approximation for the posterior precision, the posterior distribution needs to be log concave. The Normal-Wishart posterior is log concave under $\frac{1}{2}(\lambda -k-p-1)\ge k/2$ (Appendix A2), so we can use the Laplace approximation to approximate the posterior precision as the negative Hessian of the log posterior which, in turn, is the sum of the Hessian of the log likelihood (2) and the Hessian of the log prior which is given by:

$$\begin{aligned}&\frac{\partial ^2}{\partial {{\,\textrm{vec}\,}}{(\textbf{B})} \partial {{\,\textrm{vec}\,}}{(\textbf{B})}^T} \log p(\varvec{\Omega },\textbf{B}) =-{\varvec{\Omega }}^{-1}\otimes \varvec{\Lambda }^{-1}\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vech}\,}}{({\varvec{\Omega }})} \partial {{\,\textrm{vech}\,}}{({\varvec{\Omega }})}^T} \log p(\varvec{\Omega },\textbf{B})\nonumber \\&\quad =-\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( \frac{1}{2}(\lambda -k-p-1) {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T \varvec{\Lambda }^{-1}\textbf{B}){\varvec{\Omega }}^{-1} \right) \right) \textbf{D}_k\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vech}\,}}{({\varvec{\Omega }})} \partial {{\,\textrm{vec}\,}}{(\textbf{B})}^T} \log p(\varvec{\Omega },\textbf{B})=\textbf{D}_k^T \left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}\textbf{B}^T \varvec{\Lambda }^{-1} \right) \right) . \end{aligned}$$

(5)

The Hessian of the log posterior can then be written as the matrix in (3) with

$$\begin{aligned} \textbf{A}&= \textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( \alpha {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T \varvec{\Lambda }^{-1}\textbf{B}){\varvec{\Omega }}^{-1} \right) \right) \textbf{D}_k \\&\quad + \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k \\&= \textbf{D}_k^T\left[ {\varvec{\Omega }}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T\textbf{XB}+\textbf{B}^T\Lambda ^{-1}B){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k,\\ \textbf{G}&= \textbf{C}^T = -\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}\textbf{B}^T \varvec{\Lambda }^{-1} \right) \right) -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\&= -\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}\textbf{B}^T (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}) \right) \right) ,\\ \textbf{D}&= {\varvec{\Omega }}^{-1}\otimes {(\textbf{X}^T \textbf{X} + \varvec{\Lambda }^{-1})}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ is given by the inverse of the Schur complement of $\textbf{A}$: $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})$, typically denoted as ${\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})$:

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})&=\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( \left( \frac{n}{2}+\alpha \right) {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T\textbf{XB}+\textbf{B}^T \varvec{\Lambda }^{-1} \textbf{B}){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})) \right) \\&\quad \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \left[ (\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1}))\right) \right] ^T\\&=\left( \frac{n}{2}+\alpha \right) \textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\right] \textbf{D}_k \end{aligned} \end{aligned}$$

(6)

which is again not a function of design ${\textbf{X}}$ (see Appendix A5 for details on the algebraic simplification). This implies again that we cannot find an optimal experimental design for the conjugate prior with the optimality criterion of the Laplace approximation of the marginal posterior precision of ${\varvec{\Omega }}$.

In addition, if we use the Normal-Wishart conjugate prior, our prior knowledge is actually on the marginal regression coefficient $\tilde{\textbf{B}} = \textbf{B} {\varvec{\Omega }}^{-1}$, not on $\textbf{B}$. That is, the Normal-Wishart prior does not identify parameters ${\varvec{\Omega }}$ and $\textbf{B}$ separately, rather it only has information about linear combinations of them. This is also evident by observing that if we take $\varvec{\Lambda }\rightarrow 0$ (the uncertainty on $\textbf{B}$), $\textbf{B}$ is still not fully known. Instead, we would only know $\textbf{B}_0{\varvec{\Omega }}^{-1}=\tilde{\textbf{B}}_0$ where $\varvec{\Omega }$ is still random. Thus, when the uncertainty of $\textbf{B}$ goes to zero, the conjugate Normal-Wishart prior does not reduce to the known $\textbf{B}$ case which, in addition to not allowing optimal experimental design to infer ${\varvec{\Omega }}$, make the conjugate prior a suboptimal prior for the Gaussian chain graph model when our focus is the estimation of ${\varvec{\Omega }}$.

2.4 Normal-Matrix Generalized Inverse Gaussian Prior

The drawbacks on the Normal-Wishart conjugate prior for the estimation of ${\varvec{\Omega }}$ motivate us to search for other prior alternatives. We consider the matrix generalized inverse Gaussian distribution (MGIG) (Barndorff-Nielsen et al. 1982; Fazayeli and Banerjee 2016) for ${\varvec{\Omega }}$ to define the Normal-MGIG prior which is not conjugate for $\textbf{B}$, but yields a MGIG posterior for ${\varvec{\Omega }}$ that can be sampled via importance sampling (Fazayeli and Banerjee 2016).

The Normal-MGIG prior is given by:

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|\lambda , \varvec{\Psi }, \varvec{\Phi }&\sim MGIG(\lambda ,\varvec{\Psi },\varvec{\Phi })\\ {{\,\textrm{vec}\,}}(\textbf{B})|{\varvec{\Omega }}, \textbf{B}_0, \varvec{\Lambda }&\sim N({{\,\textrm{vec}\,}}(\textbf{B}_0),\varvec{\Omega }\otimes \varvec{\Lambda }) \end{aligned} \end{aligned}$$

(7)

where $\varvec{\Psi }$, $\varvec{\Phi }\in {\mathbb {R}}^{k\times k}$ are positive definite while $\lambda $ is a scalar. $\textbf{B}_0\in {\mathbb {R}}^{p\times k}$ is the mean of $\textbf{B}$ and $\varvec{\Lambda }\in {\mathbb {R}}^{p\times p}$ is the uncertainty on $\textbf{B}$. Then, the posterior distribution is proportional to:

$$\begin{aligned}&p(\varvec{\Omega },\textbf{B}|\textbf{Y},\textbf{X},\theta )\nonumber \\&\quad \propto |\varvec{\Omega }|^{\frac{n}{2}}\exp \left( {{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{X}\textbf{B})-\frac{1}{2} {{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{Y}\varvec{\Omega })- \frac{1}{2} {{\,\textrm{tr}\,}}(\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B} \varvec{\Omega }^{-1})\right) \nonumber \\&\qquad \times |\varvec{\Omega }|^{-\frac{p}{2}}\exp \left( -\frac{1}{2}{{\,\textrm{tr}\,}}([\textbf{B}-\textbf{B}_0]^T \varvec{\Lambda }^{-1}[\textbf{B}-\textbf{B}_0] \varvec{\Omega }^{-1})) \right) \nonumber \\&\qquad \times |\varvec{\Omega }|^{\lambda -\frac{k+1}{2}}\exp (-\frac{1}{2} {{\,\textrm{tr}\,}}[\varvec{\Psi }\varvec{\Omega }^{-1}]-\frac{1}{2}{{\,\textrm{tr}\,}}[\varvec{\Phi }\varvec{\Omega }])\nonumber \\&\quad \propto |\varvec{\Omega }|^{-\frac{p}{2}}\exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}(-2(\varvec{\Omega }^{-1}\textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T \textbf{X})\textbf{B}+\textbf{B}^T(\varvec{\Lambda }^{-1}+\textbf{X}^T\textbf{X}) \textbf{B}\varvec{\Omega }^{-1})) \right) \nonumber \\&\qquad \times \exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}\left( (\varvec{\Omega }^{-1} \textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X}) (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}(\varvec{\Omega }^{-1} \textbf{B}_0^T\varvec{\Lambda }^{-1}+\textbf{Y}^T\textbf{X})^T \varvec{\Omega }\right) \right) \nonumber \\&\qquad \times |\varvec{\Omega }|^{\lambda +\frac{n}{2} -\frac{k+1}{2}}\exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}((\varvec{\Phi }+\textbf{Y}^T \textbf{Y}-\textbf{Y}^T\textbf{X}(\textbf{X}^T\textbf{X} +\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T\textbf{Y})\varvec{\Omega })\right) \nonumber \\&\qquad \times \exp \left( -\frac{1}{2} {{\,\textrm{tr}\,}}((\varvec{\Psi }+\textbf{B}_0^T \varvec{\Lambda }^{-1}\textbf{B}_0-\textbf{B}_0^T\varvec{\Lambda }^{-1} (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1} \textbf{B}_0)\varvec{\Omega }^{-1})\right) \end{aligned}$$

(8)

where we use $\theta $ to denote all hyper-parameters in the prior and thus, we get

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|\textbf{Y},\textbf{X},\theta&\sim MGIG(\lambda +\frac{n}{2},\hat{\varvec{\Psi }},\hat{\varvec{\Phi }}) {{\,\textrm{vec}\,}}(\textbf{B})|\varvec{\Omega },\textbf{Y},\textbf{X},\theta \sim N\\&\quad \left( \varvec{\Omega }\otimes (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}{{\,\textrm{vec}\,}}(\textbf{X}^T\textbf{Y}+\varvec{\Lambda }^{-1}\textbf{B}_0\varvec{\Omega }^{-1}), \varvec{\Omega }\otimes (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1} \right) \end{aligned} \end{aligned}$$

where $\hat{\varvec{\Psi }}=\varvec{\Psi }+\textbf{B}_0^T\varvec{\Lambda }^{-1}\textbf{B}_0-\textbf{B}_0^T\varvec{\Lambda }^{-1}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1}\textbf{B}_0$ and $\hat{\varvec{\Phi }}=\varvec{\Phi }+\textbf{Y}^T\textbf{Y}-\textbf{Y}^T\textbf{X}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T\textbf{Y}$. We show that these matrices are positive definite in Appendix A4.

To the best of our knowledge, the Normal-MGIG prior has not been used for the Gaussian chain graph model, and thus, we prove here some of its properties in the Appendix A3: we show that the MGIG prior is conjugate for the case of known $\textbf{B}$ (Proposition 1), that it is log concave under certain conditions (Proposition 2), that it is unimodal for the case of unknown $\textbf{B}$ (Proposition 3), and that its limiting case is indeed the case of known $\textbf{B}$ (Remark 1).

The Normal-MGIG posterior is log concave under $\lambda -\frac{k+p+1}{2}\ge \frac{p}{2}$ (Proposition 2 in the Appendix), so we can use the Laplace approximation to approximate the posterior precision as the negative Hessian of the log posterior which, in turn, is the sum of the Hessian of the log likelihood (2) and the Hessian of the log prior which has the following form:

$$\begin{aligned}&\frac{\partial ^2}{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vec}\,}}(\textbf{B})^T}\log p({\varvec{\Omega }}, \textbf{B}) =-{\varvec{\Omega }}^{-1}\otimes \varvec{\Lambda }^{-1}\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vech}\,}}({\varvec{\Omega }})\partial {{\,\textrm{vech}\,}}({\varvec{\Omega }})^T}\log p({\varvec{\Omega }}, \textbf{B}) \nonumber \\&\quad = \textbf{D}_k^T\left( \left( \lambda -\frac{k+p+1}{2}\right) {\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\right) \textbf{D}_k\nonumber \\&\qquad +\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}[(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1} (\textbf{B}-\textbf{B}_0)+\varvec{\Psi }]{\varvec{\Omega }}^{-1}\right) \right) \textbf{D}_k\nonumber \\&\frac{\partial ^2}{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vech}\,}}({\varvec{\Omega }})^T}\log p({\varvec{\Omega }}, \textbf{B}) =-\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes ({\varvec{\Omega }}^{-1}(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})\right) . \end{aligned}$$

(9)

The Hessian of the log posterior can then be written as the matrix in (3) with:

$$\begin{aligned}&\textbf{A}= \textbf{D}_k^T\left( \alpha {\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\right) \textbf{D}_k +\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}[(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)+\varvec{\Psi }] {\varvec{\Omega }}^{-1}\right) \right) \textbf{D}_k\\&\quad + \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X} \textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k\\&=\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1} +{\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T \textbf{XB}+(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)+\varvec{\Psi }){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k, \\ \textbf{G}&=\textbf{C}^T = -\textbf{D}_k^T\left( {\varvec{\Omega }}^{-1}\otimes ({\varvec{\Omega }}^{-1}(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}) \right) -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\&= -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})) \right) ,\\ \textbf{D}&= {\varvec{\Omega }}^{-1}\otimes {(\textbf{X}^T \textbf{X} + \varvec{\Lambda }^{-1} )}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ is given by the inverse of the Schur complement of $\textbf{A}$: $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})$, typically denoted as ${\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})$:

$$\begin{aligned}&\varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})\nonumber \\&\quad =\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1} +{\varvec{\Omega }}^{-1}(\textbf{B}^T{\textbf{X}}^T \textbf{XB}+(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0) +\varvec{\Psi }){\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k\nonumber \\&\qquad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \nonumber \\&\qquad \left[ (\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}))\right) \right] ^T\nonumber \\&=\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( (\frac{n}{2}+\alpha ){\varvec{\Omega }}^{-1}+{\varvec{\Omega }}^{-1} \textbf{B}_0^T(\varvec{\Lambda }^{-1} -\varvec{\Lambda }^{-1} ({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1} \varvec{\Lambda }^{-1})\textbf{B}_0 {\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k\nonumber \\&\qquad +\textbf{D}_k^T\left[ \varvec{\Omega }^{-1} \otimes ({\varvec{\Omega }}^{-1}\varvec{\Psi }{\varvec{\Omega }}^{-1})\right] \textbf{D}_k \end{aligned}$$

(10)

which is a function of design ${\textbf{X}}$ (see Appendix A6 for details on the algebraic simplification). Note that the only term that involves the design matrix $\textbf{X}$ is ${\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0{\varvec{\Omega }}^{-1}$ which is $\textbf{0}$ when ${\textbf{X}}=\textbf{0}$ (i.e., no experiment), so we can understand this term as the information gain due to experiment. We can then consider the optimal design based on $( \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})$ which is the only term that we have control over. For example, for a D-optimal design, we want to maximize $|\varvec{\Lambda }^{-1}({\textbf{X}}^T {\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1}|=\frac{1}{|\varvec{\Lambda }|^2|{\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1}|}$ which can be achieved by maximizing $|{\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1}|$, which coincides with the usual Bayesian D-optimal for marginal regression coefficient (Chaloner and Verdinelli 1995).

2.4.1 Information Bound under the Normal-MGIG Prior

We now know that the experimental design has an effect on the estimation of ${\varvec{\Omega }}$ under a Normal-MGIG prior by influencing its approximate posterior precision. However, it turns out that there is a bound on the information we can gain from the experiment (${\textbf{X}}$).

Recall that we denote the term ${\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0{\varvec{\Omega }}^{-1}$ the information gain from experiments. First, we observe that the inequality $(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\le \varvec{\Lambda }^{-1}$ is true because the difference ($\varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1}$) is a positive semi-definite matrix (as it takes a quadratic form with a positive-definite matrix).

By multiplying the inequality by ${\varvec{\Omega }}^{-1} \textbf{B}_0^T$ and $\textbf{B}_0 {\varvec{\Omega }}^{-1}$ left and right, respectively, we get

$$\begin{aligned} {\varvec{\Omega }}^{-1} \textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0 {\varvec{\Omega }}^{-1} \le {\varvec{\Omega }}^{-1} \textbf{B}_0^T \varvec{\Lambda }^{-1}\textbf{B}_0 {\varvec{\Omega }}^{-1} \end{aligned}$$

where the term on the left is precisely the information that we can gain from nonzero experiments (${\textbf{X}}\ne \textbf{0}$) and this term is bounded by ${\varvec{\Omega }}^{-1} \textbf{B}_0^T \varvec{\Lambda }^{-1}\textbf{B}_0 {\varvec{\Omega }}^{-1}$. This means that when we try to find the optimal experimental design ${\textbf{X}}$ that maximizes the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$, the only term that depends on ${\textbf{X}}$ is bounded, and thus, there is a limit to how much can be gained by an optimal experimental design.

Next, we observe that this bound is sharp as the equality is achieved when ${\textbf{X}}^T {\textbf{X}}\rightarrow \infty $. In addition, this inequality provides the intuition that the information gain due to the experiment is bounded by the product of the marginal effect of the experiment ($\textbf{B}_0 {\varvec{\Omega }}^{-1}$) and the prior certainty on the experiment’s conditional effect ($\varvec{\Lambda }^{-1}$). Thus, if our prior on the effect of the experiment is no effect on any nodes ($\textbf{B}_0=\textbf{0}$), we will gain no information from the experiment. We can re-write the bound in terms of the marginal regression coefficient $\tilde{\textbf{B}}_0 = \textbf{B}_0 {\varvec{\Omega }}^{-1}$: ${\varvec{\Omega }}^{-1} \textbf{B}_0^T(\varvec{\Lambda }^{-1} - \varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0 {\varvec{\Omega }}^{-1}\le \tilde{\textbf{B}}_0^T \varvec{\Lambda }^{-1}\tilde{\textbf{B}}_0$.

In conclusion, given the information bound, the experimental design ${\textbf{X}}$ is not as important as the prior knowledge on the experiment’s effect. To increase the information bound, there are two directions: either having a large prior marginal effect ($\tilde{\textbf{B}}_0$) or a small uncertainty on the conditional regression effects ($\varvec{\Lambda }$). That is, the most helpful experiments are the ones with large marginal effects on nodes with well-known conditional effects. More on practical advice for domain scientists in the Discussion (Sect. 5).

2.5 General Independent Prior

We now investigate whether the information bound is specific to the Normal-MGIG case or whether it exists in the general case when we have a prior distribution on $\textbf{B}$ independent of ${\varvec{\Omega }}$, assuming that the prior distribution is log concave.

Let $\log p({\varvec{\Omega }}, \textbf{B})=f(\textbf{B})+g({\varvec{\Omega }})$ be the log prior density where $f(\textbf{B})$ has a Hessian given by $-\varvec{\Lambda }^{-1}\in \mathbb {R}^{kp\times kp}$ and $g(\varvec{\Omega })$ has a Hessian (with respect to unique parameters of $\varvec{\Omega }$) given by $-\varvec{\Psi }\in \mathbb {R}^{\frac{k(k+1)}{2}\times \frac{k(k+1)}{2}}$. Then, the negative Hessian of the log posterior is the sum of the negative Hessian of the log prior and the negative Hessian from log likelihood (2):

$$\begin{aligned} \left[ \begin{array}{ll} \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k +\varvec{\Psi }&{} -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \\ -(\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) )^T &{} {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1} \end{array} \right] \end{aligned}$$

which we can write as the matrix in (3) with

$$\begin{aligned} \textbf{A}&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k+\varvec{\Psi }, \\ \textbf{G}&= \textbf{C}^T = -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) , \\ \textbf{D}&= {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}. \end{aligned}$$

Then, the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ is given by the inverse of the Schur complement of $\textbf{A}$: $(\textbf{A}-\textbf{G}\textbf{D}^{-1}\textbf{C})$, typically denoted as ${\varvec{\Omega }} | I({\varvec{\Omega }}, \textbf{B})$:

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k+\varvec{\Psi }\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \left( {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}\right) ^{-1}\left( \varvec{\Omega }^{-1}\otimes (\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}) \right) \textbf{D}_k. \end{aligned} \end{aligned}$$

To simplify the notation, we take $\textbf{E}=\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) $ and $\textbf{F}={\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}$. It is simple to check that $\textbf{EF}^{-1}=\textbf{I}_{k}\otimes {\varvec{\Omega }}^{-1}\textbf{B}^T, \textbf{F}^{-1}\textbf{E}^T=\textbf{I}_{k}\otimes \textbf{B}\varvec{\Omega }^{-1}$ and $\textbf{EF}^{-1}\textbf{E}^{T} = \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}$. The Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ ($\varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})$) then becomes:

$$\begin{aligned} \begin{aligned} \varvec{\Omega }|I({\varvec{\Omega }},\textbf{B})&= \textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k+\varvec{\Psi }\\&\quad -\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) \left( {\varvec{\Omega }}^{-1} \otimes \textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1}\right) ^{-1}\left( \varvec{\Omega }^{-1}\otimes (\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}) \right) \textbf{D}_k\\&=\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\textbf{EF}^{-1}\textbf{E}^T\right) \textbf{D}_k+\varvec{\Psi } -\textbf{D}_k^T\textbf{E} \left( \textbf{F}+\varvec{\Lambda }^{-1}\right) ^{-1}\textbf{E} ^T \textbf{D}_k\\&=\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T\right) \textbf{D}_k+\varvec{\Psi }. \end{aligned} \end{aligned}$$

Note that the only term that involves ${\textbf{X}}$ is $\textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T$ which we can re-write by taking the Cholesky decomposition of $\varvec{\Lambda }^{-1}=\textbf{LL}^T$ and by (A4) (with $\textbf{A}=\textbf{F}$, $\textbf{P}=\textbf{I}_{kp}$, $\textbf{U}=\textbf{V}^T=\textbf{L}$) as

$$\begin{aligned} \begin{aligned} \textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T&=\textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\textbf{LL}^T)^{-1})\textbf{E}^T\\&=\textbf{E}\textbf{F}^{-1}\textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T\textbf{F}^{-1}\textbf{E}^T. \end{aligned} \end{aligned}$$

(11)

Then, we have the following information bound (derivation in Appendix 5)

$$\begin{aligned} \begin{aligned} \textbf{E}(\textbf{F}^{-1}-(\textbf{F}+\varvec{\Lambda }^{-1})^{-1})\textbf{E}^T&=\textbf{E}\textbf{F}^{-1}\textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T(\textbf{E}\textbf{F}^{-1})^{T}\\&\le \textbf{E}\textbf{F}^{-1}\varvec{\Lambda }^{-1} (\textbf{E}\textbf{F}^{-1})^{T}\\&=\left( \textbf{I}_{k}\otimes (\textbf{B}{\varvec{\Omega }}^{-1})^T\right) \varvec{\Lambda }^{-1} \left( \textbf{I}_{k}\otimes (\textbf{B}{\varvec{\Omega }}^{-1})^T\right) ^T. \end{aligned} \end{aligned}$$

This bound is similar to the Normal-MGIG case (Sect. 2.4.1) in that it depends on the marginal correlation coefficients $\textbf{B}\varvec{\Omega }^{-1}$ and is also not a function of ${\textbf{X}}$. We observe that this bound is also sharp as one can achieve equality when $\textbf{X}^T\textbf{X}$ tends to infinite. In addition, this bound could guide the choice of experiments when the goal is the estimation of the precision matrix. Again, one should chose experiments that have large marginal effects and high certainty in the prior on the conditional effects. More on practical advice for domain scientists in the Discussion (Sect. 5).

3 Simulation Study

In order for an experimental design to help infer network structure, such experiment should at least provide better results than not doing any experiment (a null experiment corresponding to $\textbf{X}=0$). In addition, we evaluate a standard type of experiments denoted “specific” in which there is only effect to one of the nodes. For each experimental design (null, random, specific), we calculate the KL divergence between the prior and the posterior which represents the information gain for each experiment. For evaluation of the point estimation, we compare the performance of the different priors with the Stein’s loss (Dey and Srinivasan 1985) between the maximum a posteriori (MAP) and the true value of ${\varvec{\Omega }}$.

We simulate data under the 6 covariance structures in Wang (2012) with $k=50$ responses and $p=50$ specific predictors ($\textbf{B}=\textbf{I}_{50}$). Each simulation is repeated 100 times.

Model 1: AR(1) model with $\sigma _{ij}=0.7^{|i-j|}$
Model 2: AR(2) model with $\omega _{ii}=1$, $\omega _{i-1,i}=\omega _{i,i-1}=0.5$, $\omega _{i-2,i}=\omega _{i,i-2}=0.25$ for $i=1,\dots ,k$
Model 3: Block model with $\sigma _{ii}= 1$ for $i=1,\dots ,k$, $\sigma _{ij}= 0.5$ for $1\le i\ne j\le k/2$, $\sigma _{ij}=0.5$ for $k/2 + 1\le i\ne j\le 10$ and $\sigma _{ij=0}$ otherwise.
Model 4: Star model with every node connected to the first node, with $\omega _{ii}=1$, $\omega _{1,i}=\omega _{i,1}= 0.1$ for $i=1,\dots ,k$, and $\omega _{ij}= 0$ otherwise.
Model 5: Circle model with $\omega _{ii}= 2$, $\omega _{i-1,i}=\omega _{i,i-1}= 1$ for $i=1,\dots ,k$, and $\omega _{1,j}=\omega _{j,1}= 0.9$ for $j=1,\dots ,k$.
Model 6: Full model with $\omega _{ii}= 2$ and $\omega _{ij}= 1$ for $i\ne j \in \{1,\dots ,k\}$.

We take sample sizes ranging from 200 to 2200 increased by 50.

For each covariance structure, we test two (conjugate) priors, each with two levels of uncertainty, and with and without bias in the prior of $\textbf{B}$. Namely,

1.
Normal-Wishart prior with $\lambda =2k+2$, $\varvec{\Phi }=10^{-3} \textbf{I}_k$ and two uncertainty levels of $\textbf{B}$: (i) $\varvec{\Lambda }=10^{-3} \textbf{I}_p$ (certain case) and (ii) $\varvec{\Lambda }=10^3 \textbf{I}_p$ (uncertain case).
2.
Normal-MGIG prior with $\lambda =k+1$, $\varvec{\Psi }=\varvec{\Phi }=10^{-3} \textbf{I}_k$ and two uncertainty levels of $\textbf{B}$: (i) $\varvec{\Lambda }=10^{-3}\textbf{I}_{p}$ (certain case) and (ii) $\varvec{\Lambda }=10^3 \textbf{I}_p$ (uncertain case).

The biased case for the prior of $\textbf{B}$ is given by the biased prior mean of $\textbf{B}_0 = \textbf{B} + \epsilon $ for $\epsilon \sim N(0,1)$. Note that we also test a case with a smaller bias and present these results in the Appendix. In addition, we test for midpoint uncertainty where $\varvec{\Lambda } = 10 \textbf{I}_{p}$ and $\varvec{\Lambda } = 0.1 \textbf{I}_{p}$ in the Appendix.

In addition to the conjugate priors, we test two additional shrinkage priors, namely a chain graph lasso (cglasso) prior (Shen and Solis-Lemus 2020) and multivariate regression lasso (mlasso) prior. Both priors define a graphical LASSO prior (Wang 2012) on ${\varvec{\Omega }}$. On the one hand, cglasso prior defines an independent Laplace prior on the entries of $\textbf{B}$. On the other hand, mlasso prior defines an independent Laplace prior on the entries of $\tilde{\textbf{B}}=\textbf{B} {\varvec{\Omega }}^{-1}$. Specifically,

1.
cglasso prior: $\textbf{B}_{i,j}\sim \text {Laplace}(\lambda ^2)$ and $\omega _{k,k'}\sim \text {Laplace}(\lambda ^2)$, $\omega _{k,k}\sim \text {Exp}(\lambda )$, with two shrinkage levels $\lambda =2,10$
2.
mlasso prior: $(\textbf{B}\varvec{\Omega }^{-1})_{i,j}\sim \text {Laplace}(\lambda ^2)$ and $\omega _{k,k'}\sim \text {Laplace}(\lambda ^2)$, $\omega _{k,k}\sim \text {Exp}(\lambda )$, with two shrinkage levels $\lambda =2,10$

We sample the priors with rejection sampling, which has less than $10^{-8}$ acceptance rate when $k>10$, so we test with $k=10$. We then sample the posterior with 10,000 iterations and discarding the first 5,000 as burn-ins. KL divergence are estimated using R package FNN [?].

3.1 Information Gain by Experiments

We evaluated how much the posterior distribution changes compared to the prior under different experimental settings. Intuitively, if the prior and the posterior are similar, then no much information is gained by the data or the experiment. However, simply evaluating the difference between prior and posterior is not enough since sample size surely has an effect. Thus, to better evaluate the effect of the design rather than the effect of the sample size, we need to choose a baseline experiment. We take a null design (design with $\textbf{X}=0$) as such baseline because one would expect it to perform poorly. That is, we consider an experiment to provide useful information when it is better than a null experiment in the same settings, and thus, calculate the difference between the given design (random or specific) and the null design as our measure of performance shown in figures below.

For each simulation at a given sample size, we calculate 1) the KL divergence between the prior and the posterior for a given experiment (random or specific) and 2) the KL divergence between the prior and posterior under the null experiment. The information gain due to experiment is then evaluated by taking difference of log KL divergence between the design (random or specific) and the null design. Figure 2 shows the results when simulating data under the covariance structure of Model 4 (star model). Points above 0 indicate more information gained by conducting a non-null experiment. We expect random experiments to do better in this case because the marginal effect $\textbf{B}{\varvec{\Omega }}^{-1}$ can be large. The results corresponding to all other covariance structures (with similar conclusions) can be found in the Appendix A8, and the simulation results with shrinkage priors can be found in Appendix A9, where the results largely align with results using Normal-MGIG and Normal-Wishart priors.

Except for the case of Normal-MGIG prior with $\varvec{\Lambda }=10^{-3} \textbf{I}_p$ (certain case), the difference in information gain will eventually reach 0 for all other prior cases as the sample size increases, meaning that there will no longer be any information gain from doing an experiment (random nor specific) compared to no experiment at all. While the Normal-MGIG prior with $\varvec{\Lambda }=10^{-3} \textbf{I}_p$ (certain case) stays at a distance from 0, this distance does not depend much on the sample size. With a biased prior, MGIG still shows better information gain while certain Wishart also leads to better information gains compared to a null design. More results on other types of biases can be found in the Appendix A10.

3.2 Performance on Point Estimation of ${\varvec{\Omega }}$

While KL divergence evaluates the information gain of the experiment, information gain does not imply better point estimates. To compare the performance of point estimation of ${\varvec{\Omega }}$, we use the difference in Stein’s loss of the experiment (random or specific) and the null design. The results are shown in Fig. 3. Points below 0 indicate more accurate point estimates under an experimental design rather than under a null experiment.

We have a similar observation as in the information gain section. That is, except for the case of Normal-MGIG prior with $\varvec{\Lambda }=10^{-3} \textbf{I}_p$ (certain case), the Stein’s loss ratio will eventually reach 0 for all other prior cases regardless of the experimental design (random or specific). While the Normal-MGIG prior with $\varvec{\Lambda }=10^{-3} \textbf{I}_p$ (certain case) stays at a distance from 0, this distance does not depend much on the sample size. While biased prior such as certain MGIG and certain Wishart can undermine our ability to get good point estimates from experiments, smaller biases do not have such a strong negative effect (see Appendix A10).

4 Human Gut Microbiome Data

We revisit a similar comparison to the one in the motivating toy example (Sect. 1.3) of the posterior distribution of partial correlations among responses under different priors and experiments. We use data from Claesson et al. (2012) which collected fecal microbiota composition from 178 elderly subjects, together with subjects’ residence type (in the community, day-hospital, rehabilitation or in long-term residential care) and diet (data at O’Toole (2008)) with the goal of understanding the interactions between microbes and environment via partial correlations.

We use the MG-RAST server (Meyer et al. 2008) for profiling with an e-value of 5, 60% identity, alignment length of 15 bp, and minimal abundance of 10 reads. Unclassified hits are not included in the analysis. Genus with more than 0.5% relative abundance in more than 50 samples is selected as the focal genus and all other genus serve as the reference group. This yields 13 responses and 11 predictors (i.e., $p=11,k=14$). We then fit a Gaussian chain graph model to the data. Since we cannot design an experiment on this already collected data, to compare with a hypothetical null experiment, we draw a simulated sample from a Gaussian chain graph model whose regression coefficients and precision matrix are the MLE from the original data, and a null experiment by definition has $\textbf{X} = \textbf{0}$.

Here, we focus on the partial correlation between Bacteroides, one of the largest genera in gut and Clostridium, a group known to be pathological. Figure 4 shows the posterior distribution of this partial correlation under the two different priors (Normal-Wishart and Normal-MGIG) with three different uncertainty levels on $\textbf{B}$: $\varvec{\Lambda }=10^{-1}\textbf{I}_p$, $10^0 \textbf{I}_p$, $10^{-1}\textbf{I}_p$ (shown as columns: 0.1, 1, 10) with $\varvec{\Psi }= 0.01\textbf{I}_p$ and $\varvec{\Phi }=0.01\textbf{I}_p$. Just as in the toy example (and in agreement with the expectation of our theory in Sect. 2), only the Normal-MGIG prior results in lower uncertainty when an experiment is performed.

5 Discussion

Chain graph models are relevant in genomic, microbiome and ecological applications because they encode conditional dependence among responses and predictors and focus on the estimation of the precision matrix, an important parameter to understand interactions within microbial and ecological communities. Here, we evaluated the effect of prior knowledge on conducting experiments to better estimate the precision matrix in a chain graph model. Using the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ as the optimality criterion on experimental design settings, we proved theoretically that without prior knowledge that identifies $\textbf{B}$ and ${\varvec{\Omega }}$ separately (instead of $\textbf{B}{\varvec{\Omega }}^{-1}$ combined), experiments provide no gain in knowledge for the estimation of ${\varvec{\Omega }}$. That is, the Laplace approximation of the marginal posterior precision matrix of ${\varvec{\Omega }}$ is not a function of ${\textbf{X}}$. We also showed a bound on the information gain under the Normal-MGIG prior which generalizes to the case of any independent priors. Our findings are highly relevant for domain scientists who aim to design optimal experimental designs to infer the precision matrix.

We further verified our theoretical conclusions using numerical simulations where we showed that without certain prior knowledge on $\textbf{B}$, experiments provide nearly no information gain and there is not an increase performance on the estimation of ${\varvec{\Omega }}$ either. Furthermore, it is not enough for an experiment to be specific, the prior knowledge about this specificity is also needed (more examples below in Practical advice for domain scientists).

Connections to multicollinearity. Chain graph models have a dependence property that is similar to multicollinearity in classical regression. Take the conditional distribution of the qth response node in sample ${\textbf{Y}}_i \in \mathbb {R}^k$ with the design ${\textbf{X}}_i \in \mathbb {R}^p$:

$$\begin{aligned} \begin{aligned} \left[ Y_{qi}|{\textbf{X}}_i= \textbf{x}_i,{\textbf{Y}}_{-q,i}= \textbf{y}_{-q,i}\right] = \frac{1}{\omega _{qq}}\sum _{j=1}^p\beta _{jq}x_{ji}-\frac{1}{\omega _{qq}}\sum _{l\ne q} \omega _{ql} y_{li}+\epsilon _{qi} \end{aligned} \end{aligned}$$

(12)

where $\epsilon _{qi}\sim {\mathcal {N}}(0,1/\omega _{qq})$, $\beta _{jq}$ is the (j, q) entry of the $\textbf{B}$ matrix, $\omega _{qq}$ and $\omega _{ql}$ are the (q, q) and (q, l) entries in the ${\varvec{\Omega }}$ matrix, and ${\textbf{Y}}_{-q,i}$ corresponds to the vector of responses for sample i without the qth response. Multicollinearity in this model arises, given that the correlation between ${\textbf{Y}}_l$ and ${\textbf{X}}_j$ is 0 only if when the (j, l) entry of $\textbf{B}\varvec{\Omega }^{-1}$ is 0, which requires $\textbf{B}_{j\cdot } {\varvec{\Omega }}^{-1}_{\cdot l}=0$, which is difficult to hold for all l, j. Thus, in practice, we are most likely to have some intrinsic multicollinearity in chain graph models. In univariate settings, in principle, we could design experiments to avoid problems caused by multicollinearity. When such experiments are hard to conduct, an alternative approach is to have informative priors on some of the parameters. For instance, if two predictors are collinear, the sum of the two respective regression coefficients can be easily identified, but not the individual ones. However, one individual coefficient can be identified if we have informative prior on the other. Take this intuition to chain graph where we have multicollinearity between $\omega $’s and $\beta $’s. In this case, prior knowledge on the regression coefficients $\textbf{B}$ might actually help the estimation of $\varvec{\Omega }$ under certain experimental conditions. Thus, our work to explore the interplay between experimental design and prior knowledge in chain graph models is also justified by the classical regression setting that has routinely used prior knowledge to infer parameters on cases when multicollinearity arises.

Future directions. As mentioned, the difficulty of designing experiments for the estimation of the precision matrix on a chain graph model is similar to the multicollinearity problem in univariate regression. In both cases, a prior can help identify one set of parameters to better infer another set of parameters. A natural question is whether this is also true in a general Gibbs measure with two-body interaction. For instance, the auto-logistic model (Ising model) can been used to infer networks and existing work has discussed the experimental design of this model when the effect of the treatment is completely known (Jiang et al. 2019). One interesting question is whether in order to design experiments effectively to infer the network among responses with this model, we need prior knowledge on the effect of the treatment. One difficulty when answering this question is the intractable normalizing constant or partition function in a general Gibbs measure. Some approximations of the partition function proposed by Wainwright and Jordan (2006) might be helpful to connect it with what we already know for the Gaussian case.

Practical advice for domain scientists. As we presented in both theoretical and simulation studies, for experiments to aid in the estimation of precision matrix under a chain graph model, the experiment should have large marginal effects, prior knowledge on the conditional effect of predictors on responses and high certainty on those conditional effects. For instance, if an experimenter wants to understand a microbial community, she could try different candidate experiments on the community to identify the treatment that alter the community the most (i.e., that has large marginal effects). Then, the experimenter should culture several of the species to evaluate the effect of those candidate treatments (i.e., gain prior knowledge of $\textbf{B}$). By focusing on a some single species, the experimenter will (ideally) have high certainty on the conditional effects of some of the experiment.

Similarly, an experiment where the experimenter knocks out one gene and evaluates the reaction of another gene in order to infer the interaction between two genes is useful because it can affect two target genes (marginal effect) while we know (by assumption) it is specific to one of the genes (good prior knowledge of its conditional effect). While there is a keen interest in experiments that are specific (e.g., gene knockout), our theory shows that specificity (that is, the row in $\textbf{B}$ has only one nonzero entry) is not necessary for the experiment to be useful in the inference of the precision matrix (${\varvec{\Omega }}$). However, specificity is helpful, given that it is easier to obtain prior knowledge on specificity (we are certain about some entries in $\textbf{B}$ being zero) than trying to obtain prior knowledge on multiple nonzero entries of $\textbf{B}$.

For any given experiment, the experimenter has control over the design (${\textbf{X}}$) and the prior of the regression coefficients ($\textbf{B}$). Our findings show that without prior knowledge of $\textbf{B}$ on some single predictors, experiments produce zero information gain on the estimation of ${\varvec{\Omega }}$. Our work draws attention to the importance of thorough analysis of priors and experimental design for domain scientists who aim to infer biological network structures from controlled experimental data.

Code availability

Scripts can be found in the GitHub repository: https://github.com/YunyiShen/GCG-experimental-design.

References

Anderson BD, Moore JB (2007) Optimal control: linear quadratic methods. Courier Corporation, New York
Google Scholar
Baldassano SN, Bassett DS (2016) Topological distortion and reorganized modular structure of gut microbial co-occurrence networks in inflammatory bowel disease. Sci Rep 6(1):1–14
Article Google Scholar
Barberán A, Bates ST, Casamayor EO, Fierer N (2012) Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J 6(2):343–351
Article Google Scholar
Barndorff-Nielsen O, Blaesild P, Jensen JL, Jørgensen B (1982) Exponential transformation models. Math Phys Sci 379(1776):41–65
MathSciNet Google Scholar
Boyd SP, Barratt CH (1991) Linear controller design: limits of performance. Citeseer
Chaloner K, Verdinelli I (1995) Bayesian experimental design: a review. Stat Sci 10:273–304
Article MathSciNet Google Scholar
Claesson MJ, Jeffery IB, Conde S, Power SE, O’connor EM, Cusack S, Harris HM, Coakley M, Lakshminarayanan B, O’Sullivan O et al (2012) Gut microbiota composition correlates with diet and health in the elderly. Nature 488(7410):178–184
Article Google Scholar
Daniels MJ, Kass RE (1999) Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. J Am Stat Assoc 94(448):1254–1263
Article MathSciNet Google Scholar
Daniels MJ, Pourahmadi M (2002) Bayesian analysis of covariance matrices and dynamic models for longitudinal data. Biometrika 89(3):553–566
Article MathSciNet Google Scholar
Dey DK, Srinivasan C (1985) Estimation of a covariance matrix under Stein’s loss. Ann Stat 13:1581–1591
Article MathSciNet Google Scholar
Fazayeli F, Banerjee A (2016) The matrix generalized inverse Gaussian distribution: properties and applications. In: Frasconi P, Landwehr N, Manco G, Vreeken J (eds) Machine learning and knowledge discovery in databases. Springer, Cham, pp 648–664
Chapter Google Scholar
Gan L, Narisetty NN, Liang F (2019) Bayesian regularization for graphical models with unequal shrinkage. J Am Stat Assoc 114(527):1218–1231
Article MathSciNet Google Scholar
Henderson HV, Searle SR (1981) On deriving the inverse of a sum of matrices. SIAM Rev 23(1):53–60
Article MathSciNet Google Scholar
Jiang J, Sivak DA, Thomson M (2019) Active learning of spin network models. arXiv preprint arXiv:1903.10474
Jovel J, Patterson J, Wang W, Hotte N, O’Keefe S, Mitchel T, Perry T, Kao D, Mason AL, Madsen KL et al (2016) Characterization of the gut microbiome using 16s or shotgun metagenomics. Front Microbiol 7:459
Article Google Scholar
Kang EL, Cressie N (2011) Bayesian inference for the spatial random effects model. J Am Stat Assoc 106(495):972–983
Article MathSciNet Google Scholar
Lauritzen SL, Richardson TS (2002) Chain graph models and their causal interpretations. J R Stat Soc Ser B (Stat Methodol) 64(3):321–348
Article MathSciNet Google Scholar
Layeghifard M, Hwang DM, Guttman DS (2017) Disentangling interactions in the microbiome: a network perspective. Trends Microbiol 25(3):217–228
Article Google Scholar
Magnus JR, Neudecker H (2019) Matrix differential calculus with applications in statistics and econometrics. Wiley, New York
Book Google Scholar
Matchado MS, Lauber M, Reitmeier S, Kacprowski T, Baumbach J, Haller D, List M (2021) Network analysis methods for studying microbial communities: a mini review. Comput Struct Biotechnol J 19:2687–2698
Article Google Scholar
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A et al (2008) The metagenomics rast server-a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform 9(1):1–8
Article Google Scholar
Minka TP (2000) Old and new matrix algebra useful for statistics. https://tminka.github.io/papers/matrix/minka-matrix.pdf
O’Toole PW (2008) Gut microbiota in the irish elderly and its links to health and diet (mgp154). https://www.mg-rast.org/mgmain.html?mgpage=project &project=mgp154
Pimm SL, Lawton JH, Cohen JE (1991) Food web patterns and their consequences. Nature 350(6320):669–674
Article Google Scholar
Prasolov VV (1994) Problems and theorems in linear algebra, volume 134. American Mathematical Society
Rubinov M, Sporns O (2010) Complex network measures of brain connectivity: uses and interpretations. Neuroimage 52(3):1059–1069
Article Google Scholar
Shen Y, Solis-Lemus C (2020) Bayesian conditional auto-regressive lasso models to learn sparse networks with predictors. arXiv preprint arXiv:2012.08397
van Straaten EC, Stam CJ (2013) Structure out of chaos: functional brain network analysis with eeg, meg, and functional mri. Eur Neuropsychopharmacol 23(1):7–18
Article Google Scholar
Wainwright MJ, Jordan MI (2006) Log-determinant relaxation for approximate inference in discrete Markov random fields. IEEE Trans Signal Process 54:2099–2109
Article Google Scholar
Wang H (2012) Bayesian graphical lasso models and efficient posterior computation. Bayesian Anal 7(4):867–886
Article MathSciNet Google Scholar
Yang R, Berger JO (1994) Estimation of a covariance matrix using the reference prior. Ann Stat 1994:1195–1211
MathSciNet Google Scholar

Download references

Funding

This material is based upon work support by the National Institute of Food and Agriculture, United States Department of Agriculture, Hatch Project 1023699. This work was also supported by the Department of Energy [DE-SC0021016 to CSL].

Author information

Authors and Affiliations

Department of Statistics, University of Wisconsin, Madison, USA
Yunyi Shen
Department of Wildlife Ecology, University of Wisconsin, Madison, USA
Yunyi Shen
Department of Plant Pathology, Wisconsin Institute for Discovery, University of Wisconsin, Madison, USA
Claudia Solís-Lemus

Authors

Yunyi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Solís-Lemus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudia Solís-Lemus.

Ethics declarations

Conflict of interest

There are no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A1 Hessian of the Likelihood Function: Fisher Information Matrix

Consider the log likelihood of the Gaussian chain graph model ${\textbf{Y}} \sim N(\varvec{\Sigma }\textbf{B}^{{\textbf{T}}} \textbf{X}^{{\textbf{T}}},{\varvec{\Sigma }})$:

$$\begin{aligned} \ell =\frac{n}{2}\log (2\pi |{\varvec{\Omega }}|)+{{\,\textrm{tr}\,}}({\textbf{Y}}^T{\textbf{X}}\textbf{B})-\frac{1}{2}{{\,\textrm{tr}\,}}({\textbf{Y}}^T{\textbf{Y}} {\varvec{\Omega }} )-\frac{1}{2}{{\,\textrm{tr}\,}}({\textbf{X}}^T{\textbf{X}} \textbf{B} {\varvec{\Omega }}^{-1} \textbf{B}^T ). \end{aligned}$$

Since ${\varvec{\Omega }}$ is symmetric, there are fewer free parameters and we have a constraint. Following Minka (2000), Magnus and Neudecker (2019), we use the duplication matrix $\textbf{D}_k$, a permutation matrix such that $\textbf{D}_k{{\,\textrm{vec}\,}}{\varvec{\Omega }}={{\,\textrm{vech}\,}}({\varvec{\Omega }})$ where ${{\,\textrm{vech}\,}}({\varvec{\Omega }})$ denote the vectorization of unique parameters in ${\varvec{\Omega }}$ (upper triangular part in our case).

The first term has a Hessian of $-\frac{n}{2}\textbf{D}_k\textbf{D}_k^T({\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1})\textbf{D}_k\textbf{D}_k^T$ (Equation 121 in Minka (2000)).

Let $l={{\,\textrm{tr}\,}}({\textbf{Y}}^T{\textbf{X}}\textbf{B})-\frac{1}{2}{{\,\textrm{tr}\,}}({\textbf{Y}}^T{\textbf{Y}} {\varvec{\Omega }} )-\frac{1}{2}{{\,\textrm{tr}\,}}({\textbf{X}}^T{\textbf{X}} \textbf{B} {\varvec{\Omega }}^{-1} \textbf{B}^T )$. Following Minka (2000), the differential is $dl={{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{X}d\textbf{B})-\frac{1}{2}tr(\textbf{Y}^T\textbf{Y}d\varvec{\Omega })+\frac{1}{2}{{\,\textrm{tr}\,}}(\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}d\varvec{\Omega })-{{\,\textrm{tr}\,}}(\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}d\textbf{B})$ and the second order differential is $d^2\,l=-{{\,\textrm{tr}\,}}(d\varvec{\Omega }\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}d\varvec{\Omega }\varvec{\Omega }^{-1}) +{{\,\textrm{tr}\,}}(\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}d\textbf{B}\varvec{\Omega }^{-1}d\varvec{\Omega }) -{{\,\textrm{tr}\,}}(\textbf{X}^T\textbf{X}d\textbf{B}\varvec{\Omega }^{-1}d\textbf{B}^T)$.

Thus, the full Hessian is given by:

$$\begin{aligned} \begin{aligned} \frac{\partial ^2 \ell }{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vec}\,}}(\textbf{B})^T}&=-\varvec{\Omega }^{-1}\otimes \textbf{X}^T\textbf{X}\\ \frac{\partial ^2 \ell }{\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })^T}&=-\textbf{D}_k^T\left( \frac{n}{2} \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}\right) \textbf{D}_k\\ \frac{\partial ^2 \ell }{\partial {{\,\textrm{vec}\,}}(\textbf{B})\partial {{\,\textrm{vech}\,}}(\varvec{\Omega })^T}&=\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X}) \right) . \end{aligned} \end{aligned}$$

A2 Log Concavity of Normal-Wishart Conjugate Prior

The Hessian of the log Normal-Wishart prior is given by (5). We observe similarities between the Hessian of the log prior and the negative Fisher information matrix (2) which is negative definite. Namely, the first and third partial derivatives in (5) coincide with the lower right diagonal block and the off-diagonal block, respectively, of the negative Fisher information matrix (2) of a sampling model in which we set the design matrix to be ${\textbf{X}}^T={{\,\textrm{chol}\,}}(\varvec{\Lambda }^{-1})$ with sample size k.

We also observe that the second partial derivative in (5) has the extra term of $(\frac{1}{2}(\lambda -k-p-1)- k/2){\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}$ when compared to the upper left diagonal block of the negative Fisher information matrix (2) of a sampling model in which we set the design matrix to be ${\textbf{X}}^T={{\,\textrm{chol}\,}}(\varvec{\Lambda }^{-1})$ with sample size k. Since ${\varvec{\Omega }}$ positive definite, then $(\frac{1}{2}(\lambda -k-p-1)- k/2){\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}$ is positive semi-definite when $\frac{1}{2}(\lambda -k-p-1)\ge k/2$, and thus, the Hessian would be negative definite.

A3 Log Concavity, Conjugacy and Unimodality of Normal-MGIG Prior

To the best of our knowledge, the Normal-MGIG prior has not been used for the Gaussian chain graph model, and thus, we prove here some of its properties. We show that the MGIG prior is conjugate for the case of known $\textbf{B}$ (Proposition 1), that it is log concave under certain conditions (Proposition 2) which is needed for the Laplace approximation, that it is unimodal for the case of unknown $\textbf{B}$ (Proposition 3), and that its limiting case is indeed the case of known $\textbf{B}$ (Remark 1).

Proposition 1

Under the setting of known $\textbf{B}$, the Matrix Generalized Inverse Gaussian (MGIG) distribution is a conjugate prior for ${\varvec{\Omega }}$ with density:

$$\begin{aligned} p(\varvec{\Omega }|\lambda ,\varvec{\Psi },\varvec{\Phi })\propto |\varvec{\Omega }|^{\lambda -\frac{1}{2}(k+1)}\exp \left( -\frac{1}{2}{{\,\textrm{tr}\,}}(\varvec{\Psi } \varvec{\Omega }^{-1})-\frac{1}{2}{{\,\textrm{tr}\,}}(\varvec{\Phi }\varvec{\Omega })\right) 1_{\varvec{\Omega }+} \end{aligned}$$

where ${\varvec{\Psi },\varvec{\Phi }}\in \mathbb {R}^{k\times k}$ are positive definite and $\lambda $ is a scalar.

Proof

Denote $\mu =\textbf{XB}$ and let $\theta $ represent the hyper-parameters. Consider the posterior distribution:

$$\begin{aligned} p(\varvec{\Omega }|\textbf{Y}, \mu ,\theta )&\propto |\varvec{\Omega }|^{n/2}\exp ({{\,\textrm{tr}\,}}(\textbf{Y}^T\mu ) -\frac{1}{2}{{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{Y}\varvec{\Omega }) -\frac{1}{2}{{\,\textrm{tr}\,}}(\mu ^T\mu \varvec{\Omega }^{-1}))\nonumber \\&\quad \times |\varvec{\Omega }|^{\lambda -\frac{k+1}{2}} \exp \left( -\frac{1}{2}{{\,\textrm{tr}\,}}(\varvec{\Psi } \varvec{\Omega }^{-1}) -\frac{1}{2}{{\,\textrm{tr}\,}}(\varvec{\Phi }\varvec{\Omega })\right) 1_{\varvec{\Omega }+}\nonumber \\&\propto |\varvec{\Omega }|^{\lambda +\frac{n}{2}-\frac{k+1}{2}} \exp \left( -\frac{1}{2}{{\,\textrm{tr}\,}}[(\varvec{\Psi }+\mu ^T\mu )\varvec{\Omega }^{-1}] -\frac{1}{2}{{\,\textrm{tr}\,}}[(\varvec{\Phi }+\textbf{Y}^T\textbf{Y})\varvec{\Omega }]\right) \end{aligned}$$

(A1)

which is a MGIG distribution with parameters $\lambda +\frac{n}{2}$, $\varvec{\Psi }+\mu ^T\mu $ and $\varvec{\Phi }+\textbf{Y}^T\textbf{Y}$. $\square $

Proposition 2

(Log-Concavity) The (Normal-)MGIG prior is log concave under both settings: known and unknown $\textbf{B}$ if $\lambda -\frac{2k+1}{2}\ge 0$.

Proof

We observe similarities between the Hessian of the log prior and the negative Fisher information matrix (2) which is negative definite.

For the case of known $\textbf{B}$, we observe that the log density of the MGIG prior has a similar form to the log density of the model Normal distribution

$$\begin{aligned} \log p({\textbf{Y}} | \textbf{X},\varvec{\Omega },\textbf{B}) = \frac{n}{2} \log |\varvec{\Omega }|-\frac{1}{2} {{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{Y}\varvec{\Omega })- \frac{1}{2}{{\,\textrm{tr}\,}}(\textbf{B}^T\textbf{X}^T\textbf{X}\textbf{B}\varvec{\Omega }^{-1}) + {{\,\textrm{tr}\,}}(\textbf{Y}^T\textbf{X}\textbf{B}) + C. \end{aligned}$$

(A2)

That is, if we set $\textbf{B}=\textbf{I}_k$ and ${\textbf{X}}^T={{\,\textrm{chol}\,}}(\varvec{\Psi })$ then we have $\varvec{\Psi }=\textbf{B}^T{\textbf{X}}^T \textbf{XB}$ so that the log density of MGIG can be re-written as

$$\begin{aligned} \log p({\varvec{\Omega }} | \lambda , \varvec{\Psi },\varvec{\Phi })&=\frac{k}{2}\log (|{\varvec{\Omega }}|)-\frac{1}{2} {{\,\textrm{tr}\,}}({\varvec{\Phi }\varvec{\Omega }})-\frac{1}{2} {{\,\textrm{tr}\,}}(\textbf{B}^T{\textbf{X}}^T\textbf{XB}{\varvec{\Omega }}^{-1})\nonumber \\&\quad +\left( \lambda -\frac{2k+1}{2}\right) \log (|{\varvec{\Omega }}|)+C. \end{aligned}$$

(A3)

Note that the first and third terms in (A3) coincide with the first and third terms in the Normal model (A2), and thus, the Hessian of these terms will coincide with the negative Fisher information matrix of the Normal model (2). Note that we ignore the second term in (A3) because it is of first order, and thus, it will not appear in the second derivative. Thus, the log MGIG prior would be concave as long as the last term in (A3) is concave, and this happens when $\lambda -\frac{2k+1}{2}\ge 0$.

For the case of unknown $\textbf{B}$, the Hessian of the log Normal-MGIG prior has the following form (9). We observe that the first partial derivative in (9) coincides with the lower right diagonal block in the negative Fisher information of the Normal Chain graph model (2) with a sampling model in which we set the design matrix to be ${\textbf{X}}^T={{\,\textrm{chol}\,}}(\varvec{\Lambda }^{-1})$ with sample size p. Next, we observe that the third partial derivative in (9) coincides with the off-diagonal block in the negative Fisher information of the Normal Chain graph model (2) with the same sampling model already described (${\textbf{X}}^T={{\,\textrm{chol}\,}}(\varvec{\Lambda }^{-1})$) and a regression coefficient matrix given by $\textbf{B}-\textbf{B}_0$. For the missing block (second partial derivative in (9)), there is an extra term of $ \left( \lambda - p +\frac{k+1}{2}\right) {\varvec{\Omega }}^{-1} \otimes {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1} \otimes {\varvec{\Omega }}^{-1} \varvec{\Psi } {\varvec{\Omega }}$. Since ${\varvec{\Omega }}$ is positive definite, then we simply need to show that the terms ${\varvec{\Omega }}^{-1}\varvec{\Psi } {\varvec{\Omega }}^{-1}+(\lambda -p-\frac{k+1}{2}){\varvec{\Omega }}^{-1}$ are positive definite. The term ${\varvec{\Omega }}^{-1}\varvec{\Psi } {\varvec{\Omega }}^{-1}$ is positive definite because it is a quadratic form of a positive-definite matrix $\varvec{\Psi }$. For the second term, when $\lambda -\frac{k+p+1}{2}\ge \frac{p}{2}$, $\varvec{\Omega }^{-1}$ has a nonnegative coefficient so that $(\lambda -p-\frac{k+1}{2}){\varvec{\Omega }}^{-1}$ is also positive semi-definite. Then, ${\varvec{\Omega }}^{-1}\otimes ((\lambda -p-\frac{k+1}{2}){\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}\varvec{\Psi } {\varvec{\Omega }}^{-1})$ is itself positive definite when $\lambda -\frac{k+p+1}{2}\ge \frac{p}{2}$, and thus, the prior is indeed log concave. $\square $

For the case of known $\textbf{B}$, it is already known that the MGIG prior is unimodal (Fazayeli and Banerjee 2016). Next, in Proposition 3, we show that the Normal-MGIG prior is unimodal for the case of unknown $\textbf{B}$.

Proposition 3

(Unimodality) For unknown $\textbf{B}$, the Normal-MGIG prior and corresponding posterior are unimodal.

Proof

Unimodality of prior follows by taking derivative of log density of the prior and set it to 0 to show that it has a unique solution. The solution of $\frac{\partial }{\partial \textbf{B}}p(\varvec{\Omega },\textbf{B})=0$ is $\hat{\textbf{B}}=\textbf{B}_0 $. Then, we take the partial derivative with respect to ${\varvec{\Omega }}$ of the log prior and plug in the solution $\hat{\textbf{B}}$:

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial {\varvec{\Omega }}} \log p({\varvec{\Omega }}, \hat{\textbf{B}})&=\beta {\varvec{\Omega }}^{-1}-\frac{1}{2}\varvec{\Phi } + \frac{1}{2}{\varvec{\Omega }}^{-1}\left[ \varvec{\Psi }+(\hat{\textbf{B}}-\textbf{B}_0)^T{\varvec{\Lambda }}^{-1}(\hat{\textbf{B}}-\textbf{B}_0)\right] {\varvec{\Omega }}^{-1} \end{aligned} \end{aligned}$$

where $\beta = \lambda -\frac{k+1}{2}-\frac{p}{2}$.

By setting the derivative to 0 and multiplying by ${\varvec{\Omega }}$ on both sides (left and right), we get the equation: $-2\beta {\varvec{\Omega }} + \varvec{\Omega }\varvec{\Phi } {\varvec{\Omega }}-\varvec{\Psi } =0$ which is a special case of continuous-time algebraic Riccati equation (CARE) (Boyd and Barratt 1991; Anderson and Moore 2007). Since matrices $\varvec{\Phi }$ and $\varvec{\Psi }$ are positive definite, this equation has the exact form in the proof of unimodality of MGIG distribution from Fazayeli and Banerjee (2016), and thus, we can conclude that the prior has unique solution and is unimodal.

Unimodality of posterior follows the same steps: we take partial the derivative of the posterior (8) with respect to $\textbf{B}$ and set it to 0 and let $\hat{\textbf{B}}=({\textbf{X}}^T {\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}({\textbf{X}}^T {\textbf{Y}}+\varvec{\Lambda }^{-1}\textbf{B}_0 {\varvec{\Omega }}^{-1}) {\varvec{\Omega }}$ be that solution. We then take the partial derivative of the log posterior (8) with respect to ${\varvec{\Omega }}$ and we plug in the solution $\hat{\textbf{B}}$. Let $\alpha = \lambda +\frac{n}{2}-\frac{k+1}{2}-\frac{p}{2}$, so that we get

$$\begin{aligned} \frac{\partial }{\partial {\varvec{\Omega }}} \log p({\varvec{\Omega }}, \hat{\textbf{B}})&=\alpha {\varvec{\Omega }}^{-1}-\frac{1}{2}(\textbf{Y}^{{\textbf{T}}} \textbf{Y}+\varvec{\Phi }) + \frac{1}{2}{\varvec{\Omega }}^{-1}\\&\quad \left[ \varvec{\Psi }+\hat{\textbf{B}}^T{\textbf{X}}^T {\textbf{X}}\hat{\textbf{B}}+(\hat{\textbf{B}} -\textbf{B}_0)^T\varvec{\Lambda }^{-1}(\hat{\textbf{B}} -\textbf{B}_0)\right] {\varvec{\Omega }}^{-1}\\&=\alpha {\varvec{\Omega }}^{-1} -\frac{1}{2}({\textbf{Y}}^T {\textbf{Y}}+\varvec{\Phi }) + \frac{1}{2}{\varvec{\Omega }}^{-1}\\&\quad \left[ \varvec{\Psi } + \textbf{B}_0^T(\varvec{\Lambda }^{-1} -\varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}} +{\varvec{\Lambda }}^{-1})^{-1}{\varvec{\Lambda }}^{-1}) \textbf{B}_0\right] {\varvec{\Omega }}^{-1} \end{aligned}$$

Again, by setting the derivative to 0 and multiplying by ${\varvec{\Omega }}$ on both sides (left and right), we get the equation: $-2\alpha {\varvec{\Omega }}+ {\varvec{\Omega }}({\textbf{Y}}^T {\textbf{Y}}+\varvec{\Phi }){\varvec{\Omega }}-\left[ \varvec{\Psi } + \textbf{B}_0^T(\varvec{\Lambda }^{-1}-\varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0\right] =0$ which is again a special case of continuous time algebraic Riccati equation (CARE) (Boyd and Barratt 1991; Anderson and Moore 2007). Since ${\textbf{Y}}^T{\textbf{Y}}+\varvec{\Phi }$ and $[\varvec{\Psi } + \textbf{B}_0^T({\varvec{\Lambda }}^{-1}-{\varvec{\Lambda }}^{-1}({\textbf{X}}^T{\textbf{X}}+{\varvec{\Lambda }}^{-1})^{-1}{\varvec{\Lambda }}^{-1})\textbf{B}_0]$ are positive definite (see Proposition 4 in the Appendix), then following the proof of Fazayeli and Banerjee (2016), we get that the posterior is also unimodal. $\square $

Lastly, unlike the Normal-Wishart conjugate prior, the Normal-MGIG does reach its limiting case of known $\textbf{B}$ when the uncertainty of $\textbf{B}$ goes to zero as highlighted in Remark 1.

Remark 1

Since $\varvec{\Lambda }$ represents the uncertainty on $\textbf{B}$, when we take the limit of $\varvec{\Lambda }\rightarrow \textbf{0}$, $\textbf{B}$ is fully known (as $\textbf{B}_0$) in the Normal-MGIG prior and it reduces to the known $\textbf{B}$ case.

A4 Positive Definiteness of $\hat{\varvec{\Phi }}$ in the Normal-Wishart Prior and of $\hat{\varvec{\Psi }}$ in the Normal-MGIG Prior

Proposition 4

Let $\hat{\varvec{\Psi }}=\varvec{\Psi }+\textbf{B}_0^T\varvec{\Lambda }^{-1}\textbf{B}_0-\textbf{B}_0^T\varvec{\Lambda }^{-1}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1}\textbf{B}_0$ and $\hat{\varvec{\Phi }}=\varvec{\Phi }+\textbf{Y}^T\textbf{Y}-\textbf{Y}^T\textbf{X}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T\textbf{Y}$. Let $\varvec{\Lambda }$, $\varvec{\Psi }$ and $\varvec{\Phi }$ be positive definite. If ${\textbf{X}}= \textbf{0}$ or if ${\textbf{X}}^T {\textbf{X}}$ is invertible, then both $\hat{\varvec{\Psi }}$ and $\hat{\varvec{\Phi }}$ are positive definite.

Proof

For ${\textbf{X}}=\textbf{0}$, the matrices are reduced to $\hat{\varvec{\Psi }}=\varvec{\Psi }$ and $\hat{\varvec{\Phi }}=\varvec{\Phi }+\textbf{Y}^T\textbf{Y}$, and thus are trivially positive definite.

For ${\textbf{X}}^T {\textbf{X}}$ invertible we can write the last two terms of $\hat{\varvec{\Psi }}$ as $\textbf{B}^T_0bm{\Lambda }^{-\textbf{1}}(\varvec{\Lambda }-(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1})\varvec{\Lambda }^{-1}\textbf{B}_0$ thus it suffices to show $\varvec{\Lambda }-(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}$ is positive definite. We can also write the last two terms of $\hat{\varvec{\Phi }}$ as $\textbf{Y}^T(\textbf{I}_n-\textbf{X}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T)\textbf{Y}$ thus it suffices to show $\textbf{I}_n-\textbf{X}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T$ is positive definite. We use the fact that $\varvec{\Lambda }^{-1}$ is symmetric.

Assuming proper invertibility, we have

$$\begin{aligned} (\textbf{UPV} + \textbf{A})^{-1}=\textbf{A}^{-1}-\textbf{A}^{-1}\textbf{UP}(\textbf{I} + \textbf{VA}^{-1}\textbf{UP})^{-1}\textbf{VA}^{-1} \end{aligned}$$

(A4)

(Equation 24 in Henderson and Searle 1981) which allows us to decompose $\varvec{\Lambda }-(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}$ by taking $\textbf{A}=\varvec{\Lambda }^{-1}$, $\textbf{U}= {\textbf{X}}^T$, $\textbf{V}={\textbf{X}}$ and $\textbf{P}=\textbf{I}_n$

$$\begin{aligned} \varvec{\Lambda }-(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}&=\varvec{\Lambda }-\varvec{\Lambda }+\varvec{\Lambda } {\textbf{X}}^T(\textbf{I}_n+{\textbf{X}} \varvec{\Lambda } {\textbf{X}}^T)^{-1} {\textbf{X}} \varvec{\Lambda }=\varvec{\Lambda } {\textbf{X}}^T(\textbf{I}_n+{\textbf{X}} \varvec{\Lambda } {\textbf{X}}^T)^{-1} {\textbf{X}} \varvec{\Lambda }\\&= ({\textbf{X}}\varvec{\Lambda })^T(\textbf{I}_n+{\textbf{X}} \varvec{\Lambda } {\textbf{X}}^T)^{-1} {\textbf{X}} \varvec{\Lambda } \end{aligned}$$

which is positive definite because it is a quadratic form of a positive-definite matrix $\textbf{I}_n+{\textbf{X}} \varvec{\Lambda } {\textbf{X}}^T$.

For $\textbf{I}_n-\textbf{X}(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T$, we take the Cholesky decomposition of $\varvec{\Lambda }^{-1}=\textbf{LL}^T$ and we take $\textbf{A} = {\textbf{X}} ^T {\textbf{X}}$, $\textbf{U}=\textbf{L}$, $\textbf{P}= \textbf{I}_p$ and $\textbf{V}=\textbf{L}^T$.

$$\begin{aligned} ({\textbf{X}} ^T {\textbf{X}} + \varvec{\Lambda } ^{-1})^{-1}&= ({\textbf{X}}^T {\textbf{X}})^{-1} - ({\textbf{X}}^T {\textbf{X}})^{-1} \textbf{L} (\textbf{I}_p + \textbf{L}^T ({\textbf{X}} ^T {\textbf{X}})^{-1} \textbf{L})^{-1} \textbf{L}^T({\textbf{X}} ^T {\textbf{X}})^{-1} \\ \textbf{I}_p-\textbf{X}(\textbf{X}^T\textbf{X} +\varvec{\Lambda }^{-1})^{-1}\textbf{X}^T&=\textbf{I}_p-\textbf{X} (\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\\&\quad +\textbf{X}[(\textbf{X}^T\textbf{X})^{-1} \textbf{L}(\textbf{I}_p+ \textbf{L}^T(\textbf{X}^T\textbf{X})^{-1} \textbf{L})^{-1} \textbf{L}^T(\textbf{X}^T\textbf{X})^{-1}]\textbf{X}^T. \end{aligned}$$

The last term is positive definite because it is a quadratic form of a positive-definite matrix $\textbf{I}_p+ \textbf{L}^T(\textbf{X}^T\textbf{X})^{-1} \textbf{L}$. It remains to show that $\textbf{I}_p-\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T$ is positive definite. We observe that $\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T$ is a projection matrix and denote it as $\textbf{Q}$. We then have $\textbf{Q}^T\textbf{Q}=\textbf{Q}$, and thus $\textbf{I}_p-\textbf{Q}=\textbf{I}_p-\textbf{Q}^T \textbf{Q}$. For any vector a, the quadratic form is $a^T(\textbf{I}_p-\textbf{Q})a=a^Ta-(\textbf{Q} a)^T(\textbf{Q} a)=||a||^2-||\textbf{Q} a||^2$. Since $\textbf{Q}$ is a projection to a subspace, we have that $||\textbf{Q}a||\le ||a||$, thus we have that $\textbf{I}_p-\textbf{Q}$ is positive definite. $\square $

A5 Laplace Approximation of the Marginal Posterior Precision under the Conjugate Prior

This section provides the algebraic details of the simplification in (6). We simplify the second term in (6) and it can be seen that it is almost the same as the first term, with a difference of $\left( \frac{n}{2}+\alpha \right) \textbf{D}_k^T \left[ {\varvec{\Omega }}^{-1}\otimes {\varvec{\Omega }}^{-1}\right] \textbf{D}_k$

$$\begin{aligned}&\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X} +\varvec{\Lambda }^{-1})) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \\&\qquad \left[ (\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X} +\varvec{\Lambda }^{-1}))\right) \right] ^T\\&\quad =\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \\&\qquad \left[ (\left( \varvec{\Omega }^{-1}\otimes (\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})\textbf{B} \varvec{\Omega }^{-1})\right) \right] ^T\textbf{D}_k\\&\quad =\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X}+\varvec{\Lambda }^{-1})) \right) \left[ \textbf{I}\otimes \textbf{B}\varvec{\Omega }^{-1}\right] \textbf{D}_k\\&\quad =\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T(\textbf{X}^T\textbf{X} +\varvec{\Lambda }^{-1})\textbf{B}\varvec{\Omega }^{-1}) \right) \textbf{D}_k\\&\quad =\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}\textbf{B}^T\textbf{X}^T\textbf{X} \textbf{B}\varvec{\Omega }^{-1}+\varvec{\Omega }^{-1}\textbf{B}^T \varvec{\Lambda }^{-1}\textbf{B}\varvec{\Omega }^{-1}) \right) \textbf{D}_k \end{aligned}$$

A6 Laplace Approximation of the Marginal Posterior Precision under the Normal-MGIG Prior

This section provides the algebraic details of the simplification in (10). We focus on the second term to convert to a form that is similar to the first term:

$$\begin{aligned} \begin{aligned}&\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X}+(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \\&\quad \left[ \textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X}+(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}))\right) \right] ^T \end{aligned} \end{aligned}$$

(A5)

The transpose on the term below can be simplified as follows:

$$\begin{aligned}&\left[ \textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}))\right) \right] ^T \\&\quad = \left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1} (\textbf{B}^T\textbf{X}^T\textbf{X}+(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}))\right) ^T \textbf{D}_k\\&\quad = \left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1} (\textbf{B}^T\textbf{X}^T\textbf{X}+(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1}))^T \right) \textbf{D}_k \\&\quad = \left( \varvec{\Omega }^{-1}\otimes (\textbf{B}^T\textbf{X}^T \textbf{X}+(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})^T \varvec{\Omega }^{-1} \right) \textbf{D}_k \\&\quad = \left( \varvec{\Omega }^{-1}\otimes (\textbf{X}^T\textbf{X}\textbf{B}+\varvec{\Lambda }^{-1} (\textbf{B}-\textbf{B}_0)) \varvec{\Omega }^{-1} \right) \textbf{D}_k. \end{aligned}$$

which combined with the other terms in (A5) becomes

$$\begin{aligned} \begin{aligned}&\textbf{D}_k^T\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X}+(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})) \right) \left[ \varvec{\Omega }\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}\right] \\&\quad \left( \varvec{\Omega }^{-1}\otimes (\textbf{X}^T\textbf{X}\textbf{B}+\varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)) \varvec{\Omega }^{-1} \right) \textbf{D}_k \end{aligned} \end{aligned}$$

(A6)

By the mixed-product property of Kronecker product $(\textbf{M}_1 \otimes \textbf{M}_2)(\textbf{M}_3 \otimes \textbf{M}_4) = (\textbf{M}_1\textbf{M}_3) \otimes (\textbf{M}_2 \textbf{M}_4)$, we multiply the two last terms in (A6) and obtain the first line in (A7). We then repeat the same mixed-product property to go from the first equation to the second equation and re-group terms:

$$\begin{aligned}&\textbf{D}_k^T[\left( \varvec{\Omega }^{-1}\otimes (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T\textbf{X} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1})) \right) \left[ \textbf{I}\otimes (\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}({\textbf{X}}^T{\textbf{X}}\textbf{B} +\varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)){\varvec{\Omega }}^{-1} \right] ]\textbf{D}_k\nonumber \\&\quad =\textbf{D}_k^T[\varvec{\Omega }^{-1}\otimes \left( (\varvec{\Omega }^{-1}(\textbf{B}^T\textbf{X}^T \textbf{X}+(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}))((\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}({\textbf{X}}^T{\textbf{X}}\textbf{B} +\varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)){\varvec{\Omega }}^{-1} )\right) ]\textbf{D}_k \nonumber \\&\quad =\textbf{D}_k^T[\varvec{\Omega }^{-1}\otimes \left( (\varvec{\Omega }^{-1}(\textbf{B}^T(\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})-\textbf{B}_0^T \varvec{\Lambda }^{-1}))((\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}(({\textbf{X}}^T {\textbf{X}}+\varvec{\Lambda }^{-1})\textbf{B} -\varvec{\Lambda }^{-1}\textbf{B}_0){\varvec{\Omega }}^{-1}) \right) ]\textbf{D}_k \nonumber \\&\quad =\textbf{D}_k^T[\varvec{\Omega }^{-1}\otimes \left( \varvec{\Omega }^{-1}(\textbf{B}^T(\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})-\textbf{B}_0^T \varvec{\Lambda }^{-1})(\textbf{X}^T \textbf{X}+\varvec{\Lambda }^{-1})^{-1}(({\textbf{X}}^T {\textbf{X}}+\varvec{\Lambda }^{-1})\textbf{B} -\varvec{\Lambda }^{-1}\textbf{B}_0){\varvec{\Omega }}^{-1}\right) ]\textbf{D}_k \end{aligned}$$

(A7)

The second factor in the Kronecker product in the middle is a quadratic form on $(({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})\textbf{B} -\varvec{\Lambda }^{-1}\textbf{B}_0)$, so we can expand it and re-group terms:

$$\begin{aligned}&\textbf{D}_k^T[\varvec{\Omega }^{-1}\otimes \left( \varvec{\Omega }^{-1} \left[ \textbf{B}^T({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1}) \textbf{B}-\textbf{B}^T\varvec{\Lambda }^{-1}\textbf{B}_0-\textbf{B}_0^T \varvec{\Lambda }^{-1}\textbf{B}+\textbf{B}_0^T\varvec{\Lambda }^{-1} ({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1} \textbf{B}_0\right] {\varvec{\Omega }}^{-1} \right) ] \textbf{D}_k \\&\quad =\textbf{D}_k^T[\left( \varvec{\Omega }^{-1} \otimes \varvec{\Omega }^{-1}\left[ \textbf{B}^T{\textbf{X}}^T {\textbf{X}}\textbf{B}+(\textbf{B}-\textbf{B}_0)^T \varvec{\Lambda }^{-1} (\textbf{B}-\textbf{B}_0)^T+\textbf{B}_0^T \varvec{\Lambda }^{-1} ({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1} \textbf{B}_0\right] {\varvec{\Omega }}^{-1}\right) ]\textbf{D}_k \\&\quad =\textbf{D}_k^T[\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\left[ \textbf{B}^T{\textbf{X}}^T{\textbf{X}}\textbf{B} +(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}(\textbf{B} -\textbf{B}_0)^T\right] {\varvec{\Omega }}^{-1} \right) ]\textbf{D}_k\\&\quad +\textbf{D}_k^T[\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}\left[ \textbf{B}_0^T\varvec{\Lambda }^{-1} ({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1} \varvec{\Lambda }^{-1}\textbf{B}_0\right] {\varvec{\Omega }}^{-1} \right) ]\textbf{D}_k \end{aligned}$$

We can now look at the first term in (10) and note that the part

$$\begin{aligned} \textbf{D}_k^T[\left( \varvec{\Omega }^{-1}\otimes \varvec{\Omega }^{-1}[\textbf{B}^T{\textbf{X}}^T{\textbf{X}}\textbf{B}+(\textbf{B}-\textbf{B}_0)^T\varvec{\Lambda }^{-1}(\textbf{B}-\textbf{B}_0)^T]{\varvec{\Omega }}^{-1} \right) ]\textbf{D}_k \end{aligned}$$

cancels out, so the remaining of (10) becomes:

$$\begin{aligned}&\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( {\varvec{\Omega }}^{-1}\textbf{B}_0^T\varvec{\Lambda }^{-1}\textbf{B}_0 {\varvec{\Omega }}^{-1} + {\varvec{\Omega }}^{-1}\varvec{\Psi }{\varvec{\Omega }}^{-1} -{\varvec{\Omega }}^{-1} \textbf{B}_0^T(\varvec{\Lambda }^{-1}({\textbf{X}}^T {\textbf{X}}+\varvec{\Lambda }^{-1})^{-1}\varvec{\Lambda }^{-1})\textbf{B}_0 {\varvec{\Omega }}^{-1} \right) \right] \textbf{D}_k\\&\qquad +\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( \frac{n}{2}+\alpha \right) {\varvec{\Omega }}^{-1}\right] \\&\quad =\textbf{D}_k^T\left[ {\varvec{\Omega }}^{-1}\otimes \left( \left( \frac{n}{2}+\alpha \right) {\varvec{\Omega }}^{-1} +{\varvec{\Omega }}^{-1}\textbf{B}_0^T(\varvec{\Lambda }^{-1} -\varvec{\Lambda }^{-1}({\textbf{X}}^T{\textbf{X}}+\varvec{\Lambda }^{-1})^{-1} \varvec{\Lambda }^{-1})\textbf{B}_0{\varvec{\Omega }}^{-1}+{\varvec{\Omega }}^{-1} \varvec{\Psi }{\varvec{\Omega }}^{-1}\right) \right] \textbf{D}_k \end{aligned}$$

A7 Laplace Approximation of the Marginal Posterior Precision under the General Independent Prior

Here, we show the bound inequality in (11): $\textbf{EF}^{-1} \textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T(\textbf{EF}^{-1})^{T}\le \textbf{EF}^{-1}{\varvec{\Lambda }}^{-1} (\textbf{EF}^{-1})^T$. Consider the difference $\textbf{EF}^{-1}{\varvec{\Lambda }}^{-1} (\textbf{EF}^{-1})^T- \textbf{EF}^{-1} \textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T(\textbf{EF}^{-1})^{T}$ which is equal to

$\textbf{EF}^{-1}\left( {\varvec{\Lambda }}^{-1}-\textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T\right) (\textbf{EF}^{-1})^T$.

Given that this is a quadratic form, it suffices to show that ${\varvec{\Lambda }}^{-1}-\textbf{L}(\textbf{I}_{kp}+\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T$ is positive semi-definite. Again by (A4) (Henderson and Searle 1981), since $\textbf{L}^T\textbf{F}^{-1}\textbf{L}$ is positive definite, we take the Cholesky decomposition of this matrix as $\textbf{Q}\textbf{Q}^T$. Recall that $\textbf{L} \textbf{L}^T=\varvec{\Lambda }^{-1}$ by definition, thus

$$\begin{aligned}&{\varvec{\Lambda }}^{-1}-\textbf{L}(\textbf{I}_{kp}+\textbf{L}^T \textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T\\&\quad =\textbf{L} \textbf{L}^T-\textbf{L}(\textbf{I}_{kp} +\textbf{L}^T \textbf{F}^{-1}\textbf{L})^{-1}\textbf{L}^T =\textbf{L}(\textbf{I}^{-1}_{kp}-(\textbf{I}_{kp} +\textbf{L}^T\textbf{F}^{-1}\textbf{L})^{-1})\textbf{L}^T\\&\quad =\textbf{L}(\textbf{I}^{-1}_{kp}-(\textbf{I}_{kp}+\textbf{Q} \textbf{Q}^T)^{-1})\textbf{L}^T=\textbf{L}\textbf{Q} (\textbf{I}_{kp}+\textbf{Q}^T\textbf{Q})^{-1}\textbf{Q}^T\textbf{L}^T \end{aligned}$$

which is positive (semi-)definite since it is a quadratic form of a positive-definite matrix. The last equality is per (A4), taking $\textbf{A}=\textbf{I}_{kp}$ and $\textbf{U}=\textbf{V}^T=\textbf{Q}^T$.

A8 Simulations on Other Covariance Structures

In this Appendix, we provide the results on the simulation of the remaining five covariance structures not shown in the main text.

1.1 KL Divergence Compared to Null Design

See Figs. 5, 6, 7, 8 and 9.

1.2 Stein’s Loss Compared to Null Design

See Figs. 10, 11, 12, 13 and 14.

A9 Shrinkage Priors

Here, we show the simulation results with two shrinkage priors with $k=10$ and two shrinkage levels as described on the main text.

1.1 KL Divergence Compared to Null Experiment

The overall pattern agrees with two conjugate priors, i.e., with shrinkage prior on $\textbf{B}$ (cglasso), one has better information gain from prior while prior on $\textbf{B}{\varvec{\Omega }}^{-1}$ cannot provide that gain, as evidenced by the log KL divergence converging to zero in the mlasso cases (see Figs. 15, 16, 17, 18, 19, 20).

1.2 Stein’s Loss Compared to Null Experiment

As with the KL divergence, we see that only for cglasso prior, a properly chosen prior on $\textbf{B}$, enhances the effect of experiments when estimating ${\varvec{\Omega }}$ (see Figs. 21, 22, 23, 24, 25, 26).

A10 Simulations with Smaller Bias in the Prior for $\textbf{B}$

Here, we explore how the degree of bias in the prior of $\textbf{B}$ (the effect of experiments) can influence the inference of the network ${\varvec{\Omega }}$. We choose the same setting as the simulations in the main text (Sect. 3), except that here the prior mean of $\textbf{B}$ is $\textbf{B}_0= \textbf{B}+\epsilon $ where $\epsilon \sim N(0,0.1)$. In general, we observe the same results as without bias in the prior and having certain prior on $\textbf{B}$ makes experiment useful in inferring $\varvec{\Omega }$.

1.1 KL Divergence Compared to Null Experiment

See Figs. 27, 28, 29, 30, 31 and 32.

1.2 Stein’s Loss Compared to Null Experiment

This set of experiments suggests that good prior on $\textbf{B}$ make experiments helpful in the point estimate of $\Omega $ in other models, but with biased prior this is not the case as positive Stein’s loss change means a worse point estimate compared to null design (see Figs. 33, 34, 35, 36, 37, 38).

A11 Simulations with Midpoint Level Uncertainty on $\textbf{B}$

Here, we changed the uncertainty level to $\varvec{\Lambda }=0.1\textbf{I}_p, 10\textbf{I}_p$, with two different levels of bias. A general observation is that with bias in the prior having some uncertainty will be helpful in small sample size, but not with more data. We highlight that the low-bias settings performs better than the high-bias settings for parameter recovery.

1.1 Low Bias Scheme: N(0, 0.1)

1.1.1 KL Divergence Compared to Null Experiment

See Figs. 39, 40, 41, 42, 43 and 44.

1.1.2 Stein’s Loss Compared to Null Experiment

See Figs. 45, 46, 47, 48, 49 and 50.

1.2 High Bias Scheme: N(0, 1)

1.2.1 KL Divergence Compared to Null Experiment

See Figs. 51, 52, 53, 54, 55 and 56.

1.2.2 Stein’s Loss Compared to Null Experiment

See Figs. 57, 58, 59, 60, 61 and 62.

A12 Simulations with Violations to Distributional Assumptions

Here, we investigate the performance of the inference under violations to the distributional assumptions. We repeat some of the experiments to have data generated from a multivariate Laplace distribution and multivariate Student’s t distribution with two degrees of freedom while still fit a Gaussian chain graph model to infer the precision matrix. The posterior is based on Gaussian likelihood. Inference performance is quite robust to Laplace tails. The conclusions are different with t-distribution tails. Results for point estimates still show that a good prior is needed for experiment to outperform a null experiments, but KL divergence gain is different from Gaussian chain graph. This is expected, given that we have model misspecification.

1.1 Laplace Tail

1.1.1 KL Divergence Compared to Null Experiments

See Figs. 63, 64, 65, 66, 67 and 68.

1.1.2 Stein’s Loss Compared to Null Experiments

See Figs. 69, 70, 71, 72, 73 and 74.

1.2 Student’s t-Distribution Tails

1.2.1 KL Divergence Compared to Null Experiments

See Figs. 75, 76, 77, 78, 79 and 80.

1.2.2 Stein’s Loss Compared to Null Experiments

See Figs. 81, 82, 83, 84, 85 and 86.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shen, Y., Solís-Lemus, C. The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models. JABES (2024). https://doi.org/10.1007/s13253-024-00621-1

Download citation

Received: 30 March 2023
Revised: 24 February 2024
Accepted: 20 March 2024
Published: 01 May 2024
DOI: https://doi.org/10.1007/s13253-024-00621-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models

Abstract

Similar content being viewed by others

Approximate Bayesian inference in semi-mechanistic models

Discussion to: Bayesian graphical models for modern biological applications by Y. Ni, V. Baladandayuthapani, M. Vannucci and F.C. Stingo

Posterior Contraction Rates for Stochastic Block Models

1 Introduction

1.1 Background

1.2 The Gaussian Chain Graph Model

1.3 Motivating Example: Toy Data with Explicit Posterior Precision

1.4 Structure of the Paper

2 Experimental Design under Different Priors in a Gaussian Chain Graph

2.1 Laplace Approximation of the Posterior Precision Matrix

2.2 Flat Prior

2.3 Standard Conjugate Prior: Normal-Wishart

2.4 Normal-Matrix Generalized Inverse Gaussian Prior

2.4.1 Information Bound under the Normal-MGIG Prior

2.5 General Independent Prior

3 Simulation Study

3.1 Information Gain by Experiments

3.2 Performance on Point Estimation of \({\varvec{\Omega }}\)

4 Human Gut Microbiome Data

5 Discussion

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A1 Hessian of the Likelihood Function: Fisher Information Matrix

A2 Log Concavity of Normal-Wishart Conjugate Prior

A3 Log Concavity, Conjugacy and Unimodality of Normal-MGIG Prior

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Remark 1

A4 Positive Definiteness of \(\hat{\varvec{\Phi }}\) in the Normal-Wishart Prior and of \(\hat{\varvec{\Psi }}\) in the Normal-MGIG Prior

Proposition 4

Proof

A5 Laplace Approximation of the Marginal Posterior Precision under the Conjugate Prior

A6 Laplace Approximation of the Marginal Posterior Precision under the Normal-MGIG Prior

A7 Laplace Approximation of the Marginal Posterior Precision under the General Independent Prior

A8 Simulations on Other Covariance Structures

1.1 KL Divergence Compared to Null Design

1.2 Stein’s Loss Compared to Null Design

A9 Shrinkage Priors

1.1 KL Divergence Compared to Null Experiment

1.2 Stein’s Loss Compared to Null Experiment

A10 Simulations with Smaller Bias in the Prior for \(\textbf{B}\)

1.1 KL Divergence Compared to Null Experiment

1.2 Stein’s Loss Compared to Null Experiment

A11 Simulations with Midpoint Level Uncertainty on \(\textbf{B}\)

1.1 Low Bias Scheme: N(0, 0.1)

1.1.1 KL Divergence Compared to Null Experiment

1.1.2 Stein’s Loss Compared to Null Experiment

1.2 High Bias Scheme: N(0, 1)

1.2.1 KL Divergence Compared to Null Experiment

1.2.2 Stein’s Loss Compared to Null Experiment

A12 Simulations with Violations to Distributional Assumptions

1.1 Laplace Tail

1.1.1 KL Divergence Compared to Null Experiments

1.1.2 Stein’s Loss Compared to Null Experiments

1.2 Student’s t-Distribution Tails

1.2.1 KL Divergence Compared to Null Experiments

1.2.2 Stein’s Loss Compared to Null Experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation