Random graphical model of microbiome interactions in related environments

The microbiome constitutes a complex microbial ecology of interacting components that regulates important pathways in the host. Measurements of microbial abundances are key to learning the intricate network of interactions amongst microbes. Microbial communities at various body sites tend to share some overall common structure, while also showing diversity related to the needs of the local environment. We propose a computational approach for the joint inference of microbiota systems from metagenomic data for a number of body sites. The random graphical model (RGM) allows for heterogeneity across the different environments while quantifying their relatedness at the structural level. In addition, the model allows for the inclusion of external covariates at both the microbial and interaction levels, further adapting to the richness and complexity of microbiome data. Our results show how: the RGM approach is able to capture varying levels of structural similarity across the different body sites and how this is supported by their taxonomical classification; the Bayesian implementation of the RGM fully quantifies parameter uncertainty; the microbiome network posteriors show not only a stable core, but also interesting individual differences between the various body sites, as well as interpretable relationships between various classes of microbes.


Introduction
The microbiome constitutes a complex microbial ecology of interacting components that regulates important pathways in the host.Microbiotic systems have been intensively studied in recent years, and associations have been found with a number of health conditions, such as obesity (Le Chatelier et al., 2013), diabetes (Pedersen et al., 2016) and the response to immunotherapy (Lee et al., 2022).Rich sources of high-throughput data of the microbiome, such as those generated by the Human Microbiome Project (HMP Consortium, 2012) and the Metagenomics of the Human Intestinal Tract (MetaHIT) project (Qin et al., 2010) are key to learning the intricate network of interactions amongst microbe communities.
As the microbiome interacts with the local environment, the microbiome varies in constitution profile at different body sites (Sharon et al., 2022).For example, Segata et al. (2012) find four groups of digestive tract sites, characterized by distinct bacterial compositions and metabolic processes.Despite this heterogeneity, from a structural perspective, it is expected that the interaction profile is largely shared between different body sites.This constitutes a core microbiome network, describing stable components of the microbiome interactions across time, body sites and populations.
Most studies rely solely on fecal samples to represent the gut microbiome and on saliva samples to describe the oral microbiome (Sharon et al., 2022).Equally, all available methods and implementations, such as the commonly used SparCC (Friedman and Alm, 2012) and SPIEC-EASI (Kurtz et al., 2015), infer one microbiota system from a given dataset.As such, they are suited to learn either environment-specific systems from a study on that environment or a core microbiome network from pooled data across different body sites.Instead, we propose a computational approach for the joint inference of microbiota systems from metagenomic data for a number of body sites that captures both the core metabolic network as well as individual differences.
Closest to the approach proposed in this paper, Vinciotti et al. (2022) develop a Gaussian copula graphical model to infer microbiota systems from count genomic data.While the approach recovers a core microbiome network generating the data, the parametric form used for the marginals is able to capture both the heterogeneity of microbial abundances across different body sites and the typical features of microbial data, such as zero inflation and compositionality.An efficient Bayesian inferential procedure allows to quantify the uncertainty around the estimated parameters.This is the case also for the recovered network, i.e., interactions between microbes are returned with an associated posterior edge probability and partial correlation distribution.
In this paper, we aim to capture heterogeneity also at the structural level of microbial interactions, while quantifying the possible relatedness among microbiota systems from different environments.To this end, we augment the model of Vinciotti et al. (2022) with a random graph model on the conditional independence graphs that describe the joint microbial count distributions at each body site.We define a novel random graphical model as the combination of a graphical model with this random independence graph model.
Borrowing from the network science literature (Hoff et al., 2002), we formalise the random graph model as a latent probit network model, where the probability of an edge in a particular microbiota system depends on a latent space of potentially related environments, i.e., it will increase if the body site is close in this latent space to another environment where that particular edge is present.In addition, the edge probability depends on individual network sparsity levels for each body site and on external covariates at the network level.For the latter, we consider the effect of taxonomy sharing on the propensity of microbes to interact, but, in principle, any other coavariate or external knowledge can be included at this stage.
We apply the proposed random graphical model to a microbiome study, which is part of the Human Microbiome Project (HMP).After preprocessing, we use 4500 samples of 87 microbes across 13 body sites.Our analysis shows that the latent space is able to capture the biological relatedness between the 13 microbiotic systems.Indeed, the locations of the body sites in the inferred latent space match closely both with the classification made by Segata et al. (2012) and with the Uberon anatomy classification of body sites (Mungall et al., 2012).The environment-specific networks, and in particular their associated estimated edge probabilities, can be queried further, in order to characterize the individual networks as well as to highlight commonalities and differences between the 13 environments.Beyond the information that can be discovered from the data using the proposed model, we find that the new approach leads to a more stable recovery of the microbiotic systems, compared to individual analyses conducted for each body site separately.
In Section 2, we describe the random graphical model and the Bayesian inference procedure in detail, while in Section 3 we present the results on real and simulated data, before a discussion and conclusion in Section 4.

Random graphical model
In this section we define the random graphical model for network inference from heterogeneous microbiome data from a number of environments.Let p ) be the random vector of interest, that is the abundances of p Operating Taxonomic Units (OTUs) in environment k, with k = 1, . . ., B. In our study, B = 13.At an abstract level, we assume Y (k) to be distributed according to some graphical model (GM), relative to some conditional independence graph G (k) with some associated parameters Ω (k) .The graphs G = G (k)  k are themselves distributed according to a joint random graph model, for some vector of parameters Θ.
The type of graphical model and the type of random graph model depend on the situation under consideration.As for the graphical model, we consider the Gaussian copula graphical model, due to its easy mathematical formulation.It is a flexible model for multivariate non-Gaussian data, as the count microbiome data under consideration.Thus, similarly to (Cougoul et al., 2019;Vinciotti et al., 2022), we assume: where Φ p is the cumulative distribution function of a p-dimensional multivariate normal with a zero mean vector and correlation matrix R (k) = Ω (k) −1 , Φ is the standard univariate normal distribution function, and F j is the marginal distribution of OTU j.The dependency structure induced by this model in condition k is represented by the conditional independence graph G (k) .Following from the theory of Gaussian graphical models (Lauritzen, 1996), this is given by the zero-patterns of the inverse of the correlation matrix Ω (k) , typically called the precision or concentration matrix.
In order to adapt to the richness and heterogeneity of microbiome data, the marginal distribution of OTU j can be linked to external covariates, including body site and normalizing factors such as sequencing depth.As in (Vinciotti et al., 2022), we formalise this with the use of a parametric marginal model.In particular, we consider discrete Weibull regression marginals (Peluso et al., 2019), i.e., j = 1, . . ., p, with node covariates x = (1, x 1 , . . ., x m ) and regression coefficients η j and γ j associated to the two parameters defining the discrete Weibull distribution, respectively.Due to the discreteness of the data, the mapping to the latent Gaussian space z j = Φ −1 (F j (y j )) of the copula is not unique.Indeed, each observation (y j , x) is associated to an interval in the latent space, given by As for the joint random graph model, we are particularly interested in modelling the relatedness of the different environments as well as a possible link with external covariates/existing knowledge at the microbial interaction level.To this end, we formalise the model with the following latent probit network model (Hoff et al., 2002) where G j 1 ,j 2 (k) = 1 defines an edge between node Y j 1 and node Y j 2 in condition k, w ∈ R d is the vector of edge-specific covariates, c 1 , . . ., c B ∈ R 2 are the latent space variables for each condition, α k is the intercept of the model and relates to the overall sparsity level of graph G (k) .We denote with Θ = (α, β, c) the vector of parameters associated to the joint random graph model.

Bayesian inference
Figure 1 describes the proposed random graphical model and how it generates microbiome data from different, possibly related, environments.In particular, given the parameters Θ = (α, β, c) that define the latent network model, edges in G (k) become independent, conditional on the remaining graphs, and are, therefore, the result of Bernoulli draws.Given graphs G (k) for each condition k = 1, . . ., B, the data are then generated via a Gaussian copula graphical model (Vinciotti et al., 2022), i.e., positive-definite precision matrices are drawn via G-Wishart distributions, and the resulting multivariate normal draws are combined with the parametric marginals to generate count microbiome data.
In order to quantify the full uncertainty in the estimation of the parameters, we opt for a Bayesian inferential procedure.To this end, we consider the following prior distributions: N (0, 10) priors on each parameter in Θ, G-Wishart priors for the precision matrix Ω (k) ∼ W G (3, I p ) conditional on the graph G (k) , and N (0, 1) priors on each regression coefficient in (η j , γ j ), j = 1, . . ., p.
We have devised a Markov Chain Monte Carlo (MCMC) scheme for generating samples from the posterior distribution of the parameters.In particular, the scheme is made of the following steps: ) latent network probit model Gaussian copula GM 2. Gibbs sampling of the parameters Θ from their posterior distribution via a sequence of probit regressions (with offset): (Mohammadi and Wit, 2015) 5. Continuous time birth-death MCMC sampling of the graph generating a new graph from the current graph G (k) with an edge e added, removed or kept (Mohammadi and Wit, 2015).
Upon convergence, posterior distributions of all parameters are returned.We focus particularly on the parameters Θ, which provide information on the latent process generating the graphs and how related the different environments are at the structural level, and on the graphs G (k) , which are associated to posterior edge inclusion probabilities where N is the number of MCMC iterations and W (Ω t (k) , Θ) is the waiting time for graph G t (k) with precision matrix Ω t (k) (Vinciotti et al., 2022).Posterior distributions on the precision matrices Ω (k) are also available and can be converted to partial correlations for each edge, via These values give information also about the sign of the dependencies in each environment.
In a similar vein, posterior distributions of any network statistic of interest can be derived from the MCMC chain of graphs that is returned.

Simulation study
In order to clarify the data generating process behind the proposed random graphical model (Figure 1) and to assess its performance in inferring parameters from data, we simulate n = 346 observations on p = 87 variables for B = 13 environments, with the sample size and dimensions matching those of the real data.For the simulation, we construct a latent space Θ with the following components: α parameters drawn from a N (−2, 1) distribution, i.e., a high level of network sparsity; one edge covariate W from a U (−0.5, 0.5) distribution with an associated parameter β = 2.5; latent vectors c ∈ R 2 with each component drawn from a N (0, 0.3) distribution.Given Θ, we first generate 13 graphs G (k) via Bernoulli draws for each edge conditional on the others.We repeat the sampling for 100 iterations in order to sample from the actual joint distribution of the graphs.Given the true graphs, we sample their associated precision matrices from a W G (3, I p ) distribution and we finally obtain the data for each environment from a multivariate Gaussian draw.We omit here the case of discrete marginals and concentrate on the inference of the latent space and recovery of the networks.The data constructed as described above can be retrieved by running the function sim.rgm of the rgm package accompanying this paper, using the default values for the inputs.
Figure 2 reports the results after 10000 MCMC iterations, obtained by running the function rgm with prior distributions as described in Section 2.2.The computational time is about 14 seconds per iteration.We retain the last 25% of the iterations for the calculation of posterior edge distributions from Equation (4) and posterior distributions of the parameters Θ of the random graph model.The first plot shows a good recovery of the latent network space Θ, by comparing the true probit probabilities from Equation (3) with those obtained using the mean posterior estimates of the α, β and c parameters.The second plot shows an accurate reconstruction of the networks G (k) , by comparing the recovered graphs with Figure 2: Results from the simulation study, evaluated on the last 2500 MCMC iterations.Left: True probit probabilities from Equation (3) versus those calculated using the mean posterior estimates of the α, β and c parameters.Right: Receiver operating characteristic curves of the recovered graphs against the true ones for each environment, across a sequence of thresholds on the posterior edge probabilities from Equation ( 4).The 13 colours distinguish the 13 environments, respectively.
the true graphs, for each environment and across a sequence of thresholds on the posterior edge probabilities.The average area under the receiver operating characteristic curves is 0.95, across the 13 environments.
A number of covariates are considered both at the marginal level of each OTU and at the network level.As covariates x for the discrete Weibull marginal distribution for each OTU (Equation 1), we include the library size for each sample, estimated by the geometric mean of pairwise ratios of OTU abundances of that sample with all other samples (function GMPR in rMAGMA), dummy variables for each body site, and interactions between body sites and library size.This results in 26 parameters per OTU.As regards to the random graph model in Equation 3, this is defined by a sparsity parameter α k and a latent location c k ∈ R 2 , for each environment, and a vector β of regression coefficients associated to six binary variables (w) that encode the belonging of a pair of OTUs to the same taxonomy level.In particular, we consider the six taxonomy levels given, respectively, by the bacterial phylum, class, order, family, genus and species.
We fit discrete Weibull parametric marginals for each OTU via 50000 MCMC iterations (function bdw.reg in the BDgraph package (Mohammadi and Wit, 2019)).As typical of inferential approaches for Gaussian copula graphical models, this step is performed first, followed by a calculation of the intervals from Equation 2 using posterior mean estimates of the parameters (evaluated on the last 25% of the iterations).These intervals are then used for the subsequent learning of the structural dependencies, by iterating steps 2-5 of the procedure described in Section 2.2 (function rgm in the rgm package that accompanies this paper).Given the huge space of graphs, we let the Bayesian structural learning procedure run for 3 million MCMC iterations.All subsequent results are evaluated on the last 7,500 iterations.
Interpretation of results.The most immediate output of the analysis are the 13 networks that are inferred for each environment.Figure 3 summarises these networks by the posterior edge probabilities, calculated from Equation (4).It is clear that the networks tend to be sparse, and vary to some extend between the conditions.The reasons for this environmental network variation can be found from the random graph generative process, described by Equation (3). Figure 3 shows a high level of sparsity across all networks (mean posterior edge probabilities equal to 6.8%, on average across the 13 environments).This is captured by low α values of the fitted random graph model (mean posterior estimate -2.7, on average across the 13 environments).
Figure 3 shows a high structural similarity between some environments, with a high sharing of edges with high probability and of edges with low probability among similar environments.This is captured by the latent locations c of the random graph model, whose posterior means are plotted in Figure 4.For example, sup-gingiva and sub-gingiva are highly related environments, and similarly throat and tonsils.Indeed, in both cases the two associated c vectors have a large inner product, as they are close to each other in the space and far from zero.The indicator function in Equation (3) further encourages sharing of edges between these networks.Indeed, 93% of the edges with posterior edge probability greater than 0.5 are in common between the sup-gingiva and sub-gingiva networks, and 95% between the throat and tonsils networks.Looking at the posterior mean of partial correlations, calculated from the precision matrices via Equation ( 5), we find an agreement also on the sign of the dependency, with a correlation of 0.90 between sup-gingiva and sub-gingiva partial correlation values for each edge, and 0.93 between throat and tonsils.Finally, as the two pairs of networks are almost orthogonal to each other in the latent space, we expect little structural sharing between the sup-gingiva/sub-gingiva networks and the throat/tonsils networks.Indeed, they have the lowest agreement of high probability edges across all pairs, with an average sharing of 21.3%.
The similarities between the environments detected by the proposed method are partly supported by the Uberon anatomy classification of body sites, particularly when it comes to the three skin-related body sites.These are located close to each other in the latent space of Figure 4 and have on average 63% of high-probability edges in common (Figure 3).On the other hand, the oral cavity-related body sites appear to be further split into two groups.This is in line with the analysis of Segata et al. (2012), which find four groups of body sites based on similar community compositions, namely: cheek, ker-gingiva, palate; saliva, tongue, tonsils, throat; sub-gingiva and sup-gingiva; and stool.These groups are also clearly evident in Figure 4. Finally, the taxonomical relatedness of the microbes encourages the presence of a link between them.Figure 5 shows how the probability of two OTUs connecting, in any environment, is positively associated with their belonging to some of the taxonomy levels considered, in particular to the species, genus and class taxonomies.
Comparison with other methods.Figure 6 shows that the random graphical model leads also to a more stable recovery of the individual networks, compared to estimating individual networks.Indeed, the figure shows that the variances of the posterior edge probabilities are smaller for the proposed rgm approach than when fitting individual Gaussian copula graphical models for each environment separately.For the latter, we use the approach of Vinciotti et al. (2022) (function bdgraph.dw in the BDgraph package), which uses the same parametric marginals as those considered in this paper but a more traditional Erdös-Rényi random graph prior for each environment.To facilitate comparison, the prior edge probability of the Erdös-Rényi prior is set to match the sparsity level of the networks recovered by the rgm analysis.Figure 6 shows how the posterior probabilities detected by rgm are more concentrated on either 0 or 1, than with the alternative approach, leading to a lower level of noise on the structural dependencies that are recovered by the method proposed in this paper.

Figure 1 :
Figure 1: Sketch of the random graphical model generating microbiome data across different related environments.The model combines a latent network probit model with a Gaussian copula graphical model.

Figure 3 :
Figure 3: Random graphical model inferred from microbiome data: Posterior edge probabilities for each edge and each environment, rearranged via row and column clustering.

Figure 4 :
Figure 4: Random graphical model inferred from microbiome data: mean posterior locations of the body sites (c) in a 2D latent space.Colours refer to the Uberon anatomy classification of the body sites.

Figure 5 :
Figure 5: Random graphical model inferred from microbiome data: estimation of β parameters associated to six dummy variables indicating if a pair of nodes forming an edge belongs to the same taxonomy level