Abstract
Joint analysis of microbiome and metabolomic data represents an imperative objective as the field moves beyond basic microbiome association studies and turns towards mechanistic and translational investigations. We present a censored Gaussian graphical model framework, where the metabolomic data are treated as continuous and the microbiome data as censored at zero, to identify direct interactions (defined as conditional dependence relationships) between microbial species and metabolites. Simulated examples show that our method metaMint performs favorably compared to the existing ones. metaMint also provides interpretable microbemetabolite interactions when applied to a bacterial vaginosis data set. R implementation of metaMint is available on GitHub.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The field of microbiome research is shifting rapidly from cataloging the taxonomic compositions of microbial communities [1] to refined technologies that capture strainlevel variations or amplicon sequence variants [2,3,4] and to multiomics studies that better capture community functional activity [5]. In particular, metabolomics has been extremely useful in explaining microbial functional potential because of its capability in tracking microbially derived metabolites [6,7,8]. Associations between specific microbes and metabolites provide key insights and improved mechanistic models of hostmicrobe interactions [9,10,11,12]. In practice, the nonparametric Spearman’s rank correlation is often used to quantify the pairwise correlation between microbes and metabolites. However, Spearman’s rank correlation only captures marginal monotonic association and does not distinguish direct and indirect interactions. In contrast, partial correlations measure conditional dependencies and allow the identification of direct interactions between microbes and metabolites [13].
One analytical challenge specific to the microbiome data are the uneven sequencing depths that arise due to differential efficiency of the sequencing process. The total number of reads in a sample is also constrained by the biological specimen at hand and does not reflect the absolute abundance present in the ecosystem. A common practice to address this issue is to transform the raw counts into relative abundances by normalizing over the total sequencing reads in each sample. In other words, raw sequencing counts are transformed into proportions of different microbes whose sum has to be one, also known as compositional data. Several lines of work have been proposed to model marginal and/or conditional microbial interactions from compositional data. For example, SparCC [14] and CCLasso [15] both estimate the linear Pearson correlations between logtransformed counts. A major limitation of marginal association measures such as the Pearson correlation is that they cannot distinguish between direct and indirect relationships [16]. To address this issue, SPIECEASI [17] learns the conditional dependencies between pairs of microbes while adjusting for effects from other species in the analysis. This is achieved by estimating the inverse covariance of the centered logratio (clr) transformed data using e.g., the graphical lasso algorithm [18]. Fang et al. [19] assume that the observed relative abundances follow the logistic normal distribution and proposed a MajorizationMinimization algorithm for learning the conditional dependence relationships among microbes.
Many of the aforementioned methods are specific to microbiome data and are not directly applicable for joint analysis of microbiome and other omics data types. One naive approach for joint estimation is to apply the graphical lasso algorithm directly to clr transformed microbiome and metabolomic data. However, as illustrated in Fig. 1, the Gaussian graphical model may be a poor fit for microbiome data because the marginal distributions of transformed raw counts are in fact highly skewed and often zero inflated.
This motivates the need for new statistical methodology that can accommodate both microbiome and metabolomic data while accounting for the zero inflation in microbial abundance. Some zero values are sampling zeros that arise due to limited sequencing depths, whereas others are biological zeros that indicate complete absence of a species [20]. Silverman et al. [21] in an unpublished manuscript illustrated that biological zeros in many applications can be approximated as sampling zeros because they both represent a truly low abundance. In this paper, we treat the observed zeros as due to undersampling, and propose a censored Gaussian graphical model (cGGM) to infer the conditional dependencies among microbes and metabolites. Specifically, let \({\varvec{W}}=(W_1, \ldots , W_q)^{\intercal }\) with \(W_j>0\) for all j be the latent variables, called the basis, that represent the true absolute abundance for each species. Due to undersampling and uneven sequencing depths, the observed abundance \({\varvec{R}}\) is related to \({\varvec{W}}\) via
where \(N>0\) is a scaling factor that may depend on \({\varvec{W}}\), \(u_j\) is a constant which indicates the limit of detection for the jth variable, and \({\varvec{I}}(\cdot )\) is the indicator function. The censoring value \(u_j\) may be known from the experiment or estimated from data. To adjust for the uneven sequencing depths, we apply the modified clr (mclr) transformation to \({\varvec{R}}\), which transforms all nonzero counts using the usual clr and shifts all transformed values to be strictly positive [22]. The diagonal panels in Fig. 1 show the histograms of mclr transformed abundances. Compared to the usual clr transformation that requires a pseudo count when dealing with zeros, mclr preserves the ranking of observed counts across multiple samples and is less biased towards rare species [22]. Denote \({\varvec{X}}_1 = {\text{mclr}}_{\varepsilon }({\varvec{R}})\) the resulting vector after mclr transformation with parameter \(\varepsilon \), which we elaborate in Sect. 2.3. Let \({\varvec{X}}_2=(X_{q+1},\ldots , X_{p})^{\intercal }\) denote the log transformed concentration measures from \(pq \ (p>q)\) metabolites. A natural model for integrating microbiome and metabolomic data is to assume that \({\varvec{X}}_1\) and \({\varvec{X}}_2\) follow a censored multivariate normal distribution with mean \({\varvec{\mu }}\) and covariance \(\varSigma \). Zero entries in the inverse covariance matrix \(\varOmega =\varSigma ^{1}\) capture the conditional independence relationships among the microbes and metabolites.
The problem of inferring the joint microbemetabolite network thus reduces to estimating \(\varOmega \) from n independent and identically distributed observations on \(({\varvec{X}}_1,{\varvec{X}}_2)\). We provide metaMint which is based on estimating each pair of marginal correlations with maximum likelihood. Given the estimated correlation matrix, metaMint uses the graphical lasso to recover the conditional dependencies between microbes and metabolites (direct interactions). We compare our method with several existing approaches in simulations, and show that metaMint outperforms the others in network structural recovery and accuracy of estimating the inverse covariance matrix. When applied to a real data on bacterial vaginosis [9], the integrated network reveals biologically relevant microbemetabolite interactions and also identifies novel interactions that may serve as potential biomarkers for diagnosis and treatment of bacterial vaginosis.
The censored multivariate normal distribution has been commonly used to analyze environmental data that are often subject to prespecified detection limits. For example, Hoffman and Johnson [23], Pesonen et al. [24] and Jones et al. [25] studied covariance estimation for left censored multivariate normal distribution in the classic lowdimensional setting. Recently, Augugliaro et al. [26] proposed an approximated EM algorithm for inverse covariance estimation in the highdimensional setting and applied the method to singlecell data. The work by McDavid et al. [27] was also motivated by singlecell data, but the authors proposed the zeroinflated Gaussian graphical model, which treats zeros as coming from a degenerate point mass at zero instead of being censored. Compared to existing literature, our contribution is a unified model for joint estimation of the integrated microbe and metabolite network in the highdimensional setting. Our algorithm works well in a variety of scenarios.
The rest of the paper is organized as follows. In Sect. 2, we describe the censored Gaussian graphical model framework and the proposed algorithm. We present extensive numerical studies in Sect. 3 and a real data example on bacterial vaginosis in Sect. 4. We conclude our paper with discussions in Sect. 5.
2 The Censored Gaussian Graphical Model
The censored Gaussian graphical model is suitable for zeroinflated data, which is often the case with microbiome data as shown in Fig. 1. In practice, it is reasonable to assume that the observed zeros are due to undersampling or censoring from below.
Definition 1
A random vector \({\varvec{X}}\) is said to follow a censored multivariate normal distribution with mean \({\varvec{\mu }}\) and covariance \(\varSigma \) if there exists constants \(u_1, \ldots , u_p\) such that \(X_j=Y_j {\varvec{I}}(Y_j > u_j) + u_j {\varvec{I}}(Y_j \le u_j)\) where
The censoring values \({\varvec{u}}= (u_1,\ldots , u_p)^\intercal \) are experiment specific and can be inferred from data. For example, one can use the smallest value that occurs more than a prespecified threshold (e.g. 10%) as an estimate. A prespecified threshold is necessary to ensure that the smallest value occurs more often than by chance. For zeroinflated microbiome data, the censoring values are set to be 0. When there is no censoring in the jth variable, we set \(u_j=\infty \).
The density of the multivariate normal distribution with mean \({\varvec{\mu }}\) and inverse covariance \(\varOmega =\varSigma ^{1}\) is
Without loss of generality, let \({\varvec{X}}= ({\varvec{X}}_o, {\varvec{X}}_c)\) where \({\varvec{X}}_o\) denotes the uncensored components and \({\varvec{X}}_c\) denotes the censored components. Given censoring values \({\varvec{u}}= (\infty ,\ldots ,\infty , {\varvec{u}}_c)\), the density function of \({\varvec{X}}\) is
Let \(\{{\varvec{x}}^{(1)}, \ldots , {\varvec{x}}^{(n)}\}\) denote a set of n independent and identically distributed observations on \({\varvec{X}}\). In highdimensional settings, a natural strategy to estimate the inverse covariance matrix is to maximize the \(\ell _1\) penalized loss function
where \(\lambda _n\) is a regularization parameter that controls the sparsity of \(\varOmega \). However, direct optimization of (3) is challenging due to the integral in (2) over a potentially highdimensional space. Augugliaro et al. [26] studied a general version of (3) where variables can be left and right censored. They proposed to use the EM algorithm to optimize the expectation of the full loglikelihood with respect to the conditional distribution \({\varvec{X}}_c\mid {\varvec{X}}_o\). However, exact optimization of the EM algorithm is computationally challenging as it requires the second moment of \({\varvec{X}}_c \mid {\varvec{X}}_o\), which is a multivariate truncated Gaussian. The approximation in Augugliaro et al. [26] is adapted from Guo et al. [28] and only works well when the inverse covariance matrix is very sparse or the regularization parameter \(\lambda _n\) is large.
2.1 A Direct Estimator Via Marginal Correlations
Our proposal metaMint is based on estimating the marginal correlations directly. A similar idea was used to estimate the correlation matrix of ordinal graphical models [29], where the authors showed that the direct estimator achieves more accurate estimation of the inverse covariance matrix compared to the approximated EM approach in Guo et al. [28].
The first step in metaMint is to estimate the marginal distribution for each variable, which can be done by fitting a univariate Tobit model [30] and has been implemented in the R package censReg [31]. Let \({\hat{\mu }}_j\) and \({\hat{\sigma }}_j^2\) be, respectively, the estimate of the mean and variance for the jth variable. It can be shown that \({\hat{\mu }}_j\) is a consistent estimate of \(\mu _j\), and \({\hat{\sigma }}_j^2\) is consistent for \(\sigma _j^2=\varSigma _{jj}\). To find the empirical covariance matrix \({\hat{\varSigma }}\), it suffices to estimate each pairwise correlation.
Suppose we have two variables \(X_j\) and \(X_k\ (j<k)\). If no observation is censored, it is straightforward to estimate their correlation using the Pearson’s correlation coefficient. In the following, we provide details on correlation estimation when at least one variable is censored.
Consider first the case where both variables \(X_j\) and \(X_k\) are censored from below with \(u_j\) and \(u_k\), respectively. For the ith observation, let \(\eta _{ij} = {\varvec{I}}(x^{(i)}_j>u_{j})\) be the indicator function of whether the jth variable is censored. The pairwise joint loglikelihood can be written as a function of the correlation \( \rho _{jk}\),
where \(Y_j\) and \(Y_k\) are bivariate normal with mean \((\mu _j, \mu _k)^\intercal \) and covariance
Let \(\phi (\cdot )\) and \(\varPhi (\cdot )\) denote, respectively, the density and the cumulative distribution function (c.d.f.) of a standard normal variable. Let the c.d.f. of a bivariate standard normal variable with correlation \(\rho \) be \(\varPhi _2(u,v,\rho )\). The conditional distribution \(Y_k \mid Y_j=x^{(i)}\) is again a normal distribution with mean \({\tilde{\mu }}_k = \mu _k + \frac{\sigma _k}{\sigma _j}\rho _{jk}(x^{(i)}_j  \mu _j)\) and standard deviation \({\tilde{\sigma }}_k = \sigma _k \sqrt{1\rho _{jk}^2}\). The pairwise joint loglikelihood thus becomes
where
If \(u_{j} = \infty \), this yields a bivariate random vector with only the first variable being censored. Then the joint loglikelihood becomes
We can solve for \(\rho _{jk}\) as
Because entries in \({\hat{\varSigma }}\) are estimated separately, \({\hat{\varSigma }}\) is not guaranteed to be positive semidefinite, which is unsatisfactory because ideally we expect the empirical covariance matrix to be positive semidefinite. One way of bypassing this issue is to use the projection of \({\hat{\varSigma }}\) onto a positive semidefinite cone, as done in Fan et al. [32]. In practice, one can calculate the eigendecomposition of \({\hat{\varSigma }}\) and threshold the negative ones to zero, which yields a new estimator \({\widetilde{ \varSigma }}\).
Given \({\widetilde{\varSigma }}\), one can apply the graphical lasso algorithm [18]
to solve for the inverse covariance matrix \(\varOmega \).
Remark 1
The graphical lasso in (5) can be replaced with other algorithms for inverse covariance matrix estimation such as the method by Cai et al. [33] or its adaptive version [34].
metaMint has been implemented in R. In particular, the optimization in (4) is solved using the optim function in R, and (5) is solved by the graphical lasso algorithm in the glasso package.
2.2 Tuning Parameter Selection
As with other penalizationbased methods, the proposed algorithm requires the specification of a tuning parameter \(\lambda _n\) that controls the sparsity of the inverse covariance matrix. One can use the cross validation procedure in Guo et al. [28] or the stability approach in Liu et al. [35] to select the optimal parameter. In simulations where the ground truth is known, model selection can also be done by maximizing the accuracy in network structural recovery. In Sect. 4 on real data analysis, we used the stability approach in Liu et al. [35].
2.3 The Modified Centered LogRatio
The centered logratio transformation is often used to transform observed microbial counts to values that are comparable across samples before downstream analysis [36,37,38]. Let \(g({\varvec{r}}) = (\Pi _{j=1}^p r_j)^{1/p}\) denote the geometric mean of \({\varvec{r}}=(r_1,\ldots ,r_p)\). The clr of \({\varvec{r}}\) is defined as
In practice, each sample may consist of many rare species that have zero counts. Thus a pseudo count of 0.5 or 1 is often added to all counts before clr is applied. However, this practice may unfairly bias rare species and impact the accuracy in correlation estimation. The modified centered logratio (mclr) [22] attempts to address this limitation by transforming the nonzero counts with the usual clr and shifting all transformed values to be strictly positive.
Without loss of generality, let \({\varvec{r}}^{(i)}= ({\varvec{r}}^{(i)}_1,\mathbf{0})^\intercal = (N_i {\varvec{w}}^{(i)}_1,\mathbf{0})^\intercal \) where only components in \({\varvec{r}}^{(i)}_1\) (and \({\varvec{w}}^{(i)}_1\)) are positive. Although the samplespecific scaling factor \(N_i\) does not affect the relative abundances in sample i, it captures the variation among total sequencing reads. For example, Vandeputte et al. [39] observed up to tenfold differences in the total microbial loads after correcting for microbial cell counts. We define \({\text{mclr}}_{\varepsilon }({\varvec{r}}^{(i)})\) as \((\text{clr}({\varvec{r}}^{(i)}_1)+\varepsilon , \mathbf{0})^\intercal \), where the constant \(\varepsilon \) is set to be \(\min _{i,j} \log \{r_j^{(i)}/g({\varvec{r}}_1^{(i)}\}+c\) and \(c>0\) is a small constant used to differentiate small positive counts from observed zeros. The resulting \({\text{mclr}}_{\varepsilon }({\varvec{r}}^{(i)})\) is independent of the scaling factor \(N_i\), because \(\text{clr}({\varvec{r}}^{(i)}_1)=\text{clr}({\varvec{w}}^{(i)}_1)\). However, adding a pseudo count to zeros and applying clr may introduce unnecessary bias towards zero counts. Figure 2 illustrates the marginal distributions of the genus Fusobacterium after the two transformations. Compared to clr, the mclr preserves the relative ranking of all counts while adjusting for the total sequencing depths.
Lastly, it is worth mentioning that mclr defined above is equivalent to transforming the relative abundances as done in Yoon et al. [22]. To see this, let the relative abundance \({\varvec{z}}^{(i)}\) be defined such that \(z_{j}^{(i)} = r_{j}^{(i)}/S\), where \(S=\sum _{j=1}^p r_{j}^{(i)}\). Moreover, we can write \({\varvec{z}}^{(i)}=({\varvec{z}}^{(i)}_1,\mathbf{0})^\intercal \) such that only components in \({\varvec{z}}^{(i)}_1\) are positive. For any \(z_{j}^{(i)}>0\),
In other words, mclr is scale invariant.
3 Simulation Studies
3.1 Model Setup
We first generated \({\varvec{y}}^{(i)}\ (i=1,\ldots ,n)\) from a multivariate normal distribution with mean \({\varvec{\mu }}_0\) and inverse covariance \(\varOmega _0\). The mean parameter \({\varvec{\mu }}_0\) was generated uniformly from \([0.5,2]\) to reflect the heterogeneity in abundances of microbial sequences and metabolites. To generate the inverse covariance matrix \(\varOmega _0\), we considered the following network models, each with p nodes:

(1)
Scalefree network. This network was generated using the BarabasiAlbert algorithm [40] and has \((p1)\) edges. The left panel of Fig. 3 illustrates a scalefree network.

(2)
ErdősRényi random graph [41]. This network has p edges, as illustrated in the middle panel of Fig. 3.

(3)
Nearestneighbor network. We constructed this network using the same procedure described in Guo et al. [28], where we uniformly sampled p points on a unit square and linked any two points that are 5 nearest neighbors of each other in terms of their Euclidean distances. This network has about 2.5p edges. The right panel of Fig. 3 illustrates one realization of a sparse network generated with 2 nearest neighbors.
Given the network topology, the offdiagonal entries in \(\varOmega _0\) were generated uniformly from \([1,0.5] \cup [0.5,1]\), with diagonal entries being \(\varLambda _{\min }(\varOmega _0^{}) + 0.1\). Here \(\varOmega _0^{}\) represents the matrix \(\varOmega _0\) with zeros in the diagonal and \(\varLambda _{\min }(A)\) denotes the smallest eigenvalue of A. The covariance matrix \(\varSigma _0\) is then determined by
By construction, the diagonal entries of \(\varSigma _0\) are all 1.
Given the latent \({\varvec{y}}^{(i)}\), the basis vector \({\varvec{w}}^{(i)} = (w_{1}^{(i)}, \ldots , w_{p}^{(i)})^{\intercal }\) was obtained through the transformation \(w_{j}^{(i)} = e^{y_{j}^{(i)}}\). Censored abundances \({\varvec{r}}^{(i)} = (r_{1}^{(i)}, \ldots , r_{p}^{(i)})^{\intercal }\) were generated such that
where \(N_i\) is generated uniformly between 1 and 10. Here q indicates the number of microbes. Only microbiome data are assumed to be censored and compositional in this article, but this assumption can be relaxed in general. In all simulations, we set the constant \(c=0.1\) in the modified clr transformation. Denote \({\varvec{x}}^{(i)}_1 = {\text{mclr}}_{\varepsilon }({\varvec{r}}_{1:q}^{(i)})\) and the observed abundances \({\varvec{x}}^{(i)}= ({\varvec{x}}^{(i)}_1,\log r^{(i)}_{q+1}, \ldots , \log r^{(i)}_{p})^\intercal \).
3.2 Results
We compared metaMint with SPIECEASI [17] and gCoda [19]. The oracle estimator obtained from the latent basis \(\{{\varvec{w}}^{(i)}\}_{i=1}^n\) is used as a benchmark, though in practice the oracle is generally unknown. To evaluate the performance of network recovery, we used the receiver operating characteristic (ROC) curve to plot the false positive rate (FPR) against the true positive rate (TPR) defined, respectively, as,
where \({\hat{\varOmega }}\) denotes the estimated network. The F1 score [42], which is between 0 and 1, measures the accuracy of an estimator by summarizing both false positives and false negatives. Larger F1 scores indicate better structural recovery. For \({\hat{\varOmega }}_{\lambda ^*}\) estimated at the optimal penalty parameter \({\lambda ^*}\) selected by maximizing the F1 score, we also compared the entropy loss (EL) and Frobenius norm loss (FL) for estimation accuracy:
Our first comparison is based on only microbiome data where \(p=q=60\) and \(n=100\). In this example, the percentage of zeros per species ranges from 0% to 70%. Input for gCoda is the censored abundance matrix \({\mathcal{D}} = ({\varvec{r}}^{(1)}+0.5, \ldots , {\varvec{r}}^{(n)}+0.5)^\intercal \). The clr transformation is then applied to each row in \({\mathcal{D}}\) and the resulting matrix is used as input for SPIECEASI. Figure 4 shows the ROC curves obtained from different methods across different network models. One can see that SPIECEASI and gCoda perform similarly, and both underperform compared to metaMint. Because the nearestneighbor network is denser, the ROC curves in the right panel of Fig. 4 are generally lower compared to their counterparts in other network models.
In our second study, we look at larger datasets where the number of metabolites is \(q=100\) and the number of microbes is \(pq=100\). The sample size is \(n=300\). The method gCoda is thus not applicable because it was proposed specifically for microbiome data. Because we only censor microbiome data, the proportion of censored variables in this example is smaller. We first compare different methods in terms of network structural recovery. Figure 5 shows the average F1 score of each method across a range of penalty parameters. It can be seen that metaMint has overall higher \(F_1\) scores than SPIECEASI, and closely resembles the oracle estimator.
Since we know the true network structure, we also look at comparisons in terms of inverse covariance estimation accuracy at the optimal penalty parameter selected by maximizing the F1 score. As shown in Fig. 6, SPIECEASI performs the worst in all cases because its entropy and Frobenius norm loss are the largest. It is worth pointing out that there still exists substantial gap in both EL and FL between metaMint and the oracle estimator as a result of censoring. We anticipate that this issue can be partly addressed with increased sequencing depths.
4 Analysis of Bacterial Vaginosis Data
4.1 Data Description and Processing
Bacterial vaginosis (BV) is a common vaginal condition characterized by depletion of specific Lactobacillus species and increased abundance of diverse anaerobic bacteria such as genus Gardnerella, Prevotella and others [43, 44]. This condition affects an estimated 30% of women at any given time [45], and is associated with increased transmission of HIV and increased risk of preterm labor [46, 47]. Improved diagnosis and treatment of BV require not only a clearer understanding of the roles of BV associated bacterial species and their interactions, but also a detailed catalog of the interactions between these bacteria and relevant metabolites. We applied the proposed multiomic approach to a cohort of 131 Rwandan women from McMillan et al. [9]. The microbiome data from sequencing the 16S rRNA gene consist of 51 bacterial species after initial filtering, and the vaginal metabolome determined by GCMS contains 128 metabolites [see the Methods section in 9]. One bacterial species is present in only 13 individuals, so we removed this rare species and used 50 taxa in all analysis. Of the 131 women, 79 were normal, 23 were diagnosed with BV, 22 as being intermediate between BV and the normal state, and 7 did not have diagnosis. To account for the different sequencing depths, we applied the clr and modified clr to the microbiome data. Metabolomic data available from McMillan et al. [9] have already been log transformed. After the mclr transformation, a species is treated as censored at zero if it has at least one zero count. Based on this criterion, 27 of the 50 species are left censored.
We compare metaMint with SPIECEASI by applying the former to mclr transformed data and the latter to clr transformed data. At the optimal tuning parameter, which was selected using the stability approach in Liu et al. [35] with prespecified stability threshold \(\alpha \), we randomly subsampled 80% of all samples to estimate the network using each method. This procedure was repeated 50 times and an edge selection frequency matrix was constructed such that each entry represents the proportion of times the corresponding edge was present. Only edges with at least 95% selection frequency were kept.
4.2 Results
We first compare metaMint and SPIECEASI by estimating a single integrated microbe and metabolite network for all subjects at stability threshold \(\alpha =0.01\). Figure 7 presents the joint microbemetabolite network estimated by the two methods, where the thick black edges are shared between the two methods, blue edges are unique to metaMint, and red edges are unique to SPIECEASI. We can see that a majority of edges are shared between the two methods. In particular, both methods reported the conditional association between the genus Gardnerella and metabolite GHB (6–82), and between Lactobacillus and unknown sugar 1 (3–165). These two edges are relatively stable and show up in the network for any stability threshold \(\alpha \ge 0.004\). Importantly, the interaction between Gardnerella and GHB was also observed and reproduced experimentally in McMillan et al. [9]. Other notable microbemetabolite interactions that are unique to each method include Prevotella—unknown sugar 2 (7–166) estimated only by metaMint, and Dialister—nacetylputrescine (10–106), Dialister—phenylethylamine (10–111) estimated only by SPIECEASI. These microbemetabolite interactions are unique to each method until the stability threshold increases to \(\alpha =0.02\). The differences reported by the two methods are manifestations of the different transformations and whether the model directly accounts for zero inflation.
To gain further insights into the roles of these microbemetabolite interactions, we partitioned all subjects into two groups: the normal group (\(n_1=79\)) and everyone else ( the BV group, \(n_2=52\)). metaMint and SPIECEASI were applied to estimate a network for each group using the same model selection procedure as before. In general, we observe more interactions in the groupspecific network estimated by SPIECEASI compared to the corresponding network estimated by metaMint. At stability threshold 0.01, no interaction between microbes and metabolites was recovered due to the reduced sample size in each group. As we gradually increase the stability threshold, the first pair of microbemetabolite interaction unique to the BV group is between Gardnerella and GHB, and was identified by both metaMint and SPIECEASI. Table 1 provides a list of microbemetabolite interactions that are unique to each group of patients identified by both methods at stability threshold 0.02. It is worth noting that Gardnerella—GHB, Prevotella—unknown sugar 2, and Dialister—cadaverine only show up for the BV group, whereas the interactions between Lactobacillus species and several metabolites appear only for the normal group. Abundance of Lactobacillus and Prevotella has long been used as a diagnostic signature for bacterial vaginosis [43, 44]. In addition, McMillan et al. [9] hypothesized that Dialister is responsible for malodor in the vagina. Our analysis may shed light on the mechanistic link between metabolic end products and microbes in vaginal bacterial communities, and provide key guidance regarding the diagnosis and treatment of BV.
5 Discussion
The uneven sequencing depths and sparsity in microbiome data present significant challenges in inferring interactions between microbial species and their products. The different sequencing depths imply different levels of uncertainty, but how to handle varying sequencing depths in multivariate statistical analysis remains an unsolved problem [48, 49]. This paper proposes the censored Gaussian graphical model for joint estimation of microbiome and metabolomic network, which can be used to identify conditional dependencies (direct interactions) between microbial species and metabolites. Key to our proposal is the use of the modified centered logratio for transforming the observed microbial counts, which is scale invariant and preserves the ranking of positive counts relative to zeros. Observed zeros are attributed to undersampling and modeled as due to left censoring. Our method metaMint can be generalized to study other omics data types that fit in the censored Gaussian graphical model framework. Analysis of the bacterial vaginosis data demonstrates that metaMint facilitates the discovery of important microbemetabolite interactions for diagnosis and treatment of this condition. The data example in Sect. 4 has about 50% censored variables, although 11 of them have less than 10% zero counts. As we move into highresolution studies which collect microbiome data at the strain or amplicon sequence variant level, our model that explicitly accounts for observed zeros may exhibit more advantage over existing methods.
From a methodological perspective, metaMint estimates the correlations in a marginal manner, which may not be optimal because marginal approaches ignore the fact that the correlation matrix is positive semidefinite. Augugliaro et al. [26] proposed an approximated EM algorithm that jointly estimates all entries in the correlation matrix; however, their method only works well under specific settings and there is a lack of theoretical understanding about the resulting estimator. Obvious but nontrivial extension is to explore computationally and statistically efficient alternatives that jointly estimate all entries in the correlation matrix.
Our model is related to but substantially different from the zeroinflated Gaussian graphical model in McDavid et al. [27]. While our model assumes the observed zeros are due to undersampling, McDavid et al. [27] uses a twopart Hurdle model that treats all zeros as structural. The multivariate Hurdle model consists of an Ising model that captures the discrete part and a Gaussian graphical model that describes the continuous part if the hurdle is passed. When the study design favors the twopart process, as is the case in singlecell RNAseq analysis, the multivariate Hurdle model should be considered. On the other hand, the censored Gaussian graphical model is simpler and works well if the study design favors sampling zeros and/or structural zeros can be reasonably approximated as sampling zeros [21].
It is worth pointing out that the observed data defined in (1) are continuousvalued. In this paper, we have made the simplifying assumption that the observed counts can be approximated by a lognormal distribution with left censoring. An alternative approach is to analyze observed counts directly while still treating zeros as due to left censoring. In the regression setting, Clark et al. [50] provided a general framework that uses a latent continuous variable to model observed species abundance, which can be presence/absence, continuous abundance, ordinal counts, or counts that are subject to a total sum constraint. It would be interesting to see if similar ideas can be used to model interactions between microbial species and other molecules.
References
Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, Creasy HH, Earl AM, FitzGerald MG, Fulton RS et al (2012) Structure, function and diversity of the healthy human microbiome. Nature 486(7402):207–214
LloydPrice J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J, Hall AB, Brady A, Creasy HH, McCracken C, Giglio MG et al (2017) Strains, functions and dynamics in the expanded human microbiome project. Nature 550(7674):61–66
Callahan BJ, McMurdie PJ, Holmes SP (2017) Exact sequence variants should replace operational taxonomic units in markergene data analysis. ISME J 11(12):2639–2643
Gilbert JA, Blaser MJ, Caporaso JG, Jansson JK, Lynch SV, Knight R (2018) Current understanding of the human microbiome. Nat Med 24(4):392
iHMP Research Network Consortium (2019) The integrative human microbiome project. Nature 569:641–648
McHardy IH, Goudarzi M, Tong M, Ruegger PM, Schwager E, Weger JR, Graeber TG, Sonnenburg JL, Horvath S, Huttenhower C et al (2013) Integrative analysis of the microbiome and metabolome of the human intestinal mucosal surface reveals exquisite interrelationships. Microbiome 1(1):17
Wu GD, Compher C, Chen EZ, Smith SA, Shah RD, Bittinger K, Chehoud C, Albenberg LG, Nessel L, Gilroy E et al (2016) Comparative metabolomics in vegans and omnivores reveal constraints on dietdependent gut microbiota metabolite production. Gut 65(1):63–72
Jia W, Xie G, Jia W (2018) Bile acidmicrobiota crosstalk in gastrointestinal inflammation and carcinogenesis. Nat Rev Gastroenterol Hepatol 15(2):111–128
McMillan A, Rulisa S, Sumarah M, Macklaim JM, Renaud J, Bisanz JE, Gloor GB, Reid G (2015) A multiplatform metabolomics approach identifies highly specific biomarkers of bacterial diversity in the vagina of pregnant and nonpregnant women. Sci Rep 5:14174
Org E, Blum Y, Kasela S, Mehrabian M, Kuusisto J, Kangas AJ, Soininen P, Wang Z, AlaKorpela M, Hazen SL et al (2017) Relationships between gut microbiota, plasma metabolites, and metabolic syndrome traits in the metsim cohort. Genome Biol 18(1):70
Liu R, Hong J, Xu X, Feng Q, Zhang D, Gu Y, Shi J, Zhao S, Liu W, Wang X et al (2017) Gut microbiome and serum metabolome alterations in obesity and after weightloss intervention. Nat Med 23(7):859–868
LloydPrice J, Arze C, Ananthakrishnan AN, Schirmer M, AvilaPacheco J, Poon TW, Andrews E, Ajami NJ, Bonham KS, Brislawn CJ et al (2019) Multiomics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569(7758):655–662
Gould AL, Zhang V, Lamberti L, Jones EW, Obadia B, Gavryushkin A, Korasidis N, Carlson JM, Beerenwinkel N, Ludington WB (2018) Highdimensional microbiome interactions shape host fitness. Proc Natl Acad Sci 115(51):E11951–E11960
Friedman J, Alm EJ (2012) Inferring correlation networks from genomic survey data. PLoS Comput Biol 8(9):e1002687
Fang H, Huang C, Zhao H, Deng M (2015) CCLasso: correlation inference for compositional data through lasso. Bioinformatics 31(19):3172–3180
de la Fuente A, Bing N, Hoeschele I, Mendes P (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20(18):3565–3574
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA (2015) Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 11(5):e1004226
Friedman JH, Hastie TJ, Tibshirani RJ (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441
Fang H, Huang C, Zhao H, Deng M (2017) gCoda: conditional dependence network inference for compositional data. J Comput Biol 24(7):699–708
Kaul A, Mandal S, Davidov O, Peddada SD (2017) Analysis of microbiome data in the presence of excess zeros. Front Microbiol 8:2114
Silverman JD, Roche K, Mukherjee S, David LA (2018) Naught all zeros in sequence count data are the same. bioRxiv, p 477794
Yoon G, Gaynanova I, Müller CL (2019) Microbial networks in SPRINGsemiparametric rankbased correlation and partial correlation estimation for quantitative microbiome data. Front Genet 10:516
Hoffman HJ, Johnson RE (2015) Pseudolikelihood estimation of multivariate normal parameters in the presence of leftcensored data. J Agric Biol Environ Stat 20(1):156–171
Pesonen M, Pesonen H, Nevalainen J (2015) Covariance matrix estimation for leftcensored data. Comput Stat Data Anal 92:13–25
Jones MP, Perry SS, Thorne PS (2015) Maximum pairwise pseudolikelihood estimation of the covariance matrix from leftcensored data. J Agric Biol Environ Stat 20(1):83–99
Augugliaro L, Abbruzzo A, Vinciotti V (2018) \(\ell _1\)penalized censored gaussian graphical model. Biostatistics 21:1–16
McDavid A, Gottardo R, Simon N, Drton M et al (2019) Graphical models for zeroinflated single cell gene expression. Ann Appl Stat 13(2):848–873
Guo J, Levina E, Michailidis G, Zhu J (2015) Graphical models for ordinal data. J Comput Gr Stat 24(1):183–204
Suggala AS, Yang E, Ravikumar P (2017) Ordinal graphical models: a tale of two approaches. In: International conference on machine learning, pp 3260–3269
Tobin J (1958) Estimation of relationships for limited dependent variables. Econom: J Econom Soc 26(1):24–36
Henningsen A (2010) Estimating censored regression models in R using the censreg package. R package vignettes
Fan J, Liu H, Ning Y, Zou H (2017) High dimensional semiparametric latent graphical model for mixed data. J R Stat Soc: Ser B (Stat Methodol) 79(2):405–421
Cai TT, Liu W, Luo X (2011) A constrained \(\ell _1\) minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106(494):594–607
Cai TT, Liu W, Zhou HH (2016) Estimating sparse precision matrix: optimal rates of convergence and adaptive estimation. Ann Stat 44(2):455–488
Liu H, Roeder K, Wasserman L (2010) Stability approach to regularization selection (stars) for high dimensional graphical models. In: Advances in neural information processing systems, pp 1432–1440
van den Boogaart KG, TolosanaDelgado R (2013) Analyzing compositional data with R, vol 122. Springer, Berlin
Gloor GB, Macklaim JM, PawlowskyGlahn V, Egozcue JJ (2017) Microbiome datasets are compositional: and this is not optional. Front Microbiol 8:2224
Zhou W, Sailani MR, Contrepois K, Zhou Y, Ahadi S, Leopold SR, Zhang MJ, Rao V, Avina M, Mishra T et al (2019) Longitudinal multiomics of hostmicrobe dynamics in prediabetes. Nature 569(7758):663–671
Vandeputte D, Kathagen G, D’hoe K, VieiraSilva S, VallesColomer M, Sabino J, Wang J, Tito RY, De Commer L, Darzi Y et al (2017) Quantitative microbiome profiling links gut community variation to microbial load. Nature 551(7681):507–511
Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Erdős P, Rényi A (1960) On the evolution of random graphs. Publ Math Inst Hung Acad Sci 5:17–61
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. ButterworthHeinemann, Newton
Fredricks DN, Fiedler TL, Marrazzo JM (2005) Molecular identification of bacteria associated with bacterial vaginosis. N Engl J Med 353(18):1899–1911
Ravel J, Gajer P, Abdo Z, Schneider GM, Koenig SS, McCulle SL, Karlebach S, Gorle R, Russell J, Tacket CO et al (2011) Vaginal microbiome of reproductiveage women. Proc Natl Acad Sci 108(Supplement 1):4680–4687
Koumans EH, Sternberg M, Bruce C, McQuillan G, Kendrick J, Sutton M, Markowitz LE (2007) The prevalence of bacterial vaginosis in the united states, 2001–2004; associations with symptoms, sexual behaviors, and reproductive health. Sex Transm Dis 34(11):864–869
Guerra B, Ghi T, Quarta S, MorselliLabate AM, Lazzarotto T, Pilu G, Rizzo N (2006) Pregnancy outcome after early detection of bacterial vaginosis. Eur J Obstet Gynecol Reprod Biol 128(1–2):40–45
Atashili J, Poole C, Ndumbe PM, Adimora AA, Smith JS (2008) Bacterial vaginosis and hiv acquisition: a metaanalysis of published studies. AIDS 22(12):1493
Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C, Zaneveld JR, VázquezBaeza Y, Birmingham A et al (2017) Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5(1):27
McKnight DT, Huerlimann R, Bower DS, Schwarzkopf L, Alford RA, Zenger KR (2019) Methods for normalizing microbiome data: an ecological perspective. Methods Ecol Evol 10(3):389–400
Clark JS, Nemergut D, Seyednasrollah B, Turner PJ, Zhang S (2017) Generalized joint attribute modeling for biodiversity analysis: medianzero, multivariate, multifarious data. Ecol Monogr 87(1):34–56
Acknowledgements
J. Ma is partially supported by NIH 1R01GM12951201. The author would like to thank three anonymous referees for their constructive comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ma, J. Joint Microbial and Metabolomic Network Estimation with the Censored Gaussian Graphical Model. Stat Biosci 13, 351–372 (2021). https://doi.org/10.1007/s1256102009294z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1256102009294z