Background

The human microbiome is defined as the collection of microorganisms, including bacteria, viruses, and some unicellular eukaryotes, that live in and on our bodies [1]. Research on the microbiome has grown exponentially in the past few years and it has been argued that the microbiota can be regarded as a “second genome”[2, 3]. Indeed, just the human gut microbiome is estimated to be composed of approximately 1014 bacterial cells, i.e. ten times more than the total number of human cells in the body [4]. The contribution of the human microbiome on several health outcomes has been frequently reported in the literature. For example, microbial dysbiosis in the gut has been linked to irritable bowel syndrome and Crohn’s disease [5], type 2 diabetes [6], cardiovascular disease [7], and psychological conditions via the so-called “gut-brain axis” [8]. The composition of microbiota at other body sites have also been associated with conditions such as eczema [9] and pre-term labor [10]. This stream of research holds great potential for a better understanding of many mechanistic processes in the development of human diseases, especially with respect to immune regulation and barrier defense [11, 12].

Microbiome data is most commonly obtained by sequencing variable regions of the 16S rRNA gene, then grouping the transcripts into Operational Taxonomic Units (OTUs), based on their similarity to one another. The OTUs are then defined as a cluster of reads based on a similarity threshold (typically, 97%) set by the researcher. The membership count of each cluster is then used as a proxy for taxa abundances in the sample [13]. See [14] for a discussion of how the selection of the cutoff might impact the resulting OTUs, in particular for rare species. Many studies summarize the taxa abundances by constructing several indicators of community composition (e.g. alpha and beta diversity indexes, see, [15]). Alternatively, the full OTUs abundance table can be used to obtain more detailed information about existing associations between environment or phenotypes and microbes. Well-established statistical models for the analysis of count data (e.g., Poisson or Negative Binomial distribution) can be efficaciously employed for the analysis of taxonomic count data [16]. Although less common, other distributions (e.g., a two-parameter Weibull distribution) have also been shown to provide a good fit to the data for some communities (see, e.g. [17]). One distinctive characteristic of the microbiome data is their overdispersion: while some taxa (e.g., Bacteroides and Lactobacillus species) are common among samples, many other taxa are present at much lower abundances, and often never recorded in a sample, leading to zero-inflated distributions. Many of the existing tools for microbial community analysis (e.g., the QIIME platform, [18]) bypass those characteristics and rely on nonparametric tests to compare species across different conditions [19, 20]. Other approaches use ordination, e.g. multidimensional scaling, to summarize abundances, and are sometimes employed to link the microbiome data with available clinical covariates and phylogenetic information [21, 22]. In those approaches, the choice of the distance metric is often crucial. The interpretation of biological phenomena can also be challenging in low dimensional projections. Most importantly, distance-based methods do not explicitly quantify the relative importance of significant associations between taxa and covariates, and therefore are of limited use for clinical decisions.

In this manuscript, we consider an integrative Bayesian approach based on the use of Dirichlet-Multinomial (DM) distributions [23] for studying the association between taxa abundance data and available measurements on clinical, genetic and environmental covariates. Recently, La Rosa et al. [24] proposed the use of a DM model for hypothesis testing and power calculations in microbiome experiments. Holmes et. al [25] used a finite mixture of DM distributions to directly model the taxa counts. Neither method incorporate predictors to study the influence of external factors on the microbiome’s abundance. A penalized likelihood approach based on a DM regression model has been proposed instead by [26] to determine significant associations between the microbiome composition and a set of covariates which describe the individual dietary nutrients’ intakes. Similarly, [27] develop a structure constrained version of sparse canonical correlation analysis that integrates compositionalized microbiome data, phylogenetic information, and nutrient information. Furthermore, [28] propose penalized regression models to associate the multivariate compositionalized microbiome data with some univariate phenotype of interest, e.g. body mass index, as a response. However, the use of a constrained optimization approach does not allow to fully characterize the uncertainty in the selection of the significant associations, which is of particular importance, especially when dealing with high-dimensional and highly-correlated data.

Here, we propose a probabilistic modeling approach which both flexibly takes into account the typical features of microbiome count data and also allows for straightforward incorporation of available covariate information within a DM log-linear regression framework. With respect to modeling approaches as in [28], our framework allows the study of associations between multivariate microbiome data and multivariable predictors. By imposing sparsity inducing spike-and-slab priors on the regression coefficients, our model obtains a parsimonious summary of the effects of the associations and also allows an assessment of the uncertainty of the selection process. We evaluate the performance of our model first on simulated data, where we provide comparisons with methods developed for microbiome or similar type of data. We also illustrate our method on data obtained from the Human Microbiome Project [29], to investigate the association between taxonomic abundances and metabolic pathways inferred from whole genome shotgun sequencing reads. It is known that the combination of environmental and host genetic factors shape the composition of the gut microbiota, and these interactions appear to have a significant effect on several biological mechanisms, which may be related, for example, to the individual immunity and barrier defense, as well as metabolism and diet [30, 31]. The approach has been implemented in a user-friendly R code, which has been made publicly available (see the Licensing Section).

Methods

We describe our Bayesian variable selection approach for the analysis of microbiome data and their association with a set of available covariates in the context of DM log-linear regression models.

Dirichlet-multinomial regression with variable selection

Let y i =(y i1,…,y iJ ) indicate the vector of counts representing the taxonomic abundance table obtained from the ith patient, with y ij denoting the frequency of the jth microbial taxon, for j=1,…,J and i=1,…,n. Furthermore, let X=(x 1,…,x P ) indicate a n×P matrix of measurements on P covariates. We start by modeling the taxonomic count data with a Multinomial distribution

$$\begin{array}{@{}rcl@{}} \boldsymbol{y}_{i} \mid \boldsymbol{\phi}_{i} \sim \text{Multinomial}\left(y_{i+}, \boldsymbol{\phi}_{i}\right), \end{array} $$
(1)

with \(y_{i+}=\sum _{j=1}^{J} y_{ij}\) the summation of all counts in the vector, and where the parameters ϕ’s are defined on the J dimensional simplex

$$\begin{array}{@{}rcl@{}} \mathcal{S}^{J-1} = \left\{ \left(\phi_{1}, \dots, \phi_{J}\right) : \phi_{j} \ge 0, \; \forall j, \; \textstyle \sum_{j = 1}^{J} \phi_{j} = 1 \right\}, \end{array} $$

We further impose a conjugate Dirichlet prior on ϕ, that is ϕ∼Dirichlet(γ), where γ=(γ 1,…,γ J ) indicates a J-dimensional vector of strictly positive parameters. An advantage of our hierarchical formulation is that conjugacy can be exploited to integrate ϕ out, obtaining the Dirichlet–Multinomial model, y i ∼DM(γ), with probability mass function

$$\begin{array}{*{20}l} f\left(\mathbf{y} \mid \boldsymbol{\gamma}\right) &= \frac{\Gamma\left(y_{+} + 1\right)\Gamma\left(\gamma_{+}\right)}{\Gamma\left(y_{+} + \gamma_{+}\right)}\\ & \quad\times \prod_{j = 1}^{J}\frac{\Gamma\left(y_{j} + \gamma_{+}\right)}{\Gamma(\gamma_{j})\Gamma\left(y_{j} + 1\right)}. \end{array} $$

where \(\gamma _{+} = \sum _{j}^{J} \gamma _{j}\). First described in [23] as the compound multinomial, the DM(γ) allows more flexibility than the Multinomial when encountering overdispersion in multivariate count data, as it induces an increase in the variance by a factor of (y ++γ +)/(1+γ +).

Next, we incorporate the covariates into the modeling via a log-linear regression framework where the DM parameters depend on the available covariates X’s. More specifically, we define ζ j = log(γ j ) and assume

$$\begin{array}{*{20}l} \zeta_{j} = \alpha_{j} + \sum_{p = 1}^{P} \beta_{pj} \, \mathbf{x}_{p}, \end{array} $$
(2)

i=1,…,n; j=1,…,J. In this formulation, the intercept term α j corresponds to the log baseline parameter for the taxon j, whereas the regression parameter β pj captures the effect of the pth covariate on the abundance for that taxon.

Identifying the significant associations between taxa and covariates in model (1)–(2) is equivalent to determining the non-zero β pj parameters. One way to address this issue is through variable selection and the use of spike-and-slab mixture priors [32, 33]. First, we introduce latent binary indicator vectors ξ j =(ξ 1j ,ξ 2j ,…,ξ pj ), such that ξ pj =1 if the pth covariate influences the abundance of the jth taxa and ξ pj =0 otherwise. Then, we write the prior on the β pj ’s as

$$\begin{array}{*{20}l} \beta_{pj} \sim \xi_{pj} \, \mathcal{N}\left(0, r^{2}_{j}\right) + \left(1 - \xi_{pj}\right)\, \delta_{0}\left(\beta_{pj}\right), \end{array} $$
(3)

where δ 0 denotes a Dirac-delta at 0 and \(r^{2}_{j}\) is some suitably large value [34, 35]. It is common to choose relatively large values for \(r^{2}_{j}\). Such a choice suggest a flat prior distribution on the location of the coefficients {β pj ξ pj =1} and therefore encourages the selection of relatively large effects. In the Results Section, we discuss the results of a sensitivity analysis to assist with the choice of this parameter. We place Bernoulli priors on the selection indicators ξ pj , that is

$$\begin{array}{*{20}l} \pi\left(\boldsymbol{\xi}_{j} \mid \mathbf{p}_{j}\right) = \prod_{p = 1}^{P} p_{pj}^{\xi_{pj}}\left(1 - p_{pj}\right)^{1 - \xi_{pj}}. \end{array} $$
(4)

We also specify Beta hyperpriors on the hyperparameters p pj , i.e., p pj ∼Beta(ab), as this has been shown to provide an automatic adjustment for multiplicity [36]. This is equivalent to placing a Beta mixed Binomial distribution on ξ pj ,

$$\pi\left(\xi_{pj}\right) = \int \pi\left(\xi_{pj} \mid \lambda\right) \pi(\lambda) d\lambda, $$

with λ=(a,b). As a practical suggestion, the hyper-parameters a and b should be chosen so to induce a relatively weakly specification of the prior as a “flat” Beta distribution. This can be obtained by setting a and b so that a+b=2, and the prior expected mean value m=a/(a+b). For most cases, a value of m=0.01, which corresponds to assuming a priori that 1% of the P covariates will be selected, provides an adequate balance between false positives and false negative counts, as we further illustrate in a sensitivity analysis in the Results Section. Finally, we assume normal priors on the α j ’s, i.e. \(\alpha _{j}\sim \mathcal {N}(0, s^{2}_{j})\). Large values for \(s^{2}_{j}\) encode a diffuse prior, to describe non-informative or objective prior beliefs. However, results are typically quite robust to prior choices on the intercept parameters, and \(s^{2}_{j}=10\) is usually assumed as a default specification in Bayesian regression when dealing with standardized variables. Figure 1 provides an overview of the proposed integrative modeling approach, with reference to the application to the Human Microbiome Project data we describe later.

Fig. 1
figure 1

Schematic overview of the proposed integrative Bayesian approach for the application to data from the Human Microbiome Project. The observed data counts (right) are regressed on the available covariates (left), through a variable selection approach, which informs the (unknown) population abundance of each taxon

MCMC algorithm

We implement a stochastic search Markov Chain Monte Carlo (MCMC) algorithm for posterior inference that employs a Gibbs scan to sample the non-zero regression coefficients [37]. We encourage an efficient sampling by employing a component-wise adaptive Metropolis algorithm [38] as described below. A generic iteration of the MCMC algorithm comprises the following steps:

  1. 1.

    Update of α : This is a Metropolis-Hastings step with a symmetric random walk proposal \(\alpha _{j}^{'} \sim \mathcal {N}(\alpha _{j}, t_{\alpha }^{2})\), for j=1,…,J.

  2. 2.

    Joint update of ( ξ , β ): We sample these parameters jointly via a Gibbs scan that employs a Metropolis acceptance step. For each j=1,…,J and p=1,…,P:

    • if ξ pj =1: propose \(\xi _{pj}^{'} = 0\) and \(\beta _{pj}^{'} = 0\).

    • if ξ pj =0: propose \(\xi _{pj}^{'} = 1\) and then propose \(\beta _{pj}^{'}\) following an adaptive Metropolis-Hasting scheme

      $$\begin{array}{*{20}l} \beta_{pj}^{'} \sim&\, 0.95 \, \mathcal{N}\left(\beta_{pj}, 2.38^{2} \times \hat{\sigma}^{2}_{\beta_{pj}} / J\times P\right)\\ &+ 0.05 \, \mathcal{N}\left(\beta_{pj}, 0.01 / J\times P\right), \end{array} $$

      where \(\hat {\sigma }^{2}_{\beta _{pj}}\) is the current estimate of the variance of the target distribution. The value of \(\hat {\sigma }^{2}_{\beta _{pj}}\) is updated using a recursive formula as in [39] on all the previous draws for β pj .

    • Accept (\(\xi _{pj}^{'}, \beta _{pj}^{'}\)) with probability

      $$a = \min \left\{ 1, \frac{\pi\left(\xi_{pj}^{'}, \beta_{pj}^{'} \mid \boldsymbol{\xi}_{pj}^{'}, \boldsymbol{\beta}_{pj}^{'}, \text{else}\right) }{ \pi\left(\xi_{pj}, \beta_{pj} \mid \boldsymbol{\xi}_{pj}^{'}, \boldsymbol{\beta}_{pj}^{'}, \text{else}\right)} \right\}, $$

      where \(\boldsymbol {\xi }_{pj}^{'} = (\xi _{1,j}^{'}, \dots, \xi _{p-1,j}^{'}, \xi _{p + 1,j}, \dots, \xi _{pj})\), and \(\boldsymbol {\beta }_{pj}^{'} = (\beta _{1,j}^{'}, \dots, \beta _{p-1,j}^{'}, \beta _{p + 1,j}, \dots, \beta _{pj})\).

For posterior inference, we are interested in identifying the relevant associations between taxa and covariates as captured by the selection indicators ξ pj ’s and the corresponding regression coefficients β pj ’s. Estimates of the marginal posterior probabilities of inclusion (PPIs) of the latent indicators ξ pj can be calculated by counting the number of times that each taxa/covariate association is included across the MCMC iterations. A selection of the significant associations can then be made by choosing those elements that have marginal PPIs greater than a specific value, for example greater than 0.5 for the median probability model of [40]. Another choice for the threshold which controls for multiplicity [41] relies on an estimated pre-specified Bayesian false discovery rate α calculated as

$$\widehat{\text{FDR}}(c) = \frac{\sum_{p = 1}^{P} \sum_{j = 1}^{J} \left(1 - \widehat{\text{PPI}}_{pj}\right) D_{pj}}{\sum_{p = 1}^{P} \sum_{j = 1}^{J} D_{pj}}, $$

where . An optimal threshold c can be found for error rate α by choosing c such that \(\widehat {\text {FDR}}(c') < \alpha \). Estimates of the non-zero regression coefficients β pj can also be calculated by averaging over the sampled MCMC values.

In order to compare selection performance of different methods, we calculate accuracy, false positive rate (FPR), false negative rate (FNR) and Matthews correlation coefficient (MCC), across 30 replicated datasets. We define accuracy as ACC = (TP + TN)/(P + N), with TP the number of true positives out of P selected and TN the number of true negatives out of N not selected. The false negative rate is calculated as FNR = FN/(FN + TP), the false positive rate as FPR = FP/(FP + TN), and the Matthews correlation coefficient as

$$\begin{array}{*{20}l} \text{MCC} =\text{MCC} = \frac{TP / N - S \times P} {\sqrt{PS (1 - S) (1 - P)} }, \end{array} $$

with N=TN+TP+FN+FP, \(\mathrm {P} = \frac {TP + FP} { N }\) and \(\mathrm {S} = \frac {TP + FN} { N }\) [42]. Since the MCC balances TP and FP counts, and can be used even if the classes are of very different sizes, it is generally regarded as one of the most appropriate measures of classification accuracy. We further computed receiving operating curves (ROC) to compare the performance of the selection procedure across the different methods.

Comparison study on simulated data

We carry out a simulation study to assess the performance of our model and compare results to alternative methods. More specifically, we consider two methods which have been specifically employed for the integrative analysis of microbiome data: the penalized approach of Chen and Li [26], and the false discovery rate-corrected pair-wise correlation tests considered in [19]. In addition, we consider the factorized maximum a posteriori (MAP) Bayesian lasso of [43], a recently proposed general statistical method for conducting variable selection in multivariate count-response regression. When fitting the Bayesian Gamma Lasso method of [43], model selection was done using the minimum AIC, while for Chen and Li’s approach the minimum BIC was calculated with the group penalty set to 20%. We also fit the method of Chen and Li to the untransformed data. The false discovery rate threshold for the Spearman’s correlation tests was set to 0.05.

In simulating data, we set n=100, P=50 and J=50, and chose P r =9 and J r =5 to obtain a total number of relevant taxa/covariate associations equal to 25. We simulated the covariate matrix X according to a Multivariate-Normal (0,Σ) with Σ i,j =ρ |ij| and ρ=0.4. We drew each vector y i of counts from a Dirichlet-Multinomial distribution as follows. For i=1,…,n, \(\mathbf {y}_{i}\sim \text {Multinomial}(N_{i},\boldsymbol {\pi }_{i}^{*})\), with the row sum N i DiscreteUnif[1,0000;2,000], and \(\boldsymbol {\pi }^{*}_{i}=(\pi ^{*}_{i1}, \ldots, \pi ^{*}_{iJ})\sim \text {Dirichlet}(\boldsymbol {\gamma }^{*})\). For \(\boldsymbol {\gamma }^{*}=(\gamma ^{*}_{1},\dots,\gamma ^{*}_{J})\), we set \(\gamma _{j}^{*}=\frac {\gamma _{j}}{\gamma _{+}}\,\frac {1-\psi }{\psi }\), j=1,…,J, with γ j = exp{α j +X β j }, \(\gamma _{+} = \sum _{j = 1}^{J} \gamma _{j}\) and ψ∈ [ 0,1] an overdispersion parameter. When ψ→0, the simulated values approximate a Multinomial (π) distribution, while for large ψ, the sampled values are more disperse. Here, we set ψ=0.01. We sampled the non-zero β pj ’s from the intervals ±[ 0.5,1.0] and the intercept parameters from a Uniform(–2.3, 2.3). Below we report performance results as averages over 30 replicated simulated datasets.

When running the MCMC, we used a vague prior for the intercept by setting the variance parameter to \(s_{pj}^{2} = 10\). Similarly, we set \(r_{pj}^{2} = 10\), to provide sufficiently vague prior information on the non-zero log-linear regression coefficients. Finally, we set m=0.01 (or a=0.02 and b=1.98), resulting in a sparse prior mean on selected associations of 1% of the total. We provide comments on the sensitivity of the selection results to the choice of these hyperparameters in the Section below. We ran the MCMC algorithm for 10,000 iterations and thinned to every fifth iteration. On a single dataset, the C code took approximately 31.5 min to run on an Intel Xeon E5-2630 2.30 GHz processor. We assessed convergence visually and via the Geweke diagnostic [44] as implemented in the R package coda. Convergence was checked for a) the number of active variables in each iteration and b) the samples from each of the selected β pj . The five number summary of the 25 Geweke z-scores was (–3.43, –1.06, –0.63, 0.71, 1.98).

Inferring associations between taxonomic abundances and metabolic pathways

We demonstrate our approach on publicly available data obtained from the Human Microbiome Project (HMP) website [29] from which we use 79 samples from healthy individuals. The Y matrix in our model contains 16S rRNA microbial counts from stool samples at the genus taxonomic level. As common in microbiome studies, the genera abundances (Bacteroides, Prevotella, etc.) were filtered by requiring each genus to be present in at least 5% of the samples. This procedure removes extremely low-abundance genera leaving 80 genera for the analysis. From the same 79 individuals, we obtained KEGG orthology group abundances which are used as the matrix of covariates X of our model. The KEGG orthology groups were reconstructed from metagenomic shotgun sequencing (WGS) using the HMP Unified Metabolic Analysis Network (HUMAnN) pipeline [45] and were also provided on the HMP website. These values represent inferred abundances of biochemical functional groups and metabolic pathways present due to the shotgun sequenced reads of bacterial and non-bacterial genes in the sample. To reduce correlation among the covariates we used average linkage clustering on the correlation matrix of the KEGG groups and chose one representative from each cluster, according to its relevance to microbiome research, leaving 76 columns in X. Finally, the columns in X were mean centered and scaled to unit variance. Though the HMP sampled 300 individuals for several timepoints and over many sites, there were relatively few samples that included the WGS used to obtain the KEGG orthology data. Thus, when joining the samples from the 16S rRNA data and the KEGG orthology data, a total of 79 matched samples remained.

We used the same hyperparameter settings as in the simulation study, that is \(s_{pj}^{2} = 10\) and \(r_{pj}^{2} = 10\) and set m=0.01, resulting in a sparse mean selection prior of 1% of the total 6,080 possible associations. The MCMC algorithm described in “MCMC algorithm” section above was run for 500,000 iterations and thinned to every 100th draw. We assessed convergence visually and via the Geweke diagnostic [44] as implemented in the R package coda. The five number summary of the Geweke z-scores for the 26 β pj ’s was (–3.83, –1.19, 0.15, 1.46, 3.38).

Results

Simulation study

In Fig. 2 we show the plot of the marginal PPIs of the P×J elements ξ pj , obtained by computing the proportion of times that ξ pj =1 across all iterations, after burn-in. The selected median model, corresponding to a threshold of 0.5 on the PPIs, results in a false positive rate of 0.0004 and a false negative rate of 0.04. The value of the AUC for this replicate was 0.99.

Fig. 2
figure 2

Simulated data: Marginal posterior probabilities of inclusion (PPI) for each coefficient β jp, j=1,…,50, p=1,…,50, describing the association between each taxa and each covariate. Each PPI is obtained by averaging the number of times that each taxa/covariate association is included across the MCMC iterations, after burn-in. The true associations are indicated as red dots

Figure 3 illustrates the selection performance of the proposed method, by plotting the average ROC curves over the 30 replicated datasets (ψ=0.01) for each of the methods included in the comparison. The Figure shows that our proposed model outperforms the competing methods in terms of achieved average true and false positive rates.

Fig. 3
figure 3

Simulated data: Comparison results of selection performances (ROC curves). DMBVS: Dirichlet–Multinomial Bayesian Variable Selection (our method), MAPBL : Maximum A Posteriori Bayesian Lasso, CORTEST: Multiplicity Corrected Correlation Tests as in Wu et al. (2010), C&L: composite penalty from Chen and Li (2013)

As an additional comparison, together with the total number of correctly identified regression parameters, which we term “overall recovery”, we also looked at the “taxa-wise recovery”, which we defined as the correct recovery of any element from one of the J taxa. Thus, recovery for overall selection occurs for P×J elements while taxa-wise selection occurs for J elements. Table 1 reports average values for accuracy, FPR, FNR and MCC, averaged across the 30 replicated datasets, for both overall and taxa-wise recovery. These results show that our method in particular outperforms competing methods for taxa-wise recovery. In the same Table we report results for a more challenging simulated scenario, obtained with a higher value of the overdispersion parameter (ψ=0.1). As expected, the increase in overdispersion makes the selection task more difficult for all methods. However, our method still outperforms or is commensurate with the competing methods, even in the presence of considerable overdispersion.

Table 1 Simulated data: performance assessment for two different scenarios, characterized by different values of the dispersion parameter ψ

Sensitivity analysis

Since our proposal requires the choice of a number of hyperparameters, it is important to investigate how sensitive the results are to varying parameter sets. Therefore, we conclude our simulation study by briefly discussing the sensitivity of the results to the prior specifications. In general, we found that results were robust to the prior choices on the intercept parameters, α j , while, as expected, some sensitivity was observed with respect to the variance hyperparameters of the spike-and-slab prior (3) on the regression coefficients, β pj , and the hyperparameters of the Beta priors on p pj . In Table 2 we report results obtained by considering a full grid of values for the prior expected value of p pj , i.e. m∈{0.005,0.01,0.05}, and the slab variance, \(r^{2}_{pj} \in \{1, 10, 100 \}\). In the Additional file 1, we further report the corresponding ROC curves. With only 25 truly non-zero β pj ’s, out of 2,500 parameters, small increases in false positive rates can drastically decrease the Matthews correlation coefficient. Thus imposing some sparsity by using a smaller value for m improves overall performance while larger values of m allow for more false positives. The results appear to suggest that assuming moderate sparsity a priori (e.g., m=0.01) generally leads to good operating characteristics. Similarly, when the slab variance is small, e.g. \(r_{pj}^{2} = 1\), there is more prior density close to zero, which allows small but insignificant variables to be selected. Conversely, when the slab variance is large, e.g. \(r_{pj}^{2} = 100\), false positives are less likely but false negatives increase, since the prior density is spread more evenly over the support. Therefore, an intermediate value, e.g. \(r_{pj}^{2} = 10\), provides a reasonable compromise, which favors relatively large effect sizes and a small number of false positives. In the Additional file 1, we also report the performance of our method for varying values of the over-dispersion parameter ψ and the sample size n. As expected, the results show that the performances improve for larger sample sizes and decreasing overdispersion.

Table 2 Simulated data: sensitivity analysis for varying values of the prior expected value of p pj , m, and the slab variance \(r^{2}_{pj}\), and for two different scenarios, characterized by different values of the dispersion parameter ψ

Data analysis

Figure 4 shows the traceplot of the number of included taxa/covariate associations and the plot of the marginal PPIs of the P×J elements ξ pj , obtained by computing the proportion of times that ξ pj =1 across all iterations, after burn-in. Here the median model, corresponding to a threshold of 0.5 on the PPIs, selects 92 associations. Among those, 26 have a marginal PPI greater than 0.98, which corresponds to a Bayesian FDR of 0.1. These 26 associations are listed in Table 3, together with the corresponding estimated regression coefficients, and depicted in Figs. 5 and 6, for positive and negative associations, respectively. In these Figures, the magnitude of the association, as captured by the estimated β pj ’s, is proportional to the width of the edges, with red lines indicating negative associations and blue lines positive associations. As a comparison, the method by Chen and Li [26] identified 120 associations, whereas the Bayesian Lasso of [43] and the correlation test-based method of [19] identified, respectively, 220 and 711 associations. Those results appear to confirm the sparser selection achieved by our method, consistently with the results of the simulation study.

Fig. 4
figure 4

Real data: Marginal posterior probabilities of inclusion (PPI) for each coefficient β jp, in Eq. (2), describing the association between each taxa and each covariate. Each PPI is obtained by averaging the number of times that each taxa/covariate association is included across the MCMC iterations, after burn-in. Here, the median model, corresponding to a threshold of 0.5 on the PPIs, selects 92 associations. Among those, 26 have a marginal PPI greater than 0.98, which corresponds to a Bayesian FDR of 0.1. These 26 associations are indicated as red dots

Fig. 5
figure 5

Real data: Selected positive taxa-by-covariate associations. The magnitude of the association, as captured by the median of the MCMC draws for each β pj , is proportional to the width of the edges

Fig. 6
figure 6

Real data: Selected negative taxa-by-covariate associations. The magnitude of the association, as captured by the median of the MCMC draws for each β pj , is proportional to the width of the edges

Table 3 Real data: selection results using a BFDR of 0.1

Discussion

A close investigation of the biological significance of the associations identified by our model reveals several interesting characteristics and affirms the relevance of these associations. Commensal microbiota that inhabit the human gut are proficient at scavenging glycans and polysaccharides, including those in plants, such as starches or cellulose, animal-derived tissues (glycosaminoglycans and N-linked glycans), and glycans from host mucus (O-linked glycans) [46]. Ruminococcus spp. are known to participate in both resistant starch and glycosaminoglycan degradation [46, 47]. It has been reported that long-term consumption of diets rich in protein and animal fat were associated with an enterotype primarily containing increased Bacteroides and Ruminococcus species [19]. Additionally, Ruminococcus torques and Ruminococcus gnavus have been shown to degrade mucins [48]. Thus, it is logical that Ruminococcus, which is one of the noteworthy genera involved in glycosaminoglycan degradation, would be negatively associated to glycosaminoglycan biosynthesis (ko00534) (Table 3). Similarly, Parabacteroides which is also negatively associated with N-glycan biosynthesis (ko00513), is involved in deglycosylation and utilization of N-glycans [49]. Also, among the associations identified for the glycan pathways, Prevotella was negatively associated with mucin type O-glycan biosynthesis (ko00512). In the literature, Prevotella has implications for mucosal homeostasis, as some Prevotella spp. express a unique mucin-desulfating glycosidase that can hydrolyze GlcNAc residues on mucin-type O-glycans, and thus is important for mucin degradation [50]. Other associations affirmed through the literature included that of Bacteroides with naphthalene degradation (ko00626). It has been reported that Bacteroidetes possess the capability to degrade polycyclic aromatic hydrocarbons such as naphthalene [51]. Associations of Ruminococcus with pyruvate metabolism (ko00620) are also supported, as phosphoenolpyruvate carboxykinase was previously reported to be associated with Ruminococcus flavefaciens in the rumen [52]. Another supported association is that of Prevotellaceae with sulfur metabolism. L-cysteine desulfhydrase enzymes have been characterized in Prevotella intermedia [53]. Additionally, glycosulfatase enzymes have been described in Prevotella [54]. Equally interesting is the selection of pathways that are expected to be omnipresent among many bacteria, such as glycolysis/gluconeogenesis (ko00010), as glycolysis occurs, with variations, in nearly all organisms, both aerobic and anaerobic. Thus, it is not surprising that taxa like Clostridiales are positively associated with glycolysis/gluconeogenesis as they are abundant taxa within the gut microbiome.

Given the complexity of metabolic pathways and the process of mapping specific genes to pathways, some of the selected associations are unexpected, and might be due to the 16S abundances that were made available at the HMP site and the mapping of metagenomic sequences to specific KEGG orthology groups by HUMAnN. For example, several species of Ruminococcus are known to participate in butanoate (butyrate) metabolism [55], Dialister spp. have phenylalanine arylamidase activities [56], and Prevotella spp. are known to participate in pyruvate metabolism [57, 58]. Since those associations should be driven exclusively by bacterial genes, it is interesting that we find significant associations between the abundance of certain bacterial taxa and KEGG pathways that are primarily reported among eukaryotic species (i.e., T-cell receptor signaling, hedgehog signaling, pathways in cancer, etc.). Indeed, although precautionary steps are performed, the HMP consortium reported that human contaminants are found in 50–90% of the sequences [15]. This might also explain the negative association exhibited by Bacteriodes and glycolysis/gluconeogenesis. These unexpected findings suggest the need for further investigations and validation.

Conclusion

Herein, we have developed a Bayesian approach to the Dirichlet-Multinomial regression models that allows for the selection of significant associations between covariates and taxa from a microbiome abundance table by imposing spike-and-slab priors on the log-linear regression coefficients of the model. We have applied our model to simulated data and compared performances with methods developed for similar applications. We have illustrated the performance of our method using publicly available data on taxonomic abundances and metabolic pathways inferred from whole genome shotgun sequencing reads, which we obtained from the Human Microbiome Project website. Our results have revealed interesting links between specific taxa (i.e. genera) and particular metabolic pathways, which we have validated via existing literature.

Several extensions of our model are possible. Because some habitats, e.g. the gut, are thought to have highly variable dynamics, longitudinal sampling may be preferred to cross-sectional sampling since it may give a better sense of long-term trends [59]. Thus, incorporating repeated samples with specified correlation structures in the linear predictor could produce additional insights. Another important aspect of microbiome data, which is receiving attention from researchers, is the heterogeneity in community structure across samples, as this can be an indication of the existence of “enterotypes” [60, 61]. This can be addressed within our modeling framework by employing Bayesian nonparametric models that would allow to cluster selected associations across partitions of the samples. These extensions are currently under investigation.