Data and notation
Suppose n samples have been collected, each with a microbial community profile. For sample i, let Y
i
denote an outcome of interest, which can be binary (e.g., disease status) or continuous. Let X
i
=(X
i1,…,X
ip
) be the p covariates, such as age, gender, and other clinical and environmental variables that we want to adjust for. Let Z
i
=(Z
i1,…,Z
im
) be the abundances of m taxa derived from the observed q OTUs for the ith sample. Note that an OTU represents a common species while a taxon is a group of one or more species. Here, we assume that each of Z
i1,…,Z
iq
is the count of the OTU in sample i and Z
ik
, q+1≤k≤m, is the sum of the counts of the OTUs belonging to taxon k in sample i. The evolutionary relationships among these OTUs and taxa are given by a rooted phylogenetic tree, which contains all q OTUs (as leaf nodes) and m−q taxa (as internal nodes). Suppose b
k
is the distance from the root of the phylogenetic tree to taxon k, and \(p_{ik}=Z_{ik} / \sum _{j=1}^{q}Z_{ij}\) is the proportion of taxon k in sample i. The goal is to test for a possible association between the overall microbial community composition and the outcome of interest after adjusting for the covariates.
A new class of tests: MiSPU
The MiSPU and aMiSPU tests are introduced in this section. Figure 1 illustrates the overall structure of the tests, detailing the input (a rooted phylogenetic tree, a sample of OTU counts, an outcome of interest, and possibly some covariates) and the three key steps: calculating a generalized taxon proportion for each taxon, calculating the test statistics, and applying a residual permutation scheme to obtain the p values.
One major characteristic of microbial composition data is that taxa are related as described by a phylogenetic tree. Phylogenetic distance measures that account for phylogenetic relationships among taxa can be much more powerful than those ignoring evolutionary information [20]. Among these, UniFrac distances are most popular. Consider two samples i and j. The unweighted UniFrac distance, which considers only species presence or absence, is a qualitative measure and is defined as [18]:
$$\begin{array}{*{20}l} d_{ij}^{U} =\frac{\sum_{k = 1}^{m}\{b_{k} |I(p_{ik}>0)-I(p_{jk}>0)|\}}{\sum_{k=1}^{m} b_{k}}, \end{array} $$
where I(·) is the indicator function. In contrast, weighted UniFrac, which uses OTU abundance information, is a quantitative measure [19]:
$$\begin{array}{*{20}l} d_{ij}^{W} = \frac{\sum_{k=1}^{m} b_{k} |p_{ik} -p_{jk}|}{\sum_{k=1}^{m} b_{k} |p_{ik} + p_{jk}|}. \end{array} $$
Our basic observation is that phylogenetic distance metrics, which account for the relationship among taxa via a phylogenetic tree, measure the distance among samples using all the variables (i.e., taxa) without variable selection or variable weighting. Since the dimension of microbial data is usually high, much larger than the number of samples, many taxa may provide only weak or no signals. Using a phylogenetic distance without variable weighting or variable selection may or may not be powerful. Instead, corresponding to the unweighted and weighted UniFrac distances, for each sample i and taxon k, we define the corresponding generalized taxon proportions as
$$\begin{array}{*{20}l} Q_{ik}^{u} = b_{k} I(p_{ik} >0), \qquad Q_{ik}^{w} = b_{k}p_{ik}, \end{array} $$
respectively. Note that the raw weighted UniFrac distance [19] between two samples is exactly the same as the L
1 distance of the weighted generalized taxon proportion between the two samples.
Inspired by a multivariate test for association analysis of rare variants [23], we construct a class of versatile score-based tests such that for a given scenario, at least one of the tests is powerful. Then we combine these tests to maintain high power across a wide range of scenarios. Specifically, for a binary outcome, we use a logistic regression model:
$$\begin{array}{*{20}l} \text{Logit}[\text{Pr}(Y_{i} = 1)] = \beta_{0} +\beta' X_{i} + \sum_{k = 1}^{m}Q_{ik}\varphi_{k}, \end{array} $$
where Q
ik
is either \(Q_{ik}^{u}\) or \(Q_{ik}^{w}\).
For a continuous outcome, we use a linear model:
$$\begin{array}{*{20}l} Y_{i} = \beta_{0} +\beta' X_{i} + \sum_{k = 1}^{m}Q_{ik}\varphi_{k} + \epsilon_{i}, \end{array} $$
where ε
i
is an error term with mean 0 and variance σ
2.
We are interested in testing the null hypothesis H
0: φ=(φ
1,…,φ
m
)′=0. That is, there is no association between any taxa and the outcome of interest under H
0. The score vector U=(U
1,…,U
m
)′ for φ is [17, 23–25]:
$$\begin{array}{*{20}l} U =& \sum_{i=1}^{n}(Y_{i}-\hat{\mu}_{i,0})Q_{\textit{i}\cdot}, \end{array} $$
where Q
i·=(Q
i1,Q
i2,…,Q
im
) and \(\hat {\mu }_{i,0}\) is the predicted mean of the outcome of interest (Y
i
) under H
0. Note that a general weighted score-based test can be written as
$$\begin{array}{*{20}l} T_{\mathrm{G}} = w'U = \sum_{k= 1}^{m} w_{k}U_{k}, \end{array} $$
where w=(w
1,…,w
m
)′ is a vector of weights for the m generalized taxon proportions. Most existing association tests use the score vector U to construct a test statistic, because of the closed form of the score vector U and because most of the information in the data is contained in U. Therefore, we use U to construct the weights for the score vector U. Under H
0, we have U∼N(0,Cov(U|H
0)) asymptotically, suggesting that a larger |U
k
| offers stronger evidence to reject H
0,k
: φ
k
=0. Specifically, we choose \(w=(U_{1}^{\gamma -1},\dots,U_{m}^{\gamma -1})'\) to weight the score vector for the generalized taxon proportions, leading to a MiSPU test:
$$\begin{array}{*{20}l} T_{\text{MiSPU}(\gamma)} = w'U = \sum_{k= 1}^{m} U_{k}^{\gamma}. \end{array} $$
Since γ=1 essentially treats all the variables as equally important while association directions of the generalized taxon proportions may vary, γ=1 often yields low power and thus is excluded here. Importantly, as γ increases, the MiSPU(γ) test puts more weight on the larger components of U while gradually ignoring the remaining components. As γ goes to infinity, we have
$$\begin{array}{*{20}l} T_{\text{MiSPU}(\infty)} \propto ||U||_{\infty} = \max_{k=1}^{m}|U_{k}|. \end{array} $$
We simply define \(T_{\text {MiSPU}(\infty)} = \max _{k=1}^{m}|U_{k}|\). Note that the two versions of Q
ik
, i.e., \(Q_{ik}^{w}\) and \(Q_{ik}^{u}\), yield weighted MiSPUw and unweighted MiSPUu, respectively.
We use a permutation scheme [23] to calculate the p value as the following:
-
1.
Fit the null linear or logistic regression model by regressing Y on the covariates X under H
0 to obtain \(\hat {\mu }_{i,0} = E(Y_{i}|H_{0})\) and residuals \(r_{i} = Y_{i} -\hat {\mu }_{i,0}\).
-
2.
Permute the residuals r={r
i
|i=1,…,n} to obtain a permuted set r
(b).
-
3.
Regress Q on the covariates X to obtain the residuals \(\hat {Q}\).
-
4.
Calculate the new score vector based on the permuted residuals as \(U^{(b)} = \sum _{i = 1}^{n} \hat {Q}_{\textit {i}\cdot } r_{i}^{(b)}\) and the corresponding null statistic \(T_{\text {MiSPU}}^{(b)} = T_{\text {MiSPU}}(U^{(b)})\).
-
5.
Calculate the p value as \(\left [\sum _{b=1}^{B} I\left (|T_{\text {MiSPU}}^{(b)}| \geq |T_{\text {MiSPU}}| \right)+1\right ]/(B+1)\) after B permutations.
It would be desirable to data-adaptively choose the value of γ and the version of the generalized taxon proportion since the optimal choice of them depends on the unknown true association patterns. Like the adaptive SPU (aSPU) test [23], we propose an adaptive MiSPU (aMiSPU) test, which combines the p values of multiple MiSPU tests with various values of γ and two versions of Q
ik
. Suppose that we have some candidate values of γ in Γ, e.g., Γ={2,3,…,8,∞}, as used in our later simulations and real-data analysis. Then, our combining procedure is to take the minimum p value:
$$\begin{array}{*{20}l} T_{\text{aMiSPU}_{\mathrm{u}}} &= \min_{\gamma \in \Gamma} P_{\text{MiSPU}_{\mathrm{u}}(\gamma)}, \\ T_{\text{aMiSPU}_{\mathrm{w}}} &= \min_{\gamma \in \Gamma} P_{\text{MiSPU}_{\mathrm{w}}(\gamma)}, \\ T_{\text{aMiSPU}} &= \min \left\{P_{\text{aMiSPU}_{\mathrm{u}}}, P_{\text{aMiSPU}_{\mathrm{w}}}\right\}. \end{array} $$
Note that we take the minimum p value of aMiSPUu and aMiSPUw to form the final aMiSPU test. \(T_{\text {aMiSPU}_{\mathrm {u}}}\phantom {\dot {i}\!}\), \(T_{\text {aMiSPU}_{\mathrm {w}}}\phantom {\dot {i}\!}\), and T
aMiSPU are no longer a genuine p value, but we can use the permutation to estimate its p value, using the same set of null statistics used to calculate the p values for the MiSPU tests [23].
We comment on the choice of Γ and the version of the generalized taxon proportion. Depending on how many taxa are truly associated with the outcome of interest, one may use a smaller or larger γ. For example, if more of the taxa are not associated, a larger γ would be desirable. In our numerical simulations and real-data analysis, we have found that Γ={2,3,…,8,∞} often suffices. MiSPU(8) often gives almost the same results as those of MiSPU(∞), suggesting there is no need to use other larger γ’s. In practice, we suggest using the aMiSPU test, which combines the strengths (and possibly weaknesses) of various MiSPU tests. The aMiSPU test can be regarded as a rigorous means for multiple testing adjustment with the use of several MiSPU tests, while the results of MiSPU tests may shed light on the underlying association patterns. For example, if a MiSPU with the unweighted generalized taxon proportion gives the most significant p value, it may indicate the outcome of interest is more likely to be associated with the abundance changes in rare taxa. If some odd γ’s yield more significant results than even γ’s, then most or all of the large associations are in the same direction.
Although we focus on rRNA sequencing data, the proposed method can be applied to metagenomic whole-genome shotgun sequencing data as well. Via MEGAN [26], DNA reads (or contigs) can be summarized as OTUs and their counts. Using a standard algorithm, species-specific sequences are assigned to OTUs or taxa near the leaves of a phylogenetic tree, whereas widely conserved sequences are assigned to taxa closer to the root [26]. Once we have OTU abundance data and a phylogenetic tree, aMiSPU can be applied as before.
Taxon selection
A limitation of most multivariate tests is their inability to select variables: even if the null hypothesis is rejected, they may not give any information on which taxa are (or are not) likely to be associated with the outcome of interest. We note that the aMiSPU test can be used to rank the importance of the taxa. First, if \(P_{\text {aMiSPU}_{\mathrm {u}}} < P_{\text {aMiSPU}_{\mathrm {w}}}\phantom {\dot {i}\!}\), we use the unweighted generalized taxon proportion in the subsequent analysis; otherwise, we use the weighted one. For ease of exposition, suppose we choose the weighted one. Second, we estimate the optimal value of \(\hat {\gamma } = \text {argmin}_{\gamma \in \Gamma } P_{\text {MiSPU}_{\mathrm {w}}(\gamma)}\) chosen by the aMiSPUw test. If \(\hat {\gamma } = \infty \), we can easily find the most significant taxon. Third, suppose \(\hat {\gamma } < \infty \), then we assess the relative contribution of each taxon r to the aMiSPUw test as \(\mathcal {C}_{r} = |U_{r}|^{\hat {\gamma }} / \sum _{j=1}^{m}|U_{r}|^{\hat {\gamma }}\). Fourth, we rank the taxa based on their \(\mathcal {C}_{r}\) values, and we can select a few top k
1 taxa, such as k
1=1, or such that the sum of their relative contributions \(\sum _{r = 1}^{k_{1}}\mathcal {C}_{r} \geq \alpha _{1}\) with α
1=0.7, say. The choice of k
1 or α
1 determines the trade-off between increasing true positives and increasing false positives.
The MiSPU package and implementation
We implemented the MiSPU and aMiSPU tests in an R statistical software package called MiSPU, in which a C++ version of UniFrac distances faster than the GUniFrac R package is also provided. The package is available on GitHub (https://github.com/ChongWu-Biostat/MiSPU) and CRAN. We applied MiRKAT from the MIRKAT R package developed by Ni Zhao and Michael Wu at website http://research.fhcrc.org/wu/en/software.html. The SPU and aSPU tests are available in the R package aSPU on CRAN.
Simulation settings
We used a phylogenetic tree of OTUs from a real throat microbiome data set [27], which consists of 856 OTUs after discarding singleton OTUs. The simulation settings were similar to that used in [16]. Specifically, we generated the OTU counts for each individual via the following steps:
-
1.
Based on a real throat microbiome data set [27], the estimated OTU proportions \((\hat {\pi }_{1},\hat {\pi }_{2},\dots,\hat {\pi }_{856})\) as well as the estimated overdispersion parameter \(\hat {\theta }\) were obtained via maximum likelihood.
-
2.
For sample i, the observed OTU proportions were randomly generated from a Dirichlet distribution: \((p_{1i},p_{2i},\dots,p_{856i}) \sim \text {Dirichlet}(\hat {\pi }_{1},\hat {\pi }_{2},\dots,\hat {\pi }_{856},\hat {\theta })\).
-
3.
The total count of OTUs for sample i, say n
i
, was randomly drawn from a negative binomial distribution with mean 1000 and size 25. This step mimicked varying total reads per sample.
-
4.
For sample i, the observed OTU counts were randomly generated from a multinomial distribution: (Z
i1,Z
i2,…,Z
i856) ∼ Multinomial(n
i
;p
1i
,p
2i
,…,p
856i
).
The procedure for generating simulated data is available as a function in R package MiSPU. We considered several simulation scenarios that differed in how some OTUs were related to the outcome of interest.
Under simulation scenario 1, we partitioned the 856 OTUs into 20 clusters (lineages) by partitioning around medoids based on the cophenetic distance matrix. The abundance of these 20 OTU clusters varied tremendously, such that each OTU cluster corresponded to some possible bacterial taxa. We assumed that the outcome of interest depended on the abundance cluster that constituted 6.7 % of the total OTU reads. Then we simulated dichotomous outcomes as follows:
$${\small{\begin{aligned} {}\text{Logit}\left(E(Y_{i}|X_{i},Z_{i})\right) = 0.5 \, \text{scale}(X_{1i} + X_{2i}) + \beta \, \text{scale} \left(\sum_{j\in A} Z_{ij}\right), \end{aligned}}} $$
where β was the effect size and scale(Z
i1) standardized the sample mean of Z
i·’s to 0 and the standard deviation to 1. For continuous outcomes, we simulated under the model
$$\begin{array}{*{20}l} Y_{i}= 0.5 \, \text{scale}(X_{1i} + X_{2i}) + \beta \, \text{scale} \left(\sum_{j\in A} Z_{ij}\right) + \epsilon_{i}, \end{array} $$
where ε
i
∼N(0,1). X
1i
and X
2i
were the covariates to be adjusted for, and A was the index set of the selected OTU cluster. X
1i
was generated from a Bernoulli distribution Bin(1,0.5), while X
2i
was from a standard normal distribution N(0,1). To consider the effect of potential confounders, we studied the case where X
2i
and Z
i
were correlated, specifically, \(X_{2i} = \text {scale} \left (\sum _{j\in A} Z_{ij}\right) + N(0,1)\). We varied the effect size β to mimic different magnitudes of association.
Under simulation scenario 2, we partitioned all the OTUs into 40 clusters and assumed the outcome was associated with the abundance cluster with only three OTUs. Under simulation scenarios 3, 4, and 5, we assumed that the outcome of interest was associated with the abundance cluster with 24.8 %, 16.6 %, and 1.5 % of the total OTU reads, respectively. Under simulation scenario 6, we assumed the outcome was associated with 50 randomly selected OTUs.
For all the simulation scenarios, we considered using MiSPUu and MiSPUw with γ=2,3,…,8. We combined the MiSPU tests to get aMiSPUu, aMiSPUw, and aMiSPU. We compared aMiSPU with MiRKAT with the weighted and unweighted UniFrac kernels (K
w and K
u, respectively), the Bray–Curtis kernel (K
BC), and a generalized UniFrac kernel with α=0.5 (K
5). Additionally, we also applied the optimal MiRKAT, which combines the above four kernels.
Throughout the simulations, the sample size and test significance level were fixed at 100 and α=0.05, respectively. The results were based on 1000 independent replicates for β≠0 and 10,000 independent simulations for β=0.