Abstract
Differential abundance analysis is at the core of statistical analysis of microbiome data. The compositional nature of microbiome sequencing data makes false positive control challenging. Here, we show that the compositional effects can be addressed by a simple, yet highly flexible and scalable, approach. The proposed method, LinDA, only requires fitting linear regression models on the centered logratio transformed data, and correcting the bias due to compositional effects. We show that LinDA enjoys asymptotic FDR control and can be extended to mixedeffect models for correlated microbiome data. Using simulations and real examples, we demonstrate the effectiveness of LinDA.
Background
The role of the human microbiome in health and disease has been intensively studied over the past few years, see, e.g., [1, 2], for several reviews. Potentially pathogenic or probiotic microorganisms can be identified by analyzing their abundances in a microbial ecosystem (e.g., the human gut) with respect to some covariate of interest such as disease status. Current prevailing technologies for studying the human microbiome use metagenomic sequencing, where either the DNA of a taxonomically informative gene (e.g., 16S rRNA) or all the genomic DNA in the microbial genome is sequenced. After obtaining the raw sequencing reads, the reads can be clustered into operational taxonomic units (OTUs), denoised into amplicon sequence variants (ASVs), or mapped to a microbial reference database (taxa) using existing bioinformatics pipelines such as UPARSE, DADA2, and MetaPhlAn [3–5]. For simplicity, we use the term taxon (pl. taxa) to represent any taxonomic unit (OTU/ASV/taxon) from a bioinformatics pipeline. Therefore, after bioinformatics processing, we have an abundance table recording the frequencies of detected taxa in the samples, together with a meta data table capturing the samplelevel information. Differential abundance analysis is then carried out based on the abundance and meta data table.
Ideally, we want to measure the absolute abundance of the microorganisms, i.e., the number of microorganisms per unit area/volume at the microbial ecosystem, and differential abundance analysis is performed on the absolute abundance data. However, the data from a sequencing experiment only captures the relative abundance (compositional) information since the total sequence read count, also known as sequencing depth or library size, does not reflect the total microbial load in the specimen due to the complex chemistry involved in sequencing [6, 7]. Although there are several experimental techniques such as qPCR, spikein and flow cytometry that can be used to achieve absolute abundance measurement, the severe limitations of these techniques prevent their wide adoption [8]. Therefore, the prevailing sequencing protocol is still only able to measure the relative abundances. Drawing inferences about the changes on the unknown absolute abundance based on the measured relative abundance data is challenging due to missing the total microbial load information. The increase or decrease in the abundance of some taxa with respect to a covariate of interest automatically results in changes in the relative abundances of all other taxa, a statistical phenomenon known as compositional effects. Therefore, using the standard statistical techniques such as twosample ttest, Wilcoxon rank sum test, and linear regression analysis ignoring the compositional nature of the data could lead to a large number of false discoveries.
For the differential abundance problem to be well defined, one has to make assumptions. One major assumption is that the differential signal is sparse, i.e., only a small proportion of taxa are associated with the covariate of interest. Although many studies have supported the sparse signal assumption, there are also studies support dense signal hypotheses, where a large number of taxa are differential with small effect sizes [9, 10]. Therefore, the validity of a method and the definition of true or false positive depends on the specific assumption one is willing to accept. Here our goal is to provide a statistical tool that could be potentially useful for pinpointing top candidate taxa for further biological validation.
To address compositional effects in differential analysis, one popular approach is robust normalization. It involves calculating a normalizing factor (scale factor), which is robust to a small number of differential taxa and could well capture the sequencing effort for the nondifferential part. Therefore, dividing by such a normalizing factor will bring the abundance of the nondifferential taxa to the same scale while retaining the differences for those differential ones. Assuming the number of differential taxa is small, different strategies have been used to calculate a robust normalizing factor including TMM, RLE, CSS, and GMPR [11–14]. We list these methods in Additional file 1: Table S1. In contrast, the naive total sum scaling (TSS) normalization, which divides the counts by the library size, is not a robust normalization method [14].
These normalization techniques can be combined with different statistical procedures in differential abundance analysis. For example, we can divide the counts by the normalizing factor from the normalization techniques in Additional file 1: Table S1 and then apply standard statistical tools based on the normalized data. The normalizing factor could also be included as an offset in regression models such as edgeR [15], DESeq2 [16], MicrobiomeDDA [17], and metagenomeSeq [13], where the TMM, RLE, GMPR, and CSS normalization are the accompanying normalization methods. The recently developed MaAsLin2 [18] uses log linear models on the normalized abundance data. Different normalization approaches including TSS, TMM, CSS, and CLR are options in MaAsLin2. A variant to the robust normalization approach is to find a reference taxon or a set of reference taxa, which are assumed to be nondifferential with respect to the covariate of interest. The data are then normalized by the count of the reference taxon (or the sum of the counts of the reference taxa). This strategy was used in RAIDA [19] and DACOMP [20].
Another line of methods to tackle the compositional effect uses (log) ratio approach since only ratios are well defined for compositional data [21]. The ALDEx2 method by [22] uses the centered logratio (CLR) transformation, where the counts of a sample are divided by their geometric mean before taking logarithms. Differential abundance analysis is then performed using Wilcoxon rank sum test or ttest based on the CLR transformed data. In the CLR approach, the geometric mean can also be regarded as a robust normalizing factor. The ANCOM proposed by [23] computes the pairwise ratios of the relative abundances and identifies the taxa with the most differential ratios. This is based on the observation that the abundance ratios for those differential taxa to other taxa are all differential assuming distinct effect sizes while the ratios for those nondifferential taxa are mostly nondifferential. Therefore, by analyzing the pattern of the pairwise ratios, one could distinguish the differential taxa from a background of nondifferential taxa with high accuracy. Recently, a biascorrected version of ANCOM (called ANCOMBC) has been proposed [24], which uses a linear regression framework based on logtransformed taxa count and estimates the unknown bias term due to the compositional effect through an EM algorithm.
In the work of [25, 26], the authors evaluated several popular methods in differential abundance analysis (ANCOMBC/MaAsLin2 not included) and showed that the inflation of the false discovery rate (FDR) is still a ubiquitous problem, and no method is satisfactory in all aspects. A method that is computationally efficient, relatively robust and powerful, and flexible enough to allow covariate adjustment and application to correlated microbiome data is still lacking in the field. In this paper, we propose a linear regression framework for differential abundance analysis (LinDA) to fill the methodological gap. LinDA involves three simple steps that can be carried out efficiently. First, it runs linear regressions using the CLRtransformed abundance data as the response. Then it identifies a bias term due to the compositional effect and corrects for the bias using the mode of the regression coefficients across different taxa. Finally, it computes the p values based on the biascorrected regression coefficients and applies the BenjaminiHochberg (BH) procedure to control the FDR. We rigorously prove the asymptotic FDR control of the proposed method, making it the first procedure that enjoys a theoretical FDR control guarantee. Our approach is related to ANCOMBC but differs in several aspects. (i) Our derivation provides a clear interpretation of the bias term and suggests a simple way to correct it. (ii) Our procedure does not involve the EM algorithm and can be 100–1000 times faster than ANCOMBC in our numerical studies. (iii) Our method can be directly extended to the mixedeffect models. Longitudinal and repeated measurementbased microbiome studies have been increasingly common [27, 28] but statistical tools for correlated microbiome data analysis remain scarce. LinDA can analyze the correlated microbiome data using the classic linear mixedeffects models. Through extensive simulation studies and real data analyses, we show that the new method outperforms the stateoftheart approaches in terms of FDR control and power.
Results
Numerical studies
Setups
We conducted comprehensive simulations to evaluate the performance of the proposed method under different setups. We set m=500 as the baseline for the number of taxa, which is similar to the number of tests at the species level for a typical microbiome study. We investigated the sample size n=50, 200 representing small and large sample sizes, respectively. More combinations of m and n were studied in additional settings. We generated the absolute abundances from the log normal distribution and considered three cases for the covariate of interest and confounders: the covariate of interest follows the Bernoulli distribution and no confounder (denoted as C0), the covariate of interest follows the standard normal distribution and no confounder (C1), and the covariate of interest follows the Bernoulli distribution and two confounders (C2). In addition to the basic setting (log normal abundance distribution, denoted as S0), we investigated other settings to study the robustness of the proposed method including zeroinflated absolute abundances (S1), correlated absolute abundances (S2), gamma abundance distribution (S3), smaller m (S4), smaller n (S5), 10fold difference in library size (S6), negative binomial abundance distribution (S7) and correlated microbiome data generated by mixedeffect model (S8). See the “Detailed setups for numerical studies” section for more details.
Competing methods
We compared our method with ANCOMBC, ALDEx2, DESeq2, edgeR, metagenomeSeq and MaAsLin2. For DESeq2 and edgeR, we replaced their native normalization methods with GMPR normalization, which was shown to improve the power and false positive control in differential abundance analysis [14]. For metagenomeSeq, there are two implementations, fitZig and fitFeatureModel, in the R Bioconductor package metagenomeSeq. Currently, fitFeatureModel is only applicable to binary covariate case (C0). We use metagenomeSeq2 and metagenomeSeq to denote the fitFeatureModel and fitZig procedures, respectively. We also compared with the standard nonparametric methods: Wilcoxon rank sum test for case C0 (a binary covariate) and Spearman correlation test for case C1 (a continuous covariate), both with the GMPR normalized data.
For the proposed method, we considered two zerohandling approaches. The first approach adds a pseudocount of 0.5 to all the counts, which is widely used in microbiome data analysis on the log scale. However, it has been shown to be problematic under certain situations [20]. We thus designed a new imputationbased approach, where the zeros were imputed by \(N_{s}/(\max _{k:Y_{ik}=0}N_{k})\) for ith taxon in sth sample, where N_{s} denotes the library size (sequencing depth) of sth sample and Y_{ik} denotes the read count of ith taxon in kth sample. In other words, zeros were imputed differently according to the library size of the sample, and zeros in the sample with a larger library size were replaced with larger fractions. In this imputation approach, we treat zeros as leftcensored missing data. Suppose we only know the library sizes, then a natural strategy is to impute zeros in proportion to the library size with the sample of the largest library size receiving a fractional count close to 1 (in our approach, we simply set it as 1). The purpose of the imputation strategy is to reduce false positives when the library size is correlated with the covariate of interest. As shown in the simulation studies, the pseudocount approach worked sufficiently well in most settings except the setting S6, where the library size between the groups differed by 10folds. In contrast, the imputation approach reduced the false positive rate extensively for the setting S6 (Additional file 2: Fig. S1). However, it was slightly less powerful than the pseudocount approach when the library size was a not confounder (Additional file 2: Fig. S2). Thus, in the implementation, we used an adaptive approach: we first tested the association between the covariate of interest and the library size. If the p value was smaller than 0.1, we used the imputation approach conservatively; otherwise, we used the pseudocount approach. Additional file 2: Fig. S1 and S2 show that the adaptive method controls the false positives when the library sizes are very different among groups while retaining the power when the library sizes are similar.
The proposed LinDA method can be viewed as a threestep procedure: CLROLSBC (OLS stands for ordinary least squares and BC stands for bias correction), which can be easily extended to the linear mixedeffects model using CLRLMMBC (LMM stands for linear mixedeffect model). In the setting S8 (correlated microbiome data), we compared CLRLMMBC to CLROLSBC, CLROLS, and CLRLMM to demonstrate the utility of LinDA for correlated microbiome data analysis.
Results
We use S0C0 (log normal abundance distribution, a binary covariate) to denote the setting S0 (log normal abundance distribution) with the covariate design C0 (a binary covariate) and likewise for other setups. For S0, we studied all the three covariate designs (C0–C2), and for S1–S8, we only performed C0 for demonstration. We found that DESeq2, edgeR and metagenomeSeq had severe FDR inflation under most settings. To increase the readability of the results (presented in figures), we did not include them in the main comparison and focused on the comparison between LinDA, ANCOMBC, ALDEx2, metagenomeSeq2, MaAsLin2 and Wilcoxon (Figs. 1 and 2 and Additional file 2: Fig. S1–S15). Full results of all methods are presented in Additional file 2: Fig. S20–S30.
We first point out that MaAsLin2 with CLR normalization is essentially the same as the CLROLS procedure we described earlier. Additional file 2: Fig. S3 compares LinDA, CLROLS, MaAsLin2TSS, MaAsLin2TMM, MaAsLin2CSS and MaAsLin2CLR under the setting S0C0 (log normal abundance distribution, a binary covariate). We can see that MaAsLin2CLR is close to CLROLS, both of which suffer from FDR inflation. We included MaAsLin2 with its default configuration (i.e., TSS normalization) in our comparisons below.
Figure 1 and Additional file 2: Fig. S4 and S5 show the results of the competing methods under the log normal abundance distribution with three covariate designs: a binary covariate (S0C0), a continuous covariate (S0C1), and a binary variable of interest with confounders (S0C2), respectively. Generally speaking, LinDA and ANCOMBC have the best FDR and power tradeoff. Under C0 and C2 (Fig. 1 and Additional file 2: Fig. S5), both methods control the FDR around the target level, and ANCOMBC is slightly more powerful than LinDA when the sample size is small. However, under C1 (Additional file 2: Fig. S4), LinDA controls FDR at the target level at both sample sizes while ANCOMBC has slight FDR inflation when the sample size is small. LinDA is also slightly more powerful than ANCOMBC at a small sample size. The Wilcoxon rank sum test based on GMPR normalized data and MaAsLin2 perform well under C0 (a binary covariate, Fig. 1) with slightly inflated FDR at larger effect sizes and reasonable power across settings. In contrast, for a continuous covariate (C1, Additional file 2: Fig. S4), the Spearman rank correlation test and MaAsLin2 have large FDR inflation when the signal is dense. When there are confounders (C2, Additional file 2: Fig. S5), Wilcoxon has severe FDR inflation when the sample size is large due to its inability to adjust for confounders, while MaAsLin2 provides acceptable results as under C0. ALDEx2 is a conservative method, which offers the strongest FDR control but is much less powerful. metagenomeSeq2 performs well when the signal is sparse but fails to control the FDR when the signal is dense. We also studied the effect of zero inflation and the correlations among taxa (S1C0 and S2C0, Additional file 2: Fig. S6 and S7), where we observed similar patterns such that LinDA and ANCOMBC had overall the best performance among the compared methods.
Since LinDA assumes a log normal distribution of the absolute abundance, it is interesting to evaluate its performance when the log normal assumption is violated. We thus simulated the absolute abundance data using a gamma distribution (S3C0), and the results are depicted in Fig. 2. It shows that LinDA controls the FDR close to the target level and has the highest power. When the signal is dense (20%), ANCOMBC has a noticeable FDR inflation while ALDEx2, metagenomeSeq2, MaAsLin2 and Wilcoxon have severe FDR inflation when the signal is dense.
With a smaller number of taxa (m=50, S4C0, Additional file 2: Fig. S8), ANCOMBC controls the FDR and the power is also high. LinDA is the most powerful but it has slight FDR inflation. metagenomeSeq2, MaAsLin2, and Wilcoxon control the FDR but are less powerful in the case of sparse signal. However, when the signal is dense, they could not control the FDR properly. When the sample size is very small (n=20 or 30, S5C0), LinDA controls the FDR around the target level and maintains high power (Additional file 2: Fig. S9). ANCOMBC and metagenomeSeq2 have large FDR inflation and the inflation seems to increase as the sample size gets smaller. MaAsLin2 and Wilcoxon are much less powerful and ALDEx2 has virtually no power. Under the setting S6C0, where the sequencing depth differs by 10folds, only ALDEx2 and our proposed method with adaptive zerohandling approach are able to control the FDR (Additional file 2: Fig. S10). LinDA achieves a better performance in both the FDR control and power than ALDEx2. It is interesting that ALDEx2 performs better under S6C0 than under other settings. We point out here that when we implemented ANCOMBC, we disabled its zero treatment. To further investigate whether its zero treatment option improves its performance, we also run the procedure enabling its zero treatment (zero_cut = 0.9, lib_cut = 1000, struc_zero = TRUE), and found the results were very similar (S6C0, Additional file 2: Fig. S11).
Under the previous simulation settings, we found that DESeq2 and edgeR had the worst false positive control (Additional file 2: Fig. S20–S28). As the two methods assume negative binomial distribution for the counts, it is interesting to see their performance when the data are generated by their assumed model (S7C0). Additional file 2: Fig. S29 shows that DESeq2 and edgeR (and metagenomeSeq) remain to have serious FDR inflation, indicating that the normalization approach to address the compositional effect is not sufficient. In contrast, LinDA and ANCOMBC perform the best among competitors as in other settings, and ANCOMBC achieves higher power than LinDA (Additional file 2: Fig. S12).
Finally, we applied LinDA to correlated microbiome data (S8C0), where the other competing methods except MaAsLin2 are not applicable to correlated samples. Additional file 2: Fig. S13 and S14 compare the methods CLRLMMBC (LinDALMM), CLROLSBC (LinDAOLS), CLRLMM, CLROLS, and MaAsLin2 for correlated data. In the scenario of comparing the pretreatment and posttreatment samples (S8.1, Additional file 2: Fig. S13), we could clearly see that ignoring the bias tremendously increases the FDR level especially under dense signals (LinDALMM vs CLRLMM). In addition, LinDALMM is more powerful than LinDAOLS due to its ability to exploit the correlation between pre and posttreatment samples. Under the replicate sampling setting (S8.2, Additional file 2: Fig. S14), we see that the LinDAOLS has significant FDR inflation by treating the replicate samples as independent ones. In contrast, LinDALMM controls the FDR at the target level. MaAsLin2 control the FDR under both settings but is less powerful than LinDALMM.
Based on the presented simulation settings, we summarize that LinDA and ANCOMBC have overall the most robust performance among the methods evaluated. However, ANCOMBC is computationally intensive. As shown in Table 1, LinDA could be 100–1000 times faster than ANCOMBC, making LinDA a highly scalable method in practice. In addition, the extension of LinDA to the mixedeffect models is easily carried out and performs well.
Real data applications
Datasets
We applied LinDA and the competing methods to three real datasets with independent samples from the studies ofC. difficile infection (CDI, [29]), inflammatory bowel disease (IBD, [30]), and rheumatoid arthritis (RA, [31]). To demonstrate the use of LinDA on correlated microbiome samples, we applied LinDA to a dataset from the study of the smoking effect on the human upper respiratory tract (SMOKE, [32]). We used the microbiome samples from the throat for illustration, where each subject has two samples from the left and right sides of the throat. The CDI and RA datasets were provided by the authors while the IBD and the SMOKE datasets were downloaded from the Qiita database [33] with the study ID 1460 and 524. All the datasets have binary phenotypes. Antibiotics use is the confounder for the IBD dataset (p=0.03 and OR = 0) while sex is the confounder for the SMOKE dataset (p=0.02 and OR= 2.26). They will be adjusted in methods that are capable of covariate adjustment. We excluded samples with less than 1000 read counts and taxa which appear in less than 10% of the samples. The basic characteristics for the four filtered datasets are summarized in Table 2. We compared the detection power as well as their overlap patterns for LinDA, ANCOMBC, ALDEx2, metagenomeSeq2, MaAsLin2, and Wilcoxon. Specifically, we compared the number of discoveries at different FDR levels (0.01–0.25) and used UpSet plot [34] to show the overlap at the target FDR of 0.1. We used winsorization at quantile 0.97 to reduce the impact of potential outliers as recommended in [17].
Results
For the CDI dataset, LinDA or MaAsLin2 made the most discoveries at different FDR levels (Fig. 3). At 10% FDR, LinDA discovered eight and MaAsLin2 discovered six taxa associated with CDI. In contrast, ANCOMBC, ALDEx2, and Wilcoxon discovered three while metagenomeSeq2 discovered two. As discussed in [29], subjects with CDI were more likely to have the bacterial family Lachnospiraceae and Erysipelotrichaceae. LinDA found one more taxon belonging to Lachnospiraceae than other methods (blue bars in Fig. 4). Besides, LinDA, MaAsLin2, and Wilcoxon found one differential taxon belonging to Erysipelotrichaceae while the other three methods did not identify any (orange bar in Fig. 4). For the IBD dataset, LinDA detected a similar number of taxa as ANCOMBC and MaAsLin2. Wilcoxon rank sum test detected a large number of taxa associated with the disease status, but this could be due to the confounding effects of antibiotics use since it could not adjust for covariates. From Fig. 4, we observe that most discoveries by LinDA are shared by ANCOMBC, MaAsLin2, or Wilcoxon. For the RA dataset, LinDA detected a similar number of taxa as ANCOMBC and more taxa than other methods. The differential taxa detected by LinDA and ANCOMBC largely overlapped. Overall, the results are consistent with the simulation studies.
Finally, we applied LinDALMM and MaAsLin2 to the SMOKE dataset, where each subject has two replicate samples from the throat. The aim is to identify smokingassociated taxa adjusting for the sex. To account for the correlation between the replicate samples, we included a subjectlevel random intercept in LinDALMM. As a comparison, we also applied LinDAOLS and MaAsLin2 to the right and left throat samples separately. LinDAOLS based on the left or right throat samples alone discovered 12 and 15 differential taxa at 10% FDR. When both left and right samples were used in LinDALMM, 21 differential taxa were identified, covering the majority of the taxa identified based on the left or right throat samples alone (Fig. 4). In addition, LinDALMM detected five taxa that were missed by analyzing the left or right samples separately. Compared to MaAsLin2, LinDALMM discovered seven more differential taxa. Therefore, LinDALMM provides a convenient way to analyze correlated microbiome datasets and enjoys the power improvement by analyzing all samples together.
To visualize the results, LinDA provides a function to generate the effect size plot and volcano plot for differential taxa. Additional file 2: Fig. S16–S19 display the effect size plots of differential taxa at FDR level of 0.1 and volcano plots for the four datasets, respectively. In the effect size plots (Additional file 2: Fig. S16A–S19A), the taxa in black are detected by LinDA and taxa in red are detected solely by LinDA. In Additional file 2: Fig. S16A (CDI), the taxa in blue are missed by LinDA but detected by one or more other methods. In Additional file 2: Fig. S17A (IBD) and S18A (RA), the taxa in blue are missed by LinDA but detected by two or more other methods. No taxa are detected by MaAsLin2 but missed by LinDALMM for the SMOKE dataset. Based on the effect size plots for the CDI, IBD and RA datasets, we can see that, for the taxa solely detected by LinDA, the effect sizes tend to be underestimated without bias correction and bias correction improves the power in these cases. On the contrary, for the taxa missed by LinDA, the effect sizes tend to be overestimated without bias correction. In addition, we observe that the differential taxa for the IBD, CDI, and RA datasets are more unbalanced (i.e., more negative or positive associations) while the differential taxa for the SMOKE dataset are relatively balanced (i.e., similar numbers of negative and positive associations). Indeed, the effect size plots, where we plot both the debiased and undebiased coefficients, revealed larger biases for the IBD, CDI, and RA datasets.
Discussion
Differential abundance analysis is at the core of the statistical analysis of microbiome data. Microbiome data are compositional in nature and all we know are the relative abundances, making the identification of differentially abundant taxa at the ecological site particularly challenging [6, 7]. Numerous differential abundance analysis methods have been proposed focusing on addressing the compositional effects [13, 15–17, 19, 20, 22–24]. Among all the competing methods, ANCOMBC is the stateoftheart method, it has been demonstrated to be more robust and powerful than the competing methods. However, there are two drawbacks of ANCOMBC. First, it is computationally intensive for largescale microbiome datasets such as the AmericanGut dataset. Due to the huge intersubject variation, largescale microbiome studies have been increasingly popular, resulting in larger sample sizes. On the other hand, metagenomic sequencing has become increasingly deeper to have a highresolution view of the microbiome, leading to an unprecedented number of new microbial features. To meet the analysis need for such largescale datasets, a computationally efficient tool is much needed. Secondly, ANCOMBC is not applicable to correlated/clustered microbiome datasets such as those from family/longitudinal microbiome studies or studies with paired and repeated measurements [27, 28]. Longitudinal microbiome analysis, which enables the study of the trajectory of the microbiome as well as controls for potential confounders, has been increasingly employed in human microbiome studies. Unfortunately, statistical tools for longitudinal microbiome studies are scarce. In contrast, LinDA is computationally efficient since it only involves fitting regular linear regression models and could be easily scaled to hundreds of thousands of taxa. Moreover, the extension of LinDA to linear mixedeffects models (LMM) is straightforward and we have highly efficient tools such as the R lme4 package [35] for fitting LMM. Therefore, differential abundance analysis of correlated/clustered microbiome datasets could be easily performed using LinDA. Our framework also gives more insights into the CLRbased approach, which has been widely used in compositional data analysis [21]. However, the bias of CLR regression models has not been formally recognized to our best knowledge. Our framework justifies the use of CLR regression and provides a solution to correct the bias associated with CLR regression.
In the simulation, we found that Wilcoxon rank sum test and MaAsLin2 showed similar FDR/power curves and performed fairly well in most settings. As we mentioned earlier, MaAsLin2 is based on log linear models on the normalized count data, and it is essentially a twosample ttest when no confounders are included, which explains why MaAsLin2 is close to Wilcoxon rank sum test. However, when we simulated an even stronger compositional effect by drawing the differential taxa from the top 25% most abundant taxa, we found that Wilcoxon rank sum test and MaAsLin2 began to break down (Additional file 2: Fig. S15). ANCOMBC was overall robust and powerful, but it had inflated type I error at small sample sizes. metagenomeSeq2 did not perform well when the signal was dense and was generally less powerful than ANCOMBC and LinDA. ALDEx2 was the most conservative method: its strong FDR control was at the expense of statistical power. LinDA was as competitive as ANCOMBC in most settings. It had better FDR control than ANCOMBC when the sample size was small or the covariate of interest was continuous. However, LinDA had some FDR inflation when the number of taxa was small. Under a very strong compositional effect (Additional file 2: Fig. S15), LinDA also showed some FDR inflation but overall it had the best FDR and power tradeoff.
When the library size was associated with the covariate of interest, all existing methods had severe type I error inflation. Fortunately, such association is detectable and if we see a significant association, rarefaction should be used for those methods. Although rarefaction controls the effect of uneven library sizes, it discards a significant portion of the reads and thus loses much information in the data. When there are many samples with small library sizes, the users have to decide whether to retain more reads or more samples. In LinDA, we implemented a heuristic imputation method, where the imputed values are proportional to the library sizes. This procedure makes the imputed data after CLR transformation independent of the library size and substantially reduces the inflated type I error due to library size confounding.
Although the presented simulation settings could give basic insights into the performance of various methods, such modelbased simulations might not be able to capture the full characteristics of the real microbiome data. It is very likely that the performance of the compared methods will change using a different simulation framework. Moreover, our simulation strategy purposely creates strong compositional effects, where all differential taxa show the same direction of change. Such setting is used to test the limit of the various methods in addressing the compositional effects. However, in real data, the compositional effects may not be always strong, and the FDR inflation of many methods could be very moderate. Therefore, a future benchmarking study, which uses real databased simulation strategy and investigates all biologically plausible differential settings, is much needed to have a comprehensive and objective evaluation of existing differential abundance analysis methods.
As for all modelbased approaches, LinDA has several assumptions and limitations. First, LinDA relies on the assumption that there is a mode at 0 for the regression coefficients (Condition (vi) in Theorem 1). This assumption is easy to be met if the signal is sparse. In the simulation, we show that when the signal density is around 20%, LinDA is still very robust. However, when the signal is extremely dense, LinDA could fail. Second, LinDA assumes a log linear model on the absolute abundance. Although this is a reasonable assumption, which has been widely adopted in the analysis of abundance data, the interaction between the host and the microbiome could be more complex than the simple log linear relationship. Analysis of the residuals from the CLR regression could provide evidence about whether the assumption is reasonable. If the model assumption is violated, a permutation test or transformation of the variables may be performed. Finally, although LinDA provides asymptotic FDR control, its finitesample FDR control is not guaranteed. Based on numerical simulations, we found that LinDA performed well under small sample sizes. However, we did observe some FDR inflation under a small feature size due to inefficiency in mode estimation with few features. Therefore, we do not recommend applying LinDA to datasets with small feature sizes (e.g., m <50) such as phylumlevel abundance data.
LinDA uses the relative abundance data and does not model the sampling variability of the read counts. This could reduce the statistical power for those less abundant taxa, whose sampling variability is larger than those abundant taxa. To remedy the power loss, another multinomial sampling layer could be imposed on top of LinDA. However, the computational complexity will be increased significantly, breaking the simplicity of LinDA. Another approach is to perform posterior inference of the underlying proportions based on a Bayes approach. Once we obtain the posterior samples, LinDA can be applied to the posterior samples and results are then aggregated, similar in the spirit to the multiple imputation method [36].
Besides microbiome data, LinDA could be applied to other sequencing data such as RNASeq data since all sequencing data are compositional in nature [37]. Thus, LinDA could be an alternative for differential expression analysis if there are strong compositional effects, for example, when the highly abundant genes are differential with the same direction of change.
Finally, we comment that addressing compostionality is more relevant when analyzing individual microbial features such as differential abundance analysis, since the major interest to biomedical investigators is to find those truly differential features (“driver”) instead of those driven by the compositional effect (“passenger”). However, for communitywide analysis such as distancebased analysis [38, 39], addressing the compositionality may not be necessary in order to control the type I error. This is because that compositional effect is only relevant under the alternative hypotheses. Considering compositionality in the communitywide analysis has also been found to have small effects on the statistical power [25, 40]. Additionally, in microbiomebased predictive models [41], the relative abundances and/or their ratios could already be informative features for prediction and addressing compositionality may not necessarily increase the prediction accuracy significantly. Therefore, whether to address compositionality depends on the specific problems.
Conclusions
In summary, we proposed LinDA for differential abundance analysis of microbiome compositional data. LinDA identified a bias associated with traditional linear regression models based on CLRtransformed abundance data and proposed a strategy to estimate and correct the bias. LinDA can be extended to linear mixedeffects model for analysis of correlated microbiome data. As a general methodology, LinDA can be applied to differential abundance analysis of other highdimensional compositional data.
Methods
Setup
We use C, C_{1}, and C_{2} to denote positive constants, which can be different from line to line. As summarized in the background, there are two ways to tackle the compositional effects in differential abundance analysis, namely normalization and logratio transformation. In this paper, we adopt the CLR transformation and develop a biascorrection procedure to address the compositional effects. Denote the absolute abundance and the observed read count of the ith taxon in the sth sample by X_{is} and Y_{is}, respectively. For the sth sample, the total read count of all taxa, \(N_{s}=\sum ^{m}_{i=1}Y_{is}\), is determined by the sequencing depth and DNA materials. Given N_{s}, it is natural to model the stratified count data over m taxa through a multinomial distribution as
Under (1), we have
where e_{is} denotes the estimation error, which is expected to diminish as N_{s} gets large.
OLS estimation
We consider the log linear model on the absolute abundance
where c_{s}=(c_{s1},...,c_{sd})^{⊤} is the ddimensional covariates to be adjusted, u_{s} is the covariate of interest, and ε_{is} is the error term. Our goal is to discover taxa that are differentially abundant with respect to u_{s}. Statistically, we want to simultaneously test the following m hypotheses
Set ε_{is}=ε_{is}+e_{is}. Under (2) and (3), the CLRtransformed data satisfies the following linear model
where \(\bar {\alpha }=m^{1}\sum _{i=1}^{m}\alpha _{i}, \bar {\boldsymbol {\beta }}=m^{1}\sum _{i=1}^{m}\boldsymbol {\beta }_{i}\), and \(\bar {\varepsilon }_{s}=m^{1}\sum _{i=1}^{m}\varepsilon _{is}\). From (4), we can see that the OLS estimator for α based on the CLRtransformed data is biased with the bias term being \(\bar {\alpha }\). Let \(\bar {\alpha }_{i}=\alpha _{i}\bar {\alpha }, \bar {\boldsymbol {\beta }}_{i}=\boldsymbol {\beta }_{i}\bar {\boldsymbol {\beta }}, \bar {\varepsilon }_{is}=\varepsilon _{is}\bar {\varepsilon }_{s}\), and \(\bar {\sigma }_{i}^{2}=\text {var}(\bar {\varepsilon }_{is})\). Denote by \(\tilde {\alpha }_{i}, \tilde {\boldsymbol {\beta }}_{i}\), and \(\hat {\sigma }_{i}^{2}\) the OLS estimators of \(\bar {\alpha }_{i}, \bar {\boldsymbol {\beta }}_{i}\), and \(\bar {\sigma }_{i}^{2}\), respectively. We then have
where \(\mathbf {z}_{s}=(u_{s},1,\mathbf {c}_{s}^{\top })^{\top }\). We respectively let var_{z}(·) and cov_{z}(·,·) denote the variance and covariance computed conditional on z_{1},...,z_{n}. It can be shown that
where \(\hat {\rho }\) is the (1,1)th element of \((n^{1}\sum _{s=1}^{n}\mathbf {z}_{s}\mathbf {z}_{s}^{\top })^{1}\).
Bias correction
In many applications, it is reasonable to assume that there is only a small portion of differential taxa, i.e., most α_{i}’s are equal to 0. Under this assumption, as \(\tilde {\alpha }_{i}\) is an unbiased estimator for \(\bar {\alpha }_{i}=\alpha _{i}\bar {\alpha }\), the mode of \(\tilde {\alpha }_{i}\) is expected to be close to \(\bar {\alpha }\). This observation motivates us to estimate \(\bar {\alpha }\) by
In (6), K is a nonnegative even function with \(\int _{\infty }^{\infty }K(y)dy=1\), and h is the bandwidth parameter. Under some regular conditions, we have
as m,n→∞ (see the supplementary material for the proof). Therefore, one can estimate α_{i} by the biascorrected estimator \(\hat {\alpha }_{i}=\tilde {\alpha }_{i}+\tilde {\alpha }\).
Testing procedure
To construct a statistic for testing H_{0,i}, we need to find a proper estimator for the variance of \(\hat {\alpha }_{i}\). To this end, we note that
Since \(\text {var}_{\mathbf {z}}(\tilde {\alpha }_{i})\) is \(\hat {\rho } \bar {\sigma }_{i}^{2}/n\), it dominates \(\text {var}_{\mathbf {z}}(\tilde {\alpha })\) and \(\text {cov}_{\mathbf {z}}(\tilde {\alpha }_{i},\tilde {\alpha })\) as n,m→∞ under mild conditions. Thus, we estimate the variance of \(\hat {\alpha }_{i}\) by \(\hat {\rho }\hat {\sigma }_{i}^{2}/n\). As shown in the next section, the studentized statistic \(T_{i}:=\sqrt {n}\hat {\alpha }_{i}/\sqrt {\hat {\rho }\hat {\sigma }_{i}^{2}}\) is asymptotically normal. However, for small sample, we found that tdistribution provides a better approximation to the sampling distribution of T_{i}. We define the p value for testing H_{0,i} as
where F_{n−d−2}(·) denotes the cumulative distribution function of tdistribution with n−d−2 degrees of freedom. Based on the p values in (7), we can use the BH procedure to control the FDR. The above discussion leads to the following Algorithm 1.
Remark 1
Built upon the linear regression framework, our method could be easily extended to the mixedeffect model:
where γ_{i} is the random effect and r_{s} is the corresponding design. Mixedeffects can be used to analyze correlated microbiome data from studies involving replicates or spatial sampling as well as familybased and longitudinal microbiome studies. We suggest using the R function lmer to estimate the parameters for the CLRtransformed data. Denote by \(\tilde {\alpha }_{i,\text {lmer}}, \hat {\sigma }_{i,\text {lmer}}^{2}\), and df_{i,lmer} the estimations for \(\bar {\alpha }_{i}\), the variance of \(\tilde {\alpha }_{i,\text {lmer}}\), and the degrees of freedom of \(\tilde {\alpha }_{i,\text {lmer}}\) from the lmer function. We compute the biascorrected estimates \(\hat {\alpha }_{i,\text {lmer}}=\tilde {\alpha }_{i,\text {lmer}}+\tilde {\alpha }_{\text {lmer}}\), where \(\tilde {\alpha }_{\text {lmer}}\) is obtained as the same procedure used in (6). Similarly, we let \(T_{i,\text {lmer}}=\hat {\alpha }_{i,\text {lmer}}/\hat {\sigma }_{i,\text {lmer}}\) and \(\phantom {\dot {i}\!}p_{i,\text {lmer}}=2F_{\text {df}_{i,\text {lmer}}}(T_{i,\text {lmer}})\). The BH procedure on p_{i,lmer} is finally used to control the FDR.
Remark 2
Compared to the existing methods based on either normalization or CLR transformation, our method is computationally much more efficient and can be easily scaled to problems with tens of thousands of taxa. Table 1 compares the computation time of LinDA and ANCOMBC based on simulated datasets. We observe that our method is 100–1000 times faster than ANCOMBC. We also tested on a massive dataset of the similar scale of the AmericanGut project [42] (m=5000 and n=10,000). ANCOMBC completed the analysis in 85 min compared to 28 s for our method (see the column of S0C0 in Table 1). Largescale microbiome studies have been increasingly common to overcome the large intersubject variability, making our method practically useful for the analysis of big microbiome datasets.
Asymptotic FDR control
Suppose the target FDR controlling level is q. The BH procedure is equivalent to finding the smallest t^{∗} such that \(\widehat {\text {FDP}}(t^{*})\le q\), where
Here \(\mathbb {I}\) denotes the indicator function. To show the asymptotic FDR control as m,n→∞, we take a Bayesian perspective by assuming that the parameters α_{i}’s are independently generated from a common distribution. The key result is summarized in the following theorem and technical details can be found in the supplementary note.
Theorem 1
Let ρbe the (1,1)th element of \(\{\mathbb {E}(\mathbf {z}_{s}\mathbf {z}_{s}^{\top })\}^{1}\). Suppose the following conditions are satisfied:(i) z_{s}’s are i.i.d.; u_{s} and c_{sa},a=1,...,d, are subGaussian; \(\sigma _{\text {min}}\{\mathbb {E}(\mathbf {z}_{s}\mathbf {z}_{s}^{\top })\}>C\), where σ_{min}(A) represents the minimum eigenvalue of a matrix A.(ii) σ_{i}’s are i.i.d. and \(\mathbb {P}(C_{1}<\sigma _{i}<C_{2})=1\).(iii) \(\varepsilon _{is}/\sigma _{i}\sim ^{\text {i.i.d.}}\mathcal {E}=^{d} N(0,1)\) for i=1,...,m and s=1,...,n.(iv) α_{i}’s are i.i.d.(v) z_{s},σ_{i},ε_{is}/σ_{i}, and α_{i} for i=1,...,m and s=1,...,n are mutually independent.(vi) Denote by f_{n}(·;a)the density function of \(\sqrt {n}\alpha _{i}+\sqrt {a}\varepsilon _{is}\) for any a>0. For large enough n, the density f_{n}(·;ρ)has a unique mode at 0, i.e., \(\arg \max _{x\in \mathbb {R}} f_{n}(x;\rho)=0\); for any ε>0, there exists a δ>0 such that minn infx>εf_{n}(x;ρ)−f_{n}(0;ρ)>δ.(vii) The Fourier transform \(k(u)=\int _{\infty }^{\infty }e^{\imath uy}K(y)dy\) is absolutely integrable, where \(\imath =\sqrt {1}\) is the imaginary unit. (viii) h=o(1) and 1/(mh^{2})=o(1). (ix) m=o(e^{Cn}). (x) Let \(S_{\infty,n}(t) =\mathbb {P}(\mathcal {E}+\sqrt {n}\alpha _{i}/\sqrt {\rho \sigma _{i}^{2}}>t)\). There exists t_{0} such that for large enough n, 2F_{n−d−2}(−t_{0})/S_{∞,n}(t_{0})≤q.Let
Under the above conditions, we have
Conditions (i)–(v) help prove the consistency of the variance estimators and the mode of the regression coefficients. By assuming that the errors follow the normal distributions (Condition (iii)), we can integrate all the relevant covariate information in a single parameter \(\hat \rho \), which facilitates the establishment of the consistency of the kernel density estimation and hence the estimator of mode. In the simulation studies, we also investigated the scenario of nonnormal distribution. We use an example to illustrate Condition (vi). In particular, we assume that \(\sqrt {n}\alpha _{i}\) follows a discrete distribution with \(\mathbb {P}(\sqrt {n}\alpha _{i}=a_{n,l})=\pi _{l}\) for l=0,1, where a_{n,0}=0,a_{n,1}≠0,π_{l}>0, and π_{0}+π_{1}=1. To reflect the sparsity, π_{0} is set to be 0.8. We choose a_{n,1}=2 and 5 representing weak and strong signals, respectively. We consider two cases for the error variance: (i) σ_{i}=1; (ii) σ_{i}∼IG(a,b), i.e., σ_{i} follows the inversegamma distribution with the shape parameter a and scale parameter b. As seen from Fig. 5, when the signal strength is weak, the mode of \(\sqrt {n}\alpha _{i}+\sqrt {\rho }\varepsilon _{is}\) slightly deviates from 0 as the blue curve in the left panel indicates. For strong signals, the mode is exactly equal to zero. As shown in [43], Condition (vii) is fulfilled by many commonly used kernels such as the Gaussian kernel and the uniform kernel on [−1,1]. Condition (ix) allows the number of taxa to be exponentially larger than the sample size. Condition (x) ensures the existence of a cutoff value to control the FDR at level q. A similar assumption was imposed in Theorem 4 of [44].
Detailed setups for numerical studies
The differential taxa were randomly drawn from the entire set. In particular, let H_{i}=0 if the ith taxon is differentially abundant and H_{i}=1 otherwise. The underlying truth H_{i} was generated from
We simulated two levels of signal density (i.e., percentage of differential taxa) γ = 5%, 20%, roughly corresponding to sparse and dense signals. We assumed that the baseline absolute abundance \(X_{is}^{(0)}\) follows
and correspondingly the absolute abundance X_{is} were draw based on
where \(\boldsymbol {\beta }_{i}^{(0)}\) represents the coefficients of the confounders, i=1,…,m. Let
The observed OTUs data were simulated by
To create a power curve, we included six effect sizes labeled as {1,2,...,6} in the figures. We made the effect sizes have the same signs for differential taxa (i.e., the differential taxa have the same direction of change), creating a relatively strong compositional effect. Since lowabundance taxa have much less statistical power, we upweighted their effects so that the power will not be dominated by those abundant ones. Specifically, for a randomly drawn differential taxon i, we set
where μ is equally spaced on [1.05,2] and \(\bar \pi _{i}^{(0)}=\sum _{s=1}^{n}\pi _{is}^{(0)} / n\). We considered three cases for the covariate and confounders:

A binary covariate. u_{s}∼^{i.i.d.}Bernoulli(1/2) and no confounder.

A continuous covariate. u_{s}∼^{i.i.d.}N(0,1) and no confounder.

A binary covariate of interest and two confounders. u_{s}∼Bernoulli({1+ exp(−0.5c_{s1}−0.5c_{s2})}^{−1}) independently, where c_{s1} and c_{s2} are confounders (i.e., c_{s}=(c_{s1},c_{s2})^{⊤}). In the above, c_{s1} is specified to independently follow the Rademacher distribution and c_{s2}∼^{i.i.d.}N(0,1). The corresponding coefficients of the confounders \(\boldsymbol {\beta }_{i}^{(0)}=(\beta ^{(1)}_{i}, \beta ^{(2)}_{i})^{\top }, i=1, \dots, m\), were independently generated from a 2dimensional normal distribution with mean (1,2)^{⊤} and variance I_{2}, where I_{2} denotes the 2 by 2 identity matrix.
The parameters \(\beta _{i}^{(0)}, \sigma _{i}^{2}\), and N_{s} were generated based on the estimation for a real dataset (COMBO) from the study of the gut microbiota in a general population [45], which consists of 98 samples and 6674 taxa. We only used its 500 most abundant taxa. Since \(\beta _{i}^{(0)}\) and \(\sigma _{i}^{2}\) were not directly estimable using the relative abundance data, we estimated \(\beta _{i}^{(0)}\beta _{j}^{(0)}\) and \(\sigma _{i}^{2}+\sigma _{j}^{2}\) based on the pairwise log ratios, forced some \(\beta _{i}^{(0)}\)’s to be zeros to obtain the estimators of \(\beta _{1}^{(0)},\dots, \beta _{m}^{(0)}\), and derived \(\sigma _{i}^{2}\) from the values of \(\{\sigma _{i}^{2}+\sigma _{j}^{2}\}_{i,j}\). We assume that the library size for each sample follows the negative binomial distribution
where the mean and dispersion parameters were estimated based on the combo data. The resulting sparsity (percent of zeros) of the count matrix is around 65–75%.
In addition to the basic setting (S0, log normal abundance distribution), we designed seven other settings to study the robustness of the proposed method. Specifically, on top of S0 and C0 (a binary covariate), we studied

zeroinflated absolute abundances. The microbiome data contains excessive zeros and many zeros in the microbiome data can be explained by insufficient sampling [46] since majority of the taxa are of lowabundance. However, it is also possible that zeros are due to physical absence of the taxa [47]. To study the effect of zero inflation on differential abundance analysis, we randomly forced 30% of the absolute abundance data to be 0.

Correlated absolute abundances. Existing differential abundance analysis methods assume independence among taxa. However, in practice, taxa are interconnected forming networks [48]. It is interesting to see if the methods compared are robust to the correlations among the taxa. In this setting, we simulated blockcorrelation structure by dividing the 500 taxa into 25 equalsized blocks. Within each block, we further divided the block into 2 by 2 subblocks and simulated equal positive correlations (0.5) within each subblock and equal negative correlations (−0.5) between the two subblocks. This mimics the scenario that there are mutualistic relationships within the group and competitive relationships between groups.

Gamma abundance distribution. Although the log normal distribution has been widely used for modeling species abundance data, other models such as gamma distribution are also possible [49]. We thus did additional simulation studies using the gamma distribution. Let \(X_{is}^{(0)}\sim ^{\text {i.i.d.}}\text {Gamma}(\eta _{i}^{(0)},1)\) and \(X_{is}\sim ^{\text {i.i.d.}}\text {Gamma}(\eta _{i}^{(0)} \exp (u_{s}\alpha _{i}+\mathbf {c}_{s}^{\top }\boldsymbol {\beta }_{i}^{(0)}),1)\). Similarly, we estimated \(\eta _{i}^{(0)}\) from the COMBO data, where we first estimated the baseline proportion \(\pi _{i}^{(0)}\) based on the Dirichletmultinomial distribution using the R function dirmult and set the overdispersion parameter θ^{(0)} to be 0.003, then let \(\eta _{i}^{(0)}=\pi ^{(0)}_{i}(1/\theta ^{(0)}  1)\).

Smaller m. In microbiome data, each taxon can be assigned a taxonomic lineage and taxa abundances can be aggregated at different taxonomic ranks. Differential abundance analysis at higher ranks such as family and genus is also routinely performed. At the higher ranks, the number of taxa is much smaller. We thus studied a small number of taxa (m=50) to see if the proposed method is robust to a small m. In this setting, we randomly chose 50 elements from \(\boldsymbol {\beta }^{(0)}=(\beta _{1}^{(0)},...,\beta _{500}^{(0)})^{\top }\) and \(\boldsymbol {\sigma }^{2}=(\sigma _{1}^{2},...,\sigma _{500}^{2})^{\top }\) in each simulation run. We set N_{s}∼NB(1500,5.3).

Smaller n. In pilot microbiome studies, the sample sizes are usually small. It is interesting to study the performance of the methods at a much smaller sample size. We studied n=20, 30, and used the same effect size as n=50.

10fold difference in library size. When the microbiome samples are not fully randomized in sequencing, it is likely that samples of the two groups end up in two separate sequencing runs leading to very different library sizes for the two groups. Since the presence/absence of a taxon strongly depends on the library size, the differential library size will confound the twosample comparison, especially for those rare taxa [25]. To create differential library sizes, we generated the library size from N_{s}∼NB(5000,5.3) and N_{s}∼NB(50000,5.3) for the two groups, respectively.

Negative bionomial abundance distribution. DESeq2 and edgeR assume negative binomial distribution for the counts, thus we included one more simulation setting, where we generated the counts from the negative binomial distribution. Let \(X_{is}^{(0)}\sim ^{\text {i.i.d.}}\text {NB}(\text {exp}(7645\kappa _{i}^{(0)}), \theta _{i}^{(0)})\) and \(Y_{is}\sim ^{\text {i.i.d.}}\text {NB}(\text {exp}(N_{s}\kappa _{i}^{(0)} + u_{s}\alpha _{i}+\mathbf {c}_{s}^{\top }\boldsymbol {\beta }_{i}^{(0)}), \theta _{i}^{(0)})\). Similarly, we estimated the \(\kappa _{i}^{(0)}\) (regression coefficient for the library size with respect to the log of the count of ith taxon) and \(\theta _{i}^{(0)}\) (dispersion parameter) from the COMBO data using the R function glm.nb.

Mixedeffect model. We considered two scenarios: Pretreatment and posttreatment comparison (S8.1) and Replicate sampling (S8.2). Under S8.1, for n=50 (or 200), we simulated 25 (or 100) subjects and each has paired pretreatment and posttreatment samples. The aim is to detect taxa affected by treatment. Under S8.2, each subject has multiple measurements. For n=50 (or 200), we generated 25 (or 50) subjects with each having 2 (or 4) replicates. Specifically, we let
$$\log(X_{is})\sim \mathbf{r}_{s}^{\top}\boldsymbol{\gamma}_{i}+N(\beta_{i}^{(0)}+u_{s}\alpha_{i}+\mathbf{c}_{s}^{\top}\boldsymbol{\beta}_{i}^{(0)},\sigma_{i}^{2}), $$where r_{s} has one element equal to 1 and all the others equal to 0 indicating the subject ID of sample s. Each element of γ_{i} follows \(N(0,\tau _{i}^{2})\) independently, where we let \(\tau _{i}^{2}=a_{i}\sigma _{i}^{2}\) with a_{i}∼Unif([0,1]).
Availability of data and materials
LinDA is implemented as the linda function in the CRAN R package MicrobiomeStat (https://CRAN.Rproject.org/package=MicrobiomeStat). The LinDA package is also available at GitHub (https://github.com/zhouhj1994/LinDA) [54]. The entire codes and data for generating the presented results are available at the repository Zenodo (https://doi.org/10.5281/zenodo.6326019) and at GitHub (https://github.com/zhouhj1994/LinDAmanuscriptresult) under MIT License [55]. The CDI, IBD, RA, SMOKE and COMBO datasets are from [29–32, 45], respectively.
References
Fan Y, Pedersen O. Gut microbiota in human metabolic health and disease. Nat Rev Microbiol. 2021; 19(1):55–71.
Valdes AM, Walter J, Segal E, Spector TD. Role of the gut microbiota in nutrition and health. Bmj. 2018; 361:2179.
Edgar RC. Uparse: highly accurate otu sequences from microbial amplicon reads. Nature methods. 2013; 10(10):996–998.
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. Dada2: highresolution sample inference from illumina amplicon data. Nature methods. 2016; 13(7):581–583.
Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique cladespecific marker genes. Nature methods. 2012; 9(8):811–814.
Gloor GB, Macklaim JM, PawlowskyGlahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Frontiers in microbiology. 2017; 8:2224.
Tsilimigras MC, Fodor AA. Compositional data analysis of the microbiome: fundamentals, tools, and challenges. Annals of epidemiology. 2016; 26(5):330–335.
Morton JT, Marotz C, Washburne A, Silverman J, Zaramela LS, Edlund A, Zengler K, Knight R. Establishing microbial composition measurement standards with reference frames. Nature communications. 2019; 10:2719.
Xiao J, Chen L, Yu Y, Zhang X, Chen J. A phylogenyregularized sparse regression model for predictive modeling of microbial community data. Front Microbiol. 2018; 9:3112.
Xiao J, Chen L, Johnson S, Yu Y, Zhang X, Chen J. Predictive modeling of microbiome data using a phylogenyregularized generalized linear mixed model. Front Microbiol. 2018; 9:1391.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of rnaseq data. Genome Biol. 2010; 11:25.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:106.
Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial markergene surveys. Nat Methods. 2013; 10(12):1200–2.
Chen L, Reeve J, Zhang L, Huang S, Wang X, Chen J. Gmpr: A robust normalization method for zeroinflated count data with application to microbiome sequencing data. PeerJ. 2018; 6:4600.
Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139–40.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rnaseq data with deseq2. Genome Biol. 2014; 15:550.
Chen J, King E, Deek R, Wei Z, Yu Y, Grill D, Ballman K. An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics. 2018; 34(4):643–51.
Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, Tickle TL, Weingart G, Ren B, Schwager EH, et al.Multivariable association discovery in populationscale metaomics studies. PLoS Comput Biol. 2021; 17(11):1009442.
Sohn MB, Du R, An L. A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics. 2015; 31(14):2269–75.
Brill B, Amir A, Heller R. Testing for differential abundance in compositional counts data, with application to microbiome studies. arXiv preprint arXiv:1904.08937. 2020.
Aitchison J. The Statistical Analysis of Compositional Data. New York: Chapman and Hall; 1986.
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. Unifying the analysis of highthroughput sequencing datasets: characterizing rnaseq, 16s rrna gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014; 2:15.
Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb Ecol Health Dis. 2015; 26(1):27663.
Lin H, Peddada SD. Analysis of compositions of microbiomes with bias correction. Nat Commun. 2020; 11:3514.
Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C, Zaneveld JR, VázquezBaeza Y, Birmingham A, et al.Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017; 5:27.
Hawinkel S, Mattiello F, Bijnens L, Thas O. A broken promise: microbiome differential abundance methods do not control the false discovery rate. Brief Bioinforma. 2019; 20(1):210–21.
Faust K, Lahti L, Gonze D, De Vos WM, Raes J. Metagenomics meets time series analysis: unraveling microbial community dynamics. Curr Opin Microbiol. 2015; 25:56–66.
Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, Bittinger K, Bailey A, Friedman ES, Hoffmann C, et al.Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric crohn’s disease. Cell Host Microbe. 2015; 18(4):489–500.
Schubert AM, Rogers MA, Ring C, Mogle J, Petrosino JP, Young VB, Aronoff DM, Schloss PD. Microbiome data distinguish patients with clostridium difficile infection and nonc. difficileassociated diarrhea from healthy controls. MBio. 2014; 5(3):01021–14.
Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, et al.Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012; 13:79.
Scher JU, Sczesnak A, Longman RS, Segata N, Ubeda C, Bielski C, Rostron T, Cerundolo V, Pamer EG, Abramson SB, et al.Expansion of intestinal prevotella copri correlates with enhanced susceptibility to arthritis. elife. 2013; 2:01202.
Charlson ES, Chen J, CustersAllen R, Bittinger K, Li H, Sinha R, Hwang J, Bushman FD, Collman RG. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PloS ONE. 2010; 5(12):15216.
Gonzalez A, NavasMolina JA, Kosciolek T, McDonald D, VázquezBaeza Y, Ackermann G, DeReus J, Janssen S, Swafford AD, Orchanian SB, et al.Qiita: rapid, webenabled microbiome metaanalysis. Nat Methods. 2018; 15(10):796–8.
Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H. Upset: visualization of intersecting sets. IEEE Trans Vis Comput Graph. 2014; 20(12):1983–92.
Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixedeffects models using lme4. J Stat Softw. 2015; 67:1.
Carpenter J, Kenward M. Multiple Imputation and Its Application. Hoboken: John Wiley & Sons; 2012.
Quinn TP, Erb I, Richardson MF, Crowley TM. Understanding sequencing data as compositions: an outlook and review. Bioinformatics. 2018; 34(16):2870–8.
Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J, Wu GD, Collman RG, Bushman FD, Li H. Associating microbiome composition with environmental covariates using generalized unifrac distances. Bioinformatics. 2012; 28(16):2106–13.
Chen J, Zhang X. Dmanova: fast distancebased multivariate analysis of variance for largescale microbiome association studies. Bioinformatics. 2022; 38(1):286–8.
Thorsen J, Brejnrod A, Mortensen M, Rasmussen MA, Stokholm J, AlSoud WA, Sørensen S, Bisgaard H, Waage J. Largescale benchmarking reveals false discoveries and count transformation sensitivity in 16s rrna gene amplicon data analysis methods used in microbiome studies. Microbiome. 2016; 4:62.
Zhou YH, Gallins P. A review and tutorial of machine learning methods for microbiome host trait prediction. Front Genet. 2019; 10:579.
McDonald D, Hyde E, Debelius JW, Morton JT, Gonzalez A, Ackermann G, Aksenov AA, Behsaz B, Brennan C, Chen Y, et al.American gut: an open platform for citizen science microbiome research. Msystems. 2018; 3(3):00031–18.
Parzen E. On estimation of a probability density function and mode. Ann Math Stat. 1962; 33(3):1065–76.
Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Ser B (Stat Methodol). 2004; 66(1):187–205.
Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY, Keilbaugh SA, Bewtra M, Knights D, Walters WA, Knight R, et al.Linking longterm dietary patterns with gut microbial enterotypes. Science. 2011; 334(6052):105–8.
Silverman JD, Roche K, Mukherjee S, David LA. Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J. 2020; 18:2789–98.
Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of microbiome data in the presence of excess zeros. Front Microbiol. 2017; 8:2114.
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol. 2015; 11(5):1004226.
Connolly SR, MacNeil MA, Caley MJ, Knowlton N, Cripps E, Hisano M, Thibaut LM, Bhattacharya BD, BenedettiCecchi L, Brainard RE, et al.Commonness and rarity in the marine biosphere. Proc Natl Acad Sci. 2014; 111(23):8524–9.
Zhou H, Zhang X, Chen J. Covariate adaptive familywise error rate control for genomewide association studies. Biometrika. 2021; 108(4):915–31.
Vershynin R. Highdimensional Probability: An Introduction with Applications in Data Science, vol 47. Cambridge: Cambridge University Press; 2018.
Wainwright MJ. Highdimensional Statistics: A Nonasymptotic Viewpoint, vol 48. Cambridge: Cambridge University Press; 2019.
Cao H, Chen J, Zhang X. Optimal false discovery rate control for large scale multiple testing with auxiliary information. Annals Stat. 2022; 50(2):807–857.
Zhou H, He K, Chen J, Zhang X. LinDA: Linear Models for Differential Abundance Analysis of Microbiome Compositional Data. Github. 2022. https://github.com/zhouhj1994/LinDA.
Zhou H, He K, Chen J, Zhang X. LinDA: Linear Models for Differential Abundance Analysis of Microbiome Compositional Data. Zenodo. 2022. https://doi.org/10.5281/zenodo.6326019.2022.
Acknowledgements
Not applicable.
Peer review information
Kevin Pang was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 3.
Funding
This work was supported by National Institute of Health R01GM144351 (Chen & Zhang), National Science Foundation DMS1830392, DMS2113359, DMS1811747 (Zhang & Zhou) and National Science Foundation DMS2113360 and Mayo Clinic Center for Individualized Medicine (Chen).
Author information
Authors and Affiliations
Contributions
J.C. and X.Z. conceived, designed and supervised the work together. X.Z. developed the method. X.Z., K.H., and H.Z. performed the theoretical analysis. H.Z. and J.C. performed the evaluation and developed the software. J.C., H.Z., X.Z., and K.H wrote and revised the manuscript together. The authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 2
Supplementary figures. Fig. S1 and S2 compare the proposed method LinDA with different zerohandling approaches under settings S6C0 and S0C0. Fig. S3 depicts the results of LinDA, CLROLS and MaAsLin2 with different normalization approaches under setting S0C0. Fig. S4–S10 and S12–S14 show the results of settings S0C1, S0C2, S1C0, S2C0, S4C0, S5C0, S6C0, S7C0, S8.1C0, and S8.2C0, respectively. The comparison between disabling and enabling zero treatment of the ANCOMBC method is depicted in Fig. S11 under setting S6C0. Fig. S15 shows the results of setting S0C0 with stronger compositional effects. Fig. S16–S19 show the effect size plots and volcano plots for the four datasets (CDI, IBD, RA, and SMOKE) respectively. Fig. S20–S30 present the full result of all methods under different simulation settings.
Additional file 3
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhou, H., He, K., Chen, J. et al. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biol 23, 95 (2022). https://doi.org/10.1186/s13059022026555
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059022026555
Keywords
 Compositional effect
 Differential abundance analysis
 False discovery rate
 Multiple testing