A Log-Linear Model for Inference on Bias in Microbiome Studies

Zhao, Ni; Satten, Glen A.

doi:10.1007/978-3-030-73351-3_9

Ni Zhao⁸ &
Glen A. Satten⁹

Part of the book series: Frontiers in Probability and the Statistical Sciences ((FROPROSTAS))

1684 Accesses
2 Citations

The original version of this chapter was revised: Revised Chap. 9 has been updated. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-73351-3_14

Abstract

Microbiome sequencing data are known to be biased; the measured taxa relative abundances can be systematically distorted from their true values at every step in the experimental/analysis workflow. If this bias is not accounted for, it can lead to spurious discoveries and invalid conclusions. Unfortunately, in order to measure bias, it is necessary to have samples for which the true relative abundances are known, such as model or mock community samples. In this chapter, we propose a log-linear model for the biases observed when analyzing model communities data. Our model expands the recent work from McLaren, Willis and Callahan (MWC) [eLife, 8:e46923, 2019] that proposed a multiplicative bias structure for microbiome data. Our extension of the MWC model is general enough to allow testing of complex hypotheses and readily handles situations in which samples have a different number of bacteria present by design. An F-test with permutation-based hypothesis testing is proposed to assess statistical significance. We conduct simulations to show the validity and the power of our method and also demonstrate the utility of our method through an analysis of a complex model communities dataset that allows us to directly test the multiplicative bias assumption of the MWC model. An R package implementing the proposed work is publicly available at https://github.com/zhaoni153/MicroBias.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

metamicrobiomeR: an R package for analysis of microbiome relative abundance data using zero-inflated beta GAMLSS and meta-analysis across studies using random effects models

Article Open access 16 April 2019

A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions

Article Open access 19 August 2022

An empirical Bayes approach to normalization and differential abundance testing for microbiome data

Article Open access 03 June 2020

Change history

10 May 2022
Owing to an oversight on the part of the Springer, “the Brooks data” provided in this chapter were initially published with errors. The correct presentation is given here.
“the Brooks data”

References

Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J., Pawlowsky-Glahn, V.: Logratio analysis and compositional distance. Math. Geol. 32, 271–275 (2000)
Article Google Scholar
Brooks, J.P., Edwards, D.J., Harwich, M.D., Rivera, M.C., Fettweis, J.M., Serrano, M.G., Reris, R.A., Sheth, N.U., Huang, B., Girerd, P., Strauss, J.F., Jefferson, K.K., Buck, G.A.: The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66 (2015)
Article Google Scholar
Charlson, E.S., Chen, J., Custers-Allen, R., Bittinger, K., Li, H., Sinha, R., Hwang, J., Bushman, F.D., Collman, R.G.: Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS One 5(12), e15216 (2010)
Article Google Scholar
Costea, P.I., Zeller, G., Sunagawa, S., Pelletier, E., Alberti, A., Levenez, F., Tramontano, M., Driessen, M., Hercog, R., Jung, F.E., Kultima, J.R., Hayward, M.R., Coelho, L.P., Allen-Vercoe, E., Bertrand, L., Blaut, M., Brown, J.R.M., Carton, T., Cools-Portier, S., Daigneault, M., Derrien, M., Druesne, A., de Vos, W.M., Finlay, B.B., Flint, H.J., Guarner, F., Hattori, M., Heilig, H., Luna, R.A., van Hylckama Vlieg, J., Junick, J., Klymiuk, I., Langella, P., Le Chatelier, E., Mai, V., Manichanh, C., Martin, J.C., Mery, C., Morita, H., O’Toole, P.W., Orvain, C., Patil, K.R., Penders, J., Persson, S., Pons, N., Popova, M., Salonen, A., Saulnier, D., Scott, K.P., Singh, B., Slezak, K., Veiga, P., Versalovic, J., Zhao, L., Zoetendal, E.G., Ehrlich, S.D., Dore, J., Bork, P.: Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35(11), 1069–1076 (2017)
Google Scholar
D’Amore, R., Ijaz, U.Z., Schirmer, M., Kenny, J.G., Gregory, R., Darby, A.C., Shakya, M., Podar, M., Quince, C., Hall, N.: A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling. BMC Genom. 17, 55 (2016)
Article Google Scholar
Hugerth, L.W., Andersson, A.F.: Analysing microbial community composition through amplicon sequencing: from sampling to hypothesis testing. Front. Microbiol. 8, 1561 (2017)
Article Google Scholar
Kembel, S.W., Wu, M., Eisen, J.A., Green, J.L.: Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Comput. Biol. 8(10), e1002743 (2012)
Article Google Scholar
Ledoit, O., Wolf, M.: Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J. Empir. Finan. 10(5), 603–621 (2003) ISSN: 0927-5398
Article Google Scholar
McLaren, M.R., Willis, A.D., Callahan, B.J.: Consistent and correctable bias in metagenomic sequencing experiments. eLife 8, e46923 (2019). ISSN: 2050-084X
Article Google Scholar
Morgan, J.L., Darling, A.E., Eisen, J.A.: Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 5(4), e10209 (2010)
Article Google Scholar
Pollock, J., Glendinning, L., Wisedchanwet, T., Watson, M.: The madness of microbiome: attempting to find consensus “Best Practice” for 16S microbiome studies. Appl. Environ. Microbiol. 84(7), e02627-17 (2018)
Article Google Scholar
Ross, M.G., Russ, C., Costello, M., Hollinger, A., Lennon, N.J., Hegarty, R., Nusbaum, C., Jaffe, D.B.: Characterizing and measuring bias in sequence data. Genome Biol. 14(5), R51 (2013)
Article Google Scholar
Siegwald, L., Caboche, S., Even, G., Viscogliosi, E., Audebert, C., Chabé, M.: The impact of bioinformatics pipelines on microbiota studies: does the analytical “Microscope” affect the biological interpretation? Microorganisms 7(10), 393 (2019)
Article Google Scholar
Sinha, R., Abu-Ali, G., Vogtmann, E., Fodor, A.A., Ren, B., Amir, A., Schwager, E., Crabtree, J., Ma, S., Abnet, C.C., Knight, R., White, O., Huttenhower, C.: Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 35(11), 1077–1086 (2017)
Article Google Scholar
Tyx, R., Rivera, A., Zhao, N., Satten, G.: Comparing biases of extraction methods in mock community data (with and without a biological matrix) and in real samples (2020, in preparation)
Google Scholar
van den Boogaart, K.G., Tolosana-Delgado, R.: “Compositions”: a unified R package to analyze compositional data. Comput. Geosci. 34, 320–338 (2008)
Article Google Scholar
Wang, Y.: Solving least squares or quadratic programming problems under equality/inequality constraints (2014)
Google Scholar
Wang, Y.: CovTools: statistical tools for covariance analysis (2019)
Google Scholar

Download references

Acknowledgements

NZ is supported in part by the National Institutes of Health, Environmental Influences of Child Health Outcomes (ECHO) Data Analysis Center (U24OD023382). GS is supported in part by the National Institutes of Health, National Institute of Environmental Health Sciences (R24ES029490) and the Office of the Director (UG3OD023318/UH3OD023318).

Author information

Authors and Affiliations

Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
Ni Zhao
Department of Gynecology and Obstetrics, Emory University, Atlanta, GA, USA
Glen A. Satten

Authors

Ni Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Glen A. Satten
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Glen A. Satten .

Editor information

Editors and Affiliations

Biostatistics, University of Florida, Gainesville, FL, USA
Somnath Datta
Biostatistics, University of Florida, Gainesville, FL, USA
Subharup Guha

Appendix

From Eq. (9) and the definition of Y _i⋅ after Eq. (4), we see that the form of the variance–covariance matrix of the residuals for sample i is Σ_i = P _i ΣP _i. If the compositional mean is known, denoted by μ, a simple estimator for Σ_i is to first estimate Σ by solving the estimating equation

$$\displaystyle \begin{aligned} \widehat{V} := \sum_{i=1}^N\,(r^T_{i\cdot}-P_i\mu)(r_{i\cdot}-\mu^TP_i)=\sum_iP_i \Sigma P_i. \end{aligned} $$

(16)

For each i in the sum, we use the vec trick and then solve the resulting equation for Σ to obtain

$$\displaystyle \begin{aligned} \text{vec}{(\widehat{\Sigma})} =\left(\sum_iP_i\otimes P_i\right)^{-} \text{vec} ( \widehat{V} ). \end{aligned} $$

(17)

If there is a reason to believe that there is substantial variation in the precision of the data across samples (which may occur if the variation in library sizes across samples is large enough), we may wish to weight the terms in the sums of Eq. (16) by weights ω _i that are proportional to the precision of the data from the ith sample. We have not considered this as the large library sizes in the Brooks data would seem to make this unnecessary.

In general, the centering vector μ is unknown and needs to be estimated. $\widehat \mu $ can be obtained by solving the estimating equation

$$\displaystyle \begin{aligned} \sum_ir_{i\cdot}^T=\sum_iP_i \mu \ , \end{aligned} $$

(18)

which has solution

$$\displaystyle \begin{aligned} \widehat{\mu}=\left(\sum_iP_i\right)^{-}\left(\sum_ir_{i\cdot}^T\right). \end{aligned}$$

As with the estimator of Σ, we may wish to weight the sums in Eq. (18) if there is a substantial difference in precision across samples.

Estimation of the compositional mean $\widehat {\mu }$ and variance $\widehat {\Sigma }$ was considered by van den Boogart and Tolosana-Delgado [16]. Here, we use the same estimator $\widehat {\mu }$ as [16], but use the novel estimator for $\widehat {\Sigma }$ shown here because the estimator derived in [16] is more complex and slower to compute. We typically find $\widehat {\mu }=0$, except in cases where the null model does not allow for a separate intercept for each feature (taxon).

If the number of taxa is large, a shrinkage estimator of $\widehat {V}$ can be used. This will in turn imply a shrinkage estimator of $\widehat {\Sigma }$ via Eq. (17). One possible shrinkage approach is the empirical Bayes shrinkage proposed by Ledoit and Wolf [8], which was implemented in R package “CovTools” [18]. In this approach, Σ is estimated using $\delta \widehat \Sigma + (1 - \delta ) T$, where $\widehat \Sigma $ is the estimated variance–covariance matrix (e.g., as estimated as in Eq. (17)) and T is a pre-defined target matrix. In situations when the residuals are full rank, the target matrix is usually taken as the identity matrix or a diagonal matrix with positive diagonal elements. In the current context, a reasonable target matrix can be $\hat \sigma ^2 \sum _i P_i/ n \ ,$ in which $\frac {1}{n} \sum _i P_i$ is the average of the compositional projection operators, and $\hat \sigma ^2$ is the usual variance estimated from $|r^T_{i\cdot }-P_i \widehat \mu |$.

Finally, we note that since the decorrelated residuals are given by $\widehat {\Sigma }_i^{- \frac {1}{2}} r^T_i$, their variance–covariance matrix is $(P_i \widehat {\Sigma } P_i)^{- \frac {1}{2}} (P_i \widehat {\Sigma } P_i) (P_i\widehat {\Sigma } P_i)^{- \frac {1}{2}}$ (under the assumption that Σ is well estimated). Using the SVD to express $P_i \widehat {\Sigma } P_i$, it is easy to see this variance–covariance matrix is just the projection operator into the range (column or row space) of $P_i \widehat {\Sigma } P_i$. By assumption, we take the range of $\widehat {\Sigma }$ to contain the range of P _i; thus, this projection operator is in fact P _i itself.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhao, N., Satten, G.A. (2021). A Log-Linear Model for Inference on Bias in Microbiome Studies. In: Datta, S., Guha, S. (eds) Statistical Analysis of Microbiome Data. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-73351-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-73351-3_9
Published: 24 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73350-6
Online ISBN: 978-3-030-73351-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

A Log-Linear Model for Inference on Bias in Microbiome Studies

Abstract

Access this chapter

Similar content being viewed by others

metamicrobiomeR: an R package for analysis of microbiome relative abundance data using zero-inflated beta GAMLSS and meta-analysis across studies using random effects models

A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions

An empirical Bayes approach to normalization and differential abundance testing for microbiome data

Change history

10 May 2022

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

A Log-Linear Model for Inference on Bias in Microbiome Studies

Abstract

Access this chapter

Similar content being viewed by others

metamicrobiomeR: an R package for analysis of microbiome relative abundance data using zero-inflated beta GAMLSS and meta-analysis across studies using random effects models

A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions

An empirical Bayes approach to normalization and differential abundance testing for microbiome data

Change history

10 May 2022

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation