Skip to main content

A Log-Linear Model for Inference on Bias in Microbiome Studies

  • Chapter
  • First Online:
Statistical Analysis of Microbiome Data

Part of the book series: Frontiers in Probability and the Statistical Sciences ((FROPROSTAS))

Abstract

Microbiome sequencing data are known to be biased; the measured taxa relative abundances can be systematically distorted from their true values at every step in the experimental/analysis workflow. If this bias is not accounted for, it can lead to spurious discoveries and invalid conclusions. Unfortunately, in order to measure bias, it is necessary to have samples for which the true relative abundances are known, such as model or mock community samples. In this chapter, we propose a log-linear model for the biases observed when analyzing model communities data. Our model expands the recent work from McLaren, Willis and Callahan (MWC) [eLife, 8:e46923, 2019] that proposed a multiplicative bias structure for microbiome data. Our extension of the MWC model is general enough to allow testing of complex hypotheses and readily handles situations in which samples have a different number of bacteria present by design. An F-test with permutation-based hypothesis testing is proposed to assess statistical significance. We conduct simulations to show the validity and the power of our method and also demonstrate the utility of our method through an analysis of a complex model communities dataset that allows us to directly test the multiplicative bias assumption of the MWC model. An R package implementing the proposed work is publicly available at https://github.com/zhaoni153/MicroBias.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Change history

  • 10 May 2022

    Owing to an oversight on the part of the Springer, “the Brooks data” provided in this chapter were initially published with errors. The correct presentation is given here.

    “the Brooks data”

References

  1. Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J., Pawlowsky-Glahn, V.: Logratio analysis and compositional distance. Math. Geol. 32, 271–275 (2000)

    Article  Google Scholar 

  2. Brooks, J.P., Edwards, D.J., Harwich, M.D., Rivera, M.C., Fettweis, J.M., Serrano, M.G., Reris, R.A., Sheth, N.U., Huang, B., Girerd, P., Strauss, J.F., Jefferson, K.K., Buck, G.A.: The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66 (2015)

    Article  Google Scholar 

  3. Charlson, E.S., Chen, J., Custers-Allen, R., Bittinger, K., Li, H., Sinha, R., Hwang, J., Bushman, F.D., Collman, R.G.: Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS One 5(12), e15216 (2010)

    Article  Google Scholar 

  4. Costea, P.I., Zeller, G., Sunagawa, S., Pelletier, E., Alberti, A., Levenez, F., Tramontano, M., Driessen, M., Hercog, R., Jung, F.E., Kultima, J.R., Hayward, M.R., Coelho, L.P., Allen-Vercoe, E., Bertrand, L., Blaut, M., Brown, J.R.M., Carton, T., Cools-Portier, S., Daigneault, M., Derrien, M., Druesne, A., de Vos, W.M., Finlay, B.B., Flint, H.J., Guarner, F., Hattori, M., Heilig, H., Luna, R.A., van Hylckama Vlieg, J., Junick, J., Klymiuk, I., Langella, P., Le Chatelier, E., Mai, V., Manichanh, C., Martin, J.C., Mery, C., Morita, H., O’Toole, P.W., Orvain, C., Patil, K.R., Penders, J., Persson, S., Pons, N., Popova, M., Salonen, A., Saulnier, D., Scott, K.P., Singh, B., Slezak, K., Veiga, P., Versalovic, J., Zhao, L., Zoetendal, E.G., Ehrlich, S.D., Dore, J., Bork, P.: Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35(11), 1069–1076 (2017)

    Google Scholar 

  5. D’Amore, R., Ijaz, U.Z., Schirmer, M., Kenny, J.G., Gregory, R., Darby, A.C., Shakya, M., Podar, M., Quince, C., Hall, N.: A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling. BMC Genom. 17, 55 (2016)

    Article  Google Scholar 

  6. Hugerth, L.W., Andersson, A.F.: Analysing microbial community composition through amplicon sequencing: from sampling to hypothesis testing. Front. Microbiol. 8, 1561 (2017)

    Article  Google Scholar 

  7. Kembel, S.W., Wu, M., Eisen, J.A., Green, J.L.: Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Comput. Biol. 8(10), e1002743 (2012)

    Article  Google Scholar 

  8. Ledoit, O., Wolf, M.: Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J. Empir. Finan. 10(5), 603–621 (2003) ISSN: 0927-5398

    Article  Google Scholar 

  9. McLaren, M.R., Willis, A.D., Callahan, B.J.: Consistent and correctable bias in metagenomic sequencing experiments. eLife 8, e46923 (2019). ISSN: 2050-084X

    Article  Google Scholar 

  10. Morgan, J.L., Darling, A.E., Eisen, J.A.: Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 5(4), e10209 (2010)

    Article  Google Scholar 

  11. Pollock, J., Glendinning, L., Wisedchanwet, T., Watson, M.: The madness of microbiome: attempting to find consensus “Best Practice” for 16S microbiome studies. Appl. Environ. Microbiol. 84(7), e02627-17 (2018)

    Article  Google Scholar 

  12. Ross, M.G., Russ, C., Costello, M., Hollinger, A., Lennon, N.J., Hegarty, R., Nusbaum, C., Jaffe, D.B.: Characterizing and measuring bias in sequence data. Genome Biol. 14(5), R51 (2013)

    Article  Google Scholar 

  13. Siegwald, L., Caboche, S., Even, G., Viscogliosi, E., Audebert, C., Chabé, M.: The impact of bioinformatics pipelines on microbiota studies: does the analytical “Microscope” affect the biological interpretation? Microorganisms 7(10), 393 (2019)

    Article  Google Scholar 

  14. Sinha, R., Abu-Ali, G., Vogtmann, E., Fodor, A.A., Ren, B., Amir, A., Schwager, E., Crabtree, J., Ma, S., Abnet, C.C., Knight, R., White, O., Huttenhower, C.: Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 35(11), 1077–1086 (2017)

    Article  Google Scholar 

  15. Tyx, R., Rivera, A., Zhao, N., Satten, G.: Comparing biases of extraction methods in mock community data (with and without a biological matrix) and in real samples (2020, in preparation)

    Google Scholar 

  16. van den Boogaart, K.G., Tolosana-Delgado, R.: “Compositions”: a unified R package to analyze compositional data. Comput. Geosci. 34, 320–338 (2008)

    Article  Google Scholar 

  17. Wang, Y.: Solving least squares or quadratic programming problems under equality/inequality constraints (2014)

    Google Scholar 

  18. Wang, Y.: CovTools: statistical tools for covariance analysis (2019)

    Google Scholar 

Download references

Acknowledgements

NZ is supported in part by the National Institutes of Health, Environmental Influences of Child Health Outcomes (ECHO) Data Analysis Center (U24OD023382). GS is supported in part by the National Institutes of Health, National Institute of Environmental Health Sciences (R24ES029490) and the Office of the Director (UG3OD023318/UH3OD023318).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Glen A. Satten .

Editor information

Editors and Affiliations

Appendix

Appendix

From Eq. (9) and the definition of Y i after Eq. (4), we see that the form of the variance–covariance matrix of the residuals for sample i is Σi = P i ΣP i. If the compositional mean is known, denoted by μ, a simple estimator for Σi is to first estimate Σ by solving the estimating equation

$$\displaystyle \begin{aligned} \widehat{V} := \sum_{i=1}^N\,(r^T_{i\cdot}-P_i\mu)(r_{i\cdot}-\mu^TP_i)=\sum_iP_i \Sigma P_i. \end{aligned} $$
(16)

For each i in the sum, we use the vec trick and then solve the resulting equation for Σ to obtain

$$\displaystyle \begin{aligned} \text{vec}{(\widehat{\Sigma})} =\left(\sum_iP_i\otimes P_i\right)^{-} \text{vec} ( \widehat{V} ). \end{aligned} $$
(17)

If there is a reason to believe that there is substantial variation in the precision of the data across samples (which may occur if the variation in library sizes across samples is large enough), we may wish to weight the terms in the sums of Eq. (16) by weights ω i that are proportional to the precision of the data from the ith sample. We have not considered this as the large library sizes in the Brooks data would seem to make this unnecessary.

In general, the centering vector μ is unknown and needs to be estimated. \(\widehat \mu \) can be obtained by solving the estimating equation

$$\displaystyle \begin{aligned} \sum_ir_{i\cdot}^T=\sum_iP_i \mu \ , \end{aligned} $$
(18)

which has solution

$$\displaystyle \begin{aligned} \widehat{\mu}=\left(\sum_iP_i\right)^{-}\left(\sum_ir_{i\cdot}^T\right). \end{aligned}$$

As with the estimator of Σ, we may wish to weight the sums in Eq. (18) if there is a substantial difference in precision across samples.

Estimation of the compositional mean \(\widehat {\mu }\) and variance \(\widehat {\Sigma }\) was considered by van den Boogart and Tolosana-Delgado [16]. Here, we use the same estimator \(\widehat {\mu }\) as [16], but use the novel estimator for \(\widehat {\Sigma }\) shown here because the estimator derived in [16] is more complex and slower to compute. We typically find \(\widehat {\mu }=0\), except in cases where the null model does not allow for a separate intercept for each feature (taxon).

If the number of taxa is large, a shrinkage estimator of \(\widehat {V}\) can be used. This will in turn imply a shrinkage estimator of \(\widehat {\Sigma }\) via Eq. (17). One possible shrinkage approach is the empirical Bayes shrinkage proposed by Ledoit and Wolf [8], which was implemented in R package “CovTools” [18]. In this approach, Σ is estimated using \(\delta \widehat \Sigma + (1 - \delta ) T\), where \(\widehat \Sigma \) is the estimated variance–covariance matrix (e.g., as estimated as in Eq. (17)) and T is a pre-defined target matrix. In situations when the residuals are full rank, the target matrix is usually taken as the identity matrix or a diagonal matrix with positive diagonal elements. In the current context, a reasonable target matrix can be \(\hat \sigma ^2 \sum _i P_i/ n \ ,\) in which \(\frac {1}{n} \sum _i P_i\) is the average of the compositional projection operators, and \(\hat \sigma ^2\) is the usual variance estimated from \(|r^T_{i\cdot }-P_i \widehat \mu |\).

Finally, we note that since the decorrelated residuals are given by \(\widehat {\Sigma }_i^{- \frac {1}{2}} r^T_i\), their variance–covariance matrix is \((P_i \widehat {\Sigma } P_i)^{- \frac {1}{2}} (P_i \widehat {\Sigma } P_i) (P_i\widehat {\Sigma } P_i)^{- \frac {1}{2}}\) (under the assumption that Σ is well estimated). Using the SVD to express \(P_i \widehat {\Sigma } P_i\), it is easy to see this variance–covariance matrix is just the projection operator into the range (column or row space) of \(P_i \widehat {\Sigma } P_i\). By assumption, we take the range of \(\widehat {\Sigma }\) to contain the range of P i; thus, this projection operator is in fact P i itself.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhao, N., Satten, G.A. (2021). A Log-Linear Model for Inference on Bias in Microbiome Studies. In: Datta, S., Guha, S. (eds) Statistical Analysis of Microbiome Data. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-73351-3_9

Download citation

Publish with us

Policies and ethics