Abstract
Microbiome sequencing data are known to be biased; the measured taxa relative abundances can be systematically distorted from their true values at every step in the experimental/analysis workflow. If this bias is not accounted for, it can lead to spurious discoveries and invalid conclusions. Unfortunately, in order to measure bias, it is necessary to have samples for which the true relative abundances are known, such as model or mock community samples. In this chapter, we propose a log-linear model for the biases observed when analyzing model communities data. Our model expands the recent work from McLaren, Willis and Callahan (MWC) [eLife, 8:e46923, 2019] that proposed a multiplicative bias structure for microbiome data. Our extension of the MWC model is general enough to allow testing of complex hypotheses and readily handles situations in which samples have a different number of bacteria present by design. An F-test with permutation-based hypothesis testing is proposed to assess statistical significance. We conduct simulations to show the validity and the power of our method and also demonstrate the utility of our method through an analysis of a complex model communities dataset that allows us to directly test the multiplicative bias assumption of the MWC model. An R package implementing the proposed work is publicly available at https://github.com/zhaoni153/MicroBias.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Change history
10 May 2022
Owing to an oversight on the part of the Springer, “the Brooks data” provided in this chapter were initially published with errors. The correct presentation is given here.
“the Brooks data”
References
Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J., Pawlowsky-Glahn, V.: Logratio analysis and compositional distance. Math. Geol. 32, 271–275 (2000)
Brooks, J.P., Edwards, D.J., Harwich, M.D., Rivera, M.C., Fettweis, J.M., Serrano, M.G., Reris, R.A., Sheth, N.U., Huang, B., Girerd, P., Strauss, J.F., Jefferson, K.K., Buck, G.A.: The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66 (2015)
Charlson, E.S., Chen, J., Custers-Allen, R., Bittinger, K., Li, H., Sinha, R., Hwang, J., Bushman, F.D., Collman, R.G.: Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS One 5(12), e15216 (2010)
Costea, P.I., Zeller, G., Sunagawa, S., Pelletier, E., Alberti, A., Levenez, F., Tramontano, M., Driessen, M., Hercog, R., Jung, F.E., Kultima, J.R., Hayward, M.R., Coelho, L.P., Allen-Vercoe, E., Bertrand, L., Blaut, M., Brown, J.R.M., Carton, T., Cools-Portier, S., Daigneault, M., Derrien, M., Druesne, A., de Vos, W.M., Finlay, B.B., Flint, H.J., Guarner, F., Hattori, M., Heilig, H., Luna, R.A., van Hylckama Vlieg, J., Junick, J., Klymiuk, I., Langella, P., Le Chatelier, E., Mai, V., Manichanh, C., Martin, J.C., Mery, C., Morita, H., O’Toole, P.W., Orvain, C., Patil, K.R., Penders, J., Persson, S., Pons, N., Popova, M., Salonen, A., Saulnier, D., Scott, K.P., Singh, B., Slezak, K., Veiga, P., Versalovic, J., Zhao, L., Zoetendal, E.G., Ehrlich, S.D., Dore, J., Bork, P.: Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35(11), 1069–1076 (2017)
D’Amore, R., Ijaz, U.Z., Schirmer, M., Kenny, J.G., Gregory, R., Darby, A.C., Shakya, M., Podar, M., Quince, C., Hall, N.: A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling. BMC Genom. 17, 55 (2016)
Hugerth, L.W., Andersson, A.F.: Analysing microbial community composition through amplicon sequencing: from sampling to hypothesis testing. Front. Microbiol. 8, 1561 (2017)
Kembel, S.W., Wu, M., Eisen, J.A., Green, J.L.: Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Comput. Biol. 8(10), e1002743 (2012)
Ledoit, O., Wolf, M.: Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J. Empir. Finan. 10(5), 603–621 (2003) ISSN: 0927-5398
McLaren, M.R., Willis, A.D., Callahan, B.J.: Consistent and correctable bias in metagenomic sequencing experiments. eLife 8, e46923 (2019). ISSN: 2050-084X
Morgan, J.L., Darling, A.E., Eisen, J.A.: Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 5(4), e10209 (2010)
Pollock, J., Glendinning, L., Wisedchanwet, T., Watson, M.: The madness of microbiome: attempting to find consensus “Best Practice” for 16S microbiome studies. Appl. Environ. Microbiol. 84(7), e02627-17 (2018)
Ross, M.G., Russ, C., Costello, M., Hollinger, A., Lennon, N.J., Hegarty, R., Nusbaum, C., Jaffe, D.B.: Characterizing and measuring bias in sequence data. Genome Biol. 14(5), R51 (2013)
Siegwald, L., Caboche, S., Even, G., Viscogliosi, E., Audebert, C., Chabé, M.: The impact of bioinformatics pipelines on microbiota studies: does the analytical “Microscope” affect the biological interpretation? Microorganisms 7(10), 393 (2019)
Sinha, R., Abu-Ali, G., Vogtmann, E., Fodor, A.A., Ren, B., Amir, A., Schwager, E., Crabtree, J., Ma, S., Abnet, C.C., Knight, R., White, O., Huttenhower, C.: Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 35(11), 1077–1086 (2017)
Tyx, R., Rivera, A., Zhao, N., Satten, G.: Comparing biases of extraction methods in mock community data (with and without a biological matrix) and in real samples (2020, in preparation)
van den Boogaart, K.G., Tolosana-Delgado, R.: “Compositions”: a unified R package to analyze compositional data. Comput. Geosci. 34, 320–338 (2008)
Wang, Y.: Solving least squares or quadratic programming problems under equality/inequality constraints (2014)
Wang, Y.: CovTools: statistical tools for covariance analysis (2019)
Acknowledgements
NZ is supported in part by the National Institutes of Health, Environmental Influences of Child Health Outcomes (ECHO) Data Analysis Center (U24OD023382). GS is supported in part by the National Institutes of Health, National Institute of Environmental Health Sciences (R24ES029490) and the Office of the Director (UG3OD023318/UH3OD023318).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
From Eq. (9) and the definition of Y i⋅ after Eq. (4), we see that the form of the variance–covariance matrix of the residuals for sample i is Σi = P i ΣP i. If the compositional mean is known, denoted by μ, a simple estimator for Σi is to first estimate Σ by solving the estimating equation
For each i in the sum, we use the vec trick and then solve the resulting equation for Σ to obtain
If there is a reason to believe that there is substantial variation in the precision of the data across samples (which may occur if the variation in library sizes across samples is large enough), we may wish to weight the terms in the sums of Eq. (16) by weights ω i that are proportional to the precision of the data from the ith sample. We have not considered this as the large library sizes in the Brooks data would seem to make this unnecessary.
In general, the centering vector μ is unknown and needs to be estimated. \(\widehat \mu \) can be obtained by solving the estimating equation
which has solution
As with the estimator of Σ, we may wish to weight the sums in Eq. (18) if there is a substantial difference in precision across samples.
Estimation of the compositional mean \(\widehat {\mu }\) and variance \(\widehat {\Sigma }\) was considered by van den Boogart and Tolosana-Delgado [16]. Here, we use the same estimator \(\widehat {\mu }\) as [16], but use the novel estimator for \(\widehat {\Sigma }\) shown here because the estimator derived in [16] is more complex and slower to compute. We typically find \(\widehat {\mu }=0\), except in cases where the null model does not allow for a separate intercept for each feature (taxon).
If the number of taxa is large, a shrinkage estimator of \(\widehat {V}\) can be used. This will in turn imply a shrinkage estimator of \(\widehat {\Sigma }\) via Eq. (17). One possible shrinkage approach is the empirical Bayes shrinkage proposed by Ledoit and Wolf [8], which was implemented in R package “CovTools” [18]. In this approach, Σ is estimated using \(\delta \widehat \Sigma + (1 - \delta ) T\), where \(\widehat \Sigma \) is the estimated variance–covariance matrix (e.g., as estimated as in Eq. (17)) and T is a pre-defined target matrix. In situations when the residuals are full rank, the target matrix is usually taken as the identity matrix or a diagonal matrix with positive diagonal elements. In the current context, a reasonable target matrix can be \(\hat \sigma ^2 \sum _i P_i/ n \ ,\) in which \(\frac {1}{n} \sum _i P_i\) is the average of the compositional projection operators, and \(\hat \sigma ^2\) is the usual variance estimated from \(|r^T_{i\cdot }-P_i \widehat \mu |\).
Finally, we note that since the decorrelated residuals are given by \(\widehat {\Sigma }_i^{- \frac {1}{2}} r^T_i\), their variance–covariance matrix is \((P_i \widehat {\Sigma } P_i)^{- \frac {1}{2}} (P_i \widehat {\Sigma } P_i) (P_i\widehat {\Sigma } P_i)^{- \frac {1}{2}}\) (under the assumption that Σ is well estimated). Using the SVD to express \(P_i \widehat {\Sigma } P_i\), it is easy to see this variance–covariance matrix is just the projection operator into the range (column or row space) of \(P_i \widehat {\Sigma } P_i\). By assumption, we take the range of \(\widehat {\Sigma }\) to contain the range of P i; thus, this projection operator is in fact P i itself.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zhao, N., Satten, G.A. (2021). A Log-Linear Model for Inference on Bias in Microbiome Studies. In: Datta, S., Guha, S. (eds) Statistical Analysis of Microbiome Data. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-73351-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-73351-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73350-6
Online ISBN: 978-3-030-73351-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)