Abstract
High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, unequal sampling, and undetected taxa. Technical biases and heteroscedasticity have the strongest effects, but different strains across subjects and environments also make direct differential abundance testing unwieldy. We provide an introduction to a few statistical tools that can overcome some of these difficulties and demonstrate those tools on an example. We show how standard statistical methods, such as simple hierarchical mixture and topic models, can facilitate inferences on latent microbial communities. We also review some nonparametric Bayesian approaches that combine visualization and uncertainty quantification. The intersection of molecular microbial biology and statistics is an exciting new venue. Finally, we list some of the important open problems that would benefit from more careful statistical method development.
Similar content being viewed by others
References
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Nat Prec. https://doi.org/10.1038/npre.2010.4282.1
Anderson MJ (2005) Permutational multivariate analysis of variance. Department of Statistics, University of Auckland, Auckland, vol 26, pp 32–46
Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol 26(1):32–46
Anderson MJ, Robinson J (2003) Generalized discriminant analysis based on distances. Aust N Z J Stat 45(3):301–318
Anscombe FJ (1948) The transformation of Poisson, binomial and negative-binomial data. Biometrika 35(3/4):246–254
Blei D, Lafferty J (2006) Correlated topic models. Adv Neural Inf Process Syst 18:147
Blei D, Carin L, Dunson D (2010) Probabilistic topic models. IEEE Signal Process Mag 27(6):55–65
Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ, Holmes SP (2016a) Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Research 5:1492
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016b) DADA2: high-resolution sample inference from illumina amplicon data. Nat Methods 13(7):581
Callahan BJ, DiGiulio DB, Goltsman DSA, Sun CL, Costello EK, Jeganathan P, Biggio JR, Wong RJ, Druzin ML, Shaw GM (2017a) Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women. Proc Natl Acad Sci 114(37):9966–9971
Callahan BJ, McMurdie PJ, Holmes SP (2017b) Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 11(12):2639
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017) Stan: a probabilistic programming language. J Stat Softw 76(1):1–32
Cavicchioli R, Ripple WJ, Timmis KN, Azam F, Bakken LR, Baylis M, Behrenfeld MJ, Boetius A, Boyd PW, Classen AT (2019) Scientists’ warning to humanity: microorganisms and climate change. Nat Rev Microbiol 17(9):569–586
Cheng HK, Tan SK, Sweeney TE, Jeganathan P, Briese T, Khadka V, Strouts F, Thair S, Dalai S, Hitchcock M (2019) Combined use of metagenomic sequencing and host response profiling for the diagnosis of suspected sepsis. BioRxiv, p 854182
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen A, McGarrell DM, Marsh T, Garrity GM, Tiedje J (2008) The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 37(suppl–1):D141–D145
Compant S, Samad A, Faist H, Sessitsch A (2019) A review on the plant microbiome: ecology, functions, and emerging trends in microbial application. J Adv Res 19:29–37
Davis NM, Proctor DM, Holmes SP, Relman DA, Callahan BJ (2018) Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6(1):226
Delgado-Baquerizo M, Maestre FT, Reich PB, Jeffries TC, Gaitan JJ, Encinar D, Berdugo M, Campbell CD, Singh BK (2016) Microbial diversity drives multifunctionality in terrestrial ecosystems. Nat Commun 7:10541
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL (2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 72(7):5069–5072. https://doi.org/10.1128/AEM.03006-05
DiGiulio DB, Callahan BJ, McMurdie PJ, Costello EK, Lyell DJ, Robaczewska A, Sun CL, Goltsman DS, Wong RJ, Shaw G (2015) Temporal and spatial variation of the human microbiota during pregnancy. Proc Natl Acad Sci 112(35):11060–11065
Evans SN, Matsen FA (2012) The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples. J R Stat Soc Ser B (Stat Methodol) 74(3):569–592
Excoffier L, Smouse PE, Quattro JM (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131(2):479–491
Fitzpatrick CR, Lu-Irving P, Copeland J, Guttman DS, Wang PW, Baltrus DA, Dlugosch KM, Johnson MT (2018) Chloroplast sequence variation and the efficacy of peptide nucleic acids for blocking host amplification in plant microbiome studies. Microbiome 6(1):144
Franzosa EA, Morgan XC, Segata N, Waldron L, Reyes J, Earl AM, Giannoukos G, Boylan MR, Ciulla D, Gevers D, Izard J, Garrett WS, Chan AT, Huttenhower C (2014) Relating the metatranscriptome and metagenome of the human gut. Proc Natl Acad Sci 111(22):E2329–E2338
Fukuyama J (2019) Adaptive gPCA: a method for structured dimensionality reduction with applications to microbiome data. Ann Appl Stat 13(2):1043–1067
Fukuyama J (2020) phyloseqGraphTest: graph-based permutation tests for microbiome data [Computer software manual]
Fukuyama J, McMurdie PJ, Dethlefsen L, Relman DA, Holmes S (2012) Comparisons of distance methods for combining covariates and abundances in microbiome studies. In: Pacific symposium on biocomputing, pp 213–224. https://doi.org/10.1142/9789814366496_0021. http://www.ncbi.nlm.nih.gov/pubmed/22174277
Fukuyama J, Rumker L, Sankaran K, Jeganathan P, Dethlefsen L, Relman DA, Holmes SP (2017) Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment. PLoS Comput Biol 13(8):e1005706. https://doi.org/10.1371/journal.pcbi.1005706
Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7(4):457–472
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80
Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, Schwager E, Knights D, Song SJ, Yassour M (2014) The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15(3):382–392
Gilbert JA, Jansson JK, Knight R (2014) The Earth Microbiome project: successes and aspirations. BMC Biol 12(1):69
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ (2017) Microbiome datasets are compositional: and this is not optional. Front Microbiol 8:2224
Gorvitovskaia A, Holmes SP, Huse SM (2016) Interpreting prevotella and bacteroides as biomarkers of diet and lifestyle. Microbiome 4(1):15
Grantham NS, Guan Y, Reich BJ, Borer ET, Gross K (2020a) Mimix: a Bayesian mixed-effects model for microbiome data from designed experiments. J Am Stat Assoc 115(530):599–609
Greenacre M (2010a) Correspondence analysis of raw data. Ecology 91(4):958–963
Greenacre M (2010b) Log-ratio analysis is a limiting case of correspondence analysis. Math Geosci 42(1):129
Greenacre M (2011) Compositional data and correspondence analysis. In: Compositional data analysis, pp 103–113
Grégory D, Chaudet H, Lagier JC, Raoult D (2018) How mass spectrometric approaches applied to bacterial identification have revolutionized the study of human gut microbiota. Expert Rev Proteomics 15(3):217–229. https://doi.org/10.1080/14789450.2018.1429271
Grumaz S, Stevens P, Grumaz C, Decker SO, Weigand MA, Hofer S, Brenner T, von Haeseler A, Sohn K (2016) Next-generation sequencing diagnostics of bacteremia in septic patients. Genome Med 8(1):73
Harris K, Parsons TL, Ijaz UZ, Lahti L, Holmes I, Quince C (2015) Linking statistical and ecological theory: Hubbell’s unified neutral theory of biodiversity as a hierarchical Dirichlet process. Proc IEEE 105(3):516–529
Holmes, S. (2008). Multivariate data analysis: the French way. In: Probability and statistics: essays in honor of David A. Freedman. Institute of Mathematical Statistics, pp 219–233
Holmes S, Huber W (2018) Modern statistics for modern biology. Cambridge University Press, Cambridge
Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7(2):e30126
Hong DK, Blauwkamp TA, Kertesz M, Bercovici S, Truong C, Banaei N (2018) Liquid biopsy for infectious diseases: sequencing of cell-free plasma to detect pathogen DNA in patients with invasive fungal disease. Diagn Microbiol Infect Dis 92(3):210–213
Jeganathan P, Callahan BJ, Proctor DM, Relman DA, Holmes SP (2018) The block bootstrap method for longitudinal microbiome data. arXiv preprint ArXiv:1809.01832
Kostic AD, Gevers D, Siljander H, Vatanen T, Hyötyläinen T, Hämäläinen AM, Peet A, Tillmann V, Pöhö P, Mattila I (2015) The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host Microbe 17(2):260–273
Kruschke J (2014) Doing Bayesian data analysis: a tutorial with R, JAGS, and Stan. Academic Press, Cambridge
Kuntal BK, Mande SS (2019) Visual exploration of microbiome data. J Biosci 44(5):119
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA (2015) Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 11(5):e1004226
Law CW, Chen Y, Shi W, Smyth GK (2014) voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2):R29
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550
Lozupone C, Knight R (2005) Unifrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12):8228
Lu J, Breitwieser FP, Thielen P, Salzberg SL (2017) Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci 3:e104
McMurdie PJ, Holmes S (2013) phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8(4):e61217
McMurdie PJ, Holmes S (2014) Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 10(4):e1003531
McMurdie PJ, Holmes S (2015) Shiny-phyloseq: web application for interactive microbiome analysis with provenance tracking. Bioinformatics 31(2):282–283
McLaren MR, Willis AD, Callahan BJ (2019) Consistent and correctable bias in metagenomic sequencing experiments. Elife 8:e46923
Menegaux R, Vert JP (2019) Continuous embeddings of DNA sequencing reads and application to metagenomics. J Comput Biol 26(6):509–518
Nguyen LH, Holmes S (2017) Bayesian unidimensional scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations. BMC Bioinform 18(10):394
Nguyen LH, Holmes S (2019) Ten quick tips for effective dimensionality reduction. PLoS Comput Biol 15(6):e1006907
Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H (2020) Vegan: community ecology package [Computer software manual]
Pavoine S, Dufour AB, Chessel D (2004) From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. J Theor Biol 228(4):523–537
Proctor DM, Relman DA (2017) The landscape ecology and microbiota of the human nose, mouth, and throat. Cell Host Microbe 21(4):421–432
Proctor DM, Fukuyama JA, Loomer PM, Armitage GC, Lee SA, Davis NM, Ryder MI, Holmes SP, Relman DA (2018) A spatial gradient of bacterial diversity in the human oral cavity shaped by salivary flow. Nat Commun 9(1):1–10
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO (2007) Silva: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with arb. Nucleic Acids Res 35(21):7188–7196
Purdom E (2011) Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree. Ann Appl Stat 5(4):2326–2358
Quince C, Delmont T, Raguideau S, Alneberg J, Darling A, Collins G, Eren M (2017a) Desman: a new tool for de novo extraction of strains from metagenomes. Genome Biol 18(1):1–22
Quince C, Walker A, Simpson J, Loman N, Segata N (2017b) Shotgun metagenomics, from sampling to analysis. Nat Biotechnol 35(9):833–844
Quince C, Nurk S, Raguideau S, James RS, Soyer OS, Summers JK, Limasset A, Eren AM, Chikhi R, Darling AE (2020) Metagenomics strain resolution on assembly graphs. BioRxiv
Quinn TP, Erb I, Richardson MF, Crowley TM (2018) Understanding sequencing data as compositions: an outlook and review. Bioinformatics 34(16):2870–2878
Ramirez KS, Knight CG, De Hollander M, Brearley FQ, Constantinides B, Cotton A, Creer S, Crowther TW, Davison J, Delgado-Baquerizo M, Dorrepaal E (2018) Detecting macroecological patterns in bacterial communities across independent studies of global soils. Nat Microbiol 3(2):189
R Core Team (2013) R: a language and environment for statistical computing, Vienna, Austria
Ren B, Bacallado S, Favaro S, Holmes S, Trippa L (2017) Bayesian nonparametric ordination for the analysis of microbial communities. J Am Stat Assoc 112(520):1430–1442
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140
Romero R, Hassan SS, Gajer P, Tarca AL, Fadrosh DW, Bieda J, Chaemsaithong P, Miranda J, Chaiworapongsa T, Ravel J (2014a) The vaginal microbiota of pregnant women who subsequently have spontaneous preterm labor and delivery and those with a normal delivery at term. Microbiome 2(1):18
Romero R, Hassan SS, Gajer P, Tarca AL, Fadrosh DW, Nikita L, Galuppi M, Lamont RF, Chaemsaithong P, Miranda J, Chaiworapongsa T, Ravel J (2014b) The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome 2(1):4
Rosen GL, Reichenberger ER, Rosenfeld AM (2011) NBC: the naive Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27(1):127–129
Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW (2014) Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 12(1):87
Sankaran K, Holmes S (2014) structSSI: simultaneous and selective inference for grouped or hierarchically structured data. J Stat Softw 59(13):1–21. https://doi.org/10.18637/jss.v059.i13
Sankaran K, Holmes S (2018) Interactive visualization of hierarchically structured data. J Comput Graph Stat 27(3):553–563
Sankaran K, Holmes SP (2017) treelapse: visualization of hierarchically structured data
Sankaran K, Holmes SP (2019a) Latent variable modeling for the microbiome. Biostatistics 20(4):599–614
Sankaran K, Holmes SP (2019b) Multitable methods for microbiome data integration. Front Genet 10:627
Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower C (2011) Metagenomic biomarker discovery and explanation. Genome Biol 12(6):1–18
Silverman JD, Washburne AD, Mukherjee S, David LA (2017) A phylogenetic transform enhances analysis of compositional microbiota data. Elife 6:e21887
Singh SP, Staicu AM, Dunn RR, Fierer N, Reich BJ (2019) A nonparametric spatial test to identify factors that shape a microbiome. Ann Appl Stat 13(4):2341–2362
Smyth GK (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and bioconductor. Springer, Berlin, pp 397–420
Snijders TA, Nowicki K (1997) Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif 14(1):75–100
Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, Prill RJ, Tripathi A, Gibbons SM, Ackermann G (2017) A communal catalogue reveals earth’s multiscale microbial diversity. Nature 551(7681):457
Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2020) Rank-normalization, folding, and localization: an improved \({\widehat{R}}\) for assessing convergence of MCMC. Bayesian Anal 1:1–28
Washburne AD, Silverman JD, Morton JT, Becker DJ, Crowley D, Mukherjee S, David LA, Plowright RK (2019) Phylofactorization: a graph partitioning algorithm to identify phylogenetic scales of ecological data. Ecol Monogr 89(2):e01353
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, Berlin
Xu L, Paterson AD, Turpin W, Xu W (2015) Assessment and selection of competing models for zero-inflated microbiome data. PLoS ONE 10(7):e0129606
Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP (2012) Human gut microbiome viewed across age and geography. Nature 486(7402):222–227
Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein MP, Zhou H, Zhou JJ, Ringel Y, Li H, Wu MC (2015) Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. Am J Hum Genet 96(5):797–807
Acknowledgments
We are grateful for the thoughtful reading and suggestions made by the editors and referees that helped improve the manuscript. This work was funded by a VMRC Grant from the Gates foundation and a Grant R01AI112401 from the NIH. We are happy to acknowledge to the R and Bioconductor Core Teams and authors of the packages BARBI, dada2, DESeq2, phyloseq, decontam, ggplot2, rstan which were used for constructing figures and running the analyses in this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Jeganathan, P., Holmes, S.P. A Statistical Perspective on the Challenges in Molecular Microbial Biology. JABES 26, 131–160 (2021). https://doi.org/10.1007/s13253-021-00447-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13253-021-00447-1