Journal of Biosciences

, 44:111 | Cite as

Successful strategies for human microbiome data generation, storage and analyses

  • Susan HolmesEmail author


Current interest in the potential for clinical use of new tools for improving human health are now focused on techniques for the study of the human microbiome and its interaction with environmental and clinical covariates. This review outlines the use of statistical strategies that have been developed in past studies and can inform successful design and analyses of controlled perturbation experiments performed in the human microbiome. We carefully outline what the data are, their imperfections and how we need to transform, decontaminate and denoise them. We show how to identify the important unknown parameters and how to can leverage variability we see to produce efficient models for prediction and uncertainty quantification. We encourage a reproducible strategy that builds on best practice principles that can be adapted for effective experimental design and reproducible workflows. Nonparametric, data-driven denoising strategies already provide the best strain identification and decontamination methods. Data driven models can be combined with uncertainty quantification to provide reproducible aids to decision making in the clinical context, as long as careful, separate, registered confirmatory testing are undertaken. Here we provide guidelines for effective longitudinal studies and their analyses. Lessons learned along the way are that visualizations at every step can pinpoint problems and outliers, normalization and filtering improve power in downstream testing. We recommend collecting and binding the metadata and covariates to sample descriptors and recording complete computer scripts into an R markdown supplement that can reduce opportunities for human error and enable collaborators and readers to replicate all the steps of the study. Finally, we note that optimizing the bioinformatic and statistical workflow involves adopting a wait-and-see approach that is particularly effective in cases where the features such as ‘mass spectrometry peaks’ and metagenomic tables can only be partially annotated.


Bayesian bootstrap experimental design latent variable longitudinal, statistical analyses microbiome visualization 



The work was partly supported by NIH Grant AI112401. The author is thankful to Dr. Yogesh Shouche and the team at ICMR2018 for the opportunity to provide this short personal review of the challenges in designing and analyzing microbiome studies.


  1. Callahan B, McMurdie P, Rosen M, Han A, Johnson A and Holmes S 2016a Dada2: high resolution sample inference from amplicon data. Nat. Methods 13 581CrossRefGoogle Scholar
  2. Callahan B, Proctor D, Relman D, Fukuyama J and Holmes S 2016b Reproducible research workflow in r for the analysis of personalized human microbiome data. In Biocomputing 2016: Proceedings of the Pacific Symposium (World Scientific) pp 183–194Google Scholar
  3. Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ and Holmes SP 2016c Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Research 5 Google Scholar
  4. Callahan BJ, McMurdie PJ and Holmes SP 2017 Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J. 10.1038/ismej.2017.119CrossRefPubMedPubMedCentralGoogle Scholar
  5. Davis NM, Proctor D, Holmes SP, Relman DA and Callahan BJ 2018 Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6 226CrossRefGoogle Scholar
  6. DiGiulio D, Callahan BJ, McMurdie PJ, Costello EK, Lyell DJ, Robaczewska A, Sun CL, Goltsman DSA, Wong RJ, Shaw G, Stevenson DK, Holmes S and Relman RDA 2015 Temporal and spatial variation of the human microbiota during pregnancy. PNAS 112 11060–11065CrossRefGoogle Scholar
  7. Fukuyama J 2017 Adaptive gpca: a method for structured dimensionality reduction arXiv:170200501
  8. Fukuyama J, Rumker L, Sankaran K, Jeganathan P, Dethlefsen L, Relman DA and Holmes SP 2017 Multidomain analyses of a longitudinal human micro- biome intestinal cleanout perturbation experiment. PLOS Comput. Biol. CrossRefGoogle Scholar
  9. Holmes I, Harris K and Quince C 2012 Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7 e30126CrossRefGoogle Scholar
  10. Holmes S and Huber W 2019 Modern statistics for modern biology (Cambridge University Press, Cambridge, UK)
  11. Ioannidis JP 2005 Why most published research findings are false. PLoS Med. 2 e124CrossRefGoogle Scholar
  12. Jeganathan P, Callahan BJ, Proctor DM, Relman DA and Holmes SP 2018 The block bootstrap method for longitudinal microbiome data. arXiv:180901832
  13. Karstens L, Asquith M, Caruso V, Rosenbaum JT, Fair DA, Braun J, Gregory WT, Nardos R and McWeeney SK 2018 Community profiling of the urinary microbiota: considerations for low-biomass samples. Nat. Rev. Urol. 12 1Google Scholar
  14. Leek JT and Storey JD 2007 Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 e161CrossRefGoogle Scholar
  15. Love MI, Huber W and Anders S 2014 Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 550CrossRefGoogle Scholar
  16. McMurdie PJ and Holmes S 2012 Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pac. Symp. Biocomput. 17 235–246Google Scholar
  17. McMurdie PJ and Holmes S 2013 Phyloseq: reproducible research platform for bacterial census data. Plos ONE 8 e61217CrossRefGoogle Scholar
  18. McMurdie PJ and Holmes S 2014 Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 10 e1003531CrossRefGoogle Scholar
  19. Proctor DM, Fukuyama JA, Loomer PM, Armitage GC, Lee SA, Davis NM, Ryder MI, Holmes SP and Relman DA 2018 A spatial gradient of bacterial diversity in the human oral cavity shaped by salivary flow. Nat. Commun. 9 681CrossRefGoogle Scholar
  20. Purdom E 2010 Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree. Ann. Appl. Stat. 5 2326–2358CrossRefGoogle Scholar
  21. Ren B, Bacallado S, Favaro S, Holmes S and Trippa L 2017 Bayesian nonparametric ordination for the analysis of microbial communities. J. Am. Stat. Assoc. 112 1430–1442CrossRefGoogle Scholar
  22. Sankaran K and Holmes S 2018 Latent variable modeling for the microbiome. Biostatistics kxy018 31–47Google Scholar

Copyright information

© Indian Academy of Sciences 2019

Authors and Affiliations

  1. 1.Statistics DepartmentStanfordUSA

Personalised recommendations