Successful strategies for human microbiome data generation, storage and analyses
Current interest in the potential for clinical use of new tools for improving human health are now focused on techniques for the study of the human microbiome and its interaction with environmental and clinical covariates. This review outlines the use of statistical strategies that have been developed in past studies and can inform successful design and analyses of controlled perturbation experiments performed in the human microbiome. We carefully outline what the data are, their imperfections and how we need to transform, decontaminate and denoise them. We show how to identify the important unknown parameters and how to can leverage variability we see to produce efficient models for prediction and uncertainty quantification. We encourage a reproducible strategy that builds on best practice principles that can be adapted for effective experimental design and reproducible workflows. Nonparametric, data-driven denoising strategies already provide the best strain identification and decontamination methods. Data driven models can be combined with uncertainty quantification to provide reproducible aids to decision making in the clinical context, as long as careful, separate, registered confirmatory testing are undertaken. Here we provide guidelines for effective longitudinal studies and their analyses. Lessons learned along the way are that visualizations at every step can pinpoint problems and outliers, normalization and filtering improve power in downstream testing. We recommend collecting and binding the metadata and covariates to sample descriptors and recording complete computer scripts into an R markdown supplement that can reduce opportunities for human error and enable collaborators and readers to replicate all the steps of the study. Finally, we note that optimizing the bioinformatic and statistical workflow involves adopting a wait-and-see approach that is particularly effective in cases where the features such as ‘mass spectrometry peaks’ and metagenomic tables can only be partially annotated.
KeywordsBayesian bootstrap experimental design latent variable longitudinal, statistical analyses microbiome visualization
The work was partly supported by NIH Grant AI112401. The author is thankful to Dr. Yogesh Shouche and the team at ICMR2018 for the opportunity to provide this short personal review of the challenges in designing and analyzing microbiome studies.
- Callahan B, Proctor D, Relman D, Fukuyama J and Holmes S 2016b Reproducible research workflow in r for the analysis of personalized human microbiome data. In Biocomputing 2016: Proceedings of the Pacific Symposium (World Scientific) pp 183–194Google Scholar
- Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ and Holmes SP 2016c Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Research 5 Google Scholar
- Fukuyama J 2017 Adaptive gpca: a method for structured dimensionality reduction arXiv:170200501
- Holmes S and Huber W 2019 Modern statistics for modern biology (Cambridge University Press, Cambridge, UK) http://web.stanford.edu/class/bios221/book/
- Jeganathan P, Callahan BJ, Proctor DM, Relman DA and Holmes SP 2018 The block bootstrap method for longitudinal microbiome data. arXiv:180901832
- Karstens L, Asquith M, Caruso V, Rosenbaum JT, Fair DA, Braun J, Gregory WT, Nardos R and McWeeney SK 2018 Community profiling of the urinary microbiota: considerations for low-biomass samples. Nat. Rev. Urol. 12 1Google Scholar
- McMurdie PJ and Holmes S 2012 Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pac. Symp. Biocomput. 17 235–246Google Scholar
- Sankaran K and Holmes S 2018 Latent variable modeling for the microbiome. Biostatistics kxy018 31–47Google Scholar