Advertisement

Normalization of Microbiome Profiling Data

  • Paul J. McMurdie
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 1849)

Abstract

Normalization is a term that is often used but rarely defined and poorly understood. The number of choices of normalization procedure is large—some are inappropriate or inadmissible—and all are narrowly relevant to a specific analysis that depends on both the nature of the data and the question being asked. This chapter describes key definitions of normalization as they apply in metagenomics, mainly for taxonomic profiling data; while also demonstrating specific, reproducible examples of normalization procedures in the context of analysis techniques for which they were intended. The analysis and graphics code is distributed as a supplemental companion to this chapter so that the motivated reader can re-use it on new data.

Key words

Normalization Microbiome Metagenomics DNA sequencing Statistics 

Supplementary material

340450_1_En_10_MOESM1_ESM.zip (44 kb)
supportingcode.R (ZIP 44 KB)

References

  1. 1.
    Wolfs TF, Zwart G, Bakker M, Goudsmit J (1992) HIV-1 genomic RNA diversification following sexual and parenteral virus transmission. Virology 189:103–110CrossRefGoogle Scholar
  2. 2.
    Lipkin WI (2010) Microbe hunting. Microbiol Mol Biol Rev 74:363–377CrossRefGoogle Scholar
  3. 3.
    Beerenwinkel N, Günthard HF, Roth V, Metzner KJ (2012) Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol 3:329CrossRefGoogle Scholar
  4. 4.
    Holmes S, Huber W (2018) Modern statistics for modern biology. Cambridge University Press, Cambridge (in press)Google Scholar
  5. 5.
    Aitchison J, Egozcue JJ (2005) Compositional data analysis: where are we and where should we be heading? Math Geol 37:829–850.  https://doi.org/10.1007/s11004-005-7383-7CrossRefGoogle Scholar
  6. 6.
    Pearson K (1897) Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond 60:489–498.  https://doi.org/10.1098/rspl.1896.0076CrossRefGoogle Scholar
  7. 7.
    Caporaso JG, Kuczynski J, Stombaugh J et al (2010) QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7:335–336CrossRefGoogle Scholar
  8. 8.
    Schloss PD, Westcott SL, Ryabin T et al (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541CrossRefGoogle Scholar
  9. 9.
    Efron B (2000) The bootstrap and modern statistics. J Am Stat Assoc 95:1293–1296CrossRefGoogle Scholar
  10. 10.
    Callahan BJ, McMurdie PJ, Holmes SP (2017) Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 11:2639–2643CrossRefGoogle Scholar
  11. 11.
    Kopylova E, Navas-Molina JA, Mercier C et al (2016) Open-source sequence clustering methods improve the state of the art. mSystems 1:e00003–e00015Google Scholar
  12. 12.
    McMurdie PJ, Holmes S (2014) Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 10:e1003531CrossRefGoogle Scholar
  13. 13.
    Callahan BJ, McMurdie PJ, Rosen MJ et al (2016) DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 13:581–583CrossRefGoogle Scholar
  14. 14.
    Li J, Tibshirani R (2013) Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res 22:519–536CrossRefGoogle Scholar
  15. 15.
    Marioni JC, Mason CE, Mane SM et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517CrossRefGoogle Scholar
  16. 16.
    Rapaport F, Khanin R, Liang Y et al (2013) Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol 14:R95CrossRefGoogle Scholar
  17. 17.
    R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
  18. 18.
    RStudio Team (2016) RStudio: integrated development environment for r. RStudio, Inc., Boston, MAGoogle Scholar
  19. 19.
    Huber W, Carey VJ et al (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12:115–121CrossRefGoogle Scholar
  20. 20.
    McMurdie PJ, Holmes S (2013) phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One 8:e61217CrossRefGoogle Scholar
  21. 21.
    Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol 15:550CrossRefGoogle Scholar
  22. 22.
    Fernandes AD, Reid JN, Macklaim JM et al (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2:1–13CrossRefGoogle Scholar
  23. 23.
    Paulson JN, Stine OC, Bravo HC, Pop M (2013) Differential abundance analysis for microbial marker-gene surveys. Nat Methods 10:1200–1202. Advance online publication SP - EP -:1–6CrossRefGoogle Scholar
  24. 24.
    Zhou X, Lindsay H, Robinson MD (2014) Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res 42:e91CrossRefGoogle Scholar
  25. 25.
    Ritchie ME, Phipson B, Wu D et al (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47Google Scholar
  26. 26.
    Law CW, Chen Y, Shi W, Smyth GK (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15:R29CrossRefGoogle Scholar
  27. 27.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 57:289–300Google Scholar
  28. 28.
    Kostic AD, Gevers D, Pedamallu CS et al (2012) Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res 22:292–298CrossRefGoogle Scholar
  29. 29.
    Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121CrossRefGoogle Scholar
  30. 30.
    Fernandes AD, Macklaim JM, Linn TG et al (2013) ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq. PLoS One 8:e67019CrossRefGoogle Scholar
  31. 31.
    Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325–338CrossRefGoogle Scholar
  32. 32.
    Minchin PR (1987) An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio 69:89–107CrossRefGoogle Scholar
  33. 33.
    Bray JR, Curtis JT (1957) An ordination of the upland forest communities of Southern Wisconsin. Ecol Monogr 27:325CrossRefGoogle Scholar
  34. 34.
    Callahan B, Sankaran K, Fukuyama J et al (2016) Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Res 5:1492CrossRefGoogle Scholar
  35. 35.
    Palarea-Albaladejo J, Martín-Fernández JA (2015) zCompositions - R package for multivariate imputation of left-censored data under a compositional approach. Chemom Intell Lab Syst 143:85–96CrossRefGoogle Scholar
  36. 36.
    Gloor GB, Reid G (2016) Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can J Microbiol 62:692–703CrossRefGoogle Scholar
  37. 37.
    Turnbaugh PJ, Gordon JI (2009) The core gut microbiome, energy balance and obesity. J Physiol 587:4153–4158.  https://doi.org/10.1113/jphysiol.2009.174136CrossRefPubMedPubMedCentralGoogle Scholar
  38. 38.
    Kolde R, Franzosa EA, Rahnavard G et al (2018) Host genetic variation and its microbiome interactions within the human microbiome project. Genome Med 10:6.  https://doi.org/10.1186/s13073-018-0515-8CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Anderson M (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol 26:32–46Google Scholar
  40. 40.
    James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning. Springer, BerlinCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Whole Biome, Inc.San FranciscoUSA

Personalised recommendations