Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis

  • Ruofei Du
  • Lingling An
  • Zhide FangEmail author
Part of the ICSA Book Series in Statistics book series (ICSABSS)


Background: In recent years, metagenomics, as a combination of research techniques without the process of cultivation, has become more and more popular in studying the genomic/genetic variation of microbes in environmental or clinical samples. Though generated from similar sequencing technologies, there is increasing evidence that metagenomic sequence data may not be treated as another variant of RNA-Seq count data, especially due to its compositional characteristics. While it is often of primary interest to compare taxonomic or functional profiles of microbial communities between conditions, normalization for library size is usually an inevitable step prior to a typical differential abundance analysis. Some methods have been proposed for such normalization. But the existing performance evaluation of normalization methods for metagenomic sequence data does not adequately consider the compositional characteristics.

Result: The normalization methods assessed in this chapter include Total Sum Scaling (TSS), Relative Log Expression (RLE), Trimmed Mean of M-value (TMM), Cumulative Sum Scaling (CSS), and Rarefying (RFY). In addition to compositional proportions, simulated data were generated with consideration of overdispersion, zero inflation, and under-sampling issue. The impact of normalization on subsequent differential abundance analysis was further studied.

Conclusion: Selection of a normalization method for metagenomic compositional data should be made on a case-by-case basis. Simulation using the parameters learned from the experimental data may be carried out to assist the selection.



The authors are grateful to two anonymous reviewers for their careful reading of the manuscript and their comments and suggestions. ZF’s research is supported by grant U54 GM104940 from the National Institute of General Medical Sciences of the National Institutes of Health, which funds the Louisiana Clinical and Translational Science Center of Pennington Biomedical Research Center. LA’s research is partially supported by National Science Foundation [DMS-1222592] and United States Department of Agriculture [Hatch project, ARZT-1360830-H22-138]. RD’s research was supported in part by the UNM Comprehensive Cancer Center, a recipient of NCI Cancer Support Grant 2 P30 CA118100-11 (PI: Cheryl L. Willman, MD).


  1. Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10), R106.Google Scholar
  2. Anders, S., et al. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8(9), 1765–1786.Google Scholar
  3. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.Google Scholar
  4. Bragg, L., & Tyson, G. W. (2014). Metagenomics using next-generation sequencing. Environmental Microbiology: Methods and Protocols, 1096, 183–201.Google Scholar
  5. Bullard, J. H., et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11(1), 94.Google Scholar
  6. Caporaso, J. G., et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336.Google Scholar
  7. Cole, J. R., et al. (2013). Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(D1), D633–D642.Google Scholar
  8. Costea, P. I., et al. (2014). A fair comparison. Nature Methods, 11(4), 359.Google Scholar
  9. Dillies, M.-A., et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–683.Google Scholar
  10. Fernandes, A. D., et al. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 15.Google Scholar
  11. Gloor, G. B., et al. (2016). It’s all relative: Analyzing microbiome data as compositions. Annals of Epidemiology, 26(5), 322–329.Google Scholar
  12. Handelsman, J. (2004). Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68(4), 669–685.Google Scholar
  13. Johnson, S., et al. (2014). A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15(9), S14.Google Scholar
  14. Mandal, S., et al. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microbial Ecology in Health and Disease, 26(1), 27663.Google Scholar
  15. McMurdie, P. J., & Holmes, S. (2013). phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PloS One, 8(4), e61217.Google Scholar
  16. McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Computational Biology, 10(4), e1003531.Google Scholar
  17. Metzker, M. L. (2010). Sequencing technologies—The next generation. Nature Reviews Genetics, 11(1), 31–46.Google Scholar
  18. National Research Council. (2007). The new science of metagenomics: Revealing the secrets of our microbial planet. Washington, DC: National Academies Press.Google Scholar
  19. Paulson, J. N., et al. (2013). Differential abundance analysis for microbial marker-gene surveys. Nature Methods, 10(12), 1200–1202.Google Scholar
  20. Paulson, J. N., Bravo, H. C., & Pop, M. (2014). Reply to: “A fair comparison”. Nature methods, 11(4), 359–360.Google Scholar
  21. Peterson, J., et al. (2009). The NIH human microbiome project. Genome Research, 19(12), 2317–2323.Google Scholar
  22. Powell, S., et al. (2014). eggNOG v4. 0: Nested orthology inference across 3686 organisms. Nucleic Acids Research, 42(D1), D231–D239.Google Scholar
  23. Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25.Google Scholar
  24. Shreiner, A. B., Kao, J. Y., & Young, V. B. (2015). The gut microbiome in health and in disease. Current Opinion in Gastroenterology, 31(1), 69.Google Scholar
  25. Sohn, M. B., Du, R., & An, L. (2015). A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics, 31(14), 2269–2275.Google Scholar
  26. Srinivas, G., et al. (2013). Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications, 4, 2462.Google Scholar
  27. Tatusov, R. L., et al. (2003). The COG database: An updated version includes eukaryotes. BMC Bioinformatics, 4(1), 1.Google Scholar
  28. Tsilimigras, M. C., & Fodor, A. A. (2016). Compositional data analysis of the microbiome: Fundamentals, tools, and challenges. Annals of Epidemiology, 26(5), 330–335.Google Scholar
  29. Turnbaugh, P. J., et al. (2009). The effect of diet on the human gut microbiome: A metagenomic analysis in humanized gnotobiotic mice. Science Translational Medicine, 1(6), 6ra14.Google Scholar
  30. Wang, Q., et al. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267.Google Scholar
  31. Weiss, S., et al. (2017). Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome, 5(1), 27.MathSciNetGoogle Scholar
  32. White, J. R., Nagarajan, N., & Pop, M. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology, 5(4), e1000352.Google Scholar
  33. Woese, C. R. (1987). Bacterial evolution. Microbiological Reviews, 51(2), 221.Google Scholar
  34. Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics. PLoS Computational Biology, 6(2), e1000667.Google Scholar
  35. Yang, Y. H., et al. (2002). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, 30(4), e15.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Biostatistics Shared ResourceUniversity of New Mexico Comprehensive Cancer CenterAlbuquerqueUSA
  2. 2.Department of Agricultural and Biosystems EngineeringUniversity of ArizonaTucsonUSA
  3. 3.Interdisciplinary Program in StatisticsUniversity of ArizonaTucsonUSA
  4. 4.Biostatistics Program, School of Public HealthLouisiana State University Health Sciences CenterNew OrleansUSA

Personalised recommendations