Journal of Computer Science and Technology

, Volume 25, Issue 1, pp 71–81 | Cite as

Metagenomics: Facts and Artifacts, and Computational Challenges

  • John C. Wooley
  • Yuzhen Ye


Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. By enabling an analysis of populations including many (so-far) unculturable and often unknown microbes, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, including those in ecology, environmental sciences, and biomedicine. Specific computational and statistical tools have been developed for metagenomic data analysis and comparison. New studies, however, have revealed various kinds of artifacts present in metagenomics data caused by limitations in the experimental protocols and/or inadequate data analysis procedures, which often lead to incorrect conclusions about a microbial community. Here, we review some of the artifacts, such as overestimation of species diversity and incorrect estimation of gene family frequencies, and discuss emerging computational approaches to address them. We also review potential challenges that metagenomics may encounter with the extensive application of next-generation sequencing (NGS) techniques.


metagenomics next-generation sequencing (NGS) taxonomic/functional profiling statistical approaches comparative metagenomics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Handelsman J, Rondon M R, Brady S F, Clardy J, Goodman R M. Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry & Biology, 1998, 5(10): R245–R249.CrossRefGoogle Scholar
  2. [2]
    Mardis E. Anticipating the 1,000 dollar genome. Genome Biol., 2006, 7(7): 112.CrossRefGoogle Scholar
  3. [3]
    Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, Richardson P, Solovyev V, Rubin E, Rokhsar D, Banfield J. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 2004, 428(6978): 37–43.CrossRefGoogle Scholar
  4. [4]
    Venter J, Remington K, Heidelberg J et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 2004, 304(5667): 66–74.CrossRefGoogle Scholar
  5. [5]
    Dinsdale E A, Pantos O, Smriga S, Edwards R A et al. Microbial ecology of four coral atolls in the Northern Line Islands. PLoS ONE, 2008, 3(2): e1584.CrossRefGoogle Scholar
  6. [6]
    Lorenz P, Eck J. Metagenomics and industrial applications. Nat. Rev. Microbiol., 2005, 3(6): 510–516.CrossRefGoogle Scholar
  7. [7]
    Turnbaugh P J, Hamady M, Yatsunenko T et al. A core gut microbiome in obese and lean twins. Nature, 2009, 457(7228): 480–484.CrossRefGoogle Scholar
  8. [8]
    Turnbaugh P J, Ley R E, Hamady M, Fraser-Liggett C M, Knight R, Gordon J I. The human microbiome project. Nature, 2007, 449(7164): 804–810.CrossRefGoogle Scholar
  9. [9]
    Hamady M, Walker J J, Harris J K, Gold N J, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat. Methods, 2008, 5(3): 235–237.CrossRefGoogle Scholar
  10. [10]
    Li L, McCorkle S, Monchy S, Taghavi S, van der Lelie D. Bioprospecting metagenomes: Glycosyl hydrolases for converting biomass. Biotechnol. Biofuels, 2009, 2: 10.CrossRefGoogle Scholar
  11. [11]
    Brulc J, Antonopoulos D, Miller M et al. Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases. Proc. Natl. Acad. Sci. USA, 2009, 106(6): 1948–1953.CrossRefGoogle Scholar
  12. [12]
    Jones B, Begley M, Hill C, Gahan C, Marchesi J. Functional and comparative metagenomic analysis of bile salt hydrolase activity in the human gut microbiome. Proc. Natl. Acad. Sci. USA, 2008, 105(36): 13580–13585.CrossRefGoogle Scholar
  13. [13]
    Mori T, Mizuta S, Suenaga H, Miyazaki K. Metagenomic screening for bleomycin resistance genes. Appl. Environ. Microbiol., 2008, 74(21): 6803–6805.CrossRefGoogle Scholar
  14. [14]
    Steele H, Jaeger K, Daniel R, Streit W. Advances in recovery of novel biocatalysts from metagenomes. J Mol. Microbiol. Biotechnol., 2009, 16(1/2): 25–37.CrossRefGoogle Scholar
  15. [15]
    Handelsman J, Tiedje J M, Alvarez-Cohen L et al. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. The National Academies Press, 2007.Google Scholar
  16. [16]
    Tringe S, von Mering C, Kobayashi A et al. Comparative metagenomics of microbial communities. Science, 2005, 308(5721): 554–557.CrossRefGoogle Scholar
  17. [17]
    Turnbaugh P J, Ley R E, Mahowald M A, Magrini V, Mardis E R, Gordon J I. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature, 2006, 444(7122): 1027–1131.CrossRefGoogle Scholar
  18. [18]
    Hooper S D, Raes J, Foerstner K U, Harrington E D, Dalevi D, Bork P. A molecular study of microbe transfer between distant environments. PLoS ONE, 2008, 3(7): e2607.CrossRefGoogle Scholar
  19. [19]
    Raes J, Foerstner K U, Bork P. Get the most out of your metagenome: Computational analysis of environmental sequence data. Curr. Opin. Microbiol., 2007, 10(5): 490–498.CrossRefGoogle Scholar
  20. [20]
    Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev., 2008, 72(4): 557–578, Table of Contents.CrossRefGoogle Scholar
  21. [21]
    Hamady M, Knight R. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res., 2009, 19(7): 1141–1152.CrossRefGoogle Scholar
  22. [22]
    Galperin M. Metagenomics: From acid mine to shining sea. Environ. Microbiol., 2004, 6(6): 543–545.CrossRefGoogle Scholar
  23. [23]
    Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res., 2008, 18(5): 802–809.CrossRefGoogle Scholar
  24. [24]
    Butler J, MacCallum I, Kleber M, Shlyakhter I A, Belmonte M K, Lander E S, Nusbaum C, Jaffe D B. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res., 2008, 18(5): 810–820.CrossRefGoogle Scholar
  25. [25]
    Chaisson M J, Pevzner P A. Short read fragment assembly of bacterial genomes. Genome Res., 2008, 18(2): 324–330.CrossRefGoogle Scholar
  26. [26]
    Pop M. Genome assembly reborn: Recent computational challenges. Brief Bioinform., 2009, 10(4): 354–366.CrossRefGoogle Scholar
  27. [27]
    Noguchi H, Park J, Takagi T. MetaGene: Prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res., 2006, 34(19): 5623–5630.CrossRefGoogle Scholar
  28. [28]
    Hoff K J, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics, 2008, 9: 217.CrossRefGoogle Scholar
  29. [29]
    Hoff K J, Lingner T, Meinicke P, Tech M. Orphelia: Predicting genes in metagenomic sequencing reads. Nucleic Acids Res., 2009, 37(Web Server Issue): W101–W105.CrossRefGoogle Scholar
  30. [30]
    Krause L, Diaz N N, Bartels D, Edwards R A, Puhler A, Rohwer F, Meyer F, Stoye J. Finding novel genes in bacterial communities isolated from the environment. Bioinformatics, 2006, 22(14): e281–e289.CrossRefGoogle Scholar
  31. [31]
    Ye Y, Tang H. An orfome assembly approach to metagenomics sequences analysis. J. Bioinform. Comput. Biol., 2009, 7(3): 455–471.CrossRefGoogle Scholar
  32. [32]
    Cardenas E, Tiedje J. New tools for discovering and characterizing microbial diversity. Curr. Opin. Biotechnol., 2008, 19(6): 544–549.CrossRefGoogle Scholar
  33. [33]
    Huson D H, Auch A F, Qi J, Schuster S C. MEGAN analysis of metagenomic data. Genome Res., 2007, 17(3): 377–386.CrossRefGoogle Scholar
  34. [34]
    Chakravorty S, Helb D, Burday M, Connell N, Alland D. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J. Microbiol. Methods, 2007, 69(2): 330–339.CrossRefGoogle Scholar
  35. [35]
    Monier A, Claverie J M, Ogata H. Taxonomic distribution of large DNA viruses in the sea. Genome Biol., 2008, 9(7): R106.CrossRefGoogle Scholar
  36. [36]
    Ciccarelli F D, Doerks T, von Mering C, Creevey C J, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science, 2006, 311(5765): 1283–1287.CrossRefGoogle Scholar
  37. [37]
    von Mering C, Hugenholtz P, Raes J, Tringe S G, Doerks T, Jensen L J, Ward N, Bork P. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science, 2007, 315(5815): 1126–1130.CrossRefGoogle Scholar
  38. [38]
    Wu M, Eisen J A. A simple, fast, and accurate method of phylogenomic inference. Genome Biol., 2008, 9(10): R151.CrossRefGoogle Scholar
  39. [39]
    Krause L, Diaz N N, Goesmann A, Kelley S, Nattkemper T W, Rohwer F, Edwards R A, Stoye J. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res., 2008, 36(7): 2230–2239.CrossRefGoogle Scholar
  40. [40]
    Finn R D, Mistry J, Schuster-Bockler B et al. Pfam: Clans, Web tools and services. Nucleic Acids Res., 2006, 34(Database Issue): D247–D251.CrossRefGoogle Scholar
  41. [41]
    Bentley S D, Parkhill J. Comparative genomic structure of prokaryotes. Annu. Rev. Genet., 2004, 38: 771–792.CrossRefGoogle Scholar
  42. [42]
    Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F O. TETRA: A Web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 2004, 5: 163.CrossRefGoogle Scholar
  43. [43]
    Woyke T, Teeling H, Ivanova N N et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature, 2006, 443(7114): 950–955.CrossRefGoogle Scholar
  44. [44]
    Chatterji S, Yamazaki I, Bai Z, Eisen J. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In Proc. the 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), Singapore, March 30–April 2, 2008, pp.17–28.Google Scholar
  45. [45]
    Zhou F, Olman V, Xu Y. Barcodes for genomes and applications. BMC Bioinformatics, 2008, 9: 546.CrossRefGoogle Scholar
  46. [46]
    Brady A, Salzberg S L. Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods, 2009, 6(9): 673–676.CrossRefGoogle Scholar
  47. [47]
    Gilbert J A, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I. Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS One, 2008, 3(8): e3042.CrossRefGoogle Scholar
  48. [48]
    Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 2000, 28(1): 27–30.CrossRefGoogle Scholar
  49. [49]
    Overbeek R, Begley T, Butler R M et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res., 2005, 33(7): 5691–5702.CrossRefGoogle Scholar
  50. [50]
    Dinsdale E A, Edwards R A, Hall D et al. Functional metagenomic profiling of nine biomes. Nature, 2008, 452(7187): 629–632.CrossRefGoogle Scholar
  51. [51]
    Meyer F, Paarmann D, D’Souza M et al. The metagenomics RAST server — A public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008, 9: 386.CrossRefGoogle Scholar
  52. [52]
    Yooseph S, Sutton G, Rusch D B et al. The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol., 2007, 5(3): e16.CrossRefGoogle Scholar
  53. [53]
    Li W, Wooley J C, Godzik A. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One, 2008, 3(10): e3375.CrossRefGoogle Scholar
  54. [54]
    Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001, 17(3): 282–283.CrossRefGoogle Scholar
  55. [55]
    Marcotte E M. Computational genetics: Finding protein function by nonhomology methods. Curr. Opin. Struct. Biol., 2000, 10(3): 359–365.CrossRefMathSciNetGoogle Scholar
  56. [56]
    Tringe S G, von Mering C, Kobayashi A et al. Comparative metagenomics of microbial communities. Science, 2005, 308(5721): 554–557.CrossRefGoogle Scholar
  57. [57]
    Foerstner K U, von Mering C, Hooper S D, Bork P. Environments shape the nucleotide composition of genomes. EMBO Rep., 2005, 6(12): 1208–1213.CrossRefGoogle Scholar
  58. [58]
    Raes J, Korbel J O, Lercher M J, von Mering C, Bork P. Prediction of effective genome size in metagenomic samples. Genome Biol., 2007, 8(1): R10.CrossRefGoogle Scholar
  59. [59]
    Gianoulis T A, Raes J, Patel P V et al. Quantifying environmental adaptation of metabolic pathways in metagenomics. Proc. Natl. Acad. Sci. USA, 2009, 106(5): 1374–1379.CrossRefGoogle Scholar
  60. [60]
    Lozupone C, Knight R. UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol., 2005, 71(12): 8228–8235.CrossRefGoogle Scholar
  61. [61]
    Huson D H, Richter D C, Mitra S, Auch A F, Schuster S C. Methods for comparative metagenomics. BMC Bioinformatics, 2009, 10(Suppl 1): S12.CrossRefGoogle Scholar
  62. [62]
    Mitra S, Klar B, Huson D H. Visual and statistical comparison of metagenomes. Bioinformatics, 2009, 25(15): 1849–1855.CrossRefGoogle Scholar
  63. [63]
    Schloss P D, Handelsman J. A statistical toolbox for metagenomics: Assessing functional diversity in microbial communities. BMC Bioinformatics, 2008, 9: 34.CrossRefGoogle Scholar
  64. [64]
    Wommack K E, Bhavsar J, Ravel J. Metagenomics: Read length matters. Appl. Environ. Microbiol., 2008, 74(5): 1453–1463.CrossRefGoogle Scholar
  65. [65]
    Hughes J B, Hellmann J J, Ricketts T H, Bohannan B J. Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 2001, 67(10): 4399–4406.CrossRefGoogle Scholar
  66. [66]
    Breitbart M, Salamon P, Andresen B, Mahaffy J M, Segall A M, Mead D, Azam F, Rohwer F. Genomic analysis of uncultured marine viral communities. Proc. Natl. Acad. Sci. USA, 2002, 99(22): 14250–14255.CrossRefGoogle Scholar
  67. [67]
    Schloss P D, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl. Environ. Microbiol., 2005, 71(3): 1501–1506.CrossRefGoogle Scholar
  68. [68]
    Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, Salamon P, Felts B, Nulton J, Mahaffy J, Rohwer F. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics, 2005, 6: 41.CrossRefGoogle Scholar
  69. [69]
    Schloss P D. Evaluating different approaches that test whether microbial communities have the same structure. ISME J, 2008, 2(3): 265–275.CrossRefGoogle Scholar
  70. [70]
    Schloss P D, Handelsman J. Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Appl. Environ. Microbiol., 2006, 72(10): 6773–6779.CrossRefGoogle Scholar
  71. [71]
    White J, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol., 2009, 5(4): e1000352.CrossRefGoogle Scholar
  72. [72]
    Zaneveld J, Turnbaugh P J, Lozupone C, Ley R E, Hamady M, Gordon J I, Knight R. Host-bacterial coevolution and the search for new drug targets. Curr. Opin. Chem. Biol., 2008, 12(1): 109–114.CrossRefGoogle Scholar
  73. [73]
    Ley R E, Hamady M, Lozupone C, Turnbaugh P J, Ramey R R, Bircher J S, Schlegel M L, Tucker T A, Schrenzel M D, Knight R, Gordon J I. Evolution of mammals and their gut microbes. Science, 2008, 320(5883): 1647–1651.CrossRefGoogle Scholar
  74. [74]
    Shannon P, Markiel A, Ozier O, Baliga N S, Wang J T, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res., 2003, 13(11): 2498–2504.CrossRefGoogle Scholar
  75. [75]
    Rusch D B, Halpern A L, Sutton G et al. The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol., 2007, 5(3): e77.CrossRefGoogle Scholar
  76. [76]
    Ashelford K E, Chuzhanova N A, Fry J C, Jones A J, Weightman A J. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl. Environ. Microbiol., 2005, 71(12): 7724–7736.CrossRefGoogle Scholar
  77. [77]
    Williams R, Peisajovich S, Miller O, Magdassi S, Tawfik D, Griffiths A. Amplification of complex gene libraries by emulsion PCR. Nat. Methods, 2006, 3(7): 545–550.CrossRefGoogle Scholar
  78. [78]
    Huber T, Faulkner G, Hugenholz P. Bellerophon: A program to detect chimeric sequences in multiple sequence alignments. Bioinformatics, 2004, 20(14): 2317–2319.CrossRefGoogle Scholar
  79. [79]
    Ashelford K E, Chuzhanova N A, Fry J C, Jones A J, Weightman A J. New screening software shows that most recent large 16S rRNA gene clone libraries contain chimeras. Appl. Environ. Microbiol., 2006, 72(9): 5734–5741.CrossRefGoogle Scholar
  80. [80]
    Gomez-Alvarez V, Teal T, Schmidt T. Systematic artifacts in metagenomes from complex microbial communities. ISME J, 2009, 3(11): 1314–1317.CrossRefGoogle Scholar
  81. [81]
    Sharon I, Pati A, Markowitz V M, Pintter R Y. A statistical framework for the functional analysis of metagenomes. In Proc. RECOMB 2009, Tucson, USA, May 18–21, 2009, pp.496–511.Google Scholar
  82. [82]
    Lander E S, Waterman M S. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 1988, 2: 231–239.Google Scholar
  83. [83]
    Ye Y, Doak T G. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput. Biol., 2009, 5(8): e1000465.CrossRefGoogle Scholar
  84. [84]
    Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, Goto S, Kanehisa M. KEGG Atlas mapping for global analysis of metabolic pathways. Nucleic Acids Res., 2008, 36(Web Server Issue): W423–W426.CrossRefGoogle Scholar
  85. [85]
    Rosin F M, Watanabe N, Lam E. Moonlighting vacuolar protease: Multiple jobs for a busy protein. Trends Plant Sci., 2005, 10(11): 516–518.CrossRefGoogle Scholar
  86. [86]
    Seshadri R, Kravitz S A, Smarr L, Gilna P, Frazier M. CAMERA: A community resource for metagenomics. PLoS Biol., 2007, 5(3): e75.CrossRefGoogle Scholar
  87. [87]
    Price M N, Dehal P S, Arkin A P. FastBLAST: Homology relationships for millions of proteins. PLoS One, 2008, 3(10): e3589.CrossRefGoogle Scholar
  88. [88]
    Sun Y, Cai Y, Liu L, Yu F, Farrell M L, McKendree W, Farmerie W. ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res., 2009, 37(10): e76.CrossRefGoogle Scholar
  89. [89]
    Shi Y, Tyson G W, DeLong E F. Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column. Nature, 2009, 459(7244): 266–269.CrossRefGoogle Scholar
  90. [90]
    Verberkmoes N C, Russell A L, Shah M, Godzik A, Rosenquist M, Halfvarson J, Lefsrud M G, Apajalahti J, Tysk C, Hettich R L, Jansson J K. Shotgun metaproteomics of the human distal gut microbiota. ISME J, 2009, 3(2): 179–189.CrossRefGoogle Scholar
  91. [91]
    Frias-Lopez J, Shi Y, Tyson G W, Coleman M L, Schuster S C, Chisholm S W, Delong E F. Microbial community gene expression in ocean surface waters. Proc. Natl. Acad. Sci. USA, 2008, 105(10): 3805–3810.CrossRefGoogle Scholar

Copyright information

© Springer 2010

Authors and Affiliations

  1. 1.Center for Research on BioSystems, Calit2University of Califormia San DiegoLa JollaU.S.A.
  2. 2.School of Informatics and ComputingIndiana UniversityBloomingtonU.S.A.

Personalised recommendations