Current Microbiology

, Volume 72, Issue 5, pp 612–616 | Cite as

Evaluating the Quantitative Capabilities of Metagenomic Analysis Software

Article

Abstract

DNA sequencing technologies are applied widely and frequently today to describe metagenomes, i.e., microbial communities in environmental or clinical samples, without the need for culturing them. These technologies usually return short (100–300 base-pairs long) DNA reads, and these reads are processed by metagenomic analysis software that assign phylogenetic composition–information to the dataset. Here we evaluate three metagenomic analysis software (AmphoraNet—a webserver implementation of AMPHORA2—, MG-RAST, and MEGAN5) for their capabilities of assigning quantitative phylogenetic information for the data, describing the frequency of appearance of the microorganisms of the same taxa in the sample. The difficulties of the task arise from the fact that longer genomes produce more reads from the same organism than shorter genomes, and some software assign higher frequencies to species with longer genomes than to those with shorter ones. This phenomenon is called the “genome length bias.” Dozens of complex artificial metagenome benchmarks can be found in the literature. Because of the complexity of those benchmarks, it is usually difficult to judge the resistance of a metagenomic software to this “genome length bias.” Therefore, we have made a simple benchmark for the evaluation of the “taxon-counting” in a metagenomic sample: we have taken the same number of copies of three full bacterial genomes of different lengths, break them up randomly to short reads of average length of 150 bp, and mixed the reads, creating our simple benchmark. Because of its simplicity, the benchmark is not supposed to serve as a mock metagenome, but if a software fails on that simple task, it will surely fail on most real metagenomes. We applied three software for the benchmark. The ideal quantitative solution would assign the same proportion to the three bacterial taxa. We have found that AMPHORA2/AmphoraNet gave the most accurate results and the other two software were under-performers: they counted quite reliably each short read to their respective taxon, producing the typical genome length bias. The benchmark dataset is available at http://pitgroup.org/static/3RandomGenome-100kavg150bps.fna.

Keywords

Environmental microbiology Genomics Microbiology 

Supplementary material

284_2016_991_MOESM1_ESM.docx (91 kb)
Supplementary material 1 (DOCX 91 kb)

References

  1. 1.
    Huson DH, Auch AF, Qi J, Schuster SC (2007) MEGAN analysis of metagenomic data. Genome Res 17(3):377–386CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN4. Genome Res 21(9):1552–1560CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Kerepesi C, Banky D, Grolmusz V (2014) AmphoraNet: the webserver implementation of the AMPHORA2 metagenomic workflow suite. Gene 533(2):538–540CrossRefPubMedGoogle Scholar
  4. 4.
    Kerepesi C, Szalkai B, Grolmusz V (2014) Visual analysis of the quantitative composition of metagenomic communities: the AmphoraVizu webserver. Microb Ecol 69:695–697CrossRefPubMedGoogle Scholar
  5. 5.
    Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Frank K, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4(6):495–500CrossRefPubMedGoogle Scholar
  6. 6.
    Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke R, Wilkening J, Edwards RA (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform 9:386CrossRefGoogle Scholar
  7. 7.
    Richter DC, Ott F, Auch AF, Schmid R, Huson DH (2008) Metasim: a sequencing simulator for genomics and metagenomics. PLoS One 3(10):e3373CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Wu M, Eisen JA (2008) A simple, fast, and accurate method of phylogenomic inference. Genome Biol 9(10):R151CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Wu M, Scott AJ (2012) Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinformatics 28(7):1033–1034CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.PIT Bioinformatics GroupEötvös UniversityBudapestHungary
  2. 2.Uratim Ltd.BudapestHungary

Personalised recommendations