A Probabilistic Approach to Accurate Abundance-Based Binning of Metagenomic Reads

  • Olga Tanaseichuk
  • James Borneman
  • Tao Jiang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7534)


An important problem in metagenomic analysis is to determine and quantify species (or genomes) in a metagenomic sample. The identification of phylogenetically related groups of sequence reads in a metagenomic dataset is often referred to as binning. Similarity-based binning methods rely on reference databases, and are unable to classify reads from unknown organisms. Composition-based methods exploit compositional patterns that are preserved in sufficiently long fragments, but are not suitable for binning very short next-generation sequencing (NGS) reads. Recently, several new metagenomic binning algorithms that can deal with NGS reads and do not rely on reference databases have been developed. However, all of them have difficulty with handling samples containing low-abundance species. We propose a new method to accurately estimate the abundance levels of species based on a novel probabilistic model for counting l-mer frequencies in a metagenomic dataset that takes into account frequencies of erroneous l-mers and repeated l-mers. An expectation maximization (EM) algorithm is used to learn the parameters of the model. Our algorithm automatically determines the number of abundance groups in a dataset and bins the reads into these groups. We show that our method outperforms the most recent abundance-based binning method, AbundanceBin, on both simulated and real datasets. We also show that the improved abundance-based binning method can be incorporated into a recent tool TOSS, which separates genomes with similar abundance levels and employs AbundanceBin as a preprocessing step to handle different abundance levels, to enhance its performance. We test the improved TOSS on simulated datasets and show that it significantly outperforms TOSS on datasets containing low-abundance genomes. Finally, we compare this approach against very recent metagenomic binning tools MetaCluster 4.0 and MetaCluster 5.0 on simulated data and demonstrate that it usually achieves a better sensitivity and breaks fewer genomes.


metagenomics next-generation sequencing expectation maximization abundance-based binning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amann, R.I., Ludwig, W., Schleifer, K.H.: Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiological Reviews 59(1), 143–169 (1995)Google Scholar
  2. 2.
    Tyson, G.W., Chapman, J., Hugenholtz, P., et al.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978), 37–43 (2004)CrossRefGoogle Scholar
  3. 3.
    Gill, S.R., Pop, M., DeBoy, R.T., et al.: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 312(5778), 1355–1359 (2006)CrossRefGoogle Scholar
  4. 4.
    Tringe, S.G., von Mering, C., Kobayashi, A., et al.: Comparative Metagenomics of Microbial Communities. Science 308(5721), 554–557 (2005)CrossRefGoogle Scholar
  5. 5.
    Woyke, T., Teeling, H., Ivanova, N.N., et al.: Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443(7114), 950–955 (2006)CrossRefGoogle Scholar
  6. 6.
    Margulies, M., Egholm, M., Altman, W.E., et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057), 376–380 (2005)Google Scholar
  7. 7.
    Bentley, D.R.: Whole-genome re-sequencing. Current opinion in genetics & development 16(6), 545–552 (2006)CrossRefGoogle Scholar
  8. 8.
    Singh, A.H., Doerks, T., Letunic, I., et al.: Discovering Functional Novelty in Metagenomes: Examples from Light-Mediated Processes. J. Bacteriol. 191(1), 32–41 (2009)CrossRefGoogle Scholar
  9. 9.
    Hess, M., Sczyrba, A., Egan, R., et al.: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331(6016), 463–467 (2011)CrossRefGoogle Scholar
  10. 10.
    Yang, F., Zeng, X., Ning, K., et al.: Saliva microbiomes distinguish caries-active from healthy human populations. The ISME Journal 6(1), 1–10 (2011)CrossRefGoogle Scholar
  11. 11.
    Mackelprang, R., Waldrop, M.P., DeAngelis, K.M., et al.: Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature 480(7377), 368–371 (2011)CrossRefGoogle Scholar
  12. 12.
    Huson, D.H., Auch, A.F., Qi, J., et al.: MEGAN analysis of metagenomic data. Genome research 17(3), 377–386 (2007)CrossRefGoogle Scholar
  13. 13.
    Krause, L., Diaz, N.N., Goesmann, A., et al.: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research 36(7), 2230–2239 (2008)CrossRefGoogle Scholar
  14. 14.
    Ghosh, T., Monzoorul Haque, M., Mande, S.: DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences. BMC Bioinformatics 11(suppl. 7), S14+ (2010)Google Scholar
  15. 15.
    Monzoorul Haque, M., Ghosh, T.S.S., Komanduri, D., Mande, S.S.: SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics (Oxford, England) 25(14), 1722–1730 (2009)CrossRefGoogle Scholar
  16. 16.
    Diaz, N., Krause, L., Goesmann, A., et al.: TACOA - Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10(1), 56+ (2009)Google Scholar
  17. 17.
    McHardy, A.C., Martin, H.G., Tsirigos, A., et al.: Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods 4(1), 63–72 (2006)CrossRefGoogle Scholar
  18. 18.
    Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Meth. 6(9), 673–676 (2009)CrossRefGoogle Scholar
  19. 19.
    Chatterji, S., Yamazaki, I., Bai, Z., et al.: CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 17–28. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  20. 20.
    Teeling, H., Waldmann, J., Lombardot, T., et al.: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5(1), 163+ (2004)Google Scholar
  21. 21.
    Prabhakara, S., Acharya, R.: A two-way multi-dimensional mixture model for clustering metagenomic sequences. In: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2011, pp. 191–200. ACM (2011)Google Scholar
  22. 22.
    Yang, B., Peng, Y., Leung, H., et al.: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers. BMC Bioinformatics 11(Suppl 2), S5+ (2010)Google Scholar
  23. 23.
    Wang, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species. Journal of Computational Biology: a Journal of Computational Molecular Cell Biology 19(2), 241–249 (2012)Google Scholar
  24. 24.
    Wang, Y., Leung, H., Yiu, S., Chin, F.: Metacluster 5.0: A two-round binning approach for metagenomic data for low-abundance species in a noisy sample. In: Proceedings of the ECCB (to appear, 2012)Google Scholar
  25. 25.
    Wu, Y.-W., Ye, Y.: A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 535–549. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  26. 26.
    Tanaseichuk, O., Borneman, J., Jiang, T.: Separating Metagenomic Short Reads into Genomes via Clustering. In: Przytycka, T.M., Sagot, M.-F. (eds.) WABI 2011. LNCS, vol. 6833, pp. 298–313. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  27. 27.
    Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)CrossRefGoogle Scholar
  28. 28.
    Richter, D.C., Ott, F., Auch, A.F., et al.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3(10), e3373+ (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Olga Tanaseichuk
    • 1
  • James Borneman
    • 2
  • Tao Jiang
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of CaliforniaRiversideUSA
  2. 2.Department of Plant Pathology and MicrobiologyUniversity of CaliforniaRiversideUSA

Personalised recommendations