A Probabilistic Approach to Accurate Abundance-Based Binning of Metagenomic Reads
An important problem in metagenomic analysis is to determine and quantify species (or genomes) in a metagenomic sample. The identification of phylogenetically related groups of sequence reads in a metagenomic dataset is often referred to as binning. Similarity-based binning methods rely on reference databases, and are unable to classify reads from unknown organisms. Composition-based methods exploit compositional patterns that are preserved in sufficiently long fragments, but are not suitable for binning very short next-generation sequencing (NGS) reads. Recently, several new metagenomic binning algorithms that can deal with NGS reads and do not rely on reference databases have been developed. However, all of them have difficulty with handling samples containing low-abundance species. We propose a new method to accurately estimate the abundance levels of species based on a novel probabilistic model for counting l-mer frequencies in a metagenomic dataset that takes into account frequencies of erroneous l-mers and repeated l-mers. An expectation maximization (EM) algorithm is used to learn the parameters of the model. Our algorithm automatically determines the number of abundance groups in a dataset and bins the reads into these groups. We show that our method outperforms the most recent abundance-based binning method, AbundanceBin, on both simulated and real datasets. We also show that the improved abundance-based binning method can be incorporated into a recent tool TOSS, which separates genomes with similar abundance levels and employs AbundanceBin as a preprocessing step to handle different abundance levels, to enhance its performance. We test the improved TOSS on simulated datasets and show that it significantly outperforms TOSS on datasets containing low-abundance genomes. Finally, we compare this approach against very recent metagenomic binning tools MetaCluster 4.0 and MetaCluster 5.0 on simulated data and demonstrate that it usually achieves a better sensitivity and breaks fewer genomes.
Keywordsmetagenomics next-generation sequencing expectation maximization abundance-based binning
Unable to display preview. Download preview PDF.
- 1.Amann, R.I., Ludwig, W., Schleifer, K.H.: Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiological Reviews 59(1), 143–169 (1995)Google Scholar
- 6.Margulies, M., Egholm, M., Altman, W.E., et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057), 376–380 (2005)Google Scholar
- 14.Ghosh, T., Monzoorul Haque, M., Mande, S.: DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences. BMC Bioinformatics 11(suppl. 7), S14+ (2010)Google Scholar
- 16.Diaz, N., Krause, L., Goesmann, A., et al.: TACOA - Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10(1), 56+ (2009)Google Scholar
- 20.Teeling, H., Waldmann, J., Lombardot, T., et al.: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5(1), 163+ (2004)Google Scholar
- 21.Prabhakara, S., Acharya, R.: A two-way multi-dimensional mixture model for clustering metagenomic sequences. In: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2011, pp. 191–200. ACM (2011)Google Scholar
- 22.Yang, B., Peng, Y., Leung, H., et al.: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers. BMC Bioinformatics 11(Suppl 2), S5+ (2010)Google Scholar
- 23.Wang, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species. Journal of Computational Biology: a Journal of Computational Molecular Cell Biology 19(2), 241–249 (2012)Google Scholar
- 24.Wang, Y., Leung, H., Yiu, S., Chin, F.: Metacluster 5.0: A two-round binning approach for metagenomic data for low-abundance species in a noisy sample. In: Proceedings of the ECCB (to appear, 2012)Google Scholar
- 28.Richter, D.C., Ott, F., Auch, A.F., et al.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3(10), e3373+ (2008)Google Scholar