A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples
Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify all (or most) of the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes—two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g. 75 bp) or have sequencing errors.
KeywordsBinning metagenomics EM algorithm Poisson distribution
Unable to display preview. Download preview PDF.
- 8.Margulies, M., Egholm, M., Altman, W.E., et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057), 376–380 (2005)Google Scholar
- 16.Schmidt, H.A., Strimmer, K., Vingron, M., et al.: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18(3), 502–504 (2002)Google Scholar
- 18.Krause, L., Diaz, N.N., Goesmann, A., et al.: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 36(7), 2230–2239 (2008)Google Scholar
- 19.Finn, R.D., Mistry, J., Schuster-Bockler, B., et al.: Pfam: clans, web tools and services. Nucleic Acids Res. 34(Database issue), D247–D251 (2006)Google Scholar
- 20.Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6(9), 673–676 (2009)Google Scholar
- 23.Woyke, T., Teeling, H., Ivanova, N.N., et al.: Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443(7114), 950–955 (2006)Google Scholar
- 27.Foerstner, K.U., von Mering, C., Hooper, S.D., et al.: Environments shape the nucleotide composition of genomes. EMBO Rep. 6(12), 1208–1213 (2005)Google Scholar
- 29.Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)Google Scholar
- 30.Li, X., Waterman, M.S.: Estimating the repeat structure and length of DNA sequences using l-tuples. Genome Res. 13(8), 1916–1922 (2003)Google Scholar
- 34.White, J.R., Roberts, M., Yorke, J.A., et al.: Figaro: a novel statistical method for vector sequence removal. Bioinformatics 24(4), 462–467 (2008)Google Scholar