Separating Metagenomic Short Reads into Genomes via Clustering

Tanaseichuk, Olga; Borneman, James; Jiang, Tao

doi:10.1007/978-3-642-23038-7_25

Olga Tanaseichuk²¹,
James Borneman²² &
Tao Jiang²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6833))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1076 Accesses
4 Citations

Abstract

The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Handelsman, J., Rondon, M.R., Brady, S.F.: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & biology 5 (October 1998)
Google Scholar
Venter, J.C., Remington, K., Heidelberg, J.F., et al.: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 304, 66–74 (2004)
Article Google Scholar
Gill, S.R., Pop, M., DeBoy, R.T., et al.: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 312, 1355–1359 (2006)
Article Google Scholar
Tyson, G.W., Chapman, J., Hugenholtz, P., et al.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004)
Article Google Scholar
Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome research 18, 324–330 (2008)
Article Google Scholar
Warren, R.L., Sutton, G.G., Jones, S.J.M., et al.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 500–501 (2007)
Article Google Scholar
Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research 17, 1697–1706 (2007)
Article Google Scholar
Simpson, J.T., Wong, K., Jackman, S.D., et al.: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117–1123 (2009)
Article Google Scholar
Charuvaka, A., Rangwala, H.: Evaluation of Short Read Metagenomic Assembly. Tech. Rep. GMU-CS-TR-2010-9 (2010)
Google Scholar
Chakravorty, S., Helb, D., Burday, M., et al.: A detailed analysis of 16s ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J. Microbiol Methods 69(2) (2007)
Google Scholar
Huson, D.H., Auch, A.F., Qi, J., et al.: MEGAN analysis of metagenomic data. Genome research 17, 377–386 (2007)
Article Google Scholar
Krause, L., Diaz, N.N., Goesmann, A., et al.: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research 36, 2230–2239 (2008)
Article Google Scholar
Zhou, F., Olman, V., Xu, Y.: Barcodes for genomes and applications. BMC Bioinformatics 9(1), 546+ (2008)
Article Google Scholar
Chatterji, S., Yamazaki, I., Bai, Z., et al.: Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 17–28. Springer, Heidelberg (2008)
Chapter Google Scholar
Chan, C.-K., Hsu, A., Halgamuge, S., Tang, S.-L.: ‘Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics 9(1) (2008)
Google Scholar
Teeling, H., Waldmann, J., Lombardot, T., et al.: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163+ (2004)
Article Google Scholar
Leung, H.C.M., Yiu, S.M., Yang, B., et al.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27, 1489–1495 (2011)
Article Google Scholar
Diaz, N., Krause, L., Goesmann, A., et al.: TACOA - Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10(1), 56+ (2009)
Article Google Scholar
Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38, 771–791 (2004)
Article Google Scholar
Wu, Y.-W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 535–549. Springer, Heidelberg (2010)
Chapter Google Scholar
Wheeler, D.L., Barrett, T., Benson, D.A., et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 35 (January 2007)
Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., et al.: GenBank. Nucleic acids research 37, D26–D31 (2009)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821–829 (2008)
Article Google Scholar
Tanaseichuk, O., Borneman, J., Jiang, T.: Separating metagenomic short reads into genomes via clustering (2011) (manuscript), http://www.cs.ucr.edu/~tanaseio/metagenomic-full.pdf
Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988)
Article Google Scholar
Wendl, M., Waterston, R.: Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res. 12(1), 1943–1949 (2002)
Article Google Scholar
Li, X., Waterman, M.S.: Estimating the Repeat Structure and Length of DNA Sequences Using l-Tuples. Genome Research 13, 1916–1922 (2003)
Google Scholar
van Dongen, S.: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht (May 2000)
Google Scholar
Wu, D., Daugherty, S.C., Van Aken, S.E., et al.: Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters. PLoS Biol 4, e188+ (2006)
Article Google Scholar
Richter, D.C., Ott, F., Auch, A.F., et al.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3, e3373+ (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, UC Riverside, CA, USA
Olga Tanaseichuk & Tao Jiang
Department of Plant Pathology and Microbiology, UC Riverside, CA, USA
James Borneman

Authors

Olga Tanaseichuk
View author publications
You can also search for this author in PubMed Google Scholar
James Borneman
View author publications
You can also search for this author in PubMed Google Scholar
Tao Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Center for Biotechnology Information, U.S. National Library of Medicine, 8600 Rockville Pike, 20894, Bethesda, MD, USA
Teresa M. Przytycka
Institut National de Recherche en Informatique et en Automatique (INRIA) and Université Lyon 1 (UCBL), 43 bd du 11 Novembre 1918, 69622, Villeurbanne cedex, France
Marie-France Sagot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tanaseichuk, O., Borneman, J., Jiang, T. (2011). Separating Metagenomic Short Reads into Genomes via Clustering. In: Przytycka, T.M., Sagot, MF. (eds) Algorithms in Bioinformatics. WABI 2011. Lecture Notes in Computer Science(), vol 6833. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23038-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-23038-7_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23037-0
Online ISBN: 978-3-642-23038-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics