Skip to main content

Separating Metagenomic Short Reads into Genomes via Clustering

(Extended Abstract)

  • Conference paper
Algorithms in Bioinformatics (WABI 2011)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6833))

Included in the following conference series:

Abstract

The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Handelsman, J., Rondon, M.R., Brady, S.F.: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & biology 5 (October 1998)

    Google Scholar 

  2. Venter, J.C., Remington, K., Heidelberg, J.F., et al.: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 304, 66–74 (2004)

    Article  Google Scholar 

  3. Gill, S.R., Pop, M., DeBoy, R.T., et al.: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 312, 1355–1359 (2006)

    Article  Google Scholar 

  4. Tyson, G.W., Chapman, J., Hugenholtz, P., et al.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004)

    Article  Google Scholar 

  5. Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome research 18, 324–330 (2008)

    Article  Google Scholar 

  6. Warren, R.L., Sutton, G.G., Jones, S.J.M., et al.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 500–501 (2007)

    Article  Google Scholar 

  7. Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research 17, 1697–1706 (2007)

    Article  Google Scholar 

  8. Simpson, J.T., Wong, K., Jackman, S.D., et al.: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117–1123 (2009)

    Article  Google Scholar 

  9. Charuvaka, A., Rangwala, H.: Evaluation of Short Read Metagenomic Assembly. Tech. Rep. GMU-CS-TR-2010-9 (2010)

    Google Scholar 

  10. Chakravorty, S., Helb, D., Burday, M., et al.: A detailed analysis of 16s ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J. Microbiol Methods 69(2) (2007)

    Google Scholar 

  11. Huson, D.H., Auch, A.F., Qi, J., et al.: MEGAN analysis of metagenomic data. Genome research 17, 377–386 (2007)

    Article  Google Scholar 

  12. Krause, L., Diaz, N.N., Goesmann, A., et al.: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research 36, 2230–2239 (2008)

    Article  Google Scholar 

  13. Zhou, F., Olman, V., Xu, Y.: Barcodes for genomes and applications. BMC Bioinformatics 9(1), 546+ (2008)

    Article  Google Scholar 

  14. Chatterji, S., Yamazaki, I., Bai, Z., et al.: Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 17–28. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  15. Chan, C.-K., Hsu, A., Halgamuge, S., Tang, S.-L.: ‘Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics 9(1) (2008)

    Google Scholar 

  16. Teeling, H., Waldmann, J., Lombardot, T., et al.: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163+ (2004)

    Article  Google Scholar 

  17. Leung, H.C.M., Yiu, S.M., Yang, B., et al.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27, 1489–1495 (2011)

    Article  Google Scholar 

  18. Diaz, N., Krause, L., Goesmann, A., et al.: TACOA - Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10(1), 56+ (2009)

    Article  Google Scholar 

  19. Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38, 771–791 (2004)

    Article  Google Scholar 

  20. Wu, Y.-W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 535–549. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  21. Wheeler, D.L., Barrett, T., Benson, D.A., et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 35 (January 2007)

    Google Scholar 

  22. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., et al.: GenBank. Nucleic acids research 37, D26–D31 (2009)

    Article  Google Scholar 

  23. Zerbino, D.R., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821–829 (2008)

    Article  Google Scholar 

  24. Tanaseichuk, O., Borneman, J., Jiang, T.: Separating metagenomic short reads into genomes via clustering (2011) (manuscript), http://www.cs.ucr.edu/~tanaseio/metagenomic-full.pdf

  25. Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988)

    Article  Google Scholar 

  26. Wendl, M., Waterston, R.: Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res. 12(1), 1943–1949 (2002)

    Article  Google Scholar 

  27. Li, X., Waterman, M.S.: Estimating the Repeat Structure and Length of DNA Sequences Using l-Tuples. Genome Research 13, 1916–1922 (2003)

    Google Scholar 

  28. van Dongen, S.: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht (May 2000)

    Google Scholar 

  29. Wu, D., Daugherty, S.C., Van Aken, S.E., et al.: Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters. PLoS Biol 4, e188+ (2006)

    Article  Google Scholar 

  30. Richter, D.C., Ott, F., Auch, A.F., et al.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3, e3373+ (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tanaseichuk, O., Borneman, J., Jiang, T. (2011). Separating Metagenomic Short Reads into Genomes via Clustering. In: Przytycka, T.M., Sagot, MF. (eds) Algorithms in Bioinformatics. WABI 2011. Lecture Notes in Computer Science(), vol 6833. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23038-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23038-7_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23037-0

  • Online ISBN: 978-3-642-23038-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics