Skip to main content

Separating Metagenomic Short Reads into Genomes via Clustering

(Extended Abstract)

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNBI,volume 6833)

Abstract

The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.

Keywords

  • Sequencing Error
  • Abundance Ratio
  • Phylogenetic Distance
  • Abundance Level
  • Normalize Error Rate

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-23038-7_25
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-642-23038-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   74.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Handelsman, J., Rondon, M.R., Brady, S.F.: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & biology 5 (October 1998)

    Google Scholar 

  2. Venter, J.C., Remington, K., Heidelberg, J.F., et al.: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 304, 66–74 (2004)

    CrossRef  Google Scholar 

  3. Gill, S.R., Pop, M., DeBoy, R.T., et al.: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 312, 1355–1359 (2006)

    CrossRef  Google Scholar 

  4. Tyson, G.W., Chapman, J., Hugenholtz, P., et al.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004)

    CrossRef  Google Scholar 

  5. Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome research 18, 324–330 (2008)

    CrossRef  Google Scholar 

  6. Warren, R.L., Sutton, G.G., Jones, S.J.M., et al.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 500–501 (2007)

    CrossRef  Google Scholar 

  7. Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research 17, 1697–1706 (2007)

    CrossRef  Google Scholar 

  8. Simpson, J.T., Wong, K., Jackman, S.D., et al.: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117–1123 (2009)

    CrossRef  Google Scholar 

  9. Charuvaka, A., Rangwala, H.: Evaluation of Short Read Metagenomic Assembly. Tech. Rep. GMU-CS-TR-2010-9 (2010)

    Google Scholar 

  10. Chakravorty, S., Helb, D., Burday, M., et al.: A detailed analysis of 16s ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J. Microbiol Methods 69(2) (2007)

    Google Scholar 

  11. Huson, D.H., Auch, A.F., Qi, J., et al.: MEGAN analysis of metagenomic data. Genome research 17, 377–386 (2007)

    CrossRef  Google Scholar 

  12. Krause, L., Diaz, N.N., Goesmann, A., et al.: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research 36, 2230–2239 (2008)

    CrossRef  Google Scholar 

  13. Zhou, F., Olman, V., Xu, Y.: Barcodes for genomes and applications. BMC Bioinformatics 9(1), 546+ (2008)

    CrossRef  Google Scholar 

  14. Chatterji, S., Yamazaki, I., Bai, Z., et al.: Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 17–28. Springer, Heidelberg (2008)

    CrossRef  Google Scholar 

  15. Chan, C.-K., Hsu, A., Halgamuge, S., Tang, S.-L.: ‘Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics 9(1) (2008)

    Google Scholar 

  16. Teeling, H., Waldmann, J., Lombardot, T., et al.: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163+ (2004)

    CrossRef  Google Scholar 

  17. Leung, H.C.M., Yiu, S.M., Yang, B., et al.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27, 1489–1495 (2011)

    CrossRef  Google Scholar 

  18. Diaz, N., Krause, L., Goesmann, A., et al.: TACOA - Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10(1), 56+ (2009)

    CrossRef  Google Scholar 

  19. Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38, 771–791 (2004)

    CrossRef  Google Scholar 

  20. Wu, Y.-W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 535–549. Springer, Heidelberg (2010)

    CrossRef  Google Scholar 

  21. Wheeler, D.L., Barrett, T., Benson, D.A., et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 35 (January 2007)

    Google Scholar 

  22. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., et al.: GenBank. Nucleic acids research 37, D26–D31 (2009)

    CrossRef  Google Scholar 

  23. Zerbino, D.R., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821–829 (2008)

    CrossRef  Google Scholar 

  24. Tanaseichuk, O., Borneman, J., Jiang, T.: Separating metagenomic short reads into genomes via clustering (2011) (manuscript), http://www.cs.ucr.edu/~tanaseio/metagenomic-full.pdf

  25. Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988)

    CrossRef  Google Scholar 

  26. Wendl, M., Waterston, R.: Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res. 12(1), 1943–1949 (2002)

    CrossRef  Google Scholar 

  27. Li, X., Waterman, M.S.: Estimating the Repeat Structure and Length of DNA Sequences Using l-Tuples. Genome Research 13, 1916–1922 (2003)

    Google Scholar 

  28. van Dongen, S.: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht (May 2000)

    Google Scholar 

  29. Wu, D., Daugherty, S.C., Van Aken, S.E., et al.: Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters. PLoS Biol 4, e188+ (2006)

    CrossRef  Google Scholar 

  30. Richter, D.C., Ott, F., Auch, A.F., et al.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3, e3373+ (2008)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tanaseichuk, O., Borneman, J., Jiang, T. (2011). Separating Metagenomic Short Reads into Genomes via Clustering. In: Przytycka, T.M., Sagot, MF. (eds) Algorithms in Bioinformatics. WABI 2011. Lecture Notes in Computer Science(), vol 6833. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23038-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23038-7_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23037-0

  • Online ISBN: 978-3-642-23038-7

  • eBook Packages: Computer ScienceComputer Science (R0)