Skip to main content

Fast and Accurate Genome Anchoring Using Fuzzy Hash Maps

  • Conference paper

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 93))

Abstract

Although hash-based approaches to sequence alignment and genome assembly are long established, their utility is predicated on the rapid identification of exact k-mers from a hash-map or similar data structure. We describe how a fuzzy hash-map can be applied to quickly and accurately align a prokaryotic genome to the reference genome of a related species. Using this technique, a draft genome of Mycoplasma genitalium, sampled at 1X coverage, was accurately anchored against the genome of Mycoplasma pneumoniae. The fuzzy approach to alignment, ordered and orientated more than 65% of the reads from the draft genome in under 10 seconds, with an error rate of <1.5%. Without sacrificing execution speed, fuzzy hash-maps also provide a mechanism for error tolerance and variability in k-mer centric sequence alignment and assembly applications.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Goodrich, M., Tamassia, R.: Data Structures and Algorithms in Java. John Wiley & Sons, Chichester (2001)

    Google Scholar 

  2. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)

    Google Scholar 

  3. Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389 (1997)

    Article  Google Scholar 

  4. Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences 85, 2444 (1988)

    Article  Google Scholar 

  5. Pevzner, P., Tang, H., Waterman, M.: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America 98, 9748 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  6. Zerbino, D., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821 (2008)

    Article  Google Scholar 

  7. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851 (2008)

    Article  Google Scholar 

  8. Rumble, S., Lacroute, P., Dalca, A., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: accurate mapping of short color-space reads. PLoS computational biology 5 (2009)

    Google Scholar 

  9. Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713 (2008)

    Article  Google Scholar 

  10. Lin, H., Zhang, Z., Zhang, M., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped. Bioinformatics 24, 2431 (2008)

    Article  Google Scholar 

  11. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440 (2002)

    Article  Google Scholar 

  12. Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. Journal of Experimental Biology 210, 1518 (2007)

    Article  Google Scholar 

  13. Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-generation sequencing. Genome Research 20, 1165 (2010)

    Article  Google Scholar 

  14. Pop, M.: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354 (2009)

    Article  Google Scholar 

  15. Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6 (2005)

    Article  Google Scholar 

  16. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform., bbq015 (2010)

    Google Scholar 

  17. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Digital SRC Research Report (1994)

    Google Scholar 

  18. Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009)

    Article  Google Scholar 

  19. Li, R., Yu, C., Li, Y., Lam, T., Yiu, S., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966 (2009)

    Article  Google Scholar 

  20. Topac, V.: Efficient fuzzy search enabled hash map, pp. 39–44 (2010)

    Google Scholar 

  21. Gosling, J., Joy, B., Steele, G., Bracha, G.: Java (TM) Language Specification, The Java (Addison-Wesley): Addison-Wesley Professional (2005)

    Google Scholar 

  22. Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29, 147–160 (1950)

    MathSciNet  Google Scholar 

  23. Bookstein, A., Tomi Klein, S., Raita, T.: Fuzzy Hamming Distance: A New Dissimilarity Measure (Extended Abstract), pp. 86–97 (2001)

    Google Scholar 

  24. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals (1966)

    Google Scholar 

  25. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I., Belmonte, M., Lander, E., Nusbaum, C., Jaffe, D.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 18, 810 (2008)

    Article  Google Scholar 

  26. Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., Birol: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Healy, J., Chambers, D. (2011). Fast and Accurate Genome Anchoring Using Fuzzy Hash Maps. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds) 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011). Advances in Intelligent and Soft Computing, vol 93. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19914-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19914-1_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19913-4

  • Online ISBN: 978-3-642-19914-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics