Fast and Accurate Genome Anchoring Using Fuzzy Hash Maps

  • John Healy
  • Desmond Chambers
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 93)


Although hash-based approaches to sequence alignment and genome assembly are long established, their utility is predicated on the rapid identification of exact k-mers from a hash-map or similar data structure. We describe how a fuzzy hash-map can be applied to quickly and accurately align a prokaryotic genome to the reference genome of a related species. Using this technique, a draft genome of Mycoplasma genitalium, sampled at 1X coverage, was accurately anchored against the genome of Mycoplasma pneumoniae. The fuzzy approach to alignment, ordered and orientated more than 65% of the reads from the draft genome in under 10 seconds, with an error rate of <1.5%. Without sacrificing execution speed, fuzzy hash-maps also provide a mechanism for error tolerance and variability in k-mer centric sequence alignment and assembly applications.


Draft Genome Edit Distance Hash Code Mycoplasma Genitalium Fuzzy Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Goodrich, M., Tamassia, R.: Data Structures and Algorithms in Java. John Wiley & Sons, Chichester (2001)Google Scholar
  2. 2.
    Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)Google Scholar
  3. 3.
    Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389 (1997)CrossRefGoogle Scholar
  4. 4.
    Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences 85, 2444 (1988)CrossRefGoogle Scholar
  5. 5.
    Pevzner, P., Tang, H., Waterman, M.: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America 98, 9748 (2001)MathSciNetzbMATHCrossRefGoogle Scholar
  6. 6.
    Zerbino, D., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821 (2008)CrossRefGoogle Scholar
  7. 7.
    Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851 (2008)CrossRefGoogle Scholar
  8. 8.
    Rumble, S., Lacroute, P., Dalca, A., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: accurate mapping of short color-space reads. PLoS computational biology 5 (2009)Google Scholar
  9. 9.
    Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713 (2008)CrossRefGoogle Scholar
  10. 10.
    Lin, H., Zhang, Z., Zhang, M., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped. Bioinformatics 24, 2431 (2008)CrossRefGoogle Scholar
  11. 11.
    Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440 (2002)CrossRefGoogle Scholar
  12. 12.
    Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. Journal of Experimental Biology 210, 1518 (2007)CrossRefGoogle Scholar
  13. 13.
    Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-generation sequencing. Genome Research 20, 1165 (2010)CrossRefGoogle Scholar
  14. 14.
    Pop, M.: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354 (2009)CrossRefGoogle Scholar
  15. 15.
    Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6 (2005)CrossRefGoogle Scholar
  16. 16.
    Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform., bbq015 (2010)Google Scholar
  17. 17.
    Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Digital SRC Research Report (1994)Google Scholar
  18. 18.
    Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009)CrossRefGoogle Scholar
  19. 19.
    Li, R., Yu, C., Li, Y., Lam, T., Yiu, S., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966 (2009)CrossRefGoogle Scholar
  20. 20.
    Topac, V.: Efficient fuzzy search enabled hash map, pp. 39–44 (2010)Google Scholar
  21. 21.
    Gosling, J., Joy, B., Steele, G., Bracha, G.: Java (TM) Language Specification, The Java (Addison-Wesley): Addison-Wesley Professional (2005)Google Scholar
  22. 22.
    Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29, 147–160 (1950)MathSciNetGoogle Scholar
  23. 23.
    Bookstein, A., Tomi Klein, S., Raita, T.: Fuzzy Hamming Distance: A New Dissimilarity Measure (Extended Abstract), pp. 86–97 (2001)Google Scholar
  24. 24.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals (1966)Google Scholar
  25. 25.
    Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I., Belmonte, M., Lander, E., Nusbaum, C., Jaffe, D.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 18, 810 (2008)CrossRefGoogle Scholar
  26. 26.
    Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., Birol: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • John Healy
    • 1
  • Desmond Chambers
    • 2
  1. 1.Department Computing & MathematicsGalway-Mayo Institute of TechnologyIreland
  2. 2.Department of Information TechnologyNational University of IrelandGalwayIreland

Personalised recommendations