Fast and Accurate Genome Anchoring Using Fuzzy Hash Maps

Healy, John; Chambers, Desmond

doi:10.1007/978-3-642-19914-1_21

Fast and Accurate Genome Anchoring Using Fuzzy Hash Maps

John Healy⁶ &
Desmond Chambers⁷

Conference paper

842 Accesses
1 Citations

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 93))

Abstract

Although hash-based approaches to sequence alignment and genome assembly are long established, their utility is predicated on the rapid identification of exact k-mers from a hash-map or similar data structure. We describe how a fuzzy hash-map can be applied to quickly and accurately align a prokaryotic genome to the reference genome of a related species. Using this technique, a draft genome of Mycoplasma genitalium, sampled at 1X coverage, was accurately anchored against the genome of Mycoplasma pneumoniae. The fuzzy approach to alignment, ordered and orientated more than 65% of the reads from the draft genome in under 10 seconds, with an error rate of <1.5%. Without sacrificing execution speed, fuzzy hash-maps also provide a mechanism for error tolerance and variability in k-mer centric sequence alignment and assembly applications.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Goodrich, M., Tamassia, R.: Data Structures and Algorithms in Java. John Wiley & Sons, Chichester (2001)
Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Google Scholar
Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389 (1997)
Article Google Scholar
Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences 85, 2444 (1988)
Article Google Scholar
Pevzner, P., Tang, H., Waterman, M.: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America 98, 9748 (2001)
Article MathSciNet MATH Google Scholar
Zerbino, D., Birney, E.: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821 (2008)
Article Google Scholar
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851 (2008)
Article Google Scholar
Rumble, S., Lacroute, P., Dalca, A., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: accurate mapping of short color-space reads. PLoS computational biology 5 (2009)
Google Scholar
Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713 (2008)
Article Google Scholar
Lin, H., Zhang, Z., Zhang, M., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped. Bioinformatics 24, 2431 (2008)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440 (2002)
Article Google Scholar
Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. Journal of Experimental Biology 210, 1518 (2007)
Article Google Scholar
Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-generation sequencing. Genome Research 20, 1165 (2010)
Article Google Scholar
Pop, M.: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10, 354 (2009)
Article Google Scholar
Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6 (2005)
Article Google Scholar
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform., bbq015 (2010)
Google Scholar
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Digital SRC Research Report (1994)
Google Scholar
Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009)
Article Google Scholar
Li, R., Yu, C., Li, Y., Lam, T., Yiu, S., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966 (2009)
Article Google Scholar
Topac, V.: Efficient fuzzy search enabled hash map, pp. 39–44 (2010)
Google Scholar
Gosling, J., Joy, B., Steele, G., Bracha, G.: Java (TM) Language Specification, The Java (Addison-Wesley): Addison-Wesley Professional (2005)
Google Scholar
Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29, 147–160 (1950)
MathSciNet Google Scholar
Bookstein, A., Tomi Klein, S., Raita, T.: Fuzzy Hamming Distance: A New Dissimilarity Measure (Extended Abstract), pp. 86–97 (2001)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals (1966)
Google Scholar
Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I., Belmonte, M., Lander, E., Nusbaum, C., Jaffe, D.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 18, 810 (2008)
Article Google Scholar
Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., Birol: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department Computing & Mathematics, Galway-Mayo Institute of Technology, Ireland
John Healy
Department of Information Technology, National University of Ireland, Galway, Ireland
Desmond Chambers

Authors

John Healy
View author publications
You can also search for this author in PubMed Google Scholar
Desmond Chambers
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dep. Informática / CCTC, Universidade do Minho, 4710 - 057, Braga, Portugal
Miguel P. Rocha
Department of Computing Science and Control Faculty of Science, University of Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Juan M. Corchado Rodríguez
Edificio Politécnico, ESEI: Escuela Superior de Ingeniería Informática, 32004, Ourense, Spain
Florentino Fdez-Riverola
Structural Biology and BioComputing Programme (CNIO), Spanish National Cancer Research Centre, Melchor Fdez Almagro 3, 28029, Madrid, Spain
Alfonso Valencia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Healy, J., Chambers, D. (2011). Fast and Accurate Genome Anchoring Using Fuzzy Hash Maps. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds) 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011). Advances in Intelligent and Soft Computing, vol 93. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19914-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-19914-1_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19913-4
Online ISBN: 978-3-642-19914-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics