Advertisement

A Hash Trie Filter Approach to Approximate String Matching for Genomic Databases

  • Ye-In Chang
  • Jiun-Rung Chen
  • Min-Tze Hsu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5579)

Abstract

For genomic databases, approximate string matching with k errors is often considered for genomic sequences, where k errors could be caused by substitution, insertion, or deletion operations. In this paper, we propose a new approximate string matching method, the hash trie filter, for efficiently searching in genomic databases. Our method not only reduces the number of candidates by pruning some unreasonable matched positions, but also dynamically decides the number of ordered matched grams of one candidate, which results in the increase of precision. The experiment results show that the hash trie filter outperforms the well-known (k + s) q-samples filter in terms of the response time and the precision, under different lengths of the query patterns and different error levels.

Keywords

Approximate string matching filter hash trie 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Friedberg, E.C., Walker, G.C., Siede, W.: DNA Repair and Mutagenesis. American Society Microbiology (1995)Google Scholar
  3. 3.
    Houle, J.L., Cadigan, W., Henry, S., Pinnamaneni, A., Lundahl, S.: Database Mining in the Human Genome Initiative (2000), http://www.biodatabases.com/whitepaper01.html
  4. 4.
    Karkkainen, J., Na, J.C.: Faster Filters for Approximate String Matching. In: Proc. of Workshop on Algorithm Engineering and Experiments, pp. 1–7 (2007)Google Scholar
  5. 5.
    Lipman, D.J., Pearson, W.R.: Rapid and Sensitive Protein Similarity Searches. Science 227(4693), 1435–1441 (1985)CrossRefGoogle Scholar
  6. 6.
    Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and More Sensitive Homology Search. Bioinformatics 18(3), 440–445 (2002)CrossRefGoogle Scholar
  7. 7.
    Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1), 195–197 (1995)CrossRefGoogle Scholar
  8. 8.
    Sutinen, E., Tarhio, J.: On Using q-Gram Locations in Approximate String Matching. In: Proc. of the 3rd Annual European Symp. on Algorithms, pp. 327–340 (1995)Google Scholar
  9. 9.
    Sutinen, E., Tarhio, J.: Approximate String Matching with Ordered q-Grams. Nordic Journal of Computing 11(4), 321–343 (2004)MathSciNetMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Ye-In Chang
    • 1
  • Jiun-Rung Chen
    • 1
  • Min-Tze Hsu
    • 1
  1. 1.Dept. of Computer Science and EngineeringNational Sun Yat-Sen UniversityKaohsiungTaiwan

Personalised recommendations