Skip to main content

Gapped Local Similarity Search with Provable Guarantees

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3240))

Abstract

We present a program qhash, based on q-gram filtration and high-dimensional search, to find gapped local similarities between two sequences. Our approach differs from past q-gram-based approaches in two main aspects. Our filtration step uses algorithms for a sparse all-pairs problem, while past studies use suffix-tree-like structures and counters. Our program works in sequence-sequence mode, while most past ones (except QUASAR) work in pattern-database mode.

We leverage existing research in high-dimensional proximity search to discuss sparse all-pairs algorithms, and show them to be subquadratic under certain reasonable input assumptions. Our qhash program has provable sensitivity (even on worst-case inputs) and average-case performance guarantees. It is significantly faster than a fully sensitive dynamic-programming-based program for strong similarity search on longsequences.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)

    Google Scholar 

  2. A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In Proc. 31st Symp. on Theory of Computing, pages 435–444, 1999.

    Google Scholar 

  3. Bray, N., Dubchak, I., Pachter, L.: Avid: A global alignment program. Genome Research 13(1), 97–102 (2003)

    Article  Google Scholar 

  4. Brejova, B., Brown, D., Vinar, T.: Vector seeds: An extension to spaced seeds allows substantial improvements in sensitivity and specifity. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI), vol. 2812, pp. 39–54. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. Broder, A., Charikar, M., Frieze, A., Mitzenmacher, M.: Min-wise independent permutations. In: Proc. 30th Symp. on Theory of Computing, pp. 327–336 (1998)

    Google Scholar 

  6. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Proc. 6th Intl. World Wide Web Conf., pp. 391–404 (1997)

    Google Scholar 

  7. Brudno, M., Morgenstern, B.: Fast and sensitive alignment of large genomic sequences. In: Proc. IEEE Comp. Soc. Bioinformatics Conf., pp. 138–147 (2002)

    Google Scholar 

  8. Buhler, J.: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5), 419–428 (2001)

    Article  Google Scholar 

  9. Buhler, J.: Search Algorithms for Biosequences Using Random Projection. PhD thesis, University of Washington (2001)

    Google Scholar 

  10. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H., Rivals, E., Vingron, M.: qgram based database searching using a suffix array. In: Proc. 3rd Conf. on Research in Comp. Molecular Biology, pp. 77–83 (1999)

    Google Scholar 

  11. Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. In: Proc. 12th Symp. on Comb. Pattern Matching, pp. 73–85 (2001)

    Google Scholar 

  12. Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences 55(3), 441–453 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  13. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. on Knowledge and Data Engineering 13(1), 64–78 (2001)

    Article  Google Scholar 

  14. Fredriksson, K., Navarro, G.: Improved single and multiple approximate string matching. In: 15th Symp. on Comb. Pattern Matching (2004) (to appear)

    Google Scholar 

  15. Gusfield, D.: Algorithms on Strings, Trees, and Sequences, chapter 11.6.5 (Approximate occurrences of P in T). Cambridge Univ. Press (1997)

    Google Scholar 

  16. Haveliwala, T., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Proc. 3rd Intl. Workshop on the Web and Databases (2000)

    Google Scholar 

  17. Indyk, P.: A small approximately min-wise independent family of hash functions. In: Proc. 10th Symp. on Discrete Algorithms, pp. 454–456 (1999)

    Google Scholar 

  18. Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Handbook of Discrete and Comp. Geometry, 2nd edn., CRC Press LLC (Upcoming)

    Google Scholar 

  19. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. 30th Symp. on Theory of Computing, pp. 604–613 (1998)

    Google Scholar 

  20. Karp, R., Waarts, O., Zweig, G.: The bit vector intersection problem. In: Proc. 36th Symp. on Foundations of Computer Science, pp. 621–630 (1995)

    Google Scholar 

  21. Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proc. 30th Symp. on Theory of Computing, pp. 614–623 (1998)

    Google Scholar 

  22. Landau, G., Vishkin, U.: Introducing efficient parallelism into approximate string matching and a new serial algorithm. In: Proc. 18th Symp. on Theory of Computing, pp. 220–230 (1986)

    Google Scholar 

  23. Lippert, R., Zhao, X., Florea, L., Mobarry, C., Istrail, S.: Finding anchors for genomic sequence comparison. In: Proc. 8th Conf. on Research in Comp.Molecular Biology, pp. 233–241 (2004)

    Google Scholar 

  24. Muthukrishnan, S., Sahinalp, S.: Simple and practical sequence nearest neighbors with block operations. In: Proc. 13th Symp. on Comb. Pattern Matching, pp. 262–278 (2002)

    Google Scholar 

  25. Myers, E.: An O(ND) Difference Algorithm and Its Variations. Algorithmica 1(2), 251–266 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  26. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  27. Pevzner, P.: Statistical distance between texts and filtration methods in sequence comparison. CABIOS 8(2), 121–127 (1992)

    Google Scholar 

  28. Schwartz, S., Kent, W., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., Miller, W.: Human-mouse alignments with blastz. Genome Research 13(1), 103–107 (2003)

    Article  Google Scholar 

  29. Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)

    Article  Google Scholar 

  30. Sutinen, E., Tarhio, J.: On using q-gram locations in approximate string matching. In: Proc. European Symp. on Algorithms, pp. 327–340 (1995)

    Google Scholar 

  31. Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theoretical Computer Science 92(1), 191–211 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  32. NCBI Entrez Genomes, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Narayanan, M., Karp, R.M. (2004). Gapped Local Similarity Search with Provable Guarantees. In: Jonassen, I., Kim, J. (eds) Algorithms in Bioinformatics. WABI 2004. Lecture Notes in Computer Science(), vol 3240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30219-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30219-3_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23018-2

  • Online ISBN: 978-3-540-30219-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics