Skip to main content
Log in

Multiple filtration and approximate pattern matching

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baeza-Yates, R. A., and Gonnet, G. H. A new approach to text searching.Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, MA, 1989, pp. 168–175.

  2. Baeza-Yates, R. A., and Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, and U. Manber (eds.),Combinatorial Pattern Matching 92, Tucson, A2. Lecture notes in Computer Science, Vol. 644. Springer-Verlag, Berlin (1992), pp. 185–192.

    Google Scholar 

  3. Blaisdell, B. E. A measure of the similarity of sets of sequences not requiring sequence alignment.Proc. Nat. Acad. Sci. USA,83 (1986), 5155–5159.

    Article  MATH  Google Scholar 

  4. Boyer, R. S., and Moore, J. S. A fast string searching algorithm.Comm. ACM,20 (1977), 762–772.

    Article  Google Scholar 

  5. Chang, W. I., and Lawler, E. L. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on the Foundations of Computer Science, 1990, pp. 116–124.

  6. Danckaert, A., Mugnier, C., Dessen, P., and Cohen-Solal, M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes.CABIOS,3 (1987), 303–307.

    Google Scholar 

  7. Dumas, J. P., and Ninio, J. Efficient algorithms for folding and comparing nucleic acid sequences.Nucleic Acids Res.,10 (1982), 197–206.

    Article  Google Scholar 

  8. Ehrenfeucht, A., and Haussler, D. A new distance metric on strings computable in linear time.Discrete Appl. Math.,20 (1988), 191–203.

    Article  MATH  MathSciNet  Google Scholar 

  9. Feller, W.An Introduction to Probability Theory and Its Applications. Wiley, New York (1970).

    Google Scholar 

  10. Gail, Z., and Giancarlo, R. Improved string matching withk mismatches.SIGACT News, April (1986), 52–54.

    Article  Google Scholar 

  11. Galil, Z., and Giancarlo, R. Parallel string matching withk mismatches,Theoret. Comp. Sci.,51 (1987), 341–348.

    Article  MATH  MathSciNet  Google Scholar 

  12. Galil, Z., and Giancarlo, R. Data structures and algorithms for approximate string matching, a survey.J. Complexity,4 (1988), 33–72.

    Article  MATH  MathSciNet  Google Scholar 

  13. Galil, Z., and Park, K. An improved algorithm for approximate string matching.SIAM J. Comput.,19 (1990), 989–999.

    Article  MATH  MathSciNet  Google Scholar 

  14. Galil, Z., and Seiferas, J. Time-space-optimal string matching.J. Comput. System. Sci.,26 (1983), 280–294.

    Article  MathSciNet  Google Scholar 

  15. Grossi, R., and Luccio, F. Simple and efficient string matching withk mismatches.Inform. Process. Lett.,33 (1990), 113–120.

    Article  MathSciNet  Google Scholar 

  16. Harrison, M. C. Implementation of the substring test by hashing.Comm. ACM,14 (1971), 777–779.

    Article  Google Scholar 

  17. Hume, A., and Sunday, D. Fast string searching.Software — Practice and Experience,21 (1991), 1221–1248.

    Article  Google Scholar 

  18. Ivanov, A. G. Recognition of an approximate occurrence of words on a Turing machine in real time.Math USSR-Iqv.,24 (1984), 479–522.

    Article  Google Scholar 

  19. Karp, R. M., and Rabin, M. O. Efficient randomized pattern-matching algorithms.IBM J. Res. Develop.,31 (1987), 249–260.

    Article  MATH  MathSciNet  Google Scholar 

  20. Kim, J. Y., and Shawe-Taylor, J. An approximate string matching algorithm.Theoret. Comput. Sci.,92 (1992), 107–117.

    Article  MATH  MathSciNet  Google Scholar 

  21. Knuth, D. E.The Art of Computer Programming, vol. III. Addison-Wesley, Reading, MA (1973).

    Google Scholar 

  22. Knuth, D. E., Morris, J. H., and Pratt, V. R. Fast pattern matching in strings.SIAM J. Comput.,6 (1977), 323–350.

    Article  MATH  MathSciNet  Google Scholar 

  23. Landau, G. M., and Vishkin, U. Efficent string matching in the presence of errors.Proceedings of 26th IEEE Symposium on the Foundations of Computer Science, 1985, pp. 126–136.

  24. Landau, G. M., and Vishkin, U. Efficient string matching withk mismatches.Theoret. Comput. Sci.,43 (1986), 239–249.

    Article  MATH  MathSciNet  Google Scholar 

  25. Landau, G. M., and Vishkin, U. Fast parallel and serial approximate string matching.J. Algorithms,10 (1989), 157–169.

    Article  MATH  MathSciNet  Google Scholar 

  26. Landau, G. M., Vishkin, U., and Nussinov, R. Locating alignments withk differences for nucleotide and amino acid sequences.CABIOS,4 (1988), 19–24.

    Google Scholar 

  27. Lipman, D. J., and Pearson, W. R. Rapid and sensitive protein similarity searches.Science,227 (1985), 1435–1441.

    Article  Google Scholar 

  28. Maizel, J. V., Jr., and Lenk, R. P. Enhanced graphic matrix analysis of nucleic acid and protein sequences.Proc. Nat. Acad. Sci. USA,78 (1981), 7665–7669.

    Article  MathSciNet  Google Scholar 

  29. Manber, U., and Wu, S. A new data structure for checking approximate membership with application to preventing password guessing.Inform. Process. Lett.,50 (1994), 191–197.

    Article  MATH  Google Scholar 

  30. Myers, E. W. A sublinear algorithm for approximate keyword searching.Algorithmica,12 (1994), 345–374.

    Article  MATH  MathSciNet  Google Scholar 

  31. Myers, E. W., and Mount, D. Computer program for the IBM personal computer that searches for approximate matches of short oligonucleotide sequences in long target DNA sequences.Nucleioc Acids Res.,14 (1986), 501–508.

    Article  Google Scholar 

  32. Owolabi, O., and McGregor, D. R. Fast approximate string matching.Software-Practice and Experience,18 (1988), 387–393.

    Article  Google Scholar 

  33. Tarhio, J., and Ukkonen, E.Boyer-Moore Approach to Approximate String Matching. Lecture Notes in Computer Science, Vol. 447. Springer-Verlag, Berlin (1990), pp. 348–359.

    Google Scholar 

  34. Ukkonen, U. Finding approximate patterns in strings.J. Algorithms,6 (1985), 132–137.

    Article  MATH  MathSciNet  Google Scholar 

  35. Ukkonen, U. Approximate string-matching withq-grams and maximal matches.Theoret. Comput. Sci.,92 (1992), 191–211.

    Article  MATH  MathSciNet  Google Scholar 

  36. Vishkin, U. Deterministic sampling — a new technique for fast pattern matching.SIAM J. Comput.,20 (1991), 22–40.

    Article  MATH  MathSciNet  Google Scholar 

  37. Wilbur, W. J., and Lipman, D. J. Rapid similarity searches of nucleic acid and protein data banks.Proc. Nat. Acad. Sci. USA,80 (1983), 726–730.

    Article  Google Scholar 

  38. Wu, S., and Manber, U. Agrep — A fast approximate pattern-matching tool.Proceedings of the Usenix Winter 1992 Technical Conference, San Francisco, January 1992, pp. 153–162.

  39. Wu, S., and Manber, U. Fast text searching allowing errors.Comm. ACM,35(10) (1992), 83–90.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Communicated by E. W. Myers.

This research was supported in part by the National Science Foundation under Grant No. DMS 90-05833 and the National Institute of Health under Grant No. GM-36230.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pevzner, P.A., Waterman, M.S. Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995). https://doi.org/10.1007/BF01188584

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01188584

Key words

Navigation