, Volume 13, Issue 1–2, pp 135–154 | Cite as

Multiple filtration and approximate pattern matching

  • P. A. Pevzner
  • M. S. Waterman


Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

Key words

String matching Computational molecular biology 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [BG]
    Baeza-Yates, R. A., and Gonnet, G. H. A new approach to text searching.Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, MA, 1989, pp. 168–175.Google Scholar
  2. [BP]
    Baeza-Yates, R. A., and Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, and U. Manber (eds.),Combinatorial Pattern Matching 92, Tucson, A2. Lecture notes in Computer Science, Vol. 644. Springer-Verlag, Berlin (1992), pp. 185–192.Google Scholar
  3. [B]
    Blaisdell, B. E. A measure of the similarity of sets of sequences not requiring sequence alignment.Proc. Nat. Acad. Sci. USA,83 (1986), 5155–5159.MATHCrossRefGoogle Scholar
  4. [BM]
    Boyer, R. S., and Moore, J. S. A fast string searching algorithm.Comm. ACM,20 (1977), 762–772.CrossRefGoogle Scholar
  5. [CL]
    Chang, W. I., and Lawler, E. L. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on the Foundations of Computer Science, 1990, pp. 116–124.Google Scholar
  6. [DMDC]
    Danckaert, A., Mugnier, C., Dessen, P., and Cohen-Solal, M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes.CABIOS,3 (1987), 303–307.Google Scholar
  7. [DN]
    Dumas, J. P., and Ninio, J. Efficient algorithms for folding and comparing nucleic acid sequences.Nucleic Acids Res.,10 (1982), 197–206.CrossRefGoogle Scholar
  8. [EH]
    Ehrenfeucht, A., and Haussler, D. A new distance metric on strings computable in linear time.Discrete Appl. Math.,20 (1988), 191–203.MATHCrossRefMathSciNetGoogle Scholar
  9. [F]
    Feller, W.An Introduction to Probability Theory and Its Applications. Wiley, New York (1970).Google Scholar
  10. [GG1]
    Gail, Z., and Giancarlo, R. Improved string matching withk mismatches.SIGACT News, April (1986), 52–54.CrossRefGoogle Scholar
  11. [GG2]
    Galil, Z., and Giancarlo, R. Parallel string matching withk mismatches,Theoret. Comp. Sci.,51 (1987), 341–348.MATHCrossRefMathSciNetGoogle Scholar
  12. [GG3]
    Galil, Z., and Giancarlo, R. Data structures and algorithms for approximate string matching, a survey.J. Complexity,4 (1988), 33–72.MATHCrossRefMathSciNetGoogle Scholar
  13. [GP]
    Galil, Z., and Park, K. An improved algorithm for approximate string matching.SIAM J. Comput.,19 (1990), 989–999.MATHCrossRefMathSciNetGoogle Scholar
  14. [GS]
    Galil, Z., and Seiferas, J. Time-space-optimal string matching.J. Comput. System. Sci.,26 (1983), 280–294.CrossRefMathSciNetGoogle Scholar
  15. [GL]
    Grossi, R., and Luccio, F. Simple and efficient string matching withk mismatches.Inform. Process. Lett.,33 (1990), 113–120.CrossRefMathSciNetGoogle Scholar
  16. [H]
    Harrison, M. C. Implementation of the substring test by hashing.Comm. ACM,14 (1971), 777–779.CrossRefGoogle Scholar
  17. [HS]
    Hume, A., and Sunday, D. Fast string searching.Software — Practice and Experience,21 (1991), 1221–1248.CrossRefGoogle Scholar
  18. [I]
    Ivanov, A. G. Recognition of an approximate occurrence of words on a Turing machine in real time.Math USSR-Iqv.,24 (1984), 479–522.CrossRefGoogle Scholar
  19. [KR]
    Karp, R. M., and Rabin, M. O. Efficient randomized pattern-matching algorithms.IBM J. Res. Develop.,31 (1987), 249–260.MATHMathSciNetCrossRefGoogle Scholar
  20. [KS]
    Kim, J. Y., and Shawe-Taylor, J. An approximate string matching algorithm.Theoret. Comput. Sci.,92 (1992), 107–117.MATHCrossRefMathSciNetGoogle Scholar
  21. [K]
    Knuth, D. E.The Art of Computer Programming, vol. III. Addison-Wesley, Reading, MA (1973).Google Scholar
  22. [KMP]
    Knuth, D. E., Morris, J. H., and Pratt, V. R. Fast pattern matching in strings.SIAM J. Comput.,6 (1977), 323–350.MATHCrossRefMathSciNetGoogle Scholar
  23. [LV1]
    Landau, G. M., and Vishkin, U. Efficent string matching in the presence of errors.Proceedings of 26th IEEE Symposium on the Foundations of Computer Science, 1985, pp. 126–136.Google Scholar
  24. [LV2]
    Landau, G. M., and Vishkin, U. Efficient string matching withk mismatches.Theoret. Comput. Sci.,43 (1986), 239–249.MATHCrossRefMathSciNetGoogle Scholar
  25. [LV3]
    Landau, G. M., and Vishkin, U. Fast parallel and serial approximate string matching.J. Algorithms,10 (1989), 157–169.MATHCrossRefMathSciNetGoogle Scholar
  26. [LVN]
    Landau, G. M., Vishkin, U., and Nussinov, R. Locating alignments withk differences for nucleotide and amino acid sequences.CABIOS,4 (1988), 19–24.Google Scholar
  27. [LP]
    Lipman, D. J., and Pearson, W. R. Rapid and sensitive protein similarity searches.Science,227 (1985), 1435–1441.CrossRefGoogle Scholar
  28. [ML]
    Maizel, J. V., Jr., and Lenk, R. P. Enhanced graphic matrix analysis of nucleic acid and protein sequences.Proc. Nat. Acad. Sci. USA,78 (1981), 7665–7669.CrossRefMathSciNetGoogle Scholar
  29. [kw]
    Manber, U., and Wu, S. A new data structure for checking approximate membership with application to preventing password guessing.Inform. Process. Lett.,50 (1994), 191–197.MATHCrossRefGoogle Scholar
  30. [M]
    Myers, E. W. A sublinear algorithm for approximate keyword searching.Algorithmica,12 (1994), 345–374.MATHCrossRefMathSciNetGoogle Scholar
  31. [MM]
    Myers, E. W., and Mount, D. Computer program for the IBM personal computer that searches for approximate matches of short oligonucleotide sequences in long target DNA sequences.Nucleioc Acids Res.,14 (1986), 501–508.CrossRefGoogle Scholar
  32. [OM]
    Owolabi, O., and McGregor, D. R. Fast approximate string matching.Software-Practice and Experience,18 (1988), 387–393.CrossRefGoogle Scholar
  33. [TU]
    Tarhio, J., and Ukkonen, E.Boyer-Moore Approach to Approximate String Matching. Lecture Notes in Computer Science, Vol. 447. Springer-Verlag, Berlin (1990), pp. 348–359.Google Scholar
  34. [U1]
    Ukkonen, U. Finding approximate patterns in strings.J. Algorithms,6 (1985), 132–137.MATHCrossRefMathSciNetGoogle Scholar
  35. [U2]
    Ukkonen, U. Approximate string-matching withq-grams and maximal matches.Theoret. Comput. Sci.,92 (1992), 191–211.MATHCrossRefMathSciNetGoogle Scholar
  36. [V]
    Vishkin, U. Deterministic sampling — a new technique for fast pattern matching.SIAM J. Comput.,20 (1991), 22–40.MATHCrossRefMathSciNetGoogle Scholar
  37. [WL]
    Wilbur, W. J., and Lipman, D. J. Rapid similarity searches of nucleic acid and protein data banks.Proc. Nat. Acad. Sci. USA,80 (1983), 726–730.CrossRefGoogle Scholar
  38. [WM1]
    Wu, S., and Manber, U. Agrep — A fast approximate pattern-matching tool.Proceedings of the Usenix Winter 1992 Technical Conference, San Francisco, January 1992, pp. 153–162.Google Scholar
  39. [WM2]
    Wu, S., and Manber, U. Fast text searching allowing errors.Comm. ACM,35(10) (1992), 83–90.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag New York Inc. 1995

Authors and Affiliations

  • P. A. Pevzner
    • 1
    • 2
  • M. S. Waterman
    • 1
    • 3
  1. 1.Department of MathematicsUniversity of Southern CaliforniaLos AngelesUSA
  2. 2.Computer Science DepartmentThe Pennsylvania State UniversityUniversity ParkUSA
  3. 3.Department of Molecular BiologyUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations