Abstract
Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.
Similar content being viewed by others
References
Baeza-Yates, R. A., and Gonnet, G. H. A new approach to text searching.Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, MA, 1989, pp. 168–175.
Baeza-Yates, R. A., and Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, and U. Manber (eds.),Combinatorial Pattern Matching 92, Tucson, A2. Lecture notes in Computer Science, Vol. 644. Springer-Verlag, Berlin (1992), pp. 185–192.
Blaisdell, B. E. A measure of the similarity of sets of sequences not requiring sequence alignment.Proc. Nat. Acad. Sci. USA,83 (1986), 5155–5159.
Boyer, R. S., and Moore, J. S. A fast string searching algorithm.Comm. ACM,20 (1977), 762–772.
Chang, W. I., and Lawler, E. L. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on the Foundations of Computer Science, 1990, pp. 116–124.
Danckaert, A., Mugnier, C., Dessen, P., and Cohen-Solal, M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes.CABIOS,3 (1987), 303–307.
Dumas, J. P., and Ninio, J. Efficient algorithms for folding and comparing nucleic acid sequences.Nucleic Acids Res.,10 (1982), 197–206.
Ehrenfeucht, A., and Haussler, D. A new distance metric on strings computable in linear time.Discrete Appl. Math.,20 (1988), 191–203.
Feller, W.An Introduction to Probability Theory and Its Applications. Wiley, New York (1970).
Gail, Z., and Giancarlo, R. Improved string matching withk mismatches.SIGACT News, April (1986), 52–54.
Galil, Z., and Giancarlo, R. Parallel string matching withk mismatches,Theoret. Comp. Sci.,51 (1987), 341–348.
Galil, Z., and Giancarlo, R. Data structures and algorithms for approximate string matching, a survey.J. Complexity,4 (1988), 33–72.
Galil, Z., and Park, K. An improved algorithm for approximate string matching.SIAM J. Comput.,19 (1990), 989–999.
Galil, Z., and Seiferas, J. Time-space-optimal string matching.J. Comput. System. Sci.,26 (1983), 280–294.
Grossi, R., and Luccio, F. Simple and efficient string matching withk mismatches.Inform. Process. Lett.,33 (1990), 113–120.
Harrison, M. C. Implementation of the substring test by hashing.Comm. ACM,14 (1971), 777–779.
Hume, A., and Sunday, D. Fast string searching.Software — Practice and Experience,21 (1991), 1221–1248.
Ivanov, A. G. Recognition of an approximate occurrence of words on a Turing machine in real time.Math USSR-Iqv.,24 (1984), 479–522.
Karp, R. M., and Rabin, M. O. Efficient randomized pattern-matching algorithms.IBM J. Res. Develop.,31 (1987), 249–260.
Kim, J. Y., and Shawe-Taylor, J. An approximate string matching algorithm.Theoret. Comput. Sci.,92 (1992), 107–117.
Knuth, D. E.The Art of Computer Programming, vol. III. Addison-Wesley, Reading, MA (1973).
Knuth, D. E., Morris, J. H., and Pratt, V. R. Fast pattern matching in strings.SIAM J. Comput.,6 (1977), 323–350.
Landau, G. M., and Vishkin, U. Efficent string matching in the presence of errors.Proceedings of 26th IEEE Symposium on the Foundations of Computer Science, 1985, pp. 126–136.
Landau, G. M., and Vishkin, U. Efficient string matching withk mismatches.Theoret. Comput. Sci.,43 (1986), 239–249.
Landau, G. M., and Vishkin, U. Fast parallel and serial approximate string matching.J. Algorithms,10 (1989), 157–169.
Landau, G. M., Vishkin, U., and Nussinov, R. Locating alignments withk differences for nucleotide and amino acid sequences.CABIOS,4 (1988), 19–24.
Lipman, D. J., and Pearson, W. R. Rapid and sensitive protein similarity searches.Science,227 (1985), 1435–1441.
Maizel, J. V., Jr., and Lenk, R. P. Enhanced graphic matrix analysis of nucleic acid and protein sequences.Proc. Nat. Acad. Sci. USA,78 (1981), 7665–7669.
Manber, U., and Wu, S. A new data structure for checking approximate membership with application to preventing password guessing.Inform. Process. Lett.,50 (1994), 191–197.
Myers, E. W. A sublinear algorithm for approximate keyword searching.Algorithmica,12 (1994), 345–374.
Myers, E. W., and Mount, D. Computer program for the IBM personal computer that searches for approximate matches of short oligonucleotide sequences in long target DNA sequences.Nucleioc Acids Res.,14 (1986), 501–508.
Owolabi, O., and McGregor, D. R. Fast approximate string matching.Software-Practice and Experience,18 (1988), 387–393.
Tarhio, J., and Ukkonen, E.Boyer-Moore Approach to Approximate String Matching. Lecture Notes in Computer Science, Vol. 447. Springer-Verlag, Berlin (1990), pp. 348–359.
Ukkonen, U. Finding approximate patterns in strings.J. Algorithms,6 (1985), 132–137.
Ukkonen, U. Approximate string-matching withq-grams and maximal matches.Theoret. Comput. Sci.,92 (1992), 191–211.
Vishkin, U. Deterministic sampling — a new technique for fast pattern matching.SIAM J. Comput.,20 (1991), 22–40.
Wilbur, W. J., and Lipman, D. J. Rapid similarity searches of nucleic acid and protein data banks.Proc. Nat. Acad. Sci. USA,80 (1983), 726–730.
Wu, S., and Manber, U. Agrep — A fast approximate pattern-matching tool.Proceedings of the Usenix Winter 1992 Technical Conference, San Francisco, January 1992, pp. 153–162.
Wu, S., and Manber, U. Fast text searching allowing errors.Comm. ACM,35(10) (1992), 83–90.
Author information
Authors and Affiliations
Additional information
Communicated by E. W. Myers.
This research was supported in part by the National Science Foundation under Grant No. DMS 90-05833 and the National Institute of Health under Grant No. GM-36230.
Rights and permissions
About this article
Cite this article
Pevzner, P.A., Waterman, M.S. Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995). https://doi.org/10.1007/BF01188584
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF01188584