Multiple filtration and approximate pattern matching

Pevzner, P. A.; Waterman, M. S.

doi:10.1007/BF01188584

Multiple filtration and approximate pattern matching

Published: February 1995

Volume 13, pages 135–154, (1995)
Cite this article

Algorithmica Aims and scope Submit manuscript

P. A. Pevzner^1,2 &
M. S. Waterman^1,3

237 Accesses
49 Citations
6 Altmetric
Explore all metrics

Abstract

Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Filtration Algorithms for Approximate Order-Preserving Matching

Fast and Simple Computations Using Prefix Tables Under Hamming and Edit Distance

A New Approach to String Pattern Mining with Approximate Match

References

Baeza-Yates, R. A., and Gonnet, G. H. A new approach to text searching.Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, MA, 1989, pp. 168–175.
Baeza-Yates, R. A., and Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, and U. Manber (eds.),Combinatorial Pattern Matching 92, Tucson, A2. Lecture notes in Computer Science, Vol. 644. Springer-Verlag, Berlin (1992), pp. 185–192.
Google Scholar
Blaisdell, B. E. A measure of the similarity of sets of sequences not requiring sequence alignment.Proc. Nat. Acad. Sci. USA,83 (1986), 5155–5159.
Article MATH Google Scholar
Boyer, R. S., and Moore, J. S. A fast string searching algorithm.Comm. ACM,20 (1977), 762–772.
Article Google Scholar
Chang, W. I., and Lawler, E. L. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on the Foundations of Computer Science, 1990, pp. 116–124.
Danckaert, A., Mugnier, C., Dessen, P., and Cohen-Solal, M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes.CABIOS,3 (1987), 303–307.
Google Scholar
Dumas, J. P., and Ninio, J. Efficient algorithms for folding and comparing nucleic acid sequences.Nucleic Acids Res.,10 (1982), 197–206.
Article Google Scholar
Ehrenfeucht, A., and Haussler, D. A new distance metric on strings computable in linear time.Discrete Appl. Math.,20 (1988), 191–203.
Article MATH MathSciNet Google Scholar
Feller, W.An Introduction to Probability Theory and Its Applications. Wiley, New York (1970).
Google Scholar
Gail, Z., and Giancarlo, R. Improved string matching withk mismatches.SIGACT News, April (1986), 52–54.
Article Google Scholar
Galil, Z., and Giancarlo, R. Parallel string matching withk mismatches,Theoret. Comp. Sci.,51 (1987), 341–348.
Article MATH MathSciNet Google Scholar
Galil, Z., and Giancarlo, R. Data structures and algorithms for approximate string matching, a survey.J. Complexity,4 (1988), 33–72.
Article MATH MathSciNet Google Scholar
Galil, Z., and Park, K. An improved algorithm for approximate string matching.SIAM J. Comput.,19 (1990), 989–999.
Article MATH MathSciNet Google Scholar
Galil, Z., and Seiferas, J. Time-space-optimal string matching.J. Comput. System. Sci.,26 (1983), 280–294.
Article MathSciNet Google Scholar
Grossi, R., and Luccio, F. Simple and efficient string matching withk mismatches.Inform. Process. Lett.,33 (1990), 113–120.
Article MathSciNet Google Scholar
Harrison, M. C. Implementation of the substring test by hashing.Comm. ACM,14 (1971), 777–779.
Article Google Scholar
Hume, A., and Sunday, D. Fast string searching.Software — Practice and Experience,21 (1991), 1221–1248.
Article Google Scholar
Ivanov, A. G. Recognition of an approximate occurrence of words on a Turing machine in real time.Math USSR-Iqv.,24 (1984), 479–522.
Article Google Scholar
Karp, R. M., and Rabin, M. O. Efficient randomized pattern-matching algorithms.IBM J. Res. Develop.,31 (1987), 249–260.
Article MATH MathSciNet Google Scholar
Kim, J. Y., and Shawe-Taylor, J. An approximate string matching algorithm.Theoret. Comput. Sci.,92 (1992), 107–117.
Article MATH MathSciNet Google Scholar
Knuth, D. E.The Art of Computer Programming, vol. III. Addison-Wesley, Reading, MA (1973).
Google Scholar
Knuth, D. E., Morris, J. H., and Pratt, V. R. Fast pattern matching in strings.SIAM J. Comput.,6 (1977), 323–350.
Article MATH MathSciNet Google Scholar
Landau, G. M., and Vishkin, U. Efficent string matching in the presence of errors.Proceedings of 26th IEEE Symposium on the Foundations of Computer Science, 1985, pp. 126–136.
Landau, G. M., and Vishkin, U. Efficient string matching withk mismatches.Theoret. Comput. Sci.,43 (1986), 239–249.
Article MATH MathSciNet Google Scholar
Landau, G. M., and Vishkin, U. Fast parallel and serial approximate string matching.J. Algorithms,10 (1989), 157–169.
Article MATH MathSciNet Google Scholar
Landau, G. M., Vishkin, U., and Nussinov, R. Locating alignments withk differences for nucleotide and amino acid sequences.CABIOS,4 (1988), 19–24.
Google Scholar
Lipman, D. J., and Pearson, W. R. Rapid and sensitive protein similarity searches.Science,227 (1985), 1435–1441.
Article Google Scholar
Maizel, J. V., Jr., and Lenk, R. P. Enhanced graphic matrix analysis of nucleic acid and protein sequences.Proc. Nat. Acad. Sci. USA,78 (1981), 7665–7669.
Article MathSciNet Google Scholar
Manber, U., and Wu, S. A new data structure for checking approximate membership with application to preventing password guessing.Inform. Process. Lett.,50 (1994), 191–197.
Article MATH Google Scholar
Myers, E. W. A sublinear algorithm for approximate keyword searching.Algorithmica,12 (1994), 345–374.
Article MATH MathSciNet Google Scholar
Myers, E. W., and Mount, D. Computer program for the IBM personal computer that searches for approximate matches of short oligonucleotide sequences in long target DNA sequences.Nucleioc Acids Res.,14 (1986), 501–508.
Article Google Scholar
Owolabi, O., and McGregor, D. R. Fast approximate string matching.Software-Practice and Experience,18 (1988), 387–393.
Article Google Scholar
Tarhio, J., and Ukkonen, E.Boyer-Moore Approach to Approximate String Matching. Lecture Notes in Computer Science, Vol. 447. Springer-Verlag, Berlin (1990), pp. 348–359.
Google Scholar
Ukkonen, U. Finding approximate patterns in strings.J. Algorithms,6 (1985), 132–137.
Article MATH MathSciNet Google Scholar
Ukkonen, U. Approximate string-matching withq-grams and maximal matches.Theoret. Comput. Sci.,92 (1992), 191–211.
Article MATH MathSciNet Google Scholar
Vishkin, U. Deterministic sampling — a new technique for fast pattern matching.SIAM J. Comput.,20 (1991), 22–40.
Article MATH MathSciNet Google Scholar
Wilbur, W. J., and Lipman, D. J. Rapid similarity searches of nucleic acid and protein data banks.Proc. Nat. Acad. Sci. USA,80 (1983), 726–730.
Article Google Scholar
Wu, S., and Manber, U. Agrep — A fast approximate pattern-matching tool.Proceedings of the Usenix Winter 1992 Technical Conference, San Francisco, January 1992, pp. 153–162.
Wu, S., and Manber, U. Fast text searching allowing errors.Comm. ACM,35(10) (1992), 83–90.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of Southern California, 90084-1113, Los Angeles, CA, USA
P. A. Pevzner & M. S. Waterman
Computer Science Department, The Pennsylvania State University, 16802, University Park, PA, USA
P. A. Pevzner
Department of Molecular Biology, University of Southern California, 90089-1113, Los Angeles, CA, USA
M. S. Waterman

Authors

P. A. Pevzner
View author publications
You can also search for this author in PubMed Google Scholar
M. S. Waterman
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Communicated by E. W. Myers.

This research was supported in part by the National Science Foundation under Grant No. DMS 90-05833 and the National Institute of Health under Grant No. GM-36230.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pevzner, P.A., Waterman, M.S. Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995). https://doi.org/10.1007/BF01188584

Download citation

Received: 06 August 1992
Revised: 09 February 1993
Issue Date: February 1995
DOI: https://doi.org/10.1007/BF01188584

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple filtration and approximate pattern matching

Abstract

Access this article

Similar content being viewed by others

Filtration Algorithms for Approximate Order-Preserving Matching

Fast and Simple Computations Using Prefix Tables Under Hamming and Edit Distance

A New Approach to String Pattern Mining with Approximate Match

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Multiple filtration and approximate pattern matching

Abstract

Access this article

Similar content being viewed by others

Filtration Algorithms for Approximate Order-Preserving Matching

Fast and Simple Computations Using Prefix Tables Under Hamming and Edit Distance

A New Approach to String Pattern Mining with Approximate Match

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation