Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.
Key wordsString matching Computational molecular biology
Unable to display preview. Download preview PDF.
- [BG]Baeza-Yates, R. A., and Gonnet, G. H. A new approach to text searching.Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, MA, 1989, pp. 168–175.Google Scholar
- [BP]Baeza-Yates, R. A., and Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, and U. Manber (eds.),Combinatorial Pattern Matching 92, Tucson, A2. Lecture notes in Computer Science, Vol. 644. Springer-Verlag, Berlin (1992), pp. 185–192.Google Scholar
- [CL]Chang, W. I., and Lawler, E. L. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on the Foundations of Computer Science, 1990, pp. 116–124.Google Scholar
- [DMDC]Danckaert, A., Mugnier, C., Dessen, P., and Cohen-Solal, M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes.CABIOS,3 (1987), 303–307.Google Scholar
- [F]Feller, W.An Introduction to Probability Theory and Its Applications. Wiley, New York (1970).Google Scholar
- [K]Knuth, D. E.The Art of Computer Programming, vol. III. Addison-Wesley, Reading, MA (1973).Google Scholar
- [LV1]Landau, G. M., and Vishkin, U. Efficent string matching in the presence of errors.Proceedings of 26th IEEE Symposium on the Foundations of Computer Science, 1985, pp. 126–136.Google Scholar
- [LVN]Landau, G. M., Vishkin, U., and Nussinov, R. Locating alignments withk differences for nucleotide and amino acid sequences.CABIOS,4 (1988), 19–24.Google Scholar
- [TU]Tarhio, J., and Ukkonen, E.Boyer-Moore Approach to Approximate String Matching. Lecture Notes in Computer Science, Vol. 447. Springer-Verlag, Berlin (1990), pp. 348–359.Google Scholar
- [WM1]Wu, S., and Manber, U. Agrep — A fast approximate pattern-matching tool.Proceedings of the Usenix Winter 1992 Technical Conference, San Francisco, January 1992, pp. 153–162.Google Scholar