Abstract
The classical approximate string-matching problem of finding the locations of approximate occurrences P′ of pattern string P in text string T such that the edit distance between P and P′ is ≤ k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m 2 q + size of the output). Here n = ¦T¦, m = ¦P¦, and q varies depending on the problem instance between 0 and n. In the case of the unit cost edit distance it is shown that q = O(min(n, m k+1¦∑¦k)) where ∑ is the alphabet.
This work was supported by the Academy of Finland and by the Alexander von Humboldt Foundation (Germany).
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. (1990): A basic local alignment search tool. J. of Molecular Biology 215, 403–410.
Baeza-Yates, R. A. & Gonnet, G. H.: All-against-all sequence matching (Extended Abstract).
Blumer,A., Blumer,J., Haussler, D., Ehrenfeucht, A., Chen, M.T. and Seiferas, J. (1985): The smallest automaton recognizing the subwords of a text. Theor. Comp. Sci. 40, 31–55.
Chang, W. & Lampe, J. (1992): Theoretical and empirical comparisons of approximate string matching algorithms. Proc. Combinatorial Pattern Matching 1992, (Tucson, April 1992), Lect. Notes in Computer Science 644 (Springer-Verlag 1992), pp. 175–184.
Chang, W. & Lawler, E (1990): Approximate string matching in sublinear expected time. Proc. IEEE 1990 Ann. Symp. on Foundations of Computer Science, pp. 116–124.
Crochemore, M. (1986): Transducers and repetitions. Theor. Comp. Sci. 45, 63–86.
Crochemore, M. (1988): String matching with constraints. Proc. MFCS'88 Symposium. Lect. Notes in Computer Science 324 (Springer-Verlag 1988), pp. 44–58.
Dowling, G. R. & Hall, P. (1980): Approximate string matching. ACM Comput. Surv. 12, 381–402.
Galil, Z. & Giancarlo, R. (1988): Data structures and algorithms for approximate string matching. J. Complexity 4, 33–72.
Galil, Z. & Park, K. (1989): An improved algorithm for approximate string matching. SIAM J. on Computing 19, 989–999.
Gonnet, G. H. (1992): A tutorial introduction to Computational Biochemistry using Darwin. Informatik E. T. H. Zuerich, Switzerland.
Gonnet, G.H., Baeza-Yates,R.A. & Snider, T. (1991): Lexicographical indices for text: Inverted files vs. PAT trees. Report OED-91-01, UW Centre for the New Oxford English Dictionary and Text Research, 1991.
Jokinen, P. & Ukkonen, E. (1991): Two-algorithms for approximate string matching in static texts. Proc. MFCS'91, Lect. Notes in Computer Science 520 (Springer-Verlag 1991), pp. 240–248.
Landau, G. & Vishkin, U. (1988): Fast string matching with k differences. J. Comp. Syst. Sci. 37, 63–78.
Manber, U. & Myers, G. (1990): Suffix arrays: A new method for on-line string searches. In: SODA-90, pp. 319–327.
McCreight, E. M. (1976): A space economical suffix tree construction algorithm. J. ACM 23, 262–272.
Myers, E. W.: A sublinear algorithm for approximate keyword searching. TR 90-25, Department of Computer Science, The Univ. of Arizona, Tucson (to appear in Algorithmica).
Sellers, P. H. (1980): The theory and computation of evolutionary distances: Pattern recognition. J. Algorithms 1, 359–373.
Tarhio, J. & Ukkonen, E. (1990): Boyer-Moore approach to approximate string matching. 2nd Scand. Workshop on Algorithm Theory, Lect. Notes in Computer Science 447 (Springer-Verlag 1990), pp. 348–359. Full version is to appear in SIAM J. Comput. 22.
Ukkonen, E. (1985): Finding approximate patterns in strings. J. Algorithms 6, 132–137.
Ukkonen, E. (1992): Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92, 191–211.
Ukkonen, E. (1992): Constructing suffix trees on-line in linear time. In: J. van Leeuwen (ed.), Algorithms, Software, Architecture. Information Processing 92, vol. I, pp. 484–492. Elsevier.
Ukkonen, E. & Wood, D.: Approximate string matching with suffix automata. Algorithmica (to appear in 1993).
Wagner, R. A. & Fischer, M. J. (1974): The string-to-string correction problem. J. ACM 21, 168–173.
Weiner, P. (1973): Linear pattern matching algorithms. Proc. 14th IEEE Symp. Switching and Automata Theory, pp. 1–11.
Wu, S. & Manber, U. (1992): Fast text searching allowing errors. Comm. ACM 35, 83–91.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1993 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ukkonen, E. (1993). Approximate string-matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science, vol 684. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0029808
Download citation
DOI: https://doi.org/10.1007/BFb0029808
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56764-6
Online ISBN: 978-3-540-47732-7
eBook Packages: Springer Book Archive