Advertisement

Sequences II pp 300-312 | Cite as

Approximate string-matching and the q-gram distance

  • Esko Ukkonen

Abstract

Some results are summarized on approximate string-matching with a string distance function that is computable in linear time and is based on the so-called q-grams (‘n-grams’). An algorithm is given for the associated string matching problem that finds the locally best approximate occurrences of pattern P, ∣P∣ = m, in text T, ∣T∣ = n, in time O(n log(m - q)). The occurrences with distance ≤ k can be found in time O(nlog k). This should be compared to the edit distance based k-differences problem for which the best algorithm currently known needs O(kn). The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.

Keywords

Edit Distance String Match Suffix Tree Difference Problem Approximate String Match 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, T. Chen and J. Seiferas: The smallest automaton recognizing the subwords of a text. Theor. Comp. Sci. 40 (1985), 31–55.MathSciNetzbMATHCrossRefGoogle Scholar
  2. [2]
    W. I. Chang and E. L. Lawler: Approximate string matching in sublinear expected time. In: Proc. IEEE 1990 Ann. Symposium of Foundations of Computer Science, pp. 116–124.Google Scholar
  3. [3]
    T. H. Cormen, C. E. Leiserson and R. L. Rivest: Introduction to Algorithms. (The MIT Press 1990.)zbMATHGoogle Scholar
  4. [4]
    M. Crochemore: Transducers and repetitions. Theor. Comp. Sci. 45 (1986), 63–89.MathSciNetzbMATHCrossRefGoogle Scholar
  5. [5]
    M. Crochemore: String matching with constraints. In: Proc. MFCS’88 Symposium. Lect. Notes in Computer Science 324, (Springer-Verlag 198), 44–58.Google Scholar
  6. [6]
    G. R. Dowling and P. Hall: Approximate string matching. ACM Computing Surveys 12 (1980), 381–402.MathSciNetCrossRefGoogle Scholar
  7. [7]
    Z. Galil and R. Giancarlo: Data structures and algorithms for approximate string matching. J. Complexity 4 (1988), 33–72.MathSciNetzbMATHCrossRefGoogle Scholar
  8. [8]
    Z. Galil and K. Park: An improved algorithm for approximate string matching. In: Automata, Languages, and Programming (ICALP’89). Lect. Notes in Computer Science 372 (Springer-Verlag 1989), 394–404.CrossRefGoogle Scholar
  9. [9]
    R. Grossi and F. Luccio: Simple and efficient string matching with k mismatches. Inf. Proc. Letters 33 (1989), 113–120.MathSciNetzbMATHCrossRefGoogle Scholar
  10. [10]
    P. Jokinen, J. Tarhio, and E. Ukkonen: A comparison of approximate string matching algorithms. Submitted.Google Scholar
  11. [11]
    R. M. Karp and M. O. Rabin: Efficient randomized pattern matching. IBM J. Res. Dev. 31 (1987), 249–260.MathSciNetzbMATHCrossRefGoogle Scholar
  12. [12]
    T. Kohonen and E. Reuhkala: A very fast associative method for the recognition and correction of misspellt words, based on redundant hash-addressing. In: Proc. 4th Joint Conf. on Pattern Recognition, 1978, Kyoto, Japan, pp. 807–809.Google Scholar
  13. [13]
    G. Landau and U. Vishkin: Fast string matching with k differences. J. Comp. Syst. Sci. 37 (1988), 63–78.MathSciNetzbMATHCrossRefGoogle Scholar
  14. [14]
    G. Landau and U. Vishkin: Fast parallel and serial approximate string matching. J. Algorithms 10 (1989), 157–169.MathSciNetzbMATHCrossRefGoogle Scholar
  15. [15]
    V. I. Levenshtein: Binary codes of correcting deletions, insertions and reversals. Sov. Phys.-Dokl 10 (1966), 707–710.MathSciNetGoogle Scholar
  16. [16]
    E. M. McCreight: A space-economical suffix tree construction algorithm. J. ACM 23 (1976), 262–272.MathSciNetzbMATHCrossRefGoogle Scholar
  17. [17]
    O. Owolabi and D. R. McGregor: Fast approximate string matching. Software — Practice and Experience 18 (1988), 387–393.CrossRefGoogle Scholar
  18. [18]
    P. H. Sellers: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1 (1980), 359–373.MathSciNetzbMATHCrossRefGoogle Scholar
  19. [19]
    C. E. Shannon: A mathematical theory of communications. The Bell Systems Techn. Journal 27 (1948), 379–423.MathSciNetzbMATHGoogle Scholar
  20. [20]
    J. Tarhio and E. Ukkonen: Boyer-Moore approach to approximate string matching. In: Proc. 2nd Scand. Workshop on Algorithm Theory (SWAT’90), Lect. Notes in Computer Science 447 (Springer-Verlag 1990), 348–359.Google Scholar
  21. [21]
    E. Ukkonen: Finding approximate patterns in strings. J. Algorithms 6 (1985), 132–137.MathSciNetzbMATHCrossRefGoogle Scholar
  22. [22]
    E. Ukkonen: Algorithms for approximate string matching. Information and Control 64 (1985), 100–118.MathSciNetzbMATHCrossRefGoogle Scholar
  23. [23]
    E. Ukkonen: Approximate string-matching with q-grams and maximal matches.Google Scholar
  24. [24]
    E. Ukkonen and D. Wood: Approximate string matching with suffix automata. Submitted. Report A-1990–4, Department of Computer Science, University of Helsinki, April 1990.Google Scholar
  25. [25]
    R. E. Wagner and M. J. Fisher: The string-to-string correction problem. J. ACM 21 (1974), 168–173.zbMATHCrossRefGoogle Scholar
  26. [26]
    P. Weiner: Linear pattern matching algorithms. In: Proc. 14th IEEE Ann. Symp. on Switching and Automata Theory, 1973, pp. 1–11.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag New York, Inc. 1993

Authors and Affiliations

  • Esko Ukkonen
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland

Personalised recommendations