Advertisement

Theoretical and empirical comparisons of approximate string matching algorithms

  • William I. Chang
  • Jordan Lampe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 644)

Abstract

We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(√b-1)) on the average, when alphabet size is taken into consideration.

Keywords

Edit Distance Suffix Tree Longe Common Subsequence Alphabet Size Longe Common Subsequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    R. Arratia and M.S. Waterman, Critical Phenomena in Sequence Matching, The Annals of Probability 13:4(1985), pp. 1236–1249.Google Scholar
  2. 2.
    W.I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, computer program, 1990.Google Scholar
  3. 3.
    W.I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, U.C. Berkeley, August 1991.Google Scholar
  4. 4.
    W.I. Chang and E.L. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 116–124.Google Scholar
  5. 5.
    W.I. Chang and E.L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster), abstract in Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.Google Scholar
  6. 6.
    V. Chvátal and D. Sankoff, Longest Common Subsequences of Two Random Sequences, Technical Report STAN-CS-75-477, Stanford University, Computer Science Department, 1975.Google Scholar
  7. 7.
    J. Deken, Some Limit Results for Longest Common Subsequences, Discrete Mathematics 26(1979), pp. 17–31. J. Applied Prob. 12(1975), pp. 306–315.Google Scholar
  8. 8.
    Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching, Journal of Complexity 4(1988), pp. 33–72.Google Scholar
  9. 9.
    Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching, SIAM J. Comput. 19:6(1990), pp. 989–999.Google Scholar
  10. 10.
    Z. Galil and K. Park, Dynamic Programming with Convexity, Concavity, and Sparsity, manuscript, October 1990.Google Scholar
  11. 11.
    D. Gusfield, K. Balasubramanian, J. Bronder, D. Mayfield, D. Naor, Paral: A Method and Computer Package for Optimal String Alignment using Variable Weights, in preparation.Google Scholar
  12. 12.
    D. Gusfield, K. Balasubramanian and D. Naor, Parametric Optimization of Sequence Alignment, submitted.Google Scholar
  13. 13.
    P.A.V. Hall and G.R. Dowling, Approximate String Matching, Computing Surveys 12:4(1980), pp. 381–402.Google Scholar
  14. 14.
    D. Harel and R.E. Tarjan, Fast Algorithms for Finding Nearest Common Ancestors, SIAM J. Comput. 13(1984), pp. 338–355.Google Scholar
  15. 15.
    N.I. Johnson and S. Kotz, Distributions in Statistics: Discrete Distributions, Houghton Mifflin Company (1969).Google Scholar
  16. 16.
    P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, manuscript, October 1990.Google Scholar
  17. 17.
    S. Karlin, F. Ost, and B.E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 133–157.Google Scholar
  18. 18.
    R.M. Karp, Probabilistic Analysis of Algorithms, lecture notes, U.C. Berkeley (Spring 1988; Fall 1989).Google Scholar
  19. 19.
    G.M. Landau and U. Vishkin, Fast String Matching with k Differences, J. Comp. Sys. Sci. 37(1988), pp. 63–78.Google Scholar
  20. 20.
    G.M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching, J. Algorithms 10(1989), pp. 157–169.Google Scholar
  21. 21.
    G.M. Landau, U. Vishkin, and R. Nussinov, Locating alignments with k differences for nucleotide and amino acid sequences, CABIOS 4:1(1988), pp. 19–24.Google Scholar
  22. 22.
    V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Phys. Dokl. 6(1966), pp. 126–136.Google Scholar
  23. 23.
    E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23:2 (1976), pp. 262–272.Google Scholar
  24. 24.
    U. Manber and S. Wu, Approximate String Matching with Arbitrary Costs for Text and Hypertext, manuscript, February 1990.Google Scholar
  25. 25.
    E.W. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica 1(1986), pp. 252–266.Google Scholar
  26. 26.
    E.W. Myers, Incremental Alignment Algorithms and Their Applications, SIAM J. Comput., accepted for publication.Google Scholar
  27. 27.
    D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983).Google Scholar
  28. 28.
    D. Sankoff and S. Mainville, Common Subsequences and Monotone Subsequences, in D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983), pp. 363–365.Google Scholar
  29. 29.
    B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization, SIAM J. Comput. 17:6(1988), pp. 1253–1262.Google Scholar
  30. 30.
    P.H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition, J. Algorithms 1(1980), pp. 359–373.Google Scholar
  31. 31.
    J. Tarhio and E. Ukkonen, Approximate Boyer-Moore String Matching, Report A-1990-3, Dept. of Computer Science, University of Helsinki, March 1990.Google Scholar
  32. 32.
    E. Ukkonen, Algorithms for Approximate String Matching, Inf. Contr. 64(1985), pp. 100–118.Google Scholar
  33. 33.
    E. Ukkonen, Finding Approximate Patterns in Strings, J. Algorithms 6(1985), pp. 132–137.Google Scholar
  34. 34.
    E. Ukkonen, personal communications.Google Scholar
  35. 35.
    E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Dept. of Computer Science, University of Helsinki, April 1990.Google Scholar
  36. 36.
    M.S. Waterman, Sequence Alignments, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 53–92.Google Scholar
  37. 37.
    M.S. Waterman, L. Gordon, and R. Arratia, Phase transitions in sequence matches and nucleic acid structure, Proc. Natl. Acad. Sci. USA 84(1987), pp. 1239–1243.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1992

Authors and Affiliations

  • William I. Chang
    • 1
  • Jordan Lampe
    • 2
  1. 1.Cold Spring Harbor LaboratoryCold Spring Harbor
  2. 2.University of WashingtonSeattle

Personalised recommendations