CPM 1992: Combinatorial Pattern Matching pp 175-184

Theoretical and empirical comparisons of approximate string matching algorithms

• William I. Chang
• Jordan Lampe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 644)

Abstract

We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(√b-1)) on the average, when alphabet size is taken into consideration.

Keywords

Edit Distance Suffix Tree Longe Common Subsequence Alphabet Size Longe Common Subsequence
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

1. 1.
R. Arratia and M.S. Waterman, Critical Phenomena in Sequence Matching, The Annals of Probability 13:4(1985), pp. 1236–1249.Google Scholar
2. 2.
W.I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, computer program, 1990.Google Scholar
3. 3.
W.I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, U.C. Berkeley, August 1991.Google Scholar
4. 4.
W.I. Chang and E.L. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 116–124.Google Scholar
5. 5.
W.I. Chang and E.L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster), abstract in Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.Google Scholar
6. 6.
V. Chvátal and D. Sankoff, Longest Common Subsequences of Two Random Sequences, Technical Report STAN-CS-75-477, Stanford University, Computer Science Department, 1975.Google Scholar
7. 7.
J. Deken, Some Limit Results for Longest Common Subsequences, Discrete Mathematics 26(1979), pp. 17–31. J. Applied Prob. 12(1975), pp. 306–315.Google Scholar
8. 8.
Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching, Journal of Complexity 4(1988), pp. 33–72.Google Scholar
9. 9.
Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching, SIAM J. Comput. 19:6(1990), pp. 989–999.Google Scholar
10. 10.
Z. Galil and K. Park, Dynamic Programming with Convexity, Concavity, and Sparsity, manuscript, October 1990.Google Scholar
11. 11.
D. Gusfield, K. Balasubramanian, J. Bronder, D. Mayfield, D. Naor, Paral: A Method and Computer Package for Optimal String Alignment using Variable Weights, in preparation.Google Scholar
12. 12.
D. Gusfield, K. Balasubramanian and D. Naor, Parametric Optimization of Sequence Alignment, submitted.Google Scholar
13. 13.
P.A.V. Hall and G.R. Dowling, Approximate String Matching, Computing Surveys 12:4(1980), pp. 381–402.Google Scholar
14. 14.
D. Harel and R.E. Tarjan, Fast Algorithms for Finding Nearest Common Ancestors, SIAM J. Comput. 13(1984), pp. 338–355.Google Scholar
15. 15.
N.I. Johnson and S. Kotz, Distributions in Statistics: Discrete Distributions, Houghton Mifflin Company (1969).Google Scholar
16. 16.
P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, manuscript, October 1990.Google Scholar
17. 17.
S. Karlin, F. Ost, and B.E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 133–157.Google Scholar
18. 18.
R.M. Karp, Probabilistic Analysis of Algorithms, lecture notes, U.C. Berkeley (Spring 1988; Fall 1989).Google Scholar
19. 19.
G.M. Landau and U. Vishkin, Fast String Matching with k Differences, J. Comp. Sys. Sci. 37(1988), pp. 63–78.Google Scholar
20. 20.
G.M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching, J. Algorithms 10(1989), pp. 157–169.Google Scholar
21. 21.
G.M. Landau, U. Vishkin, and R. Nussinov, Locating alignments with k differences for nucleotide and amino acid sequences, CABIOS 4:1(1988), pp. 19–24.Google Scholar
22. 22.
V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Phys. Dokl. 6(1966), pp. 126–136.Google Scholar
23. 23.
E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23:2 (1976), pp. 262–272.Google Scholar
24. 24.
U. Manber and S. Wu, Approximate String Matching with Arbitrary Costs for Text and Hypertext, manuscript, February 1990.Google Scholar
25. 25.
E.W. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica 1(1986), pp. 252–266.Google Scholar
26. 26.
E.W. Myers, Incremental Alignment Algorithms and Their Applications, SIAM J. Comput., accepted for publication.Google Scholar
27. 27.
D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983).Google Scholar
28. 28.
D. Sankoff and S. Mainville, Common Subsequences and Monotone Subsequences, in D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983), pp. 363–365.Google Scholar
29. 29.
B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization, SIAM J. Comput. 17:6(1988), pp. 1253–1262.Google Scholar
30. 30.
P.H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition, J. Algorithms 1(1980), pp. 359–373.Google Scholar
31. 31.
J. Tarhio and E. Ukkonen, Approximate Boyer-Moore String Matching, Report A-1990-3, Dept. of Computer Science, University of Helsinki, March 1990.Google Scholar
32. 32.
E. Ukkonen, Algorithms for Approximate String Matching, Inf. Contr. 64(1985), pp. 100–118.Google Scholar
33. 33.
E. Ukkonen, Finding Approximate Patterns in Strings, J. Algorithms 6(1985), pp. 132–137.Google Scholar
34. 34.
35. 35.
E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Dept. of Computer Science, University of Helsinki, April 1990.Google Scholar
36. 36.
M.S. Waterman, Sequence Alignments, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 53–92.Google Scholar
37. 37.
M.S. Waterman, L. Gordon, and R. Arratia, Phase transitions in sequence matches and nucleic acid structure, Proc. Natl. Acad. Sci. USA 84(1987), pp. 1239–1243.Google Scholar