# Theoretical and empirical comparisons of approximate string matching algorithms

## Abstract

We study in depth a model of non-exact pattern matching based on *edit distance*, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the *k differences approximate string matching problem* specifies a text string of length *n*, a pattern string of length *m*, the number *k* of differences (substitutions, insertions, deletions) allowed in a *match*, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various *O(kn)* algorithms based on dynamic programming (DP), paying particular attention to dependence on *b* the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on *b*. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is *O(kn)* for random text. Furthermore, we give a heuristic argument that our algorithm is *O(kn/(√b-1))* on the average, when alphabet size is taken into consideration.

## Keywords

Edit Distance Suffix Tree Longe Common Subsequence Alphabet Size Longe Common Subsequence## Preview

Unable to display preview. Download preview PDF.

## References

- 1.R. Arratia and M.S. Waterman, Critical Phenomena in Sequence Matching,
*The Annals of Probability*13:4(1985), pp. 1236–1249.Google Scholar - 2.W.I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, computer program, 1990.Google Scholar
- 3.W.I. Chang,
*Approximate Pattern Matching and Biological Applications*, Ph.D. thesis, U.C. Berkeley, August 1991.Google Scholar - 4.W.I. Chang and E.L. Lawler, Approximate String Matching in Sublinear Expected Time,
*Proc. 31st Annual IEEE Symposium on Foundations of Computer Science*, St. Louis, MO, October 1990, pp. 116–124.Google Scholar - 5.W.I. Chang and E.L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster), abstract in
*Human Genome II Official Program and Abstracts*, San Diego, CA, Oct. 22–24, 1990, p. 24.Google Scholar - 6.V. Chvátal and D. Sankoff, Longest Common Subsequences of Two Random Sequences,
*Technical Report STAN-CS-75-477*, Stanford University, Computer Science Department, 1975.Google Scholar - 7.J. Deken, Some Limit Results for Longest Common Subsequences,
*Discrete Mathematics*26(1979), pp. 17–31.*J. Applied Prob.*12(1975), pp. 306–315.Google Scholar - 8.Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching,
*Journal of Complexity*4(1988), pp. 33–72.Google Scholar - 9.Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching,
*SIAM J. Comput.*19:6(1990), pp. 989–999.Google Scholar - 10.Z. Galil and K. Park, Dynamic Programming with Convexity, Concavity, and Sparsity, manuscript, October 1990.Google Scholar
- 11.D. Gusfield, K. Balasubramanian, J. Bronder, D. Mayfield, D. Naor, Paral: A Method and Computer Package for Optimal String Alignment using Variable Weights, in preparation.Google Scholar
- 12.D. Gusfield, K. Balasubramanian and D. Naor, Parametric Optimization of Sequence Alignment, submitted.Google Scholar
- 13.P.A.V. Hall and G.R. Dowling, Approximate String Matching,
*Computing Surveys*12:4(1980), pp. 381–402.Google Scholar - 14.D. Harel and R.E. Tarjan, Fast Algorithms for Finding Nearest Common Ancestors,
*SIAM J. Comput.*13(1984), pp. 338–355.Google Scholar - 15.N.I. Johnson and S. Kotz,
*Distributions in Statistics: Discrete Distributions*, Houghton Mifflin Company (1969).Google Scholar - 16.P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, manuscript, October 1990.Google Scholar
- 17.S. Karlin, F. Ost, and B.E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M.S. Waterman, ed.,
*Mathematical Methods for DNA Sequences*, CRC Press (1989), pp. 133–157.Google Scholar - 18.R.M. Karp,
*Probabilistic Analysis of Algorithms*, lecture notes, U.C. Berkeley (Spring 1988; Fall 1989).Google Scholar - 19.G.M. Landau and U. Vishkin, Fast String Matching with k Differences,
*J. Comp. Sys. Sci.*37(1988), pp. 63–78.Google Scholar - 20.G.M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching,
*J. Algorithms*10(1989), pp. 157–169.Google Scholar - 21.G.M. Landau, U. Vishkin, and R. Nussinov, Locating alignments with k differences for nucleotide and amino acid sequences,
*CABIOS*4:1(1988), pp. 19–24.Google Scholar - 22.V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals,
*Soviet Phys. Dokl.*6(1966), pp. 126–136.Google Scholar - 23.E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm,
*J. ACM*23:2 (1976), pp. 262–272.Google Scholar - 24.U. Manber and S. Wu, Approximate String Matching with Arbitrary Costs for Text and Hypertext, manuscript, February 1990.Google Scholar
- 25.E.W. Myers, An O(ND) Difference Algorithm and Its Variations,
*Algorithmica*1(1986), pp. 252–266.Google Scholar - 26.E.W. Myers, Incremental Alignment Algorithms and Their Applications,
*SIAM J. Comput.*, accepted for publication.Google Scholar - 27.D. Sankoff and J.B. Kruskal, eds.,
*Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison*, Addison-Wesley (1983).Google Scholar - 28.D. Sankoff and S. Mainville, Common Subsequences and Monotone Subsequences, in D. Sankoff and J.B. Kruskal, eds.,
*Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison*, Addison-Wesley (1983), pp. 363–365.Google Scholar - 29.B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization,
*SIAM J. Comput.*17:6(1988), pp. 1253–1262.Google Scholar - 30.P.H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition,
*J. Algorithms*1(1980), pp. 359–373.Google Scholar - 31.J. Tarhio and E. Ukkonen, Approximate Boyer-Moore String Matching, Report A-1990-3, Dept. of Computer Science, University of Helsinki, March 1990.Google Scholar
- 32.E. Ukkonen, Algorithms for Approximate String Matching,
*Inf. Contr.*64(1985), pp. 100–118.Google Scholar - 33.E. Ukkonen, Finding Approximate Patterns in Strings,
*J. Algorithms*6(1985), pp. 132–137.Google Scholar - 34.E. Ukkonen, personal communications.Google Scholar
- 35.E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Dept. of Computer Science, University of Helsinki, April 1990.Google Scholar
- 36.M.S. Waterman, Sequence Alignments, in M.S. Waterman, ed.,
*Mathematical Methods for DNA Sequences*, CRC Press (1989), pp. 53–92.Google Scholar - 37.M.S. Waterman, L. Gordon, and R. Arratia, Phase transitions in sequence matches and nucleic acid structure,
*Proc. Natl. Acad. Sci. USA*84(1987), pp. 1239–1243.Google Scholar