Theoretical and empirical comparisons of approximate string matching algorithms

Chang, William I.; Lampe, Jordan

doi:10.1007/3-540-56024-6_14

William I. Chang¹ &
Jordan Lampe²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 644))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

219 Accesses
21 Citations

Abstract

We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(√b-1)) on the average, when alphabet size is taken into consideration.

This research was conducted at the University of California, Berkeley, and was supported in part by Department of Energy grant DE-FG03-90ER60999

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Arratia and M.S. Waterman, Critical Phenomena in Sequence Matching, The Annals of Probability 13:4(1985), pp. 1236–1249.
Google Scholar
W.I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, computer program, 1990.
Google Scholar
W.I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, U.C. Berkeley, August 1991.
Google Scholar
W.I. Chang and E.L. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 116–124.
Google Scholar
W.I. Chang and E.L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster), abstract in Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.
Google Scholar
V. Chvátal and D. Sankoff, Longest Common Subsequences of Two Random Sequences, Technical Report STAN-CS-75-477, Stanford University, Computer Science Department, 1975.
Google Scholar
J. Deken, Some Limit Results for Longest Common Subsequences, Discrete Mathematics 26(1979), pp. 17–31. J. Applied Prob. 12(1975), pp. 306–315.
Google Scholar
Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching, Journal of Complexity 4(1988), pp. 33–72.
Google Scholar
Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching, SIAM J. Comput. 19:6(1990), pp. 989–999.
Google Scholar
Z. Galil and K. Park, Dynamic Programming with Convexity, Concavity, and Sparsity, manuscript, October 1990.
Google Scholar
D. Gusfield, K. Balasubramanian, J. Bronder, D. Mayfield, D. Naor, Paral: A Method and Computer Package for Optimal String Alignment using Variable Weights, in preparation.
Google Scholar
D. Gusfield, K. Balasubramanian and D. Naor, Parametric Optimization of Sequence Alignment, submitted.
Google Scholar
P.A.V. Hall and G.R. Dowling, Approximate String Matching, Computing Surveys 12:4(1980), pp. 381–402.
Google Scholar
D. Harel and R.E. Tarjan, Fast Algorithms for Finding Nearest Common Ancestors, SIAM J. Comput. 13(1984), pp. 338–355.
Google Scholar
N.I. Johnson and S. Kotz, Distributions in Statistics: Discrete Distributions, Houghton Mifflin Company (1969).
Google Scholar
P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, manuscript, October 1990.
Google Scholar
S. Karlin, F. Ost, and B.E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 133–157.
Google Scholar
R.M. Karp, Probabilistic Analysis of Algorithms, lecture notes, U.C. Berkeley (Spring 1988; Fall 1989).
Google Scholar
G.M. Landau and U. Vishkin, Fast String Matching with k Differences, J. Comp. Sys. Sci. 37(1988), pp. 63–78.
Google Scholar
G.M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching, J. Algorithms 10(1989), pp. 157–169.
Google Scholar
G.M. Landau, U. Vishkin, and R. Nussinov, Locating alignments with k differences for nucleotide and amino acid sequences, CABIOS 4:1(1988), pp. 19–24.
Google Scholar
V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Phys. Dokl. 6(1966), pp. 126–136.
Google Scholar
E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23:2 (1976), pp. 262–272.
Google Scholar
U. Manber and S. Wu, Approximate String Matching with Arbitrary Costs for Text and Hypertext, manuscript, February 1990.
Google Scholar
E.W. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica 1(1986), pp. 252–266.
Google Scholar
E.W. Myers, Incremental Alignment Algorithms and Their Applications, SIAM J. Comput., accepted for publication.
Google Scholar
D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983).
Google Scholar
D. Sankoff and S. Mainville, Common Subsequences and Monotone Subsequences, in D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983), pp. 363–365.
Google Scholar
B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization, SIAM J. Comput. 17:6(1988), pp. 1253–1262.
Google Scholar
P.H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition, J. Algorithms 1(1980), pp. 359–373.
Google Scholar
J. Tarhio and E. Ukkonen, Approximate Boyer-Moore String Matching, Report A-1990-3, Dept. of Computer Science, University of Helsinki, March 1990.
Google Scholar
E. Ukkonen, Algorithms for Approximate String Matching, Inf. Contr. 64(1985), pp. 100–118.
Google Scholar
E. Ukkonen, Finding Approximate Patterns in Strings, J. Algorithms 6(1985), pp. 132–137.
Google Scholar
E. Ukkonen, personal communications.
Google Scholar
E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Dept. of Computer Science, University of Helsinki, April 1990.
Google Scholar
M.S. Waterman, Sequence Alignments, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 53–92.
Google Scholar
M.S. Waterman, L. Gordon, and R. Arratia, Phase transitions in sequence matches and nucleic acid structure, Proc. Natl. Acad. Sci. USA 84(1987), pp. 1239–1243.
Google Scholar

Download references

Author information

Authors and Affiliations

Cold Spring Harbor Laboratory, 11724, Cold Spring Harbor, NY
William I. Chang
University of Washington, 98195, Seattle, WA
Jordan Lampe

Authors

William I. Chang
View author publications
You can also search for this author in PubMed Google Scholar
Jordan Lampe
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alberto Apostolico Maxime Crochemore Zvi Galil Udi Manber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, W.I., Lampe, J. (1992). Theoretical and empirical comparisons of approximate string matching algorithms. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_14

Download citation

DOI: https://doi.org/10.1007/3-540-56024-6_14
Published: 04 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56024-1
Online ISBN: 978-3-540-47357-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics