Abstract
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings.
As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods.
For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.
Similar content being viewed by others
Notes
Important exceptions of this list are statistical compressors such as Huffman or arithmetic coding, as well as compressions that are applied after a Burrows-Wheeler transformation.
We note that in most cases, including the Levenshtein distance [22], when a i =b j , the cost of replacing a i with b j is zero.
A matrix M is totally monotone if for every i and j we have that if M[i,j]≤M[i+1,j] then M[i,j+1]≤M[i+1,j+1].
References
Aggarwal, A., Klawe, M.M., Moran, S., Shor, P., Wilber, R.: Geometric applications of a matrix-searching algorithm. Algorithmica 2, 195–208 (1987)
Amir, A., Benson, G., Farach, M.: Let sleeping files lie: pattern matching in Z-compressed files. J. Comput. System Sci. 52(2), 299–307 (1996)
Amir, A., Landau, G.M., Sokol, D.: Inplace 2D matching in compressed images. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 853–862 (2003)
Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. J. Complexity 15(1), 4–16 (1999)
Apostolico, A., Atallah, M.J., Larmore, L.L., McFaddin, S.: Efficient parallel algorithms for string editing and related problems. SIAM J. Comput. 19(5), 968–988 (1990)
Arbell, O., Landau, G.M., Mitchell, J.: Edit distance of run-length encoded strings. Inform. Process. Lett. 83(6), 307–314 (2001)
Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theoret. Comput. Sci. 302(1–3), 211–222 (2003)
Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Inform. Process. Lett. 54, 93–96 (1995)
Cégielski, P., Guessarian, I., Lifshits, Y., Matiyasevich, Y.: Window subsequence problems for compressed texts. In: Proceedings of the First International Computer Science Symposium in Russia (CSR), pp. 127–136 (2006)
Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32, 1654–1673 (2003)
Gage, P.: A new algorithm for data compression. C Users J. 12, 23–38 (1994)
Gasieniec, L., Karpinski, M., Plandowski, W., Rytter, W.: Efficient algorithms for Lempel-Ziv encoding. In: Proceedings of the 4th Scandinavian Workshop on Algorithm Theory (SWAT), pp. 392–403 (1996)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Hermelin, D., Landau, S., Landau, G.M., Weimann, O.: A unified algorithm for accelerating edit-distance via text-compression. In: Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 529–540 (2009)
Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996)
Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Proceedings of the 11th Symposium on Combinatorial Pattern Matching (CPM), pp. 195–209 (2000)
Karpinski, M., Rytter, W., Shinohara, A.: Pattern-matching for strings with short descriptions. In: Proceedings of the 6th Symposium on Combinatorial Pattern Matching (CPM), pp. 205–214 (1995)
Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theoret. Comput. Sci. 298, 253–272 (2003)
Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205–212 (2002)
Lempel, A., Ziv, J.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)
Lempel, A., Ziv, J.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 409(3), 486–496 (2008)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Lifshits, Y.: Processing compressed texts: a tractability border. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 228–240 (2007)
Lifshits, Y., Lohrey, M.: Querying and embedding compressed texts. In: Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 681–692 (2006)
Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. In: Proceedings of the 12th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–13 (1999)
Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. In: Proceedings of the 5th Symposium on Combinatorial Pattern Matching (CPM), pp. 31–49 (1994)
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. System Sci. 20, 18–31 (1980)
Miyazaki, M., Shinohara, A., Takeda, M.: An improved pattern matching algorithm for strings in terms of straight-line programs. In: Proceedings of the 8th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–11 (1997)
Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding up HMM decoding and training by exploiting sequence repetitions. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15 (2007)
Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of the 11th Data Compression Conference (DCC), pp. 459–468 (2001)
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302(1–3), 211–222 (2003)
Schmidt, J.P.: All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM J. Comput. 27(4), 972–992 (1998)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Proceedings of the 4th Italian Conference on Algorithms and Complexity (CIAC), pp. 306–315 (2000)
Tiskin, A.: Faster subsequence recognition in compressed strings. Journal of Mathematical Sciences, to appear
Tiskin, A.: Fast distance multiplication of unit-monge matrices. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1287–1296 (2010)
Wagner, R., Fischer, M.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inform. Theory 22(1), 75–81 (1976)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Corresponding author
Additional information
Gad M. Landau partially supported by the National Science Foundation Award 0904246, Israel Science Foundation grant 347/09, Yahoo, Grant No. 2008217 from the United States-Israel Binational Science Foundation (BSF) and DFG.
Rights and permissions
About this article
Cite this article
Hermelin, D., Landau, G.M., Landau, S. et al. Unified Compression-Based Acceleration of Edit-Distance Computation. Algorithmica 65, 339–353 (2013). https://doi.org/10.1007/s00453-011-9590-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-011-9590-6