Unified Compression-Based Acceleration of Edit-Distance Computation

Hermelin, Danny; Landau, Gad M.; Landau, Shir; Weimann, Oren

doi:10.1007/s00453-011-9590-6

Unified Compression-Based Acceleration of Edit-Distance Computation

Published: 17 November 2011

Volume 65, pages 339–353, (2013)
Cite this article

Algorithmica Aims and scope Submit manuscript

Danny Hermelin¹,
Gad M. Landau^2,3,
Shir Landau⁴ &
…
Oren Weimann²

375 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N ²) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings.

As it turns out, practically all known o(N ²) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods.

For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n ^2/3 N ^4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Article 15 February 2022

Clustering, coding, and the concept of similarity

Article 19 March 2024

Notes

Important exceptions of this list are statistical compressors such as Huffman or arithmetic coding, as well as compressions that are applied after a Burrows-Wheeler transformation.
We note that in most cases, including the Levenshtein distance [22], when a _i=b _j, the cost of replacing a _i with b _j is zero.
A matrix M is totally monotone if for every i and j we have that if M[i,j]≤M[i+1,j] then M[i,j+1]≤M[i+1,j+1].

References

Aggarwal, A., Klawe, M.M., Moran, S., Shor, P., Wilber, R.: Geometric applications of a matrix-searching algorithm. Algorithmica 2, 195–208 (1987)
Article MathSciNet MATH Google Scholar
Amir, A., Benson, G., Farach, M.: Let sleeping files lie: pattern matching in Z-compressed files. J. Comput. System Sci. 52(2), 299–307 (1996)
Article MathSciNet MATH Google Scholar
Amir, A., Landau, G.M., Sokol, D.: Inplace 2D matching in compressed images. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 853–862 (2003)
Google Scholar
Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. J. Complexity 15(1), 4–16 (1999)
Article MathSciNet MATH Google Scholar
Apostolico, A., Atallah, M.J., Larmore, L.L., McFaddin, S.: Efficient parallel algorithms for string editing and related problems. SIAM J. Comput. 19(5), 968–988 (1990)
Article MathSciNet MATH Google Scholar
Arbell, O., Landau, G.M., Mitchell, J.: Edit distance of run-length encoded strings. Inform. Process. Lett. 83(6), 307–314 (2001)
Article MathSciNet Google Scholar
Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theoret. Comput. Sci. 302(1–3), 211–222 (2003)
MathSciNet Google Scholar
Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Inform. Process. Lett. 54, 93–96 (1995)
Article MATH Google Scholar
Cégielski, P., Guessarian, I., Lifshits, Y., Matiyasevich, Y.: Window subsequence problems for compressed texts. In: Proceedings of the First International Computer Science Symposium in Russia (CSR), pp. 127–136 (2006)
Google Scholar
Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32, 1654–1673 (2003)
Article MathSciNet MATH Google Scholar
Gage, P.: A new algorithm for data compression. C Users J. 12, 23–38 (1994)
Google Scholar
Gasieniec, L., Karpinski, M., Plandowski, W., Rytter, W.: Efficient algorithms for Lempel-Ziv encoding. In: Proceedings of the 4th Scandinavian Workshop on Algorithm Theory (SWAT), pp. 392–403 (1996)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Hermelin, D., Landau, S., Landau, G.M., Weimann, O.: A unified algorithm for accelerating edit-distance via text-compression. In: Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 529–540 (2009)
Google Scholar
Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996)
Google Scholar
Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Proceedings of the 11th Symposium on Combinatorial Pattern Matching (CPM), pp. 195–209 (2000)
Chapter Google Scholar
Karpinski, M., Rytter, W., Shinohara, A.: Pattern-matching for strings with short descriptions. In: Proceedings of the 6th Symposium on Combinatorial Pattern Matching (CPM), pp. 205–214 (1995)
Chapter Google Scholar
Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theoret. Comput. Sci. 298, 253–272 (2003)
Article MathSciNet MATH Google Scholar
Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205–212 (2002)
Google Scholar
Lempel, A., Ziv, J.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar
Lempel, A., Ziv, J.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 409(3), 486–496 (2008)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lifshits, Y.: Processing compressed texts: a tractability border. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 228–240 (2007)
Chapter Google Scholar
Lifshits, Y., Lohrey, M.: Querying and embedding compressed texts. In: Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 681–692 (2006)
Google Scholar
Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. In: Proceedings of the 12th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–13 (1999)
Google Scholar
Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. In: Proceedings of the 5th Symposium on Combinatorial Pattern Matching (CPM), pp. 31–49 (1994)
Google Scholar
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. System Sci. 20, 18–31 (1980)
Article MathSciNet MATH Google Scholar
Miyazaki, M., Shinohara, A., Takeda, M.: An improved pattern matching algorithm for strings in terms of straight-line programs. In: Proceedings of the 8th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–11 (1997)
Chapter Google Scholar
Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding up HMM decoding and training by exploiting sequence repetitions. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15 (2007)
Chapter Google Scholar
Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of the 11th Data Compression Conference (DCC), pp. 459–468 (2001)
Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302(1–3), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Schmidt, J.P.: All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM J. Comput. 27(4), 972–992 (1998)
Article MathSciNet MATH Google Scholar
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Proceedings of the 4th Italian Conference on Algorithms and Complexity (CIAC), pp. 306–315 (2000)
Google Scholar
Tiskin, A.: Faster subsequence recognition in compressed strings. Journal of Mathematical Sciences, to appear
Tiskin, A.: Fast distance multiplication of unit-monge matrices. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1287–1296 (2010)
Google Scholar
Wagner, R., Fischer, M.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inform. Theory 22(1), 75–81 (1976)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Max-Planck-Institute for Informatics, Saarbrücken, Germany
Danny Hermelin
Department of Computer Science, University of Haifa, Haifa, Israel
Gad M. Landau & Oren Weimann
Department of Computer Science and Engineering, NYU-Poly, New York, USA
Gad M. Landau
Department of Computer Science, Tel Aviv University, Tel Aviv, Israel
Shir Landau

Authors

Danny Hermelin
View author publications
You can also search for this author in PubMed Google Scholar
Gad M. Landau
View author publications
You can also search for this author in PubMed Google Scholar
Shir Landau
View author publications
You can also search for this author in PubMed Google Scholar
Oren Weimann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shir Landau.

Additional information

Gad M. Landau partially supported by the National Science Foundation Award 0904246, Israel Science Foundation grant 347/09, Yahoo, Grant No. 2008217 from the United States-Israel Binational Science Foundation (BSF) and DFG.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hermelin, D., Landau, G.M., Landau, S. et al. Unified Compression-Based Acceleration of Edit-Distance Computation. Algorithmica 65, 339–353 (2013). https://doi.org/10.1007/s00453-011-9590-6

Download citation

Received: 08 July 2010
Accepted: 01 November 2011
Published: 17 November 2011
Issue Date: February 2013
DOI: https://doi.org/10.1007/s00453-011-9590-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unified Compression-Based Acceleration of Edit-Distance Computation

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Clustering, coding, and the concept of similarity

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unified Compression-Based Acceleration of Edit-Distance Computation

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Clustering, coding, and the concept of similarity

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation