Skip to main content
Log in

Unified Compression-Based Acceleration of Edit-Distance Computation

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings.

As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods.

For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Important exceptions of this list are statistical compressors such as Huffman or arithmetic coding, as well as compressions that are applied after a Burrows-Wheeler transformation.

  2. We note that in most cases, including the Levenshtein distance [22], when a i =b j , the cost of replacing a i with b j is zero.

  3. A matrix M is totally monotone if for every i and j we have that if M[i,j]≤M[i+1,j] then M[i,j+1]≤M[i+1,j+1].

References

  1. Aggarwal, A., Klawe, M.M., Moran, S., Shor, P., Wilber, R.: Geometric applications of a matrix-searching algorithm. Algorithmica 2, 195–208 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  2. Amir, A., Benson, G., Farach, M.: Let sleeping files lie: pattern matching in Z-compressed files. J. Comput. System Sci. 52(2), 299–307 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  3. Amir, A., Landau, G.M., Sokol, D.: Inplace 2D matching in compressed images. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 853–862 (2003)

    Google Scholar 

  4. Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. J. Complexity 15(1), 4–16 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  5. Apostolico, A., Atallah, M.J., Larmore, L.L., McFaddin, S.: Efficient parallel algorithms for string editing and related problems. SIAM J. Comput. 19(5), 968–988 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  6. Arbell, O., Landau, G.M., Mitchell, J.: Edit distance of run-length encoded strings. Inform. Process. Lett. 83(6), 307–314 (2001)

    Article  MathSciNet  Google Scholar 

  7. Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theoret. Comput. Sci. 302(1–3), 211–222 (2003)

    MathSciNet  Google Scholar 

  8. Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Inform. Process. Lett. 54, 93–96 (1995)

    Article  MATH  Google Scholar 

  9. Cégielski, P., Guessarian, I., Lifshits, Y., Matiyasevich, Y.: Window subsequence problems for compressed texts. In: Proceedings of the First International Computer Science Symposium in Russia (CSR), pp. 127–136 (2006)

    Google Scholar 

  10. Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32, 1654–1673 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gage, P.: A new algorithm for data compression. C Users J. 12, 23–38 (1994)

    Google Scholar 

  12. Gasieniec, L., Karpinski, M., Plandowski, W., Rytter, W.: Efficient algorithms for Lempel-Ziv encoding. In: Proceedings of the 4th Scandinavian Workshop on Algorithm Theory (SWAT), pp. 392–403 (1996)

    Google Scholar 

  13. Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  14. Hermelin, D., Landau, S., Landau, G.M., Weimann, O.: A unified algorithm for accelerating edit-distance via text-compression. In: Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 529–540 (2009)

    Google Scholar 

  15. Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996)

    Google Scholar 

  16. Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Proceedings of the 11th Symposium on Combinatorial Pattern Matching (CPM), pp. 195–209 (2000)

    Chapter  Google Scholar 

  17. Karpinski, M., Rytter, W., Shinohara, A.: Pattern-matching for strings with short descriptions. In: Proceedings of the 6th Symposium on Combinatorial Pattern Matching (CPM), pp. 205–214 (1995)

    Chapter  Google Scholar 

  18. Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theoret. Comput. Sci. 298, 253–272 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  19. Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205–212 (2002)

    Google Scholar 

  20. Lempel, A., Ziv, J.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  21. Lempel, A., Ziv, J.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 409(3), 486–496 (2008)

    Google Scholar 

  22. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  23. Lifshits, Y.: Processing compressed texts: a tractability border. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 228–240 (2007)

    Chapter  Google Scholar 

  24. Lifshits, Y., Lohrey, M.: Querying and embedding compressed texts. In: Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 681–692 (2006)

    Google Scholar 

  25. Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. In: Proceedings of the 12th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–13 (1999)

    Google Scholar 

  26. Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. In: Proceedings of the 5th Symposium on Combinatorial Pattern Matching (CPM), pp. 31–49 (1994)

    Google Scholar 

  27. Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. System Sci. 20, 18–31 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  28. Miyazaki, M., Shinohara, A., Takeda, M.: An improved pattern matching algorithm for strings in terms of straight-line programs. In: Proceedings of the 8th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–11 (1997)

    Chapter  Google Scholar 

  29. Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding up HMM decoding and training by exploiting sequence repetitions. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15 (2007)

    Chapter  Google Scholar 

  30. Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of the 11th Data Compression Conference (DCC), pp. 459–468 (2001)

    Google Scholar 

  31. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302(1–3), 211–222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  32. Schmidt, J.P.: All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM J. Comput. 27(4), 972–992 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  33. Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Proceedings of the 4th Italian Conference on Algorithms and Complexity (CIAC), pp. 306–315 (2000)

    Google Scholar 

  34. Tiskin, A.: Faster subsequence recognition in compressed strings. Journal of Mathematical Sciences, to appear

  35. Tiskin, A.: Fast distance multiplication of unit-monge matrices. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1287–1296 (2010)

    Google Scholar 

  36. Wagner, R., Fischer, M.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  37. Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inform. Theory 22(1), 75–81 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  38. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shir Landau.

Additional information

Gad M. Landau partially supported by the National Science Foundation Award 0904246, Israel Science Foundation grant 347/09, Yahoo, Grant No. 2008217 from the United States-Israel Binational Science Foundation (BSF) and DFG.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hermelin, D., Landau, G.M., Landau, S. et al. Unified Compression-Based Acceleration of Edit-Distance Computation. Algorithmica 65, 339–353 (2013). https://doi.org/10.1007/s00453-011-9590-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-011-9590-6

Keywords

Navigation