Efficient LZ78 Factorization of Grammar Compressed Text

  • Hideo Bannai
  • Shunsuke Inenaga
  • Masayuki Takeda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7608)

Abstract

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size n representing a text S of length N, our algorithm computes the LZ78 factorization of T in \(O(n\sqrt{N}+m\log N)\) time and \(O(n\sqrt{N}+m)\) space, where m is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the \(n\sqrt{N}\) term in the time and space complexities becomes either nL, where L is the length of the longest LZ78 factor, or (N − α) where α ≥ 0 is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of S of a certain length. Since m = O(N/log σ N) where σ is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when σ is constant, and can be more efficient when the text is compressible, i.e. when m and n are small.

Keywords

Compression Algorithm Context Free Grammar Linear Time Algorithm Derivation Tree Alphabet Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amir, A., Farach, M., Idury, R.M., Poutré, J.A.L., Schäffer, A.A.: Improved dynamic dictionary matching. Information and Computation 119(2), 258–282 (1995)MathSciNetMATHCrossRefGoogle Scholar
  2. 2.
    Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)MathSciNetMATHCrossRefGoogle Scholar
  3. 3.
    Berkman, O., Vishkin, U.: Finding level-ancestors in trees. J. Comput. System Sci. 48(2), 214–230 (1994)MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Bille, P., Fagerberg, R., Gørtz, I.L.: Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts. ACM Transactions on Algorithms 6(1) (2009)Google Scholar
  5. 5.
    Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proc. SODA 2011, pp. 373–389 (2011)Google Scholar
  6. 6.
    Cilibrasi, R., Vitányi, P.M.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)CrossRefGoogle Scholar
  7. 7.
    Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32(6), 1654–1673 (2003)MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Farach, M.: Optimal suffix tree construction with large alphabets. In: Proc. FOCS 1997, pp. 137–143 (1997)Google Scholar
  9. 9.
    Freschi, V., Bogliolo, A.: A faster algorithm for the computation of string convolutions using LZ78 parsing. Information Processing Letters 110(14-15), 609–613 (2010)MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Gawrychowski, P.: Optimal pattern matching in LZW compressed strings. In: Proc. SODA 2011, pp. 362–372 (2011)Google Scholar
  11. 11.
    Gawrychowski, P.: Tying up the loose ends in fully LZW-compressed pattern matching. In: Proc. STACS 2012, pp. 624–635 (2012)Google Scholar
  12. 12.
    Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Fast q-gram Mining on SLP Compressed Strings. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 278–289. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  13. 13.
    Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Speeding Up q-Gram Mining on Grammar-Based Compressed Texts. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 220–231. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Jansson, J., Sadakane, K., Sung, W.-K.: Compressed Dynamic Tries with Applications to LZ-Compression in Sublinear Time and Space. In: Arvind, V., Prasad, S. (eds.) FSTTCS 2007. LNCS, vol. 4855, pp. 424–435. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Jeż, A.: Faster Fully Compressed Pattern Matching by Recompression. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) ICALP 2012, Part I. LNCS, vol. 7391, pp. 533–544. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  16. 16.
    Kida, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: A unifying framework for compressed pattern matching. Theor. Comput. Sci. 298(1), 253–272 (2003)MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proc. DCC 1999, pp. 296–305. IEEE Computer Society (1999)Google Scholar
  18. 18.
    Li, M., Sleep, R.: Genre classification via an LZ78-based string kernel. In: Proc. ISMIR 2005, pp. 252–259 (2005)Google Scholar
  19. 19.
    Li, M., Sleep, R.: An LZ78 Based String Kernel. In: Li, X., Wang, S., Dong, Z.Y. (eds.) ADMA 2005. LNCS (LNAI), vol. 3584, pp. 678–689. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Li, M., Zhu, Y.: Image Classification Via LZ78 Based String Kernel: A Comparative Study. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 704–712. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)MathSciNetMATHCrossRefGoogle Scholar
  22. 22.
    Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: Proc. DCC 1994, pp. 244–253 (1994)Google Scholar
  23. 23.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)MathSciNetMATHCrossRefGoogle Scholar
  24. 24.
    Shibuya, T.: Constructing the suffix tree of a tree with a large alphabet. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E86-A(5), 1061–1066 (2003)Google Scholar
  25. 25.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetMATHCrossRefGoogle Scholar
  26. 26.
    Weiner, P.: Linear pattern-matching algorithms. In: Proc. of 14th IEEE Ann. Symp. on Switching and Automata Theory, pp. 1–11. Institute of Electrical Electronics Engineers, New York (1973)CrossRefGoogle Scholar
  27. 27.
    Westbrook, J.: Fast Incremental Planarity Testing. In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 342–353. Springer, Heidelberg (1992)CrossRefGoogle Scholar
  28. 28.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337–349 (1977)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Hideo Bannai
    • 1
  • Shunsuke Inenaga
    • 1
  • Masayuki Takeda
    • 1
  1. 1.Department of InformaticsKyushu UniversityJapan

Personalised recommendations