Skip to main content

LZD Factorization: Simple and Practical Online Grammar Compression with Variable-to-Fixed Encoding

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9133))

Included in the following conference series:

Abstract

We propose a new variant of the LZ78 factorization which we call the LZ Double-factor factorization (LZD factorization). Each factor of the LZD factorization of a string is the concatenation of the two longest previous factors, while each factor of the LZ78 factorization is that of the longest previous factor and the following character. Interestingly, this simple modification drastically improves the compression ratio in practice. We propose two online algorithms to compute the LZD factorization in \(O(m (M + \min (m, M)\log \sigma ))\) time and \(O(m)\) space, or in \(O(N \log \sigma )\) time and \(O(N)\) space, where \(m\) is the number of factors to output, \(M\) is the length of the longest factor(s), \(N\) is the length of the input string, and \(\sigma \) is the alphabet size. We also show two versions of our LZD factorization with variable-to-fixed encoding, and present online algorithms to compute these versions in \(O(N + \min (m, 2^L) (M + \min (m, M, 2^L) \log \sigma ))\) time and \(O(\min (2^L, m))\) space, where \(L\) is the bit-length of each fixed-length code word. The LZD factorization and its versions with variable-to-fixed encoding are actually grammar-based compression, and our experiments show that our algorithms outperform the state-of-the-art online grammar-based compression algorithms on several data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The bound \(M = O(N)\) can be achieved with string \(a^{N-1}\)$ with \(N-1 = 2^k\) for some \(k\). Observe that \(f_1 = aa\), \(f_2 = f_1f_1 = aaaa\), \(\ldots \), \(f_{m-1} = a^{\frac{N-1}{2}}\), and \(f_{m} = \)$.

  2. 2.

    Source codes are available at https://github.com/kg86/lzd.

  3. 3.

    The number of characters the algorithm can process a second.

  4. 4.

    http://pizzachili.dcc.uchile.cl/texts.html.

  5. 5.

    http://pizzachili.dcc.uchile.cl/repcorpus.html.

  6. 6.

    The first 10 GB of enwiki-20150112-pages-meta-history1.xml-p000000010p000002983.7z, downloaded from http://dumps.wikimedia.org/backup-index.html.

References

  1. Amir, A., Farach, M., Idury, R.M., Poutré, J.A.L., Schäffer, A.A.: Improved dynamic dictionary matching. Inf. Comput. 119(2), 258–282 (1995)

    Article  MATH  Google Scholar 

  2. Bannai, H., Inenaga, S., Takeda, M.: Efficient LZ78 factorization of grammar compressed text. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 86–98. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  3. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)

    MATH  MathSciNet  Google Scholar 

  4. Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Speeding up q-gram mining on grammar-based compressed texts. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 220–231. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  5. Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: Unified compression-based acceleration of edit-distance computation. Algorithmica 65(2), 339–353 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  6. Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: DCC 1999, 296–305 (1999)

    Google Scholar 

  7. Maruyama, S., Sakamoto, H., Takeda, M.: An online algorithm for lightweight grammar-based compression. Algorithms 5(2), 214–235 (2012)

    Article  MathSciNet  Google Scholar 

  8. Maruyama, S., Tabei, Y.: Fully online grammar compression in constant space. In: DCC 2014, pp. 173–182 (2014)

    Google Scholar 

  9. Maruyama, S., Tabei, Y., Sakamoto, H., Sadakane, K.: Fully-online grammar compression. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 218–229. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  10. Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: DCC 1994. pp. 244–253 (1994)

    Google Scholar 

  11. Peter, T.: A modified LZW data compression scheme. In: Australian Computer Science Communications, pp. 262–272 (1987)

    Google Scholar 

  12. Sekine, K., Sasakawa, H., Yoshida, S., Kida, T.: Adaptive dictionary sharing method for re-pair algorithm. In: DCC 2014, p. 425 (2014)

    Google Scholar 

  13. Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306–315. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  14. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  15. Westbrook, J.: Fast incremental planarity testing. In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 342–353. Springer, Heidelberg (1992)

    Chapter  Google Scholar 

  16. Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank Shirou Maruyama and Takuya Kida for providing source codes of their compression programs FOLCA and ADS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keisuke Goto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Goto, K., Bannai, H., Inenaga, S., Takeda, M. (2015). LZD Factorization: Simple and Practical Online Grammar Compression with Variable-to-Fixed Encoding. In: Cicalese, F., Porat, E., Vaccaro, U. (eds) Combinatorial Pattern Matching. CPM 2015. Lecture Notes in Computer Science(), vol 9133. Springer, Cham. https://doi.org/10.1007/978-3-319-19929-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19929-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19928-3

  • Online ISBN: 978-3-319-19929-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics