Skip to main content
Log in

Lempel–Ziv-Like Parsing in Small Space

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression \(n H_k + o(n\log \sigma )\) with \(k = o(\log _\sigma n)\), where n is the input length and \(\sigma\) is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be \(\Omega (\log n)\) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Hereafter, \(\log\) denote logarithm with base 2 if it is not explicitly stated otherwise.

  2. To conform with the indexation scheme used throughout the paper, we do not follow the standard practice to index the least significant bit as zeroth.

References

  1. Alakuijala, J., Farruggia, A., Ferragina, P., Kliuchnikov, E., Obryk, R., Szabadka, Z., Vandevenne, L.: Brotli: a general-purpose data compressor. ACM Trans. Inf. Syst. 37(1), 4 (2018). https://doi.org/10.1145/3231935

    Article  Google Scholar 

  2. Amir, A., Landau, G.M., Ukkonen, E.: Online timestamped text indexing. Inf. Process. Lett. 82(5), 253–259 (2002). https://doi.org/10.1016/S0020-0190(01)00275-7

    Article  MathSciNet  MATH  Google Scholar 

  3. Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: Proceedings of the CPM 2018, LIPIcs, vol. 105, pp. 7:1–7:12. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.7

  4. Belazzougui, D., Puglisi, S.J.: Range predecessor and Lempel–Ziv parsing. In: Proceedings of the SODA 2016, pp. 2053–2071. SIAM (2016). https://doi.org/10.1137/1.9781611974331.ch143

  5. Bille, P., Cording, P.H., Fischer, J., Gørtz, I.L.: Lempel–Ziv compression in a sliding window. In: Proceedings of the CPM 2017, LIPIcs, vol. 78. Schloss Dagstuhl–Leibniz–Zentrum für Informatik (2017). https://doi.org/10.4230/LIPIcs.CPM.2017.15

  6. Deorowicz, S., Danek, A., Niemiec, M.: GDC 2: compression of large collections of genomes. Sci. Rep. 5, 11565 (2015). https://doi.org/10.1038/srep11565

    Article  Google Scholar 

  7. Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27(21), 2979–2986 (2011). https://doi.org/10.1093/bioinformatics/btr505

    Article  Google Scholar 

  8. Duda, J.: Asymmetric numeral systems as close to capacity low state entropy coders. CoRR abs/1311.2540 (2013). http://arxiv.org/abs/1311.2540

  9. Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21(2), 194–203 (1975). https://doi.org/10.1109/TIT.1975.1055349

    Article  MathSciNet  MATH  Google Scholar 

  10. Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of Lempel-Ziv compression. SIAM J. Comput. 42(4), 1521–1541 (2013). https://doi.org/10.1137/120869511

    Article  MathSciNet  MATH  Google Scholar 

  11. Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: Proceedings of the ESA 2015, LNCS, vol. 9294, pp. 533–544. Springer (2015). https://doi.org/10.1007/978-3-662-48350-3_45

  12. Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006). https://doi.org/10.1016/j.ipl.2006.04.008

    Article  MathSciNet  MATH  Google Scholar 

  13. Gagie, T., Manzini, G.: Space-conscious compression. In: Proc. MFCS 2007, LNCS, vol. 4708, pp. 206–217. Springer (2007). https://doi.org/10.1007/978-3-540-74456-6_20

  14. Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of Lempel–Ziv parsing. In: Proceedings of the LATIN 2018, LNCS, vol. 10807, pp. 490–503. Springer (2018). https://doi.org/10.1007/978-3-319-77404-6_36

  15. Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the SODA 2018, pp. 1459–1477. SIAM (2018). https://doi.org/10.1137/1.9781611975031.96

  16. Gagie, T., Puglisi, S.J., Valenzuela, D.: Analyzing relative Lempel–Ziv reference construction. In: Proceedings of the SPIRE 2016, LNCS, vol. 9954, pp. 160–165. Springer (2016). https://doi.org/10.1007/978-3-319-46049-9_16

  17. Gańczorz, M.: Entropy bounds for grammar compression. CoRR abs/1804.08547 (2018). http://arxiv.org/abs/1804.08547

  18. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: Plug and play with succinct data structures. In: Proceedings of the SEA 2014, LNCS, vol. 8504, pp. 326–337. Springer (2014). https://doi.org/10.1007/978-3-319-07959-2_28

  19. Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel–Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow. 5(3), 265–273 (2011). https://doi.org/10.14778/2078331.2078341

    Article  Google Scholar 

  20. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: Simple, fast, small. In: Proceedings of the CPM 2013, LNCS, vol. 7922, pp. 189–200. Springer (2013). https://doi.org/10.1007/978-3-642-38905-4_19

  21. Karkkainen, J., Kempa, D., Puglisi, S.J.: Lempel–Ziv parsing in external memory. In: Proceedings of the DCC 2014, pp. 153–162. IEEE (2014). https://doi.org/10.1109/DCC.2014.78

  22. Kempa, D., Kosolobov, D.: LZ-End parsing in compressed space. In: Proceedings of the DCC 2017, pp. 350–359. IEEE (2017). https://doi.org/10.1109/DCC.2017.73

  23. Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the STOC 2018, pp. 827–840. ACM (2018). https://doi.org/10.1145/3188745.3188814

  24. Kosaraju, S.R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999). https://doi.org/10.1137/S0097539797331105

    Article  MathSciNet  MATH  Google Scholar 

  25. Kosolobov, D.: Relations between greedy and bit-optimal LZ77 encodings. In: Proceedings of the STACS 2018, LIPIcs, vol. 96, pp. 46:1–46:14. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.STACS.2018.46

  26. Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the DCC 2010, pp. 239–248. IEEE (2010). https://doi.org/10.1109/DCC.2010.29

  27. Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel–Ziv compression of genomes for large-scale storage and retrieval. In: Proceedings of the SPIRE 2010, LNCS, vol. 6393, pp. 201–206. Springer (2010). https://doi.org/10.1007/978-3-642-16321-0_20

  28. Kuruppu, S., Puglisi, S.J., Zobel, J.: Optimized relative Lempel–Ziv compression of genomes. In: Australasian Computer Science Conference, pp. 91–98. Australian Computer Society, Inc. (2011)

  29. Larsson, N.J.: Most recent match queries in on-line suffix trees. In: Proceedings of the CPM 2014, LNCS, vol. 8486, pp. 252–261 (2014). https://doi.org/10.1007/978-3-319-07566-2_26

  30. Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theor. Comput. Sci. 387(3), 258–272 (2007). https://doi.org/10.1016/j.tcs.2007.07.017

    Article  MathSciNet  MATH  Google Scholar 

  31. Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Softw. Pract. Exp. 45(1), 1–29 (2015)

    Article  Google Scholar 

  32. Levenshtein, V.I.: On the redundancy and delay of decodable coding of natural numbers. Syst. Theory Res. 20, 149–155 (1968)

    MathSciNet  Google Scholar 

  33. Liao, K., Petri, M., Moffat, A., Wirth, A.: Effective construction of relative Lempel–Ziv dictionaries. In: Proceedings of the WWW 2016, pp. 807–816. International World Wide Web Conferences Steering Committee (2016). https://doi.org/10.1145/2872427.2883042

  34. Mäkinen, V., Navarro, G.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007). https://doi.org/10.1145/1216370.1216372

    Article  MATH  Google Scholar 

  35. Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001). https://doi.org/10.1145/382780.382782

    Article  MathSciNet  MATH  Google Scholar 

  36. Navarro, G.: Indexing highly repetitive collections. In: Proceedings of the IWOCA 2012, LNCS, vol. 7643, pp. 274–279 (2012). https://doi.org/10.1007/978-3-642-35926-2_29

  37. Ochoa, C., Navarro, G.: RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Trans. Inf. Theory (2018). https://doi.org/10.1109/TIT.2018.2871452

    Article  MATH  Google Scholar 

  38. Policriti, A., Prezza, N.: Fast online Lempel–Ziv factorization in compressed space. In: Proceedings of the SPIRE 2015, LNCS, vol. 9309, pp. 13–20. Springer (2015). https://doi.org/10.1007/978-3-319-23826-5_2

  39. Policriti, A., Prezza, N.: LZ77 computation based on the run-length encoded BWT. Algorithmica 80(7), 1986–2011 (2018). https://doi.org/10.1007/s00453-017-0327-z

    Article  MathSciNet  MATH  Google Scholar 

  40. Puglisi, S.J.: Lempel-Ziv compression. In: Kao, M.-Y. (ed.) Encyclopedia of algorithms, pp. 1095–1100., Springer, New York (2016). https://doi.org/10.1007/978-1-4939-2864-4_634

    Chapter  Google Scholar 

  41. Shields, P.C.: Performance of LZ algorithms on individual sequences. IEEE Trans. Inf. Theory 45(4), 1283–1288 (1999). https://doi.org/10.1109/18.761286

    Article  MathSciNet  MATH  Google Scholar 

  42. Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346

    Article  MathSciNet  MATH  Google Scholar 

  43. Tillson, T.W.: A hamiltonian decomposition of \(K^*_{2m}\), \(2m \ge 8\). J. Combin. Theory Ser. B 29(1), 68–74 (1980). https://doi.org/10.1016/0095-8956(80)90044-1

    Article  MathSciNet  MATH  Google Scholar 

  44. Valenzuela, D.: CHICO: A compressed hybrid index for repetitive collections. In: Proceedings of the SEA 2016, LNCS, vol. 9685, pp. 326–338. Springer (2016). https://doi.org/10.1007/978-3-319-38851-9_22

  45. Wandelt, S., Leser, U.: FRESCO: referential compression of highly similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinf. 10(5), 1275–1288 (2013). https://doi.org/10.1109/TCBB.2013.122

    Article  Google Scholar 

  46. Wyner, A.J.: The redundancy and distribution of the phrase lengths of the fixed-database Lempel–Ziv algorithm. IEEE Trans. Inf. Theory 43(5), 1452–1464 (1997). https://doi.org/10.1109/18.623144

    Article  MathSciNet  MATH  Google Scholar 

  47. Yann Collet: Zstandard. (2016). Retrieved from: https://facebook.github.io/zstd/. Accessed 2018-09-17

  48. Yuta Mori: libdivsufsort. https://github.com/y-256/libdivsufsort/. Accessed 22 May 2020

  49. Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501

    Article  MathSciNet  MATH  Google Scholar 

  50. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitry Kosolobov.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kosolobov, D., Valenzuela, D., Navarro, G. et al. Lempel–Ziv-Like Parsing in Small Space. Algorithmica 82, 3195–3215 (2020). https://doi.org/10.1007/s00453-020-00722-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-020-00722-6

Keywords

Navigation