Lempel–Ziv-Like Parsing in Small Space

Kosolobov, Dmitry; Valenzuela, Daniel; Navarro, Gonzalo; Puglisi, Simon J.

doi:10.1007/s00453-020-00722-6

Lempel–Ziv-Like Parsing in Small Space

Published: 25 May 2020

Volume 82, pages 3195–3215, (2020)
Cite this article

Algorithmica Aims and scope Submit manuscript

Dmitry Kosolobov ORCID: orcid.org/0000-0002-2909-2952¹,
Daniel Valenzuela²,
Gonzalo Navarro³ &
…
Simon J. Puglisi²

374 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression \(n H_k + o(n\log \sigma )\) with \(k = o(\log _\sigma n)\), where n is the input length and \(\sigma\) is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be \(\Omega (\log n)\) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

The Clustal Omega Multiple Alignment Package

Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences

Notes

Hereafter, \(\log\) denote logarithm with base 2 if it is not explicitly stated otherwise.
To conform with the indexation scheme used throughout the paper, we do not follow the standard practice to index the least significant bit as zeroth.

References

Alakuijala, J., Farruggia, A., Ferragina, P., Kliuchnikov, E., Obryk, R., Szabadka, Z., Vandevenne, L.: Brotli: a general-purpose data compressor. ACM Trans. Inf. Syst. 37(1), 4 (2018). https://doi.org/10.1145/3231935
Article Google Scholar
Amir, A., Landau, G.M., Ukkonen, E.: Online timestamped text indexing. Inf. Process. Lett. 82(5), 253–259 (2002). https://doi.org/10.1016/S0020-0190(01)00275-7
Article MathSciNet MATH Google Scholar
Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: Proceedings of the CPM 2018, LIPIcs, vol. 105, pp. 7:1–7:12. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.7
Belazzougui, D., Puglisi, S.J.: Range predecessor and Lempel–Ziv parsing. In: Proceedings of the SODA 2016, pp. 2053–2071. SIAM (2016). https://doi.org/10.1137/1.9781611974331.ch143
Bille, P., Cording, P.H., Fischer, J., Gørtz, I.L.: Lempel–Ziv compression in a sliding window. In: Proceedings of the CPM 2017, LIPIcs, vol. 78. Schloss Dagstuhl–Leibniz–Zentrum für Informatik (2017). https://doi.org/10.4230/LIPIcs.CPM.2017.15
Deorowicz, S., Danek, A., Niemiec, M.: GDC 2: compression of large collections of genomes. Sci. Rep. 5, 11565 (2015). https://doi.org/10.1038/srep11565
Article Google Scholar
Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27(21), 2979–2986 (2011). https://doi.org/10.1093/bioinformatics/btr505
Article Google Scholar
Duda, J.: Asymmetric numeral systems as close to capacity low state entropy coders. CoRR abs/1311.2540 (2013). http://arxiv.org/abs/1311.2540
Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21(2), 194–203 (1975). https://doi.org/10.1109/TIT.1975.1055349
Article MathSciNet MATH Google Scholar
Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of Lempel-Ziv compression. SIAM J. Comput. 42(4), 1521–1541 (2013). https://doi.org/10.1137/120869511
Article MathSciNet MATH Google Scholar
Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: Proceedings of the ESA 2015, LNCS, vol. 9294, pp. 533–544. Springer (2015). https://doi.org/10.1007/978-3-662-48350-3_45
Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006). https://doi.org/10.1016/j.ipl.2006.04.008
Article MathSciNet MATH Google Scholar
Gagie, T., Manzini, G.: Space-conscious compression. In: Proc. MFCS 2007, LNCS, vol. 4708, pp. 206–217. Springer (2007). https://doi.org/10.1007/978-3-540-74456-6_20
Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of Lempel–Ziv parsing. In: Proceedings of the LATIN 2018, LNCS, vol. 10807, pp. 490–503. Springer (2018). https://doi.org/10.1007/978-3-319-77404-6_36
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the SODA 2018, pp. 1459–1477. SIAM (2018). https://doi.org/10.1137/1.9781611975031.96
Gagie, T., Puglisi, S.J., Valenzuela, D.: Analyzing relative Lempel–Ziv reference construction. In: Proceedings of the SPIRE 2016, LNCS, vol. 9954, pp. 160–165. Springer (2016). https://doi.org/10.1007/978-3-319-46049-9_16
Gańczorz, M.: Entropy bounds for grammar compression. CoRR abs/1804.08547 (2018). http://arxiv.org/abs/1804.08547
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: Plug and play with succinct data structures. In: Proceedings of the SEA 2014, LNCS, vol. 8504, pp. 326–337. Springer (2014). https://doi.org/10.1007/978-3-319-07959-2_28
Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel–Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow. 5(3), 265–273 (2011). https://doi.org/10.14778/2078331.2078341
Article Google Scholar
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: Simple, fast, small. In: Proceedings of the CPM 2013, LNCS, vol. 7922, pp. 189–200. Springer (2013). https://doi.org/10.1007/978-3-642-38905-4_19
Karkkainen, J., Kempa, D., Puglisi, S.J.: Lempel–Ziv parsing in external memory. In: Proceedings of the DCC 2014, pp. 153–162. IEEE (2014). https://doi.org/10.1109/DCC.2014.78
Kempa, D., Kosolobov, D.: LZ-End parsing in compressed space. In: Proceedings of the DCC 2017, pp. 350–359. IEEE (2017). https://doi.org/10.1109/DCC.2017.73
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the STOC 2018, pp. 827–840. ACM (2018). https://doi.org/10.1145/3188745.3188814
Kosaraju, S.R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999). https://doi.org/10.1137/S0097539797331105
Article MathSciNet MATH Google Scholar
Kosolobov, D.: Relations between greedy and bit-optimal LZ77 encodings. In: Proceedings of the STACS 2018, LIPIcs, vol. 96, pp. 46:1–46:14. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.STACS.2018.46
Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the DCC 2010, pp. 239–248. IEEE (2010). https://doi.org/10.1109/DCC.2010.29
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel–Ziv compression of genomes for large-scale storage and retrieval. In: Proceedings of the SPIRE 2010, LNCS, vol. 6393, pp. 201–206. Springer (2010). https://doi.org/10.1007/978-3-642-16321-0_20
Kuruppu, S., Puglisi, S.J., Zobel, J.: Optimized relative Lempel–Ziv compression of genomes. In: Australasian Computer Science Conference, pp. 91–98. Australian Computer Society, Inc. (2011)
Larsson, N.J.: Most recent match queries in on-line suffix trees. In: Proceedings of the CPM 2014, LNCS, vol. 8486, pp. 252–261 (2014). https://doi.org/10.1007/978-3-319-07566-2_26
Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theor. Comput. Sci. 387(3), 258–272 (2007). https://doi.org/10.1016/j.tcs.2007.07.017
Article MathSciNet MATH Google Scholar
Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Softw. Pract. Exp. 45(1), 1–29 (2015)
Article Google Scholar
Levenshtein, V.I.: On the redundancy and delay of decodable coding of natural numbers. Syst. Theory Res. 20, 149–155 (1968)
MathSciNet Google Scholar
Liao, K., Petri, M., Moffat, A., Wirth, A.: Effective construction of relative Lempel–Ziv dictionaries. In: Proceedings of the WWW 2016, pp. 807–816. International World Wide Web Conferences Steering Committee (2016). https://doi.org/10.1145/2872427.2883042
Mäkinen, V., Navarro, G.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007). https://doi.org/10.1145/1216370.1216372
Article MATH Google Scholar
Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001). https://doi.org/10.1145/382780.382782
Article MathSciNet MATH Google Scholar
Navarro, G.: Indexing highly repetitive collections. In: Proceedings of the IWOCA 2012, LNCS, vol. 7643, pp. 274–279 (2012). https://doi.org/10.1007/978-3-642-35926-2_29
Ochoa, C., Navarro, G.: RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Trans. Inf. Theory (2018). https://doi.org/10.1109/TIT.2018.2871452
Article MATH Google Scholar
Policriti, A., Prezza, N.: Fast online Lempel–Ziv factorization in compressed space. In: Proceedings of the SPIRE 2015, LNCS, vol. 9309, pp. 13–20. Springer (2015). https://doi.org/10.1007/978-3-319-23826-5_2
Policriti, A., Prezza, N.: LZ77 computation based on the run-length encoded BWT. Algorithmica 80(7), 1986–2011 (2018). https://doi.org/10.1007/s00453-017-0327-z
Article MathSciNet MATH Google Scholar
Puglisi, S.J.: Lempel-Ziv compression. In: Kao, M.-Y. (ed.) Encyclopedia of algorithms, pp. 1095–1100., Springer, New York (2016). https://doi.org/10.1007/978-1-4939-2864-4_634
Chapter Google Scholar
Shields, P.C.: Performance of LZ algorithms on individual sequences. IEEE Trans. Inf. Theory 45(4), 1283–1288 (1999). https://doi.org/10.1109/18.761286
Article MathSciNet MATH Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346
Article MathSciNet MATH Google Scholar
Tillson, T.W.: A hamiltonian decomposition of \(K^*_{2m}\), \(2m \ge 8\). J. Combin. Theory Ser. B 29(1), 68–74 (1980). https://doi.org/10.1016/0095-8956(80)90044-1
Article MathSciNet MATH Google Scholar
Valenzuela, D.: CHICO: A compressed hybrid index for repetitive collections. In: Proceedings of the SEA 2016, LNCS, vol. 9685, pp. 326–338. Springer (2016). https://doi.org/10.1007/978-3-319-38851-9_22
Wandelt, S., Leser, U.: FRESCO: referential compression of highly similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinf. 10(5), 1275–1288 (2013). https://doi.org/10.1109/TCBB.2013.122
Article Google Scholar
Wyner, A.J.: The redundancy and distribution of the phrase lengths of the fixed-database Lempel–Ziv algorithm. IEEE Trans. Inf. Theory 43(5), 1452–1464 (1997). https://doi.org/10.1109/18.623144
Article MathSciNet MATH Google Scholar
Yann Collet: Zstandard. (2016). Retrieved from: https://facebook.github.io/zstd/. Accessed 2018-09-17
Yuta Mori: libdivsufsort. https://github.com/y-256/libdivsufsort/. Accessed 22 May 2020
Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS).

Author information

Authors and Affiliations

Ural Federal University, Ekaterinburg, Russia
Dmitry Kosolobov
Department of Computer Science, University of Helsinki, Helsinki, Finland
Daniel Valenzuela & Simon J. Puglisi
CeBiB, Department of Computer Science, University of Chile, Santiago, Chile
Gonzalo Navarro

Authors

Dmitry Kosolobov
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Valenzuela
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Simon J. Puglisi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Kosolobov.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kosolobov, D., Valenzuela, D., Navarro, G. et al. Lempel–Ziv-Like Parsing in Small Space. Algorithmica 82, 3195–3215 (2020). https://doi.org/10.1007/s00453-020-00722-6

Download citation

Received: 20 August 2019
Accepted: 29 April 2020
Published: 25 May 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s00453-020-00722-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lempel–Ziv-Like Parsing in Small Space

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

The Clustal Omega Multiple Alignment Package

Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Lempel–Ziv-Like Parsing in Small Space

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

The Clustal Omega Multiple Alignment Package

Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation