On the Approximation Ratio of Lempel-Ziv Parsing

Gagie, Travis; Navarro, Gonzalo; Prezza, Nicola

doi:10.1007/978-3-319-77404-6_36

Travis Gagie^16,17,
Gonzalo Navarro^17,18 &
Nicola Prezza¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10807))

Included in the following conference series:

Latin American Symposium on Theoretical Informatics

2769 Accesses
9 Citations

Abstract

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that \(z=O(b\log (n/b))\), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where \(z = \varOmega (b\log n)\). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures.

Partially funded by Basal Funds FB0001, Conicyt, by Fondecyt Grants 1-171058 and 1-170048, Chile, and by the Danish Research Council DFF-4005-00267.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.
2.
For this case, we could have defined bordering in a stricter way, as the first or last block of a chunk.

References

Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 7:1–7:13 (2017)
Google Scholar
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_3
Chapter Google Scholar
Bille, P., Gagie, T., Li Gørtz, I., Prezza, N.: A separation between run-length SLPs and LZ77. CoRR, abs/1711.07270 (2017)
Google Scholar
Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Article MathSciNet MATH Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)
Google Scholar
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
Article MathSciNet MATH Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)
MATH Google Scholar
Crochemore, M., Iliopoulos, C.S., Kubica, M., Rytter, W., Waleń, T.: Efficient algorithms for three variants of the LPF table. J. Discrete Algorithms 11, 51–61 (2012)
Article MathSciNet MATH Google Scholar
Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. CoRR, abs/1702.07577 (2017)
Google Scholar
Fici, G.: Factorizations of the Fibonacci infinite word. J. Integer Sequences, 18(9), Article 3 (2015)
Google Scholar
Fritz, M.H.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)
Article Google Scholar
Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)
Article MathSciNet MATH Google Scholar
Gallant, J.K.: String Compression Algorithms. Ph.D thesis. Princeton University (1982)
Google Scholar
Gawrychowski, P.: Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic. CoRR, abs/1104.4203 (2011)
Google Scholar
Hucke, D., Lohrey, M., Reh, C.P.: The smallest grammar problem revisited. In: Inenaga, S., Sadakane, K., Sakai, T. (eds.) SPIRE 2016. LNCS, vol. 9954, pp. 35–49. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46049-9_4
Chapter Google Scholar
I, T.: Longest common extensions with recompression. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 18:1–18:15 (2017)
Google Scholar
Jez, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)
Article MathSciNet MATH Google Scholar
Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)
Article MathSciNet MATH Google Scholar
Kieffer, J.C., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
Article MathSciNet MATH Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Prob. Inf. Transm. 1(1), 1–7 (1965)
MathSciNet MATH Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Article MathSciNet MATH Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Article MathSciNet MATH Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)
MathSciNet MATH Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)
Article MathSciNet MATH Google Scholar
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)
Book Google Scholar
Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Fully dynamic data structure for LCE queries in compressed space. In: Proceedings of 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 72:1–72:15 (2016)
Google Scholar
Prezza, N.: Compressed Computation for Text Indexing. Ph.D thesis. University of Udine (2016)
Google Scholar
Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)
Article MathSciNet MATH Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms 3(24), 416–430 (2005)
Article MathSciNet MATH Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948)
Article MathSciNet MATH Google Scholar
Sthephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Chenxiang, Z., Efron, M.J., Iyer, R., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 17(7), e1002195 (2015)
Article Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank the reviewers for their insightful comments, which helped us improve the presentation significantly.

Author information

Authors and Affiliations

EIT, Diego Portales University, Santiago, Chile
Travis Gagie
Center for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
Travis Gagie & Gonzalo Navarro
Department of Computer Science, University of Chile, Santiago, Chile
Gonzalo Navarro
DTU Compute, Technical University of Denmark, Kongens Lyngby, Denmark
Nicola Prezza

Authors

Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Prezza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gonzalo Navarro .

Editor information

Editors and Affiliations

Stony Brook University, Stony Brook, New York, USA
Michael A. Bender
Rutgers University, New Brunswick, New Jersey, USA
Martín Farach-Colton
Pace University, New York, New York, USA
Miguel A. Mosteiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gagie, T., Navarro, G., Prezza, N. (2018). On the Approximation Ratio of Lempel-Ziv Parsing. In: Bender, M., Farach-Colton, M., Mosteiro, M. (eds) LATIN 2018: Theoretical Informatics. LATIN 2018. Lecture Notes in Computer Science(), vol 10807. Springer, Cham. https://doi.org/10.1007/978-3-319-77404-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-77404-6_36
Published: 13 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77403-9
Online ISBN: 978-3-319-77404-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Approximation Ratio of Lempel-Ziv Parsing