Sublinear Algorithms for Approximating String Compressibility

Raskhodnikova, Sofya; Ron, Dana; Rubinfeld, Ronitt; Smith, Adam

doi:10.1007/s00453-012-9618-6

Sublinear Algorithms for Approximating String Compressibility

Published: 22 February 2012

Volume 65, pages 685–709, (2013)
Cite this article

Algorithmica Aims and scope Submit manuscript

Sofya Raskhodnikova¹,
Dana Ron²,
Ronitt Rubinfeld^2,3 &
…
Adam Smith¹

540 Accesses
13 Citations
Explore all metrics

Abstract

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.

Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its ℓth subword complexity , for small ℓ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Substring Complexities on Run-Length Compressed Strings

On the Approximation Ratio of Lempel-Ziv Parsing

Computing Minimum Length Representations of Sets of Words of Uniform Length

Notes

When the sample size is much larger than the alphabet size, then the frequency of each individual symbol (and hence the entropy) can be estimated accurately. When the alphabet is larger than the sample size, then the approximability of the entropy depends on several features of the distribution; see, e.g., Batu et al. [4], Cai et al. [9], Paninski [33, 34], Brautbar and Samorodnitsky [6].
For example, a variant of the RLE scheme, typically used to compress images, runs RLE on the concatenated rows of the image and on the concatenated columns of the image, and stores the shorter of the two compressed files.
The notation \(\tilde{O}(g(k))\) for a function g of a parameter k means O(g(k)⋅polylog(g(k)) where polylog(g(k))=log^c(g(k)) for some constant c.
To see this, set A=o(n ^α/2) and ϵ=o(n ^−α/2).
Let b _i be a boolean variable representing the outcome of the ith coin. Then the output is \(0b_{1}01\overline{b_{2}}10b_{3} 01\overline{b_{4}}1\ldots\).

References

Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 23(1), 90–93 (1974)
Article MathSciNet MATH Google Scholar
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)
Article MathSciNet MATH Google Scholar
Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: Proceedings of the Thirty-Third Annual ACM Symposium on the Theory of Computing (STOC), pp. 266–275 (2001)
Chapter Google Scholar
Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)
Article MathSciNet MATH Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002). See comment by Khmelev D.V., Teahan W.J.: Phys. Rev. Lett. 90(8), 089803 (2003); and the reply: Phys. Rev. Lett. 90(8), 089804 (2003)
Article Google Scholar
Brautbar, M., Samorodnitsky, A.: Approximating entropy from sublinear samples. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 366–375 (2007)
Google Scholar
Bunge, J.: Bibliography on estimating the number of classes in a population. www.stat.cornell.edu/~bunge/bibliography.htm
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)
Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans. Inf. Theory 50(7), 1551–1561 (2004)
Article Google Scholar
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 268–279. ACM, New York (2000)
Chapter Google Scholar
Chui, C.K.: An Introduction to Wavelets. Academic Press, San Diego (1992)
MATH Google Scholar
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)
Article Google Scholar
Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J., Cooper, S.B., Li, A. (eds.) Proceedings of the Third International Conference on Theory and Applications of Models of Computation (TAMC). Lecture Notes in Computer Science, vol. 3959, pp. 21–45. Springer, Berlin (2006)
Chapter Google Scholar
Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)
Article Google Scholar
Cormode, G., Muthukrishnan, S.: Substring compression problems. In: Proceedings of the Thirty-Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 321–330 (2005)
Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)
Book MATH Google Scholar
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 2007, 8:252 (2007)
Google Scholar
de Luca, A.: On the combinatorics of finite words. Theor. Comput. Sci. 218(1), 13–39 (1999)
Article MATH Google Scholar
Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Proceedings of the Data Compression Conference (DCC), p. 555 (2000)
Google Scholar
Gheorghiciuc, I., Ward, M.: On correlation polynomials and subword complexity. In: Discrete Math and Theoretical Computer Science (DMTCS), Proceedings of the Conference on Analysis of Algorithms (AofA), pp. 1–18 (2007)
Google Scholar
Ilie, L., Yu, S., Zhang, K.: Repetition complexity of words. In: Ibarra, O.H., Zhang, L. (eds.) Proceedings of the 8th Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387, pp. 320–329. Springer, Berlin (2002)
Chapter Google Scholar
Janson, S., Lonardi, S., Szpankowski, W.: On average sequence complexity. Theor. Comput. Sci. 326(1–3), 213–227 (2004)
Article MathSciNet MATH Google Scholar
Kása, Z.: On the d-complexity of strings. Pure Math. Appl. 9(1–2), 119–128 (1998)
MathSciNet MATH Google Scholar
Keller, O., Kopelowitz, T., Landau, S., Lewenstein, M.: Generalized substring compression. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 26–38 (2009)
Chapter Google Scholar
Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 206–215 (2004)
Google Scholar
Keogh, E.J., Keogh, L., Handley, J.: Compression-based data mining. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, pp. 278–285. IGI Global (2009)
Google Scholar
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problemy Peredachi Inf. 37(2), 96–98 (2000) (Problems of Information Transmission (Engl. Transl.) 37, 172–184 (2001))
MathSciNet Google Scholar
Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205–212 (2002)
Google Scholar
Levé, F., Séébold, P.: Proof of a conjecture on word complexity. Bull. Belg. Math. Soc. 8(2), 277–291 (2001)
MathSciNet MATH Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)
Article Google Scholar
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (1997)
MATH Google Scholar
Loewenstern, D., Hirsh, H., Noordewier, M., Yianilos, P.: DNA sequence classification using compression-based induction. Tech. Rep. 95-04, Rutgers University, DIMACS (1995)
Paninski, L.: Estimation of entropy and mutual information. Neural Comput. 15(6), 1191–1253 (2003)
Article MATH Google Scholar
Paninski, L.: Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 50(9), 2200–2203 (2004)
Article MathSciNet Google Scholar
Pierce, L. II, Shields, P.C.: Sequences incompressible by SLZ (LZW), yet fully compressible by ULZ. In: Numbers, Information and Complexity, I, pp. 385–390. Kluwer, Norwell (2000)
Google Scholar
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. In: Proceedings of the Eleventh International Workshop on Randomization and Computation (RANDOM), pp. 609–623 (2007)
Google Scholar
Raskhodnikova, S., Ron, D., Shpilka, A., Smith, A.: Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39(3), 813–842 (2009)
Article MathSciNet MATH Google Scholar
Sculley, D., Brodley, C.E.: Compression and machine learning: a new perspective on feature space vectors. In: Proceedings of the Data Compression Conference (DCC), pp. 332–341 (2006)
Chapter Google Scholar
Shallit, J.: On the maximum number of distinct factors of a binary string. Graphs Comb. 9(2), 197–200 (1993)
Article MathSciNet MATH Google Scholar
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory 41(3), 653–664 (1995)
Article MATH Google Scholar
Witten, I.H., Bray, Z., Mahoui, M., Teahan, W.J.: Text mining: a new frontier for lossless compression. In: Proceedings of the Data Compression Conference (DCC), pp. 198–207 (1999)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to thank Amir Shpilka, who was involved in a related paper on distribution support testing [37] and whose comments greatly improved drafts of this article. We would also like to thank Eric Lehman for discussing his thesis material with us and Oded Goldreich and Omer Reingold for helpful comments. Finally, we thank several anonymous reviewers for helpful comments, especially regarding previous work.

Author information

Authors and Affiliations

Pennsylvania State University, University Park, PA, USA
Sofya Raskhodnikova & Adam Smith
Tel Aviv University, Tel Aviv, Israel
Dana Ron & Ronitt Rubinfeld
MIT, Cambridge, MA, USA
Ronitt Rubinfeld

Authors

Sofya Raskhodnikova
View author publications
You can also search for this author in PubMed Google Scholar
Dana Ron
View author publications
You can also search for this author in PubMed Google Scholar
Ronitt Rubinfeld
View author publications
You can also search for this author in PubMed Google Scholar
Adam Smith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sofya Raskhodnikova.

Additional information

A preliminary version of this paper appeared in the proceedings of RANDOM 2007 [36].

This research was initiated while the first three authors were visiting the Radcliffe Institute for Advanced Study in Cambridge, MA and conducted while S.R. was at the Hebrew University of Jerusalem, Israel, supported by the Lady Davis Fellowship, and while both S.R. and A.S. were at the Weizmann Institute of Science, Israel. A.S. was supported at Weizmann by the Louis L. and Anita M. Perlman Postdoctoral Fellowship. Currently, S.R. is supported by NSF/CCF CAREER award 0845701 and A.S., by NSF/CCF CAREER award 0747294. D.R. is supported by the Israel Science Foundation (grant number 89/05).

R.R. is supported by NSF awards CCF-1065125 and CCF-0728645, Marie Curie Reintegration grant PIRG03-GA-2008-231077 and the Israel Science Foundation grant nos. 1147/09 and 1675/09.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raskhodnikova, S., Ron, D., Rubinfeld, R. et al. Sublinear Algorithms for Approximating String Compressibility. Algorithmica 65, 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6

Download citation

Received: 31 March 2011
Accepted: 04 February 2012
Published: 22 February 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s00453-012-9618-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sublinear Algorithms for Approximating String Compressibility

Abstract

Access this article

Similar content being viewed by others

Substring Complexities on Run-Length Compressed Strings

On the Approximation Ratio of Lempel-Ziv Parsing

Computing Minimum Length Representations of Sets of Words of Uniform Length

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sublinear Algorithms for Approximating String Compressibility

Abstract

Access this article

Similar content being viewed by others

Substring Complexities on Run-Length Compressed Strings

On the Approximation Ratio of Lempel-Ziv Parsing

Computing Minimum Length Representations of Sets of Words of Uniform Length

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation