Skip to main content
Log in

Sublinear Algorithms for Approximating String Compressibility

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.

Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its th subword complexity , for small ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. When the sample size is much larger than the alphabet size, then the frequency of each individual symbol (and hence the entropy) can be estimated accurately. When the alphabet is larger than the sample size, then the approximability of the entropy depends on several features of the distribution; see, e.g., Batu et al. [4], Cai et al. [9], Paninski [33, 34], Brautbar and Samorodnitsky [6].

  2. For example, a variant of the RLE scheme, typically used to compress images, runs RLE on the concatenated rows of the image and on the concatenated columns of the image, and stores the shorter of the two compressed files.

  3. The notation \(\tilde{O}(g(k))\) for a function g of a parameter k means O(g(k)⋅polylog(g(k)) where polylog(g(k))=logc(g(k)) for some constant c.

  4. To see this, set A=o(n α/2) and ϵ=o(n α/2).

  5. Let b i be a boolean variable representing the outcome of the ith coin. Then the output is \(0b_{1}01\overline{b_{2}}10b_{3} 01\overline{b_{4}}1\ldots\).

References

  1. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 23(1), 90–93 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  2. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: Proceedings of the Thirty-Third Annual ACM Symposium on the Theory of Computing (STOC), pp. 266–275 (2001)

    Chapter  Google Scholar 

  4. Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  5. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002). See comment by Khmelev D.V., Teahan W.J.: Phys. Rev. Lett. 90(8), 089803 (2003); and the reply: Phys. Rev. Lett. 90(8), 089804 (2003)

    Article  Google Scholar 

  6. Brautbar, M., Samorodnitsky, A.: Approximating entropy from sublinear samples. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 366–375 (2007)

    Google Scholar 

  7. Bunge, J.: Bibliography on estimating the number of classes in a population. www.stat.cornell.edu/~bunge/bibliography.htm

  8. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

  9. Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans. Inf. Theory 50(7), 1551–1561 (2004)

    Article  Google Scholar 

  10. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 268–279. ACM, New York (2000)

    Chapter  Google Scholar 

  11. Chui, C.K.: An Introduction to Wavelets. Academic Press, San Diego (1992)

    MATH  Google Scholar 

  12. Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)

    Article  Google Scholar 

  13. Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J., Cooper, S.B., Li, A. (eds.) Proceedings of the Third International Conference on Theory and Applications of Models of Computation (TAMC). Lecture Notes in Computer Science, vol. 3959, pp. 21–45. Springer, Berlin (2006)

    Chapter  Google Scholar 

  14. Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)

    Article  Google Scholar 

  15. Cormode, G., Muthukrishnan, S.: Substring compression problems. In: Proceedings of the Thirty-Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 321–330 (2005)

    Google Scholar 

  16. Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)

    Book  MATH  Google Scholar 

  17. Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 2007, 8:252 (2007)

    Google Scholar 

  18. de Luca, A.: On the combinatorics of finite words. Theor. Comput. Sci. 218(1), 13–39 (1999)

    Article  MATH  Google Scholar 

  19. Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Proceedings of the Data Compression Conference (DCC), p. 555 (2000)

    Google Scholar 

  20. Gheorghiciuc, I., Ward, M.: On correlation polynomials and subword complexity. In: Discrete Math and Theoretical Computer Science (DMTCS), Proceedings of the Conference on Analysis of Algorithms (AofA), pp. 1–18 (2007)

    Google Scholar 

  21. Ilie, L., Yu, S., Zhang, K.: Repetition complexity of words. In: Ibarra, O.H., Zhang, L. (eds.) Proceedings of the 8th Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387, pp. 320–329. Springer, Berlin (2002)

    Chapter  Google Scholar 

  22. Janson, S., Lonardi, S., Szpankowski, W.: On average sequence complexity. Theor. Comput. Sci. 326(1–3), 213–227 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  23. Kása, Z.: On the d-complexity of strings. Pure Math. Appl. 9(1–2), 119–128 (1998)

    MathSciNet  MATH  Google Scholar 

  24. Keller, O., Kopelowitz, T., Landau, S., Lewenstein, M.: Generalized substring compression. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 26–38 (2009)

    Chapter  Google Scholar 

  25. Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 206–215 (2004)

    Google Scholar 

  26. Keogh, E.J., Keogh, L., Handley, J.: Compression-based data mining. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, pp. 278–285. IGI Global (2009)

    Google Scholar 

  27. Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problemy Peredachi Inf. 37(2), 96–98 (2000) (Problems of Information Transmission (Engl. Transl.) 37, 172–184 (2001))

    MathSciNet  Google Scholar 

  28. Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205–212 (2002)

    Google Scholar 

  29. Levé, F., Séébold, P.: Proof of a conjecture on word complexity. Bull. Belg. Math. Soc. 8(2), 277–291 (2001)

    MathSciNet  MATH  Google Scholar 

  30. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)

    Article  Google Scholar 

  31. Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (1997)

    MATH  Google Scholar 

  32. Loewenstern, D., Hirsh, H., Noordewier, M., Yianilos, P.: DNA sequence classification using compression-based induction. Tech. Rep. 95-04, Rutgers University, DIMACS (1995)

  33. Paninski, L.: Estimation of entropy and mutual information. Neural Comput. 15(6), 1191–1253 (2003)

    Article  MATH  Google Scholar 

  34. Paninski, L.: Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 50(9), 2200–2203 (2004)

    Article  MathSciNet  Google Scholar 

  35. Pierce, L. II, Shields, P.C.: Sequences incompressible by SLZ (LZW), yet fully compressible by ULZ. In: Numbers, Information and Complexity, I, pp. 385–390. Kluwer, Norwell (2000)

    Google Scholar 

  36. Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. In: Proceedings of the Eleventh International Workshop on Randomization and Computation (RANDOM), pp. 609–623 (2007)

    Google Scholar 

  37. Raskhodnikova, S., Ron, D., Shpilka, A., Smith, A.: Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39(3), 813–842 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  38. Sculley, D., Brodley, C.E.: Compression and machine learning: a new perspective on feature space vectors. In: Proceedings of the Data Compression Conference (DCC), pp. 332–341 (2006)

    Chapter  Google Scholar 

  39. Shallit, J.: On the maximum number of distinct factors of a binary string. Graphs Comb. 9(2), 197–200 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  40. Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory 41(3), 653–664 (1995)

    Article  MATH  Google Scholar 

  41. Witten, I.H., Bray, Z., Mahoui, M., Teahan, W.J.: Text mining: a new frontier for lossless compression. In: Proceedings of the Data Compression Conference (DCC), pp. 198–207 (1999)

    Google Scholar 

  42. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  43. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We would like to thank Amir Shpilka, who was involved in a related paper on distribution support testing [37] and whose comments greatly improved drafts of this article. We would also like to thank Eric Lehman for discussing his thesis material with us and Oded Goldreich and Omer Reingold for helpful comments. Finally, we thank several anonymous reviewers for helpful comments, especially regarding previous work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sofya Raskhodnikova.

Additional information

A preliminary version of this paper appeared in the proceedings of RANDOM 2007 [36].

This research was initiated while the first three authors were visiting the Radcliffe Institute for Advanced Study in Cambridge, MA and conducted while S.R. was at the Hebrew University of Jerusalem, Israel, supported by the Lady Davis Fellowship, and while both S.R. and A.S. were at the Weizmann Institute of Science, Israel. A.S. was supported at Weizmann by the Louis L. and Anita M. Perlman Postdoctoral Fellowship. Currently, S.R. is supported by NSF/CCF CAREER award 0845701 and A.S., by NSF/CCF CAREER award 0747294. D.R. is supported by the Israel Science Foundation (grant number 89/05).

R.R. is supported by NSF awards CCF-1065125 and CCF-0728645, Marie Curie Reintegration grant PIRG03-GA-2008-231077 and the Israel Science Foundation grant nos. 1147/09 and 1675/09.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raskhodnikova, S., Ron, D., Rubinfeld, R. et al. Sublinear Algorithms for Approximating String Compressibility. Algorithmica 65, 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-012-9618-6

Keywords

Navigation