Algorithmica

, Volume 65, Issue 3, pp 685–709 | Cite as

Sublinear Algorithms for Approximating String Compressibility

  • Sofya Raskhodnikova
  • Dana Ron
  • Ronitt Rubinfeld
  • Adam Smith
Article

Abstract

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.

Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its th subword complexity , for small ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.

Keywords

Sublinear algorithms Lossless compression Run-length encoding Lempel-Ziv 

References

  1. 1.
    Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 23(1), 90–93 (1974) MathSciNetMATHCrossRefGoogle Scholar
  2. 2.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999) MathSciNetMATHCrossRefGoogle Scholar
  3. 3.
    Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: Proceedings of the Thirty-Third Annual ACM Symposium on the Theory of Computing (STOC), pp. 266–275 (2001) CrossRefGoogle Scholar
  4. 4.
    Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005) MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002). See comment by Khmelev D.V., Teahan W.J.: Phys. Rev. Lett. 90(8), 089803 (2003); and the reply: Phys. Rev. Lett. 90(8), 089804 (2003) CrossRefGoogle Scholar
  6. 6.
    Brautbar, M., Samorodnitsky, A.: Approximating entropy from sublinear samples. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 366–375 (2007) Google Scholar
  7. 7.
    Bunge, J.: Bibliography on estimating the number of classes in a population. www.stat.cornell.edu/~bunge/bibliography.htm
  8. 8.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994) Google Scholar
  9. 9.
    Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans. Inf. Theory 50(7), 1551–1561 (2004) CrossRefGoogle Scholar
  10. 10.
    Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 268–279. ACM, New York (2000) CrossRefGoogle Scholar
  11. 11.
    Chui, C.K.: An Introduction to Wavelets. Academic Press, San Diego (1992) MATHGoogle Scholar
  12. 12.
    Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005) CrossRefGoogle Scholar
  13. 13.
    Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J., Cooper, S.B., Li, A. (eds.) Proceedings of the Third International Conference on Theory and Applications of Models of Computation (TAMC). Lecture Notes in Computer Science, vol. 3959, pp. 21–45. Springer, Berlin (2006) CrossRefGoogle Scholar
  14. 14.
    Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984) CrossRefGoogle Scholar
  15. 15.
    Cormode, G., Muthukrishnan, S.: Substring compression problems. In: Proceedings of the Thirty-Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 321–330 (2005) Google Scholar
  16. 16.
    Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991) MATHCrossRefGoogle Scholar
  17. 17.
    Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 2007, 8:252 (2007) Google Scholar
  18. 18.
    de Luca, A.: On the combinatorics of finite words. Theor. Comput. Sci. 218(1), 13–39 (1999) MATHCrossRefGoogle Scholar
  19. 19.
    Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Proceedings of the Data Compression Conference (DCC), p. 555 (2000) Google Scholar
  20. 20.
    Gheorghiciuc, I., Ward, M.: On correlation polynomials and subword complexity. In: Discrete Math and Theoretical Computer Science (DMTCS), Proceedings of the Conference on Analysis of Algorithms (AofA), pp. 1–18 (2007) Google Scholar
  21. 21.
    Ilie, L., Yu, S., Zhang, K.: Repetition complexity of words. In: Ibarra, O.H., Zhang, L. (eds.) Proceedings of the 8th Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387, pp. 320–329. Springer, Berlin (2002) CrossRefGoogle Scholar
  22. 22.
    Janson, S., Lonardi, S., Szpankowski, W.: On average sequence complexity. Theor. Comput. Sci. 326(1–3), 213–227 (2004) MathSciNetMATHCrossRefGoogle Scholar
  23. 23.
    Kása, Z.: On the d-complexity of strings. Pure Math. Appl. 9(1–2), 119–128 (1998) MathSciNetMATHGoogle Scholar
  24. 24.
    Keller, O., Kopelowitz, T., Landau, S., Lewenstein, M.: Generalized substring compression. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 26–38 (2009) CrossRefGoogle Scholar
  25. 25.
    Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 206–215 (2004) Google Scholar
  26. 26.
    Keogh, E.J., Keogh, L., Handley, J.: Compression-based data mining. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, pp. 278–285. IGI Global (2009) Google Scholar
  27. 27.
    Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problemy Peredachi Inf. 37(2), 96–98 (2000) (Problems of Information Transmission (Engl. Transl.) 37, 172–184 (2001)) MathSciNetGoogle Scholar
  28. 28.
    Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205–212 (2002) Google Scholar
  29. 29.
    Levé, F., Séébold, P.: Proof of a conjecture on word complexity. Bull. Belg. Math. Soc. 8(2), 277–291 (2001) MathSciNetMATHGoogle Scholar
  30. 30.
    Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004) CrossRefGoogle Scholar
  31. 31.
    Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (1997) MATHGoogle Scholar
  32. 32.
    Loewenstern, D., Hirsh, H., Noordewier, M., Yianilos, P.: DNA sequence classification using compression-based induction. Tech. Rep. 95-04, Rutgers University, DIMACS (1995) Google Scholar
  33. 33.
    Paninski, L.: Estimation of entropy and mutual information. Neural Comput. 15(6), 1191–1253 (2003) MATHCrossRefGoogle Scholar
  34. 34.
    Paninski, L.: Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 50(9), 2200–2203 (2004) MathSciNetCrossRefGoogle Scholar
  35. 35.
    Pierce, L. II, Shields, P.C.: Sequences incompressible by SLZ (LZW), yet fully compressible by ULZ. In: Numbers, Information and Complexity, I, pp. 385–390. Kluwer, Norwell (2000) Google Scholar
  36. 36.
    Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. In: Proceedings of the Eleventh International Workshop on Randomization and Computation (RANDOM), pp. 609–623 (2007) Google Scholar
  37. 37.
    Raskhodnikova, S., Ron, D., Shpilka, A., Smith, A.: Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39(3), 813–842 (2009) MathSciNetMATHCrossRefGoogle Scholar
  38. 38.
    Sculley, D., Brodley, C.E.: Compression and machine learning: a new perspective on feature space vectors. In: Proceedings of the Data Compression Conference (DCC), pp. 332–341 (2006) CrossRefGoogle Scholar
  39. 39.
    Shallit, J.: On the maximum number of distinct factors of a binary string. Graphs Comb. 9(2), 197–200 (1993) MathSciNetMATHCrossRefGoogle Scholar
  40. 40.
    Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory 41(3), 653–664 (1995) MATHCrossRefGoogle Scholar
  41. 41.
    Witten, I.H., Bray, Z., Mahoui, M., Teahan, W.J.: Text mining: a new frontier for lossless compression. In: Proceedings of the Data Compression Conference (DCC), pp. 198–207 (1999) Google Scholar
  42. 42.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977) MathSciNetMATHCrossRefGoogle Scholar
  43. 43.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 530–536 (1978) MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Sofya Raskhodnikova
    • 1
  • Dana Ron
    • 2
  • Ronitt Rubinfeld
    • 2
    • 3
  • Adam Smith
    • 1
  1. 1.Pennsylvania State UniversityUniversity ParkUSA
  2. 2.Tel Aviv UniversityTel AvivIsrael
  3. 3.MITCambridgeUSA

Personalised recommendations