Advertisement

Algorithmica

, Volume 61, Issue 1, pp 51–74 | Cite as

On Optimally Partitioning a Text to Improve Its Compression

  • Paolo FerraginaEmail author
  • Igor Nitto
  • Rossano Venturini
Article

Abstract

In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in Buchsbaum et al. (Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175–184, 2000; J. ACM 50(6):825–851, 2003) in the context of table compression, and then further elaborated and extended to strings and trees by Ferragina et al. (J. ACM 52:688–713, 2005; Proc. of 46th IEEE Symposium on Foundations of Computer Science, pp. 184–193, 2005) and Mäkinen and Navarro (Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229–241, 2007). Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming (Buchsbaum et al. in J. ACM 50(6):825–851, 2003; Giancarlo and Sciortino in Proc. of 14th Symposium on Combinatorial Pattern Matching, pp. 129–143, 2003), or few heuristics that do not guarantee any bounds on the efficacy of their computed partition (Buchsbaum et al. in Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175–184, 2000; J. ACM 50(6):825–851, 2003), or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform, see e.g. Ferragina et al. in J. ACM 52:688–713, 2005; Mäkinen and Navarro in Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229–241, 2007) and achieve compression performance that might be worse than the optimal-partitioning by a Ω(log n/log log n) factor. Therefore, computing efficiently the optimal solution is still open (Buchsbaum and Giancarlo in Encyclopedia of Algorithms, pp. 939–942, 2008). In this paper we provide the first algorithm which computes in O(nlog 1+ε n) time and O(n) space, a partition of T whose compressed output is guaranteed to be no more than (1+ε)-worse the optimal one, where ε may be any positive constant fixed in advance. This result holds for any base-compressor C whose compression performance can be bounded in terms of the zero-th or the k-th order empirical entropy of the text T. We will also discuss extensions of our results to BWT-based compressors and to the compression booster of Ferragina et al. (J. ACM 52:688–713, 2005).

Keywords

Data compression Dynamic programming Compression boosting Table compression Empirical entropy Burrows-Wheeler transform Arithmetic and Huffman coding 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bentley, J.L., McIlroy, M.D.: Data compression with long repeated strings. Inf. Sci. 135(1–2), 1–11 (2001) MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Bentley, J.L., Sleator, D.D., Tarjan, R.E., Wei, V.K.: A locally adaptive data compression scheme. Commun. ACM 29(4), 320–330 (1986) MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Buchsbaum, A.L., Giancarlo, R.: Table compression. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms, pp. 939–942. Springer, Berlin (2008) CrossRefGoogle Scholar
  4. 4.
    Buchsbaum, A.L., Caldwell, D.F., Church, K.W., Fowler, G.S., Muthukrishnan, S.: Engineering the compression of massive tables: an experimental approach. In: Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 175–184 (2000) Google Scholar
  5. 5.
    Buchsbaum, A.L., Fowler, G.S., Giancarlo, R.: Improving table compression with combinatorial optimization. J. ACM 50(6), 825–851 (2003) MathSciNetCrossRefGoogle Scholar
  6. 6.
    Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994) Google Scholar
  7. 7.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008) Google Scholar
  8. 8.
    Dasgupta, S., Papadimitriou, C., Vazirani, U.: Algorithms. McGraw-Hill Science/Engineering/Math, New York (2006) Google Scholar
  9. 9.
    Fenwick, P.M.: The burrows-wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39(9), 731–740 (1996) CrossRefGoogle Scholar
  10. 10.
    Ferragina, P., Manzini, G.: On compressing the textual web. In: Proc. of Third ACM Conference on Web Search and Data Mining (WSDM) (2010) Google Scholar
  11. 11.
    Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms (2010, to appear) Google Scholar
  12. 12.
    Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52, 688–713 (2005) MathSciNetCrossRefGoogle Scholar
  13. 13.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. of 46th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 184–193 (2005) Google Scholar
  14. 14.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and searching XML data via two zips. In: Proc. of 15th International World Wide Web Conference (WWW), pp. 751–760 (2006) Google Scholar
  15. 15.
    Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: theory vs practice in BWT compression. In: Proc. of 14th European Symposium on Algorithms (ESA’06). LNCS, vol. 4168, pp. 756–767. Springer, Berlin (2006) Google Scholar
  16. 16.
    Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Inf. Comput. 207, 849–866 (2009) MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Giancarlo, R.: Dynamic programming: special cases. In: Apostolico, A., Galil, Z. (eds.) Pattern Matching Algorithms, 2nd edn., pp. 201–236. Oxford University Press, London (1997) Google Scholar
  18. 18.
    Giancarlo, R., Sciortino, M.: Optimal partitions of strings: a new class of Burrows-Wheeler compression algorithms. In: Proc. of 14th Symposium on Combinatorial Pattern Matching (CPM). LNCS, vol. 2676, pp.129–143. Springer, Berlin (2003) CrossRefGoogle Scholar
  19. 19.
    Giancarlo, R., Restivo, A., Sciortino, M.: From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization. Theor. Comput. Sci. 387(3), 236–248 (2007) MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. of 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003) Google Scholar
  21. 21.
    Howard, P.G., Vitter, J.S.: Analysis of arithmetic coding for data compression. Inf. Process. Manag. 28(6), 749–764 (1992) CrossRefGoogle Scholar
  22. 22.
    Kaplan, H., Landau, S., Verbin, E.: A simpler analysis of Burrows-Wheeler-based compression. Theor. Comput. Sci. 387(3), 220–235 (2007) MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999) MathSciNetCrossRefGoogle Scholar
  24. 24.
    Kulkarni, P., Douglis, F., LaVoie, J.D., Tracey, J.M.: Redundancy elimination within large collections of files. In: USENIX Annual Technical Conference, pp. 59–72 (2004) Google Scholar
  25. 25.
    Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proc. of 7th Latin American Symposium on Theoretical Informatics (LATIN). LNCS, vol. 3887, pp. 703–714. Springer, Berlin (2006) Google Scholar
  26. 26.
    Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Proc. of 14th Symp. on String Processing and Information Retrieval (SPIRE). LNCS, vol. 4726, pp. 229–241. Springer, Berlin (2007) CrossRefGoogle Scholar
  27. 27.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001) MathSciNetCrossRefGoogle Scholar
  28. 28.
    Moffat, A., Isal, R.Y.: Word-based text compression using the Burrows-Wheeler transform. Inf. Process. Manag. 41(5), 1175–1192 (2005) CrossRefzbMATHGoogle Scholar
  29. 29.
    Ouyang, Z., Memon, N.D., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: Proc. of 3rd Conference on Web Information Systems Engineering (WISE), pp. 257–268. IEEE Comput. Soc., Los Alamitos (2002) Google Scholar
  30. 30.
    Suel, T., Memon, N.: Algorithms for delta compression and remote file synchronization. In: Sayood, K. (ed.) Lossless Compression Handbook. Academic Press, New York (2002) Google Scholar
  31. 31.
    Trendafilov, D., Memon, N., Suel, T.: Compressing file collections with a TSP-based approach. Technical Report TR-CIS-2004-02, Polytechnic University (2004) Google Scholar
  32. 32.
    Vo, B.D., Vo, K.-P.: Compressing table data with column dependency. Theor. Comput. Sci. 387(3), 273–283 (2007) MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Los Altos (1999) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Paolo Ferragina
    • 1
    Email author
  • Igor Nitto
    • 1
  • Rossano Venturini
    • 2
  1. 1.Dipartimento di InformaticaUniversity of PisaPisaItaly
  2. 2.ISTI-CNRPisaItaly

Personalised recommendations