Skip to main content
Log in

On Optimally Partitioning a Text to Improve Its Compression

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in Buchsbaum et al. (Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175–184, 2000; J. ACM 50(6):825–851, 2003) in the context of table compression, and then further elaborated and extended to strings and trees by Ferragina et al. (J. ACM 52:688–713, 2005; Proc. of 46th IEEE Symposium on Foundations of Computer Science, pp. 184–193, 2005) and Mäkinen and Navarro (Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229–241, 2007). Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming (Buchsbaum et al. in J. ACM 50(6):825–851, 2003; Giancarlo and Sciortino in Proc. of 14th Symposium on Combinatorial Pattern Matching, pp. 129–143, 2003), or few heuristics that do not guarantee any bounds on the efficacy of their computed partition (Buchsbaum et al. in Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175–184, 2000; J. ACM 50(6):825–851, 2003), or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform, see e.g. Ferragina et al. in J. ACM 52:688–713, 2005; Mäkinen and Navarro in Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229–241, 2007) and achieve compression performance that might be worse than the optimal-partitioning by a Ω(log n/log log n) factor. Therefore, computing efficiently the optimal solution is still open (Buchsbaum and Giancarlo in Encyclopedia of Algorithms, pp. 939–942, 2008). In this paper we provide the first algorithm which computes in O(nlog 1+ε n) time and O(n) space, a partition of T whose compressed output is guaranteed to be no more than (1+ε)-worse the optimal one, where ε may be any positive constant fixed in advance. This result holds for any base-compressor C whose compression performance can be bounded in terms of the zero-th or the k-th order empirical entropy of the text T. We will also discuss extensions of our results to BWT-based compressors and to the compression booster of Ferragina et al. (J. ACM 52:688–713, 2005).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bentley, J.L., McIlroy, M.D.: Data compression with long repeated strings. Inf. Sci. 135(1–2), 1–11 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bentley, J.L., Sleator, D.D., Tarjan, R.E., Wei, V.K.: A locally adaptive data compression scheme. Commun. ACM 29(4), 320–330 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  3. Buchsbaum, A.L., Giancarlo, R.: Table compression. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms, pp. 939–942. Springer, Berlin (2008)

    Chapter  Google Scholar 

  4. Buchsbaum, A.L., Caldwell, D.F., Church, K.W., Fowler, G.S., Muthukrishnan, S.: Engineering the compression of massive tables: an experimental approach. In: Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 175–184 (2000)

  5. Buchsbaum, A.L., Fowler, G.S., Giancarlo, R.: Improving table compression with combinatorial optimization. J. ACM 50(6), 825–851 (2003)

    Article  MathSciNet  Google Scholar 

  6. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

  7. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)

  8. Dasgupta, S., Papadimitriou, C., Vazirani, U.: Algorithms. McGraw-Hill Science/Engineering/Math, New York (2006)

    Google Scholar 

  9. Fenwick, P.M.: The burrows-wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39(9), 731–740 (1996)

    Article  Google Scholar 

  10. Ferragina, P., Manzini, G.: On compressing the textual web. In: Proc. of Third ACM Conference on Web Search and Data Mining (WSDM) (2010)

  11. Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms (2010, to appear)

  12. Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52, 688–713 (2005)

    Article  MathSciNet  Google Scholar 

  13. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. of 46th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 184–193 (2005)

  14. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and searching XML data via two zips. In: Proc. of 15th International World Wide Web Conference (WWW), pp. 751–760 (2006)

  15. Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: theory vs practice in BWT compression. In: Proc. of 14th European Symposium on Algorithms (ESA’06). LNCS, vol. 4168, pp. 756–767. Springer, Berlin (2006)

    Google Scholar 

  16. Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Inf. Comput. 207, 849–866 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  17. Giancarlo, R.: Dynamic programming: special cases. In: Apostolico, A., Galil, Z. (eds.) Pattern Matching Algorithms, 2nd edn., pp. 201–236. Oxford University Press, London (1997)

    Google Scholar 

  18. Giancarlo, R., Sciortino, M.: Optimal partitions of strings: a new class of Burrows-Wheeler compression algorithms. In: Proc. of 14th Symposium on Combinatorial Pattern Matching (CPM). LNCS, vol. 2676, pp.129–143. Springer, Berlin (2003)

    Chapter  Google Scholar 

  19. Giancarlo, R., Restivo, A., Sciortino, M.: From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization. Theor. Comput. Sci. 387(3), 236–248 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  20. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. of 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)

  21. Howard, P.G., Vitter, J.S.: Analysis of arithmetic coding for data compression. Inf. Process. Manag. 28(6), 749–764 (1992)

    Article  Google Scholar 

  22. Kaplan, H., Landau, S., Verbin, E.: A simpler analysis of Burrows-Wheeler-based compression. Theor. Comput. Sci. 387(3), 220–235 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  23. Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999)

    Article  MathSciNet  Google Scholar 

  24. Kulkarni, P., Douglis, F., LaVoie, J.D., Tracey, J.M.: Redundancy elimination within large collections of files. In: USENIX Annual Technical Conference, pp. 59–72 (2004)

  25. Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proc. of 7th Latin American Symposium on Theoretical Informatics (LATIN). LNCS, vol. 3887, pp. 703–714. Springer, Berlin (2006)

    Google Scholar 

  26. Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Proc. of 14th Symp. on String Processing and Information Retrieval (SPIRE). LNCS, vol. 4726, pp. 229–241. Springer, Berlin (2007)

    Chapter  Google Scholar 

  27. Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  28. Moffat, A., Isal, R.Y.: Word-based text compression using the Burrows-Wheeler transform. Inf. Process. Manag. 41(5), 1175–1192 (2005)

    Article  MATH  Google Scholar 

  29. Ouyang, Z., Memon, N.D., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: Proc. of 3rd Conference on Web Information Systems Engineering (WISE), pp. 257–268. IEEE Comput. Soc., Los Alamitos (2002)

    Google Scholar 

  30. Suel, T., Memon, N.: Algorithms for delta compression and remote file synchronization. In: Sayood, K. (ed.) Lossless Compression Handbook. Academic Press, New York (2002)

    Google Scholar 

  31. Trendafilov, D., Memon, N., Suel, T.: Compressing file collections with a TSP-based approach. Technical Report TR-CIS-2004-02, Polytechnic University (2004)

  32. Vo, B.D., Vo, K.-P.: Compressing table data with column dependency. Theor. Comput. Sci. 387(3), 273–283 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  33. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Los Altos (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Ferragina.

Additional information

The first author has been supported in part by a Yahoo! Research grant, by the MIUR-PRIN project “Mad Web”, and by the MIUR-FIRB project “Linguistica 2006”.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferragina, P., Nitto, I. & Venturini, R. On Optimally Partitioning a Text to Improve Its Compression. Algorithmica 61, 51–74 (2011). https://doi.org/10.1007/s00453-010-9437-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-010-9437-6

Keywords

Navigation