Abstract
In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in Buchsbaum et al. (Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175–184, 2000; J. ACM 50(6):825–851, 2003) in the context of table compression, and then further elaborated and extended to strings and trees by Ferragina et al. (J. ACM 52:688–713, 2005; Proc. of 46th IEEE Symposium on Foundations of Computer Science, pp. 184–193, 2005) and Mäkinen and Navarro (Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229–241, 2007). Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming (Buchsbaum et al. in J. ACM 50(6):825–851, 2003; Giancarlo and Sciortino in Proc. of 14th Symposium on Combinatorial Pattern Matching, pp. 129–143, 2003), or few heuristics that do not guarantee any bounds on the efficacy of their computed partition (Buchsbaum et al. in Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175–184, 2000; J. ACM 50(6):825–851, 2003), or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform, see e.g. Ferragina et al. in J. ACM 52:688–713, 2005; Mäkinen and Navarro in Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229–241, 2007) and achieve compression performance that might be worse than the optimal-partitioning by a Ω(log n/log log n) factor. Therefore, computing efficiently the optimal solution is still open (Buchsbaum and Giancarlo in Encyclopedia of Algorithms, pp. 939–942, 2008). In this paper we provide the first algorithm which computes in O(nlog 1+ε n) time and O(n) space, a partition of T whose compressed output is guaranteed to be no more than (1+ε)-worse the optimal one, where ε may be any positive constant fixed in advance. This result holds for any base-compressor C whose compression performance can be bounded in terms of the zero-th or the k-th order empirical entropy of the text T. We will also discuss extensions of our results to BWT-based compressors and to the compression booster of Ferragina et al. (J. ACM 52:688–713, 2005).
Similar content being viewed by others
References
Bentley, J.L., McIlroy, M.D.: Data compression with long repeated strings. Inf. Sci. 135(1–2), 1–11 (2001)
Bentley, J.L., Sleator, D.D., Tarjan, R.E., Wei, V.K.: A locally adaptive data compression scheme. Commun. ACM 29(4), 320–330 (1986)
Buchsbaum, A.L., Giancarlo, R.: Table compression. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms, pp. 939–942. Springer, Berlin (2008)
Buchsbaum, A.L., Caldwell, D.F., Church, K.W., Fowler, G.S., Muthukrishnan, S.: Engineering the compression of massive tables: an experimental approach. In: Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 175–184 (2000)
Buchsbaum, A.L., Fowler, G.S., Giancarlo, R.: Improving table compression with combinatorial optimization. J. ACM 50(6), 825–851 (2003)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)
Dasgupta, S., Papadimitriou, C., Vazirani, U.: Algorithms. McGraw-Hill Science/Engineering/Math, New York (2006)
Fenwick, P.M.: The burrows-wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39(9), 731–740 (1996)
Ferragina, P., Manzini, G.: On compressing the textual web. In: Proc. of Third ACM Conference on Web Search and Data Mining (WSDM) (2010)
Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms (2010, to appear)
Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52, 688–713 (2005)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. of 46th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 184–193 (2005)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and searching XML data via two zips. In: Proc. of 15th International World Wide Web Conference (WWW), pp. 751–760 (2006)
Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: theory vs practice in BWT compression. In: Proc. of 14th European Symposium on Algorithms (ESA’06). LNCS, vol. 4168, pp. 756–767. Springer, Berlin (2006)
Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Inf. Comput. 207, 849–866 (2009)
Giancarlo, R.: Dynamic programming: special cases. In: Apostolico, A., Galil, Z. (eds.) Pattern Matching Algorithms, 2nd edn., pp. 201–236. Oxford University Press, London (1997)
Giancarlo, R., Sciortino, M.: Optimal partitions of strings: a new class of Burrows-Wheeler compression algorithms. In: Proc. of 14th Symposium on Combinatorial Pattern Matching (CPM). LNCS, vol. 2676, pp.129–143. Springer, Berlin (2003)
Giancarlo, R., Restivo, A., Sciortino, M.: From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization. Theor. Comput. Sci. 387(3), 236–248 (2007)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. of 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)
Howard, P.G., Vitter, J.S.: Analysis of arithmetic coding for data compression. Inf. Process. Manag. 28(6), 749–764 (1992)
Kaplan, H., Landau, S., Verbin, E.: A simpler analysis of Burrows-Wheeler-based compression. Theor. Comput. Sci. 387(3), 220–235 (2007)
Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999)
Kulkarni, P., Douglis, F., LaVoie, J.D., Tracey, J.M.: Redundancy elimination within large collections of files. In: USENIX Annual Technical Conference, pp. 59–72 (2004)
Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proc. of 7th Latin American Symposium on Theoretical Informatics (LATIN). LNCS, vol. 3887, pp. 703–714. Springer, Berlin (2006)
Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Proc. of 14th Symp. on String Processing and Information Retrieval (SPIRE). LNCS, vol. 4726, pp. 229–241. Springer, Berlin (2007)
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Moffat, A., Isal, R.Y.: Word-based text compression using the Burrows-Wheeler transform. Inf. Process. Manag. 41(5), 1175–1192 (2005)
Ouyang, Z., Memon, N.D., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: Proc. of 3rd Conference on Web Information Systems Engineering (WISE), pp. 257–268. IEEE Comput. Soc., Los Alamitos (2002)
Suel, T., Memon, N.: Algorithms for delta compression and remote file synchronization. In: Sayood, K. (ed.) Lossless Compression Handbook. Academic Press, New York (2002)
Trendafilov, D., Memon, N., Suel, T.: Compressing file collections with a TSP-based approach. Technical Report TR-CIS-2004-02, Polytechnic University (2004)
Vo, B.D., Vo, K.-P.: Compressing table data with column dependency. Theor. Comput. Sci. 387(3), 273–283 (2007)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Los Altos (1999)
Author information
Authors and Affiliations
Corresponding author
Additional information
The first author has been supported in part by a Yahoo! Research grant, by the MIUR-PRIN project “Mad Web”, and by the MIUR-FIRB project “Linguistica 2006”.
Rights and permissions
About this article
Cite this article
Ferragina, P., Nitto, I. & Venturini, R. On Optimally Partitioning a Text to Improve Its Compression. Algorithmica 61, 51–74 (2011). https://doi.org/10.1007/s00453-010-9437-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-010-9437-6