Abstract
We present a technique for proving lower bounds on the compression ratio of algorithms which are based on the Burrows-Wheeler Transform (BWT). We study three well known BWT-based compressors: the original algorithm suggested by Burrows and Wheeler; BWT with distance coding; and BWT with run-length encoding. For each compressor, we show a Markov source such that for asymptotically-large text generated by the source, the compression ratio divided by the entropy of the source is a constant greater than 1. This constant is 2 − ε, 1.26, and 1.29, for each of the three compressors respectively. Our technique is robust, and can be used to prove similar claims for most BWT-based compressors (with a few notable exceptions). This stands in contrast to statistical compressors and Lempel-Ziv-style dictionary compressors, which are long known to be optimal, in the sense that for any Markov source, the compression ratio divided by the entropy of the source asymptotically tends to 1.
We experimentally corroborate our theoretical bounds. Furthermore, we compare BWT-based compressors to other compressors and show that for “realistic” Markov sources they indeed perform bad and often worse than other compressors. This is in contrast with the well known fact that on English text, BWT-based compressors are superior to many other types of compressors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The Canterbury Corpus, http://corpus.canterbury.ac.nz
Abel, J.: Web page about Distance Coding, http://www.data-compression.info/Algorithms/DC/
Bentley, J.L., Sleator, D.D., Tarjan, R.E., Wei, V.K.: A locally adaptive data compression scheme. Communications of the ACM 29(4), 320–330 (1986)
Binder, E.: Distance coder. Usenet group comp.compression (2000)
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & sons, New York (1991)
Deorowicz, S.: Second step algorithms in the Burrows–Wheeler compression algorithm. Software–Practice and Experience 32(2), 99–111 (2002)
Effros, M., Visweswariah, K., Kulkarni, S., Verdu, S.: Universal lossless source coding with the Burrows Wheeler transform. IEEE Transactions on Information Theory 48(5), 1061–1081 (2002)
Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: Theory vs practice in BWT compression. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 756–767. Springer, Heidelberg (2006)
Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. Journal of the ACM 52, 688–713 (2005)
Gailly, J., Adler, M.: The gzip compressor, http://www.gzip.org/
Gallager, R.: Variations on a theme by Huffman. IEEE Transactions on Information Theory 24(6), 668–674 (1978)
Kaplan, H., Landau, S., Verbin, E.: A simpler analysis of Burrows-Wheeler based compression. To be puiblished in Theoretical Computer Science, special issue on the Burrows-Wheeler Transform and its Applications, Preliminary version published in CPM 2006 (2007)
Kosaraju, S.R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999)
Manzini, G.: An analysis of the Burrows-Wheeler Transform. Journal of the ACM 48(3), 407–430 (2001)
Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33–50 (2004)
Moffat, A., Neal, R.M., Witten, I.H.: Arithmetic coding revisited. ACM Transactions on Information Systems 16(3), 256–294 (1998)
Savari, S.A.: Redundancy of the Lempel-Ziv-Welch code. In: Proc. Data Compression Conference (DCC), pp. 191–200 (1997)
Seward, J.: bzip2, a program and library for data compression, http://www.bzip.org/
Shkarin, D., Cheney, J.: ppmdi, a statistical compressor. This is Shkarin’s compressor PPMII, as modified and incorporated into XMLPPM by Cheney, and then extracted from XMLPPM by Adiego. J
Shor, P.: Lempel-Ziv compression (lecture notes for the course principles of applied mathematics), www-math.mit.edu/~shor/PAM/lempel_ziv_notes.pdf
Welch, T.A.: A technique for high-performance data compression. Computer 17, 8–19 (1984)
Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6), 520–540 (1987)
Wyner, A.D., Ziv, J.: The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Proc. IEEE 82(8), 872–877 (1994)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kaplan, H., Verbin, E. (2007). Most Burrows-Wheeler Based Compressors Are Not Optimal. In: Ma, B., Zhang, K. (eds) Combinatorial Pattern Matching. CPM 2007. Lecture Notes in Computer Science, vol 4580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73437-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-73437-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73436-9
Online ISBN: 978-3-540-73437-6
eBook Packages: Computer ScienceComputer Science (R0)