Advertisement

The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression

  • Paolo Ferragina
  • Raffaele Giancarlo
  • Giovanni Manzini
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4168)

Abstract

Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, Move-to-Front Encoding is generally believed to be an “inefficient” part of the Burrows-Wheeler compression process. However, only recently two theoretically superior alternatives to Move-to-Front have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first experimental comparison of these three techniques, giving a much needed methodological contribution to the current debate. We do so by providing a carefully engineered compression boosting library that can be used, on the one hand, to investigate the myriad new compression algorithms that can be based on boosting, and on the other hand, to make the first experimental assessment of how Move-to-Front behaves with respect to its recently proposed competitors. The main conclusion is that Boosting, Wavelet Trees and Move-to-Front yield quite close compression performance. Finally, our extensive experimental study of boosting technique brings to light a new fact overlooked in 10 years of experiments in the area: a fast adapting order-zero compressor is enough to provide state of the art BWT compression by simply compressing the run length encoded transform. In other words, Move-to-Front, Wavelet Trees, and Boosters can all be by-passed by a fast learner.

Keywords

Compression Ratio Compression Algorithm Optimal Partition Arithmetic Code Tree Data Structure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abel, J.: Post BWT stages of the Burrows-Wheeler compression algorithm. A fast and efficient post BWT-stage for the Burrows-Wheeler compression algorithm. In: Proc. IEEE DCC, p. 449 (submitted, 2005)Google Scholar
  2. 2.
    Burkhardt, S., Kärkkäinen, J.: Fast lightweight suffix array construction and checking. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 55–69. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. 3.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  4. 4.
    Deorowicz, S.: Context exhumation after the BurrowsWheeler transform. Information Processing Letters 95, 313–320 (2005)MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Fenwick, P.: Block sorting text compression — final report. Technical Report 130, Dept. of Computer Science, The University of Auckland New Zeland (1996)Google Scholar
  6. 6.
    Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: Theory vs practice in BWT compression. Technical Report TR-INF-2006-06-03-UNIPMN (2006), http://www.di.unipmn.it
  7. 7.
    Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 561–572. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. Journal of the ACM 52, 688–713 (2005)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Foschini, L., Grossi, R., Gupta, A., Vitter, J.: Fast compression with a static model in high order entropy. In: IEEE DCC, pp. 62–71. IEEE Computer Society TCC, Los Alamitos (2004)Google Scholar
  10. 10.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA 2003), pp. 841–850 (2003)Google Scholar
  11. 11.
    Grossi, R., Gupta, A., Vitter, J.: When indexing equals compression: Experiments on compressing suffix arrays and applications. In: Proc. 15th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA 2004), pp. 636–645 (2004)Google Scholar
  12. 12.
    Kaplan, H., Landau, S., Verbin, E.: A simpler analysis of Burrows-Wheeler based compression. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 282–293. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  14. 14.
    Lundqvist, M.: Carryless range coding, http://hem.spray.se/mikael.lundqvist/
  15. 15.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Manzini, G.: Two space saving tricks for linear time LCP computation. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 372–383. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33–50 (2004)MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Navarro, G., Mäkinen, V.: Compressed full text indexes. Technical Report TR/DCC-2006-6, Dept. of Computer Science, University of Chile (2006)Google Scholar
  19. 19.
    Seward, J.: The bzip2 home page (2006), http://www.bzip.org
  20. 20.
    Shkarin, D.: PPMd compressor Ver. J., http://www.compression.ru/ds/
  21. 21.
    Shkarin, D.: PPM: One step to practicality. In: IEEE Data Compression Conference, pp. 202–211 (2002)Google Scholar
  22. 22.
    Wheeler, D.: Improving Huffman coding (1997), ftp://ftp.cl.cam.ac.uk/users/djw3/huff.ps
  23. 23.
    Witten, I.H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communications of the ACM 30(6), 520–540 (1987)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Paolo Ferragina
    • 1
  • Raffaele Giancarlo
    • 2
  • Giovanni Manzini
    • 3
  1. 1.Dipartimento di InformaticaUniversità di PisaItaly
  2. 2.Dipartimento di Matematica ed ApplicazioniUniversità di PalermoItaly
  3. 3.Dipartimento di InformaticaUniversità del Piemonte OrientaleItaly

Personalised recommendations