Advertisement

Improving Semistatic Compression Via Pair-Based Coding

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Gonzalo Navarro
  • José R. Paramá
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4378)

Abstract

In the last years, new semistatic word-based byte-oriented compressors, such as Plain and Tagged Huffman and the Dense Codes, have been used to improve the efficiency of text retrieval systems, while reducing the compressed collections to 30–35% of their original size.

In this paper, we present a new semistatic compressor, called Pair-Based End-Tagged Dense Code (PETDC). PETDC compresses English texts to 27–28%, overcoming the optimal 0-order prefix-free semistatic compressor (Plain Huffman) in more than 3 percentage points. Moreover, PETDC permits also random decompression, and direct searches using fast Boyer-Moore algorithms.

PETDC builds a vocabulary with both words and pairs of words. The basic idea in which PETDC is based is that, since each symbol in the vocabulary is given a codeword, compression is improved by replacing two words of the source text by a unique codeword.

Keywords

Compression Ratio Encode Scheme String Match Source Text Inverted Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Englewood Cliffs (1990)Google Scholar
  2. 2.
    Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Communications of the ACM 20(10), 762–772 (1977)CrossRefGoogle Scholar
  3. 3.
    Brisaboa, N.: Simple, fast, and efficient natural language adaptive compression. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 230–241. Springer, Heidelberg (2004)Google Scholar
  4. 4.
    Brisaboa, N., et al.: Lightweight natural language text compression. Information Retrieval, To appear (2006)Google Scholar
  5. 5.
    Brisaboa, N.R., et al.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124 (1994)Google Scholar
  7. 7.
    Silva de Moura, E., et al.: Fast and flexible word searching on compressed text. ACM TOIS 18(2), 113–139 (2000)CrossRefGoogle Scholar
  8. 8.
    Farach, M., Thorup, M.: String matching in lempel-ziv compressed strings. In: Proceedings of the 27th ACM-STOC, pp. 703–712. ACM Press, New York (1995), http://doi.acm.org/10.1145/225058.225288 Google Scholar
  9. 9.
    Horspool, R.N.: Practical fast searching in strings. SPE 10(6), 501–506 (1980)Google Scholar
  10. 10.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)Google Scholar
  11. 11.
    Moffat, A.: Word-based text compression. SPE 19(2), 185–198 (1989)Google Scholar
  12. 12.
    Navarro, G., et al.: Adding compression to block addressing inverted indexes. IR 3(1), 49–77 (2000)Google Scholar
  13. 13.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge (2002)zbMATHGoogle Scholar
  14. 14.
    Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE TIT 23(3), 337–343 (1977)zbMATHMathSciNetGoogle Scholar
  16. 16.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE TIT 24(5), 530–536 (1978)zbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
  • Antonio Fariña
    • 1
  • Gonzalo Navarro
    • 2
  • José R. Paramá
    • 1
  1. 1.Database Lab., Univ. da Coruña, Facultade de Informática, Campus de Elviña s/n, 15071 A CoruñaSpain
  2. 2.Dept. of Computer Science, Univ. de Chile, Blanco Encalada 2120, SantiagoChile

Personalised recommendations