Improving Semistatic Compression Via Pair-Based Coding
In the last years, new semistatic word-based byte-oriented compressors, such as Plain and Tagged Huffman and the Dense Codes, have been used to improve the efficiency of text retrieval systems, while reducing the compressed collections to 30–35% of their original size.
In this paper, we present a new semistatic compressor, called Pair-Based End-Tagged Dense Code (PETDC). PETDC compresses English texts to 27–28%, overcoming the optimal 0-order prefix-free semistatic compressor (Plain Huffman) in more than 3 percentage points. Moreover, PETDC permits also random decompression, and direct searches using fast Boyer-Moore algorithms.
PETDC builds a vocabulary with both words and pairs of words. The basic idea in which PETDC is based is that, since each symbol in the vocabulary is given a codeword, compression is improved by replacing two words of the source text by a unique codeword.
KeywordsCompression Ratio Encode Scheme String Match Source Text Inverted Index
Unable to display preview. Download preview PDF.
- 1.Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Englewood Cliffs (1990)Google Scholar
- 3.Brisaboa, N.: Simple, fast, and efficient natural language adaptive compression. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 230–241. Springer, Heidelberg (2004)Google Scholar
- 4.Brisaboa, N., et al.: Lightweight natural language text compression. Information Retrieval, To appear (2006)Google Scholar
- 6.Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124 (1994)Google Scholar
- 9.Horspool, R.N.: Practical fast searching in strings. SPE 10(6), 501–506 (1980)Google Scholar
- 10.Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)Google Scholar
- 11.Moffat, A.: Word-based text compression. SPE 19(2), 185–198 (1989)Google Scholar
- 12.Navarro, G., et al.: Adding compression to block addressing inverted indexes. IR 3(1), 49–77 (2000)Google Scholar