Abstract
We present a new statistical compression method, which we call Phrase Based Dense Code (PBDC), aimed at compressing large digital libraries. PBDC compresses the text collection to 30–32% of its original size, permits maintaining the text compressed all the time, and offers efficient on-line information retrieval services. The novelty of PBDC is that it supports continuous growing of the compressed text collection, by automatically adapting the vocabulary both to new words and to changes in the word frequency distribution, without degrading the compression ratio. Text compressed with PBDC can be searched directly without decompression, using fast Boyer-Moore algorithms. It is also possible to decompress arbitrary portions of the collection. Alternative compression methods oriented to information retrieval focus on static collections and thus are less well suited to digital libraries.
This work is partially supported by CYTED VII.19 RIBIDI Project. It is also funded in part (for the Spanish group) by MCyT (PGE and FEDER) grant(TIC2003-06593) and (for G. Navarro) by Fondecyt Grant 1-050493, Chile.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. AW (1999)
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. P.Hall, Englewood Cliffs (1990)
Brisaboa, N., Fariña, A., Navarro, G., Paramá., J.R.: Simple, fast, and efficient natural language adaptive compression. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 230–241. Springer, Heidelberg (2004)
Brisaboa, N.R., Iglesias, E.L., Navarro, G.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)
Moura, E.S.d., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM TOIS 18(2), 113–139 (2000)
Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Acad. Press, New York (1978)
Horspool, R.N.: Practical fast searching in strings. SPE 10(6), 501–506 (1980)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)
Moffat, A.: Word-based text compression. SPE 19(2), 185–198 (1989)
Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. KDE 9(2), 302–313 (1997)
Moura, E.: Compressao de Dados Aplicada a Sistemas de Recuperacao de Informacao. PhD thesis, Universidade Federal de Minas Gerais, Brazil (1999)
Navarro, G., Moura, E.S.d., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. IR 3(1), 49–77 (2000)
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. CUP, Cambridge (2002)
Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)
de Silva Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast searching on compressed text allowing errors. In: Croft, W.B., Moffat, A., Rijsbergen, C.J.v., Wilkinson, R., Zobel, J. (eds.) Proc. 21st SIGIR, pp. 298–306 (1998)
Witten, I.H., Moffat, A., Bell, T.C. (eds.): Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kauffman, San Francisco (1999)
Zipf, G.K.: Human Behavior and the Principle of Least Effort. AW (1949)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE TIT 23(3), 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE TIT 24(5), 530–536 (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brisaboa, N.R., Fariña, A., Navarro, G., Paramá, J.R. (2005). Compressing Dynamic Text Collections via Phrase-Based Coding. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_41
Download citation
DOI: https://doi.org/10.1007/11551362_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28767-4
Online ISBN: 978-3-540-31931-3
eBook Packages: Computer ScienceComputer Science (R0)