Compressing Dynamic Text Collections via Phrase-Based Coding

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Gonzalo Navarro
  • José R. Paramá
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3652)

Abstract

We present a new statistical compression method, which we call Phrase Based Dense Code (PBDC), aimed at compressing large digital libraries. PBDC compresses the text collection to 30–32% of its original size, permits maintaining the text compressed all the time, and offers efficient on-line information retrieval services. The novelty of PBDC is that it supports continuous growing of the compressed text collection, by automatically adapting the vocabulary both to new words and to changes in the word frequency distribution, without degrading the compression ratio. Text compressed with PBDC can be searched directly without decompression, using fast Boyer-Moore algorithms. It is also possible to decompress arbitrary portions of the collection. Alternative compression methods oriented to information retrieval focus on static collections and thus are less well suited to digital libraries.

Keywords

Text Compression Text Databases Digital Libraries 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. AW (1999)Google Scholar
  2. 2.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. P.Hall, Englewood Cliffs (1990)Google Scholar
  3. 3.
    Brisaboa, N., Fariña, A., Navarro, G., Paramá., J.R.: Simple, fast, and efficient natural language adaptive compression. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 230–241. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Brisaboa, N.R., Iglesias, E.L., Navarro, G.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  5. 5.
    Moura, E.S.d., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM TOIS 18(2), 113–139 (2000)CrossRefGoogle Scholar
  6. 6.
    Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Acad. Press, New York (1978)MATHGoogle Scholar
  7. 7.
    Horspool, R.N.: Practical fast searching in strings. SPE 10(6), 501–506 (1980)Google Scholar
  8. 8.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)Google Scholar
  9. 9.
    Moffat, A.: Word-based text compression. SPE 19(2), 185–198 (1989)Google Scholar
  10. 10.
    Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. KDE 9(2), 302–313 (1997)CrossRefGoogle Scholar
  11. 11.
    Moura, E.: Compressao de Dados Aplicada a Sistemas de Recuperacao de Informacao. PhD thesis, Universidade Federal de Minas Gerais, Brazil (1999)Google Scholar
  12. 12.
    Navarro, G., Moura, E.S.d., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. IR 3(1), 49–77 (2000)Google Scholar
  13. 13.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. CUP, Cambridge (2002)MATHGoogle Scholar
  14. 14.
    Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    de Silva Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast searching on compressed text allowing errors. In: Croft, W.B., Moffat, A., Rijsbergen, C.J.v., Wilkinson, R., Zobel, J. (eds.) Proc. 21st SIGIR, pp. 298–306 (1998)Google Scholar
  16. 16.
    Witten, I.H., Moffat, A., Bell, T.C. (eds.): Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kauffman, San Francisco (1999)Google Scholar
  17. 17.
    Zipf, G.K.: Human Behavior and the Principle of Least Effort. AW (1949)Google Scholar
  18. 18.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE TIT 23(3), 337–343 (1977)MATHMathSciNetGoogle Scholar
  19. 19.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE TIT 24(5), 530–536 (1978)MATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
  • Antonio Fariña
    • 1
  • Gonzalo Navarro
    • 2
  • José R. Paramá
    • 1
  1. 1.Database Lab.Univ. da Coruña, Facultade de InformáticaA CoruñaSpain
  2. 2.Dept. of Computer ScienceUniv. de ChileSantiagoChile

Personalised recommendations