Skip to main content

Compressing Dynamic Text Collections via Phrase-Based Coding

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3652))

Included in the following conference series:

Abstract

We present a new statistical compression method, which we call Phrase Based Dense Code (PBDC), aimed at compressing large digital libraries. PBDC compresses the text collection to 30–32% of its original size, permits maintaining the text compressed all the time, and offers efficient on-line information retrieval services. The novelty of PBDC is that it supports continuous growing of the compressed text collection, by automatically adapting the vocabulary both to new words and to changes in the word frequency distribution, without degrading the compression ratio. Text compressed with PBDC can be searched directly without decompression, using fast Boyer-Moore algorithms. It is also possible to decompress arbitrary portions of the collection. Alternative compression methods oriented to information retrieval focus on static collections and thus are less well suited to digital libraries.

This work is partially supported by CYTED VII.19 RIBIDI Project. It is also funded in part (for the Spanish group) by MCyT (PGE and FEDER) grant(TIC2003-06593) and (for G. Navarro) by Fondecyt Grant 1-050493, Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. AW (1999)

    Google Scholar 

  2. Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. P.Hall, Englewood Cliffs (1990)

    Google Scholar 

  3. Brisaboa, N., Fariña, A., Navarro, G., Paramá., J.R.: Simple, fast, and efficient natural language adaptive compression. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 230–241. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  4. Brisaboa, N.R., Iglesias, E.L., Navarro, G.: An efficient compression code for text databases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 468–481. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. Moura, E.S.d., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM TOIS 18(2), 113–139 (2000)

    Article  Google Scholar 

  6. Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Acad. Press, New York (1978)

    MATH  Google Scholar 

  7. Horspool, R.N.: Practical fast searching in strings. SPE 10(6), 501–506 (1980)

    Google Scholar 

  8. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40(9), 1098–1101 (1952)

    Google Scholar 

  9. Moffat, A.: Word-based text compression. SPE 19(2), 185–198 (1989)

    Google Scholar 

  10. Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. KDE 9(2), 302–313 (1997)

    Article  Google Scholar 

  11. Moura, E.: Compressao de Dados Aplicada a Sistemas de Recuperacao de Informacao. PhD thesis, Universidade Federal de Minas Gerais, Brazil (1999)

    Google Scholar 

  12. Navarro, G., Moura, E.S.d., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. IR 3(1), 49–77 (2000)

    Google Scholar 

  13. Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. CUP, Cambridge (2002)

    MATH  Google Scholar 

  14. Navarro, G., Tarhio, J.: Boyer-Moore string matching over Ziv-Lempel compressed text. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 166–180. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  15. de Silva Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast searching on compressed text allowing errors. In: Croft, W.B., Moffat, A., Rijsbergen, C.J.v., Wilkinson, R., Zobel, J. (eds.) Proc. 21st SIGIR, pp. 298–306 (1998)

    Google Scholar 

  16. Witten, I.H., Moffat, A., Bell, T.C. (eds.): Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kauffman, San Francisco (1999)

    Google Scholar 

  17. Zipf, G.K.: Human Behavior and the Principle of Least Effort. AW (1949)

    Google Scholar 

  18. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE TIT 23(3), 337–343 (1977)

    MATH  MathSciNet  Google Scholar 

  19. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE TIT 24(5), 530–536 (1978)

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brisaboa, N.R., Fariña, A., Navarro, G., Paramá, J.R. (2005). Compressing Dynamic Text Collections via Phrase-Based Coding. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_41

Download citation

  • DOI: https://doi.org/10.1007/11551362_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28767-4

  • Online ISBN: 978-3-540-31931-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics