Advertisement

Information Retrieval

, Volume 10, Issue 1, pp 1–33 | Cite as

Lightweight natural language text compression

  • Nieves R. BrisaboaEmail author
  • Antonio Fariña
  • Gonzalo Navarro
  • José R. Paramá
Article

Abstract

Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11% larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.

Keywords

Text databases Natural language text compression Searching compressed text 

Notes

Acknowledgements

Supported by CYTED VII.19 RIBIDI Project and (for the third author) Millennium Nucles Center for Web Research, Grant P04-67-F, Mideplan, Chile. Also funded (for the Spanish group) by MCyT (PGE and FEDER) grant (TIC2003-06593) and Xunta de Galicia grant (PGIDIT05SIN10502PR).

References

  1. Allauzen, C., Crochemore, M., & Raffinot, M. (1999). Factor oracle: a new structure for pattern matching. SOFSEM, LNCS 1725 (pp. 295–310).Google Scholar
  2. Baeza-Yates, R., & Navarro, G. (2004). Recent advances in applied probability. In R. Baeza-Yates, J. Glaz, H. Gzyl, J. Husler & J. Palacios (Eds.), Modeling text databases (pp. 1–25). Springer.Google Scholar
  3. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley Longman.Google Scholar
  4. Bell, T. C., Cleary, J. G., & Witten, I. H. (1990). Text compression. Prentice Hall.Google Scholar
  5. Boyer, R. S., & Moore, J. S. (1977). A fast string searching algorithm. Communications of the ACM, 20(10), 762–772.Google Scholar
  6. Brisaboa, N., Fariña, A., Navarro, G., & Esteller, M. (2003a). (s,c)-densecoding: an optimized compression code for natural language text databases. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE'03) (pp. 122–136). LNCS 2857, Springer-Verlag.Google Scholar
  7. Brisaboa, N., Fariña, A., Navarro, G., & Paramá, J. (2004). Simple, fast, and efficient natural language adaptive compression. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE'04) (pp. 230–241). LNCS 3246, Springer-Verlag.Google Scholar
  8. Brisaboa, N., Fariña, A., Navarro, G., & Paramá, J. (2005a). Compressing dynamic text collections via phrase-based coding. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL'05) (pp. 462–474). LNCS 3652, Springer-Verlag.Google Scholar
  9. Brisaboa, N., Fariña, A., Navarro, G., & Paramá, J. (2005b). Efficiently decodable and searchable natural language adaptive compression. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05) (pp. 234–241). ACM Press.Google Scholar
  10. Brisaboa, N., Iglesias, E. L., Navarro, G., & Paramá, J. R. (2003b). An efficient compression code for text databases. In Proceedings of the 25th European Conference on IR Research (ECIR'03) (pp. 468–481). LNCS 2633, Springer-Verlag.Google Scholar
  11. Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation.Google Scholar
  12. Carpinelli, J., Moffat, A., Neal, R., Salamonsen, W., Stuiver, L., Turpin, A., & Witten, I. (1999), Word, character, integer, and bit based compression using arithmetic coding. http://www.cs.mu.oz.au/~alistair/arith_coder/Google Scholar
  13. Elias, P. (1975). Universal codeword sets and the representation of the integers. IEEE Transactions on Information Theory, 21, 194–203.Google Scholar
  14. Fariña, A. (2005). New compression codes for text databases, PhD thesis, Database Laboratory, University of A Coruna. http://coba.dc.fi.udc.es/~fari/phd/Google Scholar
  15. Fraenkel, & Klein. (1996). Robust universal complete codes for transmission and compression. Discrete Applied Mathematics and Combinatorial Operations Research and Computer Science, 64, 31–55.Google Scholar
  16. Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.Google Scholar
  17. Heaps, H. S. (1978). Information retrieval: computational and theoretical aspects. New York: Academic Press.Google Scholar
  18. Horspool, R. N. (1980). Practical fast searching in strings. Software Practice and Experience, 10(6), 501–506.Google Scholar
  19. Huffman, D. A. (1952). A method for the construction of minimum redundancy codes. In Proceedings of the Institute of Electronics and Radio Engineers (IRE), 40(9), 1098–1101.Google Scholar
  20. Klein, S. T., & Shapira, D. (2005). Pattern matching in Huffman encoded texts. Information Processing and Management, 41(4), 829–841.Google Scholar
  21. Lakshmanan, K. B. (1981). On universal codeword sets. IEEE Transactions on Information Theory, 27(5), 659–662.Google Scholar
  22. Manber, U. (1997). A text compression scheme that allows fast searching directly in the compressed file. ACM Transactions on Information Systems, 15(2), 124–136.Google Scholar
  23. Manber, U., & Wu, S. (1994). GLIMPSE: A tool to search through entire file systems. In Proc. of the Winter 1994 USENIX Technical Conference (pp. 23–32).Google Scholar
  24. Mandelbrot, B. (1953). An information theory of the statistical structure of language. In W. Jackson (Ed.), Communication theory (pp. 486–504). Academic Press N.Y.Google Scholar
  25. Miyazaki, M., Fukamachi, S., Takeda, M., & Shinohara, T. (1998). Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9), 2638–2648.Google Scholar
  26. Moffat, A. (1989). Word-based text compression. Software—Practice and Experience, 19(2), 185–198.Google Scholar
  27. Moffat, A., & Katajainen, J. (1995). In-place calculation of minimum-redundancy codes. In Proceedings of the 4th International Workshop on Algorithms and Data Structures (WADS'95) (pp. 393–402). LNCS 955, Springer.Google Scholar
  28. Moffat, A., & Turpin (1996). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 170–179.Google Scholar
  29. Moura, E., Navarro, G., Ziviani, N., & Baeza-Yates, R. (1998). Fast searching on compressed text allowing errors. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98) (pp. 298–306). ACM Press.Google Scholar
  30. Moura, E., Navarro, G., Ziviani, N., & Baeza-Yates, R. (2000). Fast and flexible word searching on compressed text., ACM Transactions on Information Systems, 18(2), 113–139.Google Scholar
  31. Navarro, G., & Brisaboa, N. (2006). New bounds on D-ary optimal codes. Information Processing Letters, 96(5), 178–184.Google Scholar
  32. Navarro, G., Moura, E., Neubert, M., Ziviani, N., & Baeza-Yates, R. (2000). Adding compression to block addressing inverted indexes. Information Retrieval, 3(1), 49–77.Google Scholar
  33. Navarro, G., & Raffinot, M. (2002). Flexible pattern matching in strings—practical on-line search algorithms for texts and biological sequences. Cambridge University Press.Google Scholar
  34. Navarro, & Tarhio, J. (2000). Boyer-Moore string matching over Ziv-Lempel compressed text. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, number 1848 in Lecture Notes in Computer Science (pp. 166–180). Springer-Verlag, Berlin, Montreal, Canada.Google Scholar
  35. Navarro, & Tarhio, J. (2005). LZgrep: A Boyer-Moore string matching tool for Ziv-Lempel compressed text. Software Practice and Experience (SPE), 35(12), 1107–1130.Google Scholar
  36. Rautio, J., Tanninen, J., & Tarhio, J. (2002). String matching with stopper encoding and code splitting. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM 2002) (pp. 42–52). LNCS 2373, Springer.Google Scholar
  37. Savari, S. A., & Szpankowski, W. (2002) On the analysis of variable-to-variable length codes. In Proceedings of 2002 IEEE International Symposium on Information Theory (ISIT'02), (p. 176). See also http://citeseer.ist.psu.edu/616808.htmlGoogle Scholar
  38. Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., & Arikawa, S. (2000). A Boyer-Moore type algorithm for compressed pattern matching. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM'00) (pp. 181–194). LNCS 1848, Springer–Verlag.Google Scholar
  39. Takeda, M., Shibata, Y., Matsumoto, T., Kida, T., Shinohara, A., Fukamachi, S., Shinohara, T., & Arikawa, S. (2001). Speeding up string pattern matching by text compression: the dawn of a new era. Transactions of Information Processing Society of Japan, 42(3), 370–384.Google Scholar
  40. Turpin, A., & Moffat, A. (1997). Fast file search using text compression. In Proceedings of the 20th Australian Computer Science Conference (pp. 1–8).Google Scholar
  41. Wan, R. (2003). Browsing and searching compressed documents. PhD thesis, Department of Computer Science and Software Engineering, University of Melbourne, Australia. http://eprints. unimelb.edu.au/archive/00000484/Google Scholar
  42. Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann Publishers, USA.Google Scholar
  43. Wu, S., & Manber, U. (1992a). Agrep—a fast approximate pattern-matching tool. In Proceedings USENIX Winter 1992 Technical Conference (pp. 153–162). San Francisco, CA.Google Scholar
  44. Wu, S., & Manber, U. (1992b). Fast text searching allowing errors, Communications of the ACM, 35(10), 83–91.Google Scholar
  45. Zipf, G. K. (1949). Human behavior and the principle of least Effort. Addison-Wesley.Google Scholar
  46. Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3), 337–343.Google Scholar
  47. Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5), 530–536.Google Scholar
  48. Ziviani, N., Moura, E., Navarro, G., & Baeza-Yates, R. (2000). Compression: a key for next-generation text retrieval systems. IEEE Computer, 33(11), 37–44.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  • Nieves R. Brisaboa
    • 1
    Email author
  • Antonio Fariña
    • 1
  • Gonzalo Navarro
    • 2
  • José R. Paramá
    • 1
  1. 1.Database Lab., Univ. da Coruña, Facultade de InformáticaA CoruñaSpain
  2. 2.Center for Web Research, Dept. of Computer ScienceUniv. de Chile, Blanco EncaladaSantiagoChile

Personalised recommendations