Binary Interpolative Coding for Effective Index Compression

Abstract

Information retrieval systems contain large volumes of text, and currently have typical sizes into the gigabyte range. Inverted indexes are one important method for providing search facilities into these collections, but unless compressed require a great deal of space. In this paper we introduce a new method for compressing inverted indexes that yields excellent compression, fast decoding, and exploits clustering—the tendency for words to appear relatively frequently in some parts of the collection and infrequently in others. We also describe two other quite separate applications for the same compression method: representing the MTF list positions generated by the Burrows-Wheeler Block Sorting transformation; and transmitting the codebook for semi-static block-based minimum-redundancy coding.

This is a preview of subscription content, log in to check access.

References

  1. Anh VN and Moffat A (1998) Compressed inverted files with reduced decoding overheads. In: Croft WB, Moffat A, van Rijsbergen CJ, Wilkinson R and Zobel J, Eds., Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia. ACM Press, New York, pp. 290-297.

    Google Scholar 

  2. Bell TC, Moffat A, Nevill-Manning CG, Witten IH and Zobel J (1993) Data compression in full-text retrieval systems. Journal of the American Society for Information Science, 44(9):508-531.

    Google Scholar 

  3. Bookstein A and Klein ST (1991) Compression of correlated bit-vectors. Information Systems, 16(4):387-400.

    Google Scholar 

  4. Bookstein A, Klein ST and Raita T (1994) Markov models for clusters in concordance compression. In: Storer JA and Cohn M, Eds., Proc. 1994 IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, California, pp. 116-125.

    Google Scholar 

  5. Bookstein A, Klein ST and Raita T (1997) Modeling word occurrences for the compression of concordances. ACM Transactions on Information Systems, 15(3):254-290.

    Google Scholar 

  6. Bookstein A, Klein ST and Ziff DA (1992) A systematic approach to compressing a full-text retrieval system. Information Processing & Management, 28(6):795-806.

    Google Scholar 

  7. Burrows M and Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California.

    Google Scholar 

  8. Choueka Y, Fraenkel AS and Klein ST (1988) Compression of concordances in full-text retrieval systems. In: Proc. 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Grenoble, France. ACM Press, New York, pp. 597-612.

    Google Scholar 

  9. Choueka Y, Fraenkel AS, Klein ST and Segal E (1986) Improved hierarchical bit-vector compression in document retrieval systems. In: Proc. 9'th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy. ACM, New York, pp. 88-97.

    Google Scholar 

  10. Elias P (1975) Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194-203.

    Google Scholar 

  11. Fenwick P (1996) The Burrows-Wheeler transform for block sorting text compression: Principles and improvements. The Computer Journal, 39(9):731-740.

    Google Scholar 

  12. Fraenkel AS and Klein ST (1985) Novel compression of sparse bit-strings-Preliminary report. In: Apostolico A and Galil Z, Eds., Combinatorial Algorithms on Words, Volume 12. Springer-Verlag, Berlin, pp. 169-183. Nato ASI Series F.

    Google Scholar 

  13. Gallager RG and Van Voorhis DC (1975) Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information Theory, IT-21(2):228-230.

    Google Scholar 

  14. Golomb SW (1966) Run-length encodings. IEEE Transactions on Information Theory, IT-12(3):399-401.

    Google Scholar 

  15. Harman DK (1995) Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271-289.

    Google Scholar 

  16. Howard PG and Vitter JS (1993) Fast and efficient lossless image compression. In: Storer JA and Cohn M, Eds., Proc. 1993 IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, California, pp. 351-360.

    Google Scholar 

  17. Jakobsson M (1978) Huffman coding in bit-vector compression. Information Processing Letters, 7(6):304-307.

    Google Scholar 

  18. Klein ST, Bookstein A and Deerwester S (1989) Storing text retrieval systems on CD-ROM: Compression and encryption considerations. ACM Transactions on Information Systems, 7(3):230-245.

    Google Scholar 

  19. McIlroy MD (1982) Development of a spelling list. IEEE Transactions on Communications, COM-30(1):91-99.

    Google Scholar 

  20. Moffat A, Neal RM and Witten IH (1998) Arithmetic coding revisited. ACM Transactions on Information Systems, 16(3):256-294. Source software available from http://www.csse.unimelb.edu.au/~alistair/arith_coder/.

    Google Scholar 

  21. Moffat A and Zobel J (1992) Parameterised compression for sparse bitmaps. In: Belkin NJ, Ingwersen P and Pejtersen AM, Eds., Proc. 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen. ACM Press, New York, pp. 274-285.

    Google Scholar 

  22. Moffat A and Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACMTransactions on Information Systems, 14(4):349-379.

    Google Scholar 

  23. Moffat A, Zobel J and Klein ST (1995) Improved inverted file processing for large text databases. In: Sacks-Davis R and Zobel J, Eds., Proc. 6th Australasian Database Conference, Singapore. World Scientific, pp. 162-171.

    Google Scholar 

  24. Persin M, Zobel J and Sacks-Davis R (1996) Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749-764.

    Google Scholar 

  25. Roos P, Viergever MA, van Dijke MC and Peters JH (1988) Reversible intraframe compression of medical images. IEEE Transactions on Medical Imaging, 7(4):328-336.

    Google Scholar 

  26. Schuegraf EJ (1976) Compression of large inverted files with hyperbolic term distribution. Information Processing & Management, 12:377-384.

    Google Scholar 

  27. Teuhola J (1978) A compression method for clustered bit-vectors. Information Processing Letters, 7(6):308-311.

    Google Scholar 

  28. Turpin A and Moffat A (2000) Housekeeping for prefix coding. IEEE Transactions on Communications, 48(4).

  29. Witten IH, Bell TC and Nevill CG (1992) Indexing and compressing full-text databases for CD-ROM. Journal of Information Science, 17:265-271.

    Google Scholar 

  30. Witten IH, Moffat A and Bell TC (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. Morgan Kaufmann, San Francisco.

    Google Scholar 

  31. Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software-Practice and Experience, 25(8):891-903.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Moffat, A., Stuiver, L. Binary Interpolative Coding for Effective Index Compression. Information Retrieval 3, 25–47 (2000). https://doi.org/10.1023/A:1013002601898

Download citation

  • index compression
  • context-based model
  • document database