Information Retrieval

, Volume 6, Issue 1, pp 5–19 | Cite as

Compressing Inverted Files

  • Andrew Trotman
Article

Abstract

Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.

The premise “smaller is better” may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput.

Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.

index compression inverted files document indexing text searching 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Reference

  1. Antoshenkov G (1994) Byte aligned data compression. US Patent Number 5363098.Google Scholar
  2. Bookstein A, Klein ST and Raita T(1994) Markov models for clusters in concordance compression. In: Proceedings of the 1994 IEEE Data Compression Conference DCC-94, pp. 116-125.Google Scholar
  3. Bookstein A, Klein ST and Raita T (2000) Simple bayesian model for Bitmap compression. Information Retrieval, 1(4):315-328.Google Scholar
  4. Chan CY and Ioannidis YE (1999) An efficient Bitmap encoding scheme for selection queries. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 215-226.Google Scholar
  5. Choueka Y, Fraenkel AS and Klein ST (1988) Compression of concordances in full-text retrieval systems. In: Proceedings of the 11th ACM-SIGIR Conference on Information Retrieval, pp. 597-612.Google Scholar
  6. Choueka Y, Fraenkel AS, Klein ST and Segal E (1986) Improved hierarchical bit-vector compression in document retrieval systems. In: Proceedings of the 9th ACM-SIGR Conference on Information Retrieval, pp. 88-97.Google Scholar
  7. Elias P (1975) Universal codeword sets and the representation of the integers. IEEE Transactions on Information Theory, 21:194-203.Google Scholar
  8. Golomb SW (1966) Run-length encodings. IEEE Transactions on Information Theory, 12(3):399-401.Google Scholar
  9. Harman DKE (1992-96)Proceedings of the TREC Text Retrieval Conference. National Institute of Standards Special Publication.Google Scholar
  10. Howard P and Vitter J (1993) Fast and efficient lossless image compression. In: Proceedings of the 1993 IEEE Data Compression Conference DCC-93, pp. 351-360.Google Scholar
  11. IBM Corporation (2000) IBM Deskstar 75GXP and Deskstar 40GV hard disk drives. IBM TECHFAX #7011. Available atwww.storage.ibm.com/hdd/desk/deskstar75gxp40gv.pdf (Viewed April 2002).Google Scholar
  12. Intel Corporation (1997) Using the RDTSC instruction for performance monitoring. Available at cedar.intel.com/software/idap/media/pdf/rdtscpm1.pdf (Viewed April 2002).Google Scholar
  13. Johnson T (1999) Performance measurements of compressed Bitmap indices. In: Proceedings of the 25th VLDB Conference, pp. 278-289.Google Scholar
  14. Klein ST, Bookstein A and Deerwester S (1989) Storing text retrieval systems on CD-ROM: Compression and encryption considerations. ACM Transactions on Information Systems, 7:230-245.Google Scholar
  15. Koudas N (2000) Space efficient Bitmap indexing. In: Proceedings of CIKM 2000, pp. 194-201.Google Scholar
  16. Lai CH and Chen TF (2001) Compressing inverted files in scalable information systems by binary decision diagram encoding. Presented at SC2001, available at http://www.sc2001.org/papers/pap.pap338.pdf (visited April 2002).Google Scholar
  17. Microsoft Corporation (2000) CreateFile. Available at msdn.microsoft.com/library/en-us/fileio/filesio 7wmd.asp (Viewed April 2002).Google Scholar
  18. Moffat A and Stuiver L (1996) Exploiting clustering in inverted file compression. In: Proceedings of the 1996 IEEE Data Compression Conference DCC-96, pp. 82-91. llGoogle Scholar
  19. Moffat A and Stuiver L (2000) Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25-47.Google Scholar
  20. Moffat A and Zobel J (1992) Parameterized compression of sparse Bitmaps. In: Proceedings of the 15th ACMSIGIR Conference on Information Retrieval, pp. 274-285. lGoogle Scholar
  21. Moffat A and Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349-379.Google Scholar
  22. Navarro G, Moura E, Neubert M, Ziviani N and Baeza-Yates R (2000) Adding compression to block addressing inverted indexes. Information Retrieval, 3(1):49-77.Google Scholar
  23. Stockinger K (2001) Design and implementation of Bitmap indices for scientific data. In: Proceedings of International Data Engineering and Applications Symposium IDEAS-01, pp. 47-57.Google Scholar
  24. Varadarajan S and Chiuen T (1997) SASE: Implementation of a compressed text search engine. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems.Google Scholar
  25. Vo AN and Moffat A (1998) Compressed inverted files with reduced decoding overheads. In: Proceedings of the 21st ACM-SIGIR Conference on Information Retrieval, pp. 290-297.Google Scholar
  26. Williams HE (2002) goanna.cs.rmit.edu.au/∼hugh/software/integer.coding.tar.gz (viewed April 2002). lGoogle Scholar
  27. Williams HE and Zobel J (1999) Compressing integers for fast file access. The Computer Journal, 42(3):193-201.Google Scholar
  28. Witten IH, Moffat A and Bell TC (1994) Managing gigabytes. Van Nostrand Reinhold 1994.Google Scholar
  29. Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8):891-903Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Andrew Trotman
    • 1
  1. 1.Department of Computer ScienceUniversity of OtagoDunedinNew Zealand

Personalised recommendations