Compressing Inverted Files
Cite this article as: Trotman, A. Information Retrieval (2003) 6: 5. doi:10.1023/A:1022949613039 Abstract
Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
The premise “smaller is better” may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput.
Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.
index compression inverted files document indexing text searching Reference
Antoshenkov G (1994) Byte aligned data compression. US Patent Number 5363098.
Bookstein A, Klein ST and Raita T(1994) Markov models for clusters in concordance compression. In: Proceedings of the 1994 IEEE Data Compression Conference DCC-94, pp. 116-125.
Bookstein A, Klein ST and Raita T (2000) Simple bayesian model for Bitmap compression. Information Retrieval, 1(4):315-328.
Chan CY and Ioannidis YE (1999) An efficient Bitmap encoding scheme for selection queries. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 215-226.
Choueka Y, Fraenkel AS and Klein ST (1988) Compression of concordances in full-text retrieval systems. In: Proceedings of the 11th ACM-SIGIR Conference on Information Retrieval, pp. 597-612.
Choueka Y, Fraenkel AS, Klein ST and Segal E (1986) Improved hierarchical bit-vector compression in document retrieval systems. In: Proceedings of the 9th ACM-SIGR Conference on Information Retrieval, pp. 88-97.
Elias P (1975) Universal codeword sets and the representation of the integers. IEEE Transactions on Information Theory, 21:194-203.
Golomb SW (1966) Run-length encodings. IEEE Transactions on Information Theory, 12(3):399-401.
Harman DKE (1992-96)Proceedings of the TREC Text Retrieval Conference. National Institute of Standards Special Publication.
Howard P and Vitter J (1993) Fast and efficient lossless image compression. In: Proceedings of the 1993 IEEE Data Compression Conference DCC-93, pp. 351-360.
IBM Corporation (2000) IBM Deskstar 75GXP and Deskstar 40GV hard disk drives. IBM TECHFAX #7011. Available atwww.storage.ibm.com/hdd/desk/deskstar75gxp40gv.pdf (Viewed April 2002).
Intel Corporation (1997) Using the RDTSC instruction for performance monitoring. Available at cedar.intel.com/software/idap/media/pdf/rdtscpm1.pdf (Viewed April 2002).
Johnson T (1999) Performance measurements of compressed Bitmap indices. In: Proceedings of the 25th VLDB Conference, pp. 278-289.
Klein ST, Bookstein A and Deerwester S (1989) Storing text retrieval systems on CD-ROM: Compression and encryption considerations. ACM Transactions on Information Systems, 7:230-245.
Koudas N (2000) Space efficient Bitmap indexing. In: Proceedings of CIKM 2000, pp. 194-201.
Lai CH and Chen TF (2001) Compressing inverted files in scalable information systems by binary decision diagram encoding. Presented at SC2001, available at http://www.sc2001.org/papers/pap.pap338.pdf (visited April 2002).
Microsoft Corporation (2000) CreateFile. Available at msdn.microsoft.com/library/en-us/fileio/filesio 7wmd.asp (Viewed April 2002).
Moffat A and Stuiver L (1996) Exploiting clustering in inverted file compression. In: Proceedings of the 1996 IEEE Data Compression Conference DCC-96, pp. 82-91. ll
Moffat A and Stuiver L (2000) Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25-47.
Moffat A and Zobel J (1992) Parameterized compression of sparse Bitmaps. In: Proceedings of the 15th ACMSIGIR Conference on Information Retrieval, pp. 274-285. l
Moffat A and Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349-379.
Navarro G, Moura E, Neubert M, Ziviani N and Baeza-Yates R (2000) Adding compression to block addressing inverted indexes. Information Retrieval, 3(1):49-77.
Stockinger K (2001) Design and implementation of Bitmap indices for scientific data. In: Proceedings of International Data Engineering and Applications Symposium IDEAS-01, pp. 47-57.
Varadarajan S and Chiuen T (1997) SASE: Implementation of a compressed text search engine. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems.
Vo AN and Moffat A (1998) Compressed inverted files with reduced decoding overheads. In: Proceedings of the 21st ACM-SIGIR Conference on Information Retrieval, pp. 290-297.
Williams HE (2002) goanna.cs.rmit.edu.au/∼hugh/software/integer.coding.tar.gz (viewed April 2002). l
Williams HE and Zobel J (1999) Compressing integers for fast file access. The Computer Journal, 42(3):193-201.
Witten IH, Moffat A and Bell TC (1994) Managing gigabytes. Van Nostrand Reinhold 1994.
Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8):891-903
Google Scholar Copyright information
© Kluwer Academic Publishers 2003