Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
The premise “smaller is better” may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput.
Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.
Unable to display preview. Download preview PDF.
- Antoshenkov G (1994) Byte aligned data compression. US Patent Number 5363098.Google Scholar
- Bookstein A, Klein ST and Raita T(1994) Markov models for clusters in concordance compression. In: Proceedings of the 1994 IEEE Data Compression Conference DCC-94, pp. 116-125.Google Scholar
- Bookstein A, Klein ST and Raita T (2000) Simple bayesian model for Bitmap compression. Information Retrieval, 1(4):315-328.Google Scholar
- Chan CY and Ioannidis YE (1999) An efficient Bitmap encoding scheme for selection queries. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 215-226.Google Scholar
- Choueka Y, Fraenkel AS and Klein ST (1988) Compression of concordances in full-text retrieval systems. In: Proceedings of the 11th ACM-SIGIR Conference on Information Retrieval, pp. 597-612.Google Scholar
- Choueka Y, Fraenkel AS, Klein ST and Segal E (1986) Improved hierarchical bit-vector compression in document retrieval systems. In: Proceedings of the 9th ACM-SIGR Conference on Information Retrieval, pp. 88-97.Google Scholar
- Elias P (1975) Universal codeword sets and the representation of the integers. IEEE Transactions on Information Theory, 21:194-203.Google Scholar
- Golomb SW (1966) Run-length encodings. IEEE Transactions on Information Theory, 12(3):399-401.Google Scholar
- Harman DKE (1992-96)Proceedings of the TREC Text Retrieval Conference. National Institute of Standards Special Publication.Google Scholar
- Howard P and Vitter J (1993) Fast and efficient lossless image compression. In: Proceedings of the 1993 IEEE Data Compression Conference DCC-93, pp. 351-360.Google Scholar
- IBM Corporation (2000) IBM Deskstar 75GXP and Deskstar 40GV hard disk drives. IBM TECHFAX #7011. Available atwww.storage.ibm.com/hdd/desk/deskstar75gxp40gv.pdf (Viewed April 2002).Google Scholar
- Intel Corporation (1997) Using the RDTSC instruction for performance monitoring. Available at cedar.intel.com/software/idap/media/pdf/rdtscpm1.pdf (Viewed April 2002).Google Scholar
- Johnson T (1999) Performance measurements of compressed Bitmap indices. In: Proceedings of the 25th VLDB Conference, pp. 278-289.Google Scholar
- Klein ST, Bookstein A and Deerwester S (1989) Storing text retrieval systems on CD-ROM: Compression and encryption considerations. ACM Transactions on Information Systems, 7:230-245.Google Scholar
- Koudas N (2000) Space efficient Bitmap indexing. In: Proceedings of CIKM 2000, pp. 194-201.Google Scholar
- Lai CH and Chen TF (2001) Compressing inverted files in scalable information systems by binary decision diagram encoding. Presented at SC2001, available at http://www.sc2001.org/papers/pap.pap338.pdf (visited April 2002).Google Scholar
- Microsoft Corporation (2000) CreateFile. Available at msdn.microsoft.com/library/en-us/fileio/filesio 7wmd.asp (Viewed April 2002).Google Scholar
- Moffat A and Stuiver L (1996) Exploiting clustering in inverted file compression. In: Proceedings of the 1996 IEEE Data Compression Conference DCC-96, pp. 82-91. llGoogle Scholar
- Moffat A and Stuiver L (2000) Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25-47.Google Scholar
- Moffat A and Zobel J (1992) Parameterized compression of sparse Bitmaps. In: Proceedings of the 15th ACMSIGIR Conference on Information Retrieval, pp. 274-285. lGoogle Scholar
- Moffat A and Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349-379.Google Scholar
- Navarro G, Moura E, Neubert M, Ziviani N and Baeza-Yates R (2000) Adding compression to block addressing inverted indexes. Information Retrieval, 3(1):49-77.Google Scholar
- Stockinger K (2001) Design and implementation of Bitmap indices for scientific data. In: Proceedings of International Data Engineering and Applications Symposium IDEAS-01, pp. 47-57.Google Scholar
- Varadarajan S and Chiuen T (1997) SASE: Implementation of a compressed text search engine. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems.Google Scholar
- Vo AN and Moffat A (1998) Compressed inverted files with reduced decoding overheads. In: Proceedings of the 21st ACM-SIGIR Conference on Information Retrieval, pp. 290-297.Google Scholar
- Williams HE (2002) goanna.cs.rmit.edu.au/∼hugh/software/integer.coding.tar.gz (viewed April 2002). lGoogle Scholar
- Williams HE and Zobel J (1999) Compressing integers for fast file access. The Computer Journal, 42(3):193-201.Google Scholar
- Witten IH, Moffat A and Bell TC (1994) Managing gigabytes. Van Nostrand Reinhold 1994.Google Scholar
- Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8):891-903Google Scholar