Information Retrieval

, Volume 6, Issue 1, pp 5–19

Compressing Inverted Files

  • Andrew Trotman

DOI: 10.1023/A:1022949613039

Cite this article as:
Trotman, A. Information Retrieval (2003) 6: 5. doi:10.1023/A:1022949613039


Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.

The premise “smaller is better” may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput.

Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.

index compression inverted files document indexing text searching 

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Andrew Trotman
    • 1
  1. 1.Department of Computer ScienceUniversity of OtagoDunedinNew Zealand

Personalised recommendations