Abstract
We examine index representation techniques for document-based inverted files, and present a mechanism for compressing them using word-aligned binary codes. The new approach allows extremely fast decoding of inverted lists during query processing, while providing compression rates better than other high-throughput representations. Results are given for several large text collections in support of these claims, both for compression effectiveness and query efficiency.
Article PDF
Similar content being viewed by others
References
Anh VN, de Kretser O and Moffat A (2001) Vector-space ranking with effective early termination. In: Croft WB, Harper DJ, Kraft DH and Zobel J, Eds., Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, Sept. ACM Press, New York, pp. 35–42.
Anh VN and Moffat A (2004) Index compression using fixed binary codewords. In: Schewe K-D and Williams H, Eds., Proc. 15th Australasian Database Conference, Jan. Dunedin, New Zealand, pp. 61–67.
Baeza-Yates R and Ribeiro-Neto B (1999) Modern Information Retrieval. ACM Press, New York.
Bailey P, Craswell N and Hawking D (2003) Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management, 39(6):853–871.
Blandford D and Blelloch G (2002) Index compression through document reordering. In: Storer JA and Cohn M, Eds., Proc. 2002 IEEE Data Compression Conference, April, IEEE Computer Society Press, Los Alamitos, CA. pp. 342–351.
Craswell N and Hawking D (2002) Overview of the TREC-2002 web track. In: Voorhees EM and Harman DK, Eds., The Eleventh Text REtrieval Conference (TREC 2002) Notebook, Nov. Gaithersburg, MD. National Institute of Standards and Technology. NIST Special Publication SP 500–251, pp. 248–257, available at http://trec.nist.gov/pubs/trec11/t11proceedings.html.
de Moura ES, Navarro G, Ziviani N and Baeza-Yates R (2000) Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139.
Frakes WB and Baeza-Yates R (1992) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ.
Harman DK (1995) Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271–289.
Moffat A and Stuiver L (2000) Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25–47.
Persin M, Zobel J and Sacks-Davis R (1996) Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749–764.
Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA.
Scholer F, Williams HE, Yiannis J and Zobel J (2002) Compression of inverted indexes for fast query evaluation. In: Beaulieu M, Baeza-Yates R, Myaeng SH and Jarvelin K, Eds., Proc. 25th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, pp. 222–229.
Soboroff I (2002) Does wt10g look like the web? In: Beaulieu M, Baeza-Yates R, Myaeng SH and Järvelin K, Eds., Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, pp. 423–424.
Trotman A (2003) Compressing inverted files. Information Retrieval, 6:5–19.
Williams HE and Zobel J (1999) Compressing integers for fast file access. The Computer Journal, 42(3):193–201.
Witten IH, Moffat A and Bell TC (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition. Morgan Kaufmann, San Francisco.
Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software—Practice and Experience, 25(8):891–903.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Anh, V.N., Moffat, A. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval 8, 151–166 (2005). https://doi.org/10.1023/B:INRT.0000048490.99518.5c
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000048490.99518.5c