Information Retrieval

, Volume 8, Issue 1, pp 151–166 | Cite as

Inverted Index Compression Using Word-Aligned Binary Codes

  • Vo Ngoc Anh
  • Alistair Moffat
Article

Abstract

We examine index representation techniques for document-based inverted files, and present a mechanism for compressing them using word-aligned binary codes. The new approach allows extremely fast decoding of inverted lists during query processing, while providing compression rates better than other high-throughput representations. Results are given for several large text collections in support of these claims, both for compression effectiveness and query efficiency.

index compression integer coding index representation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anh VN, de Kretser O and Moffat A (2001) Vector-space ranking with effective early termination. In: Croft WB, Harper DJ, Kraft DH and Zobel J, Eds., Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, Sept. ACM Press, New York, pp. 35–42.Google Scholar
  2. Anh VN and Moffat A (2004) Index compression using fixed binary codewords. In: Schewe K-D and Williams H, Eds., Proc. 15th Australasian Database Conference, Jan. Dunedin, New Zealand, pp. 61–67.Google Scholar
  3. Baeza-Yates R and Ribeiro-Neto B (1999) Modern Information Retrieval. ACM Press, New York.Google Scholar
  4. Bailey P, Craswell N and Hawking D (2003) Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management, 39(6):853–871.Google Scholar
  5. Blandford D and Blelloch G (2002) Index compression through document reordering. In: Storer JA and Cohn M, Eds., Proc. 2002 IEEE Data Compression Conference, April, IEEE Computer Society Press, Los Alamitos, CA. pp. 342–351.Google Scholar
  6. Craswell N and Hawking D (2002) Overview of the TREC-2002 web track. In: Voorhees EM and Harman DK, Eds., The Eleventh Text REtrieval Conference (TREC 2002) Notebook, Nov. Gaithersburg, MD. National Institute of Standards and Technology. NIST Special Publication SP 500–251, pp. 248–257, available at http://trec.nist.gov/pubs/trec11/t11proceedings.html.Google Scholar
  7. de Moura ES, Navarro G, Ziviani N and Baeza-Yates R (2000) Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139.Google Scholar
  8. Frakes WB and Baeza-Yates R (1992) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ.Google Scholar
  9. Harman DK (1995) Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271–289.Google Scholar
  10. Moffat A and Stuiver L (2000) Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25–47.Google Scholar
  11. Persin M, Zobel J and Sacks-Davis R (1996) Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749–764.Google Scholar
  12. Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA.Google Scholar
  13. Scholer F, Williams HE, Yiannis J and Zobel J (2002) Compression of inverted indexes for fast query evaluation. In: Beaulieu M, Baeza-Yates R, Myaeng SH and Jarvelin K, Eds., Proc. 25th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, pp. 222–229.Google Scholar
  14. Soboroff I (2002) Does wt10g look like the web? In: Beaulieu M, Baeza-Yates R, Myaeng SH and Järvelin K, Eds., Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, pp. 423–424.Google Scholar
  15. Trotman A (2003) Compressing inverted files. Information Retrieval, 6:5–19.Google Scholar
  16. Williams HE and Zobel J (1999) Compressing integers for fast file access. The Computer Journal, 42(3):193–201.Google Scholar
  17. Witten IH, Moffat A and Bell TC (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition. Morgan Kaufmann, San Francisco.Google Scholar
  18. Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software—Practice and Experience, 25(8):891–903.Google Scholar

Copyright information

© Kluwer Academic Publishers 2005

Authors and Affiliations

  • Vo Ngoc Anh
    • 1
  • Alistair Moffat
    • 1
  1. 1.Department of Computer Science and Software EngineeringThe University of MelbourneAustralia

Personalised recommendations