On Inverted Index Compression for Search Engine Efficiency

  • Matteo Catena
  • Craig Macdonald
  • Iadh Ounis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)

Abstract

Efficient access to the inverted index data structure is a key aspect for a search engine to achieve fast response times to users’ queries. While the performance of an information retrieval (IR) system can be enhanced through the compression of its posting lists, there is little recent work in the literature that thoroughly compares and analyses the performance of modern integer compression schemes across different types of posting information (document ids, frequencies, positions). In this paper, we experiment with different modern integer compression algorithms, integrating these into a modern IR system. Through comprehensive experiments conducted on two large, widely used document corpora and large query sets, our results show the benefit of compression for different types of posting information to the space- and time-efficiency of the search engine. Overall, we find that the simple Frame of Reference compression scheme results in the best query response times for all types of posting information. Moreover, we observe that the frequency and position posting information in Web corpora that have large volumes of anchor text are more challenging to compress, yet compression is beneficial in reducing average query response times.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dean, J.: Challenges in building large-scale information retrieval systems: invited talk. In: Proc. WSDM 2009 (2009)Google Scholar
  2. 2.
    Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impact scores. In: Proc. SIGIR 2006 (2006)Google Scholar
  3. 3.
    Moffat, A., Webber, W., Zobel, J., Baeza-Yates, R.: A pipelined architecture for distributed text query evaluation. Inf. Retr. 10 (2007)Google Scholar
  4. 4.
    Broccolo, D., Macdonald, C., Orlando, S., Ounis, I., Perego, R., Tonellotto, N.: Load-sensitive selective pruning for distributed search. In: Proc. CIKM 2013 (2013)Google Scholar
  5. 5.
    Witten, I.H., Bell, T.C., Moffat, A.: Managing Gigabytes: Compressing and Indexing Documents and Images, 1st edn. (1994)Google Scholar
  6. 6.
    Elias, P.: Universal codeword sets and representations of the integers. Trans. Info. Theory 21(2) (1975)Google Scholar
  7. 7.
    Yan, H., Ding, S., Suel, T.: Compressing term positions in web indexes. In: Proc. SIGIR 2009 (2009)Google Scholar
  8. 8.
    Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In: Proc. ICDE 2006 (2006)Google Scholar
  9. 9.
    Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proc. WWW 2009 (2009)Google Scholar
  10. 10.
    Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Software: Practice and Experience (2013)Google Scholar
  11. 11.
    Delbru, R., Campinas, S., Samp, K., Tummarello, G.: Adaptive frame of reference for compressing inverted lists. Technical Report 2010-12-16, DERI (2010)Google Scholar
  12. 12.
    Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable IR Platform. In: Proc. OSIR 2006 (2006)Google Scholar
  13. 13.
    Wang, L., Lin, J., Metzler, D.: Learning to efficiently rank. In: Proc. SIGIR 2010 (2010)Google Scholar
  14. 14.
    Shurman, E., Brutlag, J.: Performance related changes and their user impacts. In: Velocity: Web Performance and Operations Conference (2009)Google Scholar
  15. 15.
    Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proc. CIKM 2003 (2003)Google Scholar
  16. 16.
    Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., Degenhardt, J.: Early exit optimizations for additive machine learned ranking systems. In: Proc. WSDM 2010 (2010)Google Scholar
  17. 17.
    Wang, L., Lin, J., Metzler, D.: A cascade ranking model for efficient ranked retrieval. In: Proc. SIGIR 2011 (2011)Google Scholar
  18. 18.
    Macdonald, C., Santos, R.L., Ounis, I., He, B.: About learning models with multiple query dependent features. Trans. Info. Sys. 13(3) (2013)Google Scholar
  19. 19.
    Williams, H.E., Zobel, J.: Compressing integers for fast file access. The Computer Journal 42 (1999)Google Scholar
  20. 20.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. SIGIR 2002 (2002)Google Scholar
  21. 21.
    Golomb, S.: Run-length encodings. Trans. Infor. Theory 12(3) (1966)Google Scholar
  22. 22.
    Rice, R., Plaunt, J.: Adaptive variable-length coding for efficient compression of spacecraft television data. Trans. Communication Technology 19(6) (1971)Google Scholar
  23. 23.
    Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1) (2005)Google Scholar
  24. 24.
    Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proc. ICDE 1998 (1998)Google Scholar
  25. 25.
    Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in search engines. In: Proc. WWW 2008 (2008)Google Scholar
  26. 26.
    Zhang, J., Suel, T.: Efficient search in large textual collections with redundancy. In: Proc. WWW 2007 (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Matteo Catena
    • 1
  • Craig Macdonald
    • 2
  • Iadh Ounis
    • 2
  1. 1.GSSI - Gran Sasso Science Institute, INFNL’AquilaItaly
  2. 2.School of Computing ScienceUniversity of GlasgowGlasgowUK

Personalised recommendations