Advertisement

Information Retrieval Journal

, Volume 20, Issue 3, pp 199–220 | Cite as

The role of index compression in score-at-a-time query evaluation

  • Jimmy Lin
  • Andrew TrotmanEmail author
Information Retrieval Efficiency

Abstract

This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered—no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to “keep up” with the processor. Our results highlight the importance of “architecture affinity” when designing high-performance search engines.

Keywords

Query Term Impact Score Compression Scheme Query Evaluation Cache Line 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.CrossRefGoogle Scholar
  2. Anh, V. N., & Moffat, A. (2005a). Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1), 151–166.CrossRefGoogle Scholar
  3. Anh, V.N., & Moffat, A. (2005b). Simplified similarity scoring using term ranks. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 226–233). Salvador.Google Scholar
  4. Anh, V. N., & Moffat, A. (2006). Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2006) (pp. 372–379). Seattle.Google Scholar
  5. Anh, V. N., & Moffat, A. (2010). Inverted index compression using word-aligned binary codes. Software: Practice and Experience, 40(2), 131–147.Google Scholar
  6. Anh, V. N., de Kretser, O., & Moffat, A. (2001). Vector-space ranking with effective early termination. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 35–42). New Orleans.Google Scholar
  7. Asadi, N., & Lin, J. (2013). Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 997–1000). Dublin.Google Scholar
  8. Brewer, E. A. (2005). Combining systems and databases: A search engine retrospective. Readings in database systems. Cambridge: MIT Press.Google Scholar
  9. Büttcher, S., & Clarke, C.L.A. (2007). Index compression is good, especially for random access. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 761–770). Lisbon.Google Scholar
  10. Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., & Degenhardt, J. (2010). Early exit optimizations for additive machine learned ranking systems. In Proceedings of the third ACM international conference on web search and data mining (WSDM 2010) (pp. 411–420). New York.Google Scholar
  11. Crane, M., Trotman, A., & O’Keefe, R. (2013). Maintaining discriminatory power in quantized indexes. In Proceedings of 22nd international conference on information and knowledge management (CIKM 2013) (pp. 1221–1224). San Francisco.Google Scholar
  12. Dean, J. (2009). Challenges in building large-scale information retrieval systems. In Keynote presentation at the second ACM international conference on web search and data mining (WSDM 2009). Barcelona.Google Scholar
  13. Dean, J., & Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2), 74–80.CrossRefGoogle Scholar
  14. Ding, S., & Suel, T. (2011). Faster top-k document retrieval using block-max indexes. In Proceedings of the 34rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 993–1002). Beijing.Google Scholar
  15. Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., & Zien, J. (2011). Evaluation strategies for top-k queries over memory-resident inverted indexes. In Proceedings of the 37th international conference on very large data bases (VLDB 2011) (pp. 1213–1224). Seattle.Google Scholar
  16. Jacob, B. (2009). The memory system: you can’t avoid it, you can’t ignore it, you can’t fake it. San Rafael: Morgan & Claypool Publishers.Google Scholar
  17. Jia, X.F., Trotman, A., & O’Keefe, R. (2010). Efficient accumulator initialisation. In Proceedings of the 15th Australasian document computing symposium (ADCS 2010).Google Scholar
  18. Lemire, D., & Boytsov, L. (2015). Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1), 1–29.Google Scholar
  19. Lemire, D., Boytsov, L., & Kurz, N. (2014). SIMD compression and the intersection of sorted integers. arXiv:1401.6399v9.
  20. Lin, J., & Trotman, A. (2015). Anytime ranking for impact-ordered indexes. In Proceedings of the ACM international conference on the theory of information retrieval (ICTIR 2015) (pp. 301–304). Northampton.Google Scholar
  21. Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., et al. (2016). Toward reproducible baselines: The open-source IR reproducibility challenge. In Proceedings of the 38th European conference on information retrieval (ECIR 2016) (pp. 408–420). Padua.Google Scholar
  22. Moffat, A., & Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4), 349–379.CrossRefGoogle Scholar
  23. Patterson, D. A. (2004). Latency lags bandwidth. Communications of the ACM, 47(10), 71–75.CrossRefGoogle Scholar
  24. Persin, M., Zobel, J., & Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10), 749–764.CrossRefGoogle Scholar
  25. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998) (pp. 275–281). Melbourne.Google Scholar
  26. Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gatford, M., & Payne, A. (1995). Okapi at TREC-4. In Proceedings of the fourth text retrieval conference (TREC-4) (pp. 73–96). Gaithersburg.Google Scholar
  27. Rossi, C., de Moura, E. S., Carvalho, A. L., & da Silva, A. S. (2013). Fast document-at-a-time query processing using two-tier indexes. In Proceedings of the 36th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2013) (pp. 183–192). Dublin.Google Scholar
  28. Salton, G. (1971). The SMART retrieval system-experiments in automatic document processing. Englewood Cliffs: Prentice-Hall.Google Scholar
  29. Scholer, F., Williams, H. E., Yiannis, J., & Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 222–229). Tampere.Google Scholar
  30. Silvestri, F., & Venturini, R. (2010). VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of 19th international conference on information and knowledge management (CIKM 2010) (pp. 1219–1228). Toronto.Google Scholar
  31. Smucker, M.D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth international conference on information and knowledge management (CIKM 2007) (pp. 623–632). Lisbon.Google Scholar
  32. Stepanov, A. A., Gangolli, A. R., Rose, D. E., Ernst, R. J., & Oberoi, P. S. (2011). SIMD-based decoding of posting lists. In Proceedings of 20th international conference on information and knowledge management (CIKM 2011) (pp. 317–326). Glasgow.Google Scholar
  33. Strohman, T., & Croft, W. B. (2007). Efficient document retrieval in main memory. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007) (pp. 175–182). Amsterdam.Google Scholar
  34. Strohman, T., Turtle, H., & Croft, W.B. (2005). Optimization strategies for complex queries. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 219–225). Salvador.Google Scholar
  35. Thompson, M., & Barker, M. (2010). LMAX: How to do 100k TPS at less than 1ms latency. In Presentation at QCon San Francisco 2010 Google Scholar
  36. Tonellotto, N., Macdonald, C., & Ounis, I. (2013). Efficient and effective retrieval using selective pruning. In Proceedings of the sixth ACM international conference on web search and data mining (WSDM 2013) (pp. 63–72). Rome.Google Scholar
  37. Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1), 5–19.CrossRefGoogle Scholar
  38. Trotman, A. (2014). Compression, SIMD, and postings lists. In Proceedings of the 2014 Australasian document computing symposium (ADCS ’14) (pp. 50–57). Melbourne.Google Scholar
  39. Trotman, A., Jia, X.F., & Crane, M. (2012). Towards an efficient and effective search engine. In Proceedings of the SIGIR 2012 workshop on open source information retrieval. Portland.Google Scholar
  40. Turtle, H., & Flood, J. (1995). Query evaluation: Strategies and optimizations. Information Processing and Management, 31(6), 831–850.CrossRefGoogle Scholar
  41. Wang, L., Lin, J., & Metzler, D. (2011). A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2011) (pp. 105–114). Beijing.Google Scholar
  42. Williams, H. E., & Zobel, J. (1999). Compressing integers for fast file access. The Computer Journal, 42(3), 193–201.CrossRefGoogle Scholar
  43. Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: Compressing and indexing documents and images. San Francisco: Morgan Kaufmann Publishing.zbMATHGoogle Scholar
  44. Wolfe, R., & Hanley, J. (2002). If we’re so different, why do we keep overlapping? CMAJ, 166(1), 65–66.Google Scholar
  45. Wu, H., & Fang, H. (2014). Document prioritization for scalable query processing. In Proceedings of 23rd international conference on information and knowledge management (CIKM 2014) (pp. 1609–1618). Shanghai.Google Scholar
  46. Yan, H., Ding, S., & Suel, T. (2009). Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international eorld wide web conference (WWW 2009) (pp. 401–410). Madrid.Google Scholar
  47. Zhang, J., Long, X., & Suel, T. (2008). Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international world wide web conference (WWW 2008) (pp. 387–396). Beijing.Google Scholar
  48. Zukowski, M., Héman, S., Nes, N., & Boncz, P. (2006). Super-scalar RAM-CPU cache compression. In Proceedings of the IEEE 22nd international conference on data engineering (ICDE 2006). Atlanta.Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada
  2. 2.Department of Computer ScienceUniversity of OtagoDunedinNew Zealand

Personalised recommendations