Skip to main content

Pre-indexing Pruning Strategies

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2020)

Abstract

We explore different techniques for pruning an inverted index in advance, that is, without building the full index. These techniques provide interesting trade-offs between index size, answer quality and query coverage. We experimentally analyze them in a large public web collection with two different query logs. The trade-offs that we find range from an index of size 4% and 35% of precision@10 to an index of size 46% and 90% of precision@10, with respect to the full index case. In both cases we cover almost 97% of the query volume. We also do a relative relevance analysis with a smaller private web collection and query log, finding that some of our techniques allow a reduction of almost 40% the index size by losing less than 2% for NDCG@10.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/MojoJolo/textteaser.

  2. 2.

    Common Crawl web collection, http://commoncrawl.org/2017/11/november-2017-crawl-archive-now-available/.

  3. 3.

    Boilerpipe, https://github.com/robbypond/boilerpipe.

  4. 4.

    http://opennlp.sourceforge.net/models-1.5/.

  5. 5.

    https://fasttext.cc/docs/en/language-identification.html.

  6. 6.

    https://www.elastic.co/products/elasticsearch.

References

  1. Altingovde, I.S., Ozcan, R., Ulusoy, O.: Static index pruning in web search engines: combining term and document popularities with query views. ACM Trans. Inf. Syst. 30(1):2:1–2:28 (2012)

    Google Scholar 

  2. Anagnostopoulos, A., Becchetti, L., Leonardi, S., Mele, I., Sankowski, P.: Stochastic query covering. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM 2011, pp. 725–734. ACM, New York (2011)

    Google Scholar 

  3. Arguello, J., Callan, J., Diaz, F.: Classification-based resource selection. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1277–1286. ACM, New York (2009)

    Google Scholar 

  4. Baeza-Yates, R., Boldi, P., Chierichetti, F.: Essential web pages are easy to find. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, 18–22 May, 2015, pp. 97–107 (2015)

    Google Scholar 

  5. Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F. Design trade-offs for search engine caching. TWEB 2(4):20:1–20:28 (2008)

    Google Scholar 

  6. Baeza-Yates, R., Murdock, V., Hauff, C.: Efficiency trade-offs in two-tier web search systems. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–170. ACM, New York (2009)

    Google Scholar 

  7. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search. Addison-Wesley, Pearson (2011)

    Google Scholar 

  8. Blanco, R., Barreiro, Á.: Static pruning of terms in inverted files. In: Amati, Giambattista, Carpineto, Claudio, Romano, Giovanni (eds.) ECIR 2007. LNCS, vol. 4425, pp. 64–75. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_9

    Chapter  Google Scholar 

  9. Blanco, R., Barreiro, A.: Probabilistic static pruning of inverted files. ACM Trans. Inf. Syst. 28(1), 1:1–1:33 (2010)

    Google Scholar 

  10. Büttcher, S., Clarke, C.L.A.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 182–189. ACM, New York (2006)

    Google Scholar 

  11. Cambazoglu, B.B., Baeza-Yates, R.: Scalability Challenges in Web Search Engines. Morgan & Claypool Publishers, San Rafael (2015)

    Book  Google Scholar 

  12. Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S., Soffer, A.: Static index pruning for information retrieval systems. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–50. ACM, New York (2001)

    Google Scholar 

  13. Chen, R.-C., Lee., C.-J.: An information-theoretic account of static index pruning. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172. ACM, New York (2013)

    Google Scholar 

  14. de Moura, E.S., dos Santos, C.F., Fernandes, D.R., Silva, A.S., Calado, P., Nascimento, M.A.: Improving web search efficiency via a locality based static pruning method. In: Proceedings of the 14th International Conference on World Wide Web, pp. 235–244. ACM, New York (2005)

    Google Scholar 

  15. Kulkarni, A., Tigelaar, A.S., Hiemstra, D., Callan, J.: Shard ranking and cutoff estimation for topically partitioned collections. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 555–564. ACM, New York (2012)

    Google Scholar 

  16. Leung, G., Quadrianto N., Tsioutsiouliklis, K., Smola, A.J.: Optimal web-scale tiering as a flow problem. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 1333–1341. Curran Associates Inc. (2010)

    Google Scholar 

  17. Ntoulas, A., Cho, J.: Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198. ACM, New York (2007)

    Google Scholar 

  18. Puppin, D., Silvestri, F., Perego, R., Baeza-Yates, R.: Tuning the capacity of search engines: load-driven routing and incremental caching to reduce and balance the load. ACM Trans. Inf. Syst. 28(2), 1–36 (2010)

    Article  Google Scholar 

  19. Risvik, K.M., Aasheim, Y., Lidal, M.: Multi-tier architecture for web search engines. In: Proceedings of the 1st Conference on Latin American Web Congress, p. 132. IEEE Computer Society, Washington (2003)

    Google Scholar 

  20. Skobeltsyn, G., Junqueira, F., Plachouras, V., Baeza-Yates, R.: ResIn: a combination of results caching and index pruning for high-performance web search engines. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 131–138. ACM, New York (2008)

    Google Scholar 

  21. Thota, S.L., Carterette, B.: Within-document term-based index pruning with statistical hypothesis testing. In: Clough, Paul, Foley, Colum, Gurrin, Cathal, Jones, Gareth J.F., Kraaij, Wessel, Lee, Hyowon, Mudoch, Vanessa (eds.) ECIR 2011. LNCS, vol. 6611, pp. 543–554. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_54

    Chapter  Google Scholar 

  22. Wu, Q., Burges, C.J.C., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Inf. Retrieval 13(3), 254–270 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soner Altin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Altin, S., Baeza-Yates, R., Cambazoglu, B.B. (2020). Pre-indexing Pruning Strategies. In: Boucher, C., Thankachan, S.V. (eds) String Processing and Information Retrieval. SPIRE 2020. Lecture Notes in Computer Science(), vol 12303. Springer, Cham. https://doi.org/10.1007/978-3-030-59212-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59212-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59211-0

  • Online ISBN: 978-3-030-59212-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics