Abstract
We explore different techniques for pruning an inverted index in advance, that is, without building the full index. These techniques provide interesting trade-offs between index size, answer quality and query coverage. We experimentally analyze them in a large public web collection with two different query logs. The trade-offs that we find range from an index of size 4% and 35% of precision@10 to an index of size 46% and 90% of precision@10, with respect to the full index case. In both cases we cover almost 97% of the query volume. We also do a relative relevance analysis with a smaller private web collection and query log, finding that some of our techniques allow a reduction of almost 40% the index size by losing less than 2% for NDCG@10.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Common Crawl web collection, http://commoncrawl.org/2017/11/november-2017-crawl-archive-now-available/.
- 3.
Boilerpipe, https://github.com/robbypond/boilerpipe.
- 4.
- 5.
- 6.
References
Altingovde, I.S., Ozcan, R., Ulusoy, O.: Static index pruning in web search engines: combining term and document popularities with query views. ACM Trans. Inf. Syst. 30(1):2:1–2:28 (2012)
Anagnostopoulos, A., Becchetti, L., Leonardi, S., Mele, I., Sankowski, P.: Stochastic query covering. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM 2011, pp. 725–734. ACM, New York (2011)
Arguello, J., Callan, J., Diaz, F.: Classification-based resource selection. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1277–1286. ACM, New York (2009)
Baeza-Yates, R., Boldi, P., Chierichetti, F.: Essential web pages are easy to find. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, 18–22 May, 2015, pp. 97–107 (2015)
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F. Design trade-offs for search engine caching. TWEB 2(4):20:1–20:28 (2008)
Baeza-Yates, R., Murdock, V., Hauff, C.: Efficiency trade-offs in two-tier web search systems. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–170. ACM, New York (2009)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search. Addison-Wesley, Pearson (2011)
Blanco, R., Barreiro, Á.: Static pruning of terms in inverted files. In: Amati, Giambattista, Carpineto, Claudio, Romano, Giovanni (eds.) ECIR 2007. LNCS, vol. 4425, pp. 64–75. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_9
Blanco, R., Barreiro, A.: Probabilistic static pruning of inverted files. ACM Trans. Inf. Syst. 28(1), 1:1–1:33 (2010)
Büttcher, S., Clarke, C.L.A.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 182–189. ACM, New York (2006)
Cambazoglu, B.B., Baeza-Yates, R.: Scalability Challenges in Web Search Engines. Morgan & Claypool Publishers, San Rafael (2015)
Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S., Soffer, A.: Static index pruning for information retrieval systems. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–50. ACM, New York (2001)
Chen, R.-C., Lee., C.-J.: An information-theoretic account of static index pruning. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172. ACM, New York (2013)
de Moura, E.S., dos Santos, C.F., Fernandes, D.R., Silva, A.S., Calado, P., Nascimento, M.A.: Improving web search efficiency via a locality based static pruning method. In: Proceedings of the 14th International Conference on World Wide Web, pp. 235–244. ACM, New York (2005)
Kulkarni, A., Tigelaar, A.S., Hiemstra, D., Callan, J.: Shard ranking and cutoff estimation for topically partitioned collections. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 555–564. ACM, New York (2012)
Leung, G., Quadrianto N., Tsioutsiouliklis, K., Smola, A.J.: Optimal web-scale tiering as a flow problem. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 1333–1341. Curran Associates Inc. (2010)
Ntoulas, A., Cho, J.: Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198. ACM, New York (2007)
Puppin, D., Silvestri, F., Perego, R., Baeza-Yates, R.: Tuning the capacity of search engines: load-driven routing and incremental caching to reduce and balance the load. ACM Trans. Inf. Syst. 28(2), 1–36 (2010)
Risvik, K.M., Aasheim, Y., Lidal, M.: Multi-tier architecture for web search engines. In: Proceedings of the 1st Conference on Latin American Web Congress, p. 132. IEEE Computer Society, Washington (2003)
Skobeltsyn, G., Junqueira, F., Plachouras, V., Baeza-Yates, R.: ResIn: a combination of results caching and index pruning for high-performance web search engines. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 131–138. ACM, New York (2008)
Thota, S.L., Carterette, B.: Within-document term-based index pruning with statistical hypothesis testing. In: Clough, Paul, Foley, Colum, Gurrin, Cathal, Jones, Gareth J.F., Kraaij, Wessel, Lee, Hyowon, Mudoch, Vanessa (eds.) ECIR 2011. LNCS, vol. 6611, pp. 543–554. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_54
Wu, Q., Burges, C.J.C., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Inf. Retrieval 13(3), 254–270 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Altin, S., Baeza-Yates, R., Cambazoglu, B.B. (2020). Pre-indexing Pruning Strategies. In: Boucher, C., Thankachan, S.V. (eds) String Processing and Information Retrieval. SPIRE 2020. Lecture Notes in Computer Science(), vol 12303. Springer, Cham. https://doi.org/10.1007/978-3-030-59212-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-59212-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59211-0
Online ISBN: 978-3-030-59212-7
eBook Packages: Computer ScienceComputer Science (R0)