Abstract
In search environments where large document collections are partitioned into smaller subsets (shards), processing the query against only the relevant shards improves search efficiency. The problem of ranking the shards based on their estimated relevance to the query has been studied extensively. However, a related important task of identifying how many of the top ranked relevant shards should be searched for the query, so as to balance the competing objectives of effectiveness and efficiency, has not received much attention. This task of shard rank cutoff estimation is the focus of the presented work. The central premise for the proposed solution is that the number of top shards searched should be dependent on – 1. the query, 2. the given ranking of shards, and 3. on the type of search need being served (precision-oriented versus recall-oriented task). An array of features that capture these three factors are defined, and a regression model is induced based on these features to learn a query-specific shard rank cutoff estimator. An empirical evaluation using two large datasets demonstrates that the learned shard rank cutoff estimator provides substantial improvement in search efficiency as compared to strong baselines without degrading search effectiveness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barroso, L.A., Dean, J., Hölzle, U.: Web Search for a Planet: The Google cluster architecture. IEEE Micro 23, 22–28 (2003)
Baeza-Yates, R., Murdock, V., Hauff, C.: Efficiency trade-offs in two-tier Web search systems. In: Proceedings of the ACM SIGIR Conference, pp. 163–170 (2009)
Chowdhury, A., Pass, G.: Operational requirements for scalable search systems. In: Proceedings of the CIKM, pp. 435–442 (2003)
Risvik, K.M., Aasheim, Y., Lidal, M.: Multi-tier architecture for Web search engines. In: Proceedings of the First Latin American Web Congress, pp. 132–143 (2003)
Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., Silvestri, F.: Challenges on distributed Web retrieval. In: Proceedings of the ICDE, pp. 6–20 (2007)
Kulkarni, A., Callan, J.: Document allocation policies for selective searching of distributed indexes. In: Proceedings of the ACM CIKM, pp. 449–458 (2010)
Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the internet. ACM Transactions on Database Systems 24, 229–264 (1999)
Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the ACM SIGIR Conference, pp. 21–28 (1995)
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the ACM SIGIR Conference, pp. 298–305 (2003)
Thomas, P., Shokouhi, M.: Sushi: Scoring scaled samples for server selection. In: Proceedings of the ACM SIGIR Conference, pp. 419–426 (2009)
Shokouhi, M.: Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 160–172. Springer, Heidelberg (2007)
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden Web: Hierarchical database sampling and selection. In: Proceedings of the VLDB Conference, pp. 394–405 (2002)
Arguello, J., Callan, J., Diaz, F.: Classification-based resource selection. In: Proceedings of the ACM CIKM, pp. 1277–1286 (2009)
Puppin, D., Silvestri, F., Perego, R., Baeza-Yates, R.: Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems 28, 5:1–5:36 (2010)
Aly, R., Hiemstra, D., Demeester, T.: Taily: Shard selection using the tail of score distributions. In: Proceedings of the ACM CIKM, pp. 673–682 (2013)
Kulkarni, A., Tigelaar, A., Hiemstra, D., Callan, J.: Shard ranking and cutoff estimation for topically partitioned collections. In: Proceedings of the ACM CIKM, pp. 555–564 (2012)
Markov, I., Crestani, F.: Theoretical, qualitative, and quantitative analyses of small-document approaches to resource selection. ACM Transactions on Information Systems 32, 9:1–9:37 (2014)
Callan, J.: Distributed information retrieval, 127–150 (2000)
Shokouhi, M., Si, L.: Federated search. Foundations and Trends in Information Retrieval 5, 1–102 (2011)
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2, 18–22 (2002)
He, B., Ounis, I.: Query performance prediction. Information Systems 31, 585–594 (2006)
Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proceedings of the ACM SIGIR Conference, pp. 299–306. ACM (2002)
Macdonald, C., Tonellotto, N., Ounis, I.: Learning to predict response times for online scheduling. In: Proceedings of the ACM SIGIR Conference, pp. 621–630 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kulkarni, A. (2015). ShRkC: Shard Rank Cutoff Prediction for Selective Search. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-23826-5_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)