Abstract
Collection selection is one of the key problems in distributed information retrieval. Due to resource constraints it is not usually feasible to search all collections in response to a query. Therefore, the central component (broker) selects a limited number of collections to be searched for the submitted queries. During the past decade, several collection selection algorithms have been introduced. However, their performance varies on different testbeds. We propose a new collection-selection method based on the ranking of downloaded sample documents. We test our method on six testbeds and show that our technique can significantly outperform other state-of-the-art algorithms in most cases. We also introduce a new testbed based on the trecĀ gov2 documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Avrahami, T., et al.: The FedLemur: federated search in the real world. Journal of the American Society for Information Science and TechnologyĀ 57(3), 347ā358 (2006)
Baillie, M., Azzopardi, L., Crestani, F.: Adaptive query-based sampling of distributed collections. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol.Ā 4209, pp. 316ā328. Springer, Heidelberg (2006)
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information SystemsĀ 19(2), 97ā130 (2001)
Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proc. ACM SIGIR Conf., Seattle, Washington, pp. 21ā28. ACM Press, New York (1995)
Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proc. ACM Conf. on Digital Libraries, San Antonio, Texas, pp. 37ā46. ACM Press, New York (2000)
DāSouza, D., Thom, J., Zobel, J.: Collection selection for managed distributed document databases. Information Processing and ManagementĀ 40(3), 527ā546 (2004a)
DāSouza, D., Zobel, J., Thom, J.: Is CORI effective for collection selection? an exploration of parameters, queries, and data. In: Proc. Australian Document Computing Symposium, Melbourne, Australia, pp. 41ā46 (2004b)
Gravano, L., et al.: STARTS: Stanford proposal for Internet meta-searching. In: Proc. ACM SIGMOD Conf., Tucson, Arizona, pp. 207ā218. ACM Press, New York (1997)
Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Transactions on Database SystemsĀ 24(2), 229ā264 (1999)
Hawking, D., Thomas, P.: Server selection methods in hybrid portal search. In: Proc. ACM SIGIR Conf., Salvador, Brazil, pp. 75ā82. ACM Press, New York (2005)
Joachims, T., et al.: Accurately interpreting clickthrough data as implicit feedback. In: Proc. ACM SIGIR Conf., Salvador, Brazil, pp. 154ā161. ACM Press, New York (2005)
Manmatha, R., Rath, T., Feng, F.: Modeling score distributions for combining the outputs of search engines. In: Proc. ACM SIGIR Conf., New Orleans, Louisiana, pp. 267ā275. ACM Press, New York (2001)
Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: Proc. ACM SIGIR Conf., Toronto, Canada, pp. 290ā297. ACM Press, New York (2003)
Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Transactions on Information SystemsĀ 21(4), 412ā456 (2003)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proc. 27th Int. Conf. on Very Large Data Bases, Roma, Italy, pp. 129ā138. Morgan Kaufmann, San Francisco (2001)
Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: Proc. Asia Pacific Web Conf., Harbin, China, pp. 63ā75 (2006a)
Shokouhi, M., et al.: Capturing collection size for distributed non-cooperative retrieval. In: Proc. ACM SIGIR Conf., Seattle, Washington, pp. 316ā323. ACM Press, New York (2006b)
Si, L., Callan, J.: Unified utility maximization framework for resource selection. In: Proc. ACM CIKM Conf., New York, NY, pp. 32ā41. ACM Press, New York (2004)
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proc. ACM SIGIR Conf., Toronto, Canada, pp. 298ā305. ACM Press, New York (2003a)
Si, L., Callan, J.: A semisupervised learning method to merge search engine results. ACM Transactions on Information SystemsĀ 21(4), 457ā491 (2003b)
Si, L., et al.: A language modeling framework for resource selection and results merging. In: Proc. ACM CIKM Conf., McLean, Virginia, pp. 391ā397. ACM Press, New York (2002)
Xu, J., Croft, B.: Cluster-based language models for distributed retrieval. In: Proc. ACM SIGIR Conf., Berkeley, California, United States, pp. 254ā261. ACM Press, New York (1999)
Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the Internet. In: Proc. Conf. on Database Systems for Advanced Applications, Melbourne, Australia, pp. 41ā50 (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
Ā© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Shokouhi, M. (2007). Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-71496-5_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)