Information Retrieval Journal

, Volume 20, Issue 3, pp 221–252 | Cite as

Efficient distributed selective search

  • Yubin Kim
  • Jamie Callan
  • J. Shane Culpepper
  • Alistair Moffat
Information Retrieval Efficiency

Abstract

Simulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of shards to search for each query, fewer postings are evaluated. In this paper we extend the study of selective search into new areas using a fine-grained simulation, examining the difference in efficiency when term-based and sample-based resource selection algorithms are used; measuring the effect of two policies for assigning index shards to machines; and exploring the benefits of index-spreading and mirroring as the number of deployed machines is varied. Results obtained for two large datasets and four large query logs confirm that selective search is significantly more efficient than conventional distributed search architectures and can handle higher query rates. Furthermore, we demonstrate that selective search can be tuned to avoid bottlenecks, and thus maximize usage of the underlying computer hardware.

Keywords

Selective search Distributed search Load balancing Efficiency 

References

  1. Altingovde, I. S., Demir, E., Can, F., & Ulusoy, O. (2008). Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Transactions on Information Systems, 26(3), 15:1–15:36.CrossRefGoogle Scholar
  2. Aly, R., Hiemstra, D., & Demeester, T. (2013). Taily: Shard selection using the tail of score distributions. In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 673–682)Google Scholar
  3. Arguello, J., Callan, J., & Diaz, F. (2009). Classification-based resource selection. In Proceedings of the 18th international ACM conference on information and knowledge management (pp. 1277–1286)Google Scholar
  4. Badue, C. S., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., & Ziviani, N. (2007). Analyzing imbalance among homogeneous index servers in a web search system. Information Processing and Management, 43(3), 592–608.CrossRefGoogle Scholar
  5. Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., & Silvestri, F. (2007). Challenges on distributed web retrieval. In Proceedings of the 23rd IEEE international conference on data engineering (pp. 6–20)Google Scholar
  6. Baeza-Yates, R., Gionis, A., Junqueira, F., Plachouras, V., & Telloli, L. (2009a). On the feasibility of multi-site web search engines. In Proceedings of the 18th international ACM conference on information and knowledge management (pp. 425–434)Google Scholar
  7. Baeza-Yates, R., Murdock, V., & Hauff, C. (2009b). Efficiency trade-offs in two-tier web search systems. In Proceedings of the 32nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 163–170)Google Scholar
  8. Barroso, L. A., Dean, J., & Hölzle, U. (2003). Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2), 22–28.CrossRefGoogle Scholar
  9. Brefeld, U., Cambazoglu, B. B., & Junqueira, F. P. (2011). Document assignment in multi-site search engines. In Proceedings of the 4th ACM international conference on web search and data mining (pp. 575–584)Google Scholar
  10. Broccolo, D., Macdonald, C., Orlando, S., Ounis, I., Perego, R., Silvestri, F., & Tonellotto, N. (2013). Query processing in highly-loaded search engines. In Proceedings of the 20th international symposium on string processing and information retrieval (pp. 49–55)Google Scholar
  11. Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., & Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th international ACM conference on information and knowledge management (pp. 426–434)Google Scholar
  12. Burkowski, F. J. (1990). Retrieval performance of a distributed database utilising a parallel process document server. In Proceedings of the 2nd international symposium on databases in parallel and distributed systems (pp. 71–79)Google Scholar
  13. Cacheda, F., Carneiro, V., Plachouras, V., & Ounis, I. (2007). Performance analysis of distributed information retrieval architectures using an improved network simulation model. Information Processing and Management, 43, 204–224.CrossRefGoogle Scholar
  14. Cahoon, B., McKinley, K. S., & Lu, Z. (2000). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems, 18(1), 1–43.CrossRefGoogle Scholar
  15. Callan, J. (2000). Distributed information retrieval. In Advances in information retrieval (pp. 127–150)Google Scholar
  16. Callan, J., Connell, M., & Du, A. (1999). Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD international conference on management of data (pp. 479–490)Google Scholar
  17. Cambazoglu, B. B., Kayaaslan, E., Jonassen, S., & Aykanat, C. (2013). A term-based inverted index partitioning model for efficient distributed query processing. ACM Transactions on the Web, 7(3), 15:1–15:23.CrossRefGoogle Scholar
  18. Cambazoglu, B. B., Varol, E., Kayaaslan, E., Aykanat, C., Baeza-Yates, R. (2010). Query forwarding in geographically distributed search engines. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 90–97)Google Scholar
  19. Can, F., Altingövde, I. S., & Demir, E. (2004). Efficiency and effectiveness of query processing in cluster-based retrieval. Information Systems, 29(8), 697–717.CrossRefGoogle Scholar
  20. Croft, W. B. (1980). A model of cluster searching based on classification. Information Systems, 5(3), 189–195.CrossRefGoogle Scholar
  21. Elsas, J. L., Arguello, J., Callan, J., & Carbonell, J. G. (2008). Retrieval and feedback models for blog feed search. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 347–354)Google Scholar
  22. Francès, G., Bai, X., Cambazoglu, B. B., & Baeza-Yates, R. (2014) Improving the efficiency of multi-site web search engines. In Proceedings of the 7th ACM international conference on web search and data mining (pp. 3–12)Google Scholar
  23. Freire, A., Macdonald, C., Tonellotto, N., Ounis, I., & Cacheda, F. (2013). Hybrid query scheduling for a replicated search engine. In Proceedings of the 35th European conference on information retrieval (pp. 435–446)Google Scholar
  24. Gravano, L., García-Molina, H., & Tomasic, A. (1999). GlOSS: Text-source discovery over the internet. ACM Transactions on Database Systems, 24, 229–264.CrossRefGoogle Scholar
  25. Griffiths, A., Luckhurst, H., & Willett, P. (1986). Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science, 37, 3–11.CrossRefGoogle Scholar
  26. Hawking, D., & Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Information Systems, 17(1), 40–76.CrossRefGoogle Scholar
  27. Kang, C., Wang, X., Chang, Y., & Tseng, B. (2012). Learning to rank with multi-aspect relevance for vertical search. In Proceedings of the 5th ACM international conference on web search and data mining (pp. 453–462)Google Scholar
  28. Kim, J., & Croft, W. B. (2010). Ranking using multiple document types in desktop search. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 50–57)Google Scholar
  29. Kim, Y., Callan, J., Culpepper, J. S., & Moffat, A. (2016a). Does selective search benefit from WAND optimization? In Proceedings of the 38th European conference on information retrieval (pp. 145–158)Google Scholar
  30. Kim, Y., Callan, J., Culpepper, J. S., & Moffat, A. (2016b) Load-balancing in distributed selective search. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval (pp. 905–908)Google Scholar
  31. Kulkarni, A. (2013). Efficient and effective large-scale search. PhD thesis, Carnegie Mellon UniversityGoogle Scholar
  32. Kulkarni, A., & Callan, J. (2010a). Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM international conference on information and knowledge management (pp. 449–458)Google Scholar
  33. Kulkarni, A., & Callan, J. (2010b). Topic-based index partitions for efficient and effective selective search. In SIGIR workshop on large-scale distributed information retrieval Google Scholar
  34. Kulkarni, A., & Callan, J. (2015). Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems, 33(4), 17:1–17:33.CrossRefGoogle Scholar
  35. Kulkarni, A., Tigelaar, A., Hiemstra, D., & Callan, J. (2012). Shard ranking and cutoff estimation for topically partitioned collections. In Proceedings of the 21st ACM international conference on information and knowledge management (pp. 555–564)Google Scholar
  36. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (pp. 186–193)Google Scholar
  37. Lucchese, C., Orlando, S., Perego, R., & Silvestri, F. (2007). Mining query logs to optimize index partitioning in parallel web search engines. In Proceedings of the 2nd international conference on scalable information systems (pp. 43:1–43:9)Google Scholar
  38. Macdonald, C., Tonellotto, N., & Ounis, I. (2012). Learning to predict response times for online query scheduling. In Proceedings of the 35th annual international ACM SIGIR conference on research and development in information retrieval (pp. 621–630)Google Scholar
  39. Markov, I., & Crestani, F. (2014). Theoretical, qualitative, and quantitative analyses of small-document approaches to resource selection. ACM Transactions on Information Systems, 32(2), 9:1–9:37.CrossRefGoogle Scholar
  40. Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 472–479)Google Scholar
  41. Moffat, A., Webber, W., Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 348–355)Google Scholar
  42. Moffat, A., Webber, W., Zobel, J., & Baeza-Yates, R. (2007). A pipelined architecture for distributed text query evaluation. Information Retrieval, 10(3), 205–231.CrossRefGoogle Scholar
  43. Orlando, S., Perego, R., & Silvestri, F. (2001). Design of a parallel and distributed web search engine. In Proceedings of the international conference on parallel computing (pp. 197–204)Google Scholar
  44. Paltoglou, G., Salampasis, M., & Satratzemi, M. (2008). Integral based source selection for uncooperative distributed information retrieval environments. In Proceedings of the 2008 ACM workshop on large-scale distributed systems for information retrieval (pp. 67–74)Google Scholar
  45. Powell, A. L., French, J. C., Callan, J., Connell, M., & Viles, C. L. (2000). The impact of database selection on distributed searching. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 232–239)Google Scholar
  46. Puppin, D., Silvestri, F., & Laforenza, D. (2006). Query-driven document partitioning and collection selection. In Proceedings of the 1st international conference on scalable information systems (p. 34)Google Scholar
  47. Ribeiro-Neto, B. A., & Barbosa, R. A. (1998). Query performance for tightly coupled distributed digital libraries. In Proceedings of the 3rd ACM conference on digital libraries (pp. 182–190)Google Scholar
  48. Risvik, K. M., Aasheim, Y., & Lidal, M. (2003). Multi-tier architecture for Web search engines. In Proceedings of the 1st Latin American web congress (pp. 132–143)Google Scholar
  49. Seo, J., & Croft, W. B. (2008). Blog site search using resource selection. In Proceedings of the 17th international ACM conference on information and knowledge management (pp. 1053–1062)Google Scholar
  50. Shokouhi, M. (2007). Central-rank-based collection selection in uncooperative distributed information retrieval. In Proceedings of the 29th European conference on information retrieval (pp. 160–172)Google Scholar
  51. Shokouhi, M., & Si, L. (2011). Federated search. Foundations and Trends in Information Retrieval, 5(1), 1–102.CrossRefGoogle Scholar
  52. Si, L., & Callan, J. (2003). Relevant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 298–305)Google Scholar
  53. Si, L., & Callan, J. (2004a). The effect of database size distribution on resource selection algorithms. In Distributed multimedia information retrieval (pp. 31–42). LNCS volume 2924Google Scholar
  54. Si, L., & Callan, J. (2004b). Unified utility maximization framework for resource selection. In Proceedings of the 13th international ACM conference on information and knowledge management (pp. 32–41)Google Scholar
  55. Si, L., & Callan, J. (2005). Modeling search engine effectiveness for federated search. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 83–90)Google Scholar
  56. Thomas, P., & Hawking, D. (2009). Server selection methods in personal metasearch: A comparative empirical study. Information Retrieval, 12(5), 581–604.CrossRefGoogle Scholar
  57. Thomas, P., & Shokouhi, M. (2009). SUSHI: Scoring scaled samples for server selection. In Proceedings of the 32nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 419–426)Google Scholar
  58. Tomasic, A., & Garcia-Molina, H. (1993). Caching and database scaling in distributed shared-nothing information retrieval systems. In Proceedings of the 1993 ACM SIGMOD international conference on management of data (pp. 129–138)Google Scholar
  59. Tonellotto, N., Macdonald, C., & Ounis, I. (2013). Efficient and effective retrieval using selective pruning. In Proceedings of the 6th ACM international conference on web search and data mining (pp. 63–72)Google Scholar
  60. Voorhees, E. M. (1985). The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. Technical report, Cornell UniversityGoogle Scholar
  61. Webber, W., & Moffat, A. (Dec. 2005). In search of reliable retrieval experiments. In Proceedings of the 10th Australasian document computing symposium (pp. 26–33)Google Scholar
  62. Willett, P. (1988). Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5), 577–597.CrossRefGoogle Scholar
  63. Wu, H., & Fang, H. (2014). Analytical performance modeling for top-k query processing. In Proceedings of the 23rd ACM international conference on information and knowledge management (pp. 1619–1628)Google Scholar
  64. Xu, J., & Croft, W.B. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 254–261)Google Scholar
  65. Yuwono, B., & Lee, D. L. (1997). Server ranking for distributed text retrieval systems on internet. In Proceedings of the 5th international conference on database systems for advanced applications (pp. 41–49)Google Scholar
  66. Zhang, J., & Suel, T. (March 2007). Optimized inverted list assignment in distributed search engine architectures. In Parallel and distributed processing symposium (pp. 1–10)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Yubin Kim
    • 1
  • Jamie Callan
    • 1
  • J. Shane Culpepper
    • 2
  • Alistair Moffat
    • 3
  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.RMIT UniversityMelbourneAustralia
  3. 3.The University of MelbourneMelbourneAustralia

Personalised recommendations