Advertisement

Information Retrieval

, Volume 14, Issue 4, pp 390–412 | Cite as

A multi-collection latent topic model for federated search

  • Mark Baillie
  • Mark CarmanEmail author
  • Fabio Crestani
Article

Abstract

Collection selection is a crucial function, central to the effectiveness and efficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriate terms from topically related samples, thereby dealing with the problem of missing vocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithms.

Keyword

Distributed information retrieval Topic models Retrieval Collection selection 

References

  1. Asuncion, A., Smyth, P., & Welling, M. (2008). Asynchronous distributed learning of topic models. In Neural information processing systems (NIPS’08) (pp. 81–88). Cambridge: MIT Press.Google Scholar
  2. Avrahami, T. T, Yau, L., Si, L., & Callan, J. (2006). The fedlemur project: Federated search in the real world. Journal of the American Society for Information Science and Technology 57(3), 347–358.CrossRefGoogle Scholar
  3. Balog, K. (2008). The SIGIR 2008 workshop on future challenges in expertise retrieval (fCHER). SIGIR Forum 42(2), 46–52.CrossRefGoogle Scholar
  4. Bar-Yossef, Z., & Gurevich, M. (2006). Random sampling from a search engine’s index. In WWW’06: Proceedings of the 15th international conference on world wide web (pp. 367–376). New York: ACM.Google Scholar
  5. Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. Annals of Applied Statistics 1, 17.zbMATHCrossRefGoogle Scholar
  6. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022.zbMATHCrossRefGoogle Scholar
  7. Buntine, W. L. (1994) Operations for learning with graphical models. Journal of Artificial Intelligence Research 2, 159–225.Google Scholar
  8. Callan, J. P. (2000). Advances in information retrieval. In Distributed information retrieval (pp. 127–150). Dordrecht: Kluwer Academic Publishers.Google Scholar
  9. Callan, J. P., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions of Information Systems 19(2), 97–130.CrossRefGoogle Scholar
  10. Callan, J. P., Lu, Z., & Croft, W. B. (1995). Searching distributed collections with inference networks. In SIGIR ’95: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (pp. 21–28). New York: ACM Press.Google Scholar
  11. Craswell, N., Crimmins, F., Hawking, D., & Moffat, A. (2004). Performance and cost tradeoffs in web search. In ADC’04: Proceedings of the 15th Australasian database conference (pp. 161–169).Google Scholar
  12. Elsas, J. L., Arguello, J., Callan, J., & Carbonell, J. G. (2008). Retrieval and feedback models for blog feed search. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 347–354). New York: ACM.Google Scholar
  13. French, J. C., Powell, A. L., Viles, C. L., Emmitt, T., & Prey, K. J. (1998). Evaluating database selection techniques: A testbed and experiment. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 121–129). New York: ACM.Google Scholar
  14. Gravano, L., Chang, C. C. K., Garcia-Molina, H., & Paepcke, A. (1997). Starts: Stanford proposal for internet meta-searching. In SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on management of data (pp. 207–218). New York: ACM Press.Google Scholar
  15. Gravano, L., García-Molina, H., & Tomasic, A. (1999). GlOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems 24(2), 229–264.CrossRefGoogle Scholar
  16. Gravano, L., Ipeirotis, P. G., & Sahami, M. (2003) Qprober: A system for automatic classification of hidden-web databases. ACM Transactions of Information Systems 21(1), 1–41.CrossRefGoogle Scholar
  17. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science 101, 5228–5235.CrossRefGoogle Scholar
  18. Hawking, D., & Thomas, P. (2005). Server selection methods in hybrid portal search. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 75–82). NY: ACM Press.Google Scholar
  19. Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). NY: ACM.Google Scholar
  20. Ipeirotis, P. G., & Gravano, L. (2008). Classification-aware hidden-web text database selection. ACM Transactions of Information Systems 26(2), 1–66.CrossRefGoogle Scholar
  21. Ipeirotis, P. G., Agichtein, E., Jain, P., & Gravano, L. (2006). To search or to crawl?: Towards a query optimizer for text-centric tasks. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on management of data (pp. 265–276). New York: ACM Press.Google Scholar
  22. Li, W., & McCallum, A. (2006), Pachinko allocation: DAG-structured mixture models of topic correlations. In ICML ’06: Proceedings of the 23rd international conference on machine learning (pp. 577–584). New York: ACM.Google Scholar
  23. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment 1(2), 1241–1252.Google Scholar
  24. Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  25. Paepcke, A., Brandriff, R., Janee, G., Larson, R., Ludaescher, B., Melnik, S., et al. (2000). Search middleware and the simple digital library interoperability protocol. D-Lib Magazine 6(3).Google Scholar
  26. Price, G., & Sherman, C. (2001). The invisible web: Uncovering information sources search engines can’t see. Medford: CyberAge Books.Google Scholar
  27. Puppin, D., Silvestri, F., Perego, R., & Baeza-Yates, R. (2010). Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load. ACM Transactions on Information Systems (TOIS) 28(2), 1–36.CrossRefGoogle Scholar
  28. Shokouhi, M. (2007). Central-rank-based collection selection in uncooperative distributed information retrieval. In Advances in information retrieval, 29th European conference on IR research. ECIR 2007, Rome, Italy, 2–5 April 2007, Proceedings (pp. 160–172).Google Scholar
  29. Shokouhi, M., Baillie, M., & Azzopardi, L. (2007). Updating collection representations for federated search. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 511–518). New York: ACM.Google Scholar
  30. Si, L., & Callan, J. (2003). Relevant document distribution estimation method for resource selection. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 298–305). New York: ACM.Google Scholar
  31. Si, L., Jin, R., Callan, J., & Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In CIKM ’02: Proceedings of the eleventh international conference on information and knowledge management (pp. 391–397). New York: ACM.Google Scholar
  32. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566–1581.zbMATHCrossRefGoogle Scholar
  33. Thomas, P., & Hawking, D. (2009), Server selection methods in personal metasearch: A comparative empirical study. Information Retrieval 12(5), 581–604.CrossRefGoogle Scholar
  34. Wallach, H. M, (2008). Structured topic models for language. PhD thesis, Cambridge: University of Cambridge.Google Scholar
  35. Webber, W., Moffat, A., & Zobel, J. (2008). Score standardization for inter-collection comparison of retrieval systems. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 51–58). New York: ACM.Google Scholar
  36. Wei, X., & Croft, W. B. (2006). Lda-based document models for ad-hoc retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 178–185). New York: ACM.Google Scholar
  37. Xu, J., & Croft, W. B. (1999). Cluster-based language models for distributed retrieval. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 254–261). New York: ACM Press.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer and Information SciencesUniversity of StrathclydeGlasgow, ScotlandUK
  2. 2.Faculty of InformaticsUniversity of LuganoLuganoSwitzerland

Personalised recommendations