Adaptive Query-Based Sampling of Distributed Collections

  • Mark Baillie
  • Leif Azzopardi
  • Fabio Crestani
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4209)


As part of a Distributed Information Retrieval system a description of each remote information resource, archive or repository is usually stored centrally in order to facilitate resource selection. The acquisition of precise resource descriptions is therefore an important phase in Distributed Information Retrieval, as the quality of such representations will impact on selection accuracy, and ultimately retrieval performance. While Query-Based Sampling is currently used for content discovery of uncooperative resources, the application of this technique is dependent upon heuristic guidelines to determine when a sufficiently accurate representation of each remote resource has been obtained. In this paper we address this shortcoming by using the Predictive Likelihood to provide both an indication of the quality of an acquired resource description estimate, and when a sufficiently good representation of a resource has been obtained during Query-Based Sampling.


Language Model Resource Selection Selection Accuracy Comparable Indication Remote Resource 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Azzopardi, L., Girolami, M., Risjbergen, C.J.: Investigating the relationship between language model perplexity and IR precision-recall measures. In: Proceedings of the 26th ACM SIGIR conference, pp. 369–370 (2003)Google Scholar
  2. 2.
    Baeza-Yates, R.: Applications of web query mining. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 7–22. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  3. 3.
    Baillie, M., Azzopardi, L., Crestani, F.: Towards better measures: Evaluation of estimated resource description quality for distributed IR. In: First International Conference on Scalable Information Systems. IEEE Computer Society Press, Los Alamitos (2006)Google Scholar
  4. 4.
    Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: two sides of the same coin. Communications of the ACM 35(12), 29–38 (1992)CrossRefGoogle Scholar
  5. 5.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of the 23rd ACM SIGIR conference, pp. 33–40 (2000)Google Scholar
  6. 6.
    Callan, J.P.: Advances in information retrieval. In: chapter Distributed information retrieval, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)Google Scholar
  7. 7.
    Callan, J.P., Connell, M.: Query-based sampling of text databases. ACM Transactions of Information Systems 19(2), 97–130 (2001)CrossRefGoogle Scholar
  8. 8.
    Degroot, M.H.: Optimal Statistical Decisions (Wiley Classics Library). Wiley-Interscience, Chichester (2004)CrossRefGoogle Scholar
  9. 9.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication, Chichester (2000)Google Scholar
  10. 10.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1-2), 177–196 (2001)MATHCrossRefGoogle Scholar
  11. 11.
    Ipeirotis, P.G., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: Proceedings of the ACM SIGMOD Conference, pp. 767–778 (2004)Google Scholar
  12. 12.
    Kullback, S.: Information theoery and statistics. Wiley, New York (1959)Google Scholar
  13. 13.
    Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 63–75. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Si, L., Callan, J.P.: Modeling search engine effectiveness for federated search. In: Proceedings of the 28th ACM SIGIR Conference, pp. 83–90 (2005)Google Scholar
  15. 15.
    Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. In: Proceedings of the 22nd ACM SIGIR conference, pp. 254–261 (1999)Google Scholar
  16. 16.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transaction of Information Systems 22(2), 179–214 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Mark Baillie
    • 1
  • Leif Azzopardi
    • 1
  • Fabio Crestani
    • 1
  1. 1.Department of Computing and Information SciencesUniversity of StrathclydeGlasgowUK

Personalised recommendations