Advertisement

Information Retrieval

, Volume 3, Issue 3, pp 253–271 | Cite as

Retrieving Information from a Distributed Heterogeneous Document Collection

  • Christoph Baumgarten
Article

Abstract

This paper describes a probabilistic model for optimum information retrieval in a distributed heterogeneous environment.

The model assumes the collection of documents offered by the environment to be partitioned into subcollections. Documents as well as subcollections have to be indexed, where indexing methods using different indexing vocabularies can be employed. A query provided by a user is answered in terms of a ranked list of documents. The model determines a procedure for ranking the documents that stems from the Probability Ranking Principle: For each subcollection, the subcollection's documents are ranked; the resulting ranked lists are combined into a final ranked list of documents, where the ordering is determined by the documents' probabilities of being relevant with respect to the user's query. Various probabilistic ranking methods may be involved in the distributed ranking process. A criterion for effectively limiting the ranking process to a subset of subcollections extends the model.

The property that different ranking methods and indexing vocabularies can be used is important when the subcollections are heterogeneous with respect to their content.

The model's applicability is experimentally confirmed. When exploiting the degrees of freedom provided by the model, experiments showed evidence that the model even outperforms comparable models for the non-distributed case with respect to retrieval effectiveness.

distributed information retrieval heterogeneity probability ranking principle 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baumgarten C (1997) A probabilistic model for distributed information retrieval. In: Proc. 20th ACMSIGIR Conf. on Research and Development in Information Retrieval.Google Scholar
  2. Baumgarten C (1999a) Probabilistic information retrieval in a distributed heterogeneous environment. PhD Thesis, Dresden Univ. of Techn.Google Scholar
  3. Baumgarten C (1999b) A probabilistic solution to the selection and fusion problem in distributed information retrieval. In: Proc. 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Berkeley, CA.Google Scholar
  4. Callan JP, Zhihong L and Croft WB(1995) Searching distributed collections with inference networks. In: Proc. 18th ACM SIGIR Conf. on Research and Development in Information Retrieval.Google Scholar
  5. French J, Powell A, Viles C, Emmitt T and Prey K (1998) Evaluating database selection techniques: A testbed and experiment. In: Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval.Google Scholar
  6. Fuhr N (1992) Integration of probabilistic fact and text retrieval. In: Proc. 15th ACM SIGIR Conf. on Research and Development in Information Retrieval.Google Scholar
  7. Fuhr N (1993) Information retrieval. Course material of the course held in the summer term 1993. University of Dortmund. Available at http://ls6.informatik.uni-dortmund.de/ir/teaching/courses/ir/.Google Scholar
  8. Fuhr N (1999) A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3).Google Scholar
  9. Gravano L and Garcia-Molina H (1995) Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proc. 21st VLDB Conf. Google Scholar
  10. Harman D (1993) Overview of the first TREC conference. In: Proc. 16th ACM SIGIR Conf. on Research and Development in Information Retrieval.Google Scholar
  11. Meng W, Liu K, Yu C, Wang X, Chang Y and Rishe N (1998) Determining text databases to search in the internet. In: Proc. 24th VLDB Conf.Extended version.Google Scholar
  12. Porter M (1980) An algorithm for suffix stripping. Program 14.Google Scholar
  13. Robertson S (1977) The probability ranking principle in IR. J. of Documentation, 33(4).Google Scholar
  14. Robertson S and Sparck-Jones K(1976) Relevance weighting of search terms'. J. American Society for Inf. Science,27.Google Scholar
  15. Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval, McGraw-Hill, New York.Google Scholar
  16. Schäuble P (1997) Multimedia Information Retrieval-Content-Based Information Retrieval from Large Text and Audio Databases, Kluwer Academic, Boston.Google Scholar
  17. Stahel WA (1995) Statistische Datenanalyse, Vieweg Verlag, Braunschweig.Google Scholar
  18. Voorhees EM, Gupta NK and Johnson-Laird B (1994) The collection fusion problem. In: Harman DK (Ed.), Proc. TREC-3.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Christoph Baumgarten
    • 1
  1. 1.Eurospider Information Technology AGZurichSwitzerland

Personalised recommendations