Advertisement

Artificial Intelligence Review

, Volume 26, Issue 1–2, pp 23–34 | Cite as

Probabilistic data fusion on a large document collection

  • David LillisEmail author
  • Fergus Toolan
  • Rem Collier
  • John Dunnion
Article

Abstract

Data fusion is the process of combining the output of a number of Information Retrieval (IR) algorithms into a single result set, to achieve greater retrieval performance. ProbFuse is a data fusion algorithm that uses the history of the underlying IR algorithms to estimate the probability that subsequent result sets include relevant documents in particular positions. It has been shown to out-perform CombMNZ, the standard data fusion algorithm against which to compare performance, in a number of previous experiments. This paper builds upon this previous work and applies probFuse to the much larger Web Track document collection from the 2004 Text REtreival Conference. The performance of probFuse is compared against that of CombMNZ using a number of evaluation measures and is shown to achieve substantial performance improvements.

Keywords

Data fusion Information retrieval ProbFuse 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aslam JA, Montague M (2000) Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems. In: SIGIR ‘00: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. New York, NY, USA. ACM Press, pp 379–381Google Scholar
  2. Aslam JA, Montague M (2001) Models for metasearch. In: SIGIR ‘01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. New York, NY, USA. ACM Press, pp 276–284Google Scholar
  3. Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: SIGIR ‘04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. New York, NY, USA. ACM Press, pp 25–32Google Scholar
  4. Craswell N, Hawking D, Thistlewaite PB (1999) Merging results from isolated search engines. In: Australasian database conference. Auckland, New Zealand, pp 189–200Google Scholar
  5. Fox EA, Shaw JA (1994) Combination of multiple searches. In: Proceedings of the 2nd Text REtrieval Conference (TREC-2), National Institute of Standards and Technology Special Publication 500–215: pp 243–252Google Scholar
  6. Gravano L, Chang K, Garcia-Molina H, Paepcke A (1997) STARTS: Stanford Protocol Proposal for Internet Retrieval and Search. Technical report, Stanford, CA, USAGoogle Scholar
  7. Harman D (1993) Overview of the first Text REtrieval Conference (TREC-1). In: SIGIR ‘93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA. ACM press, pp 36–47Google Scholar
  8. Lawrence S, Giles CL (1998) Inquirus, The NECI meta search engine. In: Seventh international world wide web conference. Brisbane, Australia. Elsevier Science, pp 95–105.Google Scholar
  9. Lee JH (1997). Analyses of multiple evidence combination. SIGIR forum 31(SI): 267–276 CrossRefGoogle Scholar
  10. Lillis D, Toolan F, Mur A, Peng L, Collier R, Dunnion J (2005) Probability-based fusion of information retrieval result sets. In: Proceedings of the 16th Irish conference on artificial intelligence and cognitive science (AICS 2005). Portstewart, Northern Ireland. University of Ulster, pp 147–156Google Scholar
  11. Lillis D, Toolan F, Collier R, Dunnion J (2006a) ProbFuse: a probabilistic approach to data fusion. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. New York, USA. ACM Press, pp 139–146Google Scholar
  12. Lillis D, Toolan F, Mur A, Peng L, Collier R, Dunnion J (2006b) Probability-based fusion of information retrieval result sets. Artif Intell Rev 25(1–2), doi: 10.1007/s10462-007-9021-x
  13. Manmatha R, Rath T, Feng F (2001) Modeling score distributions for combining the outputs of search engines. In: SIGIR ‘01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. New York, NY, USA. ACM Press, pp 267–275Google Scholar
  14. Montague M, Aslam JA (2001) Relevance score normalization for metasearch. In: CIKM ‘01: Proceedings of the tenth international conference on information and knowledge management. New York, NY, USA. ACM Press, pp 427–433Google Scholar
  15. Selberg E, Etzioni O (1997) The metacrawler architecture for resource aggregation on the web. IEEE Expert (January–February), 11–14Google Scholar
  16. Silverstein C, Henzinger M, Marais H, Moricz M (1998) Analysis of a very large altavista query log. Technical Report 1998-014, Digital SRC http://gatekeeper.dec.com/pub/DEC/SRC/technical-notes/abstracts/src-tn-1998-014.htmlGoogle Scholar
  17. Vogt CC and Cottrell GW (1999). Fusion via a linear combination of scores. Inform Retriev 1(3): 151–173 CrossRefGoogle Scholar
  18. Voorhees EM, Gupta NK, Johnson-Laird B (1994) The collection fusion problem. In: Proceedings of the Third Text REtrieval Conference (TREC-3). pp 95–104Google Scholar
  19. Voorhees EM, Gupta NK, Johnson-Laird B (1995) Learning collection fusion strategies. In: SIGIR ‘95: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval. New York, NY, USA. ACM Press, pp 172–179Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2007

Authors and Affiliations

  • David Lillis
    • 1
    Email author
  • Fergus Toolan
    • 2
  • Rem Collier
    • 1
  • John Dunnion
    • 1
  1. 1.School of Computer Science and InformaticsUniversity College DublinDublin 4Ireland
  2. 2.Faculty of Computing ScienceGriffith College DublinDublin 8Ireland

Personalised recommendations