Information Retrieval

, Volume 10, Issue 3, pp 297–319 | Cite as

Result merging methods in distributed information retrieval with overlapping databases

Article

Abstract

In distributed information retrieval systems, document overlaps occur frequently among different component databases. This paper presents an experimental investigation and evaluation of a group of result merging methods including the shadow document method and the multi-evidence method in the environment of overlapping databases. We assume, with the exception of resultant document lists (either with rankings or scores), no extra information about retrieval servers and text databases is available, which is the usual case for many applications on the Internet and the Web.

The experimental results show that the shadow document method and the multi-evidence method are the two best methods when overlap is high, while Round-robin is the best for low overlap. The experiments also show that [0,1] linear normalization is a better option than linear regression normalization for result merging in a heterogeneous environment.

Keywords

Result merging Distributed information retrieval Overlapping databases 

References

  1. Aslam, J. A., & Montague, M. (2003). Models for metasearch. In: Proceedings of the 24th Annual International ACM SIGIR Conference (pp. 24–37). ACM, Gaithersburg, MD.Google Scholar
  2. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Reading, MA: ACM and Addison-Wesley.Google Scholar
  3. Callan, J. (2000). Distributed information retrieval. Advances in Information Retrieval (pp. 127–150), Kluwer Academic Publishers.Google Scholar
  4. Callan, J. K., Lu, Z., & Croft, W. (1995). Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference (pp. 21–28). ACM, Seattle, WA.Google Scholar
  5. Calvé, A. L., & Savoy, J. (2000). Database merging strategy based on logistic regression. Information Processing and Management, 36(3), 341–359.Google Scholar
  6. Chowdhury, A., Frider, O., Grossman, D., & McCabe, C. (2002). Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems, 20(2), 171–191.Google Scholar
  7. Fox, E. A., & Shaw, J. (1994). Combination of multiple searches. In The Second Text REtrieval Conference (TREC-2) (pp. 243–252). National Institute of Standards and Technology, USA, Gaitherburg, MD.Google Scholar
  8. Fuhr, N. (1999). A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3), 229–249.Google Scholar
  9. Gauch, S., Wang, G., & Gomez, M. (1996). Profusion: Intelligent fusion from multiple, distributed search engines. Journal of Universal Computer Science, 2(9), 637–649.Google Scholar
  10. Gravano, L., García-Molina, H., & Tomasic, A. (1999). Gloss: Text-source discovery over the Internet. ACM Transactions on Database Systems, 24(2), 229–264.Google Scholar
  11. Hawking, D., & Robertson, S. (2003). On collection size and retrieval effectiveness. Information Retrieval, 9(1), 99–105.Google Scholar
  12. Hawking, D., & Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Information Systems, 17(1), 40–76.Google Scholar
  13. Hull, D. A., Pedersen, J. O., & Schüze, H. (1996). Method combination for document filtering. In Proceedings of the 19th Annual International ACM SIGIR Conference (pp. 279–287). ACM, Zurich, Switzerland.Google Scholar
  14. Lawrence, S., & Giles, C. L. (1998a). Context and page analysis for improved Web search. IEEE Internet Computing, 2(4), 38–46.Google Scholar
  15. Lawrence, S., & Giles, C. L. (1998b). Inquiris, the NECI search engine. Computer Networks (Proceedings of WWW 7), 30(1–7), 95–105.Google Scholar
  16. Lawrence, S., & Giles, C. L. (2000). Accessibility of information on the Web. Intelligence, 11(1), 32–39.Google Scholar
  17. Lee, J. H. (1997). Analysis of multiple evidence combination. In Proceedings of the 20th Annual International ACM SIGIR Conference (pp. 267–275). ACM, Philadelphia, PA.Google Scholar
  18. Lemur, retrieved on October 25, 2005 from http://www.lemurproject.org/Google Scholar
  19. Lucene, retrieved on October 25, 2005 from http://lucene.apache.org/java/docs/index.htmlGoogle Scholar
  20. Manmatha, R., Rath, T., & Feng, F. (2001). Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th Annual International ACM SIGIR Conference (pp. 267–275). ACM, New Orleans, LA.Google Scholar
  21. Meng, W., Yu, C., & Liu, K. (2002). Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1), 48–89.Google Scholar
  22. Montague, M. (2002). Metasearch: Data fusion for document retrieval (Technical Report TRC2002-424). Hanover, NH: Department of Computer Science, Dartmouth College.Google Scholar
  23. Montague, M., & Aslam, J. A. (2002). Condorcet fusion for improved retrieval. In Proceedings of ACM CIKM Conference (pp. 538–548). ACM, McLean, VA.Google Scholar
  24. Oztekin, B., Karypis, G., & Kumar, V. (2002). Expert agreement and content based reranking in a meta search environment using Mearf. In Proceedings of the 11th International World Wide Web Conference (pp. 333–344). ACM, Honolulu, HI.Google Scholar
  25. Rasolofo, Y., Hawking, D., & Savoy, J. (2003). Result merging strategies for a current news metasearcher. Information Processing and Management, 39(4), 581–609.Google Scholar
  26. Searchengineshowdown, retrieved on October 25, 2005 from http: //www.searchengineshowdown.com/ stats/overlap.shtmlGoogle Scholar
  27. Selberg, E., & Etzioni, O. (1995). Multi-service search and comparison using the MetaCrawler. In Proceedings of the 4th International World Wide Web Conference. Darmstadt, Germany.Google Scholar
  28. Selberg, E., & Etzioni, O. (1997). The MetaCrawler architecture for resource aggregation on the Web. IEEE Expert, 12(1), 11–14.Google Scholar
  29. Selberg, E., & Etzioni, O. (2000). On the instability of web search engines. In Proceedings of RIAO conference (pp. 223–235). Paris, France.Google Scholar
  30. Si, L., & Callan, J. (2002). Using sampled data and regression to merge search engine results. In Proceedings of the 25th Annual International ACM SIGIR Conference (pp. 19–26). ACM, Tampere, Finland.Google Scholar
  31. Vogt, C. C., & Cottrell, G. A. (1999). Fusion via a linear combination of scores. Information Retrieval, 1(3), 151–173.Google Scholar
  32. Voorhees, E.M., Gupta, N.K., & Johnson-Laird, B. (1995). Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference (pp. 172–179). ACM, Seattle, WA.Google Scholar
  33. Voorhees, E. M., & Harman, D. K. (Eds.). (1996). In Proceedings of the 5th Text Retrieval Conference. National Institute of Standards and Technology, USA, Gaithersburg, MD.Google Scholar
  34. Wu, S., & Crestani, F. (2003). Distributed information retrieval: A multi-objective resource selection approach. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 11(Suppl.), 83–100.Google Scholar
  35. Wu, S., & Crestani, F. (2004). Shadow document methods of results merging. In Proceedings of the 19th ACM Symposium on Applied Computing (pp. 1067–1072). ACM, Nicosia, Cyprus.Google Scholar
  36. Yuwono, B., & Lee, D. (1997). Server ranking for distributed test retrieval systems on the Internet. In Proceedings of the Fifth International Conference on Database Systems for Advanced Application (pp. 41–50). World Scientific, Melbourne, Australia.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.School of Computing and MathematicsUniversity of UlsterNorthern IrelandUK

Personalised recommendations