On the Usage of Global Document Occurrences in Peer-to-Peer Information Systems

  • Odysseas Papapetrou
  • Sebastian Michel
  • Matthias Bender
  • Gerhard Weikum
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3760)


There exist a number of approaches for query processing in Peer-to-Peer information systems that efficiently retrieve relevant information from distributed peers. However, very few of them take into consideration the overlap between peers: as the most popular resources (e.g., documents or files) are often present at most of the peers, a large fraction of the documents eventually received by the query initiator are duplicates. We develop a technique based on the notion of global document occurrences (GDO) that, when processing a query, penalizes frequent documents increasingly as more and more peers contribute their local results. We argue that the additional effort to create and maintain the GDO information is reasonably low, as the necessary information can be piggybacked onto the existing communication. Early experiments indicate that our approach significantly decreases the number of peers that have to be involved in a query to reach a certain level of recall and, thus, decreases user-perceived latency and the wastage of network resources.


Information Retrieval Query Processing Query Term Local Index Query Execution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of the ACM SIGCOMM 2001, pp. 149–160. ACM Press, New York (2001)Google Scholar
  2. 2.
    Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S.: A scalable content-addressable network. In: Proceedings of ACM SIGCOMM 2001, pp. 161–172. ACM Press, New York (2001)Google Scholar
  3. 3.
    Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  4. 4.
    Buchmann, E., Böhm, K.: How to Run Experiments with Large Peer-to-Peer Data Structures. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, USA (2004)Google Scholar
  5. 5.
    Aberer, K., Punceva, M., Hauswirth, M., Schmidt, R.: Improving data access in p2p systems. IEEE Internet Computing 6, 58–67 (2002)CrossRefGoogle Scholar
  6. 6.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002)Google Scholar
  7. 7.
    Fuhr, N.: A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems 17, 229–249 (1999)CrossRefGoogle Scholar
  8. 8.
    Gravano, L., Garcia-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Trans. Database Syst. 24, 229–264 (1999)CrossRefGoogle Scholar
  9. 9.
    Si, L., Jin, R., Callan, J., Ogilvie, P.: A language modeling framework for resource selection and results merging. In: Proceedings of CIKM 2002, pp. 391–397. ACM Press, New York (2002)CrossRefGoogle Scholar
  10. 10.
    Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. In: Research and Development in Information Retrieval, pp. 254–261 (1999)Google Scholar
  11. 11.
    Callan, J.: Distributed information retrieval. In: Advances in information retrieval, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)Google Scholar
  12. 12.
    Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 290–297. ACM Press, New York (2003)CrossRefGoogle Scholar
  13. 13.
    Grabs, T., Böhm, K., Schek, H.J.: Powerdb-ir: information retrieval on top of a database cluster. In: Proceedings of CIKM 2001, pp. 411–418. ACM Press, New York (2001)CrossRefGoogle Scholar
  14. 14.
    Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19, 217–241 (2001)CrossRefGoogle Scholar
  15. 15.
    Byers, J., Considine, J., Mitzenmacher, M., Rost, S.: Informed content delivery across adaptive overlay networks. In: Proceedings of ACM SIGCOMM (2002)Google Scholar
  16. 16.
    Ganguly, S., Garofalakis, M., Rastogi, R.: Processing set expressions over continuous update streams. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 265–276. ACM Press, New York (2003)CrossRefGoogle Scholar
  17. 17.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970)zbMATHCrossRefGoogle Scholar
  18. 18.
    Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Netw. 10, 604–612 (2002)CrossRefGoogle Scholar
  19. 19.
    Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. The VLDB Journal, 216–225 (1997)Google Scholar
  20. 20.
    Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 81–88. ACM Press, New York (2002)CrossRefGoogle Scholar
  21. 21.
    Nie, Z., Kambhampati, S., Hernandez, T.: Bibfinder/statminer: Effectively mining and using coverage and overlap statistics in data integration. In: VLDB, pp. 1097–1100 (2003)Google Scholar
  22. 22.
    Hernandez, T., Kambhampati, S.: Improving text collection selection with coverage and overlap statistics. pc-recommended poster. In: WWW (2005), Full version available at
  23. 23.
    Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in p2p systems. In: Proceedings of the SIGIR Conference (2005)Google Scholar
  24. 24.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  25. 25.
    Croft, W.B., Lafferty, J.: Language Modeling for Information Retrieval. Kluwer International Series on Information Retrieval, vol. 13 (2003)Google Scholar
  26. 26.
    Bender, M., Michel, S., Weikum, G., Zimmer, C.: The MINERVA project: Database selection in the context of P2P search. In: BTW 2005 (2005)Google Scholar
  27. 27.
    Bender, M., Michel, S., Weikum, G., Zimmer, C.: Minerva: Collaborative p2p search. In: Proceedings of the VLDB Conference (Demonstration) (2005)Google Scholar
  28. 28.
    Bender, M., Michel, S., Weikum, G., Zimmer, C.: Bookmark-driven query routing in peer-to-peer web search. In: Callan, J., Fuhr, N., Nejdl, W. (eds.) Proceedings of the SIGIR Workshop on Peer-to-Peer Information Retrieval, pp. 46–57 (2004)Google Scholar
  29. 29.
    Buckley, C., Salton, G., Allan, J.: The effect of adding relevance information in a relevance feedback environment. In: SIGIR. Springer, Heidelberg (1994)Google Scholar
  30. 30.
    Luxenburger, J., Weikum, G.: Query-log based authority analysis for web information search. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 90–101. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  31. 31.
    Srivastava, J., et al.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1, 12–23 (2000)CrossRefGoogle Scholar
  32. 32.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Symposium on Principles of Database Systems (2001)Google Scholar
  33. 33.
    Nepal, S., Ramakrishna, M.V.: Query processing issues in image (multimedia) databases. In: ICDE, pp. 22–29 (1999)Google Scholar
  34. 34.
    Guntzer, U., Balke, W.T., Kiesling, W.: Optimizing multi-feature queries for image databases. The VLDB Journal, 419–428 (2000)Google Scholar
  35. 35.
    Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. VLDB (2004)Google Scholar
  36. 36.
    Zipf, G.K.: Human behavior and the principle of least effort. Addison-wesley press, Reading (1949)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Odysseas Papapetrou
    • 1
  • Sebastian Michel
    • 1
  • Matthias Bender
    • 1
  • Gerhard Weikum
    • 1
  1. 1.Max-Planck Institut für InformatikSaarbrückenGermany

Personalised recommendations